Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences data QuickTime™ and a TIFF (Un co mpress ed ) de compres so r are n eed ed to se e this p ictu re . behavior What computational problem is the brain solving? Do optimal solutions to that problem help to explain human behavior? Inductive problems • Inferring structure from data • Perception – e.g. structure of 3D world from 2D visual data data hypotheses cube shaded hexagon Inductive problems • Inferring structure from data • Perception – e.g. structure of 3D world from 2D data • Cognition – e.g. form of causal relationship from samples data hypotheses QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Reverend Thomas Bayes Bayes’ theorem Posterior probability p (h | d ) Likelihood Prior probability p (d | h) p (h) p ( d | h ) p ( h ) h H h: hypothesis d: data Sum over space of hypotheses Bayes’ theorem p( h | d ) p ( d | h ) p ( h ) h: hypothesis d: data Perception is optimal QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ an d a TIFF (Un co mpres sed ) d ecomp re sso r are n ee ded to se e this pictu re . QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Körding & Wolpert (2004) Cognition is not QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Do people use priors? p( h | d ) p ( d | h ) p ( h ) Standard answer: no (Tversky & Kahneman, 1974) Explaining inductive leaps • How do people – – – – – infer causal relationships identify the work of chance predict the future assess similarity and make generalizations learn functions, languages, and concepts . . . from such limited data? Explaining inductive leaps • How do people – – – – – infer causal relationships identify the work of chance predict the future assess similarity and make generalizations learn functions, languages, and concepts . . . from such limited data? • What knowledge guides human inferences? Prior knowledge matters when… • …using a single datapoint – predicting the future • …using secondhand data – effects of priors on cultural transmission Outline • …using a single datapoint – predicting the future – joint work with Josh Tenenbaum (MIT) • …using secondhand data – effects of priors on cultural transmission – joint work with Mike Kalish (Louisiana) • Conclusions Outline • …using a single datapoint – predicting the future – joint work with Josh Tenenbaum (MIT) • …using secondhand data – effects of priors on cultural transmission – joint work with Mike Kalish (Louisiana) • Conclusions Predicting the future QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. How often is Google News updated? t = time since last update ttotal = time between updates What should we guess for ttotal given t? Everyday prediction problems • You read about a movie that has made $60 million to date. How much money will it make in total? • You see that something has been baking in the oven for 34 minutes. How long until it’s ready? • You meet someone who is 78 years old. How long will they live? • Your friend quotes to you from line 17 of his favorite poem. How long is the poem? • You see taxicab #107 pull up to the curb in front of the train station. How many cabs in this city? Making predictions • You encounter a phenomenon that has existed for t units of time. How long will it continue into the future? (i.e. what’s ttotal?) • We could replace “time” with any other variable that ranges from 0 to some unknown upper limit. Bayesian inference p(ttotal|t) p(t|ttotal) p(ttotal) posterior probability likelihood prior Bayesian inference p(ttotal|t) p(t|ttotal) p(ttotal) posterior probability likelihood prior p(ttotal|t) 1/ttotal p(ttotal) assume random sample (0 < t < ttotal) Bayesian inference p(ttotal|t) p(t|ttotal) p(ttotal) posterior probability likelihood prior p(ttotal|t) 1/ttotal 1/ttotal assume “uninformative” random prior sample (Gott, 1993) (0 < t < ttotal) Bayesian inference p(ttotal|t) 1/ttotal 1/ttotal posterior probability random sampling “uninformative” prior What is the best guess for ttotal? How about maximal value of p(ttotal|t)? p(ttotal|t) ttotal = t ttotal Bayesian inference p(ttotal|t) 1/ttotal 1/ttotal posterior probability random sampling “uninformative” prior What is the best guess for ttotal? Instead, compute t* such that p(ttotal > t*|t) = 0.5: p(ttotal|t) ttotal Bayesian inference p(ttotal|t) 1/ttotal 1/ttotal posterior probability random sampling “uninformative” prior What is the best guess for ttotal? Instead, compute t* such that p(ttotal > t*|t) = 0.5. Yields Gott’s Rule: P(ttotal > t*|t) = 0.5 when t* = 2t i.e., best guess for ttotal = 2t Applying Gott’s rule QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. t 4000 years, t* 8000 years Applying Gott’s rule QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. t 130,000 years, t* 260,000 years Predicting everyday events • You read about a movie that has made $78 million to date. How much money will it make in total? – “$156 million” seems reasonable • You meet someone who is 35 years old. How long will they live? – “70 years” seems reasonable • Not so simple: – You meet someone who is 78 years old. How long will they live? – You meet someone who is 6 years old. How long will they live? The effects of priors • Different kinds of priors p(ttotal) are appropriate in different domains. Gott: p(ttotal) ttotal-1 The effects of priors • Different kinds of priors p(ttotal) are appropriate in different domains. e.g., wealth, contacts e.g., height, lifespan The effects of priors Evaluating human predictions • Different domains with different priors: – – – – – – A movie has made $60 million Your friend quotes from line 17 of a poem You meet a 78 year old man A move has been running for 55 minutes A U.S. congressman has served for 11 years A cake has been in the oven for 34 minutes • Use 5 values of t for each • People predict ttotal people Gott’s rule empirical prior parametric prior You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign? You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign? How long did the typical pharaoh reign in ancient Egypt? …using a single datapoint • People produce accurate predictions for the duration and extent of everyday events • Strong prior knowledge – form of the prior (power-law or exponential) – distribution given that form (parameters) – non-parametric distribution when necessary • Reveals a surprising correspondence between probabilities in the mind and in the world Outline • …using a single datapoint – predicting the future – joint work with Josh Tenenbaum (MIT) • …using secondhand data – effects of priors on cultural transmission – joint work with Mike Kalish (Louisiana) • Conclusions Cultural transmission • Most knowledge is based on secondhand data • Some things can only be learned from others – language – religious concepts • How do priors affect cultural transmission? Iterated learning (Briscoe, 1998; Kirby, 2001) data hypothesis data hypothesis • Each learner sees data, forms a hypothesis, produces the data given to the next learner • c.f. the playground game “telephone” Explaining linguistic universals • Human languages are a subset of all logically possible communication schemes – universal properties common to all languages (Comrie, 1981; Greenberg, 1963; Hawkins, 1988) • Two questions: – why do linguistic universals exist? – why are particular properties universal? Explaining linguistic universals • Traditional answer: – linguistic universals reflect innate constraints specific to a system for acquiring language • Alternative answer: – iterated learning imposes “information bottleneck” – universal properties survive this bottleneck (Briscoe, 1998; Kirby, 2001) Analyzing iterated learning What are the consequences of iterated learning? Complex algorithms Analytic results ? Kirby (2001) Simulations Smith, Kirby, & Brighton (2003) Simple algorithms Komarova, Niyogi, & Nowak (2002) Brighton (2002) Iterated Bayesian learning d0 h1 p(h|d) d1 p(d|h) h2 p(h|d) • Learners are rational Bayesian agents – covers a wide range of learning algorithms • Defines a Markov chain on (h, d) pairs p(d|h) Analytic results • Stationary distribution of Markov chain is p( d, h ) p( d | h ) p( h ) • Convergence under easily checked conditions • Rate of convergence is geometric – iterated learning is a Gibbs sampler on p(d,h) – Gibbs sampler converges geometrically (Liu, Wong, & Kong, 1995) Analytic results • Stationary distribution of Markov chain is p( d, h ) p( d | h ) p( h ) • Corollaries: – distribution over hypotheses converges to p(h) – distribution over data converges to p(d) – the proportion of a population of iterated learners with hypothesis h converges to p(h) An example: Gaussians • If we assume… – data, d, is a single real number, x – hypotheses, h, are means of a Gaussian, – prior, p(), is Gaussian(0,02) • …then p(xn+1|xn) is Gaussian(n, x2 + n2) x n / x 0 / 0 2 n 1 / 1 / 2 x 2 2 0 2 n 1 1 / 1 / 2 x 2 0 0 = 0, 02 = 1, x0 = 20 Iterated learning results in rapid convergence to prior Implications for linguistic universals • Two questions: – why do linguistic universals exist? – why are particular properties universal? • Different answers: – existence explained through iterated learning – universal properties depend on the prior • Focuses inquiry on the priors of the learners A method for discovering priors Iterated learning converges to the prior… …evaluate prior by producing iterated learning Iterated function learning • Assume – data, d, are pairs of real numbers (x, y) – hypotheses, h, are functions • An example: linear regression – hypotheses have slope and pass through origin – p() is Gaussian(0,02) y } x=1 y 0 = 1, 02 = 0.1, y0 = -1 } x=1 Function learning in the lab Stimulus Feedback Response Slider Examine iterated learning with different initial data Initial data Iteration 1 2 3 4 5 6 7 8 9 …using secondhand data • Iterated Bayesian learning converges to the prior • Constraints explanations of linguistic universals • Provides a method for evaluating priors – concepts, causal relationships, languages, … • Open questions in Bayesian language evolution – variation in priors – other selective pressures Outline • …using a single datapoint – predicting the future • …using secondhand data – effects of priors on cultural transmission • Conclusions Bayes’ theorem p( h | d ) p ( d | h ) p ( h ) A unifying principle for explaining inductive inferences Bayes’ theorem inference = f(data,knowledge) Bayes’ theorem inference = f(data,knowledge) A means of evaluating the priors that inform those inferences Explaining inductive leaps • How do people – – – – – infer causal relationships identify the work of chance predict the future assess similarity and make generalizations learn functions, languages, and concepts . . . from such limited data? • What knowledge guides human inferences? Markov chain Monte Carlo • Sample from a Markov chain which converges to target distribution • Allows sampling from an unnormalized posterior distribution • Can compute approximate statistics from intractable distributions (MacKay, 2002) Markov chain Monte Carlo x x x x x x x Transition matrix P(x(t+1)|x(t)) = T(x(t),x(t+1)) • States of chain are variables of interest • Transition matrix chosen to give target distribution as stationary distribution x Gibbs sampling Particular choice of proposal distribution (for single component Metropolis-Hastings) For variables x = x1, x2, …, xn Draw xi(t+1) from P(xi|x-i) x-i = x1(t+1), x2(t+1),…, xi-1(t+1), xi+1(t), …, xn(t) (a.k.a. the heat bath algorithm in statistical physics) Gibbs sampling (MacKay, 2002)

Descargar
# Inductive inference in perception and cognition