Priors and predictions in
everyday cognition
Tom Griffiths
Cognitive and Linguistic Sciences
data
QuickTime™ and a
TIFF (Un co mpress ed ) de compres so r
are n eed ed to se e this p ictu re .
behavior
What computational problem is the brain solving?
Do optimal solutions to that problem help to
explain human behavior?
Inductive problems
• Inferring structure from data
• Perception
– e.g. structure of 3D world from 2D visual data
data
hypotheses
cube
shaded hexagon
Inductive problems
• Inferring structure from data
• Perception
– e.g. structure of 3D world from 2D data
• Cognition
– e.g. form of causal relationship from samples
data
hypotheses
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Reverend Thomas Bayes
Bayes’ theorem
Posterior
probability
p (h | d ) 
Likelihood
Prior
probability
p (d | h) p (h)

p ( d | h ) p ( h )
h  H
h: hypothesis
d: data
Sum over space
of hypotheses

Bayes’ theorem
p( h | d )  p ( d | h ) p ( h )
h: hypothesis
d: data
Perception is optimal
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ an d a
TIFF (Un co mpres sed ) d ecomp re sso r
are n ee ded to se e this pictu re .
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Körding & Wolpert (2004)
Cognition is not
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Do people use priors?
p( h | d )  p ( d | h ) p ( h )

Standard answer: no
(Tversky & Kahneman, 1974)
Explaining inductive leaps
• How do people
–
–
–
–
–
infer causal relationships
identify the work of chance
predict the future
assess similarity and make generalizations
learn functions, languages, and concepts
. . . from such limited data?
Explaining inductive leaps
• How do people
–
–
–
–
–
infer causal relationships
identify the work of chance
predict the future
assess similarity and make generalizations
learn functions, languages, and concepts
. . . from such limited data?
• What knowledge guides human inferences?
Prior knowledge matters when…
• …using a single datapoint
– predicting the future
• …using secondhand data
– effects of priors on cultural transmission
Outline
• …using a single datapoint
– predicting the future
– joint work with Josh Tenenbaum (MIT)
• …using secondhand data
– effects of priors on cultural transmission
– joint work with Mike Kalish (Louisiana)
• Conclusions
Outline
• …using a single datapoint
– predicting the future
– joint work with Josh Tenenbaum (MIT)
• …using secondhand data
– effects of priors on cultural transmission
– joint work with Mike Kalish (Louisiana)
• Conclusions
Predicting the future
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
How often is Google News updated?
t = time since last update
ttotal = time between updates
What should we guess for ttotal given t?
Everyday prediction problems
• You read about a movie that has made $60 million
to date. How much money will it make in total?
• You see that something has been baking in the
oven for 34 minutes. How long until it’s ready?
• You meet someone who is 78 years old. How long
will they live?
• Your friend quotes to you from line 17 of his
favorite poem. How long is the poem?
• You see taxicab #107 pull up to the curb in front of
the train station. How many cabs in this city?
Making predictions
• You encounter a phenomenon that has
existed for t units of time. How long will it
continue into the future? (i.e. what’s ttotal?)
• We could replace “time” with any other
variable that ranges from 0 to some
unknown upper limit.
Bayesian inference
p(ttotal|t)  p(t|ttotal) p(ttotal)
posterior
probability
likelihood
prior
Bayesian inference
p(ttotal|t)  p(t|ttotal) p(ttotal)
posterior
probability
likelihood
prior
p(ttotal|t)  1/ttotal p(ttotal)
assume
random
sample
(0 < t < ttotal)
Bayesian inference
p(ttotal|t)  p(t|ttotal) p(ttotal)
posterior
probability
likelihood
prior
p(ttotal|t)  1/ttotal 1/ttotal
assume “uninformative”
random
prior
sample
(Gott, 1993)
(0 < t < ttotal)
Bayesian inference
p(ttotal|t)  1/ttotal 1/ttotal
posterior
probability
random
sampling
“uninformative”
prior
What is the best guess for ttotal?
How about maximal value of p(ttotal|t)?
p(ttotal|t)
ttotal = t
ttotal
Bayesian inference
p(ttotal|t)  1/ttotal 1/ttotal
posterior
probability
random
sampling
“uninformative”
prior
What is the best guess for ttotal?
Instead, compute t* such that p(ttotal > t*|t) = 0.5:
p(ttotal|t)
ttotal
Bayesian inference
p(ttotal|t)  1/ttotal 1/ttotal
posterior
probability
random
sampling
“uninformative”
prior
What is the best guess for ttotal?
Instead, compute t* such that p(ttotal > t*|t) = 0.5.
Yields Gott’s Rule: P(ttotal > t*|t) = 0.5 when t* = 2t
i.e., best guess for ttotal = 2t
Applying Gott’s rule
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
t  4000 years, t*  8000 years
Applying Gott’s rule
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
t  130,000 years, t*  260,000 years
Predicting everyday events
• You read about a movie that has made $78 million to
date. How much money will it make in total?
– “$156 million” seems reasonable
• You meet someone who is 35 years old. How long
will they live?
– “70 years” seems reasonable
• Not so simple:
– You meet someone who is 78 years old. How long will
they live?
– You meet someone who is 6 years old. How long will they
live?
The effects of priors
• Different kinds of priors p(ttotal) are
appropriate in different domains.
Gott: p(ttotal) ttotal-1
The effects of priors
• Different kinds of priors p(ttotal) are
appropriate in different domains.
e.g., wealth,
contacts
e.g., height,
lifespan
The effects of priors
Evaluating human predictions
• Different domains with different priors:
–
–
–
–
–
–
A movie has made $60 million
Your friend quotes from line 17 of a poem
You meet a 78 year old man
A move has been running for 55 minutes
A U.S. congressman has served for 11 years
A cake has been in the oven for 34 minutes
• Use 5 values of t for each
• People predict ttotal
people
Gott’s rule
empirical prior
parametric prior
You learn that in ancient
Egypt, there was a great
flood in the 11th year of a
pharaoh’s reign. How
long did he reign?
You learn that in ancient
Egypt, there was a great
flood in the 11th year of a
pharaoh’s reign. How
long did he reign?
How long did the typical
pharaoh reign in ancient
Egypt?
…using a single datapoint
• People produce accurate predictions for the
duration and extent of everyday events
• Strong prior knowledge
– form of the prior (power-law or exponential)
– distribution given that form (parameters)
– non-parametric distribution when necessary
• Reveals a surprising correspondence between
probabilities in the mind and in the world
Outline
• …using a single datapoint
– predicting the future
– joint work with Josh Tenenbaum (MIT)
• …using secondhand data
– effects of priors on cultural transmission
– joint work with Mike Kalish (Louisiana)
• Conclusions
Cultural transmission
• Most knowledge is based on secondhand data
• Some things can only be learned from others
– language
– religious concepts
• How do priors affect cultural transmission?
Iterated learning
(Briscoe, 1998; Kirby, 2001)
data
hypothesis
data
hypothesis
• Each learner sees data, forms a hypothesis,
produces the data given to the next learner
• c.f. the playground game “telephone”
Explaining linguistic universals
• Human languages are a subset of all logically
possible communication schemes
– universal properties common to all languages
(Comrie, 1981; Greenberg, 1963; Hawkins, 1988)
• Two questions:
– why do linguistic universals exist?
– why are particular properties universal?
Explaining linguistic universals
• Traditional answer:
– linguistic universals reflect innate constraints
specific to a system for acquiring language
• Alternative answer:
– iterated learning imposes “information bottleneck”
– universal properties survive this bottleneck
(Briscoe, 1998; Kirby, 2001)
Analyzing iterated learning
What are the consequences of iterated learning?
Complex
algorithms
Analytic results
?
Kirby (2001)
Simulations
Smith, Kirby, &
Brighton (2003)
Simple
algorithms
Komarova, Niyogi,
& Nowak (2002)
Brighton (2002)
Iterated Bayesian learning
d0
h1
p(h|d)
d1
p(d|h)
h2
p(h|d)
• Learners are rational Bayesian agents
– covers a wide range of learning algorithms
• Defines a Markov chain on (h, d) pairs
p(d|h)
Analytic results
• Stationary distribution of Markov chain is
p( d, h )  p( d | h ) p( h )
• Convergence under easily checked conditions
• Rate of convergence is geometric

– iterated learning is a Gibbs sampler on p(d,h)
– Gibbs sampler converges geometrically
(Liu, Wong, & Kong, 1995)
Analytic results
• Stationary distribution of Markov chain is
p( d, h )  p( d | h ) p( h )
• Corollaries:

– distribution over hypotheses converges to p(h)
– distribution over data converges to p(d)
– the proportion of a population of iterated
learners with hypothesis h converges to p(h)
An example: Gaussians
• If we assume…
– data, d, is a single real number, x
– hypotheses, h, are means of a Gaussian, 
– prior, p(), is Gaussian(0,02)
• …then p(xn+1|xn) is Gaussian(n, x2 + n2)
x n / x   0 / 0
2
n 
1 /  1 /
2
x
2
2
0
 
2
n
1
1 /  1 /
2
x
2
0
0 = 0, 02 = 1, x0 = 20
Iterated learning results in rapid convergence to prior
Implications for linguistic universals
• Two questions:
– why do linguistic universals exist?
– why are particular properties universal?
• Different answers:
– existence explained through iterated learning
– universal properties depend on the prior
• Focuses inquiry on the priors of the learners
A method for discovering priors
Iterated learning converges to the prior…
…evaluate prior by producing iterated learning
Iterated function learning
• Assume
– data, d, are pairs of real numbers (x, y)
– hypotheses, h, are functions
• An example: linear regression
– hypotheses have slope  and pass through origin
– p() is Gaussian(0,02)
y
}
x=1
y
0 = 1, 02 = 0.1, y0 = -1
}
x=1
Function learning in the lab
Stimulus
Feedback
Response
Slider
Examine iterated learning with different initial data
Initial
data
Iteration
1
2
3
4
5
6
7
8
9
…using secondhand data
• Iterated Bayesian learning converges to the prior
• Constraints explanations of linguistic universals
• Provides a method for evaluating priors
– concepts, causal relationships, languages, …
• Open questions in Bayesian language evolution
– variation in priors
– other selective pressures
Outline
• …using a single datapoint
– predicting the future
• …using secondhand data
– effects of priors on cultural transmission
• Conclusions
Bayes’ theorem
p( h | d )  p ( d | h ) p ( h )

A unifying principle for explaining
inductive inferences
Bayes’ theorem
inference = f(data,knowledge)
Bayes’ theorem
inference = f(data,knowledge)
A means of evaluating the priors
that inform those inferences
Explaining inductive leaps
• How do people
–
–
–
–
–
infer causal relationships
identify the work of chance
predict the future
assess similarity and make generalizations
learn functions, languages, and concepts
. . . from such limited data?
• What knowledge guides human inferences?
Markov chain Monte Carlo
• Sample from a Markov chain which
converges to target distribution
• Allows sampling from an unnormalized
posterior distribution
• Can compute approximate statistics
from intractable distributions
(MacKay, 2002)
Markov chain Monte Carlo
x
x
x
x
x
x
x
Transition matrix
P(x(t+1)|x(t)) = T(x(t),x(t+1))
• States of chain are variables of interest
• Transition matrix chosen to give target
distribution as stationary distribution
x
Gibbs sampling
Particular choice of proposal distribution
(for single component Metropolis-Hastings)
For variables x = x1, x2, …, xn
Draw xi(t+1) from P(xi|x-i)
x-i = x1(t+1), x2(t+1),…, xi-1(t+1), xi+1(t), …, xn(t)
(a.k.a. the heat bath algorithm in statistical physics)
Gibbs sampling
(MacKay, 2002)
Descargar

Inductive inference in perception and cognition