A Bayesian view of language
evolution by iterated learning
Tom Griffiths
Mike Kalish
Brown University
University of Louisiana
Linguistic universals
• Human languages are a subset of all logically
possible communication schemes
– universal properties common to all languages
(Comrie, 1981; Greenberg, 1963; Hawkins, 1988)
• Two questions:
– why do linguistic universals exist?
– why are particular properties universal?
Possible explanations
• Traditional answer:
– linguistic universals reflect innate constraints
specific to a system for acquiring language
(e.g., Chomsky, 1965)
• Alternative answer:
– linguistic universals emerge as the result of the fact
that language is learned anew by each generation
(e.g., Briscoe, 1998; Kirby, 2001)
Iterated learning
(Kirby, 2001)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
• Each learner sees data, forms a hypothesis,
produces the data given to the next learner
• c.f. the playground game “telephone”
The “information bottleneck”
(Kirby, 2001)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
size indicates
compressibility
“survival of the
most compressible”
Analyzing iterated learning
What are the consequences of iterated learning?
Complex
algorithms
Analytic results
?
Kirby (2001)
Simulations
Smith, Kirby, &
Brighton (2003)
Simple
algorithms
Komarova, Niyogi,
& Nowak (2002)
Brighton (2002)
Outline
• Iterated Bayesian learning
• Markov chains
• Convergence results
• Example: Emergence of compositionality
• Conclusion
Outline
• Iterated Bayesian learning
• Markov chains
• Convergence results
• Example: Emergence of compositionality
• Conclusion
Bayesian inference
• Rational procedure
for updating beliefs
• Foundation of many
learning algorithms
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
(e.g., Mackay, 2003)
• Widely used for
language learning
(e.g., Charniak, 1993)
Reverend Thomas Bayes
Bayes’ theorem
Posterior
probability
p (h | d ) 
Likelihood
Prior
probability
p (d | h) p (h)

p ( d | h ) p ( h )
h  H
h: hypothesis
d: data
Sum over space
of hypotheses
Iterated Bayesian learning
p(h|d)
p(h|d)
p(d|h)
p(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Learners are Bayesian agents
Outline
• Iterated Bayesian learning
• Markov chains
• Convergence results
• Example: Emergence of compositionality
• Conclusion
Markov chains
x
x
x
x
x
x
x
x
Transition matrix
P(x(t+1)|x(t))
• Variables x(t+1) independent of history given x(t)
• Converges to a stationary distribution under
easily checked conditions for ergodicity
Markov chain Monte Carlo
• A strategy for sampling from complex
probability distributions
• Key idea: construct a Markov chain which
converges to a particular distribution
– e.g. Metropolis algorithm
– e.g. Gibbs sampling
Gibbs sampling
For variables x = x1, x2, …, xn
Draw xi(t+1) from P(xi|x-i)
x-i = x1(t+1), x2(t+1),…, xi-1(t+1), xi+1(t), …, xn(t)
Converges to P(x1, x2, …, xn)
(Geman & Geman, 1984)
(a.k.a. the heat bath algorithm in statistical physics)
Gibbs sampling
(MacKay, 2003)
Outline
• Iterated Bayesian learning
• Markov chains
• Convergence results
• Example: Emergence of compositionality
• Conclusion
Analyzing iterated learning
p(h|d)
p(h|d)
p(d|h)
p(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
• Iterated learning is a Markov chain on (h,d)
Analyzing iterated learning
p(h|d)
p(h|d)
p(d|h)
p(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
• Iterated learning is a Markov chain on (h,d)
• Iterated Bayesian learning is a Gibbs sampler for
p( d, h )  p( d | h ) p( h )
Analytic results
• Iterated Bayesian learning converges to
p( d, h )  p( d | h ) p( h )
(geometrically; Liu, Wong, & Kong, 1995)

Analytic results
• Iterated Bayesian learning converges to
p( d, h )  p( d | h ) p( h )
(geometrically; Liu, Wong, & Kong, 1995)
• Corollaries:

– distribution over hypotheses converges to p(h)
– distribution over data converges to p(d)
– the proportion of a population of iterated
learners with hypothesis h converges to p(h)
Outline
• Iterated Bayesian learning
• Markov chains
• Convergence results
• Example: Emergence of compositionality
• Conclusion
A simple language model
“agents”
0
“actions”
1
“nouns”
compositional
0
1
0
0
1
1
events
x
language
function
utterances
y
“verbs”
A simple language model
0
1
compositional
P(h)
0
1
0
0

1
1
4
0
1
holistic
0
1
0
 0
1
1
• Data: m event-utterance pairs
• Hypotheses: languages, with error


(1   )
256
Analysis technique
1. Compute transition matrix on languages
P ( h n  i | h n 1  j ) 
 P (h
n
 i | d )P (d | h n 1  j )
d
2. Sample Markov chains
3. Compare language frequencies with prior
(can also compute eigenvalues etc.)
Convergence to priors
Effect of Prior
 = 0.50,  = 0.05, m = 3
 = 0.01,  = 0.05, m = 3
Iteration
Chain
Prior
The information bottleneck
No effect of bottleneck
 = 0.50,  = 0.05, m = 1
 = 0.01,  = 0.05, m = 3
 = 0.50,  = 0.05, m = 10
Iteration
Chain
Prior
The information bottleneck
Stability ratio
=
P ( h n  i | h n 1  i)
iC
P ( h n  i | h n 1  i)
i H

Bottleneck affects relative stability
of languages favored by prior
Outline
• Iterated Bayesian learning
• Markov chains
• Convergence results
• Example: Emergence of compositionality
• Conclusion
Implications for linguistic universals
• Two questions:
– why do linguistic universals exist?
– why are particular properties universal?
• Different answers:
– existence explained through iterated learning
– universal properties depend on the prior
• Focuses inquiry on the priors of the learners
– languages reflect the biases of human learners
Extensions and future directions
• Results extend to
– unbounded populations
– continuous time population dynamics
• Iterated learning applies to other knowledge
– religious concepts, social norms, legends…
• Provides a method for evaluating priors
– experiments in iterated learning with humans
Iterated function learning
data
hypotheses
• Each learner sees a set of (x,y) pairs
• Makes predictions of y for new x values
• Predictions are data for the next learner
Function learning in the lab
Stimulus
Feedback
Response
Slider
Examine iterated learning with different initial data
Initial
data
Iteration
1
2
3
4
5
6
7
8
9
(Kalish, 2004)
An example: Gaussians
• If we assume…
– data, d, is a single real number, x
– hypotheses, h, are means of a Gaussian, 
– prior, p(), is Gaussian(0,02)
• …then p(xn+1|xn) is Gaussian(n, x2 + n2)
x n / x   0 / 0
2
n 
1 /  1 /
2
x
2
2
0
 
2
n
1
1 /  1 /
2
x
2
0
0 = 0, 02 = 1, x0 = 20
Iterated learning results in rapid convergence to prior
An example: Linear regression
• Assume
– data, d, are pairs of real numbers (x, y)
– hypotheses, h, are functions
• An example: linear regression
– hypotheses have slope  and pass through origin
– p() is Gaussian(0,02)
y
}
x=1
y
0 = 1, 02 = 0.1, y0 = -1
}
x=1
Descargar

Inductive inference in perception and cognition