A Bayesian view of language evolution by iterated learning Tom Griffiths Mike Kalish Brown University University of Louisiana Linguistic universals • Human languages are a subset of all logically possible communication schemes – universal properties common to all languages (Comrie, 1981; Greenberg, 1963; Hawkins, 1988) • Two questions: – why do linguistic universals exist? – why are particular properties universal? Possible explanations • Traditional answer: – linguistic universals reflect innate constraints specific to a system for acquiring language (e.g., Chomsky, 1965) • Alternative answer: – linguistic universals emerge as the result of the fact that language is learned anew by each generation (e.g., Briscoe, 1998; Kirby, 2001) Iterated learning (Kirby, 2001) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. • Each learner sees data, forms a hypothesis, produces the data given to the next learner • c.f. the playground game “telephone” The “information bottleneck” (Kirby, 2001) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. size indicates compressibility “survival of the most compressible” Analyzing iterated learning What are the consequences of iterated learning? Complex algorithms Analytic results ? Kirby (2001) Simulations Smith, Kirby, & Brighton (2003) Simple algorithms Komarova, Niyogi, & Nowak (2002) Brighton (2002) Outline • Iterated Bayesian learning • Markov chains • Convergence results • Example: Emergence of compositionality • Conclusion Outline • Iterated Bayesian learning • Markov chains • Convergence results • Example: Emergence of compositionality • Conclusion Bayesian inference • Rational procedure for updating beliefs • Foundation of many learning algorithms QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. (e.g., Mackay, 2003) • Widely used for language learning (e.g., Charniak, 1993) Reverend Thomas Bayes Bayes’ theorem Posterior probability p (h | d ) Likelihood Prior probability p (d | h) p (h) p ( d | h ) p ( h ) h H h: hypothesis d: data Sum over space of hypotheses Iterated Bayesian learning p(h|d) p(h|d) p(d|h) p(d|h) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Learners are Bayesian agents Outline • Iterated Bayesian learning • Markov chains • Convergence results • Example: Emergence of compositionality • Conclusion Markov chains x x x x x x x x Transition matrix P(x(t+1)|x(t)) • Variables x(t+1) independent of history given x(t) • Converges to a stationary distribution under easily checked conditions for ergodicity Markov chain Monte Carlo • A strategy for sampling from complex probability distributions • Key idea: construct a Markov chain which converges to a particular distribution – e.g. Metropolis algorithm – e.g. Gibbs sampling Gibbs sampling For variables x = x1, x2, …, xn Draw xi(t+1) from P(xi|x-i) x-i = x1(t+1), x2(t+1),…, xi-1(t+1), xi+1(t), …, xn(t) Converges to P(x1, x2, …, xn) (Geman & Geman, 1984) (a.k.a. the heat bath algorithm in statistical physics) Gibbs sampling (MacKay, 2003) Outline • Iterated Bayesian learning • Markov chains • Convergence results • Example: Emergence of compositionality • Conclusion Analyzing iterated learning p(h|d) p(h|d) p(d|h) p(d|h) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. • Iterated learning is a Markov chain on (h,d) Analyzing iterated learning p(h|d) p(h|d) p(d|h) p(d|h) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. • Iterated learning is a Markov chain on (h,d) • Iterated Bayesian learning is a Gibbs sampler for p( d, h ) p( d | h ) p( h ) Analytic results • Iterated Bayesian learning converges to p( d, h ) p( d | h ) p( h ) (geometrically; Liu, Wong, & Kong, 1995) Analytic results • Iterated Bayesian learning converges to p( d, h ) p( d | h ) p( h ) (geometrically; Liu, Wong, & Kong, 1995) • Corollaries: – distribution over hypotheses converges to p(h) – distribution over data converges to p(d) – the proportion of a population of iterated learners with hypothesis h converges to p(h) Outline • Iterated Bayesian learning • Markov chains • Convergence results • Example: Emergence of compositionality • Conclusion A simple language model “agents” 0 “actions” 1 “nouns” compositional 0 1 0 0 1 1 events x language function utterances y “verbs” A simple language model 0 1 compositional P(h) 0 1 0 0 1 1 4 0 1 holistic 0 1 0 0 1 1 • Data: m event-utterance pairs • Hypotheses: languages, with error (1 ) 256 Analysis technique 1. Compute transition matrix on languages P ( h n i | h n 1 j ) P (h n i | d )P (d | h n 1 j ) d 2. Sample Markov chains 3. Compare language frequencies with prior (can also compute eigenvalues etc.) Convergence to priors Effect of Prior = 0.50, = 0.05, m = 3 = 0.01, = 0.05, m = 3 Iteration Chain Prior The information bottleneck No effect of bottleneck = 0.50, = 0.05, m = 1 = 0.01, = 0.05, m = 3 = 0.50, = 0.05, m = 10 Iteration Chain Prior The information bottleneck Stability ratio = P ( h n i | h n 1 i) iC P ( h n i | h n 1 i) i H Bottleneck affects relative stability of languages favored by prior Outline • Iterated Bayesian learning • Markov chains • Convergence results • Example: Emergence of compositionality • Conclusion Implications for linguistic universals • Two questions: – why do linguistic universals exist? – why are particular properties universal? • Different answers: – existence explained through iterated learning – universal properties depend on the prior • Focuses inquiry on the priors of the learners – languages reflect the biases of human learners Extensions and future directions • Results extend to – unbounded populations – continuous time population dynamics • Iterated learning applies to other knowledge – religious concepts, social norms, legends… • Provides a method for evaluating priors – experiments in iterated learning with humans Iterated function learning data hypotheses • Each learner sees a set of (x,y) pairs • Makes predictions of y for new x values • Predictions are data for the next learner Function learning in the lab Stimulus Feedback Response Slider Examine iterated learning with different initial data Initial data Iteration 1 2 3 4 5 6 7 8 9 (Kalish, 2004) An example: Gaussians • If we assume… – data, d, is a single real number, x – hypotheses, h, are means of a Gaussian, – prior, p(), is Gaussian(0,02) • …then p(xn+1|xn) is Gaussian(n, x2 + n2) x n / x 0 / 0 2 n 1 / 1 / 2 x 2 2 0 2 n 1 1 / 1 / 2 x 2 0 0 = 0, 02 = 1, x0 = 20 Iterated learning results in rapid convergence to prior An example: Linear regression • Assume – data, d, are pairs of real numbers (x, y) – hypotheses, h, are functions • An example: linear regression – hypotheses have slope and pass through origin – p() is Gaussian(0,02) y } x=1 y 0 = 1, 02 = 0.1, y0 = -1 } x=1

Descargar
# Inductive inference in perception and cognition