```Part IV: Inference algorithms
Estimation and inference
• Actually working with probabilistic models
requires solving some difficult computational
problems…
• Two key problems:
– estimating parameters in models with latent variables
– computing posterior distributions involving large
numbers of variables
Part IV: Inference algorithms
• The EM algorithm
– for estimation in models with latent variables
• Markov chain Monte Carlo
– for sampling from posterior distributions involving
large numbers of variables
Part IV: Inference algorithms
• The EM algorithm
– for estimation in models with latent variables
• Markov chain Monte Carlo
– for sampling from posterior distributions involving
large numbers of variables
SUPERVISED
dog
dog
cat
dog
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
dog
dog
cat
cat
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF ( Uncompres sed) decompressor
are needed to see this picture.
dog
cat
dog
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
dog
cat
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
cat
cat
dog
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Supervised learning
Category A
Category B
What characterizes the categories?
How should we categorize a new observation?

Parametric density estimation
• Assume that p(x|c) has a simple form,
characterized by parameters 
• Given stimuli X = x1, x2, …, xn from category c,
find  by maximum-likelihood estimation
ˆ  arg max p( X | c,  )

or some form of Bayesian estimation
ˆ  arg max log p( X | c,  )  log p( ) 

Spatial representations
• Assume a simple
parametric form for
p(x|c): a Gaussian
• For each category,
estimate parameters
– mean
– variance
}
c
P(c)
x p(x|c)
Probability
density
p(x)
The Gaussian distribution
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
standard
deviation
(x-)/
mean
p( x ) 
1
2 
exp{ ( x   ) / 2  }
2
2
variance = 2
Estimating a Gaussian
X = {x1, x2, …, xn} independently sampled from a Gaussian
n
p( X |  ,  ) 
 p( x
i 1
n


i 1
i
| , )
1
exp{ ( x i   ) / 2 }
2
2 
 1 
1
 
 exp{ 
2
 2   
2
n

2
n
 (x
i 1
 ) }
2
i
Estimating a Gaussian
X = {x1, x2, …, xn} independently sampled from a Gaussian
 1 
1
p( X |  ,  )  
 exp{ 
2
 2   
2
n
n
 ( x i  ) }
2
i 1
maximum likelihood parameter estimates:

1
n
x

n
i 1
 
2
i
1
n
n
 ( xi  )
i 1
2
Multivariate Gaussians
1
p( x |  ,  ) 
exp{ ( x   ) / 2 }
2
2 
2
mean variance/covariance matrix
p( x |  , ) 
1
(2  )
m /2
T

1/ 2
1
exp{ ( x   )  ( x   ) / 2}
1
  
0
0 

1 
1
  
0
0 

0.25 
 1
  
0.8
0.8 

1 
Estimating a Gaussian
X = {x1, x2, …, xn} independently sampled from a Gaussian
maximum likelihood parameter estimates:

1
n
x

n
i 1
i

1
n
n
 ( x i   )( x i   )
i 1
T
Bayesian inference
P (c | x ) 
P ( x | c )P (c )
 P ( x | c )P (c )
Probability
c

x
UNSUPERVISED
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF ( Uncompres sed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Unsupervised learning
What latent structure is present?
What are the properties of a new observation?
An example: Clustering
Assume each observed xi is from a cluster ci,
where ci is unknown
What characterizes the clusters?
What cluster does a new x come from?
Density estimation
c
P(c)
• We need to estimate some
probability distributions
– what is P(c)?
– what is p(x|c)?
x p(x|c)
• But… c is unknown, so we
only know the value of x
Supervised and unsupervised
Supervised learning: categorization
• Given x = {x1, …, xn} and c = {c1, …, cn}
• Estimate parameters  of p(x|c) and P(c)
ˆ  arg max p(x,c |  )  arg max
n
 p( x
| c i ,  ) P (c i |  )
i
i 1
Unsupervised learning: clustering
• Given x = {x1, …, xn}
• Estimate parameters  of p(x|c)
and
P(c)
n
ˆ  arg max p(x |  )  arg max
  p( x
i 1
ci
i
| c i ,  )P (c i |  )
Mixture distributions
Probability
mixture distribution
mixture components
p( x i |  ) 
 p( x
ci

x
i
| c i ,  ) P (c i |  )
mixture
weights
More generally…
Unsupervised learning is density estimation using
distributions with latent variables
z
P(z)
Latent (unobserved)
P(x) 
 P ( x | z )P (z )
z
x P(x|z)
Observed
Marginalize out
(i.e. sum over)
latent structure
A chicken and egg problem
• If we knew which cluster the observations were
from we could find the distributions
– this is just density estimation
• If we knew the distributions, we could infer which
cluster each observation came from
– this is just categorization
Alternating optimization algorithm
0. Guess initial parameter values
1. Given parameter estimates, solve for maximum a
posteriori assignments ci:
c i  arg max P (c i | x i ,  )  arg max p ( x i | c i ,  ) P (c i |  )
2. Given assignments ci, solve for maximum
likelihood parameter estimates:
ˆ  arg max p(x,c |  )  arg max
n
 p( x
i 1
3. Go to step 1
i
| c i ,  ) P (c i |  )
Alternating optimization algorithm
x
c: assignments to cluster
, , P(c): parameters 
For simplicity, assume , P(c) fixed: “k-means” algorithm
Alternating optimization algorithm
Step 0: initial parameter values
Alternating optimization algorithm
Step 1: update assignments
Alternating optimization algorithm
Step 2: update parameters
Alternating optimization algorithm
Step 1: update assignments
Alternating optimization algorithm
Step 2: update parameters
Alternating optimization algorithm
0. Guess initial parameter values
1. Given parameter estimates, solve for maximum
a
why “hard”
assignments?
posteriori assignments ci:
c i  arg max P (c i | x i ,  )  arg max p ( x i | c i ,  ) P (c i |  )
2. Given assignments ci, solve for maximum
likelihood parameter estimates:
ˆ  arg max p(x | c,  )  arg max
n
 p( x
i 1
3. Go to step 1
i
| c i , )
Estimating a Gaussian
(with hard assignments)
X = {x1, x2, …, xn} independently sampled from a Gaussian
 1 
1
p( X |  ,  )  
 exp{ 
2
 2   
2
n
n
 ( x i  ) }
2
i 1
maximum likelihood parameter estimates:

1
n
x

n
i 1
 
2
i
1
n
n
 ( xi  )
i 1
2
Estimating a Gaussian
(with soft assignments)
the “weight” of each point is the probability of being in the cluster
P (c i  j | x i ,  ) 
P ( x i | c i  j,  )P (c i  j |  )
 P(x
i
| c,  )P (c |  )
c
maximum likelihood parameter estimates:
n
n


j 
 ( x i   ) P (c i  j | x i ,  )
x i P (c i  j | x i ,  )
j 
2
i 1
n
 P (c
i 1
2
i
 j | x i , )
i 1
n
 P (c
i 1
i
 j | x i , )
The Expectation-Maximization algorithm
(clustering version)
0. Guess initial parameter values
1. Given parameter estimates, compute posterior
distribution over assignments ci:
P (c i | x i ,  )  p( x i | c i ,  ) P (c i |  )
2. Solve for maximum likelihood parameter
estimates, weighting each observation by the

probability it came from that cluster
3. Go to step 1
The Expectation-Maximization algorithm
(more general version)
0. Guess initial parameter values
1. Given parameter estimates, compute posterior
distribution over latent variables z:
P ( z | x,  )  P ( x | z ,  )P (z |  )
2. Find parameter estimates

new
ˆ
  arg max
 P ( z | x,
z
3. Go to step 1
old
) log P ( x, z | 
new
)
A note on expectations
• For a function f(x) and distribution P(x), the
expectation of f with respect to P is
E P ( x )  f ( x ) 

f ( x )P ( x )
x
• The expectation is the average of f, when x is
drawn from the probability distribution P

Good features of EM
• Convergence
– guaranteed to converge to at least a local maximum of
the likelihood (or other extremum)
– likelihood is non-decreasing across iterations
• Efficiency
– big steps initially (other algorithms better later)
• Generality
– can be defined for many probabilistic models
– can be combined with a prior for MAP estimation
Limitations of EM
• Local minima
– e.g., one component poorly fits two clusters, while
two components split up a single cluster
• Degeneracies
– e.g., two components may merge, a component may
lock onto one data point, with variance going to zero
• May be intractable for complex models
– dealing with this is an active research topic
EM and cognitive science
• The EM algorithm seems like it might be a good
way to describe some “bootstrapping”
– anywhere there’s a “chicken and egg” problem
– a prime example: language learning
Probabilistic context free grammars
S
NP
NP
VP
T
T
N
N
V
V
 NP VP
TN
N
 V NP
 the
a
 man
 ball
 hit
 took
S
1.0
0.7
0.3
1.0
0.8
0.2
0.5
0.5
0.6
0.4
1.0
NP
VP
0.7
T
0.8
the
N
1.0
V
0.5
NP
0.7
0.6
man hit
T
0.8
the
P(tree) = 1.00.71.00.80.50.60.70.80.5
N
0.5
ball
EM and cognitive science
• The EM algorithm seems like it might be a good
way to describe some “bootstrapping”
– anywhere there’s a “chicken and egg” problem
– a prime example: language learning
• Fried and Holyoak (1984) explicitly tested a
model of human categorization that was almost
exactly a version of the EM algorithm for a
mixture of Gaussians
Part IV: Inference algorithms
• The EM algorithm
– for estimation in models with latent variables
• Markov chain Monte Carlo
– for sampling from posterior distributions involving
large numbers of variables
The Monte Carlo principle
• The expectation of f with respect to P can be
approximated by
n
1
E P ( x )  f ( x )   f ( x i )
n i 1
where the xi are sampled from P(x)
• Example: the average # of spots on a die roll

The Monte Carlo principle
The law of large numbers
n
E P ( x )  f ( x ) 
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.

i 1

f (xi)
Markov chain Monte Carlo
• Sometimes it isn’t possible to sample directly from
a distribution
• Sometimes, you can only compute something
proportional to the distribution
• Markov chain Monte Carlo: construct a Markov
chain that will converge to the target distribution,
and draw samples from that chain
– just uses something proportional to the target
Markov chains
x
x
x
x
x
x
x
Transition matrix
T = P(x(t+1)|x(t))
Variables x(t+1) independent of all previous
variables given immediate predecessor x(t)
x
An example: card shuffling
• Each state x(t) is a permutation of a deck of
cards (there are 52! permutations)
• Transition matrix T indicates how likely one
permutation will become another
• The transition probabilities are determined by
the shuffling procedure
– riffle shuffle
– overhand
– one card
Convergence of Markov chains
• Why do we shuffle cards?
• Convergence to a uniform distribution takes
only 7 riffle shuffles…
• Other Markov chains will also converge to a
stationary distribution, if certain simple
conditions are satisfied (called “ergodicity”)
– e.g. every state can be reached in some number of
steps from every other state
Markov chain Monte Carlo
x
x
x
x
x
x
x
Transition matrix
T = P(x(t+1)|x(t))
• States of chain are variables of interest
• Transition matrix chosen to give target
distribution as stationary distribution
x
Metropolis-Hastings algorithm
• Transitions have two parts:
– proposal distribution: Q(x(t+1)|x(t))
– acceptance: take proposals with probability
A(x(t),x(t+1)) = min( 1,
P(x(t+1)) Q(x(t)|x(t+1))
P(x(t)) Q(x(t+1)|x(t))
)
Metropolis-Hastings algorithm
p(x)
Metropolis-Hastings algorithm
p(x)
Metropolis-Hastings algorithm
p(x)
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t+1)) = 0.5
Metropolis-Hastings algorithm
p(x)
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t+1)) = 1
Gibbs sampling
Particular choice of proposal distribution
For variables x = x1, x2, …, xn
Draw xi(t+1) from P(xi|x-i)
x-i = x1(t+1), x2(t+1),…, xi-1(t+1), xi+1(t), …, xn(t)
(this is called the full conditional distribution)
Gibbs sampling
(MacKay, 2002)
MCMC vs. EM
EM: converges to a single solution
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
MCMC: converges to a distribution of solutions
MCMC and cognitive science
• The Metropolis-Hastings algorithm seems like a
good metaphor for aspects of development…
• Some forms of cultural evolution can be shown
to be equivalent to Gibbs sampling
(Griffiths & Kalish, 2005)
• For experiments based on MCMC, see talk by
• The main use of MCMC is for probabilistic
inference in complex models
A selection of topics
JOB
SCIENCE
BALL
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY
TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
Syntactic
classes
Semantic
classes
Semantic “gist” of document
FOOD
FOODS
BODY
NUTRIENTS
DIET
FAT
SUGAR
ENERGY
MILK
EATING
MAP
NORTH
EARTH
SOUTH
POLE
MAPS
EQUATOR
WEST
LINES
EAST
DOCTOR
PATIENT
HEALTH
HOSPITAL
MEDICAL
CARE
PATIENTS
NURSE
DOCTORS
MEDICINE
THE
HIS
THEIR
YOUR
HER
ITS
MY
OUR
THIS
THESE
A
MORE
SUCH
LESS
MUCH
KNOWN
JUST
BETTER
RATHER
GREATER
HIGHER
LARGER
ON
AT
INTO
FROM
WITH
THROUGH
OVER
AROUND
AGAINST
ACROSS
UPON
GOLD
BOOK
IRON
BOOKS
SILVER
INFORMATION COPPER
METAL
LIBRARY
METALS
REPORT
STEEL
PAGE
CLAY
TITLE
SUBJECT
PAGES
GOOD
SMALL
NEW
IMPORTANT
GREAT
LITTLE
LARGE
BIG
LONG
HIGH
DIFFERENT
ONE
SOME
MANY
TWO
EACH
ALL
MOST
ANY
THREE
THIS
EVERY
CELLS
BEHAVIOR
CELL
SELF
ORGANISMS
INDIVIDUAL
ALGAE
PERSONALITY
BACTERIA
RESPONSE
MICROSCOPE
SOCIAL
MEMBRANE
EMOTIONAL
ORGANISM
LEARNING
FOOD
FEELINGS
LIVING
PSYCHOLOGISTS
HE
YOU
THEY
I
SHE
WE
IT
PEOPLE
EVERYONE
OTHERS
SCIENTISTS
BE
MAKE
GET
HAVE
GO
TAKE
DO
FIND
USE
SEE
HELP
Summary
• Probabilistic models can pose significant
computational challenges
– parameter estimation with latent variables,
computing posteriors with many variables
• Clever algorithms exist for solving these
problems, easing use of probabilistic models
• These algorithms also provide a source of new
models and methods in cognitive science
Generative models for language
latent structure
observed data
Generative models for language
meaning
words
Topic models
• Each document (or conversation, or segment
of either) is a mixture of topics
• Each word is chosen from a single topic
T
P (w i) 
 P (w
i
| z i  j )P (z i  j )
j1
where

wi is the ith word
zi is the topic of the ith word
T is the number of topics
Generating a document
g
distribution over topics
z
z
z
topic assignments
w
w
w
observed words
w
P(w|z = 1)
HEART
LOVE
SOUL
TEARS
JOY
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
topic 1
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0.0
w
P(w|z = 2)
HEART
LOVE
SOUL
TEARS
JOY
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
topic 2
0.0
0.0
0.0
0.0
0.0
0.2
0.2
0.2
0.2
0.2
Choose mixture weights for each document, generate “bag of words”
g = {P(z = 1), P(z = 2)}
{0, 1}
{0.25, 0.75}
MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS
RESEARCH WORK SCIENTIFIC MATHEMATICS WORK
SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC
HEART LOVE TEARS KNOWLEDGE HEART
{0.5, 0.5}
MATHEMATICS HEART RESEARCH LOVE MATHEMATICS
WORK TEARS SOUL KNOWLEDGE HEART
{0.75, 0.25}
WORK JOY SOUL TEARS MATHEMATICS
TEARS LOVE LOVE LOVE SOUL
{1, 0}
TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
Inferring topics from text
• The topic model is a generative model for a set of
documents (assuming a set of topics)
– a simple procedure for generating documents
• Given the documents, we can try to find the topics
and their proportions in each document
• This is an unsupervised learning problem
– we can use the EM algorithm, but it’s not great
– instead, we use Markov chain Monte Carlo
A selection from 500 topics [P(w|z = j)]
THEORY
SCIENTISTS
EXPERIMENT
OBSERVATIONS
SCIENTIFIC
EXPERIMENTS
HYPOTHESIS
EXPLAIN
SCIENTIST
OBSERVED
EXPLANATION
BASED
OBSERVATION
IDEA
EVIDENCE
THEORIES
BELIEVED
DISCOVERED
OBSERVE
FACTS
BRAIN
CURRENT
ART
STUDENTS
SPACE
NERVE
ELECTRICITY
PAINT
TEACHER
EARTH
SENSE
ELECTRIC
ARTIST
STUDENT
MOON
SENSES
CIRCUIT
PAINTING
TEACHERS
PLANET
ARE
IS
PAINTED
TEACHING
ROCKET
NERVOUS ELECTRICAL
ARTISTS
CLASS
MARS
NERVES
VOLTAGE
MUSEUM
CLASSROOM
ORBIT
BODY
FLOW
WORK
SCHOOL
ASTRONAUTS
SMELL
BATTERY
PAINTINGS
LEARNING
FIRST
TASTE
WIRE
STYLE
PUPILS
SPACECRAFT
TOUCH
WIRES
PICTURES
CONTENT
JUPITER
SWITCH
WORKS
INSTRUCTION MESSAGES
SATELLITE
IMPULSES CONNECTED
OWN
TAUGHT
SATELLITES
CORD
ELECTRONS
GROUP
ATMOSPHERE SCULPTURE
ORGANS
RESISTANCE
PAINTER
SPACESHIP
SPINAL
POWER
ARTS
SHOULD
SURFACE
FIBERS
CONDUCTORS
BEAUTIFUL
SCIENTISTS
SENSORY
CIRCUITS
DESIGNS
CLASSES
ASTRONAUT
PAIN
TUBE
PORTRAIT
PUPIL
SATURN
IS
NEGATIVE
PAINTERS
GIVEN
MILES
A selection from 500 topics [P(w|z = j)]
FIELD
STORY
MIND
MAGNETIC
STORIES
WORLD
MAGNET
TELL
DREAM
WIRE
CHARACTER
DREAMS
CHARACTERS NEEDLE
THOUGHT
CURRENT
AUTHOR
IMAGINATION
COIL
MOMENT
POLES
TOLD
THOUGHTS
IRON
SETTING
OWN
COMPASS
TALES
REAL
LINES
PLOT
LIFE
CORE
TELLING
IMAGINE
ELECTRIC
SHORT
SENSE
DIRECTION
CONSCIOUSNESS FICTION
FORCE
ACTION
STRANGE
MAGNETS
TRUE
FEELING
BE
EVENTS
WHOLE
MAGNETISM
TELLS
BEING
POLE
TALE
MIGHT
INDUCED
NOVEL
HOPE
JOB
SCIENCE
BALL
WORK
STUDY
GAME
JOBS
SCIENTISTS
TEAM
CAREER
SCIENTIFIC
FOOTBALL
KNOWLEDGE BASEBALL EXPERIENCE
WORK
PLAYERS EMPLOYMENT
RESEARCH
PLAY OPPORTUNITIES
WORKING
CHEMISTRY
FIELD
TRAINING
TECHNOLOGY
PLAYER
SKILLS
MANY
CAREERS
MATHEMATICS
COACH
POSITIONS
BIOLOGY
PLAYED
FIND
FIELD
PLAYING
POSITION
PHYSICS
HIT
FIELD
LABORATORY
TENNIS
STUDIES
TEAMS OCCUPATIONS
REQUIRE
WORLD
GAMES
SCIENTIST
SPORTS OPPORTUNITY
EARN
STUDYING
BAT
ABLE
SCIENCES
TERRY
A selection from 500 topics [P(w|z = j)]
FIELD
STORY
MIND
MAGNETIC
STORIES
WORLD
MAGNET
TELL
DREAM
WIRE
CHARACTER
DREAMS
CHARACTERS NEEDLE
THOUGHT
CURRENT
AUTHOR
IMAGINATION
COIL
MOMENT
POLES
TOLD
THOUGHTS
IRON
SETTING
OWN
COMPASS
TALES
REAL
LINES
PLOT
LIFE
CORE
TELLING
IMAGINE
ELECTRIC
SHORT
SENSE
DIRECTION
CONSCIOUSNESS FICTION
FORCE
ACTION
STRANGE
MAGNETS
TRUE
FEELING
BE
EVENTS
WHOLE
MAGNETISM
TELLS
BEING
POLE
TALE
MIGHT
INDUCED
NOVEL
HOPE
JOB
SCIENCE
BALL
WORK
STUDY
GAME
JOBS
SCIENTISTS
TEAM
CAREER
SCIENTIFIC
FOOTBALL
KNOWLEDGE BASEBALL EXPERIENCE
WORK
PLAYERS EMPLOYMENT
RESEARCH
PLAY OPPORTUNITIES
WORKING
CHEMISTRY
FIELD
TRAINING
TECHNOLOGY
PLAYER
SKILLS
MANY
CAREERS
MATHEMATICS
COACH
POSITIONS
BIOLOGY
PLAYED
FIND
FIELD
PLAYING
POSITION
PHYSICS
HIT
FIELD
LABORATORY
TENNIS
STUDIES
TEAMS OCCUPATIONS
REQUIRE
WORLD
GAMES
SCIENTIST
SPORTS OPPORTUNITY
EARN
STUDYING
BAT
ABLE
SCIENCES
TERRY
Gibbs sampling for topics
• Need full conditional distributions for variables
• Since we only sample z we need
number of times word w assigned to topic j
number of times topic j used in document d
Gibbs sampling
iteration
1
i
wi
di
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
M A T H E M A T IC S
KN O W LED G E
RESEARC H
W ORK
M A T H E M A T IC S
RESEARC H
W ORK
S C IE N T IFIC
M A T H E M A T IC S
W ORK
S C IE N T IFIC
KN O W LED G E
.
.
.
JO Y
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
Gibbs sampling
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
M A T H E M A T IC S
KN O W LED G E
RESEARC H
W ORK
M A T H E M A T IC S
RESEARC H
W ORK
S C IE N T IFIC
M A T H E M A T IC S
W ORK
S C IE N T IFIC
KN O W LED G E
.
.
.
JO Y
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
?
Gibbs sampling
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
M A T H E M A T IC S
KN O W LED G E
RESEARC H
W ORK
M A T H E M A T IC S
RESEARC H
W ORK
S C IE N T IFIC
M A T H E M A T IC S
W ORK
S C IE N T IFIC
KN O W LED G E
.
.
.
JO Y
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
?
Gibbs sampling
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
M A T H E M A T IC S
KN O W LED G E
RESEARC H
W ORK
M A T H E M A T IC S
RESEARC H
W ORK
S C IE N T IFIC
M A T H E M A T IC S
W ORK
S C IE N T IFIC
KN O W LED G E
.
.
.
JO Y
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
?
Gibbs sampling
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
M A T H E M A T IC S
KN O W LED G E
RESEARC H
W ORK
M A T H E M A T IC S
RESEARC H
W ORK
S C IE N T IFIC
M A T H E M A T IC S
W ORK
S C IE N T IFIC
KN O W LED G E
.
.
.
JO Y
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
?
Gibbs sampling
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
M A T H E M A T IC S
KN O W LED G E
RESEARC H
W ORK
M A T H E M A T IC S
RESEARC H
W ORK
S C IE N T IFIC
M A T H E M A T IC S
W ORK
S C IE N T IFIC
KN O W LED G E
.
.
.
JO Y
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
?
Gibbs sampling
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
M A T H E M A T IC S
KN O W LED G E
RESEARC H
W ORK
M A T H E M A T IC S
RESEARC H
W ORK
S C IE N T IFIC
M A T H E M A T IC S
W ORK
S C IE N T IFIC
KN O W LED G E
.
.
.
JO Y
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
1
?
Gibbs sampling
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
M A T H E M A T IC S
KN O W LED G E
RESEARC H
W ORK
M A T H E M A T IC S
RESEARC H
W ORK
S C IE N T IFIC
M A T H E M A T IC S
W ORK
S C IE N T IFIC
KN O W LED G E
.
.
.
JO Y
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
1
2
?
Gibbs sampling
iteration
1
2
…
1000
i
wi
di
zi
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
M A T H E M A T IC S
KNOW LEDG E
RESEARC H
W ORK
M A T H E M A T IC S
RESEARC H
W ORK
S C IE N T IFIC
M A T H E M A T IC S
W ORK
S C IE N T IFIC
KNOW LEDG E
.
.
.
JO Y
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
1
2
2
2
2
1
2
2
1
2
.
.
.
1
2
2
2
1
2
2
2
1
2
2
2
2
.
.
.
1
…
A visual example: Bars
sample each pixel from
a mixture of topics
pixel = word
image = document
Summary
• Probabilistic models can pose significant
computational challenges
– parameter estimation with latent variables,
computing posteriors with many variables
• Clever algorithms exist for solving these
problems, easing use of probabilistic models
• These algorithms also provide a source of new
models and methods in cognitive science
When Bayes is useful…
• Clarifying computational problems in cognition
• Providing rational explanations for behavior
• Characterizing knowledge informing induction
• Capturing inferences at multiple levels
```