Does RL Occur Naturally?
C. R. Gallistel
Rutgers Center for Cognitive Science
NIPS Workshops 12/10/05
Turing’s Vision (‘47-’48)
“It would be quite possible to have the machine try out
behaviors and accept or reject them…”
“What we want is a machine that can learn from experience.
The possibility of letting the machine alter its own
instructions provides the mechanism for this…”
“It might possible to carry through the organizing [of a
learning machine] with only two interfering inputs, one for
reward (R) or pleasure and the other for pain or
punishment (P). It is intended that pain stimuli occur when
the machine’s behavior is wrong, pleasure stimuli when it
is particularly right.”
A Different Vision
• Policy (what to do given a state of the
world) is pre-specified and immutable
• Learning consists in determining the state of
the world; it’s all model estimation
• Appropriate sampling behavior is itself
The Deep Reasons
• Wolpert & Macready’s “No Free Lunch”
• Chomsky’s “Poverty of the Stimulus” argument
• Bottom line: reinforcement learning takes too long
• Because there is not enough information in the R
& P signals
• Because learning in the absence of a highly
structured hypothesis space is a practical
impossibility (we don’t live long enough)
Learning by Integrating
• Ant knows where it is
• This knowledge is
acquired (learned)
• It is acquired by path
--Harkness & Maroudas,
Building a Map
• Ant remembers where the food was (records its
• Bees & ants make a map by the GPS principle
(record location coordinates--& views)
• They do not discover by trial and error that this is
a good thing to do
• As in the GPS, the computational machinery to
determine a course from an arbitrary location to an
arbitrary location is built in
• No RL learning here
Ranging Behavior
• When leaving a new food
source or a new nest
(hive), bees & wasps fly
backwards in an ever
increasing zigzag
• Determining visual feature
distances by parallax
• Innately specified
sampling (model building)
Wehner, 1981
Also in the Locust
• Locust scanning
• Sobel, 1990
• Moved target, so as to
make a independent
of D
• Reproduced function
relating take off
velocity to D
Learning by Parameter
• Animal’s (including insects) use sun as
compass reference
• To do this, must learn solar ephemeris: sun’s
compass bearing as a function of the time of
day--where it is when
• Solar ephemeris varies with latitude and
Learning from the Dance
• Returning forager does a
dance to tell other foragers
the location (range &
bearing) of source
• Compass bearing, g,
specified by specifying
current solar bearing, s
• Range specified by
number of waggles
a= compass bearing of sun
• Hopeless as an RL
g = compass bearing of source
s =solar bearing of source
Ephemeris Framework
Deceived Dancing
Dyer, 1987
Poverty of Stimulus
• Dyer & Dickinson, 1994
• Incubator raised bees allowed to forage to station
due west of hive but only in late afternoon when
sun declining in west
• On heavy overcast day, moved to new field line
with different compass orientation and allowed to
forage in morning (with feeder “west” of hive
• Experimenter observes dance of returning foragers
to estimate where they believe the sun to be
Bees Believe Earth is Round
• Form of solar ephemeris equation is built
into the nervous system
• Only its parameters are estimated from
• Solves poverty of the stimulus problem: the
information about universal properties of
the ephemeris in the priors
• Neural net without this prior information
could not generalize as bees do
Language Learning
• Same story?
• Innate universal grammar specifies structure
common to all language
• Distinctions between languages are due to
differences in parameters (e.g., head final versus
head first)
• Learning a language reduces to learning the
(binary?) parameter values
• Mark Baker (2001) The Atom’s of Language
Natural Learning Curves
• Gallistel et al (PNAS 2004)
• Analyzed individual(!) learning curves from standard
paradigms and in pigeons, rats, rabbits and mice
 Pavlovian (autoshaping in pigeon, rat & mouse)
 Eyeblink in rabbit
 + Maze in rat
 Water maze in mouse
• Regardless of paradigm, the typical curve cannot be
distinguished from a step function
• Latency and size of step varies between subjects
• Averaging across these steps produces a gradual learning
curve: it’s gradualness is an averaging artifact
• Subjects foraging back and forth between
locations where food becomes available
unpredictably (on random rate schedules with
unlimited holds)
• Subjects match the ratio of the time they invest in
the locations (expected stay duration, T1/T2) to the
ratio of the incomes they have derived from them
• Matching equates returns: Ri = Ii/Ti;
I1/T1 = I2/T2 iff T1/T2 = I1/I2
RL Models
• Most assume hill-climbing discovery of the
policy that equates returns
• Policy is one dimensional
(ratio of expected stay durations) T2
• Try-out given policy (stay ratio) I
• Determine direction of inequality T1
• Adjust investment ratio accordingly
But (Gallistel et al 2001)
• Adjustment of
investment ratio after a
step change in the
relative rates of
reward is quick and
Bayesian Ideal Detector Analysis
Second Example
D Incomes, Not D Returns
• Evidence of a change
in behavior appears as
soon as there is
evidence of a change
in incomes
• And (often) before
there is evidence of a
change in returns
Evidence of
Absence of Evidence
• Upper panel: Odds
that subject’s stay
durations had changed
as a function of
session time
• Lower panel: Odds
that subject’s returns
had changed. There
was no evidence--in
the returns!
• Matching is an innate policy
• Depends only on estimates of incomes
• Anti-aliasing sampling behavior to detect periodic
structure in reward provision built into policy
• Estimates of incomes to be expected based on
small samples taken only when a change in
income detected
• Here, too, learning is model updating, not policy
value updating
• Subjects perversely ignore returns (policy values)
• Most (all?) natural learning looks like
model estimation
• Efficient model estimation is made possible
 Informative priors (a highly structured
problem-specific hypothesis space)
 Innately specified efficient sampling routines

The Nature of Learning