Learning to Interpret
Natural Language Navigation Instructions
from Observation
Ray Mooney
Department of Computer Science
University of Texas at Austin
Joint work with
David Chen
Joohyun Kim
Lu Guo.........
1
Challenge Problem:
Learning to Follow Directions in a Virtual World
• Learn to interpret navigation instructions in a
virtual environment by simply observing
humans giving and following such directions
(Chen & Mooney, AAAI-11).
• Eventual goal: Virtual agents in video games
and educational software that automatically
learn to take and give instructions in natural
language.
2
Sample Environment
(MacMahon, et al. AAAI-06)
H
H – Hat Rack
L
L – Lamp
E
E
C
S
S – Sofa
S
B
E – Easel
C
B – Barstool
C - Chair
H
L
3
Sample Instructions
•Take your first left. Go all the way
down until you hit a dead end.
• Go towards the coat hanger and
turn left at it. Go straight down
the hallway and the dead end is
position 4.
Start 3
End
H
4
•Walk to the hat rack. Turn left.
The carpet should have green
octagons. Go to the end of this
alley. This is p-4.
•Walk forward once. Turn left.
Walk forward twice.
4
Sample Instructions
•Take your first left. Go all the way
down until you hit a dead end.
• Go towards the coat hanger and
turn left at it. Go straight down
the hallway and the dead end is
position 4.
Start 3
End
H
4
Observed primitive actions:
Forward, Left, Forward, Forward
•Walk to the hat rack. Turn left.
The carpet should have green
octagons. Go to the end of this
alley. This is p-4.
•Walk forward once. Turn left.
Walk forward twice.
5
Observed Training Instance in Chinese
Executing Test Instance in English
Formal Problem Definition
Given:
{ (e1, w1 , a1), (e2, w2 , a2), … , (en, wn , an) }
ei – A natural language instruction
wi – A world state
ai – An observed action sequence
Goal:
Build a system that produces the correct aj given
a previously unseen (ej, wj).
Observation
World State
Action Trace
Instruction
Training
Observation
World State
Learning system for parsing
navigation instructions
Navigation Plan Constructor
Action Trace
Instruction
Training
Observation
World State
Learning system for parsing
navigation instructions
Navigation Plan Constructor
Action Trace
Instruction
Training
Semantic Parser Learner
Observation
World State
Learning system for parsing
navigation instructions
Navigation Plan Constructor
Action Trace
Instruction
Training
Testing
Instruction
World State
Semantic Parser Learner
Observation
World State
Learning system for parsing
navigation instructions
Navigation Plan Constructor
Action Trace
Instruction
Training
Semantic Parser Learner
Testing
Instruction
World State
Semantic Parser
Observation
World State
Learning system for parsing
navigation instructions
Navigation Plan Constructor
Action Trace
Instruction
Training
Semantic Parser Learner
Testing
Instruction
World State
Action Trace
Semantic Parser
Execution Module (MARCO)
Representing Linguistic Context
Context is represented by the sequence of observed
actions each followed by verifying all observable
aspects of the resulting world state.
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
39
Possible Plans
An instruction can refer to a combinatorial number of
possible plans, each composed of some subset of this
full contextual description.
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
39
Possible Plan # 1
Turn and walk to the couch
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
40
Possible Plan # 2
Face the blue hall and walk 2 steps
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
41
Possible Plan # 3
Turn left. Walk forward twice.
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
42
Disambiguating Sentence Meaning
• Too many meanings to tractably enumerate
them all.
• Therefore, cannot use EM to align sentences
with enumerated meanings and thereby
disambiguate the training data.
43
Observation
World State
Learning system for parsing
navigation instructions
Navigation Plan Constructor
Action Trace
Instruction
Training
Semantic Parser Learner
Testing
Instruction
World State
Action Trace
Semantic Parser
Execution Module (MARCO)
Observation
World State
Learning system for parsing
navigation instructions
Context Extractor
Action Trace
Instruction
Training
Semantic Parser Learner
Testing
Instruction
World State
Action Trace
Semantic Parser
Execution Module (MARCO)
Observation
World State
Action Trace
Learning system for parsing
navigation instructions
Context Extractor
Lexicon Learner
Instruction
Training
Semantic Parser Learner
Testing
Instruction
World State
Action Trace
Semantic Parser
Execution Module (MARCO)
Observation
World State
Action Trace
Instruction
Training
Learning system for parsing
navigation instructions
Context Extractor
Lexicon Learner
Plan Refinement
Semantic Parser Learner
Testing
Instruction
World State
Action Trace
Semantic Parser
Execution Module (MARCO)
Lexicon Learning
• Learn meanings of words and short phrases by
finding correlations with meaning fragments.
walk
face
Turn
Verify
blue hall
front:
BLUE
HALL
Travel
2 steps
steps:
2
43
Lexicon Learning Algorithm
To learn the meaning of the word/short phrase w:
1. Collect all landmark plans that co-occur with w
and add them to the set PosMean(w)
2. Repeatedly take intersections of all possible
pairs of members of PosMean(w) and add any
new entries, g, to PosMean(w).
3. Rank the entries by the scoring function:
Graph Intersection
Graph 1: “Turn and walk to the sofa.”
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
Turn
Verify
LEFT
front:
BLUE
HALL
Graph 2: “Walk to the sofa and turn left.”
Travel
Verify
steps:
1
at:
SOFA
Graph Intersection
Graph 1: “Turn and walk to the sofa.”
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
Turn
Verify
LEFT
front:
BLUE
HALL
Graph 2: “Walk to the sofa and turn left.”
Travel
Verify
steps:
1
at:
SOFA
Intersections:
Turn
Verify
LEFT
front:
BLUE
HALL
Graph Intersection
Graph 1: “Turn and walk to the sofa.”
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
Turn
Verify
Graph 2: “Walk to the sofa and turn left.”
Travel
Verify
steps:
1
at:
SOFA
LEFT
front:
BLUE
HALL
Turn
Verify
Travel
Verify
LEFT
front:
BLUE
HALL
Intersections:
at:
SOFA
Plan Refinement
• Use learned lexicon to determine subset of
context representing sentence meaning.
Face the blue hall and walk 2 steps
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
43
Plan Refinement
• Use learned lexicon to determine subset of
context representing sentence meaning.
Face the blue hall and walk 2 steps
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
43
Plan Refinement
• Use learned lexicon to determine subset of
context representing sentence meaning.
Face the blue hall and walk 2 steps
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
43
Plan Refinement
• Use learned lexicon to determine subset of
context representing sentence meaning.
Face the blue hall and walk 2 steps
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
43
Plan Refinement
• Use learned lexicon to determine subset of
context representing sentence meaning.
Face the blue hall and walk 2 steps
Turn
LEFT
Verify
front:
BLUE
HALL
front:
SOFA
Travel
Verify
steps:
2
at:
SOFA
43
Evaluation Data Statistics
• 3 maps, 6 instructors, 1-15 followers/direction
• Hand-segmented into single sentence steps
Paragraph
Single-Sentence
706
3,236
5.0 (±2.8)
1.0 (±0)
Avg. # words
37.6 (±21.1)
7.8 (±5.1)
Avg. # actions
10.4 (±5.7)
2.1 (±2.4)
# Instructions
Avg. # sentences
End-to-End Execution Evaluation
• Test how well the system follows novel directions.
• Leave-one-map-out cross-validation.
• Strict metric: Only correct if the final position exactly
matches goal location.
• Lower baselines:
• Simple probabilistic generative model of executed plans
w/o language.
• Semantic parser trained on full context plans
• Upper baselines:
• Semantic parser trained on human annotated plans
• Human followers
End-to-End Execution Accuracy
Simple Generative Model
Trained on Full Context
Trained on Refined Plans
Trained on
Human Annotated Plans
Human Followers
Single-Sentence
11.08%
21.95%
57.28%
Paragraph
2.15%
2.66%
19.18%
62.67%
29.59%
N/A
69.64%
Sample Successful Parse
Instruction:
Parse:
“Place your back against the wall of the ‘T’ intersection.
Turn left. Go forward along the pink-flowered carpet hall
two segments to the intersection with the brick hall. This
intersection contains a hatrack. Turn left. Go forward three
segments to an intersection with a bare concrete hall,
passing a lamp. This is Position 5.”
Turn ( ),
Verify ( back: WALL ),
Turn ( LEFT ),
Travel ( ),
Verify ( side: BRICK HALLWAY ),
Turn ( LEFT ),
Travel ( steps: 3 ),
Verify ( side: CONCRETE HALLWAY )
Mandarin Chinese Experiment
• Translated all the instructions from English to
Chinese.
Trained on Refined Plans
Single Sentences
58.70%
Paragraphs
20.13%
64
Problem with
Purely Correlational Lexicon Learning
• The correlation between an n-gram w and
graph g can be affected by the context.
• Example:
– Bigram: ”the wall”
– Sample uses:
• ”turn so the wall is on your right side”
• ”with your back to the wall turn left”
– Co-occurring aspects of context
• TURN()
• VERIFY(direction: WALL)
– But “the wall” is simply an object involving no action
40
Syntactic Bootstrapping
• Children sometimes use syntactic
information to guide learning of word
meanings (Gleitman, 1990).
• Complement to Pinker’s semantic
bootstrapping in which semantics is used to
help learn syntax.
41
Using POS to Aid Lexicon Learning
Reason: “dead end” is often followed by the action of turning around to
face another direction so that there is a way to go forward
• Annotate each n-gram, w, with POS tags.
– dead/JJ end/NN
• Annotate each node in meaning graph, g, with
a semantic-category tag.
– TURN/Action VERIFY/Action FORWARD/Action
42
Constraints on Lexicon Entry: (w,g)
• The n-gram w should contain a noun if and
only if the graph g contains an Object
• The n-gram w should contain a verb if and
only if the graph g contains an Action
dead/JJ end/NN
TURN/Action VERIFT/Action FORWARD/Action
dead/JJ end/NN
Front/Relation WALL/Object
43
Experimental Results
44
PCFG Induction Model for Grounded
Language Learning (Borschinger et al. 2011)
• PCFG rules to describe generative process from
MR components to corresponding NL words
Series of Grounded Language Learning
Papers that Build Upon Each Other
•
•
•
•
Kate & Mooney, AAAI-07
Chen & Mooney, ICML-08
Liang, Jordan, and Klein, ACL-09
Kim & Mooney, COLING-10
– Also integrates Lu, Ng, Lee, & Zettlemoyer,
EMNLP-08
• Borschinger, Jones, & Johnson, EMNLP-11
• Kim & Mooney, EMNLP-12
46
PCFG Induction Model for Grounded
Language Learning (Borschinger et al. 2011)
• Generative process
– Select complete MR to describe
– Generate atomic MR constituents in order
– Each atomic MR generates NL words by unigram
Markov process
• Parameters learned using EM (Inside-Outside)
• Parse new NL sentences by reading top MR
nonterminal from most probable parse tree
– Output MRs only included in PCFG rule set
constructed from training data
•47
Limitations of Borschinger et al. 2011
PCFG Approach
• Only works in low ambiguity settings.
– Where each sentence can refer to only a few
possible MRs.
• Only output MRs explicitly included in the
PCFG constructed from the training data
• Produces intractably large PCFGs for
complex MRs with high ambiguity.
– Would require ~1018 productions for our
navigation data.
48
Our Enhanced PCFG Model
(Kim & Mooney, EMNLP-2012)
• Use learned semantic lexicon to constrain the
constructed PCFG.
• Limit each MR to generate only words and
phrases paired with this MR in the lexicon.
– Only ~18,000 productions produced for the
navigation data, compared to ~33,000 produced by
Borschinger et al. for far simpler Robocup data.
• Output novel MRs not appearing in the PCFG
by composing subgraphs from the overall
context.
49
End-to-End Execution Evaluations
Single Sentences
Paragraphs
Mapping to supervised
semantic parsing
57.28%
19.18%
Our PCFG model
57.22%
20.17%
•50
Conclusions
• Challenge problem: Learn to follow NL instructions
by just watching people follow them.
• Our goal: Learn without assuming any prior linguistic
knowledge.
– Easily adapt to new languages
• Exploit existing work on learning for semantic parsing
in order to produce structured meaning representations
that can handle complex instructions.
• Encouraging initial results on learning to navigate in a
virtual world, but still far from human-level
performance.
51
Descargar

Intelligent Information Retrieval and Web Search