Hybrid Systems for Information
Extraction and Question Answering
Presented By
Rani Qumsiyeh
What is Question Answering?
• Being able to retrieve the exact piece of information the user
is looking for rather than a set of relevant documents.
• Who was the president of the US in 2004?
George W. Bush
What is Summarization?
• “Text summarization can be regarded as the most interesting and
promising Natural Language Understanding task computational linguists
are currently faced with” Rodolfo Delmonte
• Summarization means taking a large piece of text and extracting the most
important ideas out of it.
• The story of the 3 little pigs
Once upon a time there were three little pigs who lived happily in the countryside. But in the same place lived a wicked wolf who fed
precisely on plump and tender pigs. The little pigs therefore decided to build a small house each, to protect themselves from the wolf.
The oldest one, Jimmy who was wise, worked hard and built his house with solid bricks and cement. The other two, Timmy and Tommy,
who were lazy settled the matter hastily and built their houses with straw and pieces of wood. The lazy pigs spent their days playing
and singing a song that said, "Who is afraid of the big bad wolf?" And one day, lo and behold, the wolf appeared suddenly behind their
backs. "Help! Help!", shouted the pigs and started running as fast as they could to escape the terrible wolf. He was already licking his
lips thinking of such an inviting and tasty meal. The little pigs eventually managed to reach their small house and shut themselves in,
barring the door. They started mocking the wolf from the window singing the same song, "Who is afraid of the big bad wolf?" In the
meantime the wolf was thinking a way of getting into the house. He began to observe the house very carefully and noticed it was not
very solid. He huffed and puffed a couple of times and the house fell down completely. Frightened out of their wits, the two little pigs
ran at breakneck speed towards their brother's house. "Fast, brother, open the door! The wolf is chasing us!" They got in just in time
and pulled the bolt. Within seconds the wolf was arriving, determined not to give up his meal. Convinced that he could also blow the
little brick house down, he filled his lungs with air and huffed and puffed a few times. There was nothing he could do. The house didn't
move an inch. In the end he was so exhausted that he fell to the ground. The three little pigs felt safe inside the solid brick house.
Grateful to their brother, the two lazy pigs promised him that from that day on they too would work hard.
Could this be automated?
• When understanding a text a human reader or listener does make use of
his encyclopedia parsimoniously.
• To do it automatically, the system should simulate the actual human
behavior in that the access to extra linguistic knowledge is triggered by
contextual factors independently present in the text and detected by the
system itself.
• Most simple approach is to use the Bag Of Words (BOW).
– For question answering, out of the first n documents retrieved, extract the words in the
question along with a certain number of neighboring words.
– For summarization, extract all sentences with title keywords in them.
What is the Problem?
• The problem the researchers are trying to tackle is taken from P. Bosch
contribution to a book by Herzog & Rollinger(eds), “Text Understanding in
– Identifying in a text "inferentially unstable" concepts which are to be kept
distinct from "inferentially stable" ones. The latter should be analyzed solely
on the basis of linguistic description, while the former should tap external
linguistic knowledge of the world.
• We identify tout court with contextual reasoning, i.e. performing
inferential processes on the basis of linguistic information while keeping
under control the contribution of external knowledge in order to achieve
understanding of a text
Example of the Problem
• More information from query
– Bill surprised Hillary with his answer
– The word his refers to Bill, hence, answer refers to Bill.
• Same Head Problem
– The president of Russia visited the president of China
– Who visited the president?
• Reversible Arguments Problem
– What do frogs eat?
– What eats frogs?
The solution, A Hybrid System
• Symbolic processing is defined as those computations that are performed
at the same or more abstract level than the word level.
• Statistical natural-language processing uses stochastic, probabilistic and
statistical methods to resolve some of the ambiguities of text.
• Syntactic processing deals with certain aspects of meaning that can be
determined only from the underlying structure and not simply from the
linear string of words.
• Semantic analysis involves extracting context-independent aspects of a
sentence's meaning.
• In order to act and think like a human a system needs both.
GETARUNS (General Text And
Reference UNderstander)
• Works in the following way:
– Performs semantic analysis on the basis of syntactic parsing.
– Performs Anaphora Resolution.
– Builds a quasi logical form with flat indexed Augmented
Dependency Structures (Discourse Model)
– Uses a centering algorithm to individuate the topics or discourse
centers which are weighted on the basis of a relevance score.
– This logical form can then be used to individuate the best
sentence candidates to answer queries or provide appropriate
The parser
• Rule-based deterministic parser.
• Uses a lookahead and a Well-Formed
Substring Table to reduce backtracking.
• It also implements Finite State Automata in
the task of tag disambiguation.
• It is based on a top down, depth-first search
Example of the F-Structure
produced by the Parser
John went into a restaurant
lex_form:[np/subj/agent/[human, object], pp/obl/locat/[to, in, into]/[object, place]]
voice:active; mood:ind; tense:past
gen:mas; num:sing; pers:3; spec:def:'0'
tab_ref:[+ref, -pro, -ana, -class]
num:sing; pers:3; spec:def:tab_ref:[+ref, -pro, -ana, +class]
rel2:[included(tr(f1_res2), tes(f1_res2))]
qops:qop:q(q1, indefinite)
Building the Discourse Model
• A set of entities and relation between them, as
“specified” in a discourse.
• Discourse Entities can be used as Discourse
• Entities and relation in a Discourse Model can
be interpreted as representations of the
cognitive objects of a mental model.
• Representation inspired to Situation Semantics.
• Implemented as prolog facts.
DM and infons
• Any piece of information is added to the DM as an infon.
List of Arguments - with Semantic Roles,
Polarity - 1 affirmative, 0 negation,
Temporal Location Index,
Spatial Location Index)
• An infon consists of a relation name, its arguments, a polarity
(yes/no), and a couple of indexes anchoring the relation to a spatiotemporal location.
– EX: meet, (arg1:john, arg2:mary), yes, 22-sept-2008, venice
• Each infon has a unique identifier and can be referred to by other
Kinds of Infons
• Full infons
– Situations: sit/6
– Facts: fact/6
– Complex infons: have other sit/fact as argument
• Simplified infons
– Entities: ind/2, set/2, class/2
– Cardinalities: card/3
– Membership: in/3
– Spatio-temporal rels: includes/2, during/2, …
Entities, Cardinalities, Membership
• Entities are represented in the DM without any commitment about
their “existence” in reality.
– Individual entities (“John”): ind(infon1, id5).
– Extensional plural entities (“his kids”): set(infon2, id6).
– Intensional plural entities (“lions”): class(…, id7).
• Cardinality (only for sets: “four kids”)
– card(…, id6, 5).
• Membership (between individual and sets: “one of them”)
– in(…, id5, id6).
Anaphora Resolution
• Anaphora is an instance of an expression
referring to another.
• Anaphora Resolution means identifying which
instance of an expression Anaphora is
referring to.
Two Types of Anaphora
Noun/Noun Phrase (i.e. Nominal)
He doesn’t like this book. Show him a more interesting one.
If you want a typewriter, they will provide you with one.
Sort refers to Slang
Nominal substitutes also include some indefinite pronouns, such as all, both, some, any
enough, several, none, many, much, (a) few, (a) little, the other, others, another, either,
neither, etc. eg:
Can you get me some nails? I need some.
One refers to the typewriter.
Slang disappears quickly, especially the juvenile sort.
One refers to the book.
Some refers to nails
Pronoun/Pronoun Phrase(i.e. Pronominal)
The Prime Minister of New Zealand visited us yesterday. The visit was the first time she had come to
New York since 1998.
She refers to the Prime Minister.
Us refers to the people of New York.
The monkey took the banana and ate it.
it refers to the banana.
How does it work?
• Computed by a Module of Discourse Anaphora
• Decides on the basis of semantic categories attached
to predicates and arguments of predicates whether
to bind a pronoun to the locally available antecedent
or to the discourse level one.
• Creates a list of candidates or possible arguments of
discourse which includes all external pronouns and
referential expressions. The algorithm creates a
Weighted List of Candidates Arguments of
Ontology Behind Anaphora Resolution
• On first occurrence of a referring expression
it is asserted as an INDividual if it is a definite or indefinite expression
it is asserted as a CLASS if it is quantified or has no determiner
We have LOCs for main locations, both spatial and temporal.
Whenever there is cardinality determined by a digit, the referring expression is asserted
as a SET
• On second occurrence of the same nominal head
– The semantic index is recovered from the history list
– In case it is definite or indefinite with a predicative role and no attributes nor modifiers,
nothing is done;
– In case it has different number - singular and the one present in the DM is a set or a
class, nothing happens;
– In case it has attributes and modifiers which are different and the one present in the DM
has none, nothing happens;
– In case it is quantified expression and has no cardinality, and the one present in the DM
is a set or a class, again nothing happens.
– Otherwise a new entity is asserted to the in DM.
GETARUN as a QA system
• Uses Bag Of Words to search through Google.
• It builds the Discourse Model for the first five
• It looks for the answer using the Discourse
• It retrieves the snippet with the right answer.
Examples of QA
• No other system out there that does text
• 74% F-measure for Anaphora Resolution.
• Is very effective in retrieving the “gist” of the
• Can answer natural language questions.
• Introduces very important algorithms to the
NLP community.
• Very slow when dealing with large text.
• When summarizing it only manages to
maintain 73% of the “important” text.
– No actual data to test on.
– If data is lost, can we really use such a system.
• Achieves a 63% accuracy with question
• Cannot answer WHO questions.
Future Work
• Consider more than 2 sentences in advance of
the current one being processed.
• Find a way to deal with all type of questions.
– Currently this work is being performed, no
publication yet.
• Try to increase accuracy, especially in the
summarization aspect of the system.
• Consider Categories of questions to further
“pin down” the answer.

Hybrid Systems for Information Extraction and Question