1
The New “Bill of Rights” of
Information Society
Raj Reddy and Jaime Carbonell
Carnegie Mellon University
March 23, 2006
Talk at Google
New Bill of Rights

Get the right information


To the right people


e.g. machine translation
With the right level of detail


e.g. Just-in-Time (task modeling, planning)
In the right language


e.g. categorizing, routing
At the right time


e.g. search engines
e.g. summarization
In the right medium

e.g. access to information in non-textual media
2
Relevant Technologies






“…right information”
“…right people”
“…right time”
“…right language”
“…right level of detail”
“…right medium”






search engines
classification, routing
anticipatory analysis
machine translation
summarization
speech input and output
3
4
“…right information”
Search Engines
The Right Information

Right Information from future Search Engines


How to go beyond just “relevance to query” (all) and “popularity”
Eliminate massive redundancy e.g. “web-based email”

Should not result in


Should result in


multiple links to different yahoo sites promoting their email, or even nonYahoo sites discussing just Yahoo-email.
a link to Yahoo email, one to MSN email, one to Gmail, one that
compares them, etc.
First show trusted info sources and user-community-vetted
sources

At least for important info (medical, financial, educational, …), I want
to trust what I read, e.g.,

For new medical treatments

First info from hospitals, medical schools, the AMA, medical publications, etc.
, and
 NOT from Joe Shmo’s quack practice page or from the National Enquirer.



Maximum Marginal Relevance
Novelty Detection
Named Entity Extraction
5
Beyond Pure Relevance in IR

Current Information Retrieval Technology Only
Maximizes Relevance to Query
 What about information novelty, timeliness,
appropriateness, validity, comprehensibility, density,
medium,...??
 Novelty is approximated by non-redundancy!

we really want to maximize: relevance to the query, given
the user profile and interaction history,


P(U(f i , ..., f n ) | Q & {C} & U & H)
where Q = query, {C} = collection set,
U = user profile, H = interaction history
...but we don’t yet know how. Darn.
6
7
Maximal Marginal Relevance vs.
Standard Information Retrieval
documents
query
MMR
Standard IR
IR
Novelty Detection
Find the first report of a new event
 (Unconditional) Dissimilarity with Past

 Decision
threshold on most-similar story
 (Linear) temporal decay
 Length-filter (for teasers)

Cosine similarity with standard weights:
tfidf
 (1  log( tf )) * log( N / idf )
8
New First Story Detection Directions

Topic-conditional models
 e.g.
“airplane,” “investigation,” “FAA,” “FBI,”
“casualties,”  topic, not event
 “TWA 800,” “March 12, 1997”  event
 First categorize into topic, then use
maximally-discriminative terms within topic

Rely on situated named entities

e.g. “Arcan as victim,” “Sharon as peacemaker”
9
Link Detection in Texts

Find text (e.g. Newstories) that mention the
same underlying events.
 Could be combined with novelty (e.g. something
new about interesting event.)

Techniques: text similarity, NE’s, situated NE’s,
relations, topic-conditioned models, …
10
Named-Entity identification
Purpose: to answer questions such as:

Who is mentioned in these 100 Society articles?

What locations are listed in these 2000 web pages?

What companies are mentioned in these patent
applications?

What products were evaluated by Consumer Reports
this year?
11
Named Entity Identification
President Clinton decided to send special trade
envoy Mickey Kantor to the special Asian
economic meeting in Singapore this week. Ms.
Xuemei Peng, trade minister from China, and
Mr. Hideto Suzuki from Japan’s Ministry of
Trade and Industry will also attend. Singapore,
who is hosting the meeting, will probably be
represented by its foreign and economic
ministers. The Australian representative, Mr.
Langford, will not attend, though no reason has
been given. The parties hope to reach a
framework for currency stabilization.
12
Methods for NE Extraction

Finite-State Transducers w/variables

Example output:
FNAME: “Bill” LNAME: “Clinton” TITLE: “President”


FSTs Learned from labeled data
Statistical learning (also from labeled data)



Hidden Markov Models (HMMs)
Exponential (maximum-entropy) models
Conditional Random Fields [Lafferty et al]
13
Named Entity Identification
Extracted Named Entities (NEs)
People
Places
President Clinton
Mickey Kantor
Ms. Xuemei Peng
Mr. Hideto Suzuki
Mr. Langford
Singapore
Japan
China
Australia
14
Role Situated NE’s
Motivation: It is useful to know roles of NE’s:
 Who participated in the economic meeting?
 Who hosted the economic meeting?
 Who was discussed in the economic meeting?
 Who was absent from the the economic
meeting?
15
Emerging Methods
for Extracting Relations

Link Parsers at Clause Level



Based on dependency grammars
Probabilistic enhancements [Lafferty, Venable]
Island-Driven Parsers


GLR* [Lavie], Chart [Nyberg, Placeway], LC-Flex [Rose’]
Tree-bank-trained probabilistic CF parsers [IBM, Collins]
Herald the return of deep(er) NLP techniques.
 Relevant to new Q/A from free-text initiative.
 Too complex for inductive learning (today).

16
Relational NE Extraction
Example: (Who does What to Whom)
"John Snell reporting for Wall Street. Today
Flexicon Inc. announced a tender offer for
Supplyhouse Ltd. for $30 per share, representing a
30% premium over Friday’s closing price.
Flexicon expects to acquire Supplyhouse by Q4
2001 without problems from federal regulators"
17
Fact Extraction Application

Useful for relational DB filling, to prepare data
for “standard” DM/machine-learning methods
Acquirer Acquiree Sh.price Year
__________________________________
Flexicon Logi-truck 18
1999
Flexicon Supplyhouse 30
2001
buy.com reel.com
10
2000
...
...
...
...
18
19
“…right people”
Text Categorization
The Right People

User-focused search is key

If a 7-year old is working on a school project

taking good care of one’s heart and types in “heart care”, she will want links
to pages like




“You and your friendly heart”,
“Tips for taking good care of your heart”,
“Intro to how the heart works” etc.
NOT the latest New England Journal of Medicine article on “Cardiological
implications of immuo-active proteases”.
If a cardiologist issues the query, exactly the opposite is desired
 Search engines must know their users better, and the user tasks


Social affiliation groups for search and for automatically categorizing,
prioritizing and routing incoming info or search results. New machine
learning technology allows for scalable high-accuracy hierarchical
categorization.





Family group
Organization group
Country group
Disaster affected group
Stockholder group
20
Text Categorization
Assign labels to each document or web-page
 Labels may be topics such as Yahoo-categories


Labels may be genres


finance, sports, NewsWorldAsiaBusiness
editorials, movie-reviews, news
Labels may be routing codes

send to marketing, send to customer service
21
Text Categorization
Methods
 Manual assignment


Hand-coded rules


as in Yahoo
as in Reuters
Machine Learning (dominant paradigm)




Words in text become predictors
Category labels become “to be predicted”
Predictor-feature reduction (SVD, 2, …)
Apply any inductive method: kNN, NB, DT,…
22
Multi-tier Event Classification
N ews E ven t
Terrorist E ven t
B om b in g
S h ootin g
E con om ic d isaster
A sian C risis
U S tech crisis
23
24
“…right timeframe”
Just-in-Time - no sooner or later
Just in Time Information

Get the information to user exactly when it is
needed


Immediately when the information is requested
Prepositioned if it requires time to fetch & download
(eg HDTV video)


requires anticipatory analysis and pre-fetching
How about “push technology” for, e.g. stock
alerts, reminders, breaking news?

Depends on user activity:




Sleeping or Don’t Disturb or in Meeting  wait your chance
Reading email  now if info is urgent, later otherwise
Group info before delivering (e.g. show 3 stock alerts
together)
Info directly relevant to user’s current task  immediately
25
26
“…right language”
Translation
27
Access to Multilingual Information


Language Identification (from text, speech, handwriting)
Trans-lingual retrieval (query in 1 language, results in
multiple languages)


Full translation (e.g. of web page, of search results snippets,
…)



General reading quality (as targeted now)
Focused on getting entities right (who, what, where, when
mentioned)
Partial on-demand translation



Requires more than query-word out-of-context translation (see
Carbonell et al 1997 IJCAI paper) to do it well
Reading assistant: translation in context while reading an original
document, by highlighting unfamiliar words, phrases, passages.
On-demand Text to Speech
Transliteration
“…in the Right Language”

Knowledge-Engineered MT



Parallel Corpus-Trainable MT




Transfer rule MT (commercial systems)
High-Accuracy Interlingual MT (domain focused)
Statistical MT (noisy channel, exponential models)
Example-Based MT (generalized G-EBMT)
Transfer-rule learning MT (corpus & informants)
Multi-Engine MT

Omnivorous approach: combines the above to
maximize coverage & minimize errors
28
Types of Machine Translation
Interlingua
Semantic
Analysis
Syntactic
Parsing
Source
(Arabic)
Sentence
Planning
Transfer Rules
Direct: EBMT
Text
Generation
Target
(English)
29
EBMT example
30
English:
I would like to meet her.
Mapudungun: Ayükefun trawüael
fey engu.
English:
The tallest man
Mapudungun: Chi doy fütra chi wentru
is my father.
fey ta inche ñi chaw.
English:
I would like to meet the tallest man
Mapudungun (new):
Ayükefun trawüael
Chi doy fütra chi wentru
Mapudungun (correct): Ayüken ñi trawüael
chi doy fütra wentruengu.
Multi-Engine Machine Translation

MT Systems have different strengths
Rapidly adaptable: Statistical, example-based
 Good grammar: Rule-Based (linguisitic) MT
 High precision in narrow domains: KBMT
 Minority Language MT: Learnable from informant


Combine results of parallel-invoked MT


Select best of multiple translations
Selection based on optimizing combination of:


Target language joint-exponential model
Confidence scores of individual MT engines
31
Illustration of Multi-Engine MT
El punto de descarge
se cumplirá en
el puente Agua Fria
The drop-off point
will comply with
The cold Bridgewater
El punto de descarge
se cumplirá en
el puente Agua Fria
The discharge point
will self comply in
the “Agua Fria” bridge
El punto de descarge
se cumplirá en
el puente Agua Fria
Unload of the point
will take place at
the cold water of
bridge
32
State of the Art in MEMT
for New “Hot” Languages
We
can do now:
Gisting MT for any new
language in 2-3 weeks (given
parallel text)
Medium quality MT in 6 months
(given more parallel text,
informant, bi-lingual
dictionary)
Improve-as-you-go MT
Field MT system in PCs
We
cannot do yet:
High-accuracy MT for open
domains
Cope with spoken-only
languages
Reliable speech-speech MT (but
BABYLON is coming)
MT on your wristwatch
33
34
“…right level of detail”
Summarization
35
Right Level of Detail


Automate summarization with hyperlink one-click
drilldown on user selected section(s).
Purpose Driven: summaries are in service of an
information need, not one-size fits all (as in Shaom’s
outline and the DUC NIST evaluations)

EXAMPLE: A summary of a 650-page clinical study can focus on
effectiveness of the new drug for target disease
 methodology of the study (control group, statistical rigor,…)
 deleterious side effects if any
 target population of study (e.g. acne-suffering teens, not eczema
suffering adults ….depending on the user’s task or information
query

Information Structuring and
Summarization

Hierarchical multi-level pre-computed summary
structure, or on-the-fly drilldown expansion of info.




Headline <20 words
Abstract 1% or 1 page
Summary5-10% or 10 pages
Document
100%
 Scope of Summary





Single big document (e.g. big clinical study)
Tight cluster of search results (e.g. vivisimo)
Related set of clusters (e.g. conflicting opinions on how to cope
with Iran’s nuclear capabilities)
Focused area of knowledge (e.g. What’s known about Pluto?
Lycos has good project in this via Hotbot)
Specific kinds of commonly asked information(e.g. synthesize a
bio on person X from any web-accessible info)
36
Document Summarization
Types of Summaries
Task
Query-relevant
(focused)
Query-free
(generic)
INDICATIVE
for Filtering
(Do I read further?)
Filter search engine
results
Short abstracts
CONTENTFUL
for reading in lieu of
full doc
Solve problems for busy
professionals
Executive
summaries
37
38
“…right medium”
Finding information in Non-textual Media
Indexing and Searching
Non-textual (Analog) Content
Speech  text (speech recognition)
 Text  speech


TTS: FESTVOX by far most popular high-quality
system
Handwriting  text (handwriting recognition)
 Printed text  electronic text (OCR)
 Picture  caption key words (automatically) for
indexing and searching
 Diagram, tables, graphs, maps  caption key
words (automatically)

39
40
Conclusion
What is Text Mining


Search documents, web, news
Categorize by topic, taxonomy

Enables filtering, routing, multi-text summaries, …
Extract names, relations, …
 Summarize text, rules, trends, …
 Detect redundancy, novelty, anomalies, …
 Predict outcomes, behaviors, trends, …

Who did what to whom and where?
41
Data Mining vs. Text Mining
Data: relational tables
 DM universe: huge
 DM tasks:






DB “cleanup”
Taxonomic classification
Supervised learning with
predictive classifiers
Unsupervised learning
clustering, anomaly detection
Visualization of results
Text: HTML, free form
 TM universe: 103X DM
 TM tasks:


All the DM tasks,
plus:




Extraction of roles,
relations and facts
Machine translation for
multi-lingual sources
Parse NL-query (vs. SQL)
NL-generation of results
42
Descargar

Bill of Rights