CS276A
Text Information Retrieval, Mining, and
Exploitation
Lecture 9
5 Nov 2002
Recap: Relevance Feedback




Rocchio Algorithm:
Typical weights: alpha = 8, beta = 64, gamma = 64
Tradeoff alpha vs beta/gamma: If we have a lot of
judged documents, we want a higher beta/gamma.
But we usually don’t …
Pseudo Feedback
initial
query
apply relevance
feedback
retrieve
documents
documents
label top k
docs relevant
top k
documents
Pseudo-Feedback: Performance
Today’s topics



User Interfaces
Browsing
Visualization
The User in Information Access
Find starting
point
Formulate/
Reformulate
Query
Send to system
Receive results
Information
need
Explore results
no
User
Done?
yes
Stop
The User in Information Access
Find starting
point
Formulate/
Reformulate
Query
Send to system
Receive results
Information
need
Explore results
no
User
Focus of
most IR!
Done?
yes
Stop
Information Access in Context
Information Access
Analyze
Synthesize
High-Level
Goal
Done?
no
User
yes
Stop
The User in Information Access
Find starting
point
Formulate/
Reformulate
Query
Send to system
Receive results
Information
need
Explore results
no
User
Done?
yes
Stop
Starting points

Source selection




Highwire press
Lexis-nexis
Google!
Overviews



Directories/hierarchies
Visual maps
Clustering
Highwire Press
Source Selection
Hierarchical browsing
Level 0
Level 1
Level 2
Visual Browsing: Themescape
Browsing
Starting point
x
x
x
x
x
x
x
x
x
x
x
x
Answer
Credit: William Arms, Cornell
x
x
Scatter/Gather




Scatter/gather allows the user to find a set
of documents of interest through browsing.
Take the collection and scatter it into n
clusters.
Pick the clusters of interest and merge them.
Iterate
Scatter/Gather
Scatter/gather
How to Label Clusters

Show titles of typical documents




Titles are easy to scan
Authors create them for quick scanning!
But you can only show a few titles which may
not fully represent cluster
Show words/phrases prominent in cluster



More likely to fully represent cluster
Use distinguishing words/phrases
But harder to scan
Visual Browsing: Hyperbolic
Tree
Visual Browsing: Hyperbolic
Tree
Study of Kohonen Feature Maps




H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS
49(7)
Comparison: Kohonen Map and Yahoo
Task:
 “Window shop” for interesting home page
 Repeat with other interface
Results:
 Starting with map could repeat in Yahoo (8/11)
 Starting with Yahoo unable to repeat in map (2/14)
UWMS Data Mining Workshop
Credit: Marti Hearst
Study (cont.)

Participants liked:





Correspondence of region size to #
documents
Overview (but also wanted zoom)
Ease of jumping from one topic to another
Multiple routes to topics
Use of category and subcategory labels
UWMS Data Mining Workshop
Credit: Marti Hearst
Study (cont.)

Participants wanted:









hierarchical organization
other ordering of concepts (alphabetical)
integration of browsing and search
corresponce of color to meaning
more meaningful labels
labels at same level of abstraction
fit more labels in the given space
combined keyword and category search
multiple category assignment (sports+entertain)
UWMS Data Mining Workshop
Credit: Marti Hearst
Browsing

Effectiveness depends on




Starting point
Ease of orientation (are similar docs “close”
etc, intuitive organization)
How adaptive system is
Compare to physical browsing (library,
grocery store)
Searching vs. Browsing

Information need dependent



User dependent




Open-ended (find an interesting quote on the
virtues of friendship) -> browsing
Specific (directions to Pacific Bell Park) ->
searching
Some users prefer searching, others browsing
(confirmed in many studies: some hate to type)
You don’t need to know vocabulary for
browsing.
System dependent (some web sites don’t
support search)
Searching and browsing are often interleaved.
Searchers vs. Browsers




1/3 of users do not search at all
1/3 rarely search (or urls only)
Only 1/3 understand the concept of search
(ISP data from 2000)
Exercise

Observe your own information seeking
behavior





WWW
University library
Grocery store
Are you a searcher or a browser?
How do you reformulate your query?




Read bad hits, then minus terms
Read good hits, then plus terms
Try a completely different query
…
The User in Information Access
Find starting
point
Formulate/
Reformulate
Query
Send to system
Receive results
Information
need
Explore results
no
User
Done?
yes
Stop
Query Specification

Recall:








Relevance feedback
Query expansion
Spelling correction
Query-log mining based
Interaction styles for query specification
Queries on the Web
Parametric search
Term browsing
Query Specification: Interaction
Styles

Shneiderman 97






Command Language
Form Fillin
Menu Selection
Direct Manipulation
Natural Language
Example:

How do each apply to Boolean Queries
Credit: Marti Hearst
Command-Based Query
Specification

command attribute value connector …




find pa shneiderman and tw user#
What are the attribute names?
What are the command names?
What are allowable values?
Credit: Marti Hearst
Form-Based Query Specification
(Altavista)
Credit: Marti Hearst
Form-Based Query Specification (Melvyl)
Credit: Marti Hearst
Form-based Query Specification
(Infoseek)
Credit: Marti Hearst
Credit: Marti Hearst
Menu-based Query Specification
(Young & Shneiderman 93)
Credit: Marti Hearst
Query
Specification/Reformulation


A good user interface makes it easy for the
user to reformulate the query
Challenge: one user interface is not ideal for
all types of information needs
Types of Information Needs





Need answer to question (who won the
game?)
Re-find a particular document
Find a good recipe for tonight’s dinner
Authoritative summary of information (HIV
review)
Exploration of new area (browse sites about
Baja)
Queries on the Web
Most Frequent on 2002/10/26
Queries on the Web (2000)
Intranet Queries (Aug 2000)













3351 bearfacts
3349 telebears
1909 extension
1874 schedule+of+classes
1780 bearlink
1737 bear+facts
1468 decal
1443 infobears
1227 calendar
989 career+center
974 campus+map
920 academic+calendar
840 map














773
741
738
721
716
667
627
602
582
577
563
550
543
470
bookstore
class+pass
housing
tele-bears
directory
schedule
recipes
transcripts
tuition
seti
registrar
info+bears
class+schedule
financial+aid
Source: Ray Larson
Intranet Queries

Summary of sample data from 3 weeks of UCB queries











13.2% Telebears/BearFacts/InfoBears/BearLink (12297)
6.7% Schedule of classes or final exams (6222)
5.4% Summer Session (5041)
3.2% Extension (2932)
3.1% Academic Calendar (2846)
2.4% Directories (2202)
1.7% Career Center (1588)
1.7% Housing (1583)
1.5% Map (1393)
Average query length over last 4 months: 1.8 words
This suggests what is difficult to find from the home page
Source: Ray Larson
Query Specification: Feast or
Famine
Feast
Specifying
a well targeted
query is hard.
Bigger problem
for Boolean.
Famine
Parametric search

Each document has, in addition to text,
some “meta-data” e.g.,





Language = French
Format = pdf
Subject = Physics etc.
Date = Feb 2000
A parametric search interface allows the user
to combine a full-text query with selections
on these parameters e.g.,

language, date range, etc.
Parametric search example
Notice that the output is a (large) table.
Various parameters in the table (column
headings) may be clicked on to effect a sort.
Parametric search example
We can add text search.
Interfaces for term browsing
The User in Information Access
Find starting
point
Formulate/
Reformulate
Query
Send to system
Receive results
Information
need
Explore results
no
User
Done?
yes
Stop
Explore Results

Determine: Do these results answer my
question?




Summarization
More generally: provide context
Hypertext navigation: Can I find the answer by
following a link?
Browsing and clustering (again)

Browse to explore results
Explore Results: Context


We can’t present complete documents in the
result set – too much information.
Present information about each doc



Must be concise (so we can show many docs)
Must be informative
Typical information about each document




Summary
Context of query words
Meta data: date, author, language, file
name/url
Context of document in collection
Context in Collection: Cha-Cha
Category Labels

Advantages:





Interpretable
Capture summary information
Describe multiple facets of content
Domain dependent, and so descriptive
Disadvantages



Do not scale well (for organizing documents)
Domain dependent, so costly to acquire
May mis-match users’ interests
Credit: Marti Hearst
Evaluate Results
Context in Hierarchy: Cat-a-Cone
Explore Results: Summarization

Query-dependent summarization


KWIC (keyword in context) lines (a la google)
Query-independent summarization




Summary written by author (if available)
Exploit genre (news stories)
Sentence extraction
Natural language generation
Evaluate Results
Structure of document: SeeSoft
Personalization
Outride Personalized
Search System
User Query
Interests
Query
Augmentation
Intranet
Search
Demographics
Click Stream
Result
Processing
Result Set
Search History
Web
Search
Application Usage
Outride Side Bar
Interface
Outride Schema
User x Content x
History x Demographics
Search Engine Schema
Keyword x Doc ID
x Link Rank
How Long to Get an Answer?
O ut r i d e
3 8 .9
Go o g le
75.4
Y a ho o !
81
Exci t e
8 3 .5
A OL
8 9 .6
0
10
20
30
40
50
60
70
80
90
10 0
Average Task Completion Time in Seconds
SOURCE: ZDLabs/eTesting, Inc. October 2000
Search Engine
User Actions
Outride
11.2
Google
21.2
Yahoo!
22.4
AOL
23.1
Excite
23.3
Average
22.5
Table 1. User actions study results.
Difference (%)
89.6
100.5
107.0
108.5
101.4
Experienced Users Novice Users Overall
Engine
Outride
AOL
Excite
Google
Yahoo!
Expert
Time
32.8
92.3
75.7
72.5
85.1
Rank
(1)
(5)
(3)
(2)
(4)
Novice
Time
45.1
87.0
91.3
78.4
76.9
Rank
(1)
(4)
(5)
(3)
(2)
Average
38.9
89.6
83.5
75.4
81.0
Rank
(1)
(5)
(4)
(2)
(3)
%
Difference
0%
130.2%
114.5%
93.7%
107.9%
Table 2. Overall timing results (in seconds, with placement in parenthesis).
SOURCE: ZDLabs/eTesting, Inc. October 2000
Novices versus Experts
(Average Time to Complete Task)
100
91.30
Others
Outride
90
75.70
Time (Seconds)
80
70
60
45.07
50
32.83
40
30
20
10
0
Novice
Experts
User Skill Level
SOURCE: ZDLabs/eTesting, Inc. October 2000
Performance of Interactive Retrieval
Boolean Queries: Interface
Issues





Boolean logic is difficult for the average user.
Much research was done on interfaces
facilitating the creation of boolean queries by
non-experts.
Much of this research was made obsolete by
the web.
Current view is that non-expert users are best
served with non-boolean or simple +/- boolean
(pioneered by altavista).
But boolean queries are the standard for
certain groups of expert users (eg, lawyers).
User Interfaces: Other Issues

Technical HCI issues






How to use screen real estate
One monolithic window or many?
Undo operator
Give access to history
Alternative interfaces for novel/expert users
Disabilities
Take-Away



Don’t ignore the user in information
retrieval.
Finding matching documents for a query is
only part of information access and
“knowledge work”.
In addition to core information retrieval,
information access interfaces need to
support



Finding starting points
Formulation/reformulation of queries
Exploring/evaluating results
Exercise


Current information retrieval user interfaces
are designed for typical computer screens.
How would you design a user interface for a
wall-size screen?
Resources
MIR Ch. 10.0 – 10.7
Donna Harman, Overview of the fourth text retrieval
conference (TREC 4), National Institute of Standards and
Technology.
Cutting, Karger, Pedersen, Tukey. Scatter/Gather. ACM SIGIR.
Hearst, Cat-a-cone, an interactive interface for specifying
searches and viewing retrieving results in a large category
hierarchy, ACM SIGIR.
Descargar

CS276A Text Information Retrieval, Mining, and Exploitation