CSA1013
Historical Perspectives of
Information Search and Retrieval
Dr. Christopher Staff
Department of Computer Science & AI
University of Malta
1 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Aims and Objectives
•
•
•
•
•
What is Information Search and Retrieval?
What’s the “state-of-the-art”?
How did we get here?
What are the issues?
Where are we likely to go next?
2 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
What’s Information Search and
Retrieval?
• What’s information?
– Structured vs. unstructured
• Where is it?
• Question answering vs. Information lack or
information need
3 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
What’s the “state-of-the-art”?
• Information Retrieval in the “real” world
– Web-based search engines
• Google, AllTheWeb, AltaVista, etc.
• Web directories
– Yahoo, Excite, etc.
4 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
What’s the “state-of-the-art”?
• Google, and Google-like search engines
– Index > 24 billion web pages (pdf, doc, html, …)
– User expresses “Query”
• terms, natural language query, etc
– System “compares” query to indexed
documents
– Returns “list” of “relevant” documents
5 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
What’s the “state-of-the-art”?
• Recent study by Jansen & Spink [Jansen] shows:
– |Query| = 2.14 terms [Spink]
– Queries with 1 term = 53%!
– 54% of users are satisfied with first page of results (list
of 10 documents)
– 80% of users view not more than 10 - 20 results
– 27.6% read only one document!
– 66% read < 5 documents
6 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Has life always been this good?
• It would seem that we’re living in
information heaven
• Any info we seek is just a couple of query
terms away
• In reality, although majority of queries
appear to be “trivial”, the reality is quite
different
7 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Has life always been this good?
• What if we want to find all relevant
information? (“The Invisible Web”)
• What if we want to find something that is
difficult to describe?
• What if we don’t know what we’re looking
for?
– What tools do we use to find info in
encyclopaedias, dictionaries, newspapers,
reference manuals, novels and other books?
8 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Here beginneth the history
lesson…
• People have devised tools to find
information again ever since we learnt to
write things down…
• Think of information stored on your
personal computers… how do you find
something that you wrote last month, last
year?
9 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Prehistory!
• Well, nearly!
• Early writings
–
–
–
–
Papyrus scrolls
No paragraph, page numbers, etc
Couldn’t “scroll to the end” to read an index
Instead, Greek/Roman libraries used
“sillybus”/“index” of title
10 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Greeks/Romans
• 3BC, Greeks probably use alphabetization
in Library of Alexandria
• Around 2BC (Rome), evidence of
hierarchies of information/classification
systems
– Greeks probably earlier
• Also, Tables of Contents date from around
2BC (Pliny the Elder reports before 79AD)
11 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Printing Press
• Not much else was to happen until 1455,
with the advent of the printing press
• Previously, still difficult to refer to
information “within” a book, because
copies were inaccurate
– Info on one page in one book could be on a
different page in other copies
12 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Indices and the Printing Press
• Still, alphabetization was on initial letter,
then on first four letters…
• Not until 18th Century did full
alphabetization occur!
13 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
The Second World War and beyond
• In 1945, Vannevar Bush publishes “As We May
Think” in the Atlantic Monthly
• In 1949, Warren Weaver writes that if Chinese is
English + codification, then Machine Translation
should be possible
• These give rise to “intelligent” and “statistical” (or
surface-based) approaches to Information Search
and Retrieval respectively (amongst other things :))
14 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Intelligent vs. Surface-based
“Concepts”
1950’s
• Lay in waiting for
years, because
hardware/software not
around
“Words”
1950’s
• First approaches were
“Key Words in
Context” (KWIC)
15 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Intelligent vs. Surface-based
1960’s
• Generality in AI (John
McCarthy)
1960’s
• Boolean Search
• Measures of performance
effectiveness
• Thesaural Lookup
• Vector Space Model
16 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Intelligent vs. Surface-based
1970’s
• Expert Systems
• Still about
“understanding”
information and
reasoning with and
about it
1970’s
• Explosion in availability
of electronic text
collections
• Library Retrieval Systems
• Full-text indexing
• Probabilistic IR
• Relevance Feedback
17 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Intelligent vs. Surface-based
•
•
•
•
•
•
1980’s
Conceptual IR
Knowledge Rep Langs
Lenat’s CYC
Contextual Reasoning
5th Generation
Computing, Japan
LSI feeds Statistical
IR
1980’s
• OPACs
• IR used by nonspecialists
• Extended Boolean IR
• Word Sense
Disambiguation
• Statistical IR (LSI,
etc)
• Internet
18 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Intelligent vs. Surface-based
•
•
•
•
1990’s
Better language
processing
information extraction
entity name
recognition
Advances in
contextual reasoning,
ontologies
1990’s
• WWW (1995 c. 10M
pages, 2003 c. 3B!)
• Multimedia Indexing
& Retrieval
• Web-based search
engines
19 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Intelligent vs. Surface-based
2000’s
• Semantic Web
•
•
•
•
2000’s
Faster processors
More memory
Cheaper storage space
More superficial
comparisons
20 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Intelligent vs. Surface-based
The future
• Computers that can
find precisely the
information you seek
– Even if the answer is
non-obvious
– Or the answer needs to
be the result of
reasoning
• MyLifeBits
The future
• Computers that can
approximate the
information you seek
– At much less cost
– At the expense of
“correctness”
• MyLifeBits
21 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
22 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Main Issues
• Architecture to handle ever increasing
numbers of docs + efficient data structures
• Freshness, indexing and retrieval speed
(Efficient algorithms)
• What is “relevance”? (Better, cheaper and
more accurate algorithms to understand
what the user really wants)
23 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Main References
• Paijmans, J.J., last updated 2004, “The Retrieval of Information from
historical perspective”, http://pi0959.kub.nl/Paai/Onderw/VI/Content/history.html
• American Society of Indexers, last updated 2005, “How Information
Retrieval Started”, http://www.asindexing.org/site/history.shtml
• [Jansen] Jansen, B.J., and Spink, A., 2003, ‘An Analysis of Web
Documents Retrieved and Viewed’, in Proceedings of the 4th
International Conference on Internet Computing, Las Vegas, Nevada,
23-26 June 2003.
http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/pages_viewed.p
df
• [Spink] Spink, A., et. al., 2001, ‘Searching the Web: The Public and
their Queries’, in JASIST 2001.
http://jimjansen.tripod.com/academic/pubs/jasist2001/jasist2001.html
24 of 24
CSA1013:Information Search and Retrieval
[email protected]
© 2003- Chris Staff
University of Malta
Descargar

CSA1100: Historical Perspectives of