Search and the ‘Net in 2007
Trends, Challenges and Cutting-Edge
Developments in Internet Search
Michael Hunter
Reference Librarian
Hobart and William Smith Colleges
For
Rochester Regional Library Council
Member Libraries’ Staff
Sponsored by the
Rochester Regional Library Council
Supported by Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the
New York State Library 2007
For Today . . . . . . .


The Landscape of Search in 2007
A Look at the Major Services







Google, Yahoo!, Ask, Windows Live (MSN)
Test Drive Time
New Services
Wikipedia: Looking Under the Hood
Tagging and Search
Explore on Your Own
Current Trends, Future Directions
Web Search in 2007
Who’s crawling the Web?

Yahoo






Owns AlltheWeb, Altavista, Inktomi, Overture
Google
Live Search (MSN)
Ask owns Teoma
Gigablast
NOTE: Ownership is different from database
affiliation
Google
Database Affiliates
Google
AOL
Excite
Netscape
The Indexable Web
(Gulli and Signorini, 2005)




Defined as that part of the Web
available to be crawled by search
engines
Estimated at more than 11.5 billion
pages
Based on sample data set of about a
half-million URL’s
Gulli, A and Signorini, A “The Indexable Web is More than 11.5
Billion Pages” in Proceedings of 14th International World Wide
Web Conference p. 900-1. Chiba, Japan, 2005 available
http://www.cs.uiowa.edu/~asignori/web-size/
Percent of sample indexed by
each engine
76%
 69%
 62%
 58%

-
Google
Yahoo!
Live (MSN)
Ask
Estimated Size of 4 Major
Services (Gulli and Signorini, 2005)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Google
Yahoo!
Ask
Live (MSN)
Share of Searches – April 2007
Source: Nielsen/NetRatings 2007
60
50
40
30
20
10
0
Google
Yahoo
Live
AOL
Ask
All Others
Web Composition by TLD
(Koehler, 2004)
Web Document Persistence
(Koehler, 2004)




Collection of randomly selected URL’s
After a 4-year period two-thirds could not
be accessed
The remaining one-third continue to be
stable for a total of 6 years
Legal, scholarly and educational sites have
“limited lifecycles not dissimilar to web
sites in general”
Koehler, W. “A longitudinal study of Web pages..”
Iinformationresearch v. 9 no. 2 Jan. 2004 available
http://informationr.net/ir/9-2/paper174.html

The Major Services
Google, Yahoo!, Ask and Live
What is Google these days???

Print, radio advertising company?

Deals with Viacom and others

E-mail utility (gmail), eBay clone (Base),
TV network (YouTube)?
Bank, video store?
Microsoft killer (Docs & Spreadsheets)?
Force in US and world politics?

Losing its laser-like focus on search???



New & Notable at Google in
Search






Usage Rights limit (Adv. Search)
Suggest (labs.google.com)
Transit (labs)
Froogle Mobile (labs)
News Alerts (www.google.com/alerts)
iGoogle (re-design)
iGoogle
formerly the Personalized Homepage


PHP available since 2004
Part of G’s mission





Search your own stuff (desktop)
Traditional web search (unmediated)
Mediated web search (your preferences,
search history, G’s recommendations, etc)
IP geolocation as filter for results
Facilitates use of Google Gadgets
Google gadgets


Currently over 250,000 gadgets
available for you iGoogle page
GadgetMaker allows you to create 7
different types, “with no programming”
Photo sharing
GoogleGram
Daily Me notepad Countdown gadget
Personal list
YouTube video favorites
“Free Form” (meld text and images)
New & Notable at Google
Beyond Search


Notebook (labs) Web clipping and note-taking
service
Accessible Search (labs)


Favors ad-free search results that render well
for machine readers for blind and visually
impaired
Docs & Spreadsheets (docs.google.com)


Free web-based word processing and
spreadsheet programs (Web 2.0)
Create, update, store, share content in real time
Google Desktop 3.0’s
“Search Across Computers”




Allows users to search across all their
computers
Requires user to install and configure the
feature
G uploads files from your computers, indexes
them and transfers them to your other
computers and deletes them from its servers
All computers involved must be online at the
time
Google Desktop 3.0’s
“Search Across Computers”



If one computer is not online no data
transfer can occur and the files remain on
G’s servers for up to 30 days, when they
are deleted.
If service is deactivated “some personal
account information” may stay on G
servers for up to 60 days
“Gadget” apps available-news, weather,
animations, etc.
Google Video Search/Store
video.google.com



Index of closed captioning and text
descriptions from selected TV and other
video content after Dec. 2004
Results can include pre-or full view,
description, source, date, duration and
hyperlink
Advanced Search limits
Language
38 “genres”
Length
Free or For Sale
Google’s Aug. 22 Ranking Patent
Query Themes & Editorial Opinion



Relevance ranking processing patent
granted to Google
Is Google interested in direct human
intervention in results ranking (???)
Query Themes – topics commonly occurring
in search queries identified, i.e..
Free software download sites
Travel accommodation sites
Editorial opinion parameters

For any given query theme human editors





Survey user search query logs
Examine search results lists
Identify sets of sites that are “favored” and
“non-favored”
Favored - non-spamming sites verified to
offer content relevant to the query theme or
originate from a reputable subject directory
Non-favored – sites exhibiting spam or other
deceptive characteristics
Google Print’s Library Project



Confusion and uncertainty over
copyright issues cloud Publisher and
Library Projects
U. of California system joined Library
Project in August
“Find this book in a library” link to OCLC
Worldcat holdings, searchable by zip
code, state, country
World Digital Library






Joint venture of Google and Library of Congress
Initially an expansion of LC’s American Memory
Standards and cooperative structures to be
worked out
UNESCO endorsement sought
One to keep your eye on
http://www.digitalpreservation.gov/about/index
.html
Google’s Legal Challenges



Google is a magnet for lawsuits
Any lawsuit that reaches the pre-trial
“discovery” phase can threaten the
secrecy of G.’s proprietary software
Recent cases have centered on
trademark/advertising and other
copyright issues
Child Online Protection Act of 1998


Justice Dept: Parental controls and filters
insufficient to protect children against online
pornography. Stricter governmental controls
needed
Aug., 2005 – G, Y, Microsoft and AOL issued
subpoenas for all data relating to search
terms and the sites users visited between
June 1 and July 31, 2005
Child Online Protection Act of 1998




Y, MSN and AOL “have provided some of the
information requested and taken steps to
guard users’ privacy” G refused
To date no request for IP address or other
data linking search behavior to individual
users
DoJ request upheld by US Dist. Court
3/18/06, but reduced from 1 million search
results to 50 K, with 5,000 random search
queries
Google will comply
Implications


For Users – Invasion of privacy/search
behavior, online identity, 1st Amendment
For Search Engine Industry –



R&D focused on offering search
results customized to an individual
Requires tracking individual’s search
behavior
Can privacy be guaranteed?
Yahoo!
Three ways in

www.yahoo.com
Portal home page (all services)

search.yahoo.com
Crawler only

dir.yahoo.com
Subject directory only

“People mediated” search via tagging and
personalization
New & Notable at Yahoo!


Y!Q (toolbar)
Mindset (Disambiguation)



http://mindset.research.yahoo.com/
Music Engine & other verticals
Recommendations: Movies, Music,
etc.
New & Notable at Ask





The Butler is gone! Teoma is in his place!
Smart Search
Web Answers
Zoom
Superior Mapping Tools
Windows Live Search
MSN’s live.com






Launched Sept. 2006
Successor to MSN Search
Maintains its own database
Tabs for Web, News, Images, Questions
and Answers
Also Local, Video, (RSS) Feeds
Academic – academic.live.com
Current strengths: Computer Science,
Physics, Electrical Engineering
Live.com




Advanced Search available only after a
search has been done
Full Boolean and nesting
Limits: Site/Domain, Country/Region,
Filetype (6), Languages (37), SafeSearch
Default S.E. bundled in IE 7
Live vs. others
4/23/07 (vs. 12/1/06)

Web Search - Imre Kertesz



G – 417,000 +15%
Y – 184,000 +7%
L – 37,266 +9%
A - 11,800 +33%
“Academic” vs “Scholar” verticals
For “string theory”
Scholar – 85,000 -6% Academic – 10,470 +30%
 For Siena frescoes
Scholar – 3,420 +7% Academic – 32 +113%

Live’s Collections

User-saved search results featuring
locations or points or objects of interest





Best sushi in NY
Stops along old Rt. 66
Listings may be annotated with personal
notes
Collections can be shared
Will become a user-created directory
(Web 2.0)
New Services
Exalead - www.exalead.com





Launched October 2004, based in France
Maintains its own database
Smaller than most US services (8 billion)
Offers “Narrowing Options”
Advanced features:



Phonetic spelling with “soundslike”
Approximate spelling with “spellslike”
Limits: Site (URL), Filetype (8), Adult
content, Language (57!!!)
Factbites – www.factbites.com



Returns relevant, whole sentences from
sites retrieved, not just snippets
Incorporates clustering
Based in Australia
eTools – eTools.ch
Meta Engine






Searches Google, Yahoo, Live, Ask, AV,
Lycos and 4 European engines
Parses each query to conform to each
engine’s search syntax
Limit by country (8) and language (5)
Change weighting of results from each
source engine
Results clustered by “topic” and source
Save Results as pdf or rss
Wikipedia:
Looking Under the Hood
Wikipedia’s Many Faces



Free, international, open content online
encyclopedias created and edited by
thousands of volunteers in wiki format
A world-wide online community
dedicated to sharing information freely
3rd most popular news and information
source on the Web (Nielsen, June 2006)
Background and History





2000 – Jimmy Wales, successful options
trader, began Nupedia, an online
encyclopedia “with content by experts”
2001 – Wikipedia started as a “holding pen”
for Nupedia content awaiting review
Now maintained by Wikimedia Foundation
Financed mainly by donations with current
annual budged of $ 1.5 million
Growth rate of 13-18 % per month
Wikipedias Worldwide



Over 250, each in a different language
Content is unique to each Wikipedia
Article counts (7/5/07)







English – 1,864,000+
German – 606,000+
French –
521,000+
Polish –
398,000+
Japanese – 387,000+
Dutch –
314,000+
Languages include Nahuatl, Sardu, Tetun,
Wollof
Structure and Features






Articles and relevant links
Article Talk page – for discussions related
to that article only
User page – Each registered user can
introduce themselves and chat with others
Media namespace – Uploaded image and
media files described
Category namespace – Categories and
tags that can be assigned to articles
“Watch List” – Users notified of changes to
specific articles
Authorship






Most content contributed by a core of
about 1,000 “regular, registered users”
IP addresses of anonymous editors are
recorded in the page history
Registered users’ IP’s are concealed
Almost ½ of articles have less than 5
distinct authors
About ¼ of articles have only one author
Articles average 2.7 authors each
Editorial Control (?)




Volunteer “administrators”
 Monitor changes in a section or topic area
 Arbitrate conflicts i.e. “edit wars” and decide
when to “protect” an article from further revision
Peer Review Status - granted by a larger number
of reviewers as a sign of higher quality
Featured Content Status (“The Best of W.”) peer reviewed sites selected for this honor by
further review and labeled with a Star (18-20 per
month)
Featured Portals – Large subject metasites of
high quality
Nature’s Wikipedia Study (2005)




50 entries on scientific topics from W.
and latest Britannica sent to relevant
experts
Articles’ sources not revealed (blind
study)
Only 8 serious errors identified: 4 from
each source
“Factual errors, omissions or misleading
statements”: B – 123; W - 162
Nature’s Wikipedia Study (2005)


Reviewers noted many W. articles were
“poorly structured and confusing”
Nature surveyed over 1,000 of it’s own
authors




70% had heard of Wikipedia
17% consult it at least weekly
10% author or help to update its content
Giles, Jim “Internet Encyclopedias go head to head”
Nature v. 438 Dec. 15, 2005 p. 900-1; suppl. info. at
www.nature.com/news/2005/051212/full/438900a.html
Tagging
Search and Social Networks
The Democratization of Metadata
aka Social Bookmarking

Folksonomies – Taxonomies by and for (???) people
 Del.icio.us – Personal bookmark service for URL’s
with tagging capability
 Flickr.com – online photo storage system that
provides users with a set of tags and allows new
ones to be created
 Facetious – del.icio.us tag database reworked into a
faceted classification, grouping tags under “by
place” “by technology” “by attribute” etc.
http://demo.siderean.com/facetious/facetious.jsp
Tag Creation and the “Tag Cloud”





Follows a “power law” scenario
The most used tags attract even more usage
A large number of tags are used by only a
few people
A huge number of tags will be used by only
1 or 2 people
Mathes, Adam “Folksonomies” D-Lib Magazine
12 (1) Jan., 2006 avail.
http://www.dlib.org/dlib/january06/guy/01guy.html
Tagging – the Advantages




Not a controlled vocabulary (user-created)
Easy, quick, free and enjoyable
Instantly incorporates buzz words, current
terminology, jargon, new concepts
Can be automated by mapping a resource
to tags that appear often in its text
Tagging – the Challenges


Not a controlled vocabulary
Lack of






Linkage among synonyms, homonyms
Hierarchical concept organization
Disambiguation
Often only single word tags allowed
No standardization across services
Tags can be over-personalized
Improving Tagging Practices


Services must allow compound tags
“black cat”
Educate users to advantages of a few
basic classification conventions
Controlled Vocabulary 101


Use of tag bundles (informal hierarchies)
Allow for cross-cultural tagging practices
within and across languages
Trends and the Future of Search
Search Engine Trends

Search is expanding (or being
absorbed) into other enterprises



G, Y develop Web utilities and services
unrelated to search
Ask bought by Barry Diller (owns Ingenta,
hotels around the world)
Implications for R & D?
Search Engine Trends

Privacy concerns explode



The dark side of personalization
Convenience and utility in search retrieval
vs. confidentiality
US Dept. of Justice requests: First step on
a slippery slope?
Search in the Future (???)

Local-for-mobile

Search engines will become more pervasive
as personalization and alerts merge with
mobile service
“Your favorite band is playing in town
tonight. Want to book a ticket?”
“What is the price and availability of hotel
and dining facilities in this area tonight?”
Search in the Future (???)

Search engine may be incorporating
data from blogs and tagging sites into
ranking processes



Yahoo! purchased del.icio.us
Ask purchased bloglines
Google purchased blogger in 2005
Search in the Future (???)

Increased use of radio frequency
identification chips (RFID) will enable
mobile devices to recognize products
and compare prices among the stores in
your area.
What will the next generation
of search engines look like?

They will begin to “understand” what
you are searching for, not just respond
to the characters you type. HOW??



User feedback (personalization)
Use of content-bearing metadata supplied
by author and others (social tagging
services, etc.)
Automatic resolution of semantic
differences among terms and metadata
through natural language processing
Thank You!
Michael Hunter
Reference Librarian
Hobart and William Smith Colleges
Geneva, NY 14456
(315) 781-3552
[email protected]
Descargar

dvzdfvxdfdfbdbdfbdfbd - Hobart and William Smith …