Entropy of Search Logs
- How Big is the Web?
- How Hard is Search?
- With Personalization? With Backoff?
Qiaozhu Mei†, Kenneth Church‡
† University of Illinois at Urbana-Champaign
‡ Microsoft Research
1
How Big is the Web?
Small
5B? 20B? More? Less?
• What if a small cache of millions of pages
– Could capture much of the value of billions?
• Could a Big bet on a cluster in the clouds
– Turn into a big liability?
• Examples of Big Bets
– Computer Centers & Clusters
• Capital (Hardware)
• Expense (Power)
• Dev (Mapreduce, GFS, Big Table, etc.)
– Sales & Marketing >> Production & Distribution
2
Millions (Not Billions)
3
Population Bound
• With all the talk about the Long Tail
– You’d think that the Web was astronomical
– Carl Sagan: Billions and Billions…
• Lower Distribution $$  Sell Less of More
• But there are limits to this process
– NetFlix: 55k movies (not even millions)
– Amazon: 8M products
– Vanity Searches: Infinite???
• Personal Home Pages << Phone Book < Population
• Business Home Pages << Yellow Pages < Population
• Millions, not Billions (until market saturates)
4
It Will Take Decades
to Reach Population Bound
• Most people (and products) don’t have a web
page (yet)
• Currently, I can find famous people
• (and academics)
• but not my neighbors
– There aren’t that many famous people
• (and academics)…
– Millions, not billions
• (for the foreseeable future)
5
Equilibrium: Supply = Demand
• If there is a page on the web,
– And no one sees it,
– Did it make a sound?
• How big is the web?
– Should we count “silent” pages
– That don’t make a sound?
• How many products are there?
– Do we count “silent” flops
– That no one buys?
6
Demand Side Accounting
• Consumers have limited time
– Telephone Usage: 1 hour per line per day
– TV: 4 hours per day
– Web: ??? hours per day
• Suppliers will post as many pages as
consumers can consume (and no more)
• Size of Web: O(Consumers)
7
How Big is the Web?
• Related questions come up in language
• How big is English?
How many words
do people know?
– Dictionary Marketing
– Education (Testing of Vocabulary Size)
– Psychology
– Statistics
What is a word?
– Linguistics
Person? Know?
• Two Very Different Answers
– Chomsky: language is infinite
– Shannon: 1.25 bits per character
8
Chomskian Argument:
Web is Infinite
• One could write a malicious spider trap
– http://successor.aspx?x=0 
http://successor.aspx?x=1 
http://successor.aspx?x=2
• Not just academic exercise
• Web is full of benign examples like
– http://calendar.duke.edu/
– Infinitely many months
– Each month has a link to the next
9
How Big is the Web?
5B? 20B? More? Less?
MSN Search Log
1 month  x18
• More (Chomsky)
– http://successor?x=0
• Less (Shannon)
Comp Ctr ($$$$) 
Walk in the Park ($)
Entropy (H)
Query
21.1  22.9
URL
22.1  22.4
More Practical
IP
Answer All But IP
22.1  22.6
23.9
All But URL
26.0
Millions
All 
But Query(not Billions)
27.1
Cluster in Cloud
Desktop  Flash All Three
27.2
10
Entropy (H)
• H ( X )    p ( x ) log p ( x )
• Difficulty of encoding information (a distr.)
x X
– Size of search space; difficulty of a task
• H = 20  1 million items distributed uniformly
• Powerful tool for sizing challenges and
opportunities
– How hard is search?
– How much does personalization help?
11
How Hard Is Search?
• Traditional Search
– H(URL | Query)
– 2.8 (= 23.9 – 21.1)
• Personalized Search
– H(URL | Query, IP)
– 1.2 (= 27.2 – 26.0)
Personalization
cuts H in Half!
Entropy (H)
Query
21.1
URL
22.1
IP
22.1
All But IP
23.9
All But URL
26.0
All But Query
27.1
All Three
27.2
12
Difficulty of Queries
• Easy queries (low H(URL|Q)):
– google, yahoo, myspace, ebay, …
• Hard queries (high H(URL|Q)):
– dictionary, yellow pages, movies, “what is may
day?”
13
How Hard are Query Suggestions?
The Wild Thing? C* Rice  Condoleezza Rice
• Traditional Suggestions
– H(Query)
– 21 bits
• Personalized
– H(Query | IP)
– 5 bits (= 26 – 21)
Personalization
cuts H in Half!
Twice
Entropy (H)
Query
21.1
URL
22.1
IP
22.1
All But IP
23.9
All But URL
26.0
All But Query
27.1
All Three
27.2
14
Personalization with Backoff
• Ambiguous query: MSG
– Madison Square Garden
– Monosodium Glutamate
• Disambiguate based on user’s prior clicks
• When we don’t have data
– Backoff to classes of users
• Proof of Concept:
– Classes defined by IP addresses
• Better:
– Market Segmentation (Demographics)
– Collaborative Filtering (Other users who click like me)
15
Backoff
• Proof of concept: bytes of IP define classes of users
• If we only know some of the IP address, does it help?
Bytes of IP addresses
H(URL| IP, Query)
156.111.188.243
1.17
156.111.188.*
1.20
156.111.*.*
1.39
156.*.*.*
1.95
*.*.*.*
2.74
Some of the IP is better
than none
Cuts H in half even if using
the first two bytes of IP
16
Lambda
Sparse Data
Missed
Opportunity
0. 3
Backing Off
by IP
0. 25
0. 2
0. 15
0. 1
0. 05
0
λ4
• Personalization with Backoff
• λs estimated with EM and CV
• A little bit of personalization
– Better than too much
– Or too little
λ3
λ2
λ1
λ0
4
P (Url | IP , Q ) 
  P (Url
i
| IPi , Q )
i0
λ4 : weights for first 4 bytes of IP
λ3 : weights for first 3 bytes of IP
λ2 : weights for first 2 bytes of IP
……
17
Personalization with Backoff
 Market Segmentation
• Traditional Goal of Marketing:
– Segment Customers (e.g., Business v. Consumer)
– By Need & Value Proposition
• Need: Segments ask different questions at different times
• Value: Different advertising opportunities
• Segmentation Variables
– Queries, URL Clicks, IP Addresses
– Geography & Demographics (Age, Gender, Income)
– Time of day & Day of Week
18
Query Frequency
yahoo
mapquest
cnn
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
Business Queries
on Business Days
Consumer Queries
(Weekends & Every Day)
1
3
5
7
9
11
13
15
17
19
21
23
Jan 2006 (1st is a Sunday)
sex
0.055
0.05
0.045
0.04
0.035
0.03
0.025
0.02
movie
mp3
1
3
5
7
9
11
13
15
17
19
21
23
Jan 2006 (1st is a Sunday)
19
Business Days v. Weekends:
More Clicks and Easier Queries
9,000,000
Clicks
8,000,000
7,000,000
6,000,000
5,000,000
4,000,000
Easier
3,000,000
1.20
1.18
1.16
1.14
1.12
1.10
1.08
1.06
1.04
1.02
1.00
Entropy (H)
More Clicks
1 3 5 7 9 11 13 15 17 19 21 23
Jan 2006 (1st is a Sunday)
Total Clicks
H(Url | IP, Q)
20
Day v.s. Night: More Queries, More
Diversified Queries
More clicks and
diversified
queries
Less clicks, more
unified queries
21
Harder Queries at TV Time
Harder queries
Weekends are
harder
22
Conclusions: Millions (not Billions)
• How Big is the Web?
– Upper bound: O(Population)
• Not Billions
• Not Infinite
• Shannon >> Chomsky
– How hard is search?
– Query Suggestions?
– Personalization?
Entropy is a
great hammer
• Cluster in Cloud ($$$$)  Walk-in-the-Park ($)
23
Conclusions: Personalization with
Backoff
• Personalization with Backoff
– Cuts search space (entropy) in half
– Backoff  Market Segmentation
• Example: Business v. Consumer
– Need: Segments ask different questions at different times
– Value: Different advertising opportunities
• Demographics:
– Partition by ip, day, hour, business/consumer query…
• Future Work:
– Model combinations of surrogate variables
– Group users with similarity  collaborative search
24
Thanks!
25
Prediction Task:
Coverage
Historical Logs 
Coverage of Future Demand
• Training: Estimate Pr(x)
– Given what we know today,
– Estimate, Pr(x), tomorrow’s demand for url x
– (There are an infinite set of urls x.)
• Test: Score Pr(x) by Cross Entropy
– Given tomorrow’s demand, x1…xk
– Score ≡ −log2 of geometric mean of Pr(x1) … Pr(xk)
• One forecast is better than another
– If it has better (less) cross entropy
• Cross entropy  Entropy (H)
– The score for the best possible forecast (that only God knows)
26
Millions, Not Billions
(Until Market Saturates)
• Telephones are a Mature Market
– Saturated
• Universal Service is limited by population
• Loops (telephone numbers) ≈ population
• Everyone (and every business) is listed in phonebook
– (unless they have opted out)
• Web is Growth Market
– Decades from saturation
– When everybody and every product has a page
• The number of pages will be bounded by the population
– In the meantime, millions are good enough
27
Smoothing (Cont.)
• Use interpolation smoothing:
4
P (Url | IP , Q ) 
  P (Url
i
| IPi , Q )
i0
where IPi is the first i bytes of an IP address.
e.g., IP4= 156.111.188.243; IP2 = 156.111.*.*
• Use one month’s search log (Jan 06) as training data,
new incoming log (Feb 06) as testing sets
• λi determined by EM algorithm maximizing the cross
conditional entropy on test set.
28
Cross Validation
• IP in the future might not be seen in the history
• But parts of it is seen in the history
Complete
personalization
Cross Entropy:
H(future | history)
No
personalization
Personalization
with backoff
Knows at least
two bytes
Knows every
byte
29
Cross Validations
• Weekends are harder to predict
• Weekdays to predict weekdays >> weekends
to predict weekdays
• Day time to predict day time >> nights to
predict day time
30
Partition by Day-Week (CV)
Test data: weekdays and
weekends
Search Difficulty on different Training/Test Data
1.2
1
Weekends are more
difficult to predict
H(Url | IP, Q)
0.8
0.6
Wednesday, 02/01/06
Thursday, 02/02/06
0.4
Friday, 02/03/06
Saturday, 02/04/06
0.2
Sunday, 02/05/06
Monday, 02/06/06
0
1 2 3
4 5 6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Weekdays are easier
to be predicted by
history of weekdays
Training Data: Jan 2006 (01/01 is a Sunday)
Training data: weekdays
and weekends
31
Descargar

Entropy of Search Logs - How hard is search? With