Empirical Investigations of
WWW Surfing Paths
Jim Pitkow
User Interface Research
Xerox Palo Alto Research Center
Agenda
• A few characteristics of the Web and clicks
• Aggregate click models
– user surfing behaviors
– post-hoc hit prediction
• Individual click models
– entropy
– path prediction
August 1999
Web is big, Web is good
10000000
1000000
Servers
100000
10000
1000
1 new server every 2 seconds
7.5 new pages per second
100
10
1
Aug92
Feb93
Aug93
Feb94
Aug94
Feb95
Aug95
Source: World Wide Web Consortium, Mark Gray, Netcraft Server Survey
August 1999
Feb96
Aug96
Feb97
Aug97
Feb98
Aug98
Users, sessions, and clicks
Current Internet Universe Estimate
Time spent/month
Number of unique sites visited/month
Page views/month
Number of sessions/month
Page views/session
Time spent/session
Time spent/site
Duration of a page viewed
Source Nielsen//NetRatings February 1999
August 1999
97.1 million
7:28:16
15
313
16
19
0:28:01
0:31:27
0:01:28
Popularity of pages—Zipf
• Zipf Distribution:
– frequency is
inversely
proportionate to
rank
• Zipf Law:
– slope equals
minus one
August 1999
Users enter a website at various
pages and begin surfing
(a)
Surfers arrive at pages having
traveled different paths
(b)
p
(c)
Enter
(d)
August 1999
1
p
p
2
Continuing surfers distribute
themselves down various paths
3
Exit
After some number of page visits
surfers leave the web site
Model of surfing
VL = VL-1 + L

• Where L is the number of clicks and is L varies
as independent and identically distributed
Gaussian random variables
• Surfing proceeds until the perceived cost is
larger than the discounted expected future value
August 1999
Random walk with a
stopping threshold
• Two parameter inverse Gaussian distribution
P(L) 
3


  L    
exp 

3
2
2 L
2 L


• mean (L) =  and variance (L) = 3/
August 1999
Experimental design
• Client data
– Georgia Tech (3 weeks August, 1994)
– Boston University (1995)
• tens of thousands of requests
• Proxy data
– AOL (5 days in December, 1997)
• tens of millions of requests
• Server data
– Xerox WWW Site (week during May 1997)
August 1999
Probability distribution function
20,000
Clicks
1 click/site mode
3-4 clicks/site median
8-10 clicks/site mean
Frequency
16,000
12,000
8,000
4,000
0
0
5
10
15
20
25
30
Clicks
August 1999
35
40
45
50
Cumulative distribution function
Probability
1.00
75% of the distribution
accounted for in three clicks
0.75
0.50
*** Experimental
Experimental
0.25
— inverse
inverseGaussian
Gaussian
0.00
0
25
50
Clicks
August 1999
75
100
Two observations
• Inverse Gaussian distribution has very long tail
– expect to see large deviations from average
• Due to asymmetric nature of the Inverse
Gaussian distribution, typical behavior does not
equal average behavior
August 1999
An interesting derivation
l og P ( L )  
3
l og L 
 (L   )
2
2 L
2
2

 l og 


 

2 
• Up to a constant given by the third term, the
probability of finding a group surfing at a given
level scales inversely in proportion to its depth
P(L) L
August 1999
3 / 2
Number of surfers at each level
log(Frequency)
1000
P(L) L
100
3 / 2
10
1
1
10
100
log(Clicks)
August 1999
1000
Implications of the Law of Surfing
• Implications on techniques designed to
enhance performance
– HTTP Keep-Alive, Pre-fetching of content
• Can adapt content based location on curve in a
cost sensitive manner
– different user modalities (browser, searcher, etc.)
– expend different CPU resources for different users
• Web site visitation modeling
August 1999
Spreading Activation
Pump activation
into source
Activation settles into
asymptotic pattern
Activation
spreads through
the network
August 1999
Matrix Formulation
123456
1
2
3
4
5
6
0
0
10
0
0
0
C
Sources
000010
001000
110010
010010
001100
011000
.5
.7
4
3
1
1
R
Network
A
Activation
A(t) = C + M A(t - 1)
M = (1 -  ) I +  R
Networks
User Paths
Text Similarity
Content + Usage + Topology
Technique
August 1999
Topology
Application to hit prediction
• Let fL be the fraction of users who, having surfed along
L-1 links, continue to surf to depth L.
fL 
1  F (L,  ,  )
1  F ( L  1,  ,  )
• Define the activation value Ni,L as the number of users
who are at node i after surfing through L clicks
N i , L 1  f L  S i , k N k , L
k
August 1999
Predicted versus observed
log(Predicted Hits)
100,000
10,000
1,000
100
10
1
1
10
100
1,000
10,000
0
log(Observed Hits)
August 1999
100,000
2
N o . L in k s , L i
 (d f = 6 )
0 < Li  3
2 5 .2 1
3 < Li  6
2 6 .4 4
6 < Li  9
1 7 2 .9 9
9 < Li  12
5 9 .7 0
12 < Li
3 6 6 6 .3 0
August 1999
F requency
Surfing
probabilitie
s by outlink
density
Proportion of surfers traversing link
Investigating user paths
• Each user path can be thought of as an ngram
and represented as tuples of the form <X1, X2,
… Xn> to indicate sequences of page clicks
– Distribution of ngrams is the Law of Surfing
• Determine the conditional probability of seeing
the next page given a matching prior ngram
Pr( X n  x n | X n 1 ,..., X n  k )
August 1999
Uncertainty and entropy
• Conditional probabilities are also know as as
kth-order Markov approximations/models
• Entropy is the expected (average) uncertainty
of the random variable measured in bits
– minimal number of bits to encode information
– uncertainty in the sequence of letters in languages
August 1999
Mathematics of entropy

H ( X )  E  log





p( X ) 
1
2
p ( x ) log
1
2
x X
   p ( x ) log
x X
August 1999
p(x)
2
p( x)
Conditional entropy
Conditional probability
H ( X n | X n 1 ,... X n  k ) 

p ( x n ) H ( X n | X n 1  x n 1 ,..., X n  k  x n  k )
xn  X n
Chaining rule
H ( X 1 ,..., X n )  H ( X 1 )  H ( X 2 | X 1 )    H ( X n | X 1 ,..., X n 1 )
Joint probabilities
H ( X ,Y )   

x  X y Y
August 1999
p ( x , y ) log
2
p ( x, y )
Entropy versus ngram length
9
8
9
7
7
2
3
4
5
6
7
8
9
10
Total Bits Gained
8
6
Bits
NgramTotal Information Gain
5
4
6
5
4
3
2
1
3
0
2
2
3
4
5
6
7
8
9
10
Path Length
1
0
1
2
3
4
5
6
7
Order of transition
August 1999
8
9
10
Path predictions
 Pr(PPM) the probability that a penultimate path, <xn-1,…xn-k>,
observed in the test data was matched by the same penultimate
path in the training data
 Pr(Hit|PPM) the probability that page xn is visited, given that
<xn-1,…xn-k>, is the penultimate path and the highest probability
conditional on that path is p(xn|xn-1,…xn-k )
 Pr(Hit) = Pr(Hit|PPM)*Pr(PPM), the probability that the page visited
in the test set is the one estimated from the training as the most
likely to occur
August 1999
Pr(Path Match)
1.00
2
3
4
5
6
Prob
0.80
0.60
7
8
9
10
0.40
0.20
0.00
0
2
4
6
8
Length of Path Used in Prediction
August 1999
10
Pr(Hit|Path Match)
1.00
2
3
4
5
6
7
8
9
10
Prob
0.80
0.60
0.40
0.20
0.00
0
2
4
6
Length of Path Used in Prediction
August 1999
8
10
Pr(Hit)
0.25
2
3
4
5
6
7
8
9
10
Prob
0.20
0.15
0.10
0.05
0.00
0
2
4
6
8
Length of Path Used in Prediction
August 1999
10
Principles of path prediction
• Paths are not completely modeled by 1st order
Markov approximations
• Path Specificity Principle: longer paths contain
more predictive power than lower order paths
• Complexity Reduction: keep models as simple
and as small as possible
August 1999
Agenda revisited
• A few characteristics of the Web and clicks
• Aggregate click models
– user surfing behaviors
– post-hoc hit prediction
• Individual click models
– entropy
– path prediction
August 1999
Areas of further investigation
• Advance the state of predictive modeling
– Move from post-hoc to a-priori prediction of user
interactions on the Web
– Test hypothetical models of Web site usage
• Validate existing and new models on more
representative data sets
• Understand new and emerging applications
– streaming video and audio, mobile, etc.
August 1999
More information
[email protected]
http://www.parc.xerox.com/
istl/projects/uir/projects/Webology.html
August 1999
Descargar

No Slide Title