Empirical Investigations of WWW Surfing Paths Jim Pitkow User Interface Research Xerox Palo Alto Research Center Agenda • A few characteristics of the Web and clicks • Aggregate click models – user surfing behaviors – post-hoc hit prediction • Individual click models – entropy – path prediction August 1999 Web is big, Web is good 10000000 1000000 Servers 100000 10000 1000 1 new server every 2 seconds 7.5 new pages per second 100 10 1 Aug92 Feb93 Aug93 Feb94 Aug94 Feb95 Aug95 Source: World Wide Web Consortium, Mark Gray, Netcraft Server Survey August 1999 Feb96 Aug96 Feb97 Aug97 Feb98 Aug98 Users, sessions, and clicks Current Internet Universe Estimate Time spent/month Number of unique sites visited/month Page views/month Number of sessions/month Page views/session Time spent/session Time spent/site Duration of a page viewed Source Nielsen//NetRatings February 1999 August 1999 97.1 million 7:28:16 15 313 16 19 0:28:01 0:31:27 0:01:28 Popularity of pages—Zipf • Zipf Distribution: – frequency is inversely proportionate to rank • Zipf Law: – slope equals minus one August 1999 Users enter a website at various pages and begin surfing (a) Surfers arrive at pages having traveled different paths (b) p (c) Enter (d) August 1999 1 p p 2 Continuing surfers distribute themselves down various paths 3 Exit After some number of page visits surfers leave the web site Model of surfing VL = VL-1 + L • Where L is the number of clicks and is L varies as independent and identically distributed Gaussian random variables • Surfing proceeds until the perceived cost is larger than the discounted expected future value August 1999 Random walk with a stopping threshold • Two parameter inverse Gaussian distribution P(L) 3 L exp 3 2 2 L 2 L • mean (L) = and variance (L) = 3/ August 1999 Experimental design • Client data – Georgia Tech (3 weeks August, 1994) – Boston University (1995) • tens of thousands of requests • Proxy data – AOL (5 days in December, 1997) • tens of millions of requests • Server data – Xerox WWW Site (week during May 1997) August 1999 Probability distribution function 20,000 Clicks 1 click/site mode 3-4 clicks/site median 8-10 clicks/site mean Frequency 16,000 12,000 8,000 4,000 0 0 5 10 15 20 25 30 Clicks August 1999 35 40 45 50 Cumulative distribution function Probability 1.00 75% of the distribution accounted for in three clicks 0.75 0.50 *** Experimental Experimental 0.25 — inverse inverseGaussian Gaussian 0.00 0 25 50 Clicks August 1999 75 100 Two observations • Inverse Gaussian distribution has very long tail – expect to see large deviations from average • Due to asymmetric nature of the Inverse Gaussian distribution, typical behavior does not equal average behavior August 1999 An interesting derivation l og P ( L ) 3 l og L (L ) 2 2 L 2 2 l og 2 • Up to a constant given by the third term, the probability of finding a group surfing at a given level scales inversely in proportion to its depth P(L) L August 1999 3 / 2 Number of surfers at each level log(Frequency) 1000 P(L) L 100 3 / 2 10 1 1 10 100 log(Clicks) August 1999 1000 Implications of the Law of Surfing • Implications on techniques designed to enhance performance – HTTP Keep-Alive, Pre-fetching of content • Can adapt content based location on curve in a cost sensitive manner – different user modalities (browser, searcher, etc.) – expend different CPU resources for different users • Web site visitation modeling August 1999 Spreading Activation Pump activation into source Activation settles into asymptotic pattern Activation spreads through the network August 1999 Matrix Formulation 123456 1 2 3 4 5 6 0 0 10 0 0 0 C Sources 000010 001000 110010 010010 001100 011000 .5 .7 4 3 1 1 R Network A Activation A(t) = C + M A(t - 1) M = (1 - ) I + R Networks User Paths Text Similarity Content + Usage + Topology Technique August 1999 Topology Application to hit prediction • Let fL be the fraction of users who, having surfed along L-1 links, continue to surf to depth L. fL 1 F (L, , ) 1 F ( L 1, , ) • Define the activation value Ni,L as the number of users who are at node i after surfing through L clicks N i , L 1 f L S i , k N k , L k August 1999 Predicted versus observed log(Predicted Hits) 100,000 10,000 1,000 100 10 1 1 10 100 1,000 10,000 0 log(Observed Hits) August 1999 100,000 2 N o . L in k s , L i (d f = 6 ) 0 < Li 3 2 5 .2 1 3 < Li 6 2 6 .4 4 6 < Li 9 1 7 2 .9 9 9 < Li 12 5 9 .7 0 12 < Li 3 6 6 6 .3 0 August 1999 F requency Surfing probabilitie s by outlink density Proportion of surfers traversing link Investigating user paths • Each user path can be thought of as an ngram and represented as tuples of the form <X1, X2, … Xn> to indicate sequences of page clicks – Distribution of ngrams is the Law of Surfing • Determine the conditional probability of seeing the next page given a matching prior ngram Pr( X n x n | X n 1 ,..., X n k ) August 1999 Uncertainty and entropy • Conditional probabilities are also know as as kth-order Markov approximations/models • Entropy is the expected (average) uncertainty of the random variable measured in bits – minimal number of bits to encode information – uncertainty in the sequence of letters in languages August 1999 Mathematics of entropy H ( X ) E log p( X ) 1 2 p ( x ) log 1 2 x X p ( x ) log x X August 1999 p(x) 2 p( x) Conditional entropy Conditional probability H ( X n | X n 1 ,... X n k ) p ( x n ) H ( X n | X n 1 x n 1 ,..., X n k x n k ) xn X n Chaining rule H ( X 1 ,..., X n ) H ( X 1 ) H ( X 2 | X 1 ) H ( X n | X 1 ,..., X n 1 ) Joint probabilities H ( X ,Y ) x X y Y August 1999 p ( x , y ) log 2 p ( x, y ) Entropy versus ngram length 9 8 9 7 7 2 3 4 5 6 7 8 9 10 Total Bits Gained 8 6 Bits NgramTotal Information Gain 5 4 6 5 4 3 2 1 3 0 2 2 3 4 5 6 7 8 9 10 Path Length 1 0 1 2 3 4 5 6 7 Order of transition August 1999 8 9 10 Path predictions Pr(PPM) the probability that a penultimate path, <xn-1,…xn-k>, observed in the test data was matched by the same penultimate path in the training data Pr(Hit|PPM) the probability that page xn is visited, given that <xn-1,…xn-k>, is the penultimate path and the highest probability conditional on that path is p(xn|xn-1,…xn-k ) Pr(Hit) = Pr(Hit|PPM)*Pr(PPM), the probability that the page visited in the test set is the one estimated from the training as the most likely to occur August 1999 Pr(Path Match) 1.00 2 3 4 5 6 Prob 0.80 0.60 7 8 9 10 0.40 0.20 0.00 0 2 4 6 8 Length of Path Used in Prediction August 1999 10 Pr(Hit|Path Match) 1.00 2 3 4 5 6 7 8 9 10 Prob 0.80 0.60 0.40 0.20 0.00 0 2 4 6 Length of Path Used in Prediction August 1999 8 10 Pr(Hit) 0.25 2 3 4 5 6 7 8 9 10 Prob 0.20 0.15 0.10 0.05 0.00 0 2 4 6 8 Length of Path Used in Prediction August 1999 10 Principles of path prediction • Paths are not completely modeled by 1st order Markov approximations • Path Specificity Principle: longer paths contain more predictive power than lower order paths • Complexity Reduction: keep models as simple and as small as possible August 1999 Agenda revisited • A few characteristics of the Web and clicks • Aggregate click models – user surfing behaviors – post-hoc hit prediction • Individual click models – entropy – path prediction August 1999 Areas of further investigation • Advance the state of predictive modeling – Move from post-hoc to a-priori prediction of user interactions on the Web – Test hypothetical models of Web site usage • Validate existing and new models on more representative data sets • Understand new and emerging applications – streaming video and audio, mobile, etc. August 1999 More information pitkow@parc.xerox.com http://www.parc.xerox.com/ istl/projects/uir/projects/Webology.html August 1999

Descargar
# No Slide Title