Pay for Placement Search
Agenda




Search Engines
 Where did they come from?
 How do they work?
 Who’s the biggest?
 Why GoTo is the coolest.
What type of stuff do you need to support the web’s 2nd* largest search
engine?
 Architecture, infrastructure, nuts and bolts
 Performance
 Operations
What kind of people (and how many) do you need to do this kind of
business?
Where is the Internet going? What's going to happen to search engines?
*Don’t quote me 
Copyright GoTo.com, 2/19/2001, 2
Ancient History



The Pre-cursors
 Archie (1990) – ftp based file indexing and retrieval
 Gopher (1992) – document network (non-ftp)
The early ‘bots (1992-1993)
 WWW Wanderer (wandex) –servers, then URLs
 Aliweb – index web like Archie w/site index retrieval
Then came the spiders (1993+)
 WWW Worm
 Excite (Architext), 2/93 from Stanford
Copyright GoTo.com, 2/19/2001, 3
All Done? Wrong!

Problems with Spiders:
 Get lots of data, but no intelligence to map pages to
concept space
 Problem still exist today (spamming)

The Solution? Searchable Directories. Human crafted
hierarchies.
 Tradewave Galaxy (1/94)
 Yahoo! (4/94), Filo and Yang of Stanford
Copyright GoTo.com, 2/19/2001, 4
I Give Up – Let’s Search Everyone!

Here Come the Metasearchers!
 MetaCrawler, go2net, dogpile (1995)
 Momma
 Search.com (CNet)

Spray out searches to several engines – combine
the results
Copyright GoTo.com, 2/19/2001, 5
The Universe Divides (kinda)

The Crawler-based Search
Engines
 Lycos (7/94) – the wolf
spider
 Infoseek (4/94)
 Altavista (12/95)
 Inktomi (Slurp) – HotBot
(5/96) – the plains Indians
spider myth
 Google, Northern Lights,
Excite, FAST, direct hit,
and more…

The Directory/Editorial based
Search Engines
Yahoo! (4/94)
LookSmart (5/95)
Snap.com
ODP (NewHoo) -- dmoz
(1/98)
 Ask Jeeves (4/97)
 GoTo (6/98)




Copyright GoTo.com, 2/19/2001, 6
How Crawlers Work (or don’t)



Start with list of URLs (submitted, generated from somewhere)
For each Site
 Get the base page
 ‘Catalog’ the page based on crawler-specific implementation
 Follow links on page and recurse
Some Details
 META tags
<META NAME=“ROBOTS” CONTENT=“ALL | NONE | NOINDEX | NOFOLLOW”>
 Robots.txt
# /robots.txt file for http://goto.com/
# disallow all robots from crawling GoTo
User-agent: *
Disallow: /
Copyright GoTo.com, 2/19/2001, 7
Some Search Engine Examples


Inktomi
 Infrastructure only – you pay for the search results
 Used to power Yahoo! (now Google), HotBot, many
others
 Now typically a fall-though placement (bidded or
other paid inclusion first, then Inktomi results
Google
 Sergey and Larry
 Power Yahoo!, virgin.net, some others
 Searching for a revenue model
Copyright GoTo.com, 2/19/2001, 8
Inktomi ‘Slurp’ Crawler
 Slurp Characteristics
• Starts with active submitted URLs
• Hierarchy of Importance
–
–
–
–
Page Title
Description meta
Keyword meta
Text in document (not in images )
• No frames
• Looks for spoofing tricks (drop page)
 4 week full cycle (constant incremental)
• Many different indices created (or various customers),
different depths, etc.
Copyright GoTo.com, 2/19/2001, 9
Some Cataloging Approaches (cont.)

Google
 Backrub/Googlebot crawler
 PageRank™
• Page A, Pages linking to A T1..Tn, Links on A C(A)
• PR(A) = (1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))
• ~probability distribution that random surfer hits a page based on
links
 Cache the documents (no kidding)
 All kinds of tweaks to the PageRank, including:
• Domain tweaks (.org, .gov, .edu)
• Serious bias against large pages
• Bias against dynamic pages (.asp, .jhtml, .jsp)
 Check out http://www.searchengineworld.com/google
 Original design at
http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm
Copyright GoTo.com, 2/19/2001, 10
Who’s the ‘biggest’ Search Engine

What is ‘big’
 Number of documents indexed (SearchEngineWatch, 11/8/200)
KEY: GG=Google, FAST=FAST, WT=WebTop.com, INK=Inktomi, AV=AltaVista,
NL=Northern Light, EX=Excite, Go=Go (Infoseek).
Copyright GoTo.com, 2/19/2001, 11
Who’s the ‘biggest’ Search Engine

What is ‘big’
 Searches/Day – Total Web 500mm/day (ptr estimate)
•
•
•
•
•
Yahoo! – 100mm
Alta Vista – 50mm (International too)
Google – 50mm
Inktomi – 40mm
Everyone else – 10mm or fewer
 Where’s GoTo? Hint 
Copyright GoTo.com, 2/19/2001, 12
Let’s Talk About GoTo

Basic Business Model – Middlemen for Textual
Advertisements (Search Results)
 Advertisers provide us Search Listings (Title, URL,
Description, bid) for a search term
 We charge advertisers for user clicks on Search
Listings
 We serve search listings to our own site
(www.goto.com - 5%), and other partners sites
(affiliates like Alta Vista, AOL, Netscpae, Cnet, etc.
etc. – 95%)


Since we make money when people search (and click),
we pay for sites to include our listings
Live auction for search results
Copyright GoTo.com, 2/19/2001, 13
The Scale of Operations

Search Volume – 70mm+/day, capacity for 210mm/day

300mm impressions/day

10mm clicks/day – Med/Large Phone company

6mm+ search listings

40,000+ advertisers

Wow
Copyright GoTo.com, 2/19/2001, 14
Systems Strategic Bombing View
Copyright GoTo.com, 2/19/2001, 15
It Can’t be that Simple, Right?

Right!
Sunnyvale:: ServerIron
Pasadena::saturn
: GoTo::www.goto.com
{port = 443}
OLS::Prospectiv e Client
Data
: OF::liv e_OFIN
Sunnyvale:: bellatrix
net
: GoTo::secure.goto.com
{port = 443}
VPN
: OLS::https
{port = 443,
user = goto}
AM periodically updates
f inance and gets balance
updates f rom OF
DTC::Adv ertiser
Reston:: zaurac
Sunnyvale:: sargas
LWES
: AM::MultiSiteClickListener
: DTC::https
{user = goto,
port = 443}
VPN
: DTC::loadmanager
{baseport = 3020,
user = dtc}
VPN
Pasadena::masu
: CRM::KanaWeb
Sunnyvale:: betelgeuze
eServ ice instances talk
directly to OF database
: DTC::Dy namo
{baseport = 3000,
user = dtc}
: DTC::https
{user = goto,
port = 443}
Pasadena::saba
CRM::Admin
: CRM::KanaApp
: OLS::Dy namo
{user = signup,
baseport = 2100}
: DTC::Dy namo
{baseport = 3000,
user = dtc}
DTC/OLS Dy namo's
talk to EJB serv ices
Sunnyvale:: galt
Pasadena::alrisha
live_CRM
: CRM::KanaDB
: OLS::Dy namo
{user = signup,
baseport = 2100}
: CRM::DB
Sunnyvale:: haedi
HWES
Data
: HWES::DB
: AM::MultiSiteClickListener
: OLS::Loadmanager
{baseport = 2120,
user = signup}
Sunnyvale:: tyl
: CRM::GlobalDB
: AM::DB
All EJB's access
the Database
: AM::AMConf ig
LWES
live_SILK
Pasadena::xchg3
: AM::AMScheduler
: AM::TableSnapshot
: GoTo::MSExchange
: AM::AMCTP1.0
Sunnyvale:: alula
ALL eServ ice instances
talk to both databases
VPN
Sunnyvale:: anago
: AM::EJB
: CRM::MailAttachDB
: CRM::Silk eServ ices
: CRM::EJB
: CRM::smtp serv ice
: DTC::EJB
net
: EPS::EJB
Sunnyvale:: aji
: OLS::EJB
AM::Cy bersource
: CRM::Silk eServ ices
: EPS::ImportSLRAgent
: EPS::CompleteSLRAgent
: AM::MailNotif icationAgent
Sunnyvale:: nusakan
: CRM::smtp serv ice
Sunnyvale:: akagai
: AM::EJB
: CRM::Silk eServ ices
: CRM::EJB
: DTC::EJB
: CRM::smtp serv ice
: EPS::EJB
: OLS::EJB
lb-cms.back
: EPS::ImportSLRAgent
: GoTo::cms-app
: EPS::CompleteSLRAgent
DNS (RoundRobin)
ALL EJBs are accessed
v ia the loadbalanced name
: GoTo::cms
: GoTo::jndi.cms-ejb
All instances of EPS
EJBs or Agents talk
to the databases
Sunnyvale:: hamachi
: CRM::http
CRM uses Stats
f or RunRate data
DTC uses Stats f or
Reports/prediction
Sunnyvale:: tabit
VPN
: CRM::ASP Files
: EPS::liv e_EPS
Stats pushes CTP2.0
data to AM table in liv e_CRM
Sunnyvale:: baten
Sunnyvale:: kajiki
Pasadena::desktopNT
Sunnyvale:: lesath
: CRM::http
: EPS::EPS Jr.
Sunnyvale:: lca
: EPS::GUI
Sunnyvale:: alkes
: Stats::STST
: EPS::liv e_SRDB
: Stats::CTP Array
: CRM::MSIE
EPS::Editor
Sunnyvale:: atlas
VPN
: Stats::liv e_STAT
Data
Sunnyvale:: kaus
CRM::CSR
: Stats::liv e_TMRT
Sunnyvale:: spica
: Stats::BusinessObjects
: Stats::Dy namo
: Stats::https
Copyright GoTo.com, 2/19/2001, 16
It Can’t be that Simple, Right?
GoTo’s systems seem deceptively simple.


GoTo’s pay-for-performance search product seems
simple to execute – advertisers provide the content
in the form of search listings, the content is
ordered by bid price, and advertisers are charged
for resulting clicks.
The complexity of these systems is based on the
scale of the problem (number of advertisers,
search listings, searches per day, etc.), In addition
to some non-apparent complications (e.g. fraud
detection).
Copyright GoTo.com, 2/19/2001, 17
Architecture Features

High Availability -- Noah’s Ark Approach – no single point
of failure
 Load balancers
 State migration



Scalability:
no architectural changes to scale serving capacity.
Extensibility:
can add search features incrementally.
Distributed content:
multiple sites currently serving all partners.
Copyright GoTo.com, 2/19/2001, 18
Advertiser Management
Copyright GoTo.com, 2/19/2001, 19
Advertiser Tools

DirecTraffic Center®
 Functions – manage account balance, report
on activity, real-time bid charges,
add/modify/delete search listings
 ATG/Dynamo (jhtml)/Java, EJB search Listing
services (BEA/Weblogic), custom cache
reporting scheme based on Oracle 8i
Copyright GoTo.com, 2/19/2001, 20
Advertiser Management Systems
Copyright GoTo.com, 2/19/2001, 21
Account Monitoring

The real ‘special sauce’
 Listens to real-time clicks and monitors
account activity to process notifications,
automated changes, status changes
 Manages credit limits, monthly advertiser
budgets, activation and de-activation of
accounts, and over 300 different business
rules around accounts
 EJB – Weblogic
Copyright GoTo.com, 2/19/2001, 22
Editorial Processing

We are a publishing business
 100 editors
 Workflow fo 50,000-100,000 work orders a
month
 Review all listings (with some help)
 EJB/Desktop App (Swing)
Copyright GoTo.com, 2/19/2001, 23
Fraud Detection and Reporting
Copyright GoTo.com, 2/19/2001, 24
Event Processing – What Are Events?

LWES – Light Weight Event Systems
 UDP-multicast based events thrown by front
end systems
 Events include
• Searches
• Clicks (redirects)
• Navigation
 Events are Key/Value pairs
 ‘Caught by separate Journaling Systems
Copyright GoTo.com, 2/19/2001, 25
What do we do with these events?

Result Clicks (I.e. we charge advertiser) goto fraud
detection
• patent pending system that monitors our web site behavior
to detect potentially fraudulent activity. The systems
analyze millions of transactions daily for suspicious
behavior, whether malicious or benign, and perform
sophisticated rule-based and statistically-derived event
filtering.
• GoTo’s Fraud Squad of 8 developers and analysts
constantly monitor and improve the fraud detection
techniques and tools, and manage the issue treatment and
resolution processes.
Copyright GoTo.com, 2/19/2001, 26
More About Fraud

Fraud Detection -- Attacks and Filters
 Attacks
•
•
•
•
•
•
•

Inadvertent
Crawling spiders run amok
Advertisers testing their own listings
Malicious
Stockholder -- the revenue goosers
Advertiser Vs. Advertisers
Bored Crackers
Filters
• Deterministic - rules based filters covering user sessions, IP addresses
and search terms. The deterministic filters catch all the blatant abuses
(repetitive clicking, repetitive searching, “speed” clicking).
• Probabilistic -- behavior pattern based, these filters discard anomalous
click groupings. The probabilistic filters are very good at catching subtle
abuses of advertiser resources: traversal of consecutive paid listings,
randomized but obviously scripted clicking, expensive clicking.
• Both deterministic and probabilistic filters are routinely updated to reflect
changes in site usage patterns.
Copyright GoTo.com, 2/19/2001, 27
How do you do this in near-real-time?

Data Pipeline
 The ‘backbone’ of fraud detection
 A flexible array (~30) of commodity machines that
perform simple aggregations and other arithmetic
calculations in a networked and coordinated way
 A control and processing language used to describe
the required calculations, and processed by the data
pipeline machines.

Click Scoring
 Assignment of a click score for click events that
classifies them into various ‘buckets’ of validity.
 Formulas that define the ‘buckets’ based on
historical patterns of behavior of the site, and
Copyright GoTo.com, 2/19/2001, 28
analysis of previous fraudulent attempts.
Search Serving Systems
Copyright GoTo.com, 2/19/2001, 29
Search Serving Systems
Technology/Product
Utilized
Internet
Load
Apache
mod_perl
GoTo Cache Server
Oracle OCI Drivers
JDBC Connection Pooling
Custom-Developed Load Balancing
Oracle 8i
Quest SharePlen for Oracle
Common to All
Serving Sites
Hardware
Platfoms
Balancing
Foundry ServerIron
Foundry BigIron/FastIron
Multiple Sites
Application/Web Servers
n
Content Load Balancing
Data
n
Sun 420R
Solaris 2.6
Event Journalers
Sun E4500 (Database)
Sun 420R (Event Journalers)
Data
Fraud Detection
Backoffice Sites Only
n
Custom-Developed Fraud Detection
Oracle 8i
Informatica
Business Objects
Data
Warehouse
RedHat Linux 6.2
VaLinux
2x Sun E4500 (Database)
2x Sun E4500 (Informatica ETL)
2x Sun E450 (Data Marts/Business Objects
10.5 TB SAN (StorageTek/MTI)
Copyright GoTo.com, 2/19/2001, 30
The Nitty-Gritty

Search Serving Platforms:
 100+ Sun e420R, 450mhz
(4), 4GB
 ATG/Dynamo/Java, and
Apache/mod_perl
 Gigabit site backbone
 InterNAP
 Multiple (3) co-location
facilities
 Search serving feeds
include HTML and XML all
through HTTP (1.0 or 1.1)
 Global Load Balancing
(Arrowpoint)
 Distributed content caching
(Akamai)

Backend Platforms:
 Data repository (16TB) for
search and click events –
several (4) e4500
Sun/Oracle 8i machines
connected to a MTI SAN
 Fraud Detection through an
array (3) or Intel/Linux
machines, utilizing custom
detection systems.
 CRM via Silknet (NT/2000)
 N-tier application backbone
via EJB (Weblogic) servers
– application integration all
through XML
 Complete DR site for fast
recovery
Copyright GoTo.com, 2/19/2001, 31
Facilities

6 Facilities:

Search Serving Sites
• Global Center – Sunnyvale
CA
• Cable & Wireless – Reston
VA
• ESAT – Dublin, Ireland

Offices
•
•
•
•

Pasadena
San Mateo
Raleigh-Durham
London
Development & Test Site
• Qwest CyberCenter –
Burbank CA

Backend Processing Site (New)
• Las Vegas, Nevada
Copyright GoTo.com, 2/19/2001, 32
Search Serving Performance
Copyright GoTo.com, 2/19/2001, 33
Network Operations Center
Copyright GoTo.com, 2/19/2001, 34
Network Operations Center
Copyright GoTo.com, 2/19/2001, 35
GoTo Technology Organization

Three Major Technology Groups (groupings):
 Development Groups (4)
 Technical Operations
 Architecture and Planning

About 115 people.

Number/Email to Remember:
 Me – 626-685-5743, [email protected]
Copyright GoTo.com, 2/19/2001, 36
The perils of an open office plan
Copyright GoTo.com, 2/19/2001, 37
The future…

Stickiness models are dead

The vultures are circling…

The end for ‘search engines’
 Everyone needs a revenue model
 Search  Portal  ?
 Pay for placement the norm
Copyright GoTo.com, 2/19/2001, 38
References

Web Sites about Search Engines
 www.searchenginewatch.com
 www.searchengineworld.com

Services
 www.wordtracker.com

Articles
Copyright GoTo.com, 2/19/2001, 39
Descargar

Roadshow