HUMAN EXPERTISE AND ARTIFICIAL
INTELLIGENCE IN VERTICAL SEARCH
Peter Jackson & Khalid Al-Kofahi
Corporate Research & Development
HORIZONTAL VERSUS VERTICAL SEARCH
HORIZONTAL
VERTICAL
Consumer focus
Professional focus
General interest
Specialist interests
Average user
Expert user
Shallow information need
Deep information need
2
THE PARADOX OF SEARCH
• The further you get from keyword indexing and
retrieval, the harder it is to explain a search result
– Professional searchers demand transparency
• Tool versus appliance
• You need an ‘explanatory model’ that people can
relate to and understand, even if it is actually just a
cartoon of the real process
– Examples: Basic PageRank, Collaborative Filtering
• Such models don’t work so well in vertical domains
– Links aren’t always endorsements
– Sparsity of data in smaller communities
3
RECENT TRENDS IN SEARCH
• Fragmentation of ‘horizontal’ search
– Media, location, demographics (Weber & Castillo, 2010)
• More sophisticated models of user behavior
– Post-click behaviors (Zhong, Wang, et al, 2010)
• ‘Practical semantics’ versus Semantic Web
– Maps as search results for local, micro-results
• Incorporation of domain knowledge into search
– Taxonomies, vocabularies, use cases, work flows
4
THE EXAMPLE OF LEGAL SEARCH
• The completeness requirement
– Recall as important as precision
• Less redundancy than on the Web
• The authority requirement
– Court superiority, jurisdiction
– Highly cited cases and statutes
• Supercession by statute or regulation
• The multi-topical nature of documents
– Case may cover many points of law but only cited for one
– Citations can be negative as well as positive per topic
>These factors also apply to scientific documents
5
POWER LAW AND LEGAL TOPICS
6
POWER LAW AND WESTLAW USERS
7
EXPERT SEARCH
• In many verticals, there are at least two sources of
expertise available for enhancing search
– Editors and authors, who generate useful metadata
– Users, who generate clickstreams and other data
• Editorial value addition improves recall especially
– Helps find both fat neck and long tail document on a topic
• Aggregate user behavior mostly improves precision
– Power users find most relevant and important documents
• The model of expert search enables and explains
the portfolio of results, rather than individual results
8
SOURCES OF EVIDENCE:
AUTHORS & EDITORS
17201
3 (A)
28 (B)
CASE
CASE
===
===
===
===
===
===
CASE
CASE
===
===
===
===
===
===
205,310
5 (A)
19 (B)
CASE
Burger King Corp, V.
Rudzewicz
===
===
===
Headnote, KN
Headnote, KN
35
4 (A)
5 (B)
CASE
text text text
text citation
text
citation text
text
===
===
===
CASE
CASE
CASE
CASE
===
===
===
===
===
===
===
===
===
===
===
===
Issue: Long arm jurisdiction
12
A
(Key cases)
54
B
(Highly Relevant)
9
SOURCES OF EVIDENCE
AUTHORS & EDITORS
CASES
CASES
ALR
======
=========
=========
===
===
===
===
Burger King Corp, V.
Rudzewicz
CASES
CASES
CJS
======
=========
=========
===
===
===
===
CASES
AMJUR
======
=========
=========
===
===
=========
=========
======
HN1
HN2
HN3
….
….
HN35
KN1
KN2
KN2
….
....
KN14
===
=========
=========
======
CASES
===
=========
=========
======
===
===
===
Another set of related cases
10
SOURCES OF EVIDENCE: USERS (I)
CASES
SESSION 1
Query 1
Query 2
Query 3
CLICK
PRINT
Burger King Corp, V.
Rudzewicz
CLICK
ACTIONS
KEYCITE
SESSION N
Query N
===
=========
=========
======
CLICK
Link query language to document
language via click, print, and cite
checking behaviors
PRINT
ACTIONS
CASES
===
=========
=========
======
Identify documents that are co-clicked,
co-printed, etc, with the Burger King
case across user sessions
11
SOURCES OF EVIDENCE: USERS (II)
IN THE LAST 3 MONTHS
CASES
SESSION 1
Burger King Corp, V.
Rudzewicz
QUERY 1
"personal jurisdiction”
"minimum contacts”
"forum selection clause”
“personal jurisdiction”
"forum non conveniens”
"choice of law”
QUERY N
176
50
39
39
32
29
Original breach of
contract and
trademark
infringement case
turned into a civil
procedure case
about jurisdiction
on appeal
USER ACTIONS: 10417
CLICK
ACTIONS
SESSION N
PRINT
ACTIONS
===
=========
=========
======
CASES
===
=========
=========
======
TOTAL SESSIONS: 9758
12
AI & THE RANKING PROBLEM
• Supervised Machine Learning (Ranker SVM)
– Iteratively retrieve and rank documents
– Incorporate all available cues: text similarity,
classifications, citations, user behavior and query logs
– All of this requires lots of data!
• Training & Validation
– Gold data: hand-crafted research reports covering a
variety of legal issues
– Report contains an issue statement, multiple queries, all
seminal, highly relevant documents, some relevant docs
• > 100K documents judged against ~400 legal issues
– System was also tested by an independent 3rd party
13
HADOOP FOR BIG DATA PROCESSING
• At launch, query logs contained ~ 2 Billion records
– Queries & user actions
• Relied on a Hadoop cluster to
– Extract, Transform, and Load processes.
– Cluster similar queries together
– Extract, normalize, collate citation contexts
• Dramatic improvement in processing times
– From tens of hours to tens of minutes
14
HADOOP: TYPICAL SPEED UPS
COMPUTATION
NORMAL TIME
HADOOP TIME
Building complete
Westlaw dictionary
2.5 days
1 hour
Clustering similar
Westlaw queries
1.5 days
3 minutes
Citation extraction
from over 10 M
documents
1.25 days
3 hours
CLUSTER CONFIGURATION: QUERIES
• 8 machines, each with 16 cores
• Only 14 cores/machine were available for
processing
– Giving a total of 112 cores
• Block size of 64 MB
– Each core processes one block at a time
• Cluster can process 7 GB at each step
• Latest cluster is twice the size: 224 cores
– Almost 1 TB of memory and over 1 PB of storage
16
THE POWER OF EXPERT SEARCH
• Leverages expertise of community: authors,
editors, & users
– We know why documents are linked
– We know exactly who our users are
• Metadata, authority & aggregated user data all
contribute to relevance, importance & popularity
• Can still benefit from Power Law phenomena so
common on the Web
• Can exploit data parallelism to achieve the same
kind of scale as horizontal search
17
LESSONS LEARNED
• Vertical search is not just about search
– It’s about findability
• Includes navigation, recommendations, clustering, faceted
classification, etc.
– It’s about satisfying a set of well-understood tasks
• Usually on enhanced content
• Usually for expert customers
• Leveraging human value addition is key
– None of the human actors set out to improve search
• Difficult to design complete solution upfront
– Need platform for experimentation and validation at scale
18
QUESTIONS?
• A relevant paper is downloadable from
http://labs.thomsonreuters.com
19
Descargar

Data Analytics Use Cases: ResultsPlus and Cobalt