Learning Based Web Query Processing
Yanlei Diao
Computer Science Department
Hong Kong U. of Science & Technology
Outline
Background
 Learning Based Web Query Processing
 FACT: A Prototype System
 Preliminary System Evaluation
 Conclusions
 Demonstration

Mphil Thesis, Yanlei Diao
2
Searching the Web
Want to find a piece of information on the Web?
Huge Size
Heterogeneity
Lack of
Structure
Diversified
User Bases
EverChanging
Mphil Thesis, Yanlei Diao
3
Search Engines

Maintain indices, keyword input, match input
keywords with indices, return relevant documents.

Problems
 Large hit lists with low precision. Users find
relevant documents by browsing.
 URLs but not the required information are
returned. Users read the pages for the required
information.
Mphil Thesis, Yanlei Diao
4
Web Information Retrieval
 IR: Vector-space model, search and browse
capabilities
 Web
IR: Web navigation, indexing, query
languages, query-document matching, output
ranking, user relevance feedback
 Recent
Improvement: Hierarchical classification,
better presentation of results, hypertext study,
metasearching...
Mphil Thesis, Yanlei Diao
5
Web IR for Query Processing
Problems
 A list of URLs or documents is returned. Users
browse a lot to find information.
 It asks users for precise query requirements,
which is hard for casual users.
 It lacks a well-defined underlying model. Vectorspace model does not convey as much as
Hypertext.
Large hit lists with low precision, rely on input
queries
Mphil Thesis, Yanlei Diao
6
Intelligent Agents
The agents learn user profiles/models from their search
behaviors and employ the knowledge to predict
URLs of interest to the user.



Some rely on search engines and heuristics to find targets of a
specific type: e.g. papers or homepages
Some help users in an interactive mode: They learn while
users are browsing.
Some adaptive agents work autonomously: They use
heuristics, recommend pages of interest and take user
feedback to improve.
Mphil Thesis, Yanlei Diao
7
Agents for Query Processing
Problems
 Recommending pages of interest, but not
information of interest to the user
 Using vector-space model or converting HTML to
text documents
 Requiring a prior knowledge, such as user
profiles, or using heuristics for a particular
domain
Not well suited for ad hoc queries
Mphil Thesis, Yanlei Diao
8
Database Approaches
 The
Web is a directed graph: nodes are Web
pages and edges are hyperlinks between pages.
 Query
languages: 1st generation combines
content-based and structure-based queries. 2nd
generation accesses structure of Web objects and
creates complex objects.
 Wrappers
and mediators: they present an
integrated view of the resources.
Mphil Thesis, Yanlei Diao
9
DB Approaches for Query Processing
Problems
 Wrapper generation is only feasible for a number
of sites in a domain. The Web is growing very
fast!
 Web query languages require knowledge of the
Web sites (content and linkage) and the language
syntax. They are hard to use.
Not scalable, good for Web site management but not
queries on the entire Web.
Mphil Thesis, Yanlei Diao
10
Our Goal
A Web query processing system for any Web users that
 processes ad hoc queries on HTML pages
 automatically extracts succinct and precise query
results ( a result may take the form of a table, a list or
a paragraph).
 Learn the knowledge for query processing from the
User!
Mphil Thesis, Yanlei Diao
11
Proposed Approach
An approach with learning capabilities:
 Keyword input (probably not precise)
 Search engines return a URL list
 During browsing, learns from users
 to navigate through the web pages
 to identify the required information on a web page
 Processes the rest URLs automatically
 Returns succinct and precise results
Mphil Thesis, Yanlei Diao
12
Unique Features
Returning succinct and precise results, i.e. segments
of pages;
 No a prior knowledge or preprocessing, suited for ad
hoc queries;


exploiting page formatting and linkage information
simultaneously, good use of rich information
conveyed by HTML.
Mphil Thesis, Yanlei Diao
13
Benefits from Learning
Bridging the gap between keyword input and real
query requirements
 Capable of navigating in the neighborhoods of
documents returned by search engines
 Automating the processing of all possibly relevant
documents in one query
 Almost imperceptible to users, user-friendly

Mphil Thesis, Yanlei Diao
14
Outline
Background
 Learning Based Web Query Processing
 FACT: A Prototype System
 Preliminary System Evaluation
 Conclusions
 Demonstration

Mphil Thesis, Yanlei Diao
15
Modeling a Web Page


Segment: a group of tag delimited elements, unit in query
processing, e.g. paragraph, table, list, nested (atomic segments
to the document), Segment Tree
Attributes of a segment
 content: text in the scope of the segment
 description: summary of the content

Hyperlink: represented as segments to be comparable
 content: URL
 description: anchor text
 associated with the parent segment
Mphil Thesis, Yanlei Diao
16
A Sample
<html><head>
<title> … Hotel </title></head>
<body><p>1999 Room Rates</p>
<table><tr><td><ul>
<li><a href="ac01a.html">
Guest Room</a></li>
<li><a href="ac02a.html">
Executive Suite</a></li></ul></td>
<td> Special Promotion <br>
<table><tr><td>Room Type</td>
<td>Single/Double (HK$)</td>
<tr><td>Standard</td>
<td>1000</td></tr>
<tr><td>Excutive Suite</td>
<td>2750</td>
</tr></table></td></tr></table>
</body></html>
Mphil Thesis, Yanlei Diao
Document
Paragraph
Content
Content
Content
Table
List
Table
"1999 Room
Rates"
Link
1.
ac01a.html
2.
ac02a.html
Content
& contents of
child
paragraph and
table
"Special
Promotion"
& the
content of
the child
table
Content
"Room Type Single
/Double (HK$)
Standard 1000
Executive Suite 2750"
17
Modeling a Web Site
S1
Ignore backward links,
links pointing to
themselves, links outside
a site.
L1
S21
A Web site is modeled as
hyperlink-connected
segment trees, called
Segment Graph.
Mphil Thesis, Yanlei Diao
S12
S11
S13
S131
L3
L2
S2
S3
S31
S32
L4
Definition:
Sijk: Segment
Lm:Hyperlin
k
S41
S4
18
Knowledge for the Locating Task
The locating task is to find a segment in the Segment Graph of a site as the
query result.
1) Exhaustive search simplifies it, but is impractical.
2) Navigation in the graph should terminate if a segment answers the query
well enough or conclusion of irrelevancy can be drawn.
A decision of following a link or choosing a segment should be made on
each page.
Segments and links on a page should be comparable!
Mphil Thesis, Yanlei Diao
19
Two Types of Knowledge
A link conveys description of the pointed page while a queried segment
contains both description and the result itself.
Segments and links on a page are not comparable by content!
Two types of knowledge are needed!

One only concerns descriptive information and helps find
the navigational path.

The other checks if a segment meets query requirements
on both descriptive information and the result.
Mphil Thesis, Yanlei Diao
20
Navigation Knowledge
concerns descriptive information and helps find the
navigational path
 a set of (term, weight) pairs
 Term: a selected word f the description of
segments and links on the navigational path
 Weight: indicating the importance of the term in
leading to the queried segment

Mphil Thesis, Yanlei Diao
21
Learning Navigation Knowledge
Navigational path, (link)*segment, e.g. L2L4S41.
Extended navigational path, ((segment )*link)* ((segment )*
segment), e.g. (S1S11L2)  (S3S31L4)  (S4S41).
Step1. Assign a weight to each component on the path, e.g. L2, S31, S41. The
closer to the target, the higher the weight.
Step2. Assign a weight to each term in the description of a component on
the path.
The weight of a term can be summed up over navigational paths. The set of
(term, weight) pairs is stored into the navigation knowledge base.
Mphil Thesis, Yanlei Diao
22
Classification knowledge
Checks if a segment meets query requirements on
both descriptive information and the result.
 Cast in the Bayesian learning framework.


Set of triples: (feature, NP, NN)
 Feature: word, integer, real, symbol, …, date, time, email
address, …, contained in a segment
 NP: #occurrences of the feature in positive samples
 NN: #occurrences of the feature in negative samples
Mphil Thesis, Yanlei Diao
23
Learning Classification knowledge
The queried segment is a positive sample. All other segments
on the same page are negative samples.
The content of each segment is parsed into a set of features,
either simple and complex types.
Count NP and NN accumulatively for each feature over all
samples. Store all triples (feature, NP, NN) into the
classification knowledge base.
Mphil Thesis, Yanlei Diao
24
Query Processing Using Learned Knowledge




After a Web page is retrieved, the segment graph is built
For each segment and link, a score is computed by applying
the navigation knowledge (ApplyNavigation).
Segments/links are sorted on the score
 If a link has the highest score, the system navigates
through the link
 If a segment has the highest score, all segments on the
page are checked to see if there is a queried segment
The process is repeated until either a segment is found or
conclusion can be made that the site does not contain queried
information.
Mphil Thesis, Yanlei Diao
25
Locating Algorithm
S1
On each page, if the
result is not found:
choosing an unprocessed
component with highest
score:
if a link is chosen 
if a segment is chosen
Mphil Thesis, Yanlei Diao
S12
S11
L1
S21
S13
S131
L3
L2
S2
S3
S31
S32
L4
Definition:
Sijk: Segment
Lm:Hyperlin
k
S41
S4
26
Locating Algorithm
On each page, if the
result is not found:
choosing an unprocessed
component with highest
score:
if a link is chosen
if a segment is chosen 
(ApplyClassification)
Mphil Thesis, Yanlei Diao
S1
S12
S11
L1
S21
S13
S131
L3
L2
S2
S3
S31
S32
L4
Definition:
Sijk: Segment
Lm:Hyperlin
k
S41
S4
27
Applying Learned Knowledge

Application of Navigation Knowledge:
 extracts terms in the description of a link/segment
 reads the weights of the terms and assigns a score to the
link/segment by a certain function (max currently)
 sorts all links and segments by their scores

Application of Classification Knowledge:
 computes the confidence C to classify a segment as the
queried result
 chooses the segment on a page with the largest C. If the
largest C is over a threshold, returns the segment
Mphil Thesis, Yanlei Diao
28
forward
Hotel 1
3
Hotel 2
User browses it!
Mphil Thesis, Yanlei Diao
done
29
User clicks here!
Mphil Thesis, Yanlei Diao
30
Room information
User marks it!
Mphil Thesis, Yanlei Diao
31
Generating Navigation Knowledge
The navigation path looks like:
Hotel Reservation->single hk$ double hk$ standard
room deluxe room +executive room
 By our weighting scheme, a weight is assigned to
each term

hotel
0.25
reservation
0.25
Mphil Thesis, Yanlei Diao
single
0.2
double
0.2
standard
0.2
deluxe
0.2
executive
0.2
32
Generating Classification Knowledge

Training Samples
Positive
standard room
deluxe room
+executive room
Negative
single hk$
999.00
1,199.00
1,399.00
double hk$
1,039.00
1,239.00
1,499.00
Holiday Inn Golden Mile
In the heart of Tsim Sha Tsui - Kowloon,
Holiday Inn Golden Mile is your number
one choice for accommodation, dining,
meetings and banquets.
Ideally situated in the heart of ...

Occurrences of each feature are counted
accomodation banquet
positive
0
0
negative
1
1
Mphil Thesis, Yanlei Diao
…
deluxe double &'$' executive &float
1
1
2
1
6
1
0
2
1
0
…
single standard
1
1
1
2
33
back
Fact starts here!
Mphil Thesis, Yanlei Diao
34
Mphil Thesis, Yanlei Diao
35
Applying Navigation Knowledge
The page contains
Paragraph
Links
57 - 73 Lockhart Road, Wanchai, Hong Kong, SAR, PRC
Main
Paragraph
Features & Services
Located in the hub of Wanchai, the Wharney Hotel is within walking
distance of the Hong Kong Arts Centre, Convention and Exhibition Centre,
busy commercial complexes and shopping malls.
Dining and
Banqueting
...
Paragraph
Reservation
Hotel Rates
...
TEL: (852) 2861-1000
FAX: (852) 2865-6023
Navigation knowledge shows
…
execut
0.2
hong
hotel
0.285714 0.392857
Mphil Thesis, Yanlei Diao
…
kong
0.285714
…
rate
3
servic
reserve
suit
0.066667 0.25 0.230769
36
Navigation Knowledge
assigns scores
0.285714
0.392857
0.230769
0.392857
0
0
Current
Mphil Thesis, Yanlei Diao
0.0666667
0
3.0
Fact chooses it!
0.25
0
37
Navigation Knowledge
assigns scores
Table: 0.586447
Paragraph: 3.0
Paragraph: 0.25
List: 0.25
Visited
Mphil Thesis, Yanlei Diao
0.0666667
0
Current
0.25
0
38
Classification Knowledge
computes confidence
C=1.0
C=0.3569
C=2.5e-007
Apply
Classification
Knowledge to
all Segments
C=6.3e-008
C=0.0001
Mphil Thesis, Yanlei Diao
39
Fact finds it!
Mphil Thesis, Yanlei Diao
40
Outline
Background
 Learning Based Web Query Processing
 FACT: A Prototype System
 Preliminary System Evaluation
 Conclusions
 Demonstration

Mphil Thesis, Yanlei Diao
41
A Query Processing System
A learning based query processing system:

User Interface: accepts user queries, presents query results, a
browser capable of capturing user actions

Query Analyzer: analyzes and transforms user queries

Session Controller: coordinates learning and locating


Learner: generates knowledge from captured user actions
Locator: applies knowledge and locates query results
Retriever & Parser: retrieves pages and parses to trees

Knowledge Base: stores learned knowledge

Mphil Thesis, Yanlei Diao
42
Reference Architecture
User
User Interface
Learner
Query Analyzer
Session
Controller
Knowledge
Base
Locator
Retriever & Parser
Search Engine
Web
Mphil Thesis, Yanlei Diao
43
A Query Session
Learning Process
Scripts
Learner
Browser
User
Actions
URLs
Session Controller
Result
Buffer
Query
results
Knowledge
Base
Checking
Query Result
Presenter
Mphil Thesis, Yanlei Diao
Training Segment
Strategy Graph
Locator
Locating Process
44
Training Strategies



Sequential
 First n sites: user browses and system learns
 Next N-n sites: system processes
Random
 Randomly choose n sites: user browses and system learns
 the system processes the rest
Interleaved
 First n0 sites, user browses and system learns
 Next n - n0 site, system makes decision. For incorrect ones,
user browses and system re-learns
 Next N-n sites: system processes
Mphil Thesis, Yanlei Diao
45
Outline
Background
 Learning Based Web Query Processing
 FACT: A Prototype System
 Preliminary System Evaluation
 Conclusions
 Demonstration

Mphil Thesis, Yanlei Diao
46
System Evaluation
System Capabilities
 Performance

 Effectiveness: precision, recall, correctness
 Efficiency: in a site, how many pages the system visits to
find a result or to recognize the irrelevancy
 Training efficiency: how many training samples are
needed

Key Issues
 Effectiveness of the knowledge
 Effectiveness of training strategies

Tests on A Range of Queries
Mphil Thesis, Yanlei Diao
47
A System Output Sample
Mphil Thesis, Yanlei Diao
48
System Capabilities
The system returns segments of the Web pages
 The segments may not contain any input keyword
but meet the requirement of room rates.

The system learned the query requirement from the user!

Segments can be from pages whose URLs are not
directly returned by Yahoo!.
The system learned how to follow the hyperlinks to the
queried segment!
Mphil Thesis, Yanlei Diao
49
System Evaluation - Effectiveness

Given a set of URLs in a query session, the
system makes N decisions
Found Not Found
N =N1 + N2 + N3 + N4
Precision
Recall
Correctness
Mphil Thesis, Yanlei Diao
Right
Wrong
N1
N3
N2
N4
= N1 / (N1+N3) ,
= N1 / # sites that contain results,
= (N1+N2) / N .
50
System Evaluation - Efficiency

How efficiently the system finds a queried
segment in a site?
Level of a Queried Segment = the length of the
shortest path to find it
Absolute Path length = # Visited pages,
Relative Path Length = # Visited pages / Level
of the Queried Segment .
Mphil Thesis, Yanlei Diao
51
Basic Performance
Q11: Hong Kong Hotel Room Rate
Q12: Hong Kong Hotel
URL
URL
URL URL for URL
Relevant
Returned Selected Used Training Processed URL
Q11
424 100
69
9
60
29
Q12 69533 100
71
9
62
38
Precision Recall Correctness
Q11 76.7
79.3
81.7
Q12 87.5
73.7
79.0
Mphil Thesis, Yanlei Diao
Sequential training
52
Effectiveness of Knowledge
Other two systems implemented for comparison
 Classification Knowledge Only: treat links and
segments the same by the Bayes classifier
 Learning Action
positive
negative
click a link
the link
other links on the page
mark a segment the segment other segments on the page
 Locating
Mphil Thesis, Yanlei Diao
Classify all segments and links
If a link has the highest confidence, follow the link;
If a segment has the highest confidence and passes
the threshold, return it.
53
Effectiveness of Knowledge

Navigation Knowledge Only: only checks the
descriptive information of links and segments
 Learning
Navigational path  Navigation Knowledge
 Locating
Mphil Thesis, Yanlei Diao
Assigns scores to all links and segments using
navigation knowledge
If a link has the highest score, follow the link;
If a segment has the highest score, return it.
54
Effectiveness of Knowledge
Effectiveness of three
Query 1
Query 2
systems
Correctness Precision Recall
Correctness Precision Recall
Both Types of Knowledge
81.70% 76.70% 79.30%
79.00% 87.50% 73.70%
Bayesian Only
58.30% 51.20% 69.00%
38.70% 34.00% 42.10%
Navigation Only
36.70% 28.80% 55.60%
29.00% 26.80% 39.50%
Correctness of three
Irrelevant
systems
31(60)
Both Types of Knowledge
83.80%
Bayesian Only
48.40%
Navigation Only
21.20%
Bad filtering capability!
Navigation only checks description,
nearly not workable
Mphil Thesis, Yanlei Diao
Query 1
Query 2
Level=1 Level=2 Irrelevant Level=1 Level=2
27(60)
2(60)
24(62)
12(62)
26(62)
81.50%
50% 87.50% 91.70% 65.40%
70.40%
50% 33.30%
75% 26.90%
60%
0% 12.50% 58.30% 30.80%
Only works for results
on the first page
Poor navigation
capability!
55
Effects of Training Strategies
Query 12 - Precision
Query 12 - Correctness
1
1
Precision
0.6
0.4
Sequential
Random
Interactive
0.2
0
0.6
0.4
Sequential
0.2
Random
Interactive
0
3
3
5
7
Training Size
9
10
5
7
9
10
Training Size
Query 12 - Recall
1
Query Q12
Training Size 3-10
0.8
Recall
Correctness
0.8
0.8
0.6
0.4
Sequential
Random
0.2
Interactive
0
3
5
7
Training Size
Mphil Thesis, Yanlei Diao
9
10
56
Effects of Training Strategies
Random training performs badly, low in recall
 As the training size increases, interleaved training
outperforms sequential training
 Best accuracy reaches or exceeds 90% in all metrics
when the interleaved training strategy is used
 Enlarging the training size for
random and sequential
training is not effective

Query 2 - Correctness (3-20)
Correctness
1
0.8
0.6
0.4
Sequent ial
0.2
Random
0
3
5
7
9
10
12
15
20
Training Size
Mphil Thesis, Yanlei Diao
57
Improved Performance
Q11
Correctness
0.93
Precision
0.92
Recall
0.88
Relative Path Length
1
(Found)
Absolute Path Length 1.3
(Not Found)
Q12
0.9
0.92
0.94
1.21
1.57
Interleaved training
Mphil Thesis, Yanlei Diao
58
A Range of Queries
Hotel room rates: targets at prices, easy to identify
 Admission requirements on graduate student: includes

items such as degree, GPA, GRE, etc. that are not easy to
specify in keywords but easy to show by marking

Data Mining Researcher: concept, subjective, evidence
including research interests, projects, professional activity, etc
Query Requirement (QR)
1: room rates of Hong Kong hotels
2: admission requirements on
graduate applicants
3: data mining researcher
Mphil Thesis, Yanlei Diao
Keyword Query KQ1
Keyword Query KQ2
11: “Hong Kong hotel room rate” 12: "Hong Kong hotel"
21: “requirements graduate
applicant”
22: “graduate applicant”
3: “data mining researcher”
59
Results of A Range of Queries
Interleaved training
QR1
Correctness
Precision
Recall
Relative Path Length (Found)
Absolute Path Length (Not Found)
KQ11
0.93
0.92
0.88
1
1.3
QR2
KQ12
0.9
0.92
0.94
1.21
1.57
More precise
Mphil Thesis, Yanlei Diao
KQ21
0.84
0.85
0.94
1.08
2.5
KQ22
0.91
0.88
0.91
1.1
1.76
QR3
KQ3
0.83
0.64
0.67
1
1.67
More precise
60
Performance for the Queries

Effectiveness
 first 4 queries: accuracy is 80% to above 90%
 the last query: still capable of filtering out irrelevant sites

Efficiency
 relative path length to locate a queried segment is close to 1
 absolute path length to conclude irrelevancy is no more than
2.5 pages.

The performance is not affected much by how precise
the keyword query is. The system learns query
requirements
Mphil Thesis, Yanlei Diao
61
Outline
Background
 Learning Based Web Query Processing
 FACT: A Prototype System
 Preliminary System Evaluation
 Conclusions
 Demonstration

Mphil Thesis, Yanlei Diao
62
Conclusions
Proposed and implemented learning based Web
query processing with the following features
 Returning succinct results: segments of pages;
 No a prior knowledge or preprocessing, suited for
ad hoc queries;
 exploiting page formatting and linkage
information simultaneously.
 The preliminary results are promising

Mphil Thesis, Yanlei Diao
63
Future Work
Better segmentation for HTML documents
 Better knowledge, key factor that affects system
performance
 other weighting schemes for navigation
knowledge
 other implementation of classification knowledge
 More system evaluation
 Dynamic web pages

Mphil Thesis, Yanlei Diao
64
Outline
Background
 Learning Based Web Query Processing
 FACT: A Prototype System
 Preliminary System Evaluation
 Conclusions
 Demonstration

Mphil Thesis, Yanlei Diao
65
Demonstration
Mphil Thesis, Yanlei Diao
66
Descargar

Learning Based Web Query Processing