The Usage of Various Lexical Resources and
Tools to Improve the Performance
of Web Search Engines
Cvetana Krstev, Ranka Stanković, Duško Vitas, Ivan Obradović
Human Language Technology Group, University of Belgrade, Serbia
Contents
Typical problems when retrieving
documents using a web search engine
The lexical resources used
The system options
Technical implementation
Results and evaluation
2
HLT Group, University of Belgrade
Typical problems when retrieving
documents using a web search engine
Highly inflective
language
Donošenjem odluke…
Odluka o priređivanju igara…
Ministarstvo donosi odluku…
By making a decision…
A decision to organize games
The Ministry shall make a
decision…
________________________
Sastojci za 10 porcija: 3
glavice crnog luka, 1 šoljica
ulja, 1/2 čaša belog vina, 1
čaša soka od paradajza
(The ingredients for 10
portions: 3 onions, 1 cup of
oil, ½ glass of white wine, 1
glass of tomato juice.)
3
Bilingual
search in order to
Typical
problems
find documents on the
chosen subject in two
languages, e.g. English
and Serbian.
Lexical realization of
a concept
synonyms: beli luk ‘garlic’ →
češnjak
hyponyms: muzički instrument
‘musical instrument’ →
klavir ‘piano’, gitara ‘guitar’ etc.
derivations: Beograd →
Beograđanin, Beograđanka, etc
and other relations
HLT Group, University of Belgrade
The lexical resources used
Work Station
for Query
Expansion
Inflectional
finite state
transducers
(FST)
Serbian WN conceived within the Balkanet
project with 14.593 synsets and Princeton
WN are used for query expansion with
WordNets related words & for bilingual searches
Prolex
WS4QE
Prolex: multilingual database of
proper names organized around a
Morphological
conceptual proper name that
dictionaries
represents the same concept in
different languages
FST for inflection of
Serbian morphological
dictionary
http://www.cnrtl.fr/lexiques/prolex/
both simple and compound words
is in LADL format:
developed for the Unitex system
117,000 lemmas with
http://www-igm.univ-mlv.fr/~unitex1,400,000 different lexical words
4
HLT Group, University of Belgrade
The lexical resources used
 For query beli luk two FSTs for components and one for
the compound are used producing only 12 instead of 216
possible combinations:
beli luk AND belim lukom AND beli lukovi AND belih lukova
AND belima lukovima AND belim lukovima AND bele lukove
AND bela luka AND beloga luka AND belog luka AND
belome luku AND belom luku
 thus preventing false retrievals such as:
•...posmatrano sa dna vidika, izgleda kao da iz širokih
lukova belog mosta teče i razliva se ne samo zelena
Drina…
•Thus, from a bottom view, it appears that not only green
Drina flows and spills over under the wide arcs of the white
bridge…
5
HLT Group, University of Belgrade
The system options
Alternate
alphabet
Inclusion of usage
inflectional
forms
Addition of
related
words
6
•štrajk ‘strike’ → штрајк
•štrajk ‘strike’ → štrajk, štrajka, štrajkovi etc.
• štrajk ‘strike’ → obustava rada ‘work stoppage’
• solarni sistem ‘solar system’ → Merkur, Venera,
Zemlja, Mars
• Engleska ‘England’ → Englez ‘Englishman’,
Engleskinja, ‘English woman’ + with Albion
• inflection of free phrases by
predicting their syntactic
structure
Inflexion of
free phrases
Improved query
HLT Group, University of Belgrade
Rule based procedure for inflection
 Procedure for automatic inflection of compounds and phrases
based on a set of rules
 Rule design strategy - result of expert knowledge on
morphology and the analysis of existing manually created
compound dictionaries
 Experiments with various rule strategies possible – the final
strategy is result of several iterations
 The rule based strategy presently consists of 53 rules with total
of 1014 rule subtypes (rule parts)
EXAMPLE OF RULE NUMBER 43, CLASS NC_N6X
Class
NC_
N6X
7
Gramm.
condition
_:fs1q__
_:ms1q__
_:ms1v__
_:ns1q__
_:fs1v__
_:ns1v__
Frequ
ency
3
2
2
1
0
0
Additional conditions
(The first component is a noun )
AND
((The second, the third and fourth
component are in genitive) OR
(The second word is a preposition
and the third word agrees with it))
HLT Group, University of Belgrade
Rule based procedure for inflection
<Rule ID="43" CFLX="NC_N6X" Status="true">
<RuleType ID="1">
<WordRT ID="1" POS="N" Flex="true" />
<WordRT ID="2" POS="*" Flex="false" Condition="GramCats,2"/>
<WordRT ID="3" POS="*" Flex="false" Condition="GramCats,2"/>
<WordRT ID="4" POS="*" Flex="false" Condition="GramCats,2"/>
</RuleType>
<RuleType ID="2">
<WordRT ID="1" POS="N" Flex="true" />
<WordRT ID="2" POS="PREP" Flex="false" />
<WordRT ID="3" POS="*" Flex="false" Condition="PrepAgr,2" />
<WordRT ID="4" POS="*" Flex="false" />
</RuleType>
<RulePart ID="1" Frequency="3" Example="princ na belom konju">
<WordRP ID="1" GramCats="ms1v" />
</RulePart>
<RulePart ID="2" Frequency="2"
<WordRP ID="1" GramCats="ms1q" />
</RulePart>
<RulePart ID="3" Frequency="2" >
<WordRP ID="1" GramCats="ns1q" />
</RulePart>
<RulePart ID="4" Frequency="1" >
<WordRP ID="1" GramCats="fs1q" />
</RulePart>
<RulePart ID="5" Frequency="0">
<WordRP ID="1" GramCats="ns1v" />
</RulePart>
<RulePart ID="6" Frequency="0">
<WordRP ID="1" GramCats="fs1v" />
</RulePart>
</Rule>
8
HLT Group, University of Belgrade
Rule based procedure for inflection
 System evaluation on three separate sets of
data that differ both in content and in
structure:
 compound toponyms (238)
 formal names of professions (356)
 search engine queries (728)
(log file of one of Serbian
professional journals)
 Evaluation indicated that:
 the strategy can be integrated in morphological query
expansion mechanism for compounds and phrases
which do not exist in the compounds dictionary
9
HLT Group, University of Belgrade
Tehnical implementation
 The Process
 The developed web application receives the user query and
 subsequently uses the local web service WS4QE to expand the
query and
 forwards it to the Google search engine using the Google AJAX
Search API (enables the embedding of Google searches into
personal web pages or web applications)
 Interface
 Query expansion is implemented with different possibilities and
levels of detail, so the web user can choose from several options
 From simple query expansion to complex wordnet advanced
search
 Search results are displayed within our own web pages for
different types of query expansions, depending on the resources
and type of expansion
10
HLT Group, University of Belgrade
Tehnical implementation
 Web service WS4QE uses
classes from .NET dll
components developed
within WS4LR (WorkStation
for Lexical Resources)
 WS4LR enables the usage
of lexical resources for
query expansion
 The components that make
up the WS4LR system and
their inter-relationships
11
HLT Group, University of Belgrade
Tehnical implementation
WS4QE home page
• Wordnet advanced search
Compare
•Query submitted directly
to Google with only the
initial string ‘beli luk’
returned a total of 54,900
• Expanded with ‘бели
лук’,’češnjak’,’чешњак’
then submitted by WS4QE
to Google, as a result, total
of 92,700 documents were
obtained.
12
HLT Group, University of Belgrade
Results for expanded query
Compare
•Query submitted directly to
Google obtained 66,600
documents
•Expanded query with
hypernym, in both alphabets
obtained 160,000
documents
• Morphological expansion in
two alphabets (without
semantic expansion) obtained
285,000 documents
13
HLT Group, University of Belgrade
Results for expanded query
14
HLT Group, University of Belgrade
Conclusion
further endeavors
approach
formulation of queries
Queries often need
to be ‘fine tuned’
in order to obtain
an optimal balance
between recall and
precision
15
Lexical resources
can be put to the
aid of the user by
offering him/her
various
possibilities of
query expansion
1. We shall
continue do
develop our lexical
resources
2. We will strive to
broaden the scope
of tasks that can
be solved with our
tools
HLT Group, University of Belgrade
[email protected]
[email protected]
[email protected]
[email protected]
Descargar

PowerPoint Template