Mining Wiki Resoures for
Multilingual Named Entity
Recognition
Xiej un
2008.07.31
Outline








Target
Strategy
Major features will be taken advantage within
Wikipedia
English language categorization
Multilingual categorization
Full system
Results
Summary
Target

To utilize the multilingual characteristics of
Wikipedia to annotate a large corpus of text
with NER(Named Entity Recognition) tags
with minimal human intervention and no
linguistic expertise.
Strategy


Use the Category structure inherent to
Wikipedia to determine the named entity type
of a proposed entity;
And use English language data to bootstrap
the NER process in other languages.
Five major features will be taken
advantage within Wikipedia(1)





Article links, links from one article to another of the
same language;
Category links, links from an article to special
“Category” pages;
Interwiki links, links from an article to a presumably
equivalent, article in another language;
Redirect pages, short pages which often provide
equivalent names for an entity;
Disambiguation pages, a page with title content that
links to multiple similarly named articles

@@ The first three types are collectively referred to as wikilinks.
Five major features will be taken
advantage within Wikipedia(2)

A Typical Sentence in database format
Article links


Category links


“Nescopeck Creek is a [[tributary]] of the [[North Branch
Susquehanna River]] in [[Luzerne County,
Pennsylvania|Luzerne County]].”
Will be found near the end of the same article ,such as
[[Category: Luzerne County, Pennsylvania ]], [[Category:
River of Pennsylvania ]]
Interwiki links

For example, in the Turkish language article ”Kanuni
Sultan Suleyman”, one can find a set of links including
[[en:Suleiman the Magnificent]] and [[ru:CyлеймаиⅠ]]
English Language Categorization(1)
Some Useful Category Phrases
(manually derived)
English Language Categorization(2)
Procedure
1.
2.
3.
For each article, search the category
hierarchy until a threshold of reliability is
passed or a preset limit on search distance
is reached.
If an article is not classified by this method,
check whether it is a disambiguation
page(Category:Disambiguation). If it is, the
links within are checked to see whether
there is a dominant type.
Finally, use wiktionary to eliminate some
common nouns.
English Language Categorization(3)
Example

To classify “Jacqueline Bhabha”



Extract from categories, “British lawyers”,
“Jewish American Writers”, and “Indian Jews”.
Extract the second order categories, ”Lawyers by
nationality”, “British legal professionals”,
“American writers by ethnicity”, ”Indian people by
origin”, “Indian people by ethnic or national
origin” and so on.
OK, PERSON
Multilingual Categorization(1)

To make a decision based on English
language information.



First, whenever possible, find the title of an
associated English language article by searching
for wikilink beginning with “en:”.
If such a title is found, categorize the English
article, and decide that the non-English title is
the same type.
If not, attempt to make a decision based on
Category information, associating the categories
with their English equivalents, when possible.
Multilingual Categorization(2)
Example

The Breton town of Erquy has substantial article in
French language Wikipedia, but no article in
English.


extract categories: “Catégorie:Commune des Côtesd'Armor,” “Catégorie:Ville portuaire de France,”
“Catégorie:Port de plaisance,” and “Catégorie:Station
balnéaire.”
Associate these categories respectively with “Category:
Communes of Côtes-d'Armor,” UNKNOWN, “Category:
Marinas,” and “Category: Seaside resorts” by looking in
the French language pages of each for wikilinks of the
form [[en:...]].

The first is a subcategory of “Category: Cities, towns and villages
in France”, so GPE
Full system

The main processing of each article takes place in
several stages:






The first pass uses the explicit article links within the text;
Then search an associated English language article, if
available, for additional information;
A second pass checks for multi-word phrases that exist
as titles of Wikipedia articles;
Look for certain types of person and organization
instances;
Perform additional processing for alphabetic or spaceseparated languages, including a third pass looking for
single Wikipedia titles, to identify more names of people;
Use RE to locate additional entities such as numeric
dates.
Results
• Spanish 25,000 words of human
annotated newswire derived from the
ACE 2007 test set vs. 335,000 words
of data generated by the Wiki process
held-out during training (from
290,000 articles of Oct. 2007)
• French 25,000 words of human
annotated newswire (Agence France
Presse, 30 April and 1 May 1997)
covering diverse topics vs. 920,000
words of Wiki-derived data (from
570,000 articles of Oct. 2007)
Summary


More suitable for bilingual or multilingual
dictionary
More suitable for known entities
Descargar

Mining Wiki Resoures for Multilingual Named Entity …