Link Analysis: Current State of
the Art
Ronen Feldman
Computer Science Department
Bar-Ilan University, ISRAEL
[email protected]
Introduction to Text Mining
Find Documents
Display Information
matching the Query
relevant to the Query
Actual information buried
inside documents
Extract Information from
within the documents
Long lists of documents
Aggregate over entire
collection
Let Text Mining Do the Legwork for You
Text Mining
Find Material
Read
Understand
Consolidate
Absorb / Act
What Is Unique in Text Mining?
• Feature extraction.
• Very large number of features that
represent each of the documents.
• The need for background knowledge.
• Even patterns supported by small number
of document may be significant.
• Huge number of patterns, hence need for
visualization, interactive exploration.
Document Types
• Structured documents
– Output from CGI
• Semi-structured documents
– Seminar announcements
– Job listings
– Ads
• Free format documents
– News
– Scientific papers
Text Representations
•
•
•
•
•
•
•
•
Character Trigrams
Words
Linguistic Phrases
Non-consecutive phrases
Frames
Scripts
Role annotation
Parse trees
The 100,000 foot Picture
External Systems
Integration
Corporate
Databases
File
Systems
Business Intelligence Suite
Business Intelligence
Suites
Workflow
Systems
Rich
XML
/API
Rich
XML
/API
Semantic Tagging
ClearTagsIntelligent
Suite
Tagging
(Intelligent
Auto-Tagging)
Statistical Tagging
Structural Tagging
WEB SITES/
HTML
NEWS
FEEDS
Unstructured
INTERNAL
DOCUMENTS
Content
OTHER
“RAW” DATA
Intelligent Auto-Tagging
<Facility>Finsbury Park Mosque</Facility>
(c) 2001, Chicago Tribune.
Visit the Chicago Tribune on the Internet at
http://www.chicago.tribune.com/
Distributed by Knight Ridder/Tribune
Information Services.
By Stephen J. Hedges and Cam Simpson
<Country>England</Country>
…….
<Country>United States</Country>
The Finsbury Park Mosque is the center of
radical Muslim activism in England. Through
its doors have passed at least three of the men
now held on suspicion of terrorist activity in
France, England and Belgium, as well as one
Algerian man in prison in the United States.
``The mosque's chief cleric, Abu Hamza alMasri lost two hands fighting the Soviet
Union in Afghanistan and he advocates the
elimination of Western influence from Muslim
countries. He was arrested in London in 1999
for his alleged involvement in a Yemen bomb
plot, but was set free after Yemen failed to
produce enough evidence to have him
extradited. .''
……
<Country>France </Country>
<Country>England</Country>
<Country>Belgium</Country>
<Person>Abu Hamza al-Masri</Person>
<PersonPositionOrganization>
<OFFLEN OFFSET="3576" LENGTH=“33" />
<Person>Abu Hamza al-Masri</Person>
<Position>chief cleric</Position>
<Organization>Finsbury Park Mosque</Organization>
</PersonPositionOrganization>
<City>London</City>
<PersonArrest>
<OFFLEN OFFSET="3814" LENGTH="61" />
<Person>Abu Hamza al-Masri</Person>
<Location>London</Location>
<Date>1999</Date>
<Reason>his alleged involvement in a Yemen bomb
plot</Reason>
</PersonArrest>
Intelligence Article
Google’s Article
Merger
Leveraging Content Investment
Any type of content
• Unstructured textual content (current focus)
• Structured data; audio; video (future)
In any format
• Documents; PDFs; E-mails; articles; etc
• “Raw” or categorized
• Formal; informal; combination
From any source
• WWW; file systems; news feeds; etc.
• Single source or combined sources
Information Extraction
Relevant IE Definitions
• Entity: an object of interest such as a
person or organization.
• Attribute: a property of an entity such as
its name, alias, descriptor, or type.
• Fact: a relationship held between two or
more entities such as Position of a
Person in a Company.
• Event: an activity involving several
entities such as a terrorist act, airline
crash, management change, new
product introduction.
IE Accuracy by Information Type
Information
Type
Accuracy
Entities
90-98%
Attributes
80%
Facts
60-70%
Events
50-60%
MUC Conferences
Conference
Year
Topic
MUC 1
1987
Naval Operations
MUC 2
1989
Naval Operations
MUC 3
1991
Terrorist Activity
MUC 4
1992
Terrorist Activity
MUC 5
1993
Joint Venture and Micro
Electronics
MUC 6
1995
Management Changes
MUC 7
1997
Spaces Vehicles and Missile
Launches
Applications of Information
Extraction
• Routing of Information
• Infrastructure for IR and for
Categorization (higher level features)
• Event Based Summarization.
• Automatic Creation of Databases and
Knowledge Bases.
Where would IE be useful?
• Semi-Structured Text
• Generic documents like News articles.
• Most of the information in the document is
centered around a set of easily identifiable
entities.
Approaches for Building IE
Systems
• Knowledge Engineering Approach
– Rules are crafted by linguists in cooperation with
domain experts.
– Most of the work is done by inspecting a set of
relevant documents.
– Can take a lot of time to fine tune the rule set.
– Best results were achieved with KB based IE
systems.
– Skilled/gifted developers are needed.
– A strong development environment is a MUST!
Approaches for Building IE
Systems
• Automatically Trainable Systems
– The techniques are based on pure statistics and
almost no linguistic knowledge
– They are language independent
– The main input is an annotated corpus
– Need a relatively small effort when building the rules,
however creating the annotated corpus is extremely
laborious.
– Huge number of training examples is needed in order
to achieve reasonable accuracy.
– Hybrid approaches can utilize the user input in the
development loop.
Components of IE System
Must
Advisable
Tokenization
Zoning
Nice to have
Part of Speech Tagging
Can pass
Morphological and
Lexical Analysis
Sense Disambiguiation
Shallow Parsing
Synatctic Analysis
Deep Parsing
Anaphora Resolution
Domain Analysis
Integration
Why is IE Difficult?
• Different Languages
– Morphology is very easy in English, much harder in German and
Hebrew.
– Identifying word and sentence boundaries is fairly easy in
European language, much harder in Chinese and Japanese.
– Some languages use orthography (like english) while others (like
hebrew, arabic etc) do no have it.
• Different types of style
–
–
–
–
–
•
Scientific papers
Newspapers
memos
Emails
Speech transcripts
Type of Document
– Tables
– Graphics
– Small messages vs. Books
Link Analysis on Large
Textual Networks
Social Network Analysis
The Kevin Bacon Game
• The game works as follows: given any actor,
find a path between the actor and Kevin Bacon
that has less than 6 edges.
• For instance, Kevin Costner links to Kevin Bacon
by using one direct link: Both were in JFK.
• Julia Louis-Dreyfus of TV's Seinfeld, however,
needs two links to make a path: Julia LouisDreyfus was in Christmas Vacation (1989) with
Keith MacKechnie. Keith MacKechnie was in We
Married Margo (2000) with Kevin Bacon.
• You can play the game by using the following
URL http://www.cs.virginia.edu/oracle/.
The Erdos Number
• A similar idea is also used in the mathematical
society and is called the Erdös number of a
researcher.
• Paul Erdös (1913–1996), wrote hundreds of
mathematical research papers in many different
areas, many in collaboration with others.
• There is a link between any two mathematicians
if they co-authored a paper.
• Paul Erdös is the root of the mathematical
research network and his Erdös number is 0.
• Erdös’s co-authors have Erdös number 1.
• People other than Erdös who have written a joint
paper with someone with Erdös number 1 but
not with Erdös have Erdös number 2, and so on.
Running Example
Hijackers by Flight
Flight 77 : Pentagon
Flight 11 : WTC 1
Flight 175 : WTC 2
Flight 93: PA
Khalid Al-Midhar
Satam Al Suqami
Marwan Al-Shehhi
Saeed Alghamdi
Majed Moqed
Waleed M. Alshehri
Fayez Ahmed
Ahmed Alhaznawi
Nawaq Alhamzi
Wail Alshehri
Ahmed Alghamdi
Ahmed Alnami
Salem Alhamzi
Mohamed Atta
Hamza Alghamdi
Ziad Jarrahi
Hani Hanjour
Abdulaziz Alomari
Mohald Alshehri
Automatic layout of networks
Pretty Graph Drawing
Motivation I
• In order to display large networks on the
screen we need to use automatic layout
algorithms. These algorithms display the
graphs in an aesthetic way without any
user intervention.
• The most commonly used aesthetic
criteria are to expose symmetries and
make drawing as compact as possible or
alternatively fill the space available for the
drawing.
Motivation II
• Many of the “higher-level” aesthetic criteria
are implicit consequences of:
– minimized number of edge crossings
– evenly distributed edge length
– evenly distributed vertex positions on the
graph area
– sufficiently large vertex-edge distances
– sufficiently large angular resolution between
edges.
Disadvantages of the Spring based
methods
• They are computationally expensive and hence
minimizing the energy function when dealing with large
graphs is computationally prohibitive.
• Since all methods rely on heuristics, there is no
guarantee that the “best” layout will be found.
• The methods behave as black boxes and hence it is
almost impossible to integrate additional constraints on
the layout (such as fixing the positions of certain
vertices, or specifying the relative ordering of the
vertices)
• Even when the graphs are planar it is quite possible that
we will get edge crossings.
• The methods try to optimize just the placement of
vertices and edges while ignoring the exact shape of the
vertices or the fact the vertices may have labels.
Kamada and Kawai’s (KK)
Method
Fruchterman Reingold (FR)
Method
Classic Graph Operations
Finding the shortest Path (from
Atta)
A better Visualization
Centrality
Degree
• If the graph is undirected then the degree
of a vertex v  V is the number of other
vertices that are directly connected to it.
– degree(v) = |{(v1, v2)  E | v1 = v or v2 = v}|
• If the graph is directed then we can talk
about in-degree or out-degree. An edge
(v1,v2)  E in the directed graph is leading
from vertex v1 to v2.
– In-degree(v) = |{(v1, v)  E }|
– Out-degree(v) = |{(v, v2)  E }|
Degree of the Hijackers
Name
Mohamed Atta
Abdulaziz Alomari
Ziad Jarrahi
Fayez Ahmed
Waleed M. Alshehri
Wail Alshehri
Satam Al Suqami
Salem Alhamzi
Marwan Al-Shehhi
Majed Moqed
Khalid Al-Midhar
Hani Hanjour
Nawaq Alhamzi
Ahmed Alghamdi
Saeed Alghamdi
Mohald Alshehri
Hamza Alghamdi
Ahmed Alnami
Ahmed Alhaznawi
Degree
11
11
9
8
7
7
7
7
7
7
6
6
5
5
3
3
3
1
1
Closeness Centrality Motivation
• Degree centrality measures might be
criticized because they only take into
account the direct connections that an
entity has, rather than indirect connections
to all other entities.
• One entity might be directly connected to a
large number of entities that might be
pretty isolated from the network. Such an
entity is central only in a local
neighborhood of the network.
Closeness Centrality
• This measure is based on the calculation of the
geodesic distance between the entity and all
other entities in the network.
• We can either use directed or undirected
geodesic distances between the entities.
• The sum of these geodesic distances for each
entity is the "farness" of the entity from all other
entities.
• We can convert this into a measure of closeness
centrality by taking the reciprocal.
• In addition, we can normalize the closeness
measure by dividing it by the closeness measure
of the most central entity.
Closeness : Formally
• let d(v1,v2) = the minimal distance
between v1 and v2, i.e., the minimal
number of vertices that we need to pass
on the way from v1 to v2.
Ci 
| V | 1
 d (v , v )
j i
i
j
Closeness of the Hijackers
Name
Abdulaziz Alomari
Closeness
0.6
Ahmed Alghamdi
0.5454545
Ziad Jarrahi
0.5294118
Fayez Ahmed
0.5294118
Mohamed Atta
0.5142857
Majed Moqed
0.5142857
Salem Alhamzi
0.5142857
Hani Hanjour
0.5
Marwan Al Shehhi
0.4615385
Satam Al Suqami
0.4615385
Waleed M. Alshehri
0.4615385
Wail Alshehri
0.4615385
Hamza Alghamdi
0.45
Khalid Al Midhar
0.4390244
Mohald Alshehri
0.4390244
Nawaq Alhamzi
0.3673469
Saeed Alghamdi
0.3396226
Ahmed Alnami
0.2571429
Ahmed Alhaznawi
0.2571429
Betweeness Centrality
• The betweeness centrality measures the
effectiveness in which the vertex connects
the various parts of the network.
• The main idea behind betweeness
centrality is that entities that are mediators
have more power. Entities that are on
many geodesic paths between other pairs
of entities are more powerful since they
control the flow of information between the
pairs.
Betweeness - Formally
(| V | 1)(| V | 2)
• Highest Possible Betweeness
2
• gjk = the number of geodetic paths that
connect vj with vk
• gjk(vi) = the number of geodetic paths that
connect vj with vk and pass via vi.
Bi  
j k
g jk (vi )
g jk
2 Bi
NBi 
(| V | 1)(| V | 2)
Betweenness of the Hijackers
Name
Hamza Alghamdi
Saeed Alghamdi
Ahmed Alghamdi
Abdulaziz Alomari
Mohald Alshehri
Mohamed Atta
Ziad Jarrahi
Fayez Ahmed
Majed Moqed
Salem Alhamzi
Hani Hanjour
Khalid Al-Midhar
Nawaq Alhamzi
Marwan Al-Shehhi
Satam Al Suqami
Waleed M. Alshehri
Wail Alshehri
Ahmed Alnami
Ahmed Alhaznawi
Betweeness (Bi)
0.3059446
0.2156863
0.210084
0.1848669
0.1350763
0.1224783
0.0807656
0.0686275
0.0483901
0.0483901
0.0317955
0.0184832
0
0
0
0
0
0
0
Eigen Vector Centrality
• The main idea behind eigenvector centrality is
that entities receiving many communications
from other well connected entities, will be better
and more valuable sources of information, and
hence be considered central. The Eigenvector
centrality scores correspond to the values of the
principal eigenvector of the adjacency matrix M.
l v  Mv
• Formally, the vector v satisfies the equation
where l is the corresponding eigenvalue and M
is the adjacency matrix.
EigenVector centralities of the
hijackers
Name
E1
Mohamed Atta
0.518
Marwan Al-Shehhi
0.489
Abdulaziz Alomari
0.296
Ziad Jarrahi
0.246
Fayez Ahmed
0.246
Satam Al Suqami
0.241
Waleed M. Alshehri
0.241
Wail Alshehri
0.241
Salem Alhamzi
0.179
Majed Moqed
0.165
Hani Hanjour
0.151
Khalid Al-Midhar
0.114
Ahmed Alghamdi
0.085
Nawaq Alhamzi
0.064
Mohald Alshehri
0.054
Hamza Alghamdi
0.015
Saeed Alghamdi
0.002
Ahmed Alnami
0
Ahmed Alhaznawi
0
Power Centrality
• Given an adjacency matrix M, the power
centrality of vertex i (denoted ci), is given by
ci   M ij (a  b  c j )
j to
i normalize the score; the
 a is used
normalization parameter is automatically
selected so that the sum of squares of the
vertices’s centralities is equal to the number of
vertices in the network.
 b is an attenuation factor that controls the effect
that the power centralities of the neighboring
vertices should have on the power centrality of
the vertex.
Power - Motivation
• In a similar way to the eigenvector centrality, the
power centrality of each vertex is determined by
the centrality of the vertices it is connected to.
• By specifying positive or negative values to b the
user can control if the fact that a vertex is
connected to powerful vertices should have a
positive effect on its score or a negative effect.
• The rational for specifying a positive b is that if
you are connected to powerful colleagues it
makes you more powerful.
• On the other hand, the rational for a negative b
is that powerful colleagues have many
connections and hence are not controlled by
you, while isolated colleagues have no other
sources of information and hence are pretty
much controlled by you.
Power of the Hijackers
Power : b = 0.99
Power : b = -0.99
Mohamed Atta
2.254
2.214
Marwan Al-Shehhi
2.121
0.969
Abdulaziz Alomari
1.296
1.494
Ziad Jarrahi
1.07
1.087
Fayez Ahmed
1.07
1.087
Satam Al Suqami
1.047
0.861
Waleed M. Alshehri
1.047
0.861
Wail Alshehri
1.047
0.861
Salem Alhamzi
0.795
1.153
Majed Moqed
0.73
1.029
Hani Hanjour
0.673
1.334
Khalid Al-Midhar
0.503
0.596
Ahmed Alghamdi
0.38
0.672
Nawaq Alhamzi
0.288
0.574
Mohald Alshehri
0.236
0.467
Hamza Alghamdi
0.07
0.566
Saeed Alghamdi
0.012
0.656
Ahmed Alnami
0.003
0.183
Ahmed Alhaznawi
0.003
0.183
Network Centralization
• In addition to the individual vertex centralization
measures, we can assign a number between 0 and 1
that will signal the level of centralization of the whole
network.
• The network centralization measures will be computed
based on the centralization values of its vertices and
hence we will have for type of individual centralization
measure an associated network centralization measure.
• A network that is structured like a circle will have a
network centralization value of 0 (since all vertices have
the same centralization value), while a network that
structured like a star will have a network centralization
value of 1.
• We will now provide some of the formulas for the
different network centralization measures.
Degree
Degree (V )  MaxvV Degree(v)
*
 Degree (V )  Degree(v)
*
NETDegree 
vV
(n  1)*(n  2)
For the Hijackers network NetDegree= 0.31
Betweenness
NB (V )  MaxvV NB(v)
*
 NB (V )  NB(v)
*
NETBet 
vV
(n  1)
For the Hijackers network NetBet= 0.24
Summary Diagram
Descargar

Link Analysis: Current State of the Art