On Detecting Association-Based Clique
Outliers in Heterogeneous Information
Networks
Manish Gupta1, Jing Gao2, Xifeng Yan3, Hasan Cam4,
Jiawei Han5
1Microsoft
2SUNY,
Buffalo
3UCSB
ASONAM 2013
10/7/2015
4US-ARL
5UIUC
Association-Based Clique Outliers
• Given a network, a conjunctive select query
consists of (type, predicate) pairs
• Expected result is a clique with unranked results
• ABCOutliers: cliques ranked with respect to the
unusual associations among entities
• Applications
– Discovering interesting relationships
– Data de-noising (removing incorrect data attributes or
entity associations)
– Explaining the future behavior of objects participating
in such associations
10/7/2015
ABCOutlier Examples
• CS Research (DBLP Network) Example
– Authors usually publish in conferences of their research
area, and use terms related to their research area
– Query
Outlier
Research
Area
Author
Conference
computer
networking
author
energy
and
Sustain
-ability
data
engineering
conference
3
Abstract
• ABCOutliers: Cliques containing rare and
interesting associations between constituent
entities
• Introduce a low-cost shared neighbors index to
assist clique matching
• Outlierness of a clique is computed based on
association outlierness of the attributes of nodes
within the clique
• TopK algorithm to rank cliques
• Experiments on several synthetic datasets and a
10-type Wikipedia dataset
10/7/2015
Concept Definitions: A Network
•
•
•
•
 = 〈, , 1 , 2 〉
1 :  → Τ where Τ is set of entity types
2 maps every node to attribute vector
Each node has multiple attributes each with a different
data type ( = (1 , 2 , … ,  ) such that | | =  )

• Attributes could be numeric, categorical, sets of strings,
time durations
• Edges are not typed for our work
– Person and movie; Edge could be actor, producer, director
10/7/2015
Concept Definitions
• Conjunctive Select Query (Q) of length L on a network
– 〈 1 , 1 , 2 , 2 , … ,  ,  〉
–  is defined only for type 
• Match C for set query Q
Mahatma
Gandhi
Civil Rights
Movement
India
South Africa
–  =
– Contains the same number and type of nodes as mentioned in the query
– All nodes in C are connected to each other in G and satisfy the predicates in Q
 ×1
 ×1
〈11 , … ,  〉
• Association-Based Clique Outlier
– Outlierness of a clique =f(outlierness of associations between the attributes of
these nodes)
– Outlierness of association (1 , 2 ) for attributes 1 and 2 is high if
• 1 and 2 are frequent
• Co-occurrence of 1 and 2 is rare
• High correlation between 1 and 2
• Association-Based Clique Outlier Detection Problem
– Given: a network G and a query Q
– Find: top K cliques that satisfy the query with highest outlier scores.
10/7/2015
Q=<(T1,P1), (T2,P2), …, (TL,PL)>
Q1=<(T1,P1)>
Q2=<(T2,P2)>
T1
T3
T2
…
…
QL=<(TL,PL)>
TT
Matching
L1
Network G
L2
LL
⋮
Cluster
Computation
for an Attribute
Score Computation
for a Query Edge
Candidate
Computation by
Matching
 ×
 ×
>
 ×
 ×
 , … , 
>
 =< 
 =<
⋮
, … , 
TopK
Quit?
TopK
ABCOutliers
Outlier Detection
Candidate Computation by Matching
Graph Indexing
• Relational database:
Attribute information
associated with each of the
vertices (entities) in G
• Memory: Connectivity
information of the graph
• Shared neighbors index:
For each entity, store the
number of shared
neighbors of each type,
shared between the entity
and its neighbors of a
particular type
T1
A
2
Network G
3
A
A
5
6
B
B
8
9
C
4
C
7
B
11
A
C
1 B
( 2 )
10/7/2015
TT
T2
10
A
B
B
C
A
B
C
A
B
C
A
B
C
1
0
0
1
0
0
0
1
0
0
2
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
1
0
1
0
4
0
1
0
1
0
0
0
0
0
5
0
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
0
7
0
0
1
0
0
0
1
0
0
8
0
0
1
0
0
0
1
0
0
9
0
2
1
2
0
0
1
0
0
10
0
0
0
0
0
1
0
2
0
11
0
0
0
0
0
1
0
1
1
12
0
0
1
0
0
0
2
0
0
Candidate Computation by Matching
Candidate Filtering
• Given: L lists 1 , 2 , … , 
• Find: Cliques of size L such that each clique has a
node from each list
• Start with size 1 cliques and grow them
•  is list of min size and has type 
• Prune 
– Prune the node if its typed neighbors cannot satisfy
the requirements of the query
– Prune the node if its typed neighbors do not have
enough shared neighbors
10/7/2015
Candidate Computation by Matching
Generating Candidates
Size 1 cliques: Elements of list 
Grow length-l cliques to length-l+1 cliques
Randomly choose next type t’
A node n of type t’ is added to length-l clique
if it is connected to all nodes in clique
• Length-l clique is pruned off if it cannot grow
• Algorithm terminates when l=L
•
•
•
•
10/7/2015
Outlier Score Computation
Scoring Attribute Value Pairs (1)
• Outlier score between values  and  should be
high if
0≤≤1
–
–
–
–
Values  and  co-occur rarely
Values  and  are individually frequent
∃ |co-occur freq( ,  ) >  ×freq( ) and  ∉ 
∃ |co-occur freq( ,  ) >  ×freq( ) and  ∉ 
• ComputationHindi
for individual values
China may be noisy
– Compute clusters for every attribute
• KMeans for numbers, time durations
Mandarin
• Category label
Indiafor categorical attributes
• Sets of strings:
create networkMongolian
and then partition (METIS)
Pakistan
Southern
10/7/2015
Outlier Score Computation
Scoring Attribute Value Pairs, Edges, Cliques
• Peakedness of Cluster Co-occurrence Curves
– ( ,  ) = max(0, 1 −  × (| | − 1))
• Outlier Score of an Association
Hindi Speaking

Countries
=1
•    =
Others
∈ (
Mongolian
Others
Nepal
Pakistan
India
China

=1 ( , )
 
•    =
10/7/2015
Languages inNon-Peaked
Latitude
Southern
1983
2
Mandarin
–
  × 
( ,  ) =
× Peaked
Hindi
Country
 ,
( , )+( , )
)
Baselines
• EBC (Entity Based Clique Outlier Detection)
– Cluster attributes and find outliers
–   =
∈ ()
• CBA (Community Based Association Outlier
Detection)
– Use community distributions as attributes
– Outlierness(edge)=KL-divergence between
community distributions of end points
10/7/2015
Synthetic Dataset Results
N
(%)
10000 2
5
10
20000 2
5
10
50000 2
5
10
#Attributes=4
ABC EBC
95.4 75.4
96.7 72.3
95.6
73
90.4 71.8
93.8
64
94.4 73.3
91.5 71.2
94.1 69.2
96.4 72.6
#Attributes=6
ABC EBC
97
71.2
97.6 72.4
97.8 72.8
95.9 73.8
94.8 71.4
97
74.5
96.2 73.3
97.5 73.2
97.7 73.7
#Attributes=10
ABC
EBC
95
69.1
95.2 69.7
95.7 73.2
97.3 68.8
95.2 75.2
95.6 73.6
94.8 70.8
95.2 67.7
95.4 75.6
N
(%)
10000 2
5
10
20000 2
5
10
50000 2
5
10
#Types = 5
•
•
•
Min support = 1%
ABC=Association Based Clique Outlier
Detection
EBC=Entity Based Clique Outlier Detection
10/7/2015
#Attributes=4
ABC EBC
86.6 75.8
93.7 77.6
93.4 73.8
96.9 72.7
97.3 75.1
97.4 74.8
90.3 69.5
92.9 68.8
90.8 79.3
#Attributes=6
ABC EBC
91.6 74.7
93
79.9
93.1 76.5
94.6
73
94.4 78.6
96.7 76.7
95.8 76.9
94.5 73.1
97.5
78
#Attributes=10
ABC
EBC
95.5
68
94.2
72
96
72.3
92.3 64.4
90.9 75.1
94.5 74.7
95.5 65.8
95.5 77.6
94.9 66.5
#Types = 10
•
•
Variances: 2% and 3% for ABC and EBC
resp
Average #matches: 2136, 4252 and 10621
for N=10000, 20000 and 50000 resp
Wikipedia Dataset Details
• Jan 2012 Wikipedia dump
• Entity e and e’ are connected if
– e’ appears on Wikipedia page of e and
– Both e and e’ have Wikipedia Infoboxes
• 10 entity types
• 45% of Wikipedia entities: 760K entities and 4.1M
edges
• 28 attributes per entity
• Min support=1%
10/7/2015
Case Studies
Query: 〈(film, country=“us”), (person, true), (settlement, true)〉
(film="the road to el dorado", person="hernan cortes", settlement="seville")
No.
Type1
Attribute1
Value1
Value2
screenplay
comarca
ted elliott, terry rossio
2 settlement subdivision_type3 person birth_place
comarca
Castile
es
ted elliott, terry rossio
comarca
1485
1 settlement subdivision_type3
Type2 Attribute2
film
3 settlement coordinates_region film
screenplay
4 settlement subdivision_type3 person death_date
5 settlement subdivision_type1
film
studio
autonomous community dreamworks animation, stardust pictures
Query: (company, country=“us"), (film, lang="english"), (person, birthplace=“us"), (tv, true)
(company="viacom", film="mission:impossible iii", person="tom cruise", tv="south park")
No.
Type1
1
film
Attribute1 Type2
writers company
Attribute2
divisions
Value1
Value2
alex kurtzman, roberto mtv networks, bet networks, paramount
orci, j. j. abrams
pictures corporation
2 television creator company #employees trey parker, matt stone
3 television #episodes company
divisions
223
4 television network company
divisions
comedy central
10900
mtv networks, bet networks, paramount
pictures corporation
mtv networks, bet networks, paramount
pictures corporation
1962
1971
5
person birth date company foundation
10/7/2015
Experiments
ABCOutlier Method
person
settlement
film
the road to el
dorado
hernan cortes
seville, spain
drums along
the mohawk adam helmer german flatts, ny, us
what the bleep
do we know!?
j. z. knight
yelm, wa, us
zacharias
atanarjuat
kunuk
igloolik, canada
stardust
stephen fry
norwich, england
#Nodes #Edges
10/7/2015
CBA Method
person
haruko
sugimura
film
red beard
passenger 57
settlement
tokyo, japan
wesley snipes orlando, fl, us
alfred
hitchcock
london, uk
psycho
skidoo
she’s gotta have it
pete the pup richmond, ny, us
joie lee
brooklyn, ny, us
#Types Index Size (MB) Time (sec)
10K
100K
5
0.1
1
10K
100K
10
0.4
1.5
20K
200K
5
0.2
1.8
20K
200K
10
0.7
2.3
50K
500K
5
0.5
4.1
50K
500K
10
1.8
5.5
760K
4.1M
10
22
96.7
Experiments
10/7/2015
Conclusions
• Motivated the idea of query-based outlier
detection for heterogeneous information
networks: ABCOutliers
• Proposed a methodology to compute
outlierness of a clique based on association
outlierness of the properties of nodes within
the clique
• Experiments on several synthetic datasets and
on Wikipedia entity network
10/7/2015
Thanks!
10/7/2015
Descargar

Document