Social Network Inspired Models
of NLP and Language Evolution
Monojit Choudhury (Microsoft Research India)
Animesh Mukherjee (IIT Kharagpur)
Niloy Ganguly (IIT Kharagpur)
What is a Social Network?
 Nodes: Social entities (people,
organization etc.)
 Edges: Interaction/relationship
between entities (Friendship,
collaboration, sex)
Courtesy: http://blogs.clickz.com
Social Network Inspired Computing
 Society and nature of human interaction is a
Complex System
 Complex Network: A generic tool to model
complex systems
There is a growing body of work on CNT Theory
Applied to a variety of fields – Social, Biological,
Physical & Cognitive sciences, Engineering &
Technology
Language is a complex system
Objective of this Tutorial
 To show that SNIC (Soc. Net. Inspired Comp.) is
an emerging and promising technique
 Apply it to model Natural Languages
NLP, Quantitative Linguistics, Language Evolution,
Historical Linguistics, Language acquisition
 Familiarize with tools and techniques in SNIC
 Compare it with other standard approaches to
NLP
Outline of the Tutorial
 Part I: Background
Introduction [25 min]
Network Analysis Techniques [25 min]
Network Synthesis Techniques [25 min]
 Break [3:20pm – 3:40pm]
 Part II: Case Studies
Self-organization of Sound Systems [20 min]
Modeling the Lexicon [20 min]
Unsupervised Labeling (Syntax & Semantics) [20 min]
 Conclusion and Discussions [20 min]
Complex System
 Non-trivial properties and patterns emerging
from the interaction of a large number of simple
entities
 Self-organization: The process through which
these patterns evolve without any external
intervention or central control
 Emergent Property or Emergent Behavior: The
pattern that emerges due to self-organization
The best example from nature
A termite "cathedral" mound
produced by a termite colony
Emergence of a networked life
Communities
Atom
Organisms
Molecule
Tissue
Cell
Organs
Language – a complex system
 Language: medium for communication through
an arbitrary set of symbols
 Constantly evolving
 An outcome of self-organization at many levels
Neurons
Speakers and listeners
Phonemes, morphemes, words …
 80-20 Rule in every level of structure
MESOSCOPY
Three Views of a System
MACROSCOPY
May not give a complete picture
or explanation of what goes on
MICROSCOPY
May be too difficult to analyze or
simulate the macroscopic behavior
A useful tradeoff between
the two
Language as a physical system
Microscopic: a collection of utterances by
individual speakers
Mesoscopic: an interaction between
phonemes, syllables, words, phrases
Macroscopic: A set of grammar rules with
a lexicon
Syntactic Network of Words
color
sky
weight
light
1
20
blue
blood
100
heavy
red
Complex Network Theory
Handy toolbox for modeling mesoscopy
Marriage of Graph theory and Statistics
Complex because:
Non-trivial topology
Difficult to specify completely
Usually large (in terms of nodes and edges)
Provides insight into the nature and
evolution of the system being modeled
Internet
Genetic interaction network
9-11 Terrorist Network
Social Network Analysis is a
mathematical methodology for
connecting the dots -- using
science to fight terrorism.
Connecting multiple pairs of
dots soon reveals an emergent
network of organization.
CNT Examples:
Road and Airlines Network
What Questions can be asked
 Does these networks display some
symmetry?
 Are these networks creation of intelligent
objects or they have emerged?
 How have these networks emerged
What are the underlying simple rules leading
to their complex formation?
Bi-directional Approach
Analysis of the real-world networks
Global topological properties
Community structure
Node-level properties
Synthesis of the network by means of
some simple rules
Preferential attachment models
Small-world models ……..
Application of CNT in Linguistics - I
 Quantitative linguistics
Invariance and typology (Zipf’s law, syntactic
dependencies)
 Natural Language Processing
Unsupervised methods for text labeling (POS tagging,
NER, WSD, etc.)
Textual similarity (automatic evaluation, document
clustering)
Evolutionary Models (NER, multi-document
summarization)
Application of CNT in Linguistics - II
 Language Evolution
How did sound systems evolve?
Development of syntax
 Language Change
Innovation diffusion over social networks
Language as an evolving network
 Language Acquisition
Phonological acquisition
Evolution of the mental lexicon of the child
Linguistic Networks
Name
Nodes
Edges
Why?
PhoNet
Phonemes
Co-occurrence
likelihood in languages
Evolution of sound
systems
WordNet
Words
Ontological relation
Host of NLP applications
Syntactic
Network
Words
Similarity between
syntactic contexts
POS Tagging
Semantic
Network
Words,
Names
Semantic relation
IR, Parsing, NER, WSD
Mental
Lexicon
Words
Phonetic similarity and
semantic relation
Cognitive modeling,
Spell Checking
Tree-banks
Words
Syntactic Dependency
links
Evolution of syntax
Word Cooccurrence
Words
Co-occurrence
IR, WSD, LSA, …
Summarizing
 SNIC and CNT are emerging techniques for
modeling complex systems at mesoscopic level
 Applied to Physics, Biology, Sociology,
Economics, Logistics …
 Language - an ideal application domain for SNIC
 SNIC models in NLP, Quantitative linguistics,
language change, evolution and acquisition
Topological
Characterization
of Networks
Types Of Networks and Representation
Unipartite
Binary/
Weighted
Undirected/
Directed
Bipartite
Binary/
Weighted
Undirected/
Directed
Representation
a
b
c
1. Adjacency Matrix
a
0
1
1
2. Adjacency List
b
1
0
1
c
1
1
0
a
{b,c}
b
{a,c}
c
{a,b}
Properties of Adjacency Matrix
A={aij}, where i and j are nodes and aij=1 if
there is an edge between i an j.
A2 = A*A; Entries denote number paths of
length 2 between any two node (Σa
*akj)
k ik
In general, An denotes number of paths of
length n
Trace(A) = Σaii
How is the trace of A3 related to the
number of triangles in the n/w?
Characterization of Complex N/ws??
 They have a non-trivial topological structure
 Properties:
Heavy tail in the degree distribution (non-negligible
probability mass towards the tail; more than in the case of
an exp. distribution)
High clustering coefficient
Centrality Properties
Social Roles & Equivalence
Assortativity
Community Structure
Random Graphs & Small avg. path length
Preferential attachment
Small World Properties
Degree Distribution (DD)
 Let pk be the fraction of vertices in the network that has a
degree k.
 The k versus pk plot is defined as the degree distribution
of a network
 For most of the real world networks these distributions
are right skewed with a long right tail showing up values
far above the mean – pk varies as k-α
 Due to noisy and insufficient data sometimes the
definition is slightly modified
Cumulative degree distribution is plotted
 Probability that the degree of a node is greater than or
equal to k
A Few Examples
Power law: Pk ~ k-α
Friend of Friends
 Consider the following scenario
 Sourish and Ravi are friends
 Sourish and Shaunak are friends
 Are Shaunak and Ravi friends?
 If so then …
Ravi
Saurish
Saunak
 This property is known as transitivity
Measuring Transitivity: Clustering Coefficient
 The clustering coefficient for a vertex ‘v’ in a network
is defined as the ratio between the total number of
connections among the neighbors of ‘v’ to the total
number of possible connections between the
neighbors
 High clustering coefficient means my friends know
each other with high probability – a typical property
of social networks
Mathematically…
 The clustering coefficient of a vertex i is
# of links between ‘n’ neighbors
Ci =
n(n-1)/2
 The clustering coefficient of the whole network is
the average
C=
1
N
∑Ci
 Alternatively,
C=
# triangles in the n/w
# triples in the n/w
N etw o rk
C
C ran d
L
N
WWW
0 .1 078
0 .0 002 3
3 .1
1 531 27
In tern et
0 .1 8-0.3
0 .0 01
3 .7 -3 .7 6
3 015 6 209
A cto r
0 .7 9
0 .0 002 7
3 .6 5
2 252 26
C o au th orsh ip
0 .4 3
0 .0 001 8
5 .9
5 290 9
M etab o lic
0 .3 2
0 .0 26
2 .9
2 82
F oo dw eb
0 .2 2
0 .0 6
2 .4 3
1 34
C . eleg an ce
0 .2 8
0 .0 5
2 .6 5
2 82
Centrality
 Centrality measures are commonly described as
indices of 4 Ps -- prestige, prominence,
importance, and power
Degree – Count of immediate neighbors
Betweenness – Nodes that form a bridge between two
regions of the n/w
 Where σst is total number of shortest paths between s and
t and σst (v) is the total number of shortest paths from s to t
via v
Eigenvector centrality – Bonacich (1972)
It is not just how many people knows me
counts to my popularity (or power) but how
many people knows people who knows
me – this is recursive!
In context of HIV transmission – A person
x with one sex partner is less prone to the
disease than a person y with multiple
partners
But imagine what happens if the partner of x
has multiple partners
The basic idea of eigenvector centrality
Definition
 Eigenvector centrality is defined as the principal
eigenvector of the adjacency matrix
 Eigenvector of any symmetric matrix A = {aij} is
any vector e such that
Where λ is a constant and ei is the centrality of the node i
 What does it imply – centrality of a node is
proportional to the centrality of the nodes it is
connected to (recursively)…
 Practical Example: Google PageRank
Node Equivalence
Social Roles – Nodes (actors) in a social
n/w who have similar patterns of relations
(ties) with other nodes.
Three Different ways to find equivalence
classes:
Structural Equivalence
Automorphic Equivalence
Regular Equivalence
Structural Equivalence
Two nodes are said to be exactly
structurally equivalent if they have the
same relationships to all other nodes.
Computation:
Let A be the adjacency matrix.
Compute the Euclidean Distance
/Pearson Correlation between a
pair or rows/columns representing
the neighbor profile of two nodes
(say i and j). This value shows how
much structurally similar i and j are.
Automorphic Equivalence
 The idea of automorphic equivalence is that sets
of actors can be equivalent by being embedded
in local structures that have the same patterns of
ties -- "parallel" structures.
Swap(B,D) with all their
neighbors:The distances among
all the actors in the graph would
be exactly identical
Path vectors of i: how many
nodes are at distance 1, 2, …
from node i.
Amount of Equivalence:
Distance between path vectors
Regular Equivalence
Two nodes are said to be regularly
equivalent if they have the same profile of
ties with members of other sets of actors
that are also regularly equivalent.
1 tie with Class II
Class I
Class II
No tie with Class III
1 tie with Class I
1/2 tie(s) with Class III
Class III
No tie with Class I
1 tie with Class II
Assortativity (homophily)
Rich goes with the rich (selective linking)
A famous actor (e.g., Shah Rukh Khan) would
prefer to pair up with some other famous actor
(e.g., Rani Mukherjee) in a movie rather than
a new comer in the film industry.
Assortative
Scale-free network
Disassortative
Scale-free network
Measures of Assortativity
ANND (Average nearest neighbor degree)
Find the average degree of the neighbors of each node
i with degree k
Find the Pearson correlation (r) between the degree of i
and the average degree of its neighbors
For further reference see the supplementary material
Community structure
Community structure: a group of vertices that
have a high density of edges within them and a
low density of edges in between groups
Example:
•Friendship n/w of children
•Citation n/ws: research interest
•World Wide Web: subject matter
of pages
•Metabolic networks: Functional
units
•Linguistic n/ws: similar linguistic
categories
Some Examples
Community Structure in
Political Books
Community structure in a Social n/w of
Students (American High School)
Community Identification Algorithms
 Hierarchical
 Girvan-Newman
 Radicchi et al.
 Chinese Whishpers
 Spectral Bisection
See (Newman 2004) for a comprehensive
survey (you will find the ref. in the
supplementary material)
Girvan-Newman Algorithm
Bisection Method
Calculate the betweenness for all edges in the
network.
Remove the edge with the highest
betweenness.
Recalculate betweennesses for all edges
affected by the removal.
Repeat from step 2 until no edges remain.
Evolution of Networks
Processes on Networks
Random Graphs & Small
Average Path Length
Q: What do we mean by a ‘random graph’?
A: Erdos-Renyi random graph model:
For every pair of nodes, draw an edge
between them with equal probability p.
Degrees of Separation in a Random Graph
Poisson distribution
• N nodes
• z neighbors per node, on average, z =<k>
• D degrees of separation
z
D
 N
D 
log N
log z
P(k)~ e-<k> <k>k/k!
Degree Distributions
A irlines
P oisson distribution
Exp o n e n tia l N e tw o rk
P ow er-law distribution
S cale -fre e N e tw o rk
Degree distributions for various networks
(a) World-Wide Web
(b) Coauthorship
networks: computer
science, high energy
physics, condensed
matter physics,
astrophysics
(c) Power grid of the
western United
States and Canada
(d) Social network of 43
Mormons in Utah
How do Power law DDs arise?
Barabási-Albert Model of Preferential Attachment
(Rich gets Richer)
(1) GROWTH : Starting with a small number of nodes (m0) at
every timestep we add a new node with m (<=m0) edges
(connected to the nodes already present in the system).
(2) PREFERENTIAL ATTACHMENT : The probability Π that
a new node will be connected to node i depends on the
connectivity ki of that node
 (ki ) 
ki
 jk j
A.-L.Barabási, R. Albert, Science 286, 509 (1999)
Mean Field Theory
The World is Small!
 “Registration fee for IJCNLP 2008 are being
waived for all participants – get it collected from
the registration counter”
 How long do you think the above information will
take to spread among yourselves
 Experiments say it will spread very fast – within 6
hops from the initiator it would reach all
 This is the famous Milgram’s six degrees of
separation
The Small World Effect
Even in very large social networks, the average distance
between nodes is usually quite short.
Milgram’s small world experiment:
 Target individual in Boston
 Initial senders in Omaha, Nebraska
 Each sender was asked to forward a packet to a friend
who was closer to the target
 Friends asked to do the same
Result: Average of ‘six degrees’ of separation.
S. Milgram, The small world problem, Psych. Today, 2 (1967), pp. 60-67.
Measure of Small-Worldness
 Low average geodesic path length
 High clustering coefficient
 Geodesic path – Shortest path through the
network from one vertex to another
 Mean path length
 ℓ = 2∑i≥jdij/n(n+1) where dij is the geodesic distance from
vertex i to vertex j
Most of the networks observed in real world have ℓ ≤ 6
 Film actors
 Company Directors
 Emails
 Internet
 Electronic circuits
3.48
4.60
4.95
3.33
4.34
Clustering
C = Probability that two of a node’s neighbors
are themselves connected
In a random graph: Crand ~ 1/N (if the average
degree is held constant)
Watts-Strogatz ‘Small World’ Model
Watts and Strogatz introduced this simple model to
show how networks can have both short path lengths
and high clustering.
D. J. Watts and S. H. Strogatz, Collective dynamics of “small-world”
networks, Nature, 393 (1998), pp. 440–442.
Small-world model
 Used for modeling network transitivity
 Many networks assume some kind of
geographical proximity
 Small-world model:
Start with a low-dimensional regular lattice
Rewire:
Add/remove edges to create shortcuts to join
remote parts of the lattice
For each edge with prob p move the other end to a
random vertex
 Rewiring allows to interpolate between regular
lattice and random graph
Small-world model
 Regular lattice (p=0):
Clustering coefficient C=(3k3)/(4k-2)=3/4
Mean distance L/4k
 Almost random graph
(p=1):
Rewiring probability p
Clustering coefficient C=2k/L
Mean distance log L / log k
 No power-law degree
distribution
Degree distribution
Resilience of Networks
 We consider the resilience of the
network to the removal of its vertices
(site percolation) or edges (bond
percolation).
 As vertices (or edges) are removed
from the network, the average path
length will increase.
 Ultimately, the giant component will
disintegrate.
 Networks vary according to their
level of resilience to vertex (or edge)
removal.
Stability Metric:
Percolation Threshold
f fraction of
nodes
removed
Initial single
connected
component
Giant component
still exists
fc fraction
of nodes
removed
The entire graph
breaks into smaller
fragments
Therefore fc =1-qc becomes the percolation threshold
Ordinary Percolation on Lattices
Fill in each link (bond percolation) or site (site percolation)
with probability p and ask questions about the sizes of
connected components.
Percolation in Poisson and Scale free
networks
Exponential Network
Scale free Network
CASE STUDY I: Self-Organization
of the Sound Inventories
Human Communication
Human beings and many other living
organisms produce sound signals
Unlike other organisms, they can
concatenate these sounds to produce new
messages – Language
Language is one of the primary
cause/effect of human intelligence
Human Speech Sounds

Human speech sounds are called
phonemes – the smallest unit of a
language
Phonemes are characterized by certain
distinctive features like

I.
Place of articulation
II. Manner of articulation
III. Phonation
Mermelstein’s Model
Types of Phonemes
Consonants
Vowels
/i/
/a/
L
/u/
/t/
/p/
Diphthongs
/ai/
/k/
Choice of Phonemes
How a language chooses a set of
phonemes in order to build its sound
inventory?
Is the process arbitrary?
Certainly Not!
What are the forces affecting this choice?
Forces of Choice
A Linguistic System – How does it look?
/a/
Speaker
Desires “ease of articulation”
/a/
Listener / Learner
Desires “perceptual contrast” / “ease of learnability”
The forces shaping the choice are opposing – Hence there has to be a non-trivial solution
Vowels: A (Partially) Solved Mystery
Languages choose vowels based on
maximal perceptual contrast.
For instance if a language has three
vowels then in more than 95% of the
cases they are /a/,/i/, and /u/.
/a/
/i/
/u/
Maximally Distinct
Consonants: A J
i
g
s
a
w
puzzle
Research: From 1929 – Date
No single satisfactory explanation of the
organization of the consonant inventories
The set of features that characterize
consonants is much larger than that of vowels
No single force is sufficient to explain this
organization
Rather a complex interplay of forces goes on in
shaping these inventories
Principle of Occurrence
 PlaNet – The “Phoneme-Language Network”
A bipartite network N=(VL,VC,E)
VL : Nodes representing languages of the world
VC : Nodes representing consonants
E : Set of edges which run between VL and VC
Languages
L1
/ŋ/
L2
/m/
L3
/d/
Consonants
 There is an edge e Є E between two nodes
 vl Є VL and vc Є VC if the consonant c occurs
 in the language l.
/θ/
/s/
Choudhury et al. 2006 ACL
Mukherjee et al. 2007 Int. Jnl of Modern Physics C
L4
/p/
The Structure of PlaNet
Construction of PlaNet
Data Source : UCLA Phonological
Inventory Database (UPSID)
Number of nodes in VL is 317
Number of nodes in VC is 541
Number of edges in E is 7022
Degree Distribution of PlaNet
0.08
DD of the language nodes follows a βdistribution
pk = beta(k) with α = 7.06,
and β = 47.64
0.06
pk
0.04
Γ(54.7) k6.06(1-k)46.64
pk =
Γ(7.06) Γ(47.64)
0.02
DD of the consonant nodes follows a
power-law with an exponential cut-off
kmin= 5, kmax= 173, kavg= 21
0
50
100
150
11
200
0.1
0.1
Language inventory size (degree k)
Pk
0.01
0.01
Distribution of Consonants over
Languages follow a power-law
Pk = k -0.71
Exponential Cut-off
0.001
0.001
11
10
100
10
100
Degree of a consonant, k
1000
1000
Synthesis of PlaNet
L1
L2
L3
L4
After step 3
L1
L2
L3
L4
 Non-linear preferential
attachment
 Iteratively construct the
language inventories
given their inventory
sizes
After step 4
Pr(Ci) =
diα+ ε
∑xV* (dxα + ε)
Simulation Result
1
PlaNet
PlaNetsyn
PlaNetrand
Pk
.1
.01
.001
1
10
100
1000
Degree (k)
The parameters α and ε are 1.44 and 0.5 respectively.
The results are averaged over 100 runs
Principle of Co-occurrence
 Consonants tend to co-occur in groups or
communities
 These groups tend to be organized around a few
distinctive features (based on: manner of
articulation, place of articulation & phonation) –
Principle of feature economy
plosive
voiced
bilabial
/b/
dental
/d/
then it will also tend to have
voiceless
/p/
/t/
If a language has
in its inventory
How to Capture these Co-occurrences?
 PhoNet – “Phoneme Phoneme Network”
A weighted network N=(VC,E)
VC : Nodes representing consonants
E : Set of edges which run between the nodes in VC
 There is an edge e Є E between two nodes vc1 ,vc2 Є VC
if the consonant c1 and c2 co-occur in a language. The
number of languages in which c1 and c2 co-occurs defines
the edge-weight of e. The number of languages in which c1
occurs defines the node-weight of vc1.
/k′/
50
14
42
/kw/
39
38
/k/
283
13
17
/d′/
Construction of PhoNet
 Data Source : UPSID
 Number of nodes in VC is 541
 Number of edges is 34012
PhoNet
Community Structures in PhoNet
 Radicchi et al. algorithm (for unweighted networks) –
Counts number of triangles that an edge is a part of.
Inter-community edges will have low count so remove
them.
 Modification for a weighted network like PhoNet
 Look for triangles, where the weights on the edges are
comparable.
 If they are comparable, then the group of consonants cooccur highly else it is not so.
 Measure strength S for each edge (u,v) in PhoNet where S
is,
wuv
if √Σi Є Vc-{u,v}(wui – wvi)2>0 else S = ∞
S=
2
√Σi Є Vc-{u,v}(wui – wvi)
 Remove edges with S less than a threshold η
Community Formation
1
52
110
100
3
101
2
10
5
5.17
10.94
4
45
46
1
S
11.11
6
3
0.06
4
7.14
2
5
7.5
3.77
6
η>1
1
For different values of η we get
different sets of communities
2
4
3
5
6
Consonant Societies!
η=0.35
η=0.60
η=0.72
η=1.25
The fact that the
communities are
good can
quantitatively
shown by
measuring the
feature entropy
Problems to ponder on …
Physical significance of PA:
Functional forces
Historical/Evolutionary process
Labeled synthesis of PlaNet and PhoNet
Language diversity vs. Preferential
attachment
CASE STUDY II: Modeling the
Mental Lexicon
Metal Lexicon (ML) – Basics
It refers to the repository of the word forms
that resides in the human brain
Two Questions:
How words are stored in the long term memory,
i.e., the organization of the ML.
How are words retrieved from the ML (lexical
access)
The above questions are highly inter-related – to
predict the organization one can investigate how
words are retrieved and vice versa.
Different Possible Ways of Organization
Un-organized (a bag full of words) or,
Organized
By sound (phonological similarity)
E.g., start the same: banana, bear, bean …
End the same: look, took, book …
Number of phonological segments they share
By Meaning (semantic similarity)
Banana, apple, pear, orange …
By age at which the word is acquired
By frequency of usage
By POS
Orthographically
The Hierarchical Model of ML
Proposed by Collins and Quillian in 1969
Concepts are organized in a hierarchy
Taxonomic and attributive relations are
represented
reproduces
Animal
eats
cold-blooded
warm-blooded
Mammal
has mammary glands
Fish
has gills
Cognitive Economy: Put the attributes at the
highest of all appropriate levels – e.g.,
‘reproduces’ applies to the whole animal
kingdom
Hierarchical Model
According to the principle of cognitive
economy
Animals eat < mammals eat < humans eat
However, shark is a fish = salmon is a fish
 What do < and = mean?
< : Less time to judge
= : Equal time to judge
Spreading Activation Model of ML
Not a hierarchical structure but a web of
inter-connected nodes (first proposed by
Collins and Loftus in 1975)
Distance between nodes is determined by
the structural characteristics of the wordforms (by sound, by meaning, by age, by
…)
Combining the above two: plethora of
complex networks
Phonological Neighborhood Network
(Vitevitch 2004)
(Gruenenfelder & Pisoni, 2005)
(Kapatsinski 2006)
Sound Similarity Relations in the Mental
Lexicon: Modeling the Lexicon as a
Complex Network
N/W Definition
Nodes: Words
Edge: An edge is drawn from node A to
node B if at least 2/3 of the segments that
occur in word represented by A also
occurs in the word represented by B
i.e., if the word represented by A is 6 segments
long then one can derive all its neighbors (B)
from it by two phoneme changes (insertions,
deletions or substitutions).
N/W Construction
Datbase
Hoosier Mental Lexicon (Nusbaum et al., 1984)
 phonologically transcribed words  n/w using the
metric defined earlier
Nodes with no links (correspond to hermit words i.e.,
words that have no neighbors)
Random networks (E-R) for comparison
Directed n/w  a long word can have a
short word as a neighbor, not vice versa
Have a link only if the duration of the difference
in the word pair <= (duration of a word)/3 (the
factor 1/3 is experimentally derived … see the
paper for further info.)
Neighborhood Density
The node whose neighbors are searched
 base words
Neighborhood density of a base word is
expressed as the out-degree of the node
representing the base word
Is an estimate of the number of words
activated by the base word when the base
word is presented  spreading activation
Something like semantic priming (however, in
the phonological level)
Results of the N/W Analysis
 Small-world Properties
 High clustering but also long average path
length -- like a SW network the lexicon has
densely connected neighborhoods but the links
between two nodes of different neighborhoods is
harder to find than in SW networks
Visualization – A Disconnected
Graph with a Giant Component (GC)
GC is elongated – there are some nodes that have
really long chain of intermediates and hence the
mean path length is long
Low Degree Nodes are Important!!!
Removal of low degree nodes renders the
n/w almost disconnected
A bottleneck is formed between longer
(more than 7 segments long) and shorter
words
This bottleneck consists the ‘tion’ final words:
coalition, passion, nation, fixation/fission – they
form short-cuts between the high-degree nodes
(i.e., they are low-degree stars that connect
mega-neighborhoods)
Removal of Nodes with Degree <= 40
Removal of lowdegree nodes
disconnect the n/w
as opposed to the
removal of hubs like
“pastor” (deg. =112)
Why low connectivity
between neighborhoods?
 Spreading activation should not inhibit
neighbors of the stimulus’ neighbors that are nonneighbors of the stimulus itself (and are therefore, not
similar to the stimulus)
 Low mean path  complete traversal of n/ws,
for e.g., in general purpose search
 Search in lexicon does not need to traverse links
between distant nodes; rather it involves an
activation of the structured neighborhood that
share a single sub-lexical chunk that could be
acoustically related during word-recognition
(Marslen-Wilson, 1990).
Degree Distribution (DD)
Exponential rather than power-law
Other Works (see supplementary
material for reference)
 Vitevitch (2005)
similar to the above work but builds n/ws of nodes that
are just one-segment different
 (Choudhury et al. (2007)
 Builds weighted n/ws in Hindi, Bengali and English
based on orthographic proximity (nodes: words; edges:
orthographic edit-distance) – SpellNet
Does thresholding (θ) to make the n/ws binary (at θ = 1,
3, 5).
They also obtain exponential DDs
Observe that occurrence of real word errors in a
language is proportional to avg. wghtd. deg. of the
SpellNet of that language
Other Works
 Sigman et al. (2002)
Analyzes the English WordNet
All semantic relationships are scale-invariant
Inclusion of polysemy make the n/w SW
 Ferrer i Cancho et al. (2000,2001) –
Word co-occurrence (in a sentence) based definitions of
the lexicon
Lexicon = Kernel Lexicon + Peripheral Lexicon
Finds a 2-regime DD … one comprises words in the
kernel lexicon and the other words in the peripheral
lexicon
Finds that these n/ws are small-world
Some Unsolved Mysteries –
You can Give it a Try 
What can be a model for the evolution of
the ML?
How is the ML acquired by a child learner?
Is there a single optimal structure for the
ML; or is it organized based on multiple
criteria (i.e., a combination of the different
n/ws) – Towards a single framework for
studying ML!!!
CASE STUDY III: Syntax
Unsupervised POS Tagging
Labeling of Text
 Lexical Category (POS tags)
 Syntactic Category (Phrases, chunks)
 Semantic Role (Agent, theme, …)
 Sense
 Domain dependent labeling (genes, proteins, …)
How to define the set of labels?
How to (learn to) predict them automatically?
“Nothing makes sense, unless in context”
Distribution-based definition of
Lexical category
Sense (meaning)
The X is …
If you X then I shall …
… looking at the star PP
General Approach
 Represent the context
of a word (token)
 Define some notion of
similarity between the
contexts
 Cluster the contexts
of the tokens
 Get the label of the
tokens
w1 w2 w3 w4 …
w1
w3
w2
w4
Issues
How to define the context?
How to define similarity
How to Cluster?
How to evaluate?
Unsupervised Parts-of-Speech Tagging
Employing Efficient Graph Clustering
Chris Biemann
COLING-ACL 2006
Stages
 Input: raw text corpus
 Identify feature words and define a graph for
high and medium frequency words (10000)
 Cluster the graph to identify the classes
 For low frequency words, use context similarity
 Lexicon of word classes  tag the same text 
learn a Viterbi tagger
Features Words
 Estimate the unigram
frequencies
 Feature words: Most
frequent 200 words
Feature Vector
From the familiar to the exotic, the collection is a delight
p-2
p-1
p1
p2
the
fw1
to
fw2
0
1
0
1
0
0
1
0
is from
fw199 fw200
…
…
…
…
0
0
0
0
1
0
0
0
Syntactic Network of Words
color
sky
weight
light
1
20
blue
100
blood
heavy
1
1 – cos(red, blue)
red
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
blood
0.9
0.5
red
heavy
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
blood
0.9
0.5
red
heavy
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
blood
0.9
0.5
red
heavy
Medium and Low Frequency Words
 Neighboring (window 4) co-occurrences ranked
by log-likelihood thresholded by θ
 Two words are connected iff they share at least
4 neighbors
Language
Nodes
English
52857
Finnish
85627
German
137951
Edges
691241
702349
1493571
Construction of Lexicon
Each word assigned a unique tag based
on the word class it belongs to
Class 1: sky, color, blood, weight
Class 2: red, blue, light, heavy
Ambiguous words:
High and medium frequency words that formed
singleton cluster
Possible tags of neighboring clusters
Training and Evaluation
Unsupervised training of trigram HMM
using the clusters and lexicon
Evaluation:
Tag a text, for which gold standard is available
Estimate the conditional entropy H(T|C) and the
related perplexity 2H(T|C)
Final Results:
English – 2.05 (619/345), Finnish – 3.22
(625/466), German – 1.79 (781/440)
Word Sense Disambiguation
 Véronis, J. 2004. HyperLex: lexical cartography
for information retrieval. Computer Speech &
Language 18(3):223-252.
 Let the word to be disambiguated be “light”
 Select a subcorpus of paragraphs which have at
least one occurrence of “light”
 Construct the word co-occurrence graph
HyperLex
A beam of white light is dispersed
into its component colors by its
passage through a prism.
Energy efficient light fixtures
including solar lights, night lights,
energy star lighting, ceiling lighting,
wall lighting, lamps
What enables us to see the light
and experience such wonderful
shades of colors during the course
of our everyday lives?
beam
prism
dispersed
white
colors
shades
energy
efficient
fixtures
lamps
Hub Detection and MST
light
beam
prism
dispersed
white
colors
beam
lamps
white
shades
prism
shades
dispersed
colors
energy
fixtures
energy
efficient
efficient
White fluorescent lights consume less
energy than incandescent lamps
fixtures
lamps
Other Related Works
 Solan, Z., Horn, D., Ruppin, E. and Edelman, S. 2005.
Unsupervised learning of natural languages. PNAS, 102
(33): 11629-11634
 Ferrer i Cancho, R. 2007. Why do syntactic links not
cross? Europhysics Letters
 Also applied to: IR, Summarization, sentiment
detection and categorization, script evaluation,
author detection, …
Discussions & Conclusions
What we learnt
Advantages of SNIC in NLP
Comparison to standard techniques
Open problems
Concluding remarks and Q&A
What we learnt
 What is SNIC and Complex Networks
 Analytical tools for SNIC
 Applications to human languages
 Three Case-studies:
Area
Perspective
Technique
I
Sound
systems
Language evolution and
change
Synthesis models
II
Lexicon
Psycholinguistic modeling
and linguistic typology
Topology and
search
III Syntax &
Applications to NLP
Semantics
Clustering
What we saw
 Language features complex structure at every
level of organization
 Linguistic networks have non-trivial properties:
scale-free & small-world
 Therefore, Language and Engineering systems
involving language should be studied within the
framework of complex systems, esp. CNT
Advantages of SNIC
 Fully Unsupervised techniques:
No labeled data required: A good solution to
resources scarcity
Problem of evaluation: circumvented by semisupervised techniques
 Ease of computation:
Simple and scalable
Distributed and parallel computable
 Holistic treatment:
 Language evolution & psycho-linguistic theories
Comparison to Standard Techniques
Rule-based vs. Statistical NLP
Graphical Models
Generative models in machine learning
HMM, CRF, Bayesian belief networks
JJ
NN
RB
VF
Graphical Models vs. SNIC
GRAPHICAL MODEL
 Principled: based on
Bayesian Theory
 Structure is assumed and
parameters are learnt
 Focus: Decoding &
parameter estimation
 Data-driven or
computationally intensive
 The generative process is
easy to visualize, but no
visualization of the data
COMPLEX NETWORK
 Heuristic, but underlying
principles of linear algebra
 Structure is discovered
and studied
 Focus: Topology and
evolutionary dynamics
 Unsupervised and
computationally easy
 Easy visualization of the
data
Language Modeling
 A network of words as a model of language vs.
n-gram models
 Hierarchical, hyper-graph based models
 Smoothing through holistic analysis of the
network topology
Jedynak, B. and Karakos, D. 2007. Unigram Language
Models using Diffusion Smoothing over Graphs. Proc. of
TextGraphs - 2
Open Problems
 Universals and variables of linguistic networks
 Superimposition of networks: phonetic, syntactic,
semantic
 Which clustering algorithm for which topology?
 Metrics for network comparison – important for
language modeling
 Unsupervised dependency parsing using
networks
 Mining translation equivalents
Resources
 Conferences
TextGraphs, Sunbelt, EvoLang, ECCS
 Journals
PRE, Physica A, IJMPC, EPL, PRL, PNAS, QL, ACS,
Complexity, Social Networks
 Tools
Pajek, C#UNG,
http://www.insna.org/INSNA/soft_inf.html
 Online Resources

Bibliographies, courses on CNT
Contact
 Monojit Choudhury
[email protected]
http://www.cel.iitkgp.ernet.in/~monojit/
 Animesh Mukherjee
[email protected]
http://www.cel.iitkgp.ernet.in/~animesh/
 Niloy Ganguly
[email protected]
http://www.facweb.iitkgp.ernet.in/~niloy/
Thank you!!
Descargar

Emergent Systems - Indian Institute of Technology …