Term associations:
from ontologies to text
mining and back
Irena Spasić
[email protected]
http://www.cbr-masterclass.org/
Combining the strengths of UMIST and
The Victoria University of Manchester
Outline
• introduction
• resources
• term similarity approaches
– ontology-based
– internal
– contextual
• term relations
• SOLD: Syntactic + Ontology-driven + Lexical Distance
• MaSTerClass: Machine Supported Term Classification
• conclusion
Combining the strengths of UMIST and
The Victoria University of Manchester
Introduction
Combining the strengths of UMIST and
The Victoria University of Manchester
Biomedical literature
• text is the predominant medium for information
exchange among experts
• literature
– the primary source of information
– most up to date
– easily shared
Combining the strengths of UMIST and
The Victoria University of Manchester
Problems
• major bottleneck: information overload
• MEDLINE®:
–  13M references
–  2K added daily
– > 571K added during 2004
• no human could ever hope to manage such huge
amounts of text
• unstructured format unsuitable for complex searching
• terminological variability and ambiguity: “biologists
would rather share a toothbrush than a gene’s name”
(Prof. David Botstein)
• mapping to databases and ontologies
Combining the strengths of UMIST and
The Victoria University of Manchester
A solution
• text mining:
– information retrieval: gather and filter relevant documents
– information extraction: select facts of interest
– data mining: discover unsuspected associations between the
known facts
• text mining can aid biomedical experts by
automatically:
– distilling information
– extracting facts
– discovering implicit links
– generating hypotheses
Combining the strengths of UMIST and
The Victoria University of Manchester
Text processing
text processing
raw
(unstructured)
text
Combining the strengths of UMIST and
The Victoria University of Manchester
tokenization
lexical
processing
syntactic
processing
semantic
processing
annotated
(structured)
text
Text processing: tokenization
| 5 | alpha-dihydrotestosterone |
inhibited | [3H]R1881 | binding |
to | the | androgen | receptor |
in | kidney |
text processing
raw
(unstructured)
text
... 5 alpha-dihydrotestosterone
inhibited [3H]R1881 binding to
the androgen receptor in
kidney ...
Combining the strengths of UMIST and
The Victoria University of Manchester
tokenization
lexical
processing
syntactic
processing
semantic
processing
annotated
(structured)
text
Text processing: lexical processing
| 5 | alpha-dihydrotestosterone |
inhibited | [3H]R1881 | binding |
to | the | androgen | receptor |
in | kidney |
text processing
raw
(unstructured)
text
... 5 alpha-dihydrotestosterone
inhibited [3H]R1881 binding to
the androgen receptor in
kidney ...
tokenization
lexical
processing
syntactic
processing
semantic
processing
annotated
(structured)
text
<NUM>5</NUM>
<N>alpha-dihydrotestosterone</N>
<V>inhibited</V> <N>[3H]R1881</N>
<V>binding</V> <PREP>to</PREP>
<DET>the</DET> <N>androgen</N>
<N>receptor</N>
<PREP>in</PREP> <N>kidney</N>
.
Combining the strengths of UMIST and
The Victoria University of Manchester
Text processing: syntactic processing
VP
NP
| 5 | alpha-dihydrotestosterone |
inhibited | [3H]R1881 | binding |
to | the | androgen | receptor |
in | kidney |
PP
PP
NP
NP
NP
NP
5 alpha-dihydrotestosterone inhibited [3H]R1881 binding to the androgen receptor in kidney
text processing
raw
(unstructured)
text
... 5 alpha-dihydrotestosterone
inhibited [3H]R1881 binding to
the androgen receptor in
kidney ...
tokenization
lexical
processing
syntactic
processing
semantic
processing
annotated
(structured)
text
<NUM>5</NUM>
<N>alpha-dihydrotestosterone</N>
<V>inhibited</V> <N>[3H]R1881</N>
<V>binding</V> <PREP>to</PREP>
<DET>the</DET> <N>androgen</N>
<N>receptor</N>
<PREP>in</PREP> <N>kidney</N>
.
Combining the strengths of UMIST and
The Victoria University of Manchester
Text processing: semantic processing
VP
NP
| 5 | alpha-dihydrotestosterone |
inhibited | [3H]R1881 | binding |
to | the | androgen | receptor |
in | kidney |
PP
PP
NP
NP
NP
NP
5 alpha-dihydrotestosterone inhibited [3H]R1881 binding to the androgen receptor in kidney
text processing
raw
(unstructured)
text
... 5 alpha-dihydrotestosterone
inhibited [3H]R1881 binding to
the androgen receptor in
kidney ...
tokenization
lexical
processing
<NUM>5</NUM>
<N>alpha-dihydrotestosterone</N>
<V>inhibited</V> <N>[3H]R1881</N>
<V>binding</V> <PREP>to</PREP>
<DET>the</DET> <N>androgen</N>
<N>receptor</N>
<PREP>in</PREP> <N>kidney</N>
annotated
(structured)
text
semantic
processing
syntactic
processing
<HORMONE> 5 alpha-dihydrotestosterone</HORMONE>
inhibited <LIGAND> [3H]R1881</LIGAND>
binding to the <RECEPTOR> androgen receptor</RECEPTOR>
in <ORGAN> kidney</ORGAN>
event
bio. active substance
bind
androgen receptor
inhibit
[3H]R1881
.
event
biologically
active
substance
Combining the strengths of UMIST and
The Victoria University of Manchester
Ontologies and text mining
Combining the strengths of UMIST and
The Victoria University of Manchester
Terms
term
variation
term
relation
text
term_4
term_2
.
term_3
term_1
concept_2
ontology
concept_1
term
ambiguity
Combining the strengths of UMIST and
The Victoria University of Manchester
concept_3
conceptual
relation
• term - textual realization
of a specialized concept,
e.g. gene, protein,
disease, etc.
• the principal link
between text and an
ontology
• aims to map concepts to
terms
Term-based text mining
• terms are the essential means of scientific discourse:
– to access and integrate biomedical information, biomedical
concepts and their relations need to be recognised
– no possibility to communicate biomedical knowledge without
recognition and association of biomedical terms
• basic TM tasks:
– ATR / NER
– term association:
•
•
•
•
similarities
clustering
relations
classification
• used to support more advanced TM tasks: IE, IR, DM
Combining the strengths of UMIST and
The Victoria University of Manchester
Knowledge integration
(Hoffmann and Valencia, 2004)
Combining the strengths of UMIST and
The Victoria University of Manchester
Resources
Combining the strengths of UMIST and
The Victoria University of Manchester
Corpora
• MEDLINE
– 3 million biomedical abstracts
– http://www.nlm.nih.gov/pubs/factsheets/medline.html
• GENIA
– 2000 manually annotated MEDLINE abstracts
– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/
• BioMed Central
– 10,406 peer-reviewed biomedical articles (full text)
– http://www.biomedcentral.com/info/about/datamining/
Combining the strengths of UMIST and
The Victoria University of Manchester
Biomedical ontologies
Name
UMLS
SNOMED
GENIA
GALEN
TaO
GO
URL
http://www.nlm.nih.gov/research/umls/
http://www.snomed.org/snomedct
http://www-tsujii.is.s.u-tokyo.ac.jp/~genia
http://www.opengalen.org/about.html
http://imgproj.cs.man.ac.uk/tambis
http://www.geneontology.org/
• OBO (Open Biomedical Ontologies)
• http://obo.sourceforge.net/
Combining the strengths of UMIST and
The Victoria University of Manchester
Term similarity approaches
Combining the strengths of UMIST and
The Victoria University of Manchester
Term similarities
• similarity measures
• choice of features:
– domain specific

ontology
– linguistic

text
• ontology-based similarity
• textual similarity
– internal features
– contextual features
• combined approaches
Combining the strengths of UMIST and
The Victoria University of Manchester
Similarities measures
• regardless of the types of features used, three general
similarity measures have been commonly used to
compare terms
• Dice
• Jaccard
• cosine
Combining the strengths of UMIST and
The Victoria University of Manchester
Ontology-based similarity
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison
• two terms should match if they are:
– identified as variants
– siblings in the is-a hierarchy
– in the is-a or part-whole relation
• the distance between the corresponding nodes in the
ontology should be transformed into the matching score
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with variation
<tok><sur>vitamin A</sur>
<lem cat=”term”>vitamin A</lem></tok>
<tok><sur>A vitamin</sur>
<lem cat=”term”>vitamin A</lem></tok>
<tok><sur>vitamin-A</sur>
<lem cat=”term”>vitamin A</lem></tok>
<tok><sur>retinol</sur>
<lem cat=”term”>vitamin A</lem></tok>
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with variation
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with variation
<tok><sur>vitamin A</sur>
<lem cat=”term”>vitamin A</lem></tok>
<tok><sur>A vitamin</sur>
<lem cat=”term”>vitamin A</lem></tok>
<tok><sur>vitamin-A</sur>
<lem cat=”term”>vitamin A</lem></tok>
<tok><sur>retinol</sur>
<lem cat=”term”>vitamin A</lem></tok>
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with classes
<tok>
<sur>retinol</sur>
<lem cat="term">vitamin A</lem>
</tok>
similar
<tok>
<sur>ascorbic acid</sur>
<lem cat="term">vitamin C</lem>
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester

Vitamin

identical

Vitamin
Term comparison – dealing with classes
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with classes
<tok>
<sur>retinol</sur>
<lem cat="term">vitamin A</lem>
</tok>
similar
<tok>
<sur>ascorbic acid</sur>
<lem cat="term">vitamin C</lem>
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester

Vitamin

identical

Vitamin
Term comparison – dealing with classes
<tok>
<sur>retinol</sur>
<lem cat="term">vitamin A</lem>
</tok>
similar
<tok>
<sur>ascorbic acid</sur>
<lem cat="term">vitamin C</lem>
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester

Vitamin

identical

Vitamin
Term comparison – dealing with classes
<tok>
<sur>retinol</sur>
<lem cat="term">vitamin A</lem>
</tok>
similar
<tok>
<sur>ascorbic acid</sur>
<lem cat="term">vitamin C</lem>
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester

Vitamin

identical

Vitamin
Term comparison – dealing with hierarchies
<tok>
<sur>insulin</sur>
<lem cat="term">insulin</lem>
</tok>
similar


Hormone

Biologically
Active
<tok>
Substance
<sur>glycosidase</sur>

<lem cat="term">glycosidase</lem>  Enzyme
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with hierarchies
<tok>
<sur>insulin</sur>
 Hormone
<lem cat="term">insulin</lem>

</tok>
Biologically
similar

Active
<tok>
<sur>glycosidase</sur>
<lem cat="term">glycosidase</lem> 
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester
Substance

Enzyme
Term comparison – dealing with hierarchies
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with hierarchies
<tok>
<sur>insulin</sur>
<lem cat="term">insulin</lem>
</tok>
similar


Hormone

Biologically
Active
Substance
<tok>
<sur>glycosidase</sur>

<lem cat="term">glycosidase</lem>  Enzyme
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with hierarchies
<tok>
<sur>insulin</sur>
<lem cat="term">insulin</lem>
</tok>
similar


Hormone

Biologically
Active
Substance
<tok>
<sur>glycosidase</sur>

<lem cat="term">glycosidase</lem>  Enzyme
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with hierarchies
• ontologies are typically organised in a hierarchy using
the is-a relation
• this property can be used to quantify the similarity
between the concepts, and, implicitly, semantic
similarity between the terms used to designate these
concepts
• numerical information that can be inferred from an
ontology on top of the symbolic information it explicitly
stores
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with hierarchies
• Dice’s coefficient
• Wu & Palmer
• Resnik
Combining the strengths of UMIST and
The Victoria University of Manchester
Internal term similarity
Combining the strengths of UMIST and
The Victoria University of Manchester
Term comparison – dealing with incomplete
information
<tok>
<sur>retinol</sur>
<lem cat="term">vitamin A</lem>
</tok>
similar 
lex. similar
<tok>
<sur>vitamin C</sur>
<lem cat="term">vitamin C</lem>
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester

Vitamin

?

?
Term comparison – dealing with incomplete
information
<tok>
<sur>retinol</sur>
<lem cat="term">vitamin A</lem>
</tok>
similar 
lex. similar
<tok>
<sur>vitamin C</sur>
<lem cat="term">vitamin C</lem>
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester

Vitamin

?

?
Term comparison – dealing with incomplete
information
<tok>
<sur>retinol</sur>
<lem cat="term">vitamin A</lem>
</tok>
similar 
lex. similar
<tok>
<sur>vitamin C</sur>
<lem cat="term">vitamin C</lem>
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester

Vitamin

?

?
Term comparison – dealing with incomplete
information
<tok>
<sur>retinol</sur>
<lem cat="term">vitamin A</lem>
</tok>
similar 
Vitamin

?
lex. similar
<tok>
<sur>vitamin C</sur>
<lem cat="term">vitamin C</lem>
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester

Term comparison – dealing with incomplete
information
<tok>
<sur>retinol</sur>
<lem cat="term">vitamin A</lem>
</tok>
similar 
Vitamin

?
lex. similar
<tok>
<sur>vitamin C</sur>
<lem cat="term">vitamin C</lem>
</tok>
Combining the strengths of UMIST and
The Victoria University of Manchester

Edit distance
• edit distance (ED) – the minimal number (or cost) of
changes needed to transform one string into the other
• edit operations:
insertion
deletion
replacement
transposition
...a-c... ...abc...
...abc...
...abc...
...abc... ...a-c...
...adc...
...acb...
• easily calculated: dynamic programming approach
Combining the strengths of UMIST and
The Victoria University of Manchester
Edit distance
• examples:
ED( vitamin A,
vitamin–A) = 1 (1 replacement)
ED( vitamin A,
vitamin C) = 1 (1 replacement)
ED( vitamin A,
A vitamin)
Combining the strengths of UMIST and
The Victoria University of Manchester
= 4 (2 insertions, 2 deletions)
Edit distance
• strings are usually treated as sequences of characters
• problem: word permutations
ED( stone in ki-dney,
stone in bladder) = 5
ED( stone in kidney,
kidney stone) = 15
• sometimes it is more useful to treat strings as
sequences of words, where each word is treated as a
sequence of characters
Combining the strengths of UMIST and
The Victoria University of Manchester
Word edit distance
• pairing up their words so as to minimize their ED
• word edit distance (WED) - the sum of edit distances
resulting from the minimal matching
WED( stone in kidney,
kidney stone) = 2
• WED(x, y)  ED(x, y) is not true in general
• WED(x, y) << ED(x, y) applies to the majority of word
permutations
• WED(x, y)  ED(x, y) when no word permutations apply
• useful for applications where lexical similarities are
examined
Combining the strengths of UMIST and
The Victoria University of Manchester
Lexical and syntactic information
• terms sharing a head are likely to be hyponyms of the
same term
– progesterone receptor & estrogen receptor
• a term derived by modifying another term is likely to
be its hyponym
– nuclear receptor & orphan nuclear receptor
• differentiating between heads and modifiers when
assessing lexical similarity:
Combining the strengths of UMIST and
The Victoria University of Manchester
External lexical resources
• biomedical terms are often imaginative, but can
sometimes be descriptive
• sevenless - a mutated gene that caused blindness in
fruit flies by deleting "cell seven" from the eye
• other related genes:
– bride
of sevenless
– daughter of sevenless
– son
of sevenless
• a general-purpose lexical resources (e.g. WordNet) can
be used to estimate semantic similarity between term
constituents, and consequently the belonging terms
Combining the strengths of UMIST and
The Victoria University of Manchester
Contextual term similarity
Combining the strengths of UMIST and
The Victoria University of Manchester
Term context
• issues:
– information about individual context elements
– scope of the context
– structure of the context
• relevant information about context elements:
–
–
–
–
–
–
–
–
syntactic category
terminological status
position relative to the term
syntactic relation between a context element and the term
semantic properties
semantic relation between a context element and the term
statistical information about a context element
etc.
Combining the strengths of UMIST and
The Victoria University of Manchester
Lexical and syntactic information
• a lexico-syntactic pattern:
. . . Term (, Term)* [,] and other Term . . .
• the leading Terms are supposed to be the hyponyms of
the last Term
... antiandrogens, hydroxyflutamide, bicalutamide,
cyproterone acetate, RU58841, and other compounds ...
• candidate instances of the hyponymy relation:
hyponym( antiandrogens, compound )
hyponym( hydroxyflutamide, compound )
hyponym( bicalutamide, compound )
hyponym( cyproterone acetate, compound )
hyponym( RU58841, compound )
Combining the strengths of UMIST and
The Victoria University of Manchester
Hyponymy patterns
Pattern
Example
Term such as Term (, Term)*
[,] (and | or) Term
aromatic hydrocarbons such as
2,3,7,8-tetrachlorodibenzo-p-dioxin
and 3-methylcholanthrene
Term (,Term)* [,] (and | or)
other Term
Bacillus and other Gram-positive
bacteria
Term, (especially | including)
Term (, Term)* [,] (and | or)
Term
fat soluble vitamins, especially
vitamins A and carotenoid
Combining the strengths of UMIST and
The Victoria University of Manchester
Parallel patterns
Pattern
Example
both Term and Term
DR7/DR5 motif binds both RAR*RXR
heterodimers and RXR homodimers
either Term or Term
ligand activation of either RAR.RXR
heterodimers or RXR homodimers
neither Term nor Term
neither RXR homodimers nor RXR/RAR
heterodimers are able to substitute for
LXR alpha
Combining the strengths of UMIST and
The Victoria University of Manchester
Statistical methods
• mutual information measures the dependence between
the observed terms
• Tanimoto’s coefficient can be used to locate terms that
appear more frequently in co-occurrence than isolated
• t-score compares a pair of terms by comparing the
strength of collocation with a given context
Combining the strengths of UMIST and
The Victoria University of Manchester
Latent semantic analysis (LSA)
• deeper statistical analysis using patterns of occurrences
and not simply first-order co-occurrences
• IR: cognitive or semantic model of search terms as
opposed to keyword matching in order to tackle the
problems caused by synonymy and polysemy
• a statistical model of text used to infer its semantic
properties by estimating the contextual usage
substitutability of words (e.g. tumo[u]r vs. cancer)
• key assumption:
– there is an underlying pattern of term usage across different
contexts that can be used to infer such latent structure
– terms occurring in similar contexts (with respect to their latent
structure) should be strongly correlated at the semantic level
Combining the strengths of UMIST and
The Victoria University of Manchester
Latent semantic analysis (LSA)
1. term-context matrix: frequencies of occurrences for
each term in each context
2. singular-value decomposition: approximate the original
matrix as a linear combination of k (typically 100  k 
300) orthogonal vectors (indexing dimensions)
• each term is then represented as k-dimensional feature
vector where each coordinate represents a value for
the corresponding indexing dimension
• terms occurring in similar contexts have similar feature
vectors (reflecting the fact that their usage is highly
correlated)
• contextual term similarity is calculated as the cosine
measure for the corresponding feature vectors
Combining the strengths of UMIST and
The Victoria University of Manchester
Beyond similarity:
term relations
Combining the strengths of UMIST and
The Victoria University of Manchester
Term relations
Verb form
Pattern
Example
simple
indicative
Term activates Term
RXR activates GAL4-T3RVP16
passive
Term is activated by
Term
Cre-ERT is activated by
tamoxifen
gerund
Term activating Term
guanosine triphosphatase
activating protein
nominalization
activation of Term by activation of genes by
Term
steroid receptors
nominalization
Term activation by
Term
Combining the strengths of UMIST and
The Victoria University of Manchester
oestrogen receptor
activation by oestrogen
Pattern matching for classification
• the meaning of terms is related to the restrictions
according to which these elements may be combined
• this distributional hypothesis states that specific
linguistic relations apply to semantically similar terms
• only terms from restricted semantic classes can appear
in certain predicate-argument structure
• automatically discover the facts such as the one that
specifies that proteins activate genes:
1. extract facts that x activates y through pattern matching
2. map the extracted terms to their semantic classes
3. statistical analysis
4. result: x is a protein in the majority of cases, while y is a gene
Combining the strengths of UMIST and
The Victoria University of Manchester
Lexico-syntactic patterns
• linguistic relations between term occurrences in the
text reflect the corresponding conceptual relations
• this correspondence is not one-to-one: each
conceptual relation may be linguistically represented
by a number of different syntactic constructions
• although the variability of a natural language cannot be
predicted, certain constructions are frequently used
and can be used in a pattern-matching approach to
extract the specified information
• high precision (98%), but typically low recall (<20%)
Combining the strengths of UMIST and
The Victoria University of Manchester
SOLD approach
Combining the strengths of UMIST and
The Victoria University of Manchester
SOLD measure
Combining the strengths of UMIST and
The Victoria University of Manchester
SOLD measure
• SOLD = Syntactic, Ontology-driven & Lexical Distance
• hybrid approach to comparing term contexts, which
relies on:
– linguistic information (acquired through tagging and
parsing)
– domain-specific knowledge (obtained from the
ontology)
• roughly based on the approximate pattern matching
(i.e. ED)
• combines ontology-based similarity with corpus-based
similarity using both internal and contextual features
Combining the strengths of UMIST and
The Victoria University of Manchester
SOLD measure
• the ED is used to account for structural differences in
term contexts while making it more flexible with
respect to lexical and terminological variations
• approximate matching not only for a term context as a
whole, but for its individual constituents as well
• different types of features combined:
– syntactic
– lexical
– semantic
Combining the strengths of UMIST and
The Victoria University of Manchester
Context alignment
The ecdysone receptor (EcR) is a member of
the large family of nuclear hormone
receptors, which are ligand regulated
transcription factors.
Combining the strengths of UMIST and
The Victoria University of Manchester
Context alignment
The ecdysone receptor (EcR) is a member of
the large family of nuclear hormone
receptors, which are ligand regulated
transcription factors.
The classical receptor for estradiol is a
member of a super-family of nuclear
receptors that function as hormone regulated
transcription factors.
Combining the strengths of UMIST and
The Victoria University of Manchester
Context alignment
The|ecdysone receptor|(|EcR|)|is|a member|
of|the|large family|of|nuclear hormone receptors|,|which|
are|ligand|regulated|transcription factors|.
The|classical|receptor|for|estradiol|is|a member|
of|a|super-family|of|nuclear receptors|that|
function|as|hormone|regulated|transcription factors|.
Combining the strengths of UMIST and
The Victoria University of Manchester
Context alignment
The|---------|ecdysone receptor|( |EcR
|)|is|a member|
The|classical|
receptor|for|estradiol|-|is|a member|
of|the|large family|of|nuclear hormone receptors|,|which|
of|a |super-family|of|nuclear
receptors|-|that |
are
|--|ligand |regulated|transcription factors|.
function|as|hormone|regulated|transcription factors|.
Combining the strengths of UMIST and
The Victoria University of Manchester
Context alignment – another example
----------------------------------------|-|----|-|The|
Human 1,25-dihydroxyvitamin D-3 receptor|(|hVDR|)|and|
ecdysone
receptor|(|EcR|)|is|a member|of|the|
glucocorticoid receptor|(|GR |)|, |members |of|the|
large family
|of|
steroid/thyroid hormone receptor family|, |
nuclear hormone receptors|,|which|are|ligand
|
-------------------------|-|-----|are|heterologously|
regulated|--|-----|transcription factors|.
regulated|by|other|steroids
|.
Combining the strengths of UMIST and
The Victoria University of Manchester
Applications
• term...
– ... recognition / NER
– ... disambiguation
– ... association:
• ... variation
• ... clustering
• ... classification
• ... relations
Combining the strengths of UMIST and
The Victoria University of Manchester
Term relations
• matching a pair of terms against a pair of terms linked by
a domain-specific relation in the ontology instead of
matching a single term to a classified term
• example: interact( COUP-TF II, p300 ) specified in the
ontology
• alignment:
COUP-TF II|-----|
directly|interacts|with|p300
ARA70|which|specifically|interacts|with|androgen receptor
• hypothesis: interact( ARA70, androgen receptor )
• no need for patterns to be generalised and explicitly
specified!
Combining the strengths of UMIST and
The Victoria University of Manchester
Flexible IE: approximate rule matching
• traditional IE systems rely on prespecified sets of pattern
matching rules in order to extract information about
entities and relations of interest
• pattern:
Term
[Adv]
• text:
COUP-TF II
directly interacts
V:interact Prep:with Term
with
p300
• hypothesis: interact( COUP-TF II, p300 )
• text:
ARA70 which specifically interacts with
androgen receptor
• no exact match!
• the similarity between two contexts can be used to
approximately match the second context to the given
pattern
Combining the strengths of UMIST and
The Victoria University of Manchester
Flexible IE: rule induction
• as natural languages are complex phenomena, rule-based
NLP methods necessarily give rise to numerous exceptions
• the SOLD measure can be used to identify and learn new
rules automatically
Term|-----|
COUP-TF II|-----|
[Adv]|V:interact|Prep:with|Term
directly| interacts|
ARA70|which|specifically| interacts|
• new rule:
with|p300
with|androgen
receptor
Term [Pron:which] [Adv] V:interact Prep:with Term
• generalise:
Term [Pron] [Adv] V:interact Prep:with Term
Combining the strengths of UMIST and
The Victoria University of Manchester
Flexible IE: rule management
• SOLD can directly be applied to lexico-syntactic
patterns (as part of the IE rules) to detect the
redundant rules
• patterns can be compared and clustered automatically
based on the received values of the SOLD measure and
one rule (e.g. the shortest one) per cluster retained
Term|------|[Adv]|V:interact|Prep:with|Term
Term|[Pron]|[Adv]|V:interact|Prep:with|Term
• only the insertion of a pronoun is required
• e.g. the longer rule can be removed and still be
covered by combining the SOLD measure and the
remaining rule
Combining the strengths of UMIST and
The Victoria University of Manchester
Advantages
• generality
• implicit vs. explicit
• flexibility
• adaptability
• portability
• versatility
Combining the strengths of UMIST and
The Victoria University of Manchester
MaSTerClass
http://www.cbr-masterclass.org/
Combining the strengths of UMIST and
The Victoria University of Manchester
http://www.cbr-masterclass.org/
Combining the strengths of UMIST and
The Victoria University of Manchester
MaSTerClass
Combining the strengths of UMIST and
The Victoria University of Manchester
MaSTerClass approach
new case
unclassified
term
occurrence
term (---)
+
context
annotated corpus
retrieval
case-base
• retrieve
• reuse
• revise
term (class)
+
context
retrieved
cases
voting
similarity
term (class)
+
context
classified
case
linguistic
knowledge
domain-specific
knowledge
general knowledge
term (class)
+
context
matched
cases
Combining the strengths of UMIST and
The Victoria University of Manchester
matching
similar
cases term (class)
+
context
• retain
Ontologies and text mining
Combining the strengths of UMIST and
The Victoria University of Manchester
Conclusion
Combining the strengths of UMIST and
The Victoria University of Manchester
Ontology update
• extracted information about term associations including
term ... :
– ... variants
– ... classes
– ... relations
can be used to update biomedical ontologies by ... :
– ... improving their terminological coverage
– ... positioning new term in hierarchies of semantic types
– ... linking new terms and existing terms
Combining the strengths of UMIST and
The Victoria University of Manchester
Ontology verification
• biomedical ontologies are not free from inconsistencies
regarding terms and their relations
• term associations can be used to detect:
– lexical inconsistencies (inconsistent use of linguistic
phenomena in the formation of terms)
– structural inconsistencies (inconsistent organisation
of terms)
• clustering of co-related terms can be used to verify
instances of semantic relations
• classification of terms can be used to verify the
taxonomic aspects
Combining the strengths of UMIST and
The Victoria University of Manchester
Conclusion
• knowledge integration is the key for the progress in
biomedicine
• different sources and types of information need to be
integrated
• text mining & ontologies: a marriage of mutual interest
• ontologies are needed to support the semantic aspects
of text mining
• the results of text mining can be incorporated into
ontologies to fill the gap between the existing
knowledge and its formal representation
Combining the strengths of UMIST and
The Victoria University of Manchester
Conclusion
• textual evidence needs to be linked to ontologies as the
main repositories of formally represented knowledge
• terminology is the principal link between text and an
ontology
• terms as the essential means of scientific discourse need
to be automatically identified and associated through:
similarities, clustering, (co)relations, classification
• terms and their associations (both retrieved from formal
repositories of knowledge and identified automatically)
represent a basis for sophisticated text mining in
biomedicine
Combining the strengths of UMIST and
The Victoria University of Manchester
Acknowledgements
• Dr Sophia Ananiadou
NaCTeM & University of Salford
• Prof. Douglas Kell
University of Manchester
• Dr Goran Nenadić
University of Manchester
Combining the strengths of UMIST and
The Victoria University of Manchester
Manchester Interdisciplinary Biocentre
Combining the strengths of UMIST and
The Victoria University of Manchester
Term association:
from ontologies to text
mining and back
Irena Spasić
[email protected]
http://www.cbr-masterclass.org/
Combining the strengths of UMIST and
The Victoria University of Manchester
Descargar

Slide 1