BioChain:
Using Lexical Chaining
Approaches for Biomedical
Text Summarization
Lawrence Reeve
INFO780 - Final Report – Summer 2005
1
Discussions

BioChain
Goal & Approach
 BioChain Process
 Evaluation

Using other summarization systems
 Comparing abstract vs full-text


Summarization


DUC 2004 System Examples
Summary
2
BioChain Goal

Take biomedical abstract (or full text) and generate a
summary:
Adjuvant Chemotherapy for Adult Soft Tissue Sarcomas of the Extremities and Girdles: Results of the Italian Randomized Cooperative
Trial. (Frustaci et al, 2001)
Adjuvant chemotherapy for soft tissue sarcoma is controversial because previous trials reported conflicting results. The present study was
designed with restricted selection criteria and high dose-intensities of the two most active chemotherapeutic agents.
Patients and Methods: Patients between 18 and 65 years of age with grade 3 to 4 spindle-cell sarcomas (primary diameter >= 5 cm or any
size recurrent tumor) in extremities or girdles were eligible. Stratification was by primary versus recurrent tumors and by tumor diameter
greater than or equal to 10 cm versus less than 10 cm. One hundred four patients were randomized, 51 to the control group and 53 to the
treatment group (five cycles of 4'-epidoxorubicin 60 mg/m2 days 1 and 2 and ifosfamide 1.8 g/m2 days 1 through 5, with hydration,
mesna, and granulocyte colony-stimulating factor).
Results: After a median follow-up of 59 months, 60 patients had relapsed and 48 died (28 and 20 in the treatment arm and 32 and 28 in
the control arm, respectively). The
median disease-free survival (DFS) was 48 months in the
treatment group and 16 months in the control group (P = .04); and the median
overall survival (OS) was 75 months for treated and 46 months for untreated
patients (P = .03). For OS, the absolute benefit deriving from chemotherapy was 13% at 2 years and increased to 19% at 4
years (P = .04).
Conclusion: Intensified adjuvant chemotherapy had a positive impact on the DFS and OS of patients with high risk extremity soft tissue
sarcomas at a median follow-up of 59 months. Therefore, our data favor an intensified treatment in similar cases. Although cure is still
difficult to achieve, a significant delay in death is worthwhile, also considering the short duration of treatment and the absence of toxic 3
BioChain Goal

Work done in conjunction DUCoM


Ari Brooks, M.D.
What’s the latest, best information on cancer treatment?

Current focus is on clinical trial papers

Database of ~1,200 manually processed papers

Current goal: Summarize a single clinical trial paper

Ultimate goal: Summarize multiple clinical trial
documents
4
BioChain Approach

Apply methods/concepts from lexical chaining:

Cluster (chain) words together based on semantic-relatedness


Lexical Chaining…

identifies lexical cohesion




Words are chained together based on word ‘senses’ (concepts)
property causing sentences to ‘hang together’ (Morris & Hirst, 1991)
captures core themes of a text (aboutness)
is an intermediate format
Example: (Doran et al., 2004)


“The house contains an attic. The home is a cabin.”
Lexical Chain: dwelling  {house, attic, home, cabin}
5
Implemented Using UMLS

Key UMLS resources used:

Metathesaurus


Maps terms into concepts
Semantic Network
organizes related concepts

MetaMap Transfer Application

text-to-concept mapping tool
6
BioChain Process
7
Source Text Input

Abstract or full text from PubMed

Need to identify noun phrases within each sentence

concepts are derived from noun phrases using vocabulary
in metathesaurus

Sentences must be sequentially ordered

PDF conversion issues






Columns
Captions
Bibliography
Reference numbers
Images of documents
Text tables
8
MetaMap Transfer

Maps noun phrases


to UMLS Metathesaurus concepts
to UMLS Semantic Types
Candidate
Scores
Sentence/
Phrase
Candidate
Concepts
Final
Mapping
Concept
Source: http://mmtx.nlm.nih.gov/runMMTx.shtml
Semantic Type(s)
9
UMLS Metathesaurus

Vocabulary database:

Contains concepts, terms and relationships
 Incorporates more than 100 source vocabularies
(SNOMED-CT, CPT, others)
 1 million concepts
 5 million terms

links alternative terms of the same concept together

identifies relationships between different concepts



co-occurrence
parent, child, sibling
synonymy
(National Library of Medicine, 2005d)
10
UMLS Metathesaurus
Concept
Terms
Source: http://www.nlm.nih.gov/research/umls/meta2.html
11
UMLS Semantic
Network

Provides:



categorization of all concepts in the UMLS
Metathesaurus
relationships between concepts
Consists of:


135 semantic types
54 relationships
(National Library of Medicine, 2005d)
12
UMLS Semantic
Network
Source: http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html
13
Concept Chaining

Use semantic network to link together
related concepts:

Ex: T081 - Quantitative (semantic type)





MetaMap Transfer:


High dose (concept)
cm (concept)
Size (concept)
Median Statistical Measurement (concept)
Noun phrase  concept  semantic type
BioChain:

Semantic type concept, concept, concept
14
Concept Chaining

Internal storage:

Array of semantic types formed

135 semantic types, each has a type id




Ex: T061 - Therapeutic or Preventive Procedure
135 entries indexed by semantic id
Each semantic type entry holds a list of concepts
found in the source text
Each concept instance in semantic type entry
contains:



Original noun phrase
Sentence number
Section (paragraph) number
15
Sample Abstract (Frustaci et al, 2001)
Adjuvant Chemotherapy for Adult Soft Tissue Sarcomas of the Extremities and Girdles: Results of
the Italian Randomized Cooperative Trial.
Adjuvant chemotherapy for soft tissue sarcoma is controversial because previous trials reported
conflicting results. The present study was designed with restricted selection criteria and high dose-intensities of
the two most active chemotherapeutic agents.
Patients and Methods: Patients between 18 and 65 years of age with grade 3 to 4 spindle-cell sarcomas
(primary diameter >= 5 cm or any size recurrent tumor) in extremities or girdles were eligible.
Stratification was by primary versus recurrent tumors and by tumor diameter greater than or equal to 10 cm
versus less than 10 cm. One hundred four patients were randomized, 51 to the control group and 53 to the
treatment group (five cycles of 4'-epidoxorubicin 60 mg/m2 days 1 and 2 and ifosfamide 1.8 g/m2 days 1
through 5, with hydration, mesna, and granulocyte colony-stimulating factor).
Results: After a median follow-up of 59 months, 60 patients had relapsed and 48 died (28 and 20 in the
treatment arm and 32 and 28 in the control arm, respectively). The median disease-free survival (DFS) was 48
months in the treatment group and 16 months in the control group (P = .04); and the median overall survival
(OS) was 75 months for treated and 46 months for untreated patients (P = .03). For OS, the absolute benefit
deriving from chemotherapy was 13% at 2 years and increased to 19% at 4 years (P = .04).
Conclusion: Intensified adjuvant chemotherapy had a positive impact on the DFS and OS of
patients with high risk extremity soft tissue sarcomas at a median follow-up of 59 months. Therefore, our data
favor an intensified treatment in similar cases. Although cure is still difficult to achieve, a significant 16
delay in death is worthwhile, also considering the short duration of treatment and the absence of toxic deaths.
Concept Chain - Example
T061 - Therapeutic or Preventive Procedure: 6.0
phrase: ‘Adjuvant Chemotherapy’
concept: Chemotherapy, Adjuvant
sentence#0, section#0
Semantic Type
phrase: ‘Adjuvant chemotherapy’
concept: Chemotherapy, Adjuvant
sentence#2, section#1
phrase: ‘primary diameter cm’
concept: Primary operation (qualifier value)
sentence#5, section#2
Metathesaurus
Concepts
phrase: ‘Intensified adjuvant chemotherapy’
concept: Chemotherapy, Adjuvant
sentence#13, section#4
phrase: ‘intensified treatment’
concept: Therapeutic procedure
sentence#14, section#4
17
Chain Scoring

Each chain has a score


Lexical chaining research identified 3 factors for
strength: (Morris & Hirst, 1991)




Indicates degree a semantic type is discussed in text
Reiteration: more repetion is better
Density: shorter distance between concepts is better
Length: longer chain length is better
Using method from University College Dublin
(Doran, Stokes, Dunnion, McCarthy, 2004)

Frequency of most frequent concept (reiteraton)*
number of unique concept occurences
18
Chain Scoring (cont’d)

Assign score of 0 unless in one of these
concepts:
Concept ID
Concept Name
T37
Injury or Poisoning
T51
Event
T52
Activity
T61
Therapeutic or Preventative
Procedure
T62
Research Activity
T67
Phenomena or Process
T81
Quantitative Concept
T169
Functional Concept
T170
Intellectual Product
T191
Neoplastic Process
19
Strong Chains

Strong chains identify ‘best’ semantic types in text

Lexical chaining research identifies 3 factors for
strength: (Morris & Hirst, 1991)




Reiteration: more repetion is better
Density: shorter distance between concepts is better
Length: longer chain length is better
Lexical chaining research generally uses:

two standard deviations above the mean of the scores
computed for every chain in the document (Barzilay and
Elhadad, 1997)
20
Strong Chains – Example

Top chains:

T081-Quantitative Concept, score: 14.0

T061-Therapeutic or Preventive Procedure, score: 6.0

T169-Functional Concept, score: 6.0

T079-Temporal Concept, score: 4.0

T080-Qualitative Concept, score: 4.0

T082-Spatial Concept, score: 4.0

T073-Manufactured Object, score: 2.0
Strong chains: (2 StdDev)

T109-Organic Chemical, score: 2.0
Avg score: 1.6666666666666667

T170-Intellectual Product, score: 2.0
Std Dev: 3.0671497204093914

T121-Pharmacologic Substance, score: 1.0
Strong Score: 7.80096610748545
T081-Quantitative Concept: 14.0
Strong chains: (1 StdDev)
Avg score: 1.6666666666666667
Std Dev: 3.0671497204093914
Strong Score: 4.733816387076058
T081-Quantitative Concept: 14.0
T061-Therapeutic or Preventive Procedure: 6.0
T169-Functional Concept: 6.0
21
Identifying Top Concepts

Part of sentence extraction process

Get top chains (top semantic types)


Perform frequency count on concepts with chains


based on chain strength
concept(s) with highest frequency is top concept
Another approach:

Identify concept relationship types



assign weight to each relationship type ( synonymy, siblings, parent, child)
Score each concept based on contribution to chain
Choose highest scoring concept
22
Sentence Extraction

Use extractive approach

Identify main concepts in text using semantic types

Identify which sentences discusses the main
concepts the most

Using chain strength and concept frequency
23
Sentence Extraction –
Examples
Top Concepts – 2 standard deviations
T081-Quantitative Concept
-------------Concept: Median Statistical Measurement, sentence#9
Sentence: The median disease-free survival (DFS) was 48 months in the
treatment group and 16 months in the control group (P = .04);
Concept: Median Statistical Measurement, sentence#10
Sentence: and the median overall survival (OS) was 75 months for treated
and 46 months for untreated patients (P = .03).
24
Evaluation

Qualitative



Domain expert: Dr. Ari Brooks
Provided concept filtering
Quantitative

Concept chains: Compare abstract vs. full text (Silber and
McCoy, 2002)



Recall: Percentage of strong chains from the main text
that are in the abstract
Precision: Percentage of concept instances in the
abstract that also appear in strong chains in the
document
Summarization:

Compare with Word 2002, SweSum, Copernic
25
Evaluation
How similar are sentences extracted by BioChain to other systems?
26
Evaluation
Do abstracts adequately represent the full-text?
27
Evaluation

Avg p=0.90, r=0.92

Avg # of strong chains in full-text is 3


Avg unique UMLS concepts in abstract is 8


Represents 2% of all possible semantic types
Avg 80% coverage of concepts in filter
Diversity test

p=0.00, r=0.33
28
DUC 2004 Summarization
Approaches

Systems:
News Story
 LAKE
 KMS
 GISTexter


All used extractive sentence approach
29
DUC 2004 – News Story

C5.0 decision tree to predict words in a summary

Used 8 features:






TF of word in document
IDF of term in external news corpus
position of word from start of document
Lexical cohesion score between word and document
Binary Flags: noun, verb, adjective, noun phrase
Results:


TF, word position and IDF have greatest impact on summary
quality
lexical cohesion adds little as feature in decision tree
30
DUC 2004 – LAKE

keyphrase extraction approach

extracting all uni-grams, bi-grams, tri-grams, and four-grams and filter
them with part-of-speech patterns

Naïve Bayes classifier trained using manual keyphrases used to
identify relevant keyphrases:




keyphrase head TF*IDF
distance of keyphrase from the start of document
Classifier identifies candidate phrases that maximize TF*IDF and
occur at beginning of document
Results:


Scored in middle of all submissions
Add additional features that capture the semantic properties of
keyphrases: lexical chains
31
DUC 2004 – KMS

Text decomposed into a parse tree format


identify noun phrases and score them based on a
frequency analysis of terms in the noun phrases
Results:
frequency-based approach performs better than
systems based on other approaches
 Simple to implement

32
DUC 2004 – GISTexter

computes weight for each term in collection
based on term frequency in a relevant set of
documents
 Sentence score = sum of weights of each term in
sentence
 Top scoring sentences are then extracted


Results

Performed among the best systems
33
Summary

Want to summarize biomedical texts (specifically oncology)

Use lexical chaining approaches with existing UMLS resources to
identify the ‘aboutness’ of a text using concepts vs terms

Extract sentences containing strongest concepts within a semantic
type chain

Result is an indicative summary of what text is about

Evaluation shows concept chaining is strong between human
summary and full-text
34
References









Afantenos, S. D., Karkaletsis, V., & Stamatopoulos, P. (2005). Summarization from
Medical Documents: A Survey
Artificial Intelligence in Medicine, 33(2), 157-177.
Aronson, A. R. (2001). Effective mapping of biomedical text to the UMLS
Metathesaurus: the MetaMap program. Proceedings of the AMIA Symposium 2001, 17-21.
Barzilay, R., & Elhadad, M. (1997). Using Lexical Chains for Text Summarization. In
Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS'97), ACL, Madrid,
Spain, 10-18.
Copernic Technologies, I. (2005). Copernic Summarizer. Canada: . Retrieved August 7,
2005, from http://www.copernic.com
D’Avanzo, E., Magnini, B., & Vallin, A. (2004). Keyphrase Extraction for
Summarization Purposes: The LAKE System at DUC-2004. Proceedings of the 2004
Document Understanding Conference, Boston, USA, Retrieved June 3, 2005,
Dalianis, H. (2000). SweSum - A Text Summarizer for Swedish No. TRITA-NA-P0015).
Stockholm, Sweden: NADA, KTH.
Doran, W., Stokes, N., Carthy, J., & Dunnion, J. (2004). Comparing Lexical Chain-based
Summarisation Approaches using an Extrinsic Evaluation. Proceedings of the Global
WordNet Conference(GWC 2004),
Doran, W. P., Stokes, N. S., Dunnion, J., & Carthy, J. (2004). Assessing the Impact of
Lexical Chain Scoring Methods and Sentence Extraction Schemes on Summarization.
Proceedings of the 5th International conference on Intelligent Text Processing and Computational
Linguistics CICLing-2004,
Doran, W., Stokes, N., Newman, E., Dunnion, J., Carthy, J., & Toolan, F. (2004). News
Story Gisting at University College Dublin. Proceedings of the Document Understanding
Conference (DUC-2004),
35
References, continued









Fellbaum, C. (1998). WORDNET: An Electronic Lexical DatabaseThe MIT Press.
Galley, M., & McKeown, K. (2003). Improving Word Sense Disambiguation in
Lexical Chaining. Proceedings of the Eighteenth International Joint Conference on Artificial
Intelligence, Acapulco,Mexico, 1486-1488.
Lacatusu, F., Hickl, A., Harabagiu, S., & Nezda, L. (2004). Lite-GISTexter at DUC
2004. Proceedings of the 2004 Document Understanding Conference, Retrieved June 10,
2005,
Lin, C. (2005). Recall-Oriented Understudy for Gisting Evaluation (ROUGE). Retrieved
August 20, 2005 from http://www.isi.edu/~cyl/ROUGE/
Litkowski, K. C. (2004). Summarization Experiments in DUC 2004. Proceedings of
the 2004 Document Understanding Conference, Boston, USA, Retrieved June 5, 2005,
Microsoft Coporation. (2002). Microsoft Word 2002. Redmond, Washington, USA: .
Retrieved August 7, 2005, from http://office.microsoft.com
Morris, J., & Hirst, G. (1991). Lexical Cohesion Computed by Thesaural Relations
as an Indicator of the Structure of Text. Computational Linguistics, 17(1), 21-43.
National Institute of Standards and Technology (NIST). (2005). Document
Undertanding Conferences. Retrieved August 20, 2005 from http://wwwnlpir.nist.gov/projects/duc/
Silber, G. H., & McCoy, K. F. (2002). Efficiently Computed Lexical Chains as an
Intermediate Representation for Automatic Text Summarization. Computational
Linguistics, 28(4)
36
References, continued








SNOMED International. (2005). SNOMED Clinical Terms. Retrieved July 31, 2005 from
http://www.snomed.org/
Turney, P. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4),
303-336.
United States National Library of Medicine. (2005a). ClinicalTrials.gov. Retrieved July 31, 2005
from http://www.clinicaltrials.gov/
United States National Library of Medicine. (2005b). MetaMap Transfer. Retrieved July 31,
2005 from http://mmtx.nlm.nih.gov/
United States National Library of Medicine. (2005c). PubMed. Retrieved July 31, 2005 from
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
United States National Library of Medicine. (2005d). Unified Medical Language System (UMLS).
Retrieved July 5, 2005 from http://www.nlm.nih.gov/research/umls/
United States National Library of Medicine. (2004a). UMLS Metathesaurus Fact Sheet.
Retrieved July 31, 2005 from http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html
United States National Library of Medicine. (2004b). UMLS Semantic Network Fact Sheet.
Retrieved July 31, 2005 from http://www.nlm.nih.gov/pubs/factsheets/umlssemn.html
37
Descargar

Survey of Semantic Annotation Platforms