Biological Data Extraction
and Integration
A Research Area
Background Study
Cui Tao
Department of Computer Science
Brigham Young University
Research Field Overview
My research
Semantic Web
Data Integration
Schema Matching
Information Extraction
Bioinformatics
2
Information Extraction
• “Information extraction systems process
text documents and locate a specific set of
relevant items.” [Califf99]
3
Information Extraction
• “Information extraction systems process
text documents and locate a specific set of
relevant items.” [Califf99]
• “Because the WWW consists primarily of
text, information extraction is central to all
effort that would use the web as a
resource for knowledge discovery.”
[Freitag98]
4
Information Extraction
• Traditional information extraction
• Hidden web crawling
• Biological data extraction
5
Traditional Information Extraction
• Different groups of IE tools: [Laender02]
– Wrapper generation tools
– NLP-based and learning-based tools
– Ontology-based tools
6
Traditional Information Extraction
• Wrapper generation tools
– Lixto [Baumgartner01]
• Supervised wrapper generation
• Semi-automatically
• Not robust; Does not work well with unstructured
data
– ROADRUNNER [Crescenzi01]
• Fully automatic wrapper generation
• Does not generate robust and general wrappers
• Only works for highly regular web pages
7
Traditional Information Extraction
• NLP-based and learning-based tools
– SRV [Freitag98]
• Top-down learner
• Learns based on simple and relational features
• Single slot filling
– RAPIER [Califf99]
•
•
•
•
Bottom-up learner
Learns pre-filler, slot filler, and post-filler patterns
Only works for free text
Single slot filling
8
Traditional Information Extraction
• Ontology-based tools
– BYU Ontos [Embley99]
•
•
•
•
Based on domain-specific extraction ontologies
Robust to changes
Multiple slot filling
Ontologies has to be built manually
9
Hidden Web Crawling
• Traditional IE tools: publicly indexable web
pages
• Hidden web crawling
– Crawl the hidden web according to a user’s
query
– HiWE (Hidden Web Exposer) [Raghavan01]
• Source form representation  task-specific DB
concepts
• Fill out and submit forms
• Retrieve information hidden behind the form
10
Biological Data Extraction
• Mainly from plain text
• Extract biological terms
– Dictionary-based
– Rule-based
• Extract relationships between biological
terms/elements
• Example systems
– BLAST-based name identifier [Krauthammer00]
– PASTA (Protein Active Site Template Acquisition)
[Gaizauskas03]
11
The Semantic Web
•
•
•
•
Machine-understandable web
Gives information a well-defined meaning
Allows automation of tasks
Provides biologists
– Intelligent information services
– Personalized web resources
– Semantically empowered search engines
12
The Semantic Web
• Semantic web languages
 XOL (XML-based Ontology Exchange Language)
 SHOE (Simple HTML Ontology Extension)
 OML (Ontology Markup Language)
 RDF(S) (Resource Description Framework (Schema))
 OIL (Ontology Interchange Language)
 DAML+OIL (DARPA Agent Markup Language + OIL)
 OWL (Ontology Web Language)
• Semantic Annotation
– Old: indexing of publications in libraries
– New: information extraction
13
Schema Matching
• Previous methods [Raghavan01]:
– Individual matchers vs. combining matchers
– Schema-based matchers vs. instance-based
matchers
– Learning-based matchers vs. rule-based
matchers
– Element-level matchers vs. structure-level
matchers
14
Schema Matching
• LSD (Learning Source Description)
[Doan01]
– Semi-automatic
– Learning-based
– Both schema-level and instance-Level
– Only 1-1 mappings
• GLUE & CGLUE [DMD+03]
– Ontology alignment
– CGLUE: Complex (non-1-1) mappings
15
Schema Matching
• Cupid [Madhavan01]
– Rule-based matcher
– Both element-level and structure-level
– Schema-based
– Works on hierarchical schemas with schema tree
– Linguistic similarity & structure similarity
– Matches tree elements by weighted similarities
16
Schema Matching
• COMA (COmbing MAtch) [Do02]
– Combines different matchers
– Interactive with users
– Also an evaluation platform for different
matchers
17
Biological Data Integration
• Challenge:
– Huge amount, growing rapidly
– Highly diverse in granularity and variety
– Different terminologies, ID systems, units
– Unstable and unpredictable
– Different interface and querying capabilities
18
Biological Data Integration
• SRS (Sequence Retrieval System) [Etzold96]
– Keyword-based retrieval system
– Returns simple aggregation of matched records
– Only works for relational databases
• BioKleisli [Davidson97]
–
–
–
–
Integrated digital library in biomedical domain
No global schema or ontology
A mediator works on top of source-specific wrappers
Horizontal integration
19
Biological Data Integration
• DiscoveryLink [Haas01]
–
–
–
–
–
Mediator-based, wrapper-oriented
Provides virtual DB access from different sources
Cannot deal with complex source data
Hard to add new sources
Requires knowledge of specific query language
• TAMBIS (Transparent Access to Multiple Bioinformatics
Information Sources) [Stevens00]
–
–
–
–
–
Mediator-based
Uses global ontology and schema
Maps source and target concepts manually
Not robust to changes
Hard to add new sources
20
Bioinformatics
• Biological ontology
• Bioinformatics data source discovery
• Trustworthiness and provenance
21
Bioinformatics
• Biological ontology
– GO (Gene Ontology) [Ashburner00]
• Controlled vocabulary
– Molecular Function (7278 terms)
– Biological Process (8151 terms)
– Cellular Component (1379 terms)
• Is represent knowledge hierarchically
22
Bioinformatics
• Biology Ontology
– LinKBase [Verschelde03]
• Originally a biomedical ontology
– Over 2,000,000 medical concepts
– Over 5,300,000 instantiations
– 543 relations
• Expanded using GO
• Only describes simple binary relationships
23
Bioinformatics
• Bioinformatics data source discovery
– First step in integrating or answering queries
– Example System: [Rocco03]:
• Pre-defined classes with class descriptions
• Tries to map a source with a class
• Trustworthiness and provenance
– Trustworthiness:
•
•
•
•
Consistency
Reliability
Competence
Honesty
– Provenance
•
•
•
•
Record History
Transformations
Annotations
updates
24
25
26
27
Summary and Future Work
My research
Semantic Web
Schema Matching
• Overcome drawbacks
of existing systems
• Elaborate new
algorithms to solve the
problem of locating and
extracting data from
heterogeneous
biological sources
Information Extraction
Bioinformatics
28
Descargar

Autumn 2003 - 2 - Brigham Young University