Intelligent Information Systems
7. Bioinformatics
based on NRC* Bioinformatics workshop on Data Integration,
Washington DC, February 2000
To be published as …..
Gio Wiederhold
April-June 2000, at 14:15 - 15:15, room INJ 211
*NRC = National Research Council,
Analysis and publication arm of the
U.S. National Academy of Sciences
EPFL7B - Gio spring 2000
Presentations in English -- but I'll try to manage discussions in French and/or German.
• I plan to cover the material in an integrating fashion, drawing from concepts in
databases, artificial intelligence, software engineering, and business principles.
1. 13/4 Historical background, enabling technology:ARPA, Internet, DB, OO, AI., IR
2. 27/4 Search engines and methods (recall, precision, overload, semantic problems).
3. 4/5 Digital libraries, information resources. Value of services, copyright.
4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing.
5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in
processing. Role of humans and automation, maintenance.
6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer]
7. 31/5 Application to Bioinformatics.
8. 15/6 Educational challenges. Expected changes in teaching and learning.
9. 22/6 Privacy protection and security. Security mediation.
10.29/6 Summary and projection for the future.
• Feedback and comments are appreciated.
EPFL7B - Gio spring 2000
• to learn about ourselves,
– our origins, our place in the world
• Primates, Mice, Zebrafish,
Fruit Flies* (drosophilae), Roundworms* (c.elegans),
and viruses as HIV*, Yeast*, plants
– modesty, seeing how much we share with all organisms
– not just of philosophical interest, but also
• to help humanity to lead healthy lives
– to create new scientific methods
– to create new diagnostics
– to create new therapeutics
EPFL7B - Gio spring 2000
* substantially/completely sequenced.
also bacterium* (Haemophilus influenzae)
Information systems applied to biology and healthcare
• Biomedical statistics, …, …
• Genomics - an subset of major interest, dealing with information
related to gene-derived data
• boundary often unclear nature versus nurture & lifestyle
– A person’s Genomic make-up has a major effect on
susceptibility to diseases: positive and negative
exposure to smoke
– Major genomic errors prevent birth, hence
& lungcancer
– we deal with differences that are relatively minor
289 / ~10 000 genes suspected/identified
– complexity: most health effects are also combinatorial
multiple genes, promotors, inhibitors, metabolic cross-roads
EPFL7B - Gio spring 2000
~10 000
The human genome: ~ 3 200 000 000 base pairs
Genes, and gene abnormalities
6 000 000 000
Everybody’s genes
Metabolic pathways
~2 000 000
Small organic molecules - affect proteins - suitable for drugs
EPFL7B - Gio spring 2000
• Basepairs: certain pairs of 4 amino acids: ACGT
• adenine, cytosine, guanine, thymine,
combine in double helix
• 3 basepairs define 1/12 amino acids (<< 43 =64)
• Proteins:
– determined by certain sequences of amino acids: genes
– assembled by Ribosome according to RNA template
– coded in ~3% of the genome -- but where?
– 97% is miscellaneous: historical junk / promotors / inhibitors
multiple genes for many proteins
EPFL7B - Gio spring 2000
Human Genome project
(NIH-NCGI & Wellcome trust)
1988-- 2005, but likely roughly in completed in 2000/2001?
Technology and strategies caused exponential rates of improvement
work at Universities, related research labs, split per 24 chromosomes
collected in public databases
100 M in 1998
(well annotated, with paper publishing)
• 2.100 M by March 2000. ~12,000 base pairs per day in 1999.
automation [Perkins-Elmer Biosystems, Affymetrix…]
piece-wise (100-1 000) analysis and subsequent assembly versus walking the gene
pieces overlap, software to match
Private enterprises at various levels
not-for-profit [The institute for Genomic Research (TIGR) dir. Craig Ventner]
for profit [Celera Genomics (Ventner), Incyte] sell leads to pharmaceutical companies
Early discovery pharmaceuticals [HGS Inc, Millenium Ph.]
Established Pharmaceutical companies in-house [all now],support
drug development, trials on animals humans, {toxicity, then benefit} trials, marketing.
EPFL7B - Gio spring 2000
Heterogeneity inhibits Integration
• An essential feature of science
– autonomy of fields
– differing granularity and scope of focus
– growth of fields requires new terms
• A feature of technological process
– standards require stability
– yesterday’s innovations are today’s infrastructure
• Must be dealt with explicitly
– sharing, integration, and aggregation are essential
– large quantities of data require precision
– []
EPFL7B - Gio spring 2000
Integrating knowledge
bring together biologists and computer scientists from academia, industry and
government to discuss salient issues in biological computing.
The following topics will becovered:
the generation and integration of biologic databases;
interoperability of heterogeneous databases;
integrity of databases;
modeling and simulation,
data mining,
visualization of "model fit” to data.
The format of this workshop is designed to facilitate lively interaction between
speakers and audience participants.
EPFL7B - Gio spring 2000
Electronic Publication
The Signal Transduction Knowledge Environment:
Brian Ray,
American Assoc.for the Advancement of Science
STKE: Virtual journal, developed jointly with High-wire Press: Using the web for
summarizing relevant articles from other (electronic journals)
A prototype for a future publication model: all academic papers are placed into a
pile, and classified into one or more discipline categories, and aggregated and
retrieved by secondary specialists - a new role for editors, requiring scientific
competence and authority. Maintains a pathway map for attaching Has a
controlled vocabulary. Does caching of retrieved referenced Medline articles.
EPFL7B - Gio spring 2000
Generating and Integrating Biological Data
Methods for data collection Virtual Cell Project
Dong-Guk Shin, Univ. Connecticut, also available
without DB support, from www.nrcam,
NIH supported: Physiology modeling, NSF: computational modeling approach. Bottom-up approach to
cell modeling Cross checking of models and HXs: Geometry from segmented images,2Dvisualization of specified reactions: cannels, pumps, for extra, intra (cytosol), ef core cellular
compartments. Generates equations for simulation. Result is a DB publication cycle, supporting
model copying and adaptation. For access to remote DBs will need more than a browser, but also a
query system, with join over association. DBs nee APIs <and mediation for scalability and
Data characteristics
Stephen Koslow, Office on Neuroinformatics, NIMH
Data integration Jim Garrels, Proteome, Inc.
Moderated Discussion: By Susan Davidson, Univ. Pennsylvania
EPFL7B - Gio spring 2000
Generating and Integrating Biological Data
Methods for data collection Dong-Guk Shin, Univ. Conn.
Data characteristics Need interoperation
Stephen Koslow, Office on Neuroinformatics, NIMH
The human brain has 100 billion (10^14) neural cells, 10^15 connections. uses 15 Watts. Neuroscience is
a growing field, includes neuroinformatics. Intial, broad journals, reductionist journals, Numerical,
symbolic, literature and image data. Volume of publication only for serotonin, discovered in 1948,
now 70 000 papers, is becoming impossible to follow. Dozens of cell types. Voluminous 3-D MRI
data.UCLA brain mapping. Basis for localization of diagnostic EEG, MEG observations.
Data integration
Jim Garrels, Proteome, Inc.
Moderated Discussion: By Susan Davidson, Univ.Pennsylvania
EPFL7B - Gio spring 2000
Generating and Integrating Biological Data
Methods for data collection. Dong-Guk Shin, Univ. Conn
Data characteristics Stephen Koslow, Neuroinform., NIMH
Data integration
Jim Garrels, Proteome, Inc. - free
Literature 50 billion bytes of text coveng the 5 billion bytes in Genbank.
BioKnowlede Library, Pages {title wth bief functiona description, family, properties (Mutant phenotype, }
sequnece annotations, related proteins: Orhologs and Interlogs (in different soecies) [Marc Vidal,
MGH], classifuactuo followung [Ascchburner?]. } curated by expert. Integrated from cDNA
microarrays and chips, systematic 2-hybrids, … .
Model-organims: Started with Yeast, now worms [Stuart Kim, Stanford], Pombe. Several 1000 physical
associations and interactions.Authors shoild not publish expeimentaldata directluy into a DB and
curate their own papers,, but submit thei esults and publishlang expression studies and update
their own results.
Need portal sites a well as content sites.
Moderated Discussion: By Susan Davidson, Univ.Pennsylvania
EPFL7B - Gio spring 2000
Matching of sequences
• Difficult because of
– errors in amino-acid sequence
– missing subsequences, extra strands
– meaningful variation: HIV reverse transcriptase (RT) & protease
is characterized by many mutations
– Loops and repeats in sequences
• Several tools: BLAST, GRAIL
EPFL7B - Gio spring 2000
2D to 3D conversion
Protein folding
• Strand of DNA, snipped of by …, assumes a tight,
3_D shape
• The shape determines the attachkment points to
cells, ...
– nature does it in a few nanaoseconds
– computation based on finding minimum energy
conformations would take many years
– current research tries to break computaion by recognizing
common substructure types: alpha-helixes, beta sheets, ...
EPFL7B - Gio spring 2000
Interoperability of Databases
Design features of interoperable databases
Daniel Gardner, Cornell University. edu
Interoperability in a 4.5D space:
1. user - platforms, software, open to new data: model journal to define scope and views, but include
data - reanalyzable.. Dat quality is domain-dependent. Data sets presented via a virtual
2 common datamodel (XML based, with capability for interdomain queries.) for neuroscience.
hierarchical with a controlled vocabulary, for selected granularity. Much metadate, (physiological
site, data, reference, method and model elements) used in query term as well. Data compaptability federatd, and evolving.
3 TEMPORAL - legacy, current, future (IBN card -- XML)
4 Technical - Proprietary versus open (as PNAs papers)
4.5 Domain versus interdisciplinary. just interfaces.
XML BDML for brains. Will be longer lived than CORBA.
<<the problem of interopertation is not the syntax ox XML, but the semantics of the DTD tags, Scalability beyond
neurosciences. Federtion versus articulkation>
EPFL7B - Gio spring 2000
Interoperability of Databases
Information retrieval and complex queries
Peter Karp, SRI Int., Bioinformatics Res.Group
Dtabases are supplanting journals. They are re-analyzable. Results published in journals are not. Estimate
now about 500 public databases for Bioinformatics. Not all vn hav APIs. Want seamless interoperations.
Differing models, units of measurements, leadng to semantic problems.
Progress in interoeration
Follwup includes K2 t Upenn, OPM Gene Logic, Hyperlinkng at SB-Glaxo.
Warehouse (SRS 13o sources) versus multi-databases. Text (SRS)vs. Structured.
150 metabolic pathways known in Ecoli.
orces lack DBMS, ontologyies, no formal model, irregular flat files, inconsitent semantics (example even in
Genbank entries), no web APIs.
proposed XOL= ontology exchange language. <<CS545>>
Databbses often don’’t have the right fields (SwiisProt infered versus being observed. maintenance over time
<<Need mediating help>>
EPFL7B - Gio spring 2000
Interoperability of Databases
Definition of data elements and database structure
William Gelbart, Harvard University, Flybase
Moving from being Hunter Gatherers in science to Harvesters, moving to an agronomical society <<new laws>>
Phenome <-- --> complexome <-- -->Genome <-- transciptome <- -->> Preteome.
Clasical genomics is being superseded by Expression and Interaction of gene products and gene perturbation <-- -->
How do me organizes DBs for that objectives. Things {biological objects, relationships among the objects -- with sources ) ->
robust object classifiers with controlled vocabularies. <<by guilds>> Many sorting methods
<<moving from agronomic to the medieval guilds, the predecessors of professional societies- sitting around the market
square, where the farmers deliver their source , as wholesalers and intermediaries. Well maintained derived databases
also have value -added value by expertise focused on some objective.>>
Flybase collects more, as exons and their mutations. Tranposon insertion sites.
Foundation DBs vs Derived DBs -- define ownership of foundation sources. Histories must be maintained. Version tracking.
Presentation standards
EPFL7B - Gio spring 2000
Interoperability of Databases
Novel approaches to achieving interoperability
not(Jaron Lanier), National Tele-Immersion Initiative
actually James Bower, California Institute of Technology
Historians are important, past models for the future.
In biology individuals cannot provide contributions,
Organizations can first be folkloric - can ignore data -- all non-quantified diagrams. Commonly used by
biologists. Need commitment to quantification. Aristotlian - model starts with some dyta, but model
are then worked out independent of all the data.
Timeline of experimental data (1904 ..) versus (lagging) structural theories for Purkinji cells. (1958 ,,) . n
1918 bad model used in DB [Holmes] Later experimental data is ignored.
EPFL7B - Gio spring 2000
Database Integrity
Curation and quality control (SGD database)
Michael Cherry, Stanford Univ.
Curation is the act of establishing and maintaining a database, here the xxxx. Similar task to what a
journal editor does, also finctions as an Educator, Ontologist [Yaahoo]
Learn what aids the community needs, aaand build the musum to satisfy those needs [John Cotten
Dana, 1850]
Set limits according to what you can do and obtain.
Find missing details in literature,
GO [Michael Ashburner] for fly, mouses, and yeast (saccaries) Gene ontology for molecular function,
cellular location (abs or rel), . Format DAGs, Used for annotaing microarrays. Included summary
EPFL7B - Gio spring 2000
Database Integrity
Error detection protocols
Chris Overton, Univ. of Pennsylvania
Works in genom annotaion, to predict and archive landmarks.
Want links to data, to encoded proteins.
Errors come fom experimnetal data, manual curation from the literature, computational
Errors are propagated in computation, and integration.
In K2 (GAIA DB) uses GenBanl, SwissProt, TRRD , GERD, TRANSFAC, MEDLINE. Some
have moderatly or highly restrictive licenses.
Look for syntac errors: matching introns and exons (implied in GD, also actual coding
Spelling probles are propoagated.
Genbank majority have annotaion ambiguity.
PDB does not list all binding sites found in proteins - lack of motivation [Weissig, Bioinformatics 99],
Predictions [GRAIL] get propagated.
Poor advice of changes other than to sequence.
EPFL7B - Gio spring 2000
Database Correctness
Methods for correcting errors
Bill Anderson ( Knowledge Bus Inc, Hanover MD,all their work(Data Alive) is baed on an
ontology fro biochenical databases) for EML( Europeam Media Lab.) anf EMBL,
Heidelberg) :
Syntactic errors: formats
Semantic : interpretation of relations -- ontology
Pragmatic errors - true sata differences (exprimnt, transcriprtion)
biochemical ontology --> microanatomy -->{spatial, events} , chemistry --{{spatial,
events}--> (several 100 axioms as constraint rules)
Either the database or the constraing ontology is wong)
When a fault occurs go back to pragmatics, no automatic curation.
EPFL7B - Gio spring 2000
Modeling & Simulation
Modeling and simulation
James Bower, California Institute of Technology - kids learning relationships, including .
Web site, Purkinje Park, allows onging collaboration with students,
Purkinje cell(6 M in human) 100 micro meters, has 250 000 inputs, 10-12 distinct conductances modlled by Eric
Schoeter [now Belgium] . Tested with elecrical probes. Found differences with publ.information: here the
dendrite is current sink. Rethinking of cerrebellum. It is a sensory device, not a motor control device.
Shown by experiments motor and sensing, and observing brain activity. Still linking images and actual
activity of neurons in that area is hard.
levels - Cognitive- sytem- network- cellular - subcellular -molecular atomic,
Correponding simulators:
ACT SOAR (connects 2 levels)- GENESIS (4 levels)- NEURON (2)-- MCELL/VCELL (2) / RASMO/WebLlab GEPASI/GAMESS/Psl.
EPFL7B - Gio spring 2000
Analytical Approaches
Data Mining, Douglas Brutlag, Stanford University
Many types of relevant DB.Sequenc, sequence variation, Now also relationship
DBs.(phylogenetic, gene fusion [Eisenberg], pathways, gene expression,
protein-ligand, signal transduction)
Challenge: finding them, syntax, semantics (MESH inadequate),
Doubletwist [Pangea] - an agent-based specific journal - summaries and
notifications of subsequent published findings.
EPFL7B - Gio spring 2000
• Match patterns of two samples
– label amino acids with fluorescent markers
– does not require functional genomic knowledge
– PCMR multiplies sample size
– Fluorescent activated cell sorters can separate cells,
Ex.:separate embryo cells from mother’s blood
by labeling with father’s genes and matching
– Familial ties, human migrations, ...
child that died in French prison was Louis XVII by tissue
comparison with current relatives
– Ancestry of species by creating hierarchical difference trees
uses “junk portions” of genome - functions no longer needed
EPFL7B - Gio spring 2000
Clinical: Diagnosis
Diagnosis is more advanced than treatment
• Match patient tissue sample pattern to rich pattern
– VLSI technology used to place 10 000 known genes on a chip surface
– look for matches of expressed genes vs expectations in cells from
diseased tissue (skin for melanoma, …)
– can distinguish, say, cancers, that require specific treatment,
but are indistinguishable by pathologists
• Follow with
– traditional treatments, if any
– but earlier / more aggressive / more specific
– being careful
– haemophilia
– being emotionally more prepared
EPFL7B - Gio spring 2000
Clinical Treatment
Only few choices now, take many years to develop, test
Two ways to get good genes to work
• in vivo -- problem: rejection
• put virus (can penetrate cells) with repaired gene into cells
• those cells now generate proper protein
• expect cells to replicate, and create more protein
• in vitro
-- problem: getting protein to right places
• use bacteria to replicate gene
• let them manufacture needed proteins
• inject proteins
EPFL7B - Gio spring 2000
Clinical Treatments 2
Or, block bad genes ,
all in vivo -- problem: knowledge, getting there
• flood area with decoy promotors
– fool the ribosome, prevent transcription from DNA to RNA
• block RNA from being a model for more DNA
– use anti-sense molecules to create wrong double helix segments
• stiffle cells by synthetic antibodies (for cancers)
– block growth factor attachment for its proteins, by providing fakes
EPFL7B - Gio spring 2000
Visualization of model fit to data
John Mazziotta, Univ.of California Los Angeles
Huma rain aatlas
EPFL7B - Gio spring 2000
Data and Models to represent
understanding of data
• Sharing and Publishing electronically at two levels
1. Sources, I.e.: data -- with provenance - incl. predictions, fixes.
recognize owners’ objectives - they may not be your objectives, (PDB does not list all
binding sites found - lack of motivation )
2. Models, incorporating knowledge, with means to populate the model
3. Added value by secondary processing. - shared ownership (c)
Expanding on Prof. Gelbart’s example by moving from agronomic to the medieval guilds -the predecessors of professional societies -- sitting around the market square, where the
farmers deliver their source, as wholesalers and intermediaries. Well maintained derived
databases also have value -added value by expertise focused on some objective.
EPFL7B - Gio spring 2000
A focus of Knowledge generation is integration of data
The problem of interoperation is not the syntax ox XML, but the semantics of the DTD
tags. Scalability beyond neurosciences. Federation versus articulation>
Yes keep the fundamental sources, but get added value in derived data (as Swiss Prot):
error correction for a specific objective (U,
adding entries
Does not require federation and terminological alignment of all sources.
Rules and ontologies provide incremental help. help much but don’t solve problems of
semantic errors
EPFL7B - Gio spring 2000
The People Problem
The demand for people in bioinformatics is high, at all levels
• Critical is a lack of
– training opportunities - programs and teachers
– available trainees
Being in multi-disciplinary field is scary
– tenure for faculty
– load for students
– salary and growth differentials in biology and CS
Some institutions [Caltech, U Penn] are moving aggressively
– must compete with World-Wide Web visions
EPFL7B - Gio spring 2000
Privacy requires Ethics
Knowledge carries responsibilities.
also, always some error rates
How will people feel about your knowledge about them?
their genetic make-up,
physical & psychological propensities.
Privacy is hard to formalize,
but that does not mean it is not real to people.
Perceptions count.
(There is also real stuff insurance scams - personal relations )
Diagnostics without therapies.
EPFL7B - Gio spring 2000
Securing Collaboration
source query
certified result
Security Filter
certified query
unfiltered result
Private Patient Data
Gio Wiederhold TIHI Oct96 34
EPFL7B - Gio spring 2000
To sustain the trend
1. The value of the results has to keep increasing
precision, relevance not volume
2. Value is provided by experts,
encoded as models of
diverse resources, customers
Problems to be addressed
Clear models
temporal extensions
EPFL7B - Gio spring 2000

1. History - The Stanford University InfoLab