Part 3: Integrating Services
Life Science Identifiers & Information model.
Data and Metadata management – the MIR.
Domain Services – Native, Soaplab and Gowlab.
Taverna/Freefluo Workbench and Workflow Enactor.
Professor Carole Goble
University of Manchester, UK
http://www.mygrid.org.uk
GGF Summer School 24th July 2004, Italy
Part 3: Integrating Services
Life Science Identifiers & Information model.
Data and Metadata management – the MIR.
Domain Services – Native, Soaplab and Gowlab.
Taverna/Freefluo Workbench and Workflow Enactor.
GGF Summer School 24th July 2004, Italy
20,000 feet
Provenance and
Data browser
Haystack or Portal
Taverna
Workbench
Semantic
Discovery
& Registration
View
Service
Portal
LSID Authority
UDDI
mIR
data
mIR
metadata
Store Service
Event
Notification
Service
GGF Summer
School
24th July 2004,
Italy
Freefluo
Workflow
Engine
Web services, local tools
User interaction etc.
Information Model v2
myGrid
components form a loosely coupled system
An Information Model for e-Science experiments
Based on CCLRC scientific metadata model
XML messages between services conform to the IMv2
Domain specific
<<Resource>>
Agent
1
Study
has participants
0..*
0..*
scmInvestigator
<<Resource>>
0..*
<<Resource>>
participates in
1
PeopleAndTeams.Person
StudyParticipation
+name:String
+description:String
Subject
+startTim e:DateTim e
Object
+endTime:DateTime
LabBookView
0..*
0..*
1
+name:String
+status:String
Resources.Resource
selected studies
+rule:String
1
0..*
labBooks
acts in
contains
+getId:URIString
0..*
StudyRole
1
+roleName:String
+description:String
ProgrammeResource
uses
1
Programme
Investigation
Domain neutral
0..*
+name:String
0..*
uses
1..*
<<Resource>>
Operations.Operation
<<Resource>>
1
0..*
method
ExperimentDesign
Agent
1
has instances
0..*
ExperimentInstance
method
http://cvs.mygrid.org.uk/cgi-bin/viewcvs.cgi/mygrid/MIR/model/
GGF Summer School
24th
Nick Sharman, Nedim Alpdemir, Justin Ferris, Mark Greenwood, Peter
July 2004,
Li,Italy
Chris Wroe, The myGrid Information Model, Proc UK e-Science 2nd
All Hands Meeting, Nottingham, UK 1-3 Sept 2004.
Information Model v2
myGrid
components form a loosely coupled system
An Information Model for e-Science experiments
Based on CCLRC scientific metadata model
XML messages between services conform to the IMv2
Domain specific
Scientific data and the Life
Science Identifier
Types, Identifier Types,
Values and Documents
Molecular Biology
Bioinformatics
Resources and Ids
Domain neutral
Provenance information
Annotation and Argumentation
e-Science process, experimental methods
GGF Summer School 24th July 2004, Italy
People, teams and organizations
Layered Semantics
•
•
Domain Semantics layered on top of domain neutral but scientific data
model
Reducing the activation energy, lowering barriers of entry.
Ontologies
IMv2
Format
XSD types
MIME types
Domain Semantics
Data Metadata
Workflow metadata
Experiment Semantics Service Metadata
Provenance metadata
Syntax
Workflow
OGSA-DQP
GGF Summer School 24th July 2004, Italy
Experimental entities
<<Resource>>
Study
0..*
1
has participants
+name:String
LabBookView
+description:String
0..*
1
+endTime:DateTime
Object
contains
+name:String
+rule:String
selected studies
+status:String
Resources.Resource
scmInvestigator
0..*
+startTim e:DateTim e
Subject
0..*
1
labBooks
1
0..*
+getI d:URIString
1
researchFocus
0..*
ProgrammeResour ce
uses
1
Programme
Annotation.Sem anticConcept
Investigation
0..*
+name:String
0..*
uses
<<Resource>>
Oper ations.Operation
initiate
1..*
<<Resource>>
1
0..*
ExperimentDesign
method
GGF Summer School 24th July 2004, Italy
Agent
1
method
has instances
0..*
ExperimentInstance
0..*
View over the MIR
GGF Summer School 24th July 2004, Italy
Life Science IDs
• Each database on the web has:
– Different policies for assigning and maintaining identifiers,
dealing with versioning etc.
– Different mechanism for retrieving an item given an ID.
• Life Science IDs designed to harmonise the retrieval of
data.
• Emerging standard for bioinformatics
– I3C, OMG Life Sciences Group, W3C
• Defines:
– URN for life science resources
– SOAP (and other) interfaces for LSID assignment, LSID
resolution & resolution discovery services
GGFS.
Summer
24th JulyGlobally
2004, Italy
T. Clark,
Martin School
& T. Liefeld:
distributed object identification for biological
knowledge bases, Briefings in Bioinformatics Vol 5 No 1 pp 59-70, March 2004
What is an LSID?
urn:lsid:AuthorityID:NamespaceID:ObjectID:[RevisionID]
urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2
urn:lsid:ebi.ac.uk:SWISS-PROT.accession:P34355:3
urn:lsid:rcsb.org:PDB:1D4X:22
• LSID Designator: A mandatory preface that notes that the item being
identified is a life science-specific resource
• Authority Identifier: An Internet domain owned by the organization
that assigns an LSID to a resource
• Namespace Identifier: The name of the resource (e.g., a database)
chosen by the assigning organization
• Object Identifier: The unique name of an item (e.g., a gene name or
a publication tracking number) as defined within the context of a
given database
• Revision Identifier: An optional parameter to keep track of different
versions of theth same item
GGF Summer School 24 July 2004, Italy
LSID Properties
• Unique authority for each identifier
• Multiple resolution services, supporting:
– Data retrieval – data immutable: data returned for a
given LSID must always be the same
• caches
– Metadata retrieval – mutable and resolver-specific
• annotation services. More on this in Part 4
• Resolution discovery service
– Implemented over DNS/DDNS (Optional)
• Authority commitment: must always maintain an
authority at e.g. pdb.org that can point to data and
metadata resolvers.
GGF Summer School 24th July 2004, Italy
How is data retrieved?
Application
2. Where can I get data and metadata
for urn:lsid:pdb.org.1AFT
PDB Authority
@ pdb.org
1. Get me info for:
urn:lsid:pdb.org:1AFT
LSID client
PDB Data
resolver
PDB database
PDB Metadata
resolver
2. Get me the data and metadata for:
urn:lsid:pdb.org:1AFT
GGF Summer School 24th July 2004, Italy
LSID Components
• IBM built client and server
implementations in Perl,
Java, C++
• Straightforward to wrap an
existing database as a
source of data or metadata
• Client simple to use
• LSID Launchpad adds
LSID resolution to Internet
Explorer
• LSID aware client
applications, e.g. Haystack
(see Part 4).
th
GGF Summer School 24 July 2004, Italy
http://www-124.ibm.com/developerworks/oss/lsid/
Use within myGrid
• Needed an identifier for our own experimental resources
– workflows, experiments, new data results etc
• All and everything identified with LSIDs
• LSID saves us having to invent our own conventions and
code.
• Can pass references to data around and be reassured the
other party will know how to resolve that reference
• Resolution services:
– Data: myGrid Information Repository (MIR)
– Metadata: myGrid Metadata Store (RDF-based)
• As a client:
– Uniform access to myGrid and external resources
• Retrieval
• Annotation
(see Part 4)
th
GGF Summer School 24 July 2004, Italy
LSID Assignment
4. Data and metadata
retrieved
Data
LSIDs
Client application
Metadata
Requests
LSID Assigning
Service
LSID
Authority
LSID Metadata
Resolver
LSID Data
Resolver
2. New LSIDs
assigned to data
mIR
Store plug-in
Services
Enactor
1. Data sent/
received from
services
Metadata plug-in
Workflow
design
User
context
GGF Summer School 24th July 2004, Italy
Metadata
Store
3. Data / Metadata stored
Part 3: Integrating Services
Life Science Identifiers & Information model.
Data and Metadata management – the MIR.
Domain Services – Native, Soaplab and Gowlab.
Taverna/Freefluo Workbench and Workflow Enactor.
GGF Summer School 24th July 2004, Italy
Information Storage
• The MIR data store
• Stores experimental
components
– Workflow specs as
XML Scufl docs
– Data, XML notes
– Types: XML docs,
Relational
•
•
•
•
Every entry has Dublin Core provenance attributes
Every entry can have (multiple) ontology expressions
Multiple mIRs
The (MIR) metadata store
GGF Summer School 24 July 2004, Italy
– RDF using Jena 2.0
th
Metamodel for Types
• Necessary to identify the type and format of each datum
of interest so that it can (only) be input to typecompatible viewers, services and workflows.
• Can’t fix this – working in an open world. There are many
established, de facto and locally preferred types &
formats. Define common bio-types a fool’s errand.
GGF Summer School 24th July 2004, Italy
Intermediate Results
GGF Summer School 24th July 2004, Italy
Results Management
• Taverna/Freefluo WfEE
agnostic about the data
flowing through it.
• As objects progress
through tagged with terms
from ontologies, free text
descriptions and MIME
types, and which may
contain arbitrary collection
structures.
• Using the metadata hints
we can locate and launch
pluggable view
components.
• One WBS workflow can
produce ~130 files.
(intermediate) results
management and
presentation a major
headache.
GGF Summer School 24th July 2004, Italy
GGF Summer School 24th July 2004, Italy
Results Amplification
•
One input
•
•
•
•
•
Automated annotation workflows produce lots of
heterogeneous data
The workflows changed how scientist works.
Before: analyse results as go along
After: all results, all the analysis, in one go
Intermediate results management and associated
provenance management essential
Domain specific visualisation
Many outputs
GGF Summer School 24th July 2004, Italy
GGF Summer School 24th July 2004, Italy
Part 3: Integrating Services
Life Science Identifiers & Information model.
Data and Metadata management – the MIR.
Domain Services – Native, Soaplab and Gowlab.
Taverna/Freefluo Workbench and Workflow Enactor.
GGF Summer School 24th July 2004, Italy
Domain Services
• Native WSDL Web services
– DDBJ, NCBI BLAST, PathPort
• BioMOBY Web services
– Single function stereotype
• Wrapped legacy services
– Stateful interaction stereotype
– One button wrapping
– SoapLab for command-line tools
– GowLab for screen scraped web
pages
– http://industry.ebi.ac.uk/soaplab/
– Leveraged the EMBOSS Suite
and others
– Circa 300 services
GGF Summer School 24th July 2004, Italy
For each application
CreateJob
Run
WaitFor
GetResults
Destroy
Domain Services
•
•
•
•
•
•
•
•
•
•
Domain Services in WBS
Lots of them ~ 300
• Repeatmasker
Open world: we don’t own them
• NCBI_BLAST
Many produce text not numbers
• Modified BLAST
Many are unique, single site
• GenScan
Need lots of genuine redundant replica services • PSORTII
• iPSORT
Unreliable and unstable
• TargetP
– Research level software
• Various EMBOSS services
– Reliant on other peoples servers
• InterProScan
• BLAST2
Services in the wild rare -significant time to
• NIX
wrap applications as web services (licensing,
• TESS
installation, maintenance)
• TWINSCAN
WSDL in the wild is poor
• Alibaba2
Firewalls
• SignalScan
• Promotorscan
Licensing
• SumoPlot
– Can’t be used outside of licensing body
• SignalP
th July 2004,
–
No
license
= 24
access
third-party
webservices
GGF
Summer
School
Italy
•
Copyright
Can you guess
what it is yet?
GGF Summer School 24th July 2004, Italy
SHIM Services
Main
Bioinformatics
Applications
Services
SHIM
Main
Bioinformatics
Services
Application
GGF Summer School 24th July 2004, Italy
• Explicitly capturing the
process
• Unrecorded ‘steps’
which aren’t realised
until attempting to build
something
• Services that enable
domain services to fit
together
• “experimentally neutral”
• Libraries of SHIMs
• Possible candidates for
automatic selection,
composition and
substitution
• Reusable
Part 3: Integrating Services
Life Science Identifiers & Information model.
Data and Metadata management – the MIR.
Domain Services – Native, Soaplab and Gowlab.
Taverna/Freefluo Workbench and Workflow Enactor.
GGF Summer School 24th July 2004, Italy
Workflow development
and enactment
• Freefluo workflow enactment engine
–Processor & event observer plugin support
• Taverna development and execution environment
–Workbench, workflow editor, tool plug-in
support
• http://taverna.sourceforge.net
• Simple conceptual unified flow language (XScufl)
wraps up units of activity
–More user friendly, more abstract, more
directly in user terms
• “tethered” programme: own open source
GGF Summer
School 24 July 2004,
Italy
development
community
th
tree structure explorer
graphical
diagram
Results in enactor
invocation window
GGF Summer School 24th July 2004, Italy
service palette shows a range of
operations which can be used in
the composition of a workflow
Workflow environment
GGF Summer School 24th July 2004, Italy
• Taverna API acts as an intermediate layer
between user level applications and
workflow enactors such as FreeFluo.
• Includes object models using a standard
MVC design for both workflow definitions
and data objects within a workflow
• Implicit iteration and data flow
• Data sets and nested flows
• Configurable failure handling
• Life Science ID resolution
• Plug-in framework
• Event notification
• Provenance and status reporting
• Permissive type management
• Graphical display
• Data entry wizard
Scufl-Taverna-FreeFluo
• SCUFL - Simple Conceptual Unified
Flow Language
• Started with WSFL  … SCUFL
provides a much higher level view on
workflows, and therefore simpler and
more user-focused.
• Simple – relies upon an inherently
connected environment to reduce the
quantity of information explicitly stated in
the workflow definition.
– No port definitions in XScufl
– Processor metadata intelligently gathered
from underlying sources i.e. WSDL, Soaplab
– Allows optional typing information, can
GGF Summer School 24th July 2004, Italy
specify as little or as much as is available
Scufl
• Conceptual – one
Processor in a SCUFL
workflow maps as far as is
possible to one conceptual
operation as viewed by a
non expert user
– Wrap up stateful service
interactions into custom
Processor
implementations
– Lowers the barrier
preventing experts in
other domains such as
bioinformatics entering
or using e-Science
GGF Summer School 24th July 2004, Italy
Taverna
Workbench
Scufl language
parser
Freefluo Workflow Enactor Core
Processor
Processor
Processor
Web
Service
Soap
lab
Bio
MOBY
Processor
Local
App
Processor
Enactor
Scufl
• Unified Flow Language –
SCUFL does not dictate
how the workflow is to be
enacted, it is inherently
declarative in intent.
• Can potentially be
translated to other
workflow languages.
• Can be arbitrarily abstract,
any given workflow engine
may require further
definition of the language
before it can be enacted.
GGF Summer School 24th July 2004, Italy
Taverna
Workbench
Scufl language
parser
Freefluo Workflow Enactor Core
Processor
Processor
Processor
Web
Service
Soap
lab
Bio
MOBY
Processor
Local
App
Processor
Enactor
• One input, three outputs
and eight processors.
• All the processors are
labeled top to bottom with
input ports, processor
name and output ports.
• All the processors here are
standard WSDL-described
standard web services,
except for “Pepstats” which
is a Soaplab processor.
• All the links are data links
except for two coordination
links on the right hand side.
• The links are labelled with
syntactic type information:
“l(text/plain)” indicates
a list
GGF Summer School 24th July 2004, Italy
of plain text strings.
GGF Summer School 24th July 2004, Italy
Workflow
script
Workflow In and Outs
Failure policy
Service
Discovery
Services
Alternates list
Invocation
+ Data
Metadata
template
Enactor
LSID
External
Data
Store
LSID
LSID +
Data
Data
MIR
Data
Store
GGF Summer School 24th July 2004, Italy
LSIDs +
Metadata
MIR
Metadata
Store
Events
Event
Notification
Service
Fault tolerance
• Failure of workflow
engine
– P2P architecture
– XML serialisation
– Checkpointing
• Failure of services or
network
– User defined retry policy
– Alternate replicas
– Alternate list
• Automatic choices for
domain services
undesired by users
GGF Summer School 24th July 2004, Italy
Retry, delay
and backoff
configuration
Alternate
Processor
Fault tolerance
scheduled and
waiting for data
aborted
data
ready
types
match
creating
alternate
processor
can
iterate
data mismatch
constructing
iterator
invoking
instantiation error
aborted
waiting to
retry
error
timeout
done
iterating
success
complete
aborted
invoking with
implicit iteration
retries
left
alternate
available
waiting to
retry
adding item to
result data set
error
timeout
retries
left
service failure
GGF Summer School 24th July 2004, Italy
allow
partials
success
Status reporting
GGF Summer School 24th July 2004, Italy
Whither BPEL?
• Focus: scripting simple request/response services vs.
choreographing business processes
• Complexity: Scufl is simple enough for bioinformaticians
to develop workflows
• Generality: Extensible processor support vs. Web
Services only
• Provenance generation
GGF Summer School 24th July 2004, Italy
What needs to be done
• Free-standing web service
• Long-running workflows
– Computationally-intensive services
– Access to a reliable high performance BLAST service that
reflects NCBI Blast – NCBioGrid?
• Scalability
– Large documents – data staging
• Debugging environment – services / workflows are brittle.
• Interactivity
– Version 1 had user proxy as an actor
– The Original Process split into 3 steps:
• Identification of candidate overlapping nucleotide sequences
• Characterisation of nucleotide sequence
• Characterisation of any gene product in the sequence
GGF Summer School 24th July 2004, Italy
OGSA-DQP
http://www.ogsa-dai.org.uk/dqp
GGF Summer School 24 July 2004, Italy
th
• Used in Grave’s Disease
• Uses OGSA-DAI data
access services to access
individual data resources.
• A single query to access
and join data from more
than one OGSA-DAI
wrapped data resource.
• Supports orchestration of
computational as well as
data access services.
• Interactive interface for
integrating resources and
executing requests.
• Implicit, pipelined and
partitioned parallelism.
Publications
•
•
•
•
•
•
T Oinn, M Addis, J Ferris, D Marvin, M Senger, M Greenwood, T Carver, K Glover, Matthew R.
Pocock, A Wipat, P Li. Taverna: A tool for the composition and enactment of bioinformatics
workflows accepted for Bioinformatics Journal, 16 June 2004
T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Greenwood, C. Goble, A. Wipat, P. Li, T. Carver
Delivering Web Service Coordination Capability to Users In Thirteenth International World Wide
Web Conference (WWW2004) pp. 438-439, New York, May 2004.
M Addis, J Ferris, M Greenwood, D Marvin, P Li, T Oinn and A Wipat Experiences with
eScience workflow specification and enactment in bioinformatics, Proceedings of UK e-Science
All Hands Meeting 2003, pages 459-467
M.N. Alpdemir, A. Mukherjee, N.W. Paton, P. Watson, A.A.A. Fernandes, A. Gounaris and J.
Smith Service-based Distributed Querying on the Grid in the Proceedings of the First
International Conference on Service Oriented Computing, 15-18, December 2003 Trento, Italy.
Springer.
J. Smith, A. Gounaris, P. Watson, N.W. Paton, A.A.A. Fernandes and Rizos Sakellariou
Distributed Query Processing on the Grid in International Journal of High Performance
Computing Applications, Volume 17, Issue 04, November 2003.
Nick Sharman, Nedim Alpdemir, Justin Ferris, Mark Greenwood, Peter Li, Chris
Wroe, The myGrid Information Model, Proc UK e-Science 2nd All Hands Meeting,
Nottingham, UK 1-3 Sept 2004.
GGF Summer School 24th July 2004, Italy
Descargar

EPSRC demo Williams Progress