Sherpa DP – a Technical
Architecture for a Disaggregated
Preservation Service
Mark Hedges
Arts and Humanities Data Service
King’s College London
Funded by:
© AHDS
SHERPA DP Project
Development Partners: AHDS at King’s College London (Lead),
Nottingham, Glasgow, Edinburgh, White Rose Consortium,
London Leap Consortium
Objective: To create a shared, distributed preservation
environment for the SHERPA project framed around the OAIS
Reference Model.
Notes:
Participating repositories all based on DSpace or EPrints.
Relatively simple data objects (eprints).
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Distributed OAIS Model
Institutional R epositor y (C ontent P r ovider )
S IP
D IP
D ata
Managem ent
C onsum er
Ingest
Access
P r oducer
Ar chival
S tor age
Adm inistr ation
Adm inistr ation
Ar chival S tor age
Ingest
Access
D ata
Managem ent
S IP
D IP
AIP
P r eser vation P lanning
Funded by:
P r eser vation S er vice (S er vice P r ovider )
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Distributed Workflow
S u b m it d a ta
& m e ta d a ta
C ontent P r ovider (Institutional R epositor y)
V alidation
suc c esful
No
Request
Resubmission
Ye s
Is
metadata
c omplete?
No
E nhanc e
Metadata
S er vice P r ovider (P r eser vation S er vice)
Ye s
C opy S IP to
repository
store
E -print in
appropriate
deposit
format
No
Request
Resubmission
Migrate to
dissemination
format
No
Rec ord details
of migration
ac tion
V alidation
suc c esful
Yes
C reate
tec hnic al
m etadata
R is k
a sse ssme n t
Is s u e s
id e n tifie d
Resolve
issues
Ye s
Transfer D IP
to storage
D ata
transfer
Generate
replac em ent
D IP
N o p ro b le m s
id e n tifie d
Fo rm a t a t-ris k
Im plem ent
P res ervation
S trategy
Fo rm a t co n s id e re d
a t-ris k
R is k
a sse ssme n t
Metadata
transfer
N o o b so le s ce n c e
p ro b le m s id e n t if ie d
Make available
in c atalogue
Funded by:
S c hedule
O bs oles c enc e
Monitoring
Researc her
(C onsumer)
ac c esses data
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Generate
AIP
T rans fer AIP to
P res ervation
s tore
System Architecture
W e b In t e r f a c e
In g e s t
F u n c tio n s
P o s t- In g e s t
F u n c tio n s
HTTP
SO A P
RE ST
HTTP
SO A P
RE ST
E n q u ir ie s
F e d o r a S e r v ic e s
S OAP
E n q u ir y
S e r v ic e s
S OAP
P o s t- In g e s t
S e r v ic e s
HTTP
In g e s t
S e r v ic e s
HTTP
S h e r p a D P S e r v ic e s
F e d o r a G e n e r ic
Se a rch
F e d o ra C o re
R e p o s ito r y S e r v ic e
HTTP
HTTP
Funded by:
E xte r n a lly
R e fe r e n c e d
C o n te n t
E xte r n a l
s e r v ic e s , e .g .
D R O ID
R e la tio n a l D B
lo c a l file s y s te m
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Key preservation actions at
ingest
•
•
•
•
•
•
Integrity/fixity checks.
File format identification.
Preservation metadata creation.
Implement preservation strategy
File format normalisation.
Others …
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Requirements
• Scalability: need to handle increasingly
large quantities of data
• Generation and management of
extensive set of preservation metadata
• Audit trail/provenance metadata:
knowledge held in explicit machineprocessable form
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
More Requirements
• Distributed architecture
• Integration of specialised tools
• Follow standards to allow flexible
integration of future tools
• Automate workflow where possible, but
also allow human interaction
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Approach
• Web services encapsulating preservation
actions
• Web interface for points in the process
where human input required
• Linked by workflow management tool
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Workflow management
• Large number of tools available
– Taverna
– BPEL (Active BPEL)
– jBPM
– others …
• Settled on jBPM
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
jBPM
Funded by:
• Web services and UI functions chained
together to form a workflow or “Business
Process”
• Open source, flexible, extensible workflow
management system
• Bridges the gap between users and developers
by giving them a common language
• Packaged as a J2EE application - can run on
any J2EE application server like JBoss, Tomcat,
etc.
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Preservation Metadata
Funded by:
• Approach based on PREMIS data
dictionary
• PREMIS data model based on five
categories: intellectual entities, objects,
agents, events, rights
• Implementing a subset of this model
• … with some format-specific extensions
(e.g. MIX for images)
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Available Tools
• Stand-alone specialised tools that perform
preservation-related tasks
• File format identification, e.g. DROID-PRONOM
– Developed by The National Archives
– Identification of file formats based on their file
signatures
• Technical metadata generation, e.g. JHOVE
– Extensible framework for format validation
– Perform format-specific identification, validation,
and characterization of a digital object
Funded by:
• File format migration tools (e.g. XENA, Open
Office)
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Available tools and workflow
• Tools written in different languages
• Define generic interfaces for preservation
actions
• Wrap the tools used as web services to
promote:
– Interoperability
– Loose coupling, flexibility
– Reusability
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Workflow in jBPM
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
jBPM (jPDL)
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Node and ActionHandler
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Workflow Inputs & Outputs
ARCHIVAL
INFORMATION
PACKAGE
(AIP)
SUBMISSION
INFORMATION
PACKAGE
(SIP)
WORKFLOW
DISSEMINATION
INFORMATION
PACKAGES
(DIPs)
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Workflow Outputs
• Multiple METS packages (atomic model),
each containing (some of):
– data
– Descriptive metadata
– PREMIS object metadata (technical)
– PREMIS event metadata
– PREMIS relationship metadata
– Format-specific technical metadata (e.g.
MIX)
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Fedora object model
e-print
has
Ma
nife
stat
ion
hasP
a rt
original file 2
a rt
sP
sP
a rt
normalised
manifestation
ha
a rt
hasP
ha
t
Par
h as
ion
stat
nife
Ma
has
hasPart
original
manifestation
original file 1
hasVersion
migrated file 1
Funded by:
isDerivedFrom
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
updated version
of e-print
Issues with automation
• Preserving content – what do we actually want
to preserve?
• Significant properties – soft concept, hard to
quantify (INSPECT)
• Lack of suitable tools – expensive, outputs
unreliable
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Next Steps
• SHERPA DP 2 (2007-2008), looking at:
- Additional repository types
- More complex object types
- different methods of data transfer
• Generalise system
• Add post-ingest preservation actions
• Add semantics for dynamic service discovery
• Resource discovery metadata generation
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Questions
Contact: [email protected]
Funded by:
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
© AHDS
Descargar

JISC - Digital Deluge