Preserving
Scientific Data
Jamie Shiers, Information Technology
Department, CERN, Geneva, Switzerland
Agenda
• Motivation for preserving scientific data – examples
from a range of sciences
• Volume of data involved and related issues
• Some concrete archiving examples from Particle Physics
• Remaining challenges
• Conclusions
UNESCO Information Preservation debate, April 2007 - [email protected]
2
Motivation
• Climate data: in an era when climate change is hotly debated, the
motivations appear clear…
• Medical data: important for understanding issues such as historical
pandemics, cross-species diseases etc. Avian flu, HIV, …
• Cosmological data: plays a vital role in our evolving understanding
of the Universe – astrophysics community has an explicit policy
(data is made public after 1 year – data volume doubles each year)
• Particle Physics data: Similar arguments – will we ever be able to
build similar accelerators to those of today? If we ‘lose’ this data,
what of our scientific heritage? Need to look at old data for a
signal that should have been seen (has happened several times)
UNESCO Information Preservation debate, April 2007 - [email protected]
3
Standard Cosmology
Energy, Density, Temperature
Good model from 0.01 sec
after Big Bang
Elementary Particle Physics
From the Standard Model into the
unknown: towards energies of
1 TeV and beyond: the Terascale
Time
Supported by considerable
observational evidence
Towards Quantum Gravity
From the unknown into the
unknown...
http://www.damtp.cam.ac.uk/user/gr/public/bb_history.html
UNESCO Information
Preservation debate, April 2007 -
4
Issues
• How much data is involved?
• Preserving the bits
• Understanding the bits
UNESCO Information Preservation debate, April 2007 - [email protected]
5
Balloon
(30 Km)
How much data is involved?
• In 1998, the following estimates were made regarding the
data from LEP (1989 – 2000) that should be kept
CD stack with
1 year LHC data!
(~ 20 Km)
Experiment
Analysis dataset
Reconstructable dataset
ALEPH
250GB
1-2TB
DELPHI
2-6TB
L3
500GB
5TB
OPAL
300GB
1-2TB
Concorde
(15 Km)
 By today’s standards, these data volumes are trivial
• Even though the total volume of data at the LHC is much
Mt. Blanc
much higher, the data that must be kept beyond the life(4.8
ofKm)
the machine (2007 to ~2020) will be easily handled by then
 The LHC will generate some 15PB of data per year!
UNESCO Information Preservation debate, April 2007 - [email protected]
6
Q uickTim e™ and a
TIFF (U ncompressed) decompressor
are needed to see this pic ture.
The LHC machine - Overview
pp, B-Physics,
CP Violation
Introduction
Status of
LHCb
ATLAS
LHC : 27 km ring
100m underground
ALICE
CMS
Conclusions
ATLAS
General Purpose,
pp, heavy ions
Heavy ions, pp
ALICE
CMS
+TOTEM
UNESCO Information Preservation debate, April
2007 - [email protected]
7
Q uickTim e™ and a
TIFF (U ncompressed) decompressor
are needed to see this pic ture.
The size of HEP detectors
Introduction
ATLAS
Status of
LHCb
ATLAS
Bld. 40
ALICE
CMS
Conclusions
CMS
UNESCO Information Preservation debate, April
2007 - [email protected]
8
Understanding the bits
• In the mid-1990s, a successful re-analysis of 10-year old data from
the JADE collaboration at the PETRA accelerator at DESY was made
• A sub-set of the data was found abandoned in an office corner. The
programs to read the data were in an obsolete language and were
unusable. The data format was proprietary (but de-codable).
 This provided valuable input into the LEP data archive
• Data format: will this be readable in 5 / 10 / 100 years? 1000?
• Programs: languages / operating systems / hardware platforms have
very short life-spans wrt an archive
• Metadata: essential to understand what the data means
 The best solution to date is a so-called ‘Museum system’, but
this is still a very short term solution wrt even Einstein, let
alone Tyco Brahe, Kepler and Newton…
UNESCO Information Preservation debate, April 2007 - [email protected]
9
Preserving the bits
• Lifetimes of Particle Physics experiments are extremely
long! Currently measured in decades…
• Ironically, one of the solutions proposed for the LEP data
archive (the then-current proposal for the LHC) was later
abandoned (technical / commercial reasons)
• This necessitated a ‘triple migration’:
 Of 300TB of data between storage media;
 Of the same data from one data format to another;
 Of the accompanying processing codes.
• In the end, the exercise took around 2 months per 100TB
of data migrated, as well as a significant amount of effort
(~1 FTE / 100TB) and hardware resources
UNESCO Information Preservation debate, April 2007 - [email protected] 10
Outstanding Issues
• There are no data formats, programming languages,
computing hardware or operating systems with lifetimes
that can be guaranteed beyond the short term
• Virtual machine technology may extend an environment’s
(see above) natural life – perhaps doubling it
• Reducing the data into a much simplified and widely-used
format can have significant advantages, but only allows
restricted analyses to be performed
• Preserving the detailed knowledge of the experimental
apparatus is beyond current technology – it would require
extreme discipline on behalf of the researchers as well as
major advances in the understanding and description of
metadata
UNESCO Information Preservation debate, April 2007 - [email protected] 11
Conclusions
• As long as advances in storage capacity continue, there
are no significant issues related to the volume of
scientific data that must be kept
• Periodic migration between different types of storage
media must be foreseen
• Specific storage formats must also be catered for – this
can require much more significant (time consuming and
expensive) migrations
 By far the biggest problem concerns understanding the
data – there is currently no clear solution in this domain
UNESCO Information Preservation debate, April 2007 - [email protected] 12
References
• LEP Data archive
• 1997: http://s.web.cern.ch/s/sticklan/www/archive/
• 2002: http://mgt-focus.web.cern.ch/mgt-focus/Focus25/maggim.pdf
• 2003: http://cern.ch/pfeiffer/LEP-DataArchive/proposal/ProposalForTheLEPDataArchive.html
• http://tenchini.home.cern.ch/tenchini/Status_Archiving_6_Mar_2003.pdf
• Lisbon workshop
• http://cern.ch/knobloch/talks/CernCodataLisbon.ppt
• http://www.erpanet.org/events/2003/lisbon/LisbonReportFinal.pdf
• COMPASS / HARP data migrations
• http://storageconference.org/2003/papers/06-Lubeck-Overview.pdf
• http://www.slac.stanford.edu/econf/C0303241/proc/papers/THKT001.PDF
• http://indico.cern.ch/getFile.py/access?contribId=448&sessionId=24&resI
d=1&materialId=paper&confId=0
UNESCO Information Preservation debate, April 2007 - [email protected] 13
Acknowledgements
The following people provided material and / or pointers for
this talk (knowingly or otherwise):
• LEP Data Archive coordinators:
• David Stickland, [email protected] (L3)
• Andreas Pfeiffer, [email protected]
• Marcello Maggi, [email protected] (ALEPH)
• COMPASS / HARP migrations:
• Andrea Valassi, [email protected]
• ERPANET/CODATA Workshop
• Jürgen Knobloch, [email protected]
UNESCO Information Preservation debate, April 2007 - [email protected] 14
The End
Descargar

WLCG Service Challenges