Project sponsors
Earth System Grid - DOE/SciDAC
Coupled Energetics and Dynamics of
Atmospheric Regions - NSF/GEO/ATM
Virtual Solar-Terrestrial Observatory NSF/CISE/SCI
Related DODS/OPeNDAP work - NASA and
NCAR/HAO
January 4, 2005
Fox
2
Overview
 Report on experience with data ‘systems’ and data ‘frameworks’
 CEDARWEB
 Earth System Grid
 Compare and contrast success in terms of use(rs)
 Technology integration - when and how does it work and scale?
 Outline a merged approach for Virtual Observatory concept
3
January 4, 2005
Fox
CEDARWEB
4
January 4, 2005
Fox
CEDARWEB: heritage
 CEDAR is a large scientific and technical community focusing on the Earth’s middle
and upper atmosphere. The program features ground-based observing networks,
models and integrative studies. Funded by NSF, in third phase (3rd decade)
 CEDAR data history




Started as an incoherent radar database in 1983 as a tape archive (back to 1966)
Grew by late 80’s adding other instruments, models, indices
Went on-line in early 90’s (became a single-tiered data system)
Web access in 1996, three versions of the interface
 Holdings - some satellite data, geophysical indices, modesl (GCM, empirical, tides,
etc.), ISRs, HF Radars, Digisondes, FPIs, IR Michelson Interferometers,
Spectrometers, Airglow Imagers, All-Sky Cameras, LIDARs, Multi-Channel
Photometers, MST Radars, MF Radars, LF Radars, Meteor Wind Radars,
Campaigns, Presentations, Surveys, Jobs, Workshops, etc.
 Community, 600+, 300+ registered users, ~ 100 active data users per year
 NCAR tasked with community support, and especially in the early days to ‘take care’
of the data and work with data providers and users
 Significant effort in catalogs, metadata, controlled vocabulary
 System has labored in getting past the code/mnemonic schemes of the past, base
data format
January 4, 2005
Fox
5
CEDAR pre-web
Data query,
selection and
retrieval
interface,
without any
integrated
tools or
ability to
preview data
before
retrieving it.
January 4, 2005
Fox
6
CEDARWEB 2.0
7
January 4, 2005
Fox
CEDARWEB 2.0
8
January 4, 2005
Fox
CEDARWEB 3.x
Data query,
selection and
retrieval
interface, with
integrated
tools, e.g.
ability to plot
(preview) data
before
retrieving it.
9
January 4, 2005
Fox
CEDARWEB - OPeNDAP
10
January 4, 2005
Fox
CEDARWEB - OPeNDAP
11
January 4, 2005
Fox
CEDARWEB 3.1
Ability to quickly plot
data to assess
suitability, quality,
and produce a quick
copy with some
customization for a
preliminary study.
12
January 4, 2005
Fox
Experience: CEDARWEB
Don’t just
provide data,
but also build in
community
information and
ancillary
information that
is of value.
13
January 4, 2005
Fox
Inside CEDARWEB
 Rich metadata; categorized
 OPeNDAP for data access and transport
 MySQL for catalog and user records
 https and cookies for session authentication
 Script-enabled interface with plotting built in (ION) delivers
html to browsers
 ‘Hides’ organizational data record structure (sort of)
 Low-level data product, but also high-level
 Disconnect between delivery of data and attributes
 Today: framework is inside the data system!
January 4, 2005
Fox
14
Experience: CEDARWEB
CEDARWEB has been developed and improved over
more than 10 years of interaction with users, data
providers, and a community steering committee. Each
of these elements has directly contributed to changes in
what services are provided, what information and
materials are made available via the web site and what
levels of authorization and authentication are required.
Biggest lesson: systems approach has worked
because of the heritage of the data collection but users
(esp. new or very experienced) see a barrier to entry
and don’t understand where system starts/stops.
http://cedarweb.hao.ucar.edu
January 4, 2005
Fox
15
Earth System Grid Overview
 The goal of ESG is to make climate data – particularly climate model
data – an easily accessible community resource. The project is
funded by the SciDAC program: Scientific Discovery through
Advanced Computing.
 Enabling researchers to understand and make effective use of very
large, distributed climate datasets is critical. The broad strategy is to
develop a collection of server-side capabilities – minimize the
amount of data movement.
 Multiple interfaces to ESG will allow researchers to focus on science
rather than issues of data transfer, format, and data set
manipulation.
 Foundation is Globus Grid technology
January 4, 2005
Fox
16
ESG: U.S. Collaborations &
ANL: Computational grids,
Development
& grid-based applications
LBNL: Climate storage
facility
LLNL: Model diagnostics
& inter-comparison
USC/ISI: Computational grids,
& grid-based applications
NCAR: Climate change
predication and scenarios
LANL: Next generation
coupled models & computing
January 4, 2005
Fox
ORNL: Climate storage &
computational resources17
ESG leverages existing
software and projects
DODS/OPeNDAP: Distributed Oceanographic Data System (Unidata)
Integrations of Globus GridFTP, DODS data access
THREDDS: THematic Real-time Environmental Distributed Data Services
(Unidata)
LAS: Live Access Server (NOAA Pacific Marine Environmental Laboratory)
Works with CDAT, Ferret, GrADS, …
CDAT: Climate Data Analysis Tools (PCMDI), includes CDMS: Climate Data
Management System, VCDAT visualization
Community Data Portal project (NCAR)
NCL (NCAR)
Globus Grid technology(ANL, ISI): GridFTP, CAS Community Access Portal
18
January 4, 2005
Fox
ESG: Requirements & Priority
Matrix
ESG Services:
Framework
Automatic Installation
Distributed Computing
Authorization & Authentication
Registration
Event Services
Task Management
Logging Services
Data Systems
Search and Discovery
data movement (transport)
meta-data framework
collaboratories
Tools
analysis
visualization
collaboration
ESG Developer
ESG Administrator
ESG User
H
L
H
L
H
H
H
H
L
L
L
H
H
L
L
H
M
L
M
L
H
M
L
H
M
H
H
H
L
H
H
M
H
M
L
M
M
L
M
H
H
H
19
L = LOW, M = MEDIUM, H = HIGH
January 4, 2005
Fox
ESG: ESG-II Architecture
21
January 4, 2005
Fox
The Earth System Grid
DATA storage
SECURITY services
METADATA services
LBNL
gridFTP server/client
TRANSPORT services
ANALYSIS & VIZ services
HRM
MONITORING services
FRAMEWORK services
ANL
DISK
Auth metadata
NCAR
MySQL
RLS
GSI
CAS client
TOMCAT
SLAMON daemon
NCL openDAPg client
NERSC
HPSS
AXIS
CAS server
GRAM
LAS server
gridFTP server/client
HRM
NCAR
MSS
LLNL
GSI
openDAPg server
ORNL
TOMCAT
DISK
SLAMON daemon
CDAT openDAPg client
MySQL
gridFTP server/client
Xindice
HRM
GSI
DISK
THREDDS catalogs
RLS
CAS client
MyProxy client
gridFTP server/client
MyProxy server
ORNL
HPSS
openDAPg server
HRM
DISK
ISI
MySQL
RLS
GSI
CAS client
MCS
MySQL
January 4, 2005
Xindice
GSI
23
OGSA-DAIS
Fox
MySQL
GSI
RLS
Earth System Grid Portal
24
January 4, 2005
Fox
Community Data Portal
Free text search
Authentication
Applications
Live Access
News
THREDDS catalog
January 4, 2005
Fox
25
Community Data Portal
26
January 4, 2005
Fox
LAS/CDAT: Example of a Webbased Data Portal
 Technology: Web Based (end user
requirements)
 LAS, DODS, ESG (i.e., Globus),
CDAT
 Portal should hide/simplify the Grid for
users
 Single sign-on
 Community-based authorization
 Simplified resource location
 Remote job submission,
management
 Accesses the ESG Grid Testbed
27
January 4, 2005
Fox
ESG: Example of a Web-based
Data Portal (serving 40+
simulations: AMIP, CMIP, and PCM)
28
January 4, 2005
Fox
ESG: Example of a Client
Application
29
January 4, 2005
Fox
Metadata-centric view of ESG
services
DATA TRANSPORT
USER AUTHENTICATION
AND AUTHORIZATION
LOCATION
METADATA
ACCESS AND
AUTHORIZATION
METADATA
DATA ANALYSIS &
VISUALIZATION
AGGREGATION
METADATA
METADATA
SERVICES
CATALOGUING
METADATA
CONTENT METADATA
DATA BROWSING
LOGGING
METADATA
SYSTEM MONITORING
AND CONTROL
ANNOTATION & HISTORY
METADATA
DATA SEARCH & DISCOVERY
January 4, 2005
Fox
30
ESG Metadata Services
Architecture
3-layer architecture:
 Metadata Holdings: physical metadata content, stored in a system of
relational and/or XML native databases
 Core Metadata Services: modules and libraries that mediates all
access to the Metadata Holdings (insert, update, delete, query) –
expose an API that hides the specific implementation of the
databases and query languages
 High Level Metadata Services: system of applications that make use
of the Core Metadata Services to fulfill a specific atomic functionality
– will be invoked by external clients
31
January 4, 2005
Fox
ESG CLIENTS API
PUBLISHING
& USER INTERFACES
SEARCH & DISCOVERY
ANALYSIS & VISUALIZATION
ADMINISTRATION
BROWSING & DISPLAY
HIGH LEVEL METADATA SERVICES
METADATA
EXTRACTION
METADATA
ANNOTATION
METADATA & DATA
REGISTRATION
METADATA
BROWSING
METADATA
AGGREGATION
METADATA
VALIDATION
METADATA
CONVERSION
METADATA
DISPLAY
METADATA
SEARCH, QUERY
& DISCOVERY
CORE METADATA SERVICES
METADATA ACCESS
(update, insert, delete, query)
SERVICE TRANSLATION
LIBRARY
METADATA HOLDINGS
Replica
Location
Services
Metadata
Cataloguing
Services
January 4, 2005
THREDDS
catalogs
XML DB
Fox
32
ESG Metadata Services
Goal Functionality
 Services responsible for the creation, management and utilization of
metadata associated with geophysical data
 Functionality:
 Metadata extraction (automatically, from files in different format and
according to various possible metadata standards)
 Metadata conversion (from one standard to another)
 Metadata aggregation (associated with data collections)
 Metadata annotation (manually by humans)
 Metadata validation (basic quality control of metadata)
 Registration (population of metadata holdings)
 Harvesting (combination of metadata from different repositories)
 Metadata browsing and display (for humans)
 Search and discovery of data through metadata
 Metadata query (by agents or clients for data analysis and visualization)
33
January 4, 2005
Fox
ESG Metadata Services
Current Development





Currently have in production the following technologies :
Replica Location Services : database to manage and index multiple
copies of the same data stored at different centers
Metadata Cataloguing Services : relational database to store
scientific metadata (developed for high energy physics and
geophysical data)
XML native (**) and SQL databases
THREDDS (by Unidata ) : system for hierarchical cataloguing of
datasets and associated metadata
(http://www.unidata.ucar.edu/projects/THREDDS)
NcML (Netcdf Markup Language) : XML language for encoding of
metadata associated with data in netcdf format (and more…)
January 4, 2005
Fox
34
ESG Metadata Policy
 Premise : geophysical sciences are too broad and complex to
impose a single, omnicomprehensive metadata standard to capture
the relevant information for all datasets, projects, instruments,
scientists
 ESG will not mandate use of any metadata schema or convention
 Allow data providers, scientists to use their metadata of choice,
provide technologies and tools to store and access metadata
through common services (MCS, XML DB, THREDDS catalogs)
 Encourage development and reuse of a limited set of domainspecific standards (climate data, radar data, airborn instrumentation
etc), encoding in XML (according to community developed
schemas), interoperability and combination of schemas (XML
namespaces and RDF-based ontologies - developed but not used)
35
January 4, 2005
Fox
OPeNDAP for ESG II
DODS since ~ 1995 was been based on
http and cgi-style architecture
Two concerns
Application support and performance of HTTP
Housekeeping abilities of cgi architecture
Solution evolve OPeNDAP the discipline
neutral aspect of DODS
January 4, 2005
Fox
36
OPeNDAP ctd.
Data transport protocol and access protocol
separated
Revised server architecture
Address Grid-style authentication
Memory management
Exception handling
All these changes and retain interoperation with
HTTP and cgi
Advanced requirements: URL should support
more than one dataset, or object, i.e.
aggregation
January 4, 2005
Fox
37
OPeNDAP 3.x vs OPeNDAP-g
Architecture
• Simple and easy to install
• One CGI process per
URL request
• Limited memory
management – external
• Limited scalability
• Limited status reporting
to web server
• Returns data stream from
one format
January 4, 2005
• Standalone server or
httpd module
• Can manage multiple
daemon processes
• Strong memory
management – internal
• Reuse processes, scales
• Coupled to OPeNDAP
server for status
• Returns multiple formats
in a single stream,
multiple protocols
Fox
38
39
January 4, 2005
Fox
Application development
40
January 4, 2005
Fox
Status
 Refactor core classes to remove http/libwww, etc.
 Operational/production release of standalone OPeNDAP
server (no dependence on web server)
 Multi-protocol support: file, http, GridFTP, ftp, etc.
 Re-architected for aggregation support and performance
 Run OPeNDAP server as a client to GridFTP server
 Portal application client in production, prototype of
netCDF client operational
 Authentication is handled outside OPeNDAP server
 URL syntax is more complex
41
January 4, 2005
Fox
ESG: Framework experience
 ESG is a highly collaborative effort and will allow users to quickly access
data storage facilities storing petabytes of raw or processed data in an
application independent manner.
 Payoffs of this distributed collaborative infrastructure have included:
 Distributed data-sharing, RLS works! SRM/HRM work! OPeNDAP-g works!
 Simplified data discovery of climate data, the work on metadata paid off!
Scalability?
 Large-scale climate data processing and analysis via highly integrated portal
 Increased collaboration among climate research scientists, people use it!
 Aid in climate assessments and estimates of future climate variability and trends,
IPCC!
 Authentication and authorization have been a significant challenge
 GSI to CAS
 MyProxy - session based and seems to work well, more compatible with
heterogeneous framework services
 SAML is working for multi-file batch transfer
42
January 4, 2005
Fox
ESG: Framework experience
 Privatization
 Portal interface (and much of the holdings) are cloned
 Closed communities are breeding dead-end alley developments, e.g. delivering
netCDF
 Transport - GridFTP versus HTTP




Server to server
Very good performance
Depends on a very specific version of GRIDftp server (stripped)
Clients are not as capable due to ‘weight’ of globus, revert to HTTP
 Scalability and response times (data AND metadata)
 Framework architecture supports re-layered for tuning
 Service monitoring
 to support the distributed collaborative infrastructure
 need lots or all services to really make a production environment work
 Many Globus services not used (GRIS, MDS, GIIS, … )
 Feeling lucky? Try out ESG by visiting the website at:
http://www.earthsystemgrid.org
43
January 4, 2005
Fox
Success?




Users are generally happy
Exploited new technology components
 Integration - when and how does it work and scale?
 XML
 SQL
 DODS
 OPeNDAP and OPeNDAP-g
 Portals
 P2P - clients are not as ready as we think
Globus provides a suite of framework components, some are easier to
integrate than others, some just don’t fit our use-cases and
architecture
Data framework - e.g. OPeNDAP has been extremely successful
January 4, 2005
Fox
44
User needs
In discussions with data providers and users, the needs are clear:
``Fast access to `portable' data, in a way that works with the tools
we have; information must be easy to access, retrieve and
work with.'’
Too often users (and data providers) have to deal with the organizational
structure of the data sets which varies significantly --- data may be stored
at one site in a small number of large files while similar data may be stored
at another site in a large number of relatively smaller files. There is an
equally large problem with the range of metadata descriptions for the data.
Users often only want subsets of the data and struggle with getting it
efficiently. One user expresses it as:
``(Please) solve the interface problem.''
January 4, 2005
45
Fox
Vision for building science
cyberinfrastructure
 Use-case, then requirements
 Then derive architecture and choose technology
components
 Build a working system for users from the start
 Get your funding source and community to commit to an
evolving architecture
 If you choose a major framework technology, e.g. Globus,
OPeNDAP, THREDDS, partner with them
 Data framework - e.g. OPeNDAP has been extremely
successful
January 4, 2005
Fox
46
One paradigm
Goal - find the right balance of data/model holdings,
portals and client software that a researchers can
use without effort or interference as if all the
materials were available on his/her local computer.
E.g.
The Virtual Solar-Terrestrial Observatory (VSTO) is proposed to be:
• a distributed, scalable education and research environment for
searching, integrating, and analyzing observational, experimental and
model databases in the fields of solar, solar-terrestrial and space
physics
Comprises:
• a system-like framework which provides virtual access to specific data,
model, tool and material archives containing items from a variety of
space- and ground-based instruments and experiments, as well as
individual and community modeling and software efforts bridging
research and educational use
January 4, 2005
Fox
47
Virtual Observatory? Need better
glue
•
Basic problem: schema are categorized rather than developed from an object
model/class hierarchy -> significantly limits non-human use. However, they all
form the basis to organize catalog interfaces for all types of data, images, etc.
•
This limits data systems utilizing frameworks and prevents frameworks from
truly interoperating (SOAP, WSDL only a start)
•
Directories, e.g. NASA GCMD, CEDAR catalog, FITS (flat) keyword/ value
pairs, are being turned into ontologies (SWEET, VSTO)
•
Markup languages, e.g. ESML, SPDML, ESG/ncML are excellent bases
•
Evolve, recast, merge (where appropriate) using formal processes, tools with
intended use in mind - for interface specifications, reasoning, validation, etc.
beyond the usual search and access
48
January 4, 2005
Fox
Summary

Basic success in both data systems and data framework approaches

Satisfying user and sponsor needs (from ‘just’ to ‘outstanding’)

Experience with Globus ranges from very good, to not ready for our need

Experience with OPeNDAP is very good, especially with core services

Scalability and performance require an adaptable architecture which is
something system-level interfaces can still hide from the user

Challenge - to bring these attributes to a framework, i.e. in which the user is
more exposed

Interoperate, interoperate, interoperate - interface, interface, interface

User interfaces still require significant HCI efforts

Metadata services are extremely important
49
January 4, 2005
Fox
Descargar

The Earth System Grid: Grid Enabling the Entire Climate