The Data Author’s Perspective: Lessons
Learned From Data Creation to Data Curation
Collect
Jeff Dozier
James E. Frew
Present
Store
Analyze
Search
Retrieve
1
Snow spectral reflectance and absorption
coefficient of ice
1.0
1.0E-02
1.0E-03
0.8
1.0E-04
0.2 mm
0.6
1.0E-05
0.5 mm
1.0 mm
0.4
1.0E-06
absorption
coefficient
absorption coefficient
reflectance
0.05 mm
1.0E-07
0.2
1.0E-08
0.0
1.0E-09
0.4
0.6
0.8
1.0
1.2
1.4
1.6
wavelength, m m
1.8
2.0
2.2
2.4
2
Landsat Thematic Mapper (TM) band
combinations
Bands 4,3,2 (R,G,B)
Bands 5,4,2 (R,G,B)
3
What you see, through Earth’s atmosphere
100
s now
ve g e t at io n
r o ck
80
re fle c ta n c e (% )
e q u al s n o w - ve g - r o ck
80% s n o w , 10% ve g , 10% r o ck
60
20% s n o w , 50% ve g , 30% r o ck
40
20
0
0 .3
0 .8
1 .3
1 .8
w a v e le n g th ( m m )
2 .3
4
Spatial, spectral characteristics of Landsat and MODIS
10
20
panchromatic
Landsat
30
visible
NIR
SWIR
50
IFOV, m
thermal
100
200
300
NIR
high-resolution
MODIS
visible
“land”
500
SWIR
1000
ocean/atmosphere
visible
0.4 0.5 0.6
NIR
0.8 1.0 1.2 1.5
mid IR
2
3
wavelength, mm
4
thermal
5
6
8
10 12
15
5
What a multispectral sensor sees
100
s now
ve g e t at io n
r o ck
80
re fle c ta n c e (% )
e q u al s n o w - ve g - r o ck
80% s n o w , 10% ve g , 10% r o ck
60
20% s n o w , 50% ve g , 30% r o ck
40
20
0
0 .3
0 .8
1 .3
1 .8
w a v e le n g th ( m m )
2 .3
6
Set of equations for each pixel
 R1

R
 2



R
 N
   snow  r , c , 1 , a 1 
 

r , c, 2 , a 2 
  snow 

 
 
 
  snow  r , c ,  N , a N 
w here 0  Fi  1 and
 12
1M 
 22
 2M
N2
 NM







 F1

F
 2


 FM






 F 1
S olve for r , c and F (least squares)
 S till to co nsider: better corrections for illum ina tion angle, 


view
ing
angle,
subpixel
topography,
and
vegetation


7
Fractional snow cover, Sierra Nevada, March 7 2004
8
Sierra Nevada topography
9
Daily MODIS acquisition, processing for
Sierra Nevada snow cover and albedo
Ingest from
NASA DAACs
Sierra Nevada = 36 MB/day
Snow-covered land = 8 GB/day
Sierra Nevada = 10 MB/day
Snow-covered land = 2 GB/day
MODIS
snow
cover &
albedo
algorithm
reproject,
mosaic,
subset,
format
Database
MODster
Terra
Server
Alexandria
10
Examples of fractional snow cover, January
through April 2004
Jan 01 2004
Jan 17 2004
Mar 26 2004
Apr 08 2004
11
Examples of grain size, January through April
2004
Jan 01 2004
Jan 17 2004
Mar 26 2004
Apr 08 2004
12
Effect of vegetation
2004, March 3 vs March 4
2004, March 4 vs March 5
SCA
(%)
CCA (%)
Sensor zenith
(degrees)
March 3
73
0
50
March 4
74
18
48
March 5
78
27
36
March 7
69
9
15
March 8
55
31
62
2004, March 7 vs March 8
2004, March 5 vs March 7
13
Applications: snowmelt modeling, Marble
Fork of the Kaweah River
(Molotch et al., GRL, 2004)
M elt Flux   R net m q  T d a r   SC A
net radiation > 0
degree days > 0
Snow Covered Area
where:
mq = Energy to water depth conversion, 0.026 cm W-1 m2 day-1
a r   convection param eter, based on w ind spee d, hum idity, and roughness 
14
Magnitude of snowmelt: Modeled – Observed snow water equivalence
SWE difference, cm
assumed w/
update
AVIRIS
albedo
Tokopah basin,
Sierra Nevada
assumed
albedo
15
The data author’s perspective on drivers and
constraints
• The science information user:
– I want reliable, timely, usable science information products
» Accessibility
» Accountability
• The funding agencies and the science community:
– We want this to be done by a distributed federation of providers,
not just by data centers
» Scalability
• The science information provider:
– I’m doing just fine, thanks.
» Transparency
16
Research vs. production computing
Research computing is …
• Heterogeneous
– multiple platforms,
applications, languages
• Idiosyncratic
– researchers typically have
highly customized computing
environments
• Problem-driven
– focus on results, not processes
Production computing is …
• Robust
– reliable, not just correct
• Standardized
– can easily substitute
components for repair,
upgrade, etc.
• Scalable
– accommodates steady or
increasing demand for
product
17
Principles
• Goal
– Help scientists become information providers in a
federated data system
• Prime Directive
– Minimal disruption of a working scientist’s
computational environment
• Ultimate product
– Software, system architecture, and procedures for
turning science projects into a federation of providers
18
Model structure for MODIS snow-covered
area and albedo
MODIS
cloud mask
(48 bits)
MODIS 7
land bands
(112 bits)
Watershed
info
Snow
fraction
MODIS
view
angles
albedo
Basin
mask
MODIS
quality
flags
MODIS
snow cover
and grain
size
Topography
Solar
zenith,
azimuth
Processing
Lineage
RMS
error
Veg
fraction
Soil
fraction
Quality
flag
Shade
fraction
Open
water
fraction
19
Lineage: current best practice
20
ESSW: Our Earth System Science Workbench
Producer and consumer issues can both be addressed
by a laboratory metaphor
• Experiment
– Network of models
– … ingesting / synthesizing data
– … generating products
• Laboratory
– Experiment execution environment
» Computing + storage = accessibility + scalability
• Lab Notebook
– Persistent storage that can be queried
– Keeps track of all experiments
» Documentation + lineage = accountability
21
Use existing science applications
• No “standard” Earth science computing
environment
– commercial packages (ArcGIS, ENVI,
MATLAB, …)
– public packages/models (MM5, MODTRAN,
…)
– locally-developed codes
• Example: Snow cover from AVHRR
commercial + standalone programs
– parameters highly customized for UCSB
• How do we get these programs to
– communicate
– cooperate
with the Earth System Science
Workbench (ESSW), without rewriting?
Receive
Ingest and Calibrate
Navigate
(Manual/Automatic)
Snow-Covered
Area
Rectify
Snow
Maps
22
Wrap Your App: Scripts talk to ESSW
• No changes,
just additions
– Wrapper scripts
» Make program
(groups) look like
ESSW experiments
– ESSW daemon
» Converts
wrapper output
to
database input
– ESSW database
» Stores converted
wrapper output
XML + SQL
Perl API
ESSW
daemon
Receive
Ingest and Calibrate
Navigate
(Manual/Automatic)
ESSW
Database
Snow-Covered
Area
Rectify
MySQL
Java
JDBC
Perl
Snow
Maps
23
avhrr_L0
Detailed
example
AVHRR Level 0 product
AVHRR telemetry ingest
avhrr_ingest
Hand navigation details
avhrr_l1b
AHVRR Level 1B
product
avhrr_
handNav
Multi-channel
snow-covered
area
algorithm
avhrr_
navd_l1b
Hand navigation
procedure
AVHRR Level 1B:
navigated
avhrr_
snowModel
Snow-covered area
Copy
navigated
image
SCA: navigated
avhrr_sca
avhrr_
copyNav
avhrr_
navd_sca
24
ESSW Lessons
•
Providers are customers
–
•
A light touch is the right touch
–
•
Wrapping is easier for scientists and their programmers to deal with than
complete re-engineering
Scientists do write scripts, but not necessarily Perl
–
•
•
Federations aren’t much good unless scientists are happy to put
information in them
Scripting (gluing stuff together) comes naturally to scientists
Scientists don’t write DTDs
Nobody calls metadata APIs
ESSW was automatic, but not automatic enough…
25
data lineage tracking
ES3 : Earth System Science Server
MODster
OpenDAP
Watershedscale snow
product
MODIS
Microsoft TerraServer
AVHRR
Globalscale snow
product
Alexandria Digital
Library
Corona
BUB data
storage
ROCKS processing
clusters
26
From ESSW to ES3: Summary
• Perl wrappers  Probulators
• Perl API  web services + RDF messages
• SQL  XML database(s)
27
From wrappers to probulators
Wrappers: active lineage
• Good
– Complete control over what gets recorded
– Single language/API for all wrapped events
– Not tied to execution
» You can even lie about what happened
• Bad
– Must explicitly script everything
– Scripts can drift from reality
» You can even lie about what happened
28
From wrappers to probulators
Probulators: passive lineage
• Good
– Record what actually happened
» Not just what you think happened
» Not what didn’t happen
– Automatic: don’t have to write new scripts for
everything
• Bad
– Different flavors for different environments
» Can’t just do everything in Perl…
29
Probulator flavors
• Instrumentation
– Insert lineage capture instructions directly into science codes
» e.g. “I just created file ‘foo’”
– Typical implementation: preprocessor/precompiler
• Overriding
– Replace standard routines/libraries with lineage-capturing versions
» e.g. open(…) → snoopy_open(…)
– Typical implementation: modify execution environment
» environment variables
» configuration files
• Passive monitoring
– Trace program execution
» e.g. “called open() with args foo, bar, …”
– Typical implementation: strace’d shell
30
ES3 lineage architecture
probulator1
logger
transmitter
ES3 core
probulatorn
log
files
31
Now What?
• Probulator reports not universally unique
– Q: How hook separate reports together?
– A: Logger assigns UUIDs to
» Data streams
» Processes
» Jobs (workflows)
• Lineage not explicit
– Q: How publish lineage?
– A: ES3 Core builds serialized graph
32
Products available from
http://www.snow.ucsb.edu (forthcoming)
• Fractional snow-covered area, grain size (and
contaminants) from daily MODIS images
– Quality flags for cloud cover, highly oblique viewing
– Fractional coverage of other endmembers
• Best estimate of snow-covered area and broadband
albedo on that date
– Extrapolating from previous values to that date and
smoothing
• End-of-season reanalysis of daily snow-covered
area and broadband albedo
– Interpolation, smoothing, comparison with in situ snow
pillow data
33
Descargar

AAAS Feb 2006