Comparing NetCDF and a multidimensional array
database on managing and querying large
hydrologic datasets: a case study of SciDB– P5
Haicheng Liu
7-10-2015
Delft
University of
Technology
Challenge the future
Outline
 Background
 Query design
 Selection of multidimensional (MD) array database
 Test environment setup
 Benchmark test and analysis
 Conclusions
Geomatics for the Built Environment
2
Background
Geomatics for the Built Environment
3
NetCDF
 A concept which can refer to data model, format or
API
 Data model
 Dimension: physical dimension or index such as time step
 Variable: core data stored, e.g. precipitation
 Attribute: metadata of variables or file
 Format
 Classic, and 64-bit offset format consisting of a header
and a data array stored contiguously
 NetCDF-4, and NetCDF-4 classic model format, support
for dynamic schema and chunked storage
Geomatics for the Built Environment
4
Problem for query
 Contiguous storage structure adopted by classic
and 64-bit offset format
20
45
55
21
30
20
10
11
13
3
Grid 1
Grid 2
Grid 3
…
Grid 1
Grid 2
Grid 3
One-dimensional array
Geomatics for the Built Environment
5
MD array database
 A database of which the abstract model for data
management and query is multidimensional array
consisting of dimensions and attributes
 Many solutions
 Open source: Rasdaman, SciDB, MonetDB, etc.
 Commercial: Essbase, Caché, Oracle spatial, etc.
 Most utilize chunked storage structure
Geomatics for the Built Environment
6
Possible solution
 Chunked storage structure of NetCDF-4 format and
multidimensional (MD) array database
20
Index
MD chunk
 MD array database also has smarter caching
strategy
Geomatics for the Built Environment
7
Research question
 Can a MD array database process frequently
implemented queries faster than NetCDF solutions
for large hydrological datasets?
Geomatics for the Built Environment
8
Roadmap
Query design (Dataset selection)
Selection of MD
array database
Test environment setup (HydroNET-4)
NetCDF connector
MD array database connector
Benchmark
64-bit offset storage
MD array database (normal chunk)
NetCDF-4 (normal chunk)
NetCDF-4 (compressed chunk)
MD array database (compressed chunk)
Geomatics for the Built Environment
9
Query design
Geomatics for the Built Environment
10
Query and dataset collection
 6 experts interviewed in total
 19 conceptual queries categorized into 5 classes
 Selection based on dimension value
 Selection based on variable value
 Masking query, e.g. data quality check
 Statistical operation, e.g. Sum, Avg and Max
 Spatial operation, e.g. intersection
 Datasets include 1D time series records, 2D
satellite images, 5D forecast datasets, etc.
Geomatics for the Built Environment
11
Datasets
Dataset
MPE (MultiSensor
Precipitation
Estimate)
rainfall rate
from
satellite
data product
Information stored
Dimension
count
Dimension
Span (single
file)
Temporal
resolution
Spatial
resolution
and
coverage
Single
file
size
Data
format
Rainfall rate;
Availability;
Quality
3
x, y, time
(4000,4000,4)
15 minutes
0.03 degree
(3.3 km),
1/3 world
250 MB
64-bit
offset
5
Longitude,
latitude,
forecast,
ensemble,
model run
(360,181,40,2
0,1)
6 hours
1 degree
(111 km),
Global
1.55 GB
64-bit
offset
Temperature 2m above
ground;
Maximum temperature 2m
above ground;
GEFS (Global Minimum temperature 2m
Ensemble
above ground;
Forecast
Relative humidity 2m above
System)
ground ;
weather
Total precipitation;
forecast
Total Cloud Cover;
data
U-Component of Wind 10m
above ground;
V-Component of Wind 10m
above ground;
Data status
Geomatics for the Built Environment
12
MPE & GEFS
Ensemble
Modelrun
Latitude
Longitude
Forecast
Time
Longitude
Latitude
3D MPE
5D GEFS
Geomatics for the Built Environment
13
Query Designed
 MPE dataset
 Sub grid selection (Delft and northern part of the
Netherlands)
 Time series extraction (A spot location in the Indian
Ocean)
 Pyramid query (the Netherlands)
 Average calculation (the Netherlands)
 Maximum calculation (the Netherlands)
 GEFS dataset
 Time series extraction (Delft, one cell in GEFS)
 Percentile calculation (Delft, one cell)
 Ensemble mean calculation (the Netherlands and Europe)
Geomatics for the Built Environment
14
Selection of MD array database
Geomatics for the Built Environment
15
MD array database selection
 Rasdaman and SciDB are focused on and
compared
 9 criteria in total and different approaches are
employed to assess each criterion, e.g.
 Implementation of MD data storage structure: paper study,
official documentation, forums, source code and
discussion with developers
 No practical tests are performed
Geomatics for the Built Environment
16
MD array database selection
Criterion
Rasdaman
SciDB
License (i.e. commercial open-source)
1
1
Implementation of MD data storage structure
1
1
Lossless compression support
0
1
Parallelization
1
1
.Net API
0.5
0
Query language
1
1
Spatial calculating capability
0
0
NetCDF importer
1
0.5
Maintenance
0.5
1
Overall grade
6
6.5
Final grade shows SciDB scores higher
Geomatics for the Built Environment
17
Test environment setup
Geomatics for the Built Environment
18
Benchmark architecture
Geomatics for the Built Environment
19
Benchmark test and analysis
Geomatics for the Built Environment
20
64-bit offset NetCDF files
 MPE dataset
 One file contain 4 time steps, 250 MB
 A folder contains 1722 files
 GEFS dataset
 Only one file stored, containing 1 modelrun, 20
ensembles, 40 forecast steps, 181 latitudes and 360
longitudes, 1.55 GB
Geomatics for the Built Environment
21
NetCDF-4 files
 MPE dataset
Data store name
NetCDF4_C2
NetCDF4_C2_C
(compression)
Chunk size (X x Y x Time)
4000 x 4000 x 1
Single file size
250 MB
4000 x 4000 x 1
3 MB
 One file contains 4 time steps
 Two folders created for the two data stores, each with 720
files
 GEFS dataset (1 file for one data store)
Data store name
NetCDF4_GEFS_S3
NetCDF4_GEFS_S3_C (compression)
NetCDF4_GEFS_S5
NetCDF4_GEFS_S5_C (compression)
Chunk size (X x Y x Forecast x
Ensemble x Modelrun)
360 x 181 x 1 x 20 x 1
360 x 181 x 1 x 20 x 1
360 x 181 x 1 x 1 x 1
360 x 181 x 1 x 1 x 1
Single file size
1.55 GB
654 MB
1.55 GB
561 MB
Geomatics for the Built Environment
22
SciDB arrays
 MPE dataset
MPE data stored
Time
step
count
SciDB array
size
Original size of files in
64-bit offset format
Tiny
Small
First 2 hours of 1st September, 2013
First 6 hours of 1st September, 2013
8
24
37 MB
112 MB
488 MB
1.3 GB
Medium
1st September, 2013
96
448 MB
5.7 GB
Large
7 days from 1st to 7th September, 2013
672
3 GB
40 GB
Very
large
30 days of September, 2013
2880
13 GB
171.6 GB
Array
level
 Diverse chunk sizes and compression settings
 GEFS dataset
 4 data schemas for storage -> modification of order of
dimensions
Geomatics for the Built Environment
23
6 chunk sizes for MPE arrays
4 x 800 x 800: C3
4 x 100 x 100: C5
4 x 4000 x 4000: C1
1 x 800 x 800: C4
1 x 100 x 100: C6
1 x 4000 x 4000: C2
Geomatics for the Built Environment
24
GEFS: effect of dimensions order
1
0
E
0
M
F
F
Y
X
Y
X
0
1
0
F
0
0
M
E
E
Y
X
Y
X
0
Geomatics for the Built Environment
0
25
Benchmark test
 Two database systems (NetCDF and SciDB) are
benchmarked
 Each specific query is run 20 times and the average of the
middle 12 records is used as query response time
 Network delay and query parsing for SciDB, such additional
cost is between 0.05s to 0.2s
Geomatics for the Built Environment
26
MPE sub grid selection
Time
Y
X
0
Geomatics for the Built Environment
27
MPE sub grid selection
Scheme
C1, C1_C
C2, C2_C
C3, C3_C
C4, C4_C
C5, C5_C
C6, C6_C
Chunk size
4 x 4000 x 4000
1 x 4000 x 4000
4 x 800 x 800
1 x 800 x 800
4 x 100 x 100
1 x 100 x 100
Selecting grid covering the northern part of the Netherlands
Geomatics for the Built Environment
28
GEFS forecast time series extraction
1 modelrun
Forecast
Forecast
Y X
Y X
X
0
Forecast
0
Forecast
Y X
0
Forecast
Y
Y X
0
0
Ensemble
Geomatics for the Built Environment
29
GEFS forecast time series extraction
3.000
Scheme
Average query response time (s)
2.500
2.000
S1, S1_C
S2, S2_C
S3, S3_C
S5, S5_C
Dimensions
order
MEFYX
MFYXE
XYFEM
XYFEM
2.702
Chunk size
1 x 20 x 1 x 181 x 360
1 x 1 x 181 x 360 x 20
360 x 181 x 1 x 20 x 1
360 x 181 x 1 x 1 x 1
2.281
1.430
1.500
1.061
1.142
0.910
1.000
0.500
0.109
23.112
48.031
0.000
Data store
Extracting precipitation forecast time series from Delft, a spot location
Geomatics for the Built Environment
30
Overall evaluation
64-bit offset
NetCDF-4
NetCDF-4
DEFLATE
compression
SciDB array
SciDB array
DEFLATE
compression
Data loading
Storage
5
1
4
1
3
5
1
3
1
4
Scheme transformation
1
1
1
4
4
7
6
9
8
9
MPE sub grid selection
4
5
1
3
2
MPE time series extraction
2
5
1
4
3
MPE average calculation
4
5
1
3
3
MPE maximum calculation
4
5
1
3
3
GEFS forecast time series
extraction
5
4
2
3
1
GEFS percentile calculation
5
4
3
2
1
GEFS ensemble mean calculation
5
4
3
2
1
29
32
12
20
14
6.48
6.57
4.71
5.52
5.00
Data solution
Management
Management overall score
Query
Query overall score
Compound score (management * 0.33 + query * 0.14)
NetCDF-4 ranks the first, then 64-bit offset, SciDB solutions come after
Geomatics for the Built Environment
31
Conclusions and future work
Geomatics for the Built Environment
32
Summary
 Within the scope of research, NetCDF-4 without
compression is the best solution for managing and
querying large hydrologic datasets
 For SciDB, small chunk size is preferable but
overload of huge in-memory metadata of chunks
(i.e. <InstanceID, ArrayID, ChunkID, VersionID>) is
a problem
 DEFLATE compression of SciDB arrays can either
have negative or no effect on query performance
Geomatics for the Built Environment
33
Summary
 Correlation between SciDB DEFLATE compression
and chunk size is observed in time series
extraction
 With hypercubic and modest chunk sizes, the
internal data structure of chunks in SciDB has
insignificant influence on query performance.
 Masking query, e.g. data quality check as well as
spatial operation should be included in
comprehensive benchmarking
Geomatics for the Built Environment
34
Future work
 Generic chunk model, to determine best chunk size
for querying
 More realistic benchmark test, e.g. analyze
Hydrologic Research query log and simulate
scenarios
 Test with less memory capacity with focus on
NetCDF
 Parallel query processing and parallel loading for
SciDB
Geomatics for the Built Environment
35
Reflection
 Knowledge gained from Geo-database and Geoweb courses are utilized, e.g. blocks to store
images, HTTP communication
 The research makes use of geomatics techniques
to solve water problems
 The research fulfills organizations’ needs
(Hydrologic, Deltares, etc) and contribute water
services to the public
Geomatics for the Built Environment
36
Questions?
Geomatics for the Built Environment
37
Descargar

PowerPoint-presentatie