XLDB ‘09
Luke Lonergan
[email protected]
10/3/2015
1
“Big” numbers for GP today
•
•
•
•
•
•
•
70K/day - Query Rate
6.5PB – Dataset Size
+100GB/s – Analysis Rate
+3GB/s – Net Loading Rate
100,000/s – Transaction Rate
56 TB / kW, 1.6 GB/s/kW – Power Rate
100s – Number of Data/Compute nodes
10/3/2015
2
Things I’ve Heard
• Tiered computing
– Organizational / Political / Geographic
boundaries require it
• Metadata computing for HEP
– “10TB sounds small but it’s not easy”
• Processing for Radio Astronomy, HEP
– Data intensive computing
– Requires an efficient pipeline from raw to
consumables
10/3/2015
3
Thoughts
• A lot of plumbing! Moving data around,
pipeline processing
– Core engine should do this so the plumbing
isn’t done over and over
• Need for specialized access methods and
storage classes
• “Computing in data” is key to success
10/3/2015
4
GP Basic Features
• Access Methods
– Compression, Column Store, Heap Store,
External Tables, Indexes (GIST, GIN, Rtree,
Bitmap, B-Tree, …)
– Network Ingest / Export directly into parallel
pipeline
– Logical Partitioning by Range, List
• Parallel Programming Languages
– SQL 2003 with Analytics
– Map Reduce in Perl, Python, C, SQL, …
– PL/R,python,perl,C,pgSQL,SQL, …
10/3/2015
5
From Enterprise Data Clouds
•
Elastic / adaptive infrastructure for data warehousing and analytics
– IT Operations deploy pools of low-cost commodity infrastructure
•
Physical servers, virtual infrastructure, or onramp to public cloud
– DBAs and Analysts provision sandboxes and warehouses in minutes
•
Warehouses
Infrastructure
10/3/2015
Assemble the data they need (common, private, etc) for agile analytics
40
16
8
16
120
Free
Consumer
Division
DBA
16 16
96
68
Free
Packaged
Goods
40
Finance
6
64
Free
Analyst
IT Operations
Proprietary & Confidential
Use Case: Big Telco
Data Mart Consolidation
Goals:
Approach:
•Reduce maintenance and support
costs from proliferation of data
mart platforms
•Embrace data – encourage
‘physical consolidation’ in advance
of data model unification
•Reduce risks and exposure due to
data in shadow IT systems
•Provide ‘self serve’ model to
bring shadow IT into the light
•Break down silo walls - provide a
unified way to find and access all
data
•Allow unified data access and
pragmatic ‘logical’ data model
unification incrementally
X
X
X
X
US- West
100 nodes
X
X
X
X
10/3/2015
Data
Sources
X
7
Proprietary & Confidential
Use Case: Big Ad Network
Project Sandboxes
Goals:
EDC
•Remove IT barriers to analyst
productivity and value creation
Self-Serve
Dashboard
•Dramatically reduce IT resource
constraints and delays – i.e.
realize ideas sooner
•Combine centralized ‘EDW’ data
with freshly discovered feeds and
other useful sources
40
16
US – West
200 nodes
8
Analyst’s
Private
Data Feed
16
120
Free
16
Approach:
•Self-serve creation of project
warehouses in minutes – and
elastically expand as needed
•Load new data feeds without
requiring formal modeling
•Bring together any data within
the EDC – even if globally
distributed – and analyze
10/3/2015
16
Europe
100 nodes
68
Free
96
Asia
200 nodes
Analyst’s New
Warehouse
US- East
100 nodes
40
64
Free
8
Proprietary & Confidential
GP is Software – Develop Now
• Download at:
– Gpn.greenplum.com
– Get the VMWare image or use it on OSX,
Linux, Solaris
10/3/2015
9
Think Big. Think Fast.
Descargar

Greenplum Overview - Stanford University