InfoSphere Streams
Tushar
TusharKale
Kale
Big Data Evangelist – Streams Architect
[email protected]
Nov 2013
© 2013 IBM Corporation
Information Management
Agenda

Overview

Architecture

Customer Use Cases
2
© 2013 IBM Corporation
Information Management
Big Data = Variety, Velocity, and Volume
Extracting insight from an immense volume, variety and velocity of
data, in context, beyond what was previously possible.
3
Variety
Manage the complexity of multiple
relational and non-relational data
types and schemas
Velocity
Streaming data and large volume data
movement
Volume
Scale from terabytes to zettabytes
© 2013 IBM Corporation
Information Management
InfoSphere Streams
A Platform to Run In-Motion Analytics on BIG Data
Real time delivery
Volume
Handles up to Petabytes of
data per day
Variety
Supports traditional as well as
ICU
Monitoring
Algo
Trading
Cyber
Security
non-traditional data (Audio,
Video etc.)
Velocity
Complex
Analytics
Agility
4
Environment
Monitoring
Powerful
Analytics
Government /
Law enforcement
Telco churn
predict
Smart
Grid
Delivers insights with
microsecond latencies
Millions of
events per
second
Microsecond
Latency
Supports custom analytics
written in C++/Java and
warehouse analytic models
Traditional / Non-traditional
data sources
Single instance can support
multiple applications
© 2013 IBM Corporation
Information Management
Stream Computing Illustrated
tuple
directory: directory: directory: directory:
”/img" ”/img"
”/opt" ”/img"
filename: filename: filename: filename:
“farm” “bird” “java” “cat”
5
5
height:
640
width:
480
data:
height:
1280
width:
1024
data:
height:
640
width:
480
data:
© 2013 IBM Corporation
Information Management
What can Streams do for you?
 Analyze and react to events as they are happening
 Take advantage of more sources of data in “true” real time
 Build models on your most up-to-the-second information that will help
predict what happens next
 Streams is a middleware and language for building and running analytic
applications operating on data in motion
• Scale – easily handles a few events per second through multiple
millions of events per second
• Reaction time – possible to get actionable results in much less than a
second (< 20 micros possible)
 Enables TRUE situational awareness
6
© 2013 IBM Corporation
Information Management
BIG Data – Extending the Warehouse
Warehouse
Traditional /
Relational
Data Sources
Database &
Warehouse
Streams
Non-Traditional /
Non-Relational
Data Sources
At-Rest Data
Analytics
Results
In-Motion
Analytics
InfoSphere
Streams
Ultra Low Latency
Results
Non-Traditional/
Non-Relational
Data Sources
Internet
Scale
7
Internet Scale
Traditional/Relational
Data Sources
Data Analytics,
InfoSphere
BigInsights Data Operations &
Model Building
Results
© 2013 IBM Corporation
Information Management
Adaptive Analytics
Integrating Analytics on Data in Motion and Data at Rest
Visualization of realtime and historical
insights
Data Integration,
data mining,
machine learning,
statistical modeling
InfoSphere
Streams
1. Data Ingest
Data
2. Bootstrap/Enrich
Data ingest,
preparation,
online analysis,
model validation
Control
flow
InfoSphere
BigInsights,
Database &
Warehouse
3. Adaptive Analytics Model
8
© 2013 IBM Corporation
Information Management
Agenda

Overview

Architecture

Customer Use Cases
9
© 2013 IBM Corporation
Information Management
What are key differentiating technical capabilities of Streams?
Language built for Streaming
applications:
Reusable operators
Rapid application development
Continuous “pipeline” processing
Performance and Scaling:
Operator Fusing and Threading
Efficient use of cores
Distributed execution
Very fast data exchange
Use the data that
gives you a
competitive
advantage:
Can handle virtually
any data type
Use data that is too
expensive and time
sensitive for traditional
approaches
Easy to extend:
Built in adaptors
Users add capability with
familiar C++ and Java
Easy to manage:
Automatic placement
Extend applications incrementally
without downtime
Multi-user / multiple applications
10
10
Dynamic analysis:
Flexible and high
performance transport:
Programmatically change
topology at runtime
Create new subscriptions
Create new port properties
Very low latency
High data rates
© 2013 IBM Corporation
Information Management
InfoSphere Streams
Streams Processing
Language and IDE
Runtime
Environment
Tools and Technology
Integration
Front Office 3.0
Streams Studio
Eclipse IDE for SPL
Highly Scalable
stream processing runtime
Streams Console & Monitoring,
Built-in Stream Relational Analytics,
Adapters, Toolkits
Supported on x86 hardware, RedHat Enterprise Linux Version 5 (5.3 and up)
11
© 2013 IBM Corporation
Information Management
Terminology
stream A
 Application
stream
connection
• Data flow graph of operator instances connected to
each other via stream connections
O2
MySink
A
 Operator
O1
MySrc
• Reusable stream analytic
• Input ports: receives data / Output ports: produces data
• Source: No input ports / Sink: No output ports
 Operator Instance
O3
MySink
(stream<Type> A) as O1 = MySrc() {}
() as O2 = MySink(A) {}
() as O3 = MySink(A) {}
• A specific instantiation of an operator
 Stream
• Continuous series of tuples, generated by an operator instance’s output port
 Stream connection
• A stream connected to a specific operator instance input port
 PE
• A runtime process that executes a set of operator instances
 Job
• An application instance running on a set of hosts
12
© 2013 IBM Corporation
Information Management
InfoSphere Streams Programming Model
Source Adapters
Operator Repository
Sink Adapters
Application Programming (SPL)
Platform optimized compilation
13
© 2013 IBM Corporation
Information Management
Streams Core Analytical Capabilities
Streams Built-in Relational and Utility Operators
The Split operator is
used for dividing
incoming tuples into
separate streams for
parallel processing
The Delay operator is
used to “artificially”
slowdown a stream
The Aggregate
operator is used for
grouping and
summarization of
incoming tuples
The Functor operator is
used for performing tuplelevel manipulations
The Join operator is
used for correlating two
streams
The Punctor operator
is for inserting
punctuation marks in
streams
The Sort operator is used for
imposing an order on
incoming tuples in a stream
The Barrier operator
is used as a
synchronization point
And more!
14
© 2013 IBM Corporation
Information Management
Streams Core Adapter Capabilities
Streams Built-in Adapters and DB Toolkit
The ODBCSource
operator is used for
reading data from
databases, such as
DB2, IDS, Oracle
The ODBCAppend
operator is used for
writing data to
databases, such as
DB2, IDS, Oracle
The solidDBEnrich
operator is used for
extending streaming
data based on
lookups performed
from in-memory
database tables
The FileSource
operator is used for
reading data from
files in formats such
as csv, line, or binary
The TCP /
UDPSource operator
is used for reading
data from sockets in
formats such as csv,
line, or binary
15
The ODBCEnrich
operator is used for
extending streaming
data based on
lookups performed
from database tables
The FileSink
operator is used for
writing data to files in
formats such as csv,
line, or binary
The TCP / UDPSink
operator is used for
writing data to
sockets in formats
such as csv, line, or
binary
© 2013 IBM Corporation
Information Management
Extensibility
 User-defined operators that extend the language
– A reusable, generic operator model
• written in general purpose programming languages (C++/Java)
 User-defined functions that extend the language
 Toolkits: Set of domain-specific operators/functions
– Toolkits available as part of Streams
• DB toolkit
• Data mining toolkit
• Financial toolkit
– Streams Exchange on developerWorks
• Re-usable Assets and Forum
 Developers in two categories
– Application developers
– Toolkit developers
16
© 2013 IBM Corporation
Information Management
Static vs. Dynamic Composition
 Static connections
– Fully specified at application development-time and do not change at run-time
 Dynamic connections
– Partially specified at application development-time (Name or Properties)
– Established at run-time, as new jobs come and go
• Specifications can also be updated at run-time
 Dynamic application composition
– Incremental deployment of applications
– Dynamic adaptation of applications
17
© 2013 IBM Corporation
Information Management
Static vs. Dynamic Composition
 Static connections
– Fully specified at application development-time and do not change at run-time
 Dynamic connections
– Partially specified at application development-time (Name or Properties)
– Established at run-time, as new jobs come and go
• Specifications can also be updated at run-time
 Dynamic application composition
– Incremental deployment of applications
– Dynamic adaptation of applications
18
© 2013 IBM Corporation
Information Management
InfoSphere Streams Runtime Architecture
Eclipse IDE and Management Tools
Language/Optimizing
Compiler
Admin Config /
Console
Management APIs
InfoSphere Streams Runtime running on a cluster – 125 blades
streamtool
Running anywhere inside the cluster
Streams
Web Service
Name Service
Root Service
Name Service
Partition Service
Streams Application
Manager
Streams Resource
Manager
Authorization and
Authentication
Service
Scheduler
Components running on management hosts
Host Controller
Processing
Element
Container Agent
Subset of a
SPL application
(a collection of operators)
Components running on application hosts
19
© 2013 IBM Corporation
Information Management
InfoSphere Streams Runtime
 Streams is a distributed, multi-user, multi-instance system
• Multiple instances can run at the same time
• Can run jobs from multiple users
• A security model is provided for authentication and authorization
 Application management
• New jobs can be added/removed at any time
• New and existing jobs can connect to each other
• Scheduler assigns PEs to Hosts based on load
 Resource management
• Hosts & Services configuration and state
• System & Application Metrics
 Failure semantics
• Recovery of management services state
• PEs can be restarted or relocated upon failure
• All connections will be re-established once a PE restarts
• All state and in transit tuples are lost
• Checkpointing can be used to restore operator state
20
© 2013 IBM Corporation
Information Management
InfoSphere Streams Runtime - cont’d
 Runs on commodity hardware
• From single node to blade centers to high performance multi-rack clusters
 Adapts to changes :
X86 Host
21
X86 Host
X86 Host
X86 Host
X86 Host
© 2013 IBM Corporation
Information Management
InfoSphere Streams Runtime – cont’d
 Runs on commodity hardware
• From single node to blade centers to high performance multi-rack clusters
 Adapts to changes :
• In workloads
X86 Host
22
X86 Host
X86 Host
X86 Host
X86 Host
© 2013 IBM Corporation
Information Management
InfoSphere Streams Runtime – cont’d
 Runs on commodity hardware
• From single node to blade centers to high performance multi-rack clusters
 Adapts to changes :
• In workloads
X86 Host
23
X86 Host
X86 Host
X86 Host
X86 Host
© 2013 IBM Corporation
Information Management
InfoSphere Streams Runtime – cont’d
 Runs on commodity hardware
• From single node to blade centers to high performance multi-rack clusters
 Adapts to changes :
• In workloads
• In resources
X86 Host
24
X86 Host
X86 Host
X86 Host
X86 Host
© 2013 IBM Corporation
Information Management
InfoSphere Streams Runtime – cont’d
 Runs on commodity hardware
• From single node to blade centers to high performance multi-rack clusters
 Adapts to changes :
• In workloads
• In resources
X86 Host
25
X86 Host
X86 Host
X86 Host
X86 Host
© 2013 IBM Corporation
Information Management
Streams Studio Eclipse IDE
26
© 2013 IBM Corporation
Information Management
Streams Console – Metrics
27
© 2013 IBM Corporation
Information Management
Agenda

Overview

Architecture

Customer Use Case
28
© 2013 IBM Corporation
Information Management
Streaming Analytics in Action
Stock Market
 Impact of weather on securities prices
 Analyze market data at ultra-low latencies
Natural Systems
 Wildfire management
 Water management
Law Enforcement,
Defense & Cyber Security
 Real-time multimodal surveillance
 Situational awareness
 Cyber security detection
Transportation
 Intelligent traffic
management
Fraud Prevention
 Detecting multi-party fraud
 Real time fraud prevention
Manufacturing
 Process control for
microchip fabrication
e-Science
 Space weather prediction
 Detection of transient events
 Synchrotron atomic research
Health & Life Sciences
 Neonatal ICU monitoring
 Epidemic early warning
system
 Remote healthcare
monitoring
29
Other
Telephony
 CDR processing
 Social analysis
 Churn prediction
 Geomapping
 Smart Grid
 Text analysis
 Who’s talking to whom?
 ERP for commodities
 FPGA acceleration
© 2013 IBM Corporation
Information Management
Smarter Faster Cheaper CDR Processing
6 Billion CDRs per day, dedups over 7 days, processing latency from 12 hours to a few seconds
6 machines (using ½ processor capacity)
InfoSphere Streams
xDR Hub
Key Requirements:
Price/Performance and Scaling
30
© 2013 IBM Corporation
Information Management
Telco: Beyond CDR processing, building on existing insight
Mobile
Network
Call Data
Analytics
Network
Analytics
Customer
Interactions
31
Audio
Analytics
Weather
Location
Analytics
Social
Media
Social
Analytics
Call Quality
Analytics
Churn
Analytics
Campaign
Analytics
Database &
Warehouse
Business
Rules
…
…
…
Analytics
Analytics
Analytics
InfoSphere
Streams
© 2013 IBM Corporation
Information Management
Surveillance and Physical Security: TerraEchos (Business Partner)
 Use scenario
• State-of-the-art covert surveillance system
based on Streams platform
• Acoustic signals from buried fiber optic cables
are monitored, analyzed and reported in real
time for necessary action
• Currently designed to scale up to 1600 streams
of raw binary data
 Requirement
• Real-time processing of multi-modal signals
(acoustics. video, etc)
• Easy to expand, dynamic
• 3.5M data elements per second
 Winner 2010 IBM CTO Innovation Award
32
© 2013 IBM Corporation
Information Management
Cyber Security Analytics
IT I/S Firewalls
Live Packet
Capture
Processing Element
Container
Processing Element
Container
Processing Element
Container
Processing Element
Container
Processing Element
Container
InfoSphere
Streams

DNS / DHCP / Netflow sources


Botnet Behavior modeling
 Botnet nodes / Malware
 IP/MAC identifying suspects
External C&C Feeds (live DB queries)
Remediation Infrastructure / Ticketing
33
33
© 2013 IBM Corporation
Information Management
University of Ontario Institute of Technology (UOIT)
and Sick Kids Hospital
IBM Data Baby
http://youtu.be/ZiqY7p1v950
34
© 2013 IBM Corporation
Information Management
Intelligent Transportation
 Multimodal Data Streams
•
•
•
•
•
GPS
Counts, speeds, travel times
Public Transport
Pollution measurements
Weather Conditions
 Archiving of cleansed data
 Real Time Traffic Monitoring
 Real Time Traffic Information
GPS
Data
Streams
Real Time
Transformatio
n Logic
Real Time
Geo
Mapping
Real Time
Speed &
Heading
Estimation
Real Time
Aggregates
& Statistics
 (Multimodal) Travel Planner
Interactive
visualization
Only 4 x86 Blade servers to process
250,000 GPS probes per second
Web
Server
Google
Earth
35
Storage
adapters
Data
Warehouse
Offline
statistical
analysis
© 2013 IBM Corporation
Information Management
THINK
36
Nov 2013
© 2013 IBM Corporation
Information Management
Questions?
Nov 2013
© 2013 IBM Corporation
Information Management
38
© 2013 IBM Corporation
Descargar

IBM Presentations: Smart Planet Template