Queries over Streaming
Sensor Data
Sam Madden
DB Lunch
October 12, 2001
Outline


Background
Server Side Solutions


Sensor Side Solutions



Fjords, Sensor Proxies, CACQ
Catalog Management
Aggregation
Future Work
Background: Sensor Networks
Sensor Networks

Small, low cost battery powered
microprocessors with 1 –4 sensors




Light, temperature, vibration, acceleration, AC
power, humidity.
10 kBit – 1Mbit wireless networks, 100ft
range.
“Ad-hoc” networking – no predefined routes.
Cal, MIT, UCLA OS and networking
communities committed
SmartDust



Sensor nets motivated by “SmartDust
Vision” – millimeter scale
microprocessors, sensor, and wireless
communication for pennies.
Deployed in thousands, no concern for
reliability of a single sensor.
Requires: position detection, fault
tolerance, aggregation, etc.
Rene / Mica Motes


SmartDust stand-in
~2cm x 3cm, OTS.
Processor
Atmel 8535
4Mhz, 5 mA
Radio
RFM TR1000
911 Mhz, 10kBits
~25 mJ/msg,
20-30 msg / sec
Memory
512B RAM, 8k Flash,
32k EEPROM
Flash R/O
EEPROM slow
Power
575 mAh battery
Peak load: 19.5 mA, Idle
3.1 mA, sleeping 10uA.
TinyOS

Lightweight OS for sensors




Event-based
Active-message, multi-hop networking
Auto-idling
Network reprogramming, time
synchronization, etc.
[18] J. Hill, R. Szewczyk, A. Woo, S. Hollar, and D. C. K. Pister. System architecture directions for networked sensors. In
Proceedingsof the 9th International Conference on Architectural Support for Programming Languages and Operating
Systems, November 2000.
Applications of Sensor Nets
• Space Monitoring
• Power, light, temp in buildings
• Temperature, humidity
• Traffic
• Military
• Structural
• Personal Networks
Database Opportunities



All applications depend on data
processing
Declarative query language over
sensors attractive
Want “to combine and aggregate data
streaming from motes.”

Sounds like a database…
Database Challenges

Sensors unreliable





Come on and offline, variable bandwidth
Sensors push data
Sensors stream data
Sensors have limited memory, power,
bandwidth
Sensors have processors
Outline


Background
Server Side Solutions


Sensor Side Solutions



Fjords, Sensor Proxies, CACQ
Catalog Management
Aggregation
Future Work
Fjords


Query Plan Abstraction to handle lack of
reliability and streaming, push based data
Combine push and pull in arbitrary
combinations


Use connectors between operators to isolate them
from flow direction
“Bracket Model” – Graefe ‘93
Fjords (Continued)
 Operators assume non-blocking queue interface
between each other.
 Queues implement push vs. pull
 Pull from A to B : Suspend A, schedule B until it
produces data. A cannot go forward until B produces
data.
 Push from B to A : A polls, scheduler thread invokes B
until it produces data. A can process other inputs while
waiting for B.
 Supports parallelism between operators via
queues, state machines, and OS (e.g. NIC buffers,
DMA) in operator transparent way.
Fjords Example
P
u
s
h

P
u
s
h

Pull
Samuel Madden, Michael J. Franklin. Fjording The Stream: An Architecture For Queries Over Streaming Sensor Data.
International Conference on Data Engineering, 2002. To Appear, Feburary 2002.
Fjords Example
P
u
s
h

P
u
s
h

Pull
Samuel Madden, Michael J. Franklin. Fjording The Stream: An Architecture For Queries Over Streaming Sensor Data.
International Conference on Data Engineering, 2002. To Appear, Feburary 2002.
Fjords Example
P
u
s
h

P
u
s
h

Pull
Samuel Madden, Michael J. Franklin. Fjording The Stream: An Architecture For Queries Over Streaming Sensor Data.
International Conference on Data Engineering, 2002. To Appear, Feburary 2002.
Fjords Example
P
u
s
h

P
u
s
h

Pull
Samuel Madden, Michael J. Franklin. Fjording The Stream: An Architecture For Queries Over Streaming Sensor Data.
International Conference on Data Engineering, 2002. To Appear, Feburary 2002.
Fjords Example
P
u
s
h

P
u
s
h

Pull
Samuel Madden, Michael J. Franklin. Fjording The Stream: An Architecture For Queries Over Streaming Sensor Data.
International Conference on Data Engineering, 2002. To Appear, Feburary 2002.
Fjords Example
P
u
s
h

P
u
s
h

Pull
Samuel Madden, Michael J. Franklin. Fjording The Stream: An Architecture For Queries Over Streaming Sensor Data.
International Conference on Data Engineering, 2002. To Appear, Feburary 2002.
Fjords Example
P
u
s
h

P
u
s
h

Pull
Samuel Madden, Michael J. Franklin. Fjording The Stream: An Architecture For Queries Over Streaming Sensor Data.
International Conference on Data Engineering, 2002. To Appear, Feburary 2002.
Fjords Example
P
u
s
h

P
u
s
h

Pull
Samuel Madden, Michael J. Franklin. Fjording The Stream: An Architecture For Queries Over Streaming Sensor Data.
International Conference on Data Engineering, 2002. To Appear, Feburary 2002.
Fjords Example
P
u
s
h

P
u
s
h

Pull
Samuel Madden, Michael J. Franklin. Fjording The Stream: An Architecture For Queries Over Streaming Sensor Data.
International Conference on Data Engineering, 2002. To Appear, Feburary 2002.
Fjords Applications

Combine traffic streams with web-based accident
reports
Francis Li, Sam Madden, Megan Thomas. Traffic Visualization. http://www.cs.berkeley.edu/~mct/infovis/project/traffic.html
Operators for Streaming Data

Need special operators for dealing with
streams (See P. Seshadri, et al. The design and implementation of a
sequence database systems..VLDB ’96)


In particular, streams can’t be joined or sorted in
the traditional sense
Solution: Use windows – e.g. “Zipper Join”
Sensor Proxy

Energy-sensitive database operator




Buffer sensor tuples and route to multiple user
queries to hide query load from sensors
Push aggregation operators into sensors to reduce
communications load
Dynamically adjust sample rate based on user
demand
Push results into Fjords so that other operators
don’t block waiting on slow or dead sensors
Some Results

Pushing predicates into sensors can
vastly reduce costs:
Power (W)
Power Drain (W) vs. Sample
Method
0.008
0.007
0.006
0.005
0.004
0.003
0.002
0.001
0
Atmel Simulator
100 samples / sec
5 vehicles / sec
Every Sample Every Vehicle
Sampling Method
7x power savings
CACQ


Expect hundreds to thousands of queries over
same sensor sources
Continuously Adaptive Continuous Queries

Query 1
Continuous Queries: Long running queries which
combine selections and joins to improve efficiency
(See Chen, NiagaraCQ, SIGMOD 2000)
Query 2

Stocks.

Stocks.
symbol = ‘MSFT’
symbol = ‘APPL’

Stock Quotes
Stock Quotes
‘MSFT’
‘APPL’
CACQ (Cont.)


Continuous Adaptivity From Eddies
Route tuples differently, depending on
selectvity and cost estimates of
operators
static
dataflow
eddy
CACQ (cont.)

Combining CA with CQ is a win:




CQ increases number of simultaneous
queries
Adaptivity well suited to long running
queries
Eddies allow us to avoid ugly queryoptimization phase in traditional CQ
Eddies + Streams == few copies, unlike
traditional CQ
CACQ (cont)
Look for a paper in SIGMOD 2002 (fingers crossed!)
Outline


Background
Server Side Solutions


Sensor Side Solutions



Fjords, Sensor Proxies, CACQ
Catalog Management
Aggregation
Future Work
Sensor Side Solutions

CACQ + Fjords provides interface +
performance on QP, but sensors still
need help:


Locate / identify sensors
Reduce power consumption


Take advantage of processors?
Improve responsiveness
Cataloging Sensors



To query sensors, need a way to locate,
identify properties, extract values
Goal: Drop a bunch of sensors around
the DBMS, allow them to be queried
without manual effort
Idea: Add a layer to each sensor which
advertises its capabilities
Catalog (Continued)



Compiled in 27
bytes of memory
Layer to register
with telegraph
Can be “push” or
“pull”
#temperature sensor
field {
name : "temp" #optional
type : int
units : celsius
min : -20
max : 100
bits : 8
sample_cost : 10.0 J #optional -- for use in costing
sample_time : 10.0 ms #optional -- for use in costing
input : adc2 #optional : read from adc channel 1
sends : ondemand
accessorEvent : GET_TEMPERATURE_DATA
responseEvent : TEMPERATURE_DATA_READY
}
Aggregating Over Sensors



Sensor Proxy combines user queries,
pushes down aggregates
Goal: Save energy, increase efficiency
Idea: Take advantage of the routing
hierarchy (example soon!)
Why bother with aggregation

Individual sensor readings are of limited use



Interest in higher level properties, e.g. what vehicles drove
through, what is the spread of temperatures in the building
We have a processor & network on board, lets use it
We cannot survive without aggregation



Delivering a message to all nodes much easier than delivering a
message from each node to a central point
Delivering a large amount of data from every node harder still, vide
connectivity experiment
Forwarding raw information too expensive



Scarce energy
Scarce bandwidth
Multihop performance penalty
Aggregation challenges

Inherently unreliable environment, certain information
unavailable or expensive to obtain





Information trickles in one message at a time


how many nodes are present?
how many nodes are supposed to respond?
what is the error distribution (in particular, what about malicious
nodes?)
Trying to build an infrastructure to remove all uncertainty from the
application may not be feasible – do we want to build distributed
transactions?
Never have a complete and up-to-date information about the
neighborhood
What type of information should we expect from aggregation


Streams
Robust estimates
1
2
3
4
5
Scenario: Count
Sensor #
1
2
3
4
5
Goal: Count the number of
nodes in the network.
Number of children is
unknown.
Scenario: Count
1
2
3
4
5
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Time
Sensor #
1
2
3
Goal: Count the number of
nodes in the network.
Number of children is
unknown.
Scenario: Count
1
2
3
4
5
1
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Time
Sensor #
1
2
3
4
Goal: Count the number of
nodes in the network.
Number of children is
unknown.
Scenario: Count
1
2
3
4
5
1
-
-
-
-
1
1
1
-
-
1+2
1
1
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Time
Sensor #
1
2
3
4
5
Goal: Count the number of
nodes in the network.
Number of children is
unknown.
Scenario: Count
1
2
3
4
5
1
-
-
-
-
1
1
1
-
-
1+2
1
1
1
-
1+2
1+
½
1+
½
1
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Time
Sensor #
1
2
3
4
5
Goal: Count the number of
nodes in the network.
Number of children is
unknown.
Scenario: Count
1
2
3
4
5
1
-
-
-
-
1
1
1
-
-
1+2
1
1
1
-
1+2
1+
½
1+
½
1
1
1+3
1+
½
1+
½
1+1
1
-
-
-
-
-
-
-
-
-
-
Time
Sensor #
1
2
3
4
5
Goal: Count the number of
nodes in the network.
Number of children is
unknown.
Scenario: Count
1
2
3
4
5
1
-
-
-
-
1
1
1
-
-
1+2
1
1
1
-
1+2
1+
½
1+
½
1
1
1+3
1+
½
1+
½
1+1
1
1+3
1+2/
2
1+2/
2
1+1
1
-
-
-
-
-
Time
Sensor #
1
2
3
4
5
Goal: Count the number of
nodes in the network.
Number of children is
unknown.
Scenario: Count
1
2
3
4
5
1
-
-
-
-
1
1
1
-
-
1+2
1
1
1
-
1+2
1+
½
1+
½
1
1
1+3
1+
½
1+
½
1+1
1
1+3
1+2/
2
1+2/
2
1+1
1
1+4
1+2/
2
1+2/
2
1+1
1
Time
Counting Lessons



Take advantage of redundancy to
improve accuracy (reply to all parents,
not just one)
Use broadcast to reduce number of
messages
Result is a stream of values: much
more robust to failures, movement, or
collision than a single value.
Aggregation in network
programming

Network programming problem


Basic setup




Reliable delivery of a large number of messages to all nodes in
range, while exploiting the broadcast nature of the medium
Broadcast a known number of idempotent program fragments
Each node keeps a bitmap of fragments received (1=packet
received)
Two stages of the problem: single hop, and multihop
Solutions

Single hop, dense cell



Broadcasting the program – trivial, the central node broadcasts
Feedback from nodes – broadcast a request from the central node: Is
anyone missing packets in this packet range?
Convergence: no replies to the request
Aggregation in multihop
network programming

Broadcasting the program – use flooding


Remember the last 8 packets forwarded, use that cache to
decide whether to forward or not
Feedback from nodes



Distribute requests for feedback using the flooding
After some delay, respond if any packets are missing locally
Responses from children: AND with the local bitmap, store
the result locally, forward the request


Suboptimal because there is no local fixups
Convergence

No replies to the request
Aggregation over streams

Inherent uncertainty of the system




Can nodes communicate, do they have enough
power, have they moved?
computing a complete single answer can be very
expensive, and may not be possible
Partial estimates have their own value
Aggregation over streams


Values reflect the current best estimates
Self stabilizing: in the absence of changes
converges to a desired value within N steps
What does it mean to aggregate
(The DB Perspective)

General purpose solution: apply standard aggregation operators
like COUNT, MIN, MAX, AVERAGE, and SUM to any set of
sensors.



Previous example are application specific
In sensors, operators may be arbitrary signal processing functions
Provide grouping semantics: e.g. ‘select avg(temp) group by
trunc(light/10)’

In sensor networks, groups may be random samples
t1
t2
t3
t4
t7
t5
t8
t6
t9
Identifying Groups

Need a way to identify groups

Idea: set of membership criteria pushed down





Nodes determine their membership set based on those criteria
Nodes can be in multiple but not unlimited groups
E.g. “Group 1 : 0 <= t < 10, Group 2 : 10 <= t < 20, …”
Need a way to evaluate aggregation predicates by
group
May want to allow grouping and aggregation
predicates to be expressed together to take
advantage of broadcast effects
Local Query Rewrite

Intermediate nodes may determine that its faster to
evaluate an aggregate by asking children a different
question.



Example 1: MAX(t). Once we have a guess T for MAX, ask
children to report iff t > T, rather than asking all children to
compute a local maximum.
Example 2: Network programming. Rather than asking
nodes what packets they have, ask them to report iff
packets missing.
Is this a general technique? Maybe:

Inform child of guess at aggregate, ask it to refute.

Works for average (within error bound), not count.
Wins and pitfalls of
aggregation

Aggregation over natural network topology


Aggregation over an arbitrary subset of the network may be
a loss
Really dense cells



Aggregation does not help with the starvation problem
Use the message suppression via query rewrite technique
Still beneficial in a multihop scenario
Advanced Aggregation Tricks


Break the Network Protocol Boundary
Use analog reading from channel over
time to determine aggregates. Simple
example:
Time
Reading = 11 = 110100
Reading = 21 = 101010
Sum
Reading = 32 =
2 + 2 + 4 + 8 + 16
Outline


Background
Server Side Solutions


Sensor Side Solutions



Fjords, Sensor Proxies, CACQ
Catalog Management
Aggregation
Future Work
Future Work

DBMS Side

Efficient Catalog Management



Query Optimization Techniques
Sensor Side




Moving Object Databases
Efficient Grouping
Joins over Network Topology
Non Standard Aggregate Functions
Somewhere In Between


Histograms and other Correlations
Sampling and Compression for Streams
Descargar

Queries over Streaming Sensor Data