```Something for almost nothing:
algorithms
Ronitt Rubinfeld
MIT and Tel Aviv U.
Algorithms for REALLY big data
No time
What can we hope to do without viewing
most of the data?
Small world phenomenon
• The social network is a
graph:
– “node’’ is a person
– “edge’’ between people that
know each other
• “6 degrees of separation’’
• Are all pairs of people
connected by path of
distance at most 6?
Vast data
• Impossible to access all of it
• Accessible data is too enormous to be
viewed by a single individual
• Once accessed, data can change
The Gold Standard
• Linear time algorithms:
– for inputs encoded by n
bits/words, allow cn time
steps (constant c)
What can we hope to do without viewing
most of the data?
• Can’t answer “for all” or “exactly” type statements:
• are all individuals connected by at most 6 degrees of
separation?
• exactly how many individuals on earth are left-handed?
• Compromise?
• is there a large group of individuals connected by at most 6
degrees of separation?
• approximately how many individuals on earth are left-handed?
What types of approximation?
Property Testing
• Does the input object have crucial properties?
• Example Properties:
•
•
•
•
•
•
Clusterability,
Small diameter graph,
Close to a codeword,
Linear or low degree polynomial function,
Increasing order
Lots and lots more…
“In the ballpark” vs. “out of the ballpark”
tests
• Property testing: Distinguish inputs that have specific property
from those that are far from having that property
• Benefits:
– Can often answer such questions much faster
– May be the natural question to ask
• When some “noise” always present
• When data constantly changing
• Gives fast sanity check to rule out very “bad” inputs (i.e., restaurant
bills)
• Model selection problem in machine learning
Examples
• Can test if a function is a homomorphism in CONSTANT
TIME [Blum Luby R.]
• Can test if the social network has 6 degrees of separation
in CONSTANT TIME [Parnas Ron]
Constructing a property tester:
• Find characterization of property that is
• Efficiently (locally) testable
• Robust • objects that have the property satisfy characterization,
• and objects far from having the property are unlikely to
PASS
Usually the
bigger
challenge
Example: Homomorphism property of
functions
∀x,y f(x)+f(y) = f(x+y)
For most x f(x)+f(1) = f(x+1)
• Good characterization:
For most x,y f(x)+f(y) = f(x+y)
Example: 6 degrees of separation
For every node, all other nodes within distance 6.
For most nodes, all other nodes within distance 6.
• Good characterization:
For most nodes, there are many other nodes within distance 6.
An example in depth
Monotonicity of a sequence
• Given: list y1 y2 ... yn
• Question: is the list sorted?
• Clearly requires n steps – must look at each yi
Monotonicity of a sequence
• Given: list y1 y2 ... yn
• Question: can we quickly test if the list close to sorted?
What do we mean by ``quick’’?
• query complexity measured in terms of list size n
• Our goal (if possible):
• Very small compared to n, will go for clog n
What do we mean by “close’’?
Definition: a list of size n is .99-close to sorted if can delete at
most .01n values to make it sorted. Otherwise, .99-far.
Sorted: 1 2 4 5 7 11 14 19 20 21 23 38 39 45
Close: 1 4 2 5 7 11 14 19 20 39 23 21 38 45
1 4
5 7 11 14 19 20
23
38 45
Far: 45 39 23 1 38 4 5 21 20 19 2 7 11 14
1
4 5
7 11 14
Requirements for algorithm:
• pass sorted lists
• if list passes test, can change at most .01 fraction of list to
make it sorted
An attempt:
• Proposed algorithm:
• Pick random i and test that yi≤yi+1
• 1,2,3,4,5,…j, 1,2,3,4,5,….j, 1,2,3,4,5,…j, 1,2,3,4,5,…,j
• Difficult for this algorithm to find “breakpoint”
• But other algorithms work well
yi
i
A second attempt:
• Proposed algorithm:
• Pick random i<j and test that yi≤yj
• n/m groups of m elements
m,m-1,m-2,…,1, 2m,2m-1,2m-2,…m+1, 3m,3m-1,3m-2,…,
• must pick i,j in same group
• need at least (n/m)1/2 choices to do this
yi
i
A test that works
• The test: (for distinct yi)
• Test several times:
•
•
•
•
•
Pick random i
Look at value of yi
Do binary search for yi
Does the binary search find any inconsistencies? If yes, FAIL
Do we end up at location i? If not FAIL
• Pass if never failed
• Running time: O(log n) time
• Why does this work?
• If list is in order, then test always passes
• If the test passes on choice i and j, then yi and yj are in correct order
• Since test usually passes, most yi’s in the right order
Many more properties studied!
• Graphs, functions, point sets, strings, …
• Amazing characterizations of problems testable in
graph and function testing models!
Properties of functions
• Linearity and low total degree polynomials
[Blum Luby R.] [Bellare Coppersmith Hastad Kiwi Sudan] [R. Sudan] [Arora Safra]
[Arora Lund Motwani Sudan Szegedy] [Arora Sudan] ...
• Functions definable by functional equations –
trigonometric, elliptic functions [R.]
• All locally characterized affine invariant function
classes! [Bhattacharyya Fischer Hatami Hatami Lovett]
• Groups, Fields [Ergun Kannan Kumar R. Viswanathan]
• Monotonicity [EKKRV] [Goldreich Goldwasser Lehman Ron
Samorodnitsky] [Dodis Goldreich Lehman Ron Raskhodnikova
Samorodnitsky][Fischer Lehman Newman Raskhodnikova R. Samorodnitsky]
• Convexity, submodularity
[Parnas Ron R.] [Fattal Ron] [Seshadri
Vondrak]…
• Low complexity functions, Juntas
[Parnas Ron Samorodnitsky] [Fischer Kindler Ron Safra Samorodnitsky]
[Diakonikolas Lee Matulef Onak R. Servedio Wan]…
Properties of sparse and general graphs
• Easily testable dense and hyperfinite graph properties are
completely characterized! [Alon Fischer Newman
Shapira][Newman Sohler]
• General Sparse graphs: bipartiteness, connectivity,
diameter, colorability, rapid mixing, triangle free,…
[Goldreich Ron] [Parnas Ron] [Batu Fortnow R. Smith White]
[Kaufman Krivelevich Ron] [Alon Kaufman Krivelevich Ron]…
• Tools: Szemeredi regularity lemma, random walks, local
search, simulate greedy, borrow from parallel algorithms
Some other combinatorial properties
• Set properties – equality, distinctness,...
• String properties – edit distance, compressibility,…
• Metric properties – metrics, clustering, convex hulls,
embeddability...
• Membership in low complexity languages –
regular languages, constant width branching programs,
context-free languages, regular tree languages…
• Codes – BCH and dual-BCH codes, Generalized ReedMuller,…
• Output number close to value of the optimal solution (not
enough time to construct a solution)
• Some examples:
•
•
•
•
•
Minimum spanning tree,
vertex cover,
max cut,
positive linear program,
edit distance, …
Example: Vertex Cover
• Given graph G(V,E), a vertex cover (VC) C is a
subset of V such that it “touches” every edge.
• What is minimum size of a vertex cover?
• NP-complete
• Poly time multiplicative 2-approximation based on
relationship of VC and maximal matching
Vertex Cover and Maximal Matching
• Maximal Matching:
• M ⊆ E is a matching if no node in in more than one edge.
• M is a maximal matching if adding any edge violates the
matching property
• Note: nodes matched by M form a pretty good
Vertex Cover!
• Vertex cover must include at least one node in each edge
of M
• Union of nodes of M form a vertex cover
• So |M| ≤ VC ≤ 2 |M|
“Classical” approximation examples
• Can get CONSTANT TIME approximation for vertex cover
on sparse graphs!
• Output y which is at most 2∙ OPT + ϵn
How?
• Oracle reduction framework [Parnas Ron]
•
•
Construct “oracle” that tells you if node u in 2-approx vertex
cover
Use oracle + standard sampling to estimate size of cover
But how do you implement the oracle?
Implementing the oracle – two
approaches:
• Sequentially simulate computations of a local
distributed algorithm [Parnas Ron]
• Figure out what greedy maximal matching
algorithm would do on u [Nguyen Onak]
Greedy algorithm for maximal matching
• Algorithm:
• ←∅
• For every edge (u,v)
• If neither of u or v matched
• Output M
• Why is M maximal?
• If e not in M then either u or v already matched by earlier
edge
Implementing the Oracle via Greedy
• To decide if edge e in matching:
• Must know if adjacent edges that come before e
in the ordering are in the matching
• Do not need to know anything about edges
coming after
• Arbitrary edge order can have long
dependency chains!
1
2
4
8
25
36
47
88
89
Odd or even
steps from
beginning?
110 111 112 113
Breaking long dependency chains
[Nguyen Onak]
• Assign random ordering to edges
• Greedy works under any ordering
• Important fact: random order has short
dependency chains
Better Complexity for VC
• Always recurse on least ranked edge first
• Heuristic suggested by [Nguyen Onak]
• Yields time poly in degree [Yoshida Yamamoto Ito]
• Additional ideas yield query complexity nearly
linear in average degree for general graphs [Onak Ron
Rosen R.]
Further work
• More complicated arguments for maximum
matching, set cover, positive LP… [Parnas Ron + Kuhn
Moscibroda Wattenhofer] [Nguyen Onak] [Yoshida Yamamoto Ito]
Can dependence be
degree?
• Even better results for hyperfinite graphs [Hassidim
Kelner Nguyen Onak][Newman Sohler]
• e.g., planar
No samples
What if data only accessible via random
samples?
Distributions
Play the lottery?
Is the lottery unfair?
• From Hitlotto.com: Lottery experts agree,
past number histories can be the key to
predicting future winners.
True Story!
• Polish lottery Multilotek
• Choose “uniformly” at random distinct 20 numbers
out of 1 to 80.
• Initial machine biased
• e.g., probability of 50-59 too small
• Past results:
http://serwis.lotto.pl:8080/archiwum/wyniki_wszystkie.php?id_gra=2
Thanks to Krzysztof Onak (pointer) and Eric Price (graph)
New Jersey Pick 3,4 Lottery
• New Jersey Pick k ( =3,4) Lottery.
• Pick k digits in order.
• 10k possible values.
• Assume lottery draws iid
• Data:
• Pick 3 - 8522 results from 5/22/75 to 10/15/00
• 2-test gives 42% confidence
• Pick 4 - 6544 results from 9/1/77 to 10/15/00.
• fewer results than possible values
• 2-test gives no confidence
Distributions on BIG domains
• Given samples of a distribution, need to know, e.g.,
•
•
•
•
entropy
number of distinct elements
“shape” (monotone, bimodal,…)
closeness to uniform, Gaussian, Zipfian…
• No assumptions on shape of distribution
• i.e., smoothness, monotonicity, Normal distribution,…
• Considered in statistics, information theory, machine
learning, databases, algorithms, physics, biology,…
Key Question
• How many samples do you need in terms of
domain size?
• Do you need to estimate the probabilities of each
domain item?
• Can sample complexity be sublinear in size of the
domain?
Rules out standard statistical techniques,
learning distribution
Our Aim:
Algorithms with sublinear sample complexity
Similarities of distributions
Are p and q close or far?
• p is given via samples
• q is either
• known to the tester (e.g. uniform)
• given via samples
Is p uniform?
• Theorem: ([Goldreich Ron] [Batu
p
Fortnow R. Smith White] [Paninski])
samples
Test
Pass/Fail?
Sample complexity of
distinguishing
=
/2
1
from |  −  |1 >  is ( )
• Nearly same complexity to test
if p is any known distribution
[Batu Fischer Fortnow Kumar R.
1
White]:“Testing
identity”
−  1 = Σ  −

Upper bound for L2 distance [Goldreich Ron]
• L2 distance:
−
2
2
= ∑  −
• ||p-U||22 = S(pi -1/n)2
= Spi2 - 2Spi /n + S1/n2
= Spi2 - 1/n
2
========
• Estimate collision probability to estimate L2
distance from uniform
Testing uniformity [GR, Batu et. al.]
• Upper bound: Estimate collision probability and use
known relation between between L1 and L2 norms
• Issues:
• Collision probability of uniform is 1/n
• Use O(sqrt(n)) samples via recycling
• Comment: [P] uses different estimator
• Easy lower bound: (n½)
• Can get  (n½/2) [P]
Back to the lottery…
plenty of samples!
Is p uniform?
• Theorem: ([Goldreich Ron][Batu
p
Fortnow R. Smith White] [Paninski])
samples
Test
Pass/Fail?
Sample complexity of
distinguishing
p=U
from |p-U|1> is (n1/2)
• Nearly same complexity to test
if p is any known distribution
[Batu Fischer Fortnow Kumar R.
White]: “Testing identity”
Testing closeness of two distributions:
Transactions of 20-30 yr olds
trend change?
Transactions of 30-40 yr olds
Testing closeness
p
q
Test
Pass/Fail?
Theorem: ([BFRSW] [P. Valiant])
Sample complexity of
distinguishing
p=q
from ||p-q||1 >
~ 2/3
is (n )
Why so different?
• Collision statistics are all that matter
• Collisions on “heavy” elements can hide collision
statistics of rest of the domain
• Construct pairs of distributions where heavy
elements are identical, but “light” elements are
either identical or very different
Output ||p-q||1 ±
need (n/log n) samples [G. Valiant P. Valiant]
Information theoretic quantities
• Entropy
• Support size
Information in neural spike trails
[Strong, Koberle, de Ruyter van Steveninck, Bialek ’98]
• Each application of stimuli
gives sample of signal (spike
trail)
Neural signals
time
• Entropy of (discretized) signal
indicates which neurons
respond to stimuli
Compressibility of data
Can we get multiplicative
approximations?
• In general, no….
• 0 entropy distributions are hard to distinguish
• What if entropy is bigger?
• Can g-multiplicatively approximate the entropy with Õ(n1/g2)
samples (when entropy >2g/) [Batu Dasgupta R. Kumar]
• requires (n1/g2) [Valiant]
• better bounds when support size is small [Brautbar
Samorodnitsky]
• Similar bounds for estimating support size [Raskhodikova Ron
R. Smith] [Raskhodnikova Ron Shpilka Smith]
More properties:
• Independence: [Batu Fischer Fortnow Kumar R. White]
• Limited Independence: [Alon Andoni Kaufman Matulef R.
Xie] [Haviv Langberg]
• K-flat distributions [Levi Indyk R.]
• K-modal distributions [Daskalakis Diakonikolas Servedio]
• Poisson Binomial Distributions [Daskalakis Diakonikolas
Servedio]
• Monotonicity over general posets [Batu Kumar R.]
[Bhattacharyya Fischer R. P. Valiant]
• Properties of multiple distributions [Levi Ron R.]
Conclusion:
• For many problems, we need a lot less time
and samples than one might think!
• Many cool ideas and techniques have been
developed
• Lots more to do!
Thank you!
```