Challenges in
Combinatorial Scientific Computing
John R. Gilbert
University of California, Santa Barbara
SIAM Annual Meeting
July 10, 2009
1
Support: DOE Office of Science, NSF, DARPA, SGI
Combinatorial Scientific Computing
“I observed that most of the
coefficients in our matrices were
zero; i.e., the nonzeros were ‘sparse’
in the matrix, and that typically the
triangular matrices associated with
the forward and back solution
provided by Gaussian elimination
would remain sparse if pivot
elements were chosen with care”
- Harry Markowitz, describing the 1950s
work on portfolio theory that won
the 1990 Nobel Prize for Economics
2
2002 SIAM Annual Meeting
IC10:
Mathematical Analysis of Hyperlinks in the World-Wide Web
IP3:
Discrete Mathematics and Theoretical Computer Science
VNL:
The Human Genome and Beyond
IC18:
Finding Provably Near-Optimal Solutions to Discrete
Optimization Problems
CP6:
Computer Science: Discrete Algorithms
MS54:
New Approaches for Scalable Sparse Linear System Solution
MS71:
Special Session on IR Algorithms and Software
MS73:
Reverse Engineering Gene Networks
MS86/110: Combinatorial Algorithms in Scientific Computing
3
2007 SIAM Annual Meeting
IC 7:
Combinatorics Inspired by Biology
IC 12:
Parallel Network Analysis
IC 14:
On the Complexity of Game, Market, and Network Equilibria
MS1/8:
Adaptive Algebraic Multigrid Methods
MS 4/11/26 : Mathematical Challenges in Cyber Security
MS 36/43/55 : High Performance Computing on Massive Real-World Graphs
4
MS 47:
Optimization and Graph Algorithms
MS 79:
Enumerative and Geometric Combinatorics
MS 71:
Modeling of Large-Scale Metabolic Networks
MS 75:
Combinatorial Scientific Computing
An analogy?
Continuous
physical modeling
As the “middleware”
of scientific computing,
linear algebra has supplied
or enabled:
• Mathematical tools
Linear algebra
• “Impedance match” to
computer operations
• High-level primitives
• High-quality software libraries
Computers
• Ways to extract performance
from computer architecture
• Interactive environments
5
An analogy?
Continuous
physical modeling
Discrete
structure analysis
Linear algebra
Graph theory
Computers
6
Computers
An analogy? Well, we’re not there yet ….
• Mathematical tools
• “Impedance match” to
computer operations
Discrete
structure analysis
• High-level primitives
• High-quality software libs
Graph theory
• Ways to extract performance
from computer architecture
• Interactive environments
7
Computers
Five Challenges
in
Combinatorial Scientific
Computing
8
Challenges I’m not going to talk about
•
•
9
How do we know we’re solving the right problem?
–
Harder in graph analytics than in numerical linear
algebra
–
I’ll pretend this is a math modeling question,
not a CSC question
How will applications keep us honest in the future?
–
Tony Chan, 1991: T(A, B) = elapsed time before
field A would notice the disappearance of field B
–
minA( T(A, CSC) ) ?
#1: The Architecture & Algorithms Challenge
Two Nvidia
8800 GPUs
> 1 TFLOPS
LANL / IBM Roadrunner
> 1 PFLOPS
 Parallelism is no longer optional…
 … in every part of a computation.
10
Intel 80core chip
> 1 TFLOPS
High-Performance Architecture

Most high-performance
computer designs allocate
resources to optimize
Gaussian elimination on
large, dense matrices.

Originally, because linear
algebra is the middleware
of scientific computing.

Nowadays, largely for
bragging rights.
11
P A
=
L
x
U
The Memory Wall Blues

Most of memory is hundreds or thousands of cycles away
from the processor that wants it.

You can buy more bandwidth, but you can’t buy less latency.
(Speed of light, for one thing.)
12
The Memory Wall Blues

Most of memory is hundreds or thousands of cycles away
from the processor that wants it.

You can buy more bandwidth, but you can’t buy less latency.
(Speed of light, for one thing.)

You can hide latency with either locality or
(parallelism + bandwidth).

You can hide lack of bandwidth with locality,
but not with parallelism.
13
The Memory Wall Blues

Most of memory is hundreds or thousands of cycles away
from the processor that wants it.

You can buy more bandwidth, but you can’t buy less latency.
(Speed of light, for one thing.)

You can hide latency with either locality or
(parallelism + bandwidth).

You can hide lack of bandwidth with locality,
but not with parallelism.

Most emerging graph problems have lousy locality.

Thus the algorithms need even more parallelism!
14
Architectural impact on algorithms
12000 would take
1095 years
Naïve 3-loop matrix multiply [Alpern et al., 1992]:
6
T = N4.7
log cycles/flop
5
4
3
Size 2000 took 5 days
2
1
0
-1 0
1
2
3
4
5
log Problem Size
Naïve algorithm is O(N5) time under UMH model.
BLAS-3 DGEMM and recursive blocked algorithms are O(N3).
15
Slide from Larry Carter
The parallel computing challenge

Efficient sequential algorithms for graph-theoretic
problems often follow long chains of dependencies
16

Computation per edge traversal is often small

Little locality or opportunity for reuse

A big opportunity exists for architecture to influence
combinatorial algorithms.

Maybe even vice versa.
Strongly connected components
1
2
4
7
5
1
3
6
1
2
4
7
2
4
7
5
5
3
PAPT
17
6
3
6
G(A)
•
Symmetric permutation to block triangular form
•
Diagonal blocks are strong Hall (irreducible / strongly connected)
•
Sequential: linear time by depth-first search [Tarjan]
•
Parallel: divide & conquer, work and span depend on input
[Fleischer, Hendrickson, Pinar]
Strong components of 1M-vertex RMAT graph
18
One architectural approach: Cray MTA / XMT
19
•
Hide latency by massive
multithreading
•
Per-tick context switching
•
Uniform (sort of) memory
access time
•
But the economic case for
the machine is less than
completely clear.
#2: The Data Size Challenge
“Can we understand
anything interesting
about our data when
we do not even have
time to read all of it?”
- Ronitt Rubinfeld, 2002
20
Approximating min spanning tree weight in
sublinear time [Chazelle, Rubinfeld, Trevisan]
•
Key subroutine: estimate number of connected
components of a graph, in time depending on
relative error but not on size of graph
•
Idea: for each vertex v define
f(v) = 1/(size of component containing v)
21
•
Then Σv f(v) = number of connected components
•
Estimate Σv f(v) by breadth-first search from a few vertices
Features of (many) large graph applications
•
“Feasible” means O(n), or even less.
•
You can’t scan all the data.
•
22
–
you want to poke at it in various ways
–
maybe interactively.
Multiple simultaneous queries to the same graph
–
maybe to differently filtered subgraphs
–
throughput and response time are both important.
•
Benchmark data sets are a big challenge!
•
Think of data not as a finite object but as a statistical
sample of an infinite process . . .
#3: The Uncertainty Challenge
“You could not step
twice into the same river;
for other waters are ever
flowing on to you.”
- Heraclitus
23
Change and uncertainty
24
•
Statistical perspectives; robustness to fluctuations;
streaming, noisy data; dynamic structure; analysis
of sensitivity and propagation of uncertainty
•
K-betweenness centrality for robustness [Bader]
•
Stochastic centrality for infection vulnerability
analysis [Vullikanti]
•
Dynamic community updating [Bhowmick]
•
Detection theory [Kepner]:
“You should never worry about implementation;
you should worry about whether you can correctly
model the background and foreground.”
Horizontal - vertical decomposition
[Mezic et al.]
1
level 4
9
8
2
3
4
5
6
7
8
9
1
2
6
level 3
3
7
4
5
level 2
2
3
4
5
6
7
8
level 1
1
9
•
Strongly connected components, ordered by levels of DAG
•
Linear-time sequentially; no work/span efficient parallel algorithms known
•
… but we don’t want an exact algorithm anyway; edges are probabilities,
and the application is a kind of statistical model reduction …
25
Model reduction and graph decomposition
Spectral graph decomposition technique combined with dynamical
systems analysis leads to deconstruction of a possibly unknown network
into inputs, outputs, forward and feedback loops and allows identification
of a minimal functional unit (MFU) of a system.
Output, execution
Approach:
1. Decompose networks
2. Propagate uncertainty through
components
3. Iteratively aggregate component
uncertainty
Trim the network,
preserve dynamics!
H-V decomposition
(node 4 and
several connections pruned,
with no loss of performance)
Feedback loops
Forward, production unit
Input, initiator
Additional functional
requirements
Level of output
For MFU
Minimal functional units:
sensitive edges (leading to lack of production)
easily identifiable
26
Level of output
with feedback loops
Mezic group, UCSB
Allows identification of roles of
different feedback loops
#4: The Primitives Challenge
•
By analogy to
numerical
scientific
computing. . .
Basic Linear Algebra Subroutines (BLAS):
Speed (MFlops) vs. Matrix Size (n)
C = A*B
y = A*x
•
27
What should the
combinatorial
BLAS look like?
μ = xT y
Primitives should…
28
•
Supply a common notation to express computations
•
Have broad scope but fit into a concise framework
•
Allow programming at the appropriate level of
abstraction and granularity
•
Scale seamlessly from desktop to supercomputer
•
Hide architecture-specific details from users
Frameworks for graph primitives
Many possibilities; none completely satisfactory; relatively
little work on common frameworks or interoperability.
29
•
Visitor-based, distributed-memory: PBGL
•
Visitor-based, multithreaded: MTGL
•
Heterogeneous, tuned kernels: SNAP
•
Sparse array-based: Matlab dialects, CBLAS
•
Scan-based vectorized: NESL
•
Map-reduce: lots of visibility, utility unclear
Distributed-memory parallel sparse matrix-matrix
multiplication
j
k

2D block layout

Outer product formulation

Sequential “hypersparse” kernel
k
*
i
=
Cij
Cij += Aik * Bkj
•
Asynchronous MPI-2 implementation
•
Experiments: TACC Lonestar cluster
•
Good scaling to 256 processors
Time vs Number of cores -- 1M-vertex RMAT
30
Recursive All-Pairs Shortest Paths
Based on R-Kleene algorithm
Well suited for GPU architecture:
A
A
B
C
D
D
B
•
Fast matrix-multiply kernel
•
In-place computation =>
low memory bandwidth
•
Few, large MatMul calls =>
low GPU dispatch overhead
•
Recursion stack on host CPU, D = D + CB;
D = D*;
% recursive call
not on multicore GPU
•
31
C
Careful tuning of GPU code
+ is “min”,
A = A*;
× is “add”
% recursive call
B = AB; C = CA;
B = BD; C = DC;
A = A + BC;
APSP: Experiments and observations
128-core Nvidia 8800 GPU
Speedup relative to. . .
1-core CPU: 120x – 480x
16-core CPU:
Iterative, 128-core GPU:
17x – 45x
40x – 680x
MSSSP, 128-core GPU: ~3x
Time vs. Matrix Dimension
Conclusions:
• High performance is achievable but not simple
• Carefully chosen and optimized primitives will be key
32
#5: The Education Challenge

Or, the interdisciplinary challenge.

Where do you go to take courses in
 Graph algorithms …
 … on massive data sets …
 … in the presence of uncertainty …
 … analyzed on parallel computers …
 … applied to a domain science?
CSC needs to be both coherent and interdisciplinary!
33
Five Challenges In CSC
34
1.
Architecture & algorithms
2.
Data size & management
3.
Uncertainty
4.
Primitives & software
5.
Education
Morals
35
•
Things are clearer if you look at them from
multiple perspectives
•
Combinatorial algorithms are pervasive in
scientific computing and will become more so
•
This is a great time to be doing combinatorial
scientific computing!
Thanks …
David Bader, Jon Berry, Nadya Bliss, Aydin
Buluc, Larry Carter, Alan Edelman, John Feo,
Bruce Hendrickson, Jeremy Kepner, Jure
Leskovic, Kamesh Madduri, Michael Mahoney,
Igor Mezic, Cleve Moler, Steve Reinhardt, Eric
Robinson, Ronitt Rubinfeld, Rob Schreiber,
Viral Shah, Sivan Toledo, Anil Vullikanti
36
Descargar

Slide 1