Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara SIAM Annual Meeting July 10, 2009 1 Support: DOE Office of Science, NSF, DARPA, SGI Combinatorial Scientific Computing “I observed that most of the coefficients in our matrices were zero; i.e., the nonzeros were ‘sparse’ in the matrix, and that typically the triangular matrices associated with the forward and back solution provided by Gaussian elimination would remain sparse if pivot elements were chosen with care” - Harry Markowitz, describing the 1950s work on portfolio theory that won the 1990 Nobel Prize for Economics 2 2002 SIAM Annual Meeting IC10: Mathematical Analysis of Hyperlinks in the World-Wide Web IP3: Discrete Mathematics and Theoretical Computer Science VNL: The Human Genome and Beyond IC18: Finding Provably Near-Optimal Solutions to Discrete Optimization Problems CP6: Computer Science: Discrete Algorithms MS54: New Approaches for Scalable Sparse Linear System Solution MS71: Special Session on IR Algorithms and Software MS73: Reverse Engineering Gene Networks MS86/110: Combinatorial Algorithms in Scientific Computing 3 2007 SIAM Annual Meeting IC 7: Combinatorics Inspired by Biology IC 12: Parallel Network Analysis IC 14: On the Complexity of Game, Market, and Network Equilibria MS1/8: Adaptive Algebraic Multigrid Methods MS 4/11/26 : Mathematical Challenges in Cyber Security MS 36/43/55 : High Performance Computing on Massive Real-World Graphs 4 MS 47: Optimization and Graph Algorithms MS 79: Enumerative and Geometric Combinatorics MS 71: Modeling of Large-Scale Metabolic Networks MS 75: Combinatorial Scientific Computing An analogy? Continuous physical modeling As the “middleware” of scientific computing, linear algebra has supplied or enabled: • Mathematical tools Linear algebra • “Impedance match” to computer operations • High-level primitives • High-quality software libraries Computers • Ways to extract performance from computer architecture • Interactive environments 5 An analogy? Continuous physical modeling Discrete structure analysis Linear algebra Graph theory Computers 6 Computers An analogy? Well, we’re not there yet …. • Mathematical tools • “Impedance match” to computer operations Discrete structure analysis • High-level primitives • High-quality software libs Graph theory • Ways to extract performance from computer architecture • Interactive environments 7 Computers Five Challenges in Combinatorial Scientific Computing 8 Challenges I’m not going to talk about • • 9 How do we know we’re solving the right problem? – Harder in graph analytics than in numerical linear algebra – I’ll pretend this is a math modeling question, not a CSC question How will applications keep us honest in the future? – Tony Chan, 1991: T(A, B) = elapsed time before field A would notice the disappearance of field B – minA( T(A, CSC) ) ? #1: The Architecture & Algorithms Challenge Two Nvidia 8800 GPUs > 1 TFLOPS LANL / IBM Roadrunner > 1 PFLOPS Parallelism is no longer optional… … in every part of a computation. 10 Intel 80core chip > 1 TFLOPS High-Performance Architecture Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices. Originally, because linear algebra is the middleware of scientific computing. Nowadays, largely for bragging rights. 11 P A = L x U The Memory Wall Blues Most of memory is hundreds or thousands of cycles away from the processor that wants it. You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.) 12 The Memory Wall Blues Most of memory is hundreds or thousands of cycles away from the processor that wants it. You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.) You can hide latency with either locality or (parallelism + bandwidth). You can hide lack of bandwidth with locality, but not with parallelism. 13 The Memory Wall Blues Most of memory is hundreds or thousands of cycles away from the processor that wants it. You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.) You can hide latency with either locality or (parallelism + bandwidth). You can hide lack of bandwidth with locality, but not with parallelism. Most emerging graph problems have lousy locality. Thus the algorithms need even more parallelism! 14 Architectural impact on algorithms 12000 would take 1095 years Naïve 3-loop matrix multiply [Alpern et al., 1992]: 6 T = N4.7 log cycles/flop 5 4 3 Size 2000 took 5 days 2 1 0 -1 0 1 2 3 4 5 log Problem Size Naïve algorithm is O(N5) time under UMH model. BLAS-3 DGEMM and recursive blocked algorithms are O(N3). 15 Slide from Larry Carter The parallel computing challenge Efficient sequential algorithms for graph-theoretic problems often follow long chains of dependencies 16 Computation per edge traversal is often small Little locality or opportunity for reuse A big opportunity exists for architecture to influence combinatorial algorithms. Maybe even vice versa. Strongly connected components 1 2 4 7 5 1 3 6 1 2 4 7 2 4 7 5 5 3 PAPT 17 6 3 6 G(A) • Symmetric permutation to block triangular form • Diagonal blocks are strong Hall (irreducible / strongly connected) • Sequential: linear time by depth-first search [Tarjan] • Parallel: divide & conquer, work and span depend on input [Fleischer, Hendrickson, Pinar] Strong components of 1M-vertex RMAT graph 18 One architectural approach: Cray MTA / XMT 19 • Hide latency by massive multithreading • Per-tick context switching • Uniform (sort of) memory access time • But the economic case for the machine is less than completely clear. #2: The Data Size Challenge “Can we understand anything interesting about our data when we do not even have time to read all of it?” - Ronitt Rubinfeld, 2002 20 Approximating min spanning tree weight in sublinear time [Chazelle, Rubinfeld, Trevisan] • Key subroutine: estimate number of connected components of a graph, in time depending on relative error but not on size of graph • Idea: for each vertex v define f(v) = 1/(size of component containing v) 21 • Then Σv f(v) = number of connected components • Estimate Σv f(v) by breadth-first search from a few vertices Features of (many) large graph applications • “Feasible” means O(n), or even less. • You can’t scan all the data. • 22 – you want to poke at it in various ways – maybe interactively. Multiple simultaneous queries to the same graph – maybe to differently filtered subgraphs – throughput and response time are both important. • Benchmark data sets are a big challenge! • Think of data not as a finite object but as a statistical sample of an infinite process . . . #3: The Uncertainty Challenge “You could not step twice into the same river; for other waters are ever flowing on to you.” - Heraclitus 23 Change and uncertainty 24 • Statistical perspectives; robustness to fluctuations; streaming, noisy data; dynamic structure; analysis of sensitivity and propagation of uncertainty • K-betweenness centrality for robustness [Bader] • Stochastic centrality for infection vulnerability analysis [Vullikanti] • Dynamic community updating [Bhowmick] • Detection theory [Kepner]: “You should never worry about implementation; you should worry about whether you can correctly model the background and foreground.” Horizontal - vertical decomposition [Mezic et al.] 1 level 4 9 8 2 3 4 5 6 7 8 9 1 2 6 level 3 3 7 4 5 level 2 2 3 4 5 6 7 8 level 1 1 9 • Strongly connected components, ordered by levels of DAG • Linear-time sequentially; no work/span efficient parallel algorithms known • … but we don’t want an exact algorithm anyway; edges are probabilities, and the application is a kind of statistical model reduction … 25 Model reduction and graph decomposition Spectral graph decomposition technique combined with dynamical systems analysis leads to deconstruction of a possibly unknown network into inputs, outputs, forward and feedback loops and allows identification of a minimal functional unit (MFU) of a system. Output, execution Approach: 1. Decompose networks 2. Propagate uncertainty through components 3. Iteratively aggregate component uncertainty Trim the network, preserve dynamics! H-V decomposition (node 4 and several connections pruned, with no loss of performance) Feedback loops Forward, production unit Input, initiator Additional functional requirements Level of output For MFU Minimal functional units: sensitive edges (leading to lack of production) easily identifiable 26 Level of output with feedback loops Mezic group, UCSB Allows identification of roles of different feedback loops #4: The Primitives Challenge • By analogy to numerical scientific computing. . . Basic Linear Algebra Subroutines (BLAS): Speed (MFlops) vs. Matrix Size (n) C = A*B y = A*x • 27 What should the combinatorial BLAS look like? μ = xT y Primitives should… 28 • Supply a common notation to express computations • Have broad scope but fit into a concise framework • Allow programming at the appropriate level of abstraction and granularity • Scale seamlessly from desktop to supercomputer • Hide architecture-specific details from users Frameworks for graph primitives Many possibilities; none completely satisfactory; relatively little work on common frameworks or interoperability. 29 • Visitor-based, distributed-memory: PBGL • Visitor-based, multithreaded: MTGL • Heterogeneous, tuned kernels: SNAP • Sparse array-based: Matlab dialects, CBLAS • Scan-based vectorized: NESL • Map-reduce: lots of visibility, utility unclear Distributed-memory parallel sparse matrix-matrix multiplication j k 2D block layout Outer product formulation Sequential “hypersparse” kernel k * i = Cij Cij += Aik * Bkj • Asynchronous MPI-2 implementation • Experiments: TACC Lonestar cluster • Good scaling to 256 processors Time vs Number of cores -- 1M-vertex RMAT 30 Recursive All-Pairs Shortest Paths Based on R-Kleene algorithm Well suited for GPU architecture: A A B C D D B • Fast matrix-multiply kernel • In-place computation => low memory bandwidth • Few, large MatMul calls => low GPU dispatch overhead • Recursion stack on host CPU, D = D + CB; D = D*; % recursive call not on multicore GPU • 31 C Careful tuning of GPU code + is “min”, A = A*; × is “add” % recursive call B = AB; C = CA; B = BD; C = DC; A = A + BC; APSP: Experiments and observations 128-core Nvidia 8800 GPU Speedup relative to. . . 1-core CPU: 120x – 480x 16-core CPU: Iterative, 128-core GPU: 17x – 45x 40x – 680x MSSSP, 128-core GPU: ~3x Time vs. Matrix Dimension Conclusions: • High performance is achievable but not simple • Carefully chosen and optimized primitives will be key 32 #5: The Education Challenge Or, the interdisciplinary challenge. Where do you go to take courses in Graph algorithms … … on massive data sets … … in the presence of uncertainty … … analyzed on parallel computers … … applied to a domain science? CSC needs to be both coherent and interdisciplinary! 33 Five Challenges In CSC 34 1. Architecture & algorithms 2. Data size & management 3. Uncertainty 4. Primitives & software 5. Education Morals 35 • Things are clearer if you look at them from multiple perspectives • Combinatorial algorithms are pervasive in scientific computing and will become more so • This is a great time to be doing combinatorial scientific computing! Thanks … David Bader, Jon Berry, Nadya Bliss, Aydin Buluc, Larry Carter, Alan Edelman, John Feo, Bruce Hendrickson, Jeremy Kepner, Jure Leskovic, Kamesh Madduri, Michael Mahoney, Igor Mezic, Cleve Moler, Steve Reinhardt, Eric Robinson, Ronitt Rubinfeld, Rob Schreiber, Viral Shah, Sivan Toledo, Anil Vullikanti 36

Descargar
# Slide 1