UNCLASSIFIED
Sequoia RFP and Benchmarking
Status
Scott Futral
Mark K. Seager
Tom Spelce
Lawrence Livermore National Laboratory
2008
SciComp
Meeting
Lawrence
LivermoreSummer
National Laboratory
Overview
Sequoia Objectives
• 25-50x BlueGene/L (367TF/s) on Science Codes
• 12-24x Purple on Integrated Design Codes
Sequoia Procurement Strategy
• Sequoia is actually a cluster of procurements
• Risk management pervades everything
Sequoia Target Architecture
• Driven by programmatic requirements and technical realities
• Requires innovation on several fronts
Sequoia will deliver petascale computing for the mission
and pushes the envelope by 10-100x in every dimension!
5/23/2008
2008 SciComp Summer Meeting
2
By leveraging industry trends, Sequoia will successfully deliver a petascale
UQ engine for the stockpile
Sequoia Production Platform Programmatic Drivers
• UQ Engine for mission deliverables in the 2011-2015 timeframe
Programmatic drivers require unprecedented leap forward in
computing power
Program needs both Capability and Capacity
• 25-50x BGL (367TF/s) for science codes (knob removal)
• 12-24x Purple for capability runs on Purple (8,192 MPI tasks UQ
Engine)
These requirements met with current industry trends drive
us to a different target architecture than Purple or BGL
5/23/2008
2008 SciComp Summer Meeting
3
Predicting stockpile performance drives five separate classes of
petascale calculations
1.
Quantifying uncertainty (for all classes of simulation)
2.
Identify and model missing physics
3.
Improving accuracy in material property data
4.
Improving models for known physical processes
5.
Improving the performance of complex models and algorithms in
macro-scale simulation codes
Each of these mission drivers require petascale computing
5/23/2008
2008 SciComp Summer Meeting
4
Sequoia Strategy
Two major deliverables
• Petascale Scaling “Dawn” Platform in 2009
• Petascale “Sequoia” Platform in 2011
Lessons learned from previous capability and capacity
procurements
•
•
•
•
Leverage best-of-breed for platform, file system, SAN and storage
Major Sequoia procurement is for long term platform partnership
Three R&D partnerships to incentivize bidders to stretch goals
Risk reduction built into overall strategy from day-one
Drive procurement with single peak mandatory
• Target Peak+Sustained on marquee benchmarks
• Timescale, budget, technical details as target requirements
• Include TCO factors such as power
5/23/2008
2008 SciComp Summer Meeting
5
To Minimize Risk, Dawn Deployment Extends the Existing
Purple and BG/L Integrated Simulation Environment
 ASC Dawn is the initial delivery
system for Sequoia
 Code development platform and
scaling for Sequoia
 0.5 petaFLOP/s peak for ASC
production usage
 Target production 2009-2014
 Dawn Component Scaling
• Memory B:F = 0.3
• Mem BW B:F = 1.0
• Link BW B:F = 2.0
• Min Bisect B:F = 0.001
• SAN GB/s:PF/s = 384
• F is peak FLOP/s
5/23/2008
2008 SciComp Summer Meeting
6
Sequoia Target Architecture in Integrated Simulation
Environment Enables a Diverse Production Workload
 Diverse usage models drive
platform and simulation
environment requirements
• Will be 2D ultra-res and 3D high-res
Quantification of Uncertainty engine
• 3D Science capability for known
unknowns and unknown unknowns
 Peak of 14 petaFLOP/s with option
for 20 petaFLOP/s
 Target production 2011-2016
 Sequoia Component Scaling
• Memory B:F = 0.08
• Mem BW B:F = 0.2
• Link BW B:F = 0.1
• Min Bisect B:F = 0.03
• SAN BW GB/:PF/s = 25.6
• F is peak FLOP/s
5/23/2008
2008 SciComp Summer Meeting
7
Sequoia Targets A Highly Scalable Operating System
Light weight kernel on compute node
Application
Application
NPTL Posix threads
Application
glibc
dynamic
loading
NPTL
Posix
threads
Application
glibc
dynamic
loading
NPTL
Posix
threads
MPI
GLIBC
glibc
dynamic
loading
Posix
threads,
OpenMP
and SE/TM
MPI
ADI
GLIBC
glibc dynamicMPI
loading Shared
ADI
RAS
syscalls Futex Memory
GLIBC
Shared
MPI
ADI
RAS
syscalls Futex
hardware transport
GLIBC
Memory
Shared
ADI
Futex
syscalls
Sequoia
CN
andRAS
Interconnect
hardware
transport
Function
Shipped
Memory
SMP
RAS
Sequoia
and Interconnect
syscalls
hardwareCN
transport
Sequoia
CN
and
hardware transportInterconnect
Sequoia CN and Interconnect
Compute Nodes
 Optimized for scalability and reliability
 As simple as possible. Full control
 Extremely low OS noise
 Direct access to interconnect hardware
 OS features
 Linux compatible with OS functions
forwarded to I/O node OS
 Support for dynamic libs runtime loading
 Shared memory regions
 Open source
Linux on I/O Node
FSD SLURMD
Perf toolstotalview
Lustre Client NFSv4 Function Shipped
syscalls
LNet
UDP TCP/IP
Linux/Unix
Sequoia ION and Interconnect
 Leverage huge Linux base & community
 Enhance TCP offload, PCIe, I/O
 Standard File Systems Lustre, NFSv4, etc
 Factor to Simplify:
 Aggregates N CN for I/O & admin
 Open source
I/O Node
5/23/2008
2008 SciComp Summer Meeting
8
Sequoia Target Application Programming Model Leverages
Factor and Simplify to Scale Applications to O(1M) Parallelism
MPI Parallelism at top level
• Static allocation of MPI tasks to nodes and sets of
cores+threads
• Allow for MPI everywhere, just in case…
Effectively absorb multiple cores+threads in MPI
task
Support multiple languages
• C/C++/Fortran03/Python
Allow different physics packages to express
node concurrency in different ways
5/23/2008
2008 SciComp Summer Meeting
18
1)
2)
3)
4)
5)
6)
1-3
1-3
W
W
W
Exit
MPI_FINALIZE
MAIN
OpenMP
MPI Call
OpenMP
MPI Call
Funct1
1-3
TM/SE
MPI Call
TM/SE
MPI Call
Funct2
1-3
1-3
OpenMP
MPI Call
OpenMP
MPI Call
MPI Call
W
W
W
Funct1
Thread0
Thread1
Thread2
Thread3
MAIN
MPI_INIT
With Careful Use of Node Concurrency We can Support A Wide
Variety of Complex Applications
1-3
1-3
Pthreads born with MAIN
Only Thread0 calls functions to nest parallelism
Pthreads based MAIN calls OpenMP based Funct1
OpenMP Funct1 calls TM/SE based Funct2
Funct2 returns to OpenMP based Funct1
Funct1 returns to Pthreads based MAIN
 MPI Tasks on a node are processes (one shown) with multiple OS threads
(Thread0-3 shown)
 Thread0 is “Main thread” Thread1-3 are helper threads that morph from
Pthread to OpenMP worker to TM/SE compiler generated threads via
runtime support
 Hardware support to significantly reduce overheads for thread repurposing
and OpenMP loops and locks
5/23/2008
2008 SciComp Summer Meeting
19
Sequoia Distributed Software Stack Targets Familiar
Environment for Easy Applications Port
Code Development Tools
C/C++/Fortran
Compilers, Python
APPLICATION
Kernel Space
Function Shipped
syscalls
Parallel Math Libs
OpenMP, Threads, SE/TM
Clib/F03 runtime
SOCKETS
Lustre Client
LNet
MPI2
UDP
IP
ADI
Interconnect Interface
5/23/2008
TCP
LWK, Linux
Optimized Math Libs
Code Dev Tools Infrastructure
RAS, Control System
SLURM/Moab
User Space
2008 SciComp Summer Meeting
External Network
20
Sequoia Platform Target Performance is a Combination of Peak
and Application Sustained Performance
“Peak” of the machine is absolute maximum performance
• FLOP/s = FLoating point OPeration per second
Sustained is weighted average of five “marquee” benchmark
code “Figure of Merit”
• Four IDC package benchmarks and one “science workload” benchmark
from SNL
• FOM chosen to mimic “grind times” and factor out scaling issues
Purple – 0.1PF/s
5/23/2008
BlueGene/L – 0.4 TF/s
2008 SciComp Summer Meeting
22
Sequoia Benchmarks have already incentivized the industry to
work on problems relevant to our mission needs
ASC Sequoia Benchmarks
Language
Tier
1
Code
UMT
1
AMG
1
IRS
1
SPhot
1
LAMMPS
2
Pynamic
X X X X
X
X
X
X
X
X
X
X
X
X
X
X
2
CLOMP
2
2
FTQ
IOR
Parallelism
X
X
X
X
X
X
X
X
X
2
2
Phloem
MPI
Benchmar
ks
X
Memory
Benchmar
ks
X
3
3
UMTMk
AMGMk
3
3
3
IRSMk
SPhotMK X
CrystalMK
5/23/2008
Description
F Py C C++ MPI OpenMP Pthreads
X
X
X
X
X
X
X
X
Marquee performance code. Single physics package code.
Unstructured-Mesh determinist ic radiation Transport
Marquee performance code. Algebraic Multi-Grid linear system solver
for unstructured mesh physics packages
Marquee performance code. Single physics package code. Implicit
Radiation Solver for diffusion equation on a block structured mesh
Marquee performance code. Single physics package code. Monte Carlo
Scalar PHOTon transport code
Marquee performance code. Full-system scie nce code. Classical
molecular dynamics simulation code (as used)
Subsystem functionality and performance test. Dummy application
that closely models the footprint of an important Python-based multiphysics ASC code
Subsystem functionality and performance test. Measure OpenMP
overheads and other performance impacts due to threading
Fixed Time Quantum test. Measures operating system noise
Interleaved or Random I/O Benchamrk. IOR is used for testing the
performance of parallel filesystems using various interfaces and access
patte rns.
Subsystem functionality and performancetests. Collection of
independent MPI Benchmarks to measure the health and stability of
various aspects of MPI performance including interconnect messaging
rate, latency, aggre gate bandwidth, and collective latencies under heavy
network loads.
What’s missing?
–
–
–
Hydrodynamics
Structural mechanics
Quantum MD
Me mory Subsyste m functionality and performance te sts.
Collection of STREAMS and STRIDE memory benchmarks to
measure the memory subsystem under a variety of memory access
patterns
Threading compile r test and single core performance
Sparse matrix-vector operations single core performance and OpenMP
performance
Single core optimization and SIMD compiler challenge
Single core integer arithmetic and branching performance
Single core optimization and SIMD compiler challenge.
2008 SciComp Summer Meeting
23
Validation and Benchmark Efforts
Platforms

Purple (IBM Power5, AIX)

BGL (IBM PPC440, LWK)

BGP (IBM PPC450, LWK, SMP)

ATLAS ( AMD Opteron, TOSS)

Red Storm ( AMD Opteron, Catamount)

Franklin (AMD Opteron, CNL )

Phoenix (Vector, UNICOS)
5/23/2008
2008 SciComp Summer Meeting
The strategy for aggregating performance incentivizes vendors
in two ways.
1 – Peak (petaFLOP/s)
2 – #MPI / Node <= Memory per Node / 2 GB
IRS
SPhot
UMT
LAMMPS
wFOM = A x “solution vector size” * iter / sec
wFOM = B x “temperature variables” * iter / sec
wFOM = C x “tracks” / sec
wFOM = D x corners*angles*groups*zones * iter / sec
wFOM = E x atom updates / sec
awFOM = wFOMAMG + wFOMIRS + wFOMSPhot
+ wFOMUMT + wFOMLAMMPS
LAMMPS
5/23/2008
SPhot
SPhot
SPhot
SPhot
SPhot
SPhot
UMT
UMT
UMT
UMT
UMT
UMT
IRS
IRS
IRS
IRS
IRS
IRS
AMG 3
AMG 3
AMG 3
AMG 4
AMG 4
AMG 4
Weak Scaling on Purple
WEIGHTED Figure of Merit
AMG
1.E+14
9.E+13
8.E+13
7.E+13
6.E+13
5.E+13
4.E+13
3.E+13
2.E+13
1.E+13
0.E+00
1
2
3
4
5
6
7
8
9
Thousands
# PEs
S P hot
2008 SciComp Summer Meeting
UMT
A MG
I RS
25
AMG Results
Billions
Raw Figure of Merit
A MG Weak Scaling
3 .0
2 .5
2 .0
1 .5
1 .0
0 .5
0 .0
1
2
3
4
5
6
7
8
9
Thousands
PEs
BG/ L
5/23/2008
Atlas- 6A
R ed Storm - 6A
2008 SciComp Summer Meeting
Purple- 6A
AMG message size distribution
AMG (4096) MPI Characteristics
1.E+03
1.E+05
Time
1.E+07
83
18
8
1.E+01
55
63
25
02
7
0
1.E-01
15
08
1
5
37
2
4096
98
2048
10
26
1024
15
0 .0 4
0 .0 3 5
0 .0 3
0 .0 2 5
0 .0 2
0 .0 1 5
0 .0 1
0 .0 0 5
0
20
18
16
14
12
10
8
6
4
2
0
8
20
Thousands
25
Count
Count
Thousands
AMG Message # By Size
Bytes
Byt es
Count
Time
An improved messaging rate would significantly impact AMG
communication performance.
5/23/2008
2008 SciComp Summer Meeting
27
UMT and Sphot results
SPhot Weak Scaling
50
40
30
20
10
0
1
2
3
4
5
6
PEs
Purple
5/23/2008
BG/L*
Atlas
7
8
9
Billions
60
Raw Figure of Merit
Billions
Raw Figure of Merit
UMT Weak Scaling
16
14
12
10
8
6
4
2
1
Thousands
Red Storm
2
3
4
5
6
PEs
Purple
Purple-rerun
Atlas
Red Storm
2008 SciComp Summer Meeting
7
8
Thousands
BG/L
9
Observations of messaging rate for UMT indicate we need to
have messaging rate as an interconnect requirement
UMT Messaging Rate
700
10000
600
500
400
S12
S6
300
# of Occurences
Thousands
Messaging Rate (per second)
Maximum Message Rate
1000
S12
100
S6
10
200
130
23
100
1
0
100
200
0
0
20
40
60
80
100
120
140
300
Thousands
Messages/sec
Window Size (microseconds)
Messaging is very bursty, and most messaging occurs at a high
messaging rate.
5/23/2008
2008 SciComp Summer Meeting
29
IRS- Implicit Radiation Solver results
Billions
Raw FOM
IRS Weak Scaling
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
Thousands
# PEs
P urple
5/23/2008
BG /L
A tlas
Red S torm
2008 SciComp Summer Meeting
IRS Load Imbalance has two components: compute and
communications
Percentage of Work
I RS "S end/Rec eive" L oad B alanc e
IMBALANCE
(MAX / AVG)
100%
90%
80%
70%
60%
50%
40%
512
1 ,0 0 0 2 ,1 9 7 4 ,0 9 6 8 ,0 0 0
#PE
512
1,000
2,197
4,096
8,000
Model
1.1429
1.1111
1.0833
1.0667
1.0526
Power5
1.521
1.487
1.428
1.352
BG/L
Red Storm
1.061
1.064
1.052
1.030
1.092
1.080
1.067
1.052
P roc es s or C ount
6 _C ore
5 _F ac e
4 _E dge
3 _C orner
8 ,0 0 0
512
% of MPI Time
BG/L
1 0 0 .0 0 %
MPI_Allreduce
MPI_Waitany
1 0 .0 0 %
MPI_Waitall
MPI_Bcast
1 .0 0 %
MPI_Recv
MPI_Isend
MPI
COM PREP
0 .1 0 %
B
C
COMPUTE
COM PREP
MPI
MPI
MPI_Wait
0 .0 1 %
0
5/23/2008
MPI_Send
1 2
3
4 5
6
7 8
9
Thousands
APPLICATION
PEs
wire
COMMUNICATION
MPI_Irecv
COMPUTATION
2008 SciComp Summer Meeting
31
Summary
Sequoia is a carefully choreographed risk mitigation
strategy to develop and deliver a huge leap forward in
computing power to the National Stockpile Stewardship
Program
Sequoia will work for weapons science and integrated
design codes when delivered because of our
evolutionary approach to yield a revolutionary advance
on multiple fronts
The ground work on system requirements, benchmarks,
and SOW are in place for launch of a successful
procurement competition for Sequoia
5/23/2008
2008 SciComp Summer Meeting
32
Descargar

Sequoia RFP and Benchmarking Status