Overview of Extreme-Scale
Software Research in China
Depei Qian
Sino-German Joint Software Institute (JSI)
Beihang University
China-USA Computer Software Workshop
Sep. 27, 2011
Outline








Related R&D efforts in China
Algorithms and Computational Methods
HPC and e-Infrastructure
Parallel programming frameworks
Programming heterogeneous systems
Advanced compiler technology
Tools
Domain specific programming support
Related R&D efforts in China

NSFC




863 program




Basic algorithms and computable modeling for high
performance scientific computing
Network based research environment
Many-core parallel programming
High productivity computer and Grid service environment
Multicore/many-core programming support
HPC software for earth system modeling
973 program


Parallel algorithms for large scale scientific computing
Virtual computing environment
Algorithms and Computational
Methods
NSFC’s Key Initiative on Algorithm
and Modeling




Basic algorithms and computable modeling
for high performance scientific computing
8-year, launched in 2011
180 million Yuan funding
Focused on



Novel computational methods and basic
parallel algorithms
Computable modeling for selected domains
Implementation and verification of parallel
algorithms by simulation
HPC & e-Infrastructure
863’s key projects on HPC and
Grid


“High productivity Computer and Grid
Service Environment”

Period: 2006-2010

940 million Yuan from the MOST and more than
1B Yuan matching money from other sources
Major R&D activities

Developing PFlops computers

Building up a grid service environment--CNGrid

Developing Grid and HPC applications in
selected areas
CNGrid GOS Architecture
Other Domain Specific Applications
GSML Workshop.
Cmd Line Tools
IDE Debugger Compiler
GSML
Composer
HPCG App & Mgmt Portal
Gsh & cmd tools
GSML
Browser
Tool/App
VegaSSH
System Mgmt Portal
Core, System and App Level
Services
GOS Library (Batch, Message, File, etc)
GOS System Call (Resource mgmt,Agora mgmt, User mgmt, Grip mgmt, etc)
HPCG Backend
Axis Handlers
for Message Level Security
CA Service
metainfo mgmt
File mgmt
BatchJob mgmt
Account mgmt
MetaSchedule
Message
Service
Dynamic
DeployService
Grip
DataGrid
GridWorkflow
DB Service
Work Flow
Engine
System
Tomcat(5.0.28) +
Axis(1.2 rc2)
Agora
Security
Resource Space
J2SE(1.4.2_07, 1.5.0_07)
Res AC & Sharing
Grip Instance Mgmt
User Mgmt
Agora Mgmt
Core
Res Mgmt
OS (Linux/Unix/Windows)
Naming
Grip Runtime
ServiceController
Other
RController
Tomcat(Apache)+Axis, GT4, gLite, OMII
Java J2SE
Grid Portal, Gsh+CLI, GSML
Workshop and Grid Apps
Other 3rd
software &
tools
Hosting
Environment
PC Server (Grid Server)
Abstractions

Grid community: Agora


persistent information storage and
organization
Grid process: Grip

runtime control
CNGrid GOS deployment





CNGrid GOS deployed
on 11 sites and some
application Grids
Support heterogeneous
HPCs: Galaxy, Dawning,
DeepComp
Support multiple
platforms
Unix, Linux, Windows
Using public network
connection, enable only
HTTP port
Flexible client



Web browser
Special client
GSML client
CNIC: 150TFlops,
1.4PB storage,30
applications, 269 users
all over the country,
IPv4/v6 access
Tsinghua University:
1.33TFlops, 158TB
storage, 29
applications, 100+
users. IPV4/V6
access
IAPCM: 1TFlops,
4.9TB storage, 10
applications, 138
users, IPv4/v6 access
Shandong University
10TFlops, 18TB
storage, 7
applications, 60+
users, IPv4/v6 access
GSCC: 40TFlops,
40TB, 6
applications, 45
users , IPv4/v6
access
SSC: 200TFlops,
600TB storage, 15
applications, 286
users, IPv4/v6 access
XJTU: 4TFlops, 25TB
storage, 14
applications, 120+
users, IPv4/v6 access
HUST: 1.7TFlops,
15TB storage,
IPv4/v6 access
SIAT: 10TFlops,
17.6TB storage,
IPv4v6 access
USTC: 1TFlops, 15TB
storage, 18
applications, 60+ users,
IPv4/v6 access
HKU: 20TFlops, 80+
users, IPv4/v6 access
CNGrid: resources




11 sites
>450TFlops
2900TB storage
Three PF-scale
sites will be
integrated into
CNGrid soon
CNGrid:services and users


230 services
>1400 users






China commercial
Aircraft Corp
Bao Steel
automobile
institutes of CAS
universities
……
CNGrid:applications

Supporting >700 projects

973, 863, NSFC, CAS Innovative, and
Engineering projects
Parallel programming
frameworks
Jasmin: A parallel programming
Framework
separate
Library
Models
Special
Applications
Stencils
Codes
Algorithms
Models
Common Stencils
Algorithms
extract
Data Dependency
form
Data Structures
Parallel
Computing
Models
Communications
support
Load Balancing
Promote
Computers
Also supported by the
973 and 863 projects
Basic ideas

Hide the complexity of programming millons
of cores

Integrate the efficient implementations of
parallel fast numerical algorithms

Provide efficient data structures and solver
libraries

Support software engineering for code
extensibility.
Basic Ideas
PetaFlops MPP
Applications
Codes
TeraFlops
Cluster
Serial
Programming
Personal
Computer
JASMIN
Structured
Grid
Inertial
Confinement
Fusion
Global
Climate
Particle
Modeling Simulation
CFD
Material
Simulations
J parallel
Adaptive
Structured Mesh
INfrastructure
JASMIN
http:://www.iapcm.ac.
cn/jasmin,
2010SR050446
2003-now
……
Unstructured
Grid
JASMIN
User provides: physics, parameters, numerical methods,
expert experiences, special algorithms, etc.
User Interfaces:Components based Parallel
Programming models. ( C++ classes)
JASMIN
Numerical Algorithms:geometry, fast solvers,
mature numerical methods, time integrators, etc.
V. 2.0
HPC implementations( thousands of CPUs):data
structures, parallelization, load balancing, adaptivity,
visualization, restart, memory, etc.
Architecture:Multilayered, Modularized, Object-oriented;
Codes: C++/C/F90/F77+MPI/OpenMP,500,000 lines;
Installation: Personal computers, Cluster, MPP.
Numerical simulations on TianHe-1A
Codes
# CPU cores Codes
# CPU cores
LARED-S
32,768
RH2D
1,024
LARED-P
72,000
HIME3D
3,600
LAP3D
16,384
PDD3D
4,096
MEPH3D
38,400
LARED-R
512
MD3D
80,000
LARED Integration
128
RT3D
1,000
Simulation duration : several hours to tens of hours.
Programming heterogeneous
systems
GPU programming support



Source to source translation
Runtime optimization
Mixed programming model for
multi-GPU systems
S2S translation for GPU

A source-to-source translator, GPUS2S, for GPU programming

Facilitate the development of
parallel programs on GPU by
combining automatic mapping and
static compilation
S2S translation for GPU (con’d)


Insert directives into the source program

Guide implicit call of CUDA runtime libraries

Enable the user to control the mapping from the
homogeneous CPU platform to GPU’s streaming
platform
Optimization based on runtime profiling

Take full advantage of GPU according to the
application characteristics by collecting runtime
dynamic information.
The GPU-S2S architecture
PGAS programming model
MPI message transfer model
Layer of
software
productivity
Pthread thread model
GPU-S2S
Profile
information
GPU
supporting library
Calling shared
library
User standard
library
Layer of
Running-time performance collection performance
discover
Operating system
GPU platform
Program translation by GPU-S2S
homogeneous Computing
Templates library of
platform code function called by
Profile
optimized computing
homogeneos
with
libray
intensive applications
platform code
directives
User defined part
Calling shared libary
Source code before translation (homogeneous platform program framework)
GPU-S2S
Kernel
program of
GPU
according
templates
General
Templates library of
purpose
Profile
optimized
computing
computing
library
intensive
applications
interface
Calling shared libary
User standard library
Source code after translation (GPU streaming architecture platform
program framework)
Control
program
of CPU
Runtime optimization based on profiling
First level
profiling
(function level)
GPU-S2S
*.c、*.h
homogeneous
platform
code
C
language
compiler
Pretreatment
First level dynamic
instrumentation
Second level dynamic
instrumentation
Automatically
inserting directives
Second level profiling
(memory access and
kernel improvement )
Third level
profiling (data
partition)
Generate CUDA code
containing optimized
kernel
Compile
and run
Compile
and run
Extract profile
information:
computing kernel
First
Level
Profile
Extract profile information:
Data block size, Share memory
configuration parameters,
Judge whether can use stream
Second
Level
Profile
Don’t
need to
optimize
further
Termination
Need to optimize
further
Third level dynamic
instrumentation in
CUDA code
Compile
and run
Extract profile information:
Number of stream, Data size
of every stream
*.o
Executable
code on
GPU
CUDA
Compiler
tool
Generate
CUDA code
using
stream
*.h、
*.cu、
*.c
CUDA code
Third
Level
Profile
First level profiling
Homogeneous platform code
Allocate address
space
initialization
function0
Source-tosource compiler
instrumentation0
instrumentatio1
function1
...
Free address space
Identify computing
kernels

instrumentation0
functionN

instrumentation1
instrumentationN
instrumentationN
Instrument the scan
source code, get the
execution time of
every function, and
identify computing
kernel
Second level profiling

Homogeneous platform
code
Computing
kernel1
Computing
kernel2
instrumentation

instrumentation

...
...
Computing
kernel3
Source-to-source
compiler
Identify the memory
access pattern and
improve the kernels
instrumentation
Instrument the
computing kernels
extract and analyze
the profile
information, optimize
according to the
feature of application,
and finally generate
the CUDA code with
optimized kernel
Third level profiling
CUDA control code
Allocate address
space
initialization
Allocate global
address space
function0--copyin
function0--kernel
Source-to-source
compiler

Optimization by
improve data partition

instrumentationi
instrumentationi
instrumentationk

instrumentationk
function0--copyout
...
Free address space
instrumentationo
instrumentationo

Get copy time and
computing time by
instrumentation
Compute the number
of streams and data
size of each stream
Generate the optimized
CUDA code with
stream
800
time(ms)
600
only using
global
memory
500
400
300
second level
profile
optimization
200
100
0
1024
2048
different size of input data
third level
profile
optimization
time(ms)
memory
access
optimization
700
50000
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
memory
access
optimization
only using
global
memory
second level
profile
optimization
4096
8192
different size of input data
third level
profile
optimization
Matrix multiplication Performance comparison before and after profile
time(ms)
The CUDA code with three
10000000
8000000
three level
profile
optimization
CPU
6000000
4000000
2000000
0
level profiling optimization
achieves 31% improvement
over the CUDA code with only
memory access optimization,
1024 2048 4096 8192
different size of input
data
Execution performance comparison on different platform
and 91% improvement over
the CUDA code using only
global memory for computing .
1800
memory access
optimization
1600
t i m e (m s )
1400
1200
second level
profile
optimization
third level
profile
optimization
only using
global memory
1000
800
600
400
200
0
15
30
45
The CUDA code after
three level profile
optimization achieves
60
number of Batch
38% improvement over
FFT(1048576 points) Performance comparison before and after profile
the CUDA code with
memory access
50000
optimization, and 77%
time(ms)
40000
three level
profile
optimization
CPU
30000
20000
improvement over the
CUDA code using only
global memory for
10000
computing .
0
15
30
45
60
different size of input data
FFT(1048576 points ) execution performance comparison on different platform
Programming Multi-GPU systems
The memory of the CPU+GPU system are both distributed
and shared. So it is feasible to use MPI and PGAS
programming model for this new kind of system.
CPU
MainMem
CPU
Private space
Message
data
Main
Mem
Share space
Share data
Device
Mem
Device
Mem
Device
Mem
Device
Mem
GPU
GPU
GPU
GPU
MPI
PGAS
Using message passing or shared data for
communication between parallel tasks or GPUs
Mixed Programming Model
NVIDIA GPU
—— CUDA
Traditional Programming
model
—— MPI/UPC
MPI+CUDA/UPC+CUDA
Program start
Host
CPU
GPU
Device choosing
Program initial
Main MM
MPI/UPC runtime
Device
CPU
Source data
copy in
CPU
Main MM
(communication
interface of upper
programing model)
CPU
CUDA runtime
Device MM
GPU
Computing
start call
Communication
between tasks
Parallel
Task
Computing
kernel
GPU
GPU
computing
CUDA program execution
Result data
copy out
Device MM
CPU
CPU
CPU
end
cudaMemCopy
MPI+CUDA experiment

Platform






2NF5588 server, equipped with
 1 Xeon CPU (2.27GHz), 12GB MM
 2 NVIDIA Tesla C1060 GPU(GT200 architecture,
4GB deviceMM)
1Gbt Ethernet
RedHatLinux5.3
CUDA Toolkit 2.3 and CUDA SDK
OpenMPI 1.3
BerkeleyUPC 2.1
MPI+CUDA experiment (con’d)

Matrix Multiplication program




Using block matrix multiply for UPC programming.
Data spread on each UPC thread.
The computing kernel carries out the multiplication of two
blocks at one time, using CUDA to implement.
The total time of execution:
Tsum=Tcom+Tcuda=Tcom+Tcopy+Tkernel
Tcom: UPC thread communication time
Tcuda: CUDA program execution time
Tcopy: Data transmission time between host and device
Tkernel: GPU computing time
MPI+CUDA experiment (con’d)
2server,8 MPI task most
1 server with 2 GPUs

For 4094*4096,the speedup of 1 MPI+CUDA task ( using 1 GPU for
computing) is 184x of the case with 8 MPI task.

For small scale data,such as 256,512 , the execution time of using 2 GPUs
is even longer than using 1 GPUs
the computing scale is too small , the communication between two tasks
overwhelm the reduction of computing time.
PKU Manycore Software Research
Group

Software tool development for GPU
clusters



Software porting service


Unified multicore/manycore/clustering
programming
Resilience technology for very-large GPU
clusters
Joint project, <3k-line Code, supporting Tianhe
Advanced training program
PKU-Tianhe Turbulence Simulation
PKUFFT(using GPUs)




Reach a scale 43
times higher than that
of the Earth
Simulator did
7168 nodes / 14336
CPUs / 7168 GPUs
FFT speed: 1.6X of
Jaguar
Proof of feasibility of
GPU speed up for
large scale systems
MKL(not using GPUs)
Jaguar
Advanced Compiler Technology
Advanced Compiler Technology
(ACT) Group at the ICT, CAS

ACT’s Current research



Parallel programming languages and models
Optimized compilers and tools for HPC (Dawning) and multicore processors (Loongson)
Will lead the new multicore/many-core programming
support project
PTA: Process-based TAsk parallel
programming model

new process-based task construct


With properties of isolation, atomicity and deterministic submission
Annotate a loop into two parts, prologue and task
segment
#pragma pta parallel [clauses]
#pragma pta task
#pragma pta propagate (varlist)


Suitable for expressing coarse-grained, irregular
parallelism on loops
Implementation and performance



PTA compiler, runtime system and assistant tool (help writing correct
programs)
Speedup: 4.62 to 43.98 (average 27.58 on 48 cores); 3.08 to 7.83
(average 6.72 on 8 cores)
Code changes is within 10 lines, much smaller than OpenMP
UPC-H : A Parallel Programming Model
for Deep Parallel Hierarchies

Hierarchical UPC


Provide multi-level data distribution
Implicit and explicit hierarchical loop parallelism



Hybrid execution model: SPMD with fork-join
Multi-dimensional data distribution and super-pipelining
Implementations on CUDA clusters and Dawning 6000 cluster

Based on Berkeley UPC




Enhance optimizations as localization and communication optimization
Support SIMD intrinsics
CUDA cluster:72% of hand-tuned version’s performance, code
reduction to 68%
Multi-core cluster: better process mapping and cache reuse than
UPC
OpenMP and Runtime Support for
Heterogeneous Platforms

Heterogeneous platforms consisting of CPUs and GPUs



OpenMP extension



Specify partitioning ratio to optimize data transfer globally
Specify heterogeneous blocking sizes to reduce false sharing among
computing devices
Runtime support



Multiple GPUs, or CPU-GPU cooperation brings extra data transfer
hurting the performance gain
Programmers need unified data management system
DSM system based on the blocking size specified
Intelligent runtime prefetching with the help of compiler analysis
Implementation and results


On OpenUH compiler
Gains 1.6X speedup through prefetching on NPB/SP (class C)
Analyzers based on Compiling
Techniques for MPI programs

Communication slicing and process mapping tool

Compiler part



Optimized mapping tool



Weighted graph, Hardware characteristic
Graph partitioning and feedback-based evaluation
Memory bandwidth measuring tool for MPI programs


PDG Graph Building and slicing generation
Iteration Set Transformation for approximation
Detect the burst of bandwidth requirements
Enhance the performance of MPI error checking



Redundant error checking removal by dynamically turning on/off the global
error checking
With the help of compiler analysis on communicators
Integrated with a model checking tool (ISP) and a runtime checking tool
(MARMOT)
LoongCC: An Optimizing Compiler for
Loongson Multicore Processors

Based on Open64-4.2 and supporting C/C++/Fortran


Powerful optimizer and analyzer with better
performances






Open source at http://svn.open64.net/svnroot/open64/trunk/
SIMD intrinsic support
Memory locality optimization
Data layout optimization
Data prefetching
Load/store grouping for 128-bit memory access instructions
Integrated with Aggressive Auto Parallelization
Optimization (AAPO) module



Dynamic privatization
Parallel model with dynamic alias optimization
Array reduction optimization
Tools
Testing and evaluation of HPC
systems




A center led by Tsinghua University
(Prof. Wenguang Chen)
Developing accurate and efficient
testing and evaluation tools
Developing benchmarks for HPC
evaluation
Provide services to HPC developers
and users
LSP3AS: large-scale parallel program
performance analysis system
Source Code


Designed for performance
tuning on peta-scale HPC
systems
Method:




Source code is
instrumented
Instrumented code is
executed, generating
profiling&tracing data files
The profiling&tracing data
is analyzed and
visualization report is
generated
Instrumentation: based on
TAU from University of
Dynamic Compensation
TAU Instrumentation
Measurement API
RDMA Transmission and
Buffer Management
Instrumented Code
Compiler/Linker
External Libraries
RDMA Library
Executable Datafile
Environment
Clustering Analysis
Based on Iteration
Performance Datafile
Clustering Visualization
Based on hierarchy
classify
Profiling Tools
Visualization and Analysis
Traditional Process of
performance analysis
Tracing Tools
Analysis based on
hierarchical clustering
Dependency of Each Step
Innovations
LSP3AS: large-scale parallel
program performance analysis
system

Scalable performance data
collection



Distributed data collection and
transmission: eliminate
bottlenecks in network and
data processing
Dynamic Compensation:
reduce the influence of
performance data volume
Efficient Data Transmission:
use Remote Direct Memory
Access (RDMA) to achieve
high bandwidth and low
latency
Storage system
FC
FC
RD
M
Compute node
IO node
IO node
Lustre Client
Or GFS
Lustre Client
Or GFS
Thread
Thread
Receiver
Receiver
RD
A
……
M
A
RD
M
RD
A
……
M
A
Compute node
Compute node
Sender
Sender
Sender
Sender
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
User
process
User
process
User
process
User
process
User
process
User
process
Compute node
User
process
User
process
LSP3AS: large-scale parallel
program performance analysis
system

Analysis & Visualization


Data Analysis: Iteration-based
clustering are used
Visualization: Clustering
visualization Based on
Hierarchy Classification
SimHPC: Parallel Simulator

Challenge for HPC Simulation: performance


Target system: >1,000 nodes and processors
Difficult for traditional architecture simulators


e.g. Simics
Our solution

Parallel simulation


Use same node in host system with the target


Using cluster to simulate cluster
Advantage: no need to model and simulate detailed components,
such as pipeline in processors and cache
Execution-driven, Full-system simulation, support execution of
Linux and applications include benchmarks (e.g. Linpack)
SimHPC: Parallel Simulator (con’d)

Analysis

Execution time of a process in target system is
composed of:
T process  Trun  T IO  Tready
equal to host
can be obtained in Linux kernel
needed to be
simulated
unequal to host
needed to be re-calculated
− Trun: execution time of instruction sequences
− TIO: I/O blocking time, such as r/w files,
send/recv msgs
− Tready: waiting time in ready-state
So, Our simulator needs to:
①Capture system events
• process scheduling
• I/O operations: read/write files, MPI send()/recv()
②Simulate I/O and interconnection network subsystems
③Synchronize timing of each application process
SimHPC: Parallel Simulator (con’d)

System Architecture

Application processes of multiple target nodes
allocated to one host node
number of host nodes << number of target nodes



Events captured on host node while application is
running
Events sent to the central node for time analysis,
synchronization, and simulation
Host node
Host node
Event
Capture
Event
Capture
Host node
……
Event
Capture
Parallel applications
...
Target
Event Collection
Control
Process ... Process
Analysis &
Time-axis Sychronize
Simulation Results
...
Target
Target
Process ... Process
Process ... Process
Target
Process ... Process
Simulator
Simulator
……
Interconnection
Network
Host Linux
Disk I/O
Host Hardware Platform
Host Hardware Platform
Architecture Simulation
Host
Host
Host Linux
SimHPC: Parallel Simulator (con’d)
• Experiment Results
–
–
–
–
Host: 5 IBM Blade HS21 (2-way Xeon)
Target: 32 – 1024 nodes
OS: Linux
App: Linpack HPL
Simulation Slowdown
Simulation Error Test
Linpack performance for Fat-tree
and 2D-mesh Interconnection
Communication time for Fat-tree
and 2D-mesh Interconnection
System-level Power Management

Power-aware Job
Scheduling algorithm
Suspend a node if its idletime > threshold
Wakeup nodes if there is no
enough nodes to execute
jobs, while
Avoid node thrashing
between busy and suspend
state


The algorithm is
integrated into OpenPBS
System-level Power Management (con’d)

Power Management Tool



Monitor the power-related status of the system
Reduce runtime power consumption of the machine
Multiple power management policies




Manual-control
On-demand control
Suspend-enable
…
Po w er M a n ag em e n t Po lic ies
P olicy
Level
Po w er M a n ag e m e n t So ftw are / In te rfa ce s
M anagem ent
/ Interface
Level
Po w er M a n ag e m e n t A g e n t in N o d e
N ode
sleep/w akeup
N ode
O n/O ff
CPU Freq.
contr ol
Pow er
Fan speed
control of I/O
control
equipments
Layers of Power Management
N ode Level
...
Control &
Monitor
System-level Power
Management (con’d)
Commands
Status
Power data
Power
• Power Management Test
– On 5 IBM HS21 blades
Task Load
(tasks per
hour)
20
Power Mesurement
System
Comparison
Power
Management
Policy
Task Exec.
Time
(s)
Power
Consumption
(J)
Performance
slowdown
Power
Saving
On-demand
3.55
1778077
5.15%
-1.66%
Suspend
3.60
1632521
9.76%
-12.74%
On-demand
3.55
1831432
4.62%
-3.84%
Suspend
3.65
1683161
10.61%
-10.78%
On-demand
3.55
2132947
3.55%
-7.05%
Suspend
3.66
2123577
11.25%
-9.34%
200
800
Power management test for different Task Load
(Compared to no power management)
Domain specific programming
support
Parallel Computing Platform
for Astrophysics

Joint work





Shanghai Astronomical Observatory, CAS (SHAO),
Institute of Software, CAS (ISCAS)
Shanghai Supercomputer Center (SSC)
Build a high performance parallel computing software
platform for astrophysics research, focusing on the
planetary fluid dynamics and N-body problems
New parallel computing models and parallel algorithms
studied, validated and adopted to achieve high
performance.
Architecture
Web Portal on CNGrid
Software Platform for Astrophysics
Data Processing
Scientific Visualiztion
Physical and
Mathematical
Model
Numerical
Methods
Fluid Dynamics
PETSc
Aztec
Improved
Preconditioner
MPI
OpenMP
N-body Problem
FFTW
SpMV
Fortran
GSL
Improved Lib. for
Collective
Comunication
C
100T Supercomputer
Lustre
Parallel
Computing
Model
Software
Development
PETSc Optimized (Speedup 15-26)

Method 1: Domain Decomposition Ordering Method for Field
Coupling

Method 2: Preconditioner for Domain Decomposition Method

Method 3: PETSc Multi-physics Data Structure
Left: mesh 128 x 128 x 96
Right: mesh 192 x 192 x 128
Computation Speedup: 15-26
Strong scalability: Original code normal, New code ideal
Test environment: BlueGene/L at NCAR (HPCA2009)
Strong Scalability on TianHe-1A
2015/10/4
CLeXML Math Library
Task
Parallel
Self Adaptive
Tunning
Multi-core
parallel
Iterative Solver
LAPACK
BLAS
Computationa
l Model
FFT
CPU
Self Adaptive
Tunning,
Instruction
Reordering,
Software
Pipelining…
BLAS2 Performance: MKL vs.
CLeXML
HPC Software support for
Earth System Modeling

Led by Tsinghua University






Tsinghua
Beihang University
Jiangnan Computing Institute
Peking University
…
Part of the national effort on climate
change study
67
Development
Wizard and
Editor
Source Code
Compiler/
Debugger/
Optimizer
Other Data
Executable
Algorithm
(Parallel)
Initial Field and
Boundary Condition
Running
Environment
Earth System Model
Development Workflow
Earth
System
Model
Computation
Output
Result Evaluation
Result Visualization
Data
Visualization
and Analysis
Tools
Data
Management
Subsystem
Standard
Data Set
68
Major research activities
Subprojet I
 Efficient integration and management of massive heterogeneous
data
Subproject II
 Fast visualization of massive data analysis and diagnosis of model data
Subproject III
 MPMD program debugging,analysis and high-availability technologies
Subproject IV
 Integrated development environment(IDE) and Demonstrative applications
for earth system model
Demonstrative Applications
Expected Results
research on
global change
model application
systems
development
tools:
data conversion
diagnosis
debugging
performance analysis
high availability
integrated high performance
computing environment for
earth system model
Existing tools:
compiler
system monitor
version control
editor
software
standards
international
resources
template library
module library
high performance computers in China
71
Potential cooperation areas

Software for exa-scale computer systems







Power
Performance
Programmability
resilience
CPU/GPU hybrid programming
Parallel algorithms and parallel program
frameworks
Large scale parallel applications support

Applications requiring ExaFlops computers
Thank you!
Descargar

没有幻灯片标题 - DIMACS