Future mass apps reflect a concurrent world
 Exciting applications in future mass computing market represent
and model physical world.
 Traditionally considered “supercomputing apps” or super-apps.

Physiological simulation, Molecular dynamics simulation, Video and audio
manipulation, Medical imaging, Consumer game and virtual reality products
 Attempts to grow current architectures “out” or domain-specific
architectures “in” lack success; a more broad approach to cover
more domains is promising
Traditional applications
Current architecture
coverage
New applications
Domain-specific
architecture coverage
Obstacles
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
1
ISCA Panel June 7, 2005
MPEG Encoding Parallelism
 Independent IPPP sequences
I1
P1.1
P1.2
Px.1
Px.2
...
Video
Ix
 Frames: independent 16x16 pel macroblocks
MB MB MB MB
1
2
3
4
I1
Frame Order = I, P1, P2...
 Localized dependence of P-frame
MB2 in frame P2
MB
2
macroblocks on previous frame
Region of P1 required to
process MB2
P1 macroblocks
containing
 Steps of macroblock
Mot Comp
&
Frame Sub
processing exhibit finer
grained parallelism, each block spans function boundaries
MB
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
Mot.
Est.
2
DCT &
Quantizer
Dequantizer
& IDCT
Output
ISCA Panel June 7, 2005
Alternative Forms of MPEG-4 Threading
0
1
2
3
4
5
6
time
Frame 2
Frame 1
Loop
Partitioning
Frame 1
Loop Fusion +
Memory Privatization
Operations performed on
16x16 macroblocks
Frame 2
Motion Estimation
Motion Compensation,
Frame Subtraction
DCT & Quantization
Dequantization, IDCT,
Frame Addition
Frame 1
Main Memory Access
Macropipelining
Frame 2
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
3
ISCA Panel June 7, 2005
7
Building on HPF Compilation: what’s new?
 Applicability to mass software base - requires pointer
analysis, control flow analysis, data structure and object
analysis, beyond traditional dependence analysis
 Domain-specific, application model languages

More intuitive than C for inherently parallel problems



increased productivity, increased portability
Will still likely have C as implementation language
There is room for a new app language or a family of languages
 Role for the compiler in model language environments
 Model can provide structured semantics for the compiler, beyond
what can be derived from analysis of low-level code
 Compiler can magnify the usefulness of model information with
its low-level analysis
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
4
ISCA Panel June 7, 2005
Pointer analysis: sensitivity, stability and safety
Fulcra in OpenIMPACT [SAS2004, PASTE2004] and others
Improved efficiency increases the
scope over which unique, heapallocated objects can be discovered
Discovered Objects
132.ijpeg
Improved analysis algorithms provide more
accurate call graphs (below) instead of a
blurred view (above) for use by program
transformation tools
A multitude
of distinct
objects
1000
100
3
2
10
WORSE
5
1
0
1
1
10
1000
100
Observed Connectivity
A few, highlyconnected
objects
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
ANALYSIS
SCOPE
BETTER
10000
... ...
... ...
ISCA Panel June 7, 2005
Thoughts from the VLIW/EPIC Experience
 Any significant compiler work for a new computing platform takes
10-15 years to mature




1989-1998 initial academic results from IMPACT
1995-2005 technology collaboration with Intel/HP
2000-2005 SPEC 2000, Itanium 1 and 2, open source apps
This was built on significant work from Multiflow, Cydrom, RISC, HPC teams
 Real work in compiler development begins when hardware arrives
 IMPACT output code performance improved by more than 20% since arrival of
Itanium hardware – and much more stable
 Most apps brought up with IMPACT after Itanium systems arrived: debugging!
 Real performance effects can only be measured on hardware
 Early access to hardware for academic compiler teams crucial and must a
priority for industry development team.
 Quantitative methodology driven by large apps is key
 Innovations evaluated in whole system context
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
6
ISCA Panel June 7, 2005
How the next-generation compiler will do it (1)
Heavyweight
loops
Upsample
Table
Initialization
Load
Scanline
Color
Conversion
Callgraph
Memory
To-do list:
Acceleration opportunities:
o Heavyweight loops identified for acceleration
o However, they are isolated in separate functions called
through pointers
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
7
o Identify acceleration
opportunities
o Localize memory
o Stream data and
overlap computation
ISCA Panel June 7, 2005
How the next-generation compiler will do it (2)
Initialization code
identified
Accelerator 1
Upsample
Table
Initialization
Accelerator 2
Load
Scanline
Color
Conversion
Callgraph
Memory
Large constant lookup
tables identified
To-do list:
Localize memory:
 Identify acceleration
o Pointer analysis identifies indirect callees
opportunities
o Pointer analysis identifies localizable memory objects
o Localize memory
o Private tables inside accelerator initialized once, saving traffic o Stream data and
overlap computation
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
8
ISCA Panel June 7, 2005
How the next-generation compiler will do it (3)
Summarize output
access pattern
Constant table
privatized
Accelerator 1
Accelerator 2
Summarize input
access pattern
Upsample
Table
Initialization
Load
Scanline
Color
Conversion
Callgraph
Memory
To-do list:
Streaming and computation overlap:
o Memory dataflow summarizes array/pointer access patterns
o Opportunities for streaming are automatically identified
o Unnecessary memory operations replaced with streaming
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
9
 Identify acceleration
opportunities
 Localize memory
o Stream data and
overlap computation
ISCA Panel June 7, 2005
How the next-generation compiler will do it (4)
Accelerator 1
Upsample
Table
Initialization
Load
Scanline
Color
Conversion
Callgraph
Memory
Accelerator 2
To-do list:
Achieve macropipelining of parallelizable accelerators  Identify acceleration
o Upsampling and color conversion can stream to each other
o Optimizations can have substantial effect on both efficiency
and performance
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
10
opportunities
 Localize memory
 Stream data and
overlap computation
ISCA Panel June 7, 2005
Memory dataflow in the pointer world
Cols
Cols
C
Y C C
Y C C
...
Y C C
Y C C
...
…
C
...
Rows
Y
Rows
Y C C
Array of constant
pointers
Y C C
Row arrays never
overlap
 Arrays are not true 3D arrays (unlike in Fortran)
 Actual implementation: array of pointers to array of samples
 New type of dataflow problem – understanding the semantics of
memory structures instead of true arrays
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
11
ISCA Panel June 7, 2005
Descargar

GSRC 20040606 Overview