Low Power Multimedia
Reconfigurable Platforms
Jun-Dong Cho
SungKyunKwan Univ.
Dept. of ECE, Vada Lab.
http://vada.skku.ac.kr
What are the Challenges ?
[ST microelectronics, MorphICs, Dataquest, eASIC]
factor
2
4y
1
0
10
12
VLSI Algorithmic Design Automation Lab. at SKKU
18
months
2
Reconfigurable System



Reconfigurable systems are suitable for the dynamic application and
communication environment of wireless multimedia devices
such as SDR.
A hierarchical system model is used in which Quality of Service and
energy consumption play a crucial role.
Dynamically partition tasks of an application.
VLSI Algorithmic Design Automation Lab. at SKKU
3
Reconfigurable SOC



As technology (supply voltage) scales down, logic (transistor) is
virtually free while the interconnect becomes the bottleneck and
power consuming.
Parallel execution of nested Do loop algorithms by an array of
localized processing elements at moderate clock frequency is a
viable solution.
It can compromise the three orthogonal issues: design time,
power consumption, and performance.
VLSI Algorithmic Design Automation Lab. at SKKU
4
Context

SoC and Customizable Platform Based-Design
DSP
Specifications
Processing power
Area
Power consumption
etc.
Reconfigurable
Hardware
(Fine Grain)
ASIC 1
ASIC 2
Reconfigurable
Hardware
(Coarse Grain)
We need metrics to compare !
VLSI Algorithmic Design Automation Lab. at SKKU
5
First choose the right architecture …
Jan Rabaey
.5-5
MIPS/mW
Flexibility
Prog Mem
mP
10-100
MOPS/mW
Addr
Unit
Gen
DSP
(e.g. TI 320CXX )
100-1000
MOPS/mW
Embedded
FPGA
Direct Mapped
Hardware
MAC
Reconfigurable
Processors
(Maia)
Embedded
Processor
(lpArm)
Factor of 100-1000
Area or Power
VLSI Algorithmic Design Automation Lab. at SKKU
6
Design Space of Reconfigurable
Architecture
RECONFIGURABLE ARCHITECTURES
(R-SOC)
MULTI GRANULARITY
(Heterogeneous)
FINE GRAIN
(FPGA)
Processor +
Coprocessor
Island
Topology
Hierarchical
Topology
Coarse Grain
Coprocessor
Fine Grain
Coprocessor
• Xilinx Virtex
• Xilinx Spartran
• Atmel AT40K
• Lattice ispXPGA
• Altera Stratix
• Altera Apex
• Altera Cyclone
• Chameleon
• REMARC
• Morphosys
• Pleiades
• Garp
• FIPSOC
• Triscend E5
• Triscend A7
• Xilinx Virtex-II Pro
• Altera Excalibur
• Atmel FPSIC
COARSE GRAIN
(Systolic)
Tile-Based
Architecture
Mesh
Topology
• aSoC
• E-FPFA
VLSI Algorithmic Design Automation Lab. at SKKU
Linear
Topology
• RAW
• Systolic Ring
• CHESS
• RaPiD
• MATRIX
• PipeRench
• KressArray
• Systolix Pulsedsp
Hierarchical
Topology
• DART
• FPFA
7
Semiconductor
Revolutions
“Mainstream Silicon Application
is switching every 10 Years”
custom
LSI,
MSI
hardware people
1977
2007
1987
ASICs,
accel’s
1997
2nd design crisis
1957
1967
µproc.,
memory
1st design crisis
TTL
new breed needed
software people
standard
new breed (M&C)
VLSI Algorithmic Design Automation Lab. at SKKU
8
3 different mind sets
hardware people
TTL
1957
1967
CS
people new breed needed
µproc.,
memory
LSI,
MSI
1977
1987
ASICs,
accel’s
FPGAs
1997
2007
soft
CPUs
coarse
grain
Common terminology needed
VLSI Algorithmic Design Automation Lab. at SKKU
9
Machine paradigms
von
instructi
M
Neumann
on stream
data-stream
memory
address
machine M data
generator
Flowware
instruction
stream
I/O
DPU
machine
CPU instruction
sequencer
(data sequencer)
I/O
Software
Configware
asM*
data stream
DPU or rDPU
embedded memory architecture*
M M M M
I/O
M
M M M M
M
memory
I/O
(r)DPU
VLSI Algorithmic Design Automation Lab. at SKKU
(r)DPA
10
FPGA Chip
DSP Chip
Programming
Language
VHDL, Verilog
C, Assembly Language
Ease of
software
programming
Fairly easy, however, a programmer
needs to understand the hardware
architecture before programming
Easy
Performance
Can be very fast if an appropriate
architecture is designed
Speed is limited by the clock
speed of a DSP chip
SRAM-type FPGAs can be reconfigurable
infinite times
Can be configurable by
changing program memory
content
Reconfigurabili
ty
VLSI Algorithmic Design Automation Lab. at SKKU
11
FPGA Chip
Reconfiguration
method
DSP Chip
downloading configuration data to a
chip electronically
reading a program at a
memory address
FIR filter, IIR filter, conrrelator,
convolver, FFT
A signal processing program
Power
consumption
Can be minimized if the circuit is
designed to save power
Power consumption does
not change
Speed of MAC
Can be fast if a parallel algorithm is
used.
Limited by the speed of a
DSP chip
Parallelism
Can be parallelized to archieve high
performance
DSP chip programming is
usually sequential
Area
VLSI Algorithmic Design Automation Lab. at SKKU
12
Architecture Choices for
Real-time Embedded System
VLSI Algorithmic Design Automation Lab. at SKKU
Greg Delagi, TI
13
Fine-Grained RSOCs Xilinx Virtex II-Pro







Xilinx, Inc., San Jose, CA
Up to 4 PowerPC 405 Processor Cores
Up to 160k Reconfigurable Logic Cells
(4-i/p 1-o/p Lookup Table)
Up to 216 18-bit x 18-bit Dedicated
Multipliers
Up to 216 18-kbit On-Chip Distributed
Memory Blocks
Up to 852 I/O Pins
www.xilinx.com
VLSI Algorithmic Design Automation Lab. at SKKU
14
Xilinx의 Xtreme
VLSI Algorithmic Design Automation Lab. at SKKU
15
Fine-Grained RSOCs
Altera Excalibur
Altera, San Jose, CA
32-bit ARM9 Based
Microprocessor @200 MHz
Up to 256kbytes SRAM
Up to 1M programmable
logic gates
200 MHz Bus
Built-in SDRAM Controller
VLSI Algorithmic Design Automation Lab. at SKKU
16
Fine-Grained RSOCs:
Triscend A7 CSOC
A7 Family, Triscend,
32-bit ARM 7 with 8kB Cache
3200 logic cells max. (40K
gates)
Up to 3800 flip-flops
Up to 300 Prog. I/O pins
www.triscend.com
VLSI Algorithmic Design Automation Lab. at SKKU
17
Chameleon Structure
Coarse-Grained RSOCs
Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit,
Martinus Bos, Paul M. Heysters
32-bit ARC control processor
Up to 84 32-bit Datapath Units (DPU)
DPU=a 32-bit ALU+a 32-bit barrel shifter
Up to 24 of 16x24-bit multipliers
Up to 48 of 128x32-bit local memory
modules
Up to 160 Prog. I/O pins
Targeted at 3rd gen. wireless basestation,
wireless local loop, SW radio, etc.
www.chameleonsystems.com
Chameleon Systems Inc.
VLSI Algorithmic Design Automation Lab. at SKKU
18
Architectural


Configurable processors have shown orders of magnitude performance
improvements
Tensilica has shown ~2x to ~50x performance improvements



Scott Weber
University of and
California
at
Rationale
Motivation
Berkeley
Specialized functional units
Memory configurations
Tensilica matches the architecture with software development tools
Memory
RegFile
FU
Set memory parameters
Add DCT and Huffman
blocks for a JPEG app
Configuration
ICache
VLSI Algorithmic Design Automation Lab. at SKKU
Memory
RegFile
FU FU
DCT HUF
ICache
FU
19
Architectural Rationale and Motivation

In order to continue this performance improvement trend



Architectural features which exploit more concurrency are required
Heterogeneous configurations need to be made possible
Software development tools support new configuration options
Memory
RegFile
FU
FU DCT HUF FU
...begins to
look like a
VLIW...
PE
Memory PE
PE
PE
RegFile
PE
PE
FU
PE
ICache
PE
PE
PE
PE
PE
PE
PE
PE
PE
...generic mesh
may not suit the
application’s
topology...
...concurrent processes
are required in order
to continue performance
improvement trend...
FU DCT HUF FU
PE
PE
ICache
PE
PE
PE
PE
PE
PE
PE
PE
VLSI Algorithmic Design Automation Lab. at SKKU
...configurable VLIW
PEs and network
topology...
20
IXP1200 Network Processors
SDRAM
Ctrl

MicroEng
PCI
Interface
ICache
SA
Core


MicroEng
MicroEng
Hash
Engine
IX Bus
Interface
MicroEng
DCache
Mini
DCache
MicroEng
Six micro-engines



Scratch
Pad
SRAM
MicroEng
Support 24 contexts
Hash instructions
StrongArm core
Bus and memory
controllers
Example of an
architecture we want
to be able to
configure to
SRAM
Ctrl
IXP1200 Network Processor (Intel)
VLSI Algorithmic Design Automation Lab. at SKKU
21
Architecture Goals

Provide template for the exploration of a range of architectures

Retarget compiler and simulator to the architecture

Enable compiler to exploit the architecture

Concurrency




Support for efficient computation


Multiple instructions per processing element
Multiple threads per and across processing elements
Multiple processes per and across processing elements
Special-purpose functional units, intelligent memory, processing elements
Support for efficient communication


Configurable network topology
Combined shared memory and message passing
VLSI Algorithmic Design Automation Lab. at SKKU
22
Architecture Template

Prototyping template for array of processing elements



Configure processing element for efficient computation
Configure memory elements for efficient retiming
Configure the network topology for efficient communication
...configure
PE...
Memory
RegFile
FU
FU
FU
FU
FU
Memory
RegFile
FU
ICache
...configure
memory
elements...
FU DCT HUF FU
ICache
...configure PEs
and network to
match the
application...
VLSI Algorithmic Design Automation Lab. at SKKU
Memory
Memory
RegFile
RegFile
FU
FU
DCT HUF FU
ICache
23
Architecture Template

Templates provide prototyping platform for constrained refinement

Estimators feedback system performance and guide configuration
System designer refines configuration or the process is automated

Refined elements have a compatible interface in the system

Programmer’s
Model
Compiler
.o
gen
uArch
Designer
gen
Estimation
Simulator
VLSI Algorithmic Design Automation Lab. at SKKU
24
Synthesis of Architectures


Not inventing new architectures
We are providing a tool for the prototyping and synthesis of a
family of architectures


Gives a micro-architecture, ISA, compiler, and simulator
Refine within an instance to improve characteristics of the
design

Most existing architectures are a point in the architecture spectrum

We want to allow a wide range of architectures to be realized

Each coupled with supporting software development tools
VLSI Algorithmic Design Automation Lab. at SKKU
25
Initial Processing Element

Memory System
VLIW class architecture



Register File
Malleable elements



FU
FU
FU
FU
SFU
HPL-PD architecture
Exploit ILP


Memory size
Cache size
Register file size
Number of functional units
Specialized functional units
Instruction Cache
VLSI Algorithmic Design Automation Lab. at SKKU
26
Future Processing Element

Specialized memory systems for efficient memory utility


Multi-ported, banked, levels, and intelligent memory
Split register file allows greater register bandwidth to FUs

Groups of functional units have dedicated register files

Sticky state for specialized FUs saves register file reads and writes

Multiple contexts for a processing element provide latency tolerance


Hardware for efficient context switching to fill empty instruction slots
Specialized functional units and processing elements




SIMD instructions
Re-configurable fabrics for bit-level operations
Re-use IP blocks for more efficient computation
Custom hardware for the highest performance
VLSI Algorithmic Design Automation Lab. at SKKU
27
Initial Distributed Architecture
PE
PE
PE


PE
PE
PE
PE
PE
PE
Array of concurrent PEs and
supporting network
Malleable network topology


Topology matches application
Efficient communication
VLSI Algorithmic Design Automation Lab. at SKKU
28
Initial Distributed Architecture

PE
PEPE
PE
PE

Array of concurrent PEs and
supporting network
Malleable network and PEs


PE
PE
PE
PE

Memory organized around a PE


PE
PE
PE
Topology matches application
Refine to meet system constraints
Each PE has physical memory
Message passing between PEs
PEPE
VLSI Algorithmic Design Automation Lab. at SKKU
29
Future Distributed Architecture

Multiple processing elements share a memory space

Shared memory communication



Snooping cache coherency protocol
Directory based protocol required if PEs in a shared memory space is
large
Introspective processing elements

Use processing elements to analyze the computation or
communication


Identify dynamic bottlenecks and remove them on the fly
Reschedule and bind tasks as the introspective elements report
VLSI Algorithmic Design Automation Lab. at SKKU
30
Communication Models

Shared memory




Hardware handles loads and stores from PEs to a common
memory
Synchronization is separate from communication
Interacting threads on a single or group of processing elements
Message passing



Hardware to send and receive messages and invoke a handler
Synchronization and communication are together
Interacting processes between single or group of processing
elements
VLSI Algorithmic Design Automation Lab. at SKKU
31
Memory Model

Relax the consistency model



Hardware implements lock and unlock mutex instructions
Synchronization instructions inserted in program
Loads and stores before a lock must complete before loads and
stores after the lock are started


Relaxes the ordering of reads and writes in order to increase memory
utility
Compiler is constrained on reordering around synchronization
barriers
VLSI Algorithmic Design Automation Lab. at SKKU
32
Range of Architectures
Scalar Configuration
 EPIC Configuration
 EPIC with special FUs
 Mesh of HPL-PD PEs
 Customized PEs, network
 Supports a family of
architectures

Memory System
Register File
FU
Instruction Cache

Plan to extend the family with
the micro-architectural features
presented
VLSI Algorithmic Design Automation Lab. at SKKU
33
Range of Architectures
Scalar Configuration
 EPIC Configuration
 EPIC with special FUs
 Mesh of HPL-PD PEs
 Customized PEs, network
 Supports a family of
architectures

PE
Memory
SystemPE
PE
Register File
PE
PE
FU
FU
FU
PE
FU
FU

PE
PE
PE
Instruction Cache
Plan to extend the family with
the micro-architectural features
presented
VLSI Algorithmic Design Automation Lab. at SKKU
34
Range of Architectures
Scalar Configuration
 EPIC Configuration
 EPIC with special FUs
 Mesh of HPL-PD PEs
 Customized PEs, network
 Supports a family of
architectures

Memory System
Register File
FU
FU
DES DCT FFT

Instruction Cache
Plan to extend the family with
the micro-architectural features
presented
VLSI Algorithmic Design Automation Lab. at SKKU
35
Range of Architectures
Scalar Configuration
 EPIC Configuration
 EPIC with special FUs
 Mesh of HPL-PD PEs
 Customized PEs, network
 Supports a family of
architectures

Memory
PE System
PE
Register File
PE
PE
FU
FU
PE
PE
DES DCT FFT

PE
PE
Instruction Cache
PE
Plan to extend the family with
the micro-architectural features
presented
VLSI Algorithmic Design Automation Lab. at SKKU
36
Range of Architectures
Scalar Configuration
 EPIC Configuration
 EPIC with special FUs
 Mesh of HPL-PD PEs
 Customized PEs, network
 Supports a family of
architectures

PE
PE
PE
PE
PE

PE
PE
PE
Plan to extend the family with
the micro-architectural features
presented
VLSI Algorithmic Design Automation Lab. at SKKU
37
Range of Architectures (Future)
SDRAM
Ctrl

MicroEng
PCI
Interface
ICache
SA
Core

MicroEng
MicroEng
Hash
Engine

IX Bus
Interface

MicroEng
DCache
Mini
DCache
MicroEng
Template support for
such an architecture
Prototype architecture
Software
development tools
generated
Scratch
Pad
SRAM

Generate compiler
Generate simulator
MicroEng
SRAM
Ctrl
IXP1200 Network Processor (Intel)
VLSI Algorithmic Design Automation Lab. at SKKU
38
The Research Playground
Application
Algorithm
Software
Implementation
What is the
Programmer’s
Model?
Compilation
and SW
Environment
Verification and
Manufacture Test
VLSI Algorithmic Design Automation Lab. at SKKU
Architecture
Microarchitecture
Component Assembly
and Synthesis
39
Mescal Compiler
Manish Vachharajani
Princeton University
Outline



Compiler goals
Compiler research issues
Compiler infrastructure requirements



Trimaran 2.0 compiler infrastructure
Ongoing work
Summary
VLSI Algorithmic Design Automation Lab. at SKKU
41
So What’s Different?

General purpose compiler hand tuned to:



Need compiler tuned to:



SPEC benchmarks
A particular general purpose machine
Specific application
A particular application specific machine
And…


Meet code density, real-time, and power constraints
Do this automatically for a range of applications/architectures
VLSI Algorithmic Design Automation Lab. at SKKU
42
So What’s Different?

Traditional application hw/sw design requires


Hand selection of traditional general purpose OS components
Hand written customization of



device drivers
memory management…
Instead…

Application specific synthesis of traditional OS components



scheduling
synchronization…
Automatic synthesis of hardware specific code from specifications


device drivers
memory management…
VLSI Algorithmic Design Automation Lab. at SKKU
43
Compiler Goals


Develop a retargetable compiler infrastructure that
enables a set of interesting applications to be efficiently
mapped onto a family of fully programmable architectures
and microarchitectures.
10 Year Vision:


Will have fully automatically-retargetable compilation, OS
synthesis, and simulation for a class of architectures consisting of
multiple heterogeneous processing elements with specialized
functional units / memories
Compiled code size and performance will be within 10% of handcoding
VLSI Algorithmic Design Automation Lab. at SKKU
44
Compiler Research Issues

Synthesis of RTOS elements in the compiler




Automatic retargetability for family of target architectures while
preserving aggressive optimization
Automatic application partitioning


On the application side: Generation of an efficient application-specific
static/run-time scheduler and synchronization
On the hardware side: Generation of device drivers, memory
management primitives, etc. using hardware specifications
Mapping of process/task-level concurrency onto multiple PEs using
programmer guidance in programmer’s model
Effective visualization for family of target architectures
VLSI Algorithmic Design Automation Lab. at SKKU
45
Compiler Infrastructure Requirements

High level of usability

good documentation, well coded

Large suite of machine-independent code optimizations

Significant level of retargetability

Strong support for instruction-level parallelism

Support for memory as a first-class citizen

Simulation tools

Preferably


visualization tools
a good support team
VLSI Algorithmic Design Automation Lab. at SKKU
46
Trimaran 2.0 Compiler Overview
www.trimaran.org
C

U. of Illinois
IMPACT Group
IMPACT
Front-End
IMPACT/ELCOR features strong
VLIW data structure and
algorithm support

Data structures


HP Labs CAR
Group MDES

ELCOR
Back-End


Algorithms



NYU ReaCTILP Group
Simulator &
Visualization
basic, hyper, super blocks
loop analysis
procedure analysis
miscellaneous, e.g. lists, sets
if-conversion
software pipelining
scheduling/register allocation
VLSI Algorithmic Design Automation Lab. at SKKU
47
Trimaran 2.0 Overview:
Simulator and Visualization Tools

Cycle-level simulator easily extensible to support new
specialized operations


Simply augment table specifying operation semantics
Visualization tools visualize assortment of useful static /
dynamic information




Instruction schedule
Data-dependency graphs
Total cycles per function / region
Percentage of total function operations that are branches, loads,
stores, integer ALU, floating-point ALU, etc.
VLSI Algorithmic Design Automation Lab. at SKKU
48
Trimaran 2.0 Overview:
Machine Description (MDES)
C
Target specified in high-level
machine-description language



Translated into low-level language
ELCOR supports Playdoh



TRIMARAN
Parameterized non-clustered
VLIW architecture
Support for
speculative/predicated execution,
software pipelining
User may modify following
playdoh parameters:



High-level
PlayDoh
MDES
Low-level
PlayDoh
MDES
number of registers
number of integer, floating-point,
memory, branch FUs
operation latencies
VLSI Algorithmic Design Automation Lab. at SKKU
IMPACT
Front-End
ELCOR
Back-End
Simulator &
Visualization
49
Extensions to Trimaran 2.0:
Support for Multiple PEs

ELCOR does not provide MDES
and data structure support for
multiple Playdoh PEs




New MDES format has been
devised to support multiple PEs
with varying connectivity
Array of MDES data structures
maintained, one per PE
Each code region must be
associated with an MDES PE prior
to code generation
Communication channels between
PEs currently not modeled
MESCAL Machine Description
PE1: machine description
PE2: machine description
.
.
.
PEm: machine description
Channel1: from PE1 to PE2
Channel2: from PE1 to PE3
.
.
.
Channeln: from PEi to PEj
VLSI Algorithmic Design Automation Lab. at SKKU
50
Support for Specialized FUs and
Operations



ELCOR lacks support for
specialized FUs and operations
MESCAL supports specialized
FUs and operations via function
intrinsics which get translated
into special operations.
Special operations only require
map from intrinsic for
implementation.
Assembly
Normalization
Hardware
VLSI Algorithmic Design Automation Lab. at SKKU
NORM B
Intrinsic
x = NORM(y)
51
Mescal Compiler Framework

C
MESCAL source code layer
exists on top of ELCOR


All Trimaran source code
needing modification is copied
over to the MESCAL layer
MESCAL source code is
compatible with future Trimaran
releases
TRIMARAN
IMPACT
Front-End
MESCAL
MDES
ELCOR
MDES
MESCAL
ELCOR
Back-End
Simulator &
Visualization
VLSI Algorithmic Design Automation Lab. at SKKU
52
What Do You Get?
Hardware
description
Application
Code in
Programmer’
s Model

Mescal Compiler will feature:

Mescal Compiler

RTOS
synthesis
Automatic retargetability for
architectures consisting of multiple
heterogeneous PEs and a
configurable communication topology
Mapping of coarse-grain parallelism
onto multiple PEs via guidance from
programmer’s model

Compiler
front end
Compiler
back end
System Code

Programmer’s model will allow codegeneration with size and performance
comparable to hand-coding
Synthesis of RTOS elements and
synchronization that are tuned to the
application
VLSI Algorithmic Design Automation Lab. at SKKU
53
Ongoing Work
Automatic device driver synthesis
from a system specification
 Xiaoling Xu, Minxi Gao: UCB
 Support for additional classes of
processors (e.g. DSPs) within Mescal
framework
 Subbu Rajagopalan: Princeton
 involves adding support for
memory as a first-class
citizen
 Tuning of front/back end code
optimizations, based on application
and micro-architectural characteristics
 Manish Vachharajani: Princeton

Automatic synthesis of RTOS
elements in compiler
 Shaojie Wang: Princeton
 Dynamically-reconfigurable
computing for systems-on-a-chip
 Zhining Huang: Princeton
 MESCAL compiler overview.
 Niraj Shah, Michael Shilman:
UCB

VLSI Algorithmic Design Automation Lab. at SKKU
54
The Research Playground
Application
Algorithm
Software
Implementation
What is the
Programmer’s
Model?
Compilation
Verification and
Manufacture
VLSI Algorithmic Design Automation
Lab. atTest
SKKU
Architecture
Microarchitecture
Component Assembly
and Synthesis
55
MESCAL Programmer’s Model
Niraj Shah
University of California at Berkeley
Outline





Motivation
Goals
Our Approach
Initial Model
Ongoing Research
VLSI Algorithmic Design Automation Lab. at SKKU
57
Motivation

Silicon integration is allowing for high micro-architectural
complexity on a die (e.g. Intel IXP1200)



multiple processors
specialized execution units
hardware context swap
SDRAM
Controller
PCI
Interface
I-Cache


Circuit architects are
designing more complex
devices
Strong
Arm
Core
D-Cache
Mini
D-Cache
mengine
mengine
mengine
mengine
mengine
Hash
Engine
IX Bus
Interface
Scratch
Pad
SRAM
mengine
SRAM
Controller
How do we program these architectures?
VLSI Algorithmic Design Automation Lab. at SKKU
58
Example: C Language



C compilers of the early 70’s were not good, but C
became the standard for writing efficient code.
C provided an abstraction (programmer’s model) of
standard processors that allowed programmers to write
efficient code
They found the 20% of the assembler capability to
capture 80% of program efficiency:

register keyword

pointer arithmetic

bit-level operations
VLSI Algorithmic Design Automation Lab. at SKKU
59
Goals

Capture the 20% of architectural features of new
architectural platforms to get 80% of the performance



Concurrency
 processor level
 functional unit level
 bit level
Memory
 useful characteristics of specialized memories
 address generation units
Present the programmer with an abstraction of the
architecture while giving them the power to write efficient
code
VLSI Algorithmic Design Automation Lab. at SKKU
60
Our Approach

Combine bottom-up and top-down views
 Bottom-up: create an abstraction of the architecture



visibility - sufficient detail of the architecture to allow the
program to improve the efficiency of the program
opacity - hide micro-architectural details from programmer
Top-down: expressive enough for the programmer to
relay all the information he/she knows about the
program to the compiler
VLSI Algorithmic Design Automation Lab. at SKKU
61
Bottom-Up View

Visible

Specialized hardware




FU’s
PE’s
Opaque

Micro-architectural features


pipelines
cache details
Communication


explicit message passing
shared address space
PE
PE
PE
PE
PE
FU FU FU FU SFU
PE
VLSI Algorithmic Design Automation Lab. at SKKU
PE
PE
62
Top-Down View

Parallelism at different levels



Process level - communicate via message passing
Task/thread level - communicate via shared memory
OS capabilities



Scheduling
Binding
Synchronization
VLSI Algorithmic Design Automation Lab. at SKKU
63
Initial Programmer’s Model



Start with C
View specialized FU’s through intrinsics (e.g.
normalization)
Assembly
Intrinsic
NORM B
x = NORM(y)
Model process level concurrency through a hybrid
communication model


Processes - subset of Message Passing Interface (MPI)
Threads - shared memory
VLSI Algorithmic Design Automation Lab. at SKKU
64
Message Passing Interface (MPI)


A standard interface for communication on multiprocessor
systems
Messages are passed between processes, which the user
must specify

“Push” style communication – sender specifies data rate

Types of Communication



Blocking: stall until send/receive buffer can be used
Non-blocking: allows overlap of computation and communication
Simulator included
VLSI Algorithmic Design Automation Lab. at SKKU
65
Ongoing Research
The programmer’s model is the Holy Grail of the
MESCAL project



Right abstraction for memory
Incorporate bit level concurrency
Compiler for Intel IXP1200 - test initial programmer’s
model
VLSI Algorithmic Design Automation Lab. at SKKU
66
The Research Playground
Application
Algorithm
Architecture
What is the
Programmer’s
Model?
Software
Implementation
Microarchitecture
Component Assembly
and Synthesis
Compilation
Verification and
Manufacture Test
VLSI Algorithmic Design Automation Lab. at SKKU
67
with
Embedded Programmable
Components
Tim Cheng
University of California, Santa Barbara
Goals


Reuse of on-chip programmable components for test
Processor/DSP/FPGA cores for on-chip test generation,
measurement, response analysis and even diagnosis



Self-test a processor/DSP using its instruction set for high structural
fault coverage
Use the tested processor/DSP to test buses, interfaces and other
components, including analog and mixed-signal components
Extend for self-diagnosis
Test and diagnosis are applications of a highly programmable system!!
VLSI Algorithmic Design Automation Lab. at SKKU
69
Faults Motivation

At-speed testing of GHz IC’s increasingly difficult with
external testers





Growing gap between IC and tester performance
Growing cost of high performance testers
Increasing yield loss caused by inherent tester inaccuracy
Self-testing using instructions enables natural application of
at-speed test of GHz processors and SoC’s
Potential advantages over structural BIST (such as scanbased BIST) include: area, performance, design time,
power consumption during test
VLSI Algorithmic Design Automation Lab. at SKKU
70
Functional Self-Test vs. Structural
BIST



Good understanding of the capability and limitations of
functional self-test could support further new development
of hybrid solutions combining strengths of functional and
structural self-test
Lesson from memory self-test: from functional, to
structural, now back to functional self-test
Logic self-test?
VLSI Algorithmic Design Automation Lab. at SKKU
71
Initial Projects on Processor Functional
Self-Test

Self-Testing of Embedded Processor Cores and SoC (UCSD)



Delivering deterministic tests using instruction set
Automatic synthesis of programs for:
 on-chip test generation (constraint-aware software LFSR)
 test pattern delivery
 test response analysis
Self-Testing of Processor Cores for Delay Faults (UCSB)



Automatic synthesis of test programs for path delay faults
Applying deterministic delay tests by execution of test program
Tests generated by integrated process combining structural ATPG and
instruction-level ATPG
VLSI Algorithmic Design Automation Lab. at SKKU
72
Components
External
Tester
Processor bus
Processor
bus
CPU
On-chip test
generation
program
Self-test
signature
Test pattern
delivery
program
Test patterns
Test response
analysis
program
Instr. memory
VLSI Algorithmic Design Automation Lab. at SKKU
Test response
Response
signature
Data memory
73
Functional Self-Testing of Processor
Cores for Path Delay Faults
Instr. Set Architecture,
m-architecture & Netlist

Automatic Constraint Extraction

Path Classification

Constrained Structural ATPG

Spatial and temporal constraints
between registers and control signals
Some structural testable paths not
functionally testable by instructions
Identifying functionally testable paths
Vector generation for functionally
testable paths
Test Program Synthesis

Test Program
Mapping test vectors to instruction
sequences
VLSI Algorithmic Design Automation Lab. at SKKU
74
Path Classification: DLX - A 32-bit RISC
Processor


Automatic identification of paths testable by instructions
Structurally testable but functionally untestable paths need not be
tested.
datapath
No. of paths:
~430K paths
No. of paths:
~18K paths
controller
Structurally testable
~97%
Structurally testable
~51%
Functionally testable
~40%
Functionally testable
~46%
VLSI Algorithmic Design Automation Lab. at SKKU
75
Components in Highly Programmable
Systems

Reuse on-chip digital programmable components and A/D
and/or D/A converters for test signal generation, on-chip
measurement and response analysis for analog/mixed signal
components



To relieve the need for expensive mixed-signal testers
To avoid noisy external measurement
To provide maximum flexibility for customized/optimized self-test
solutions for different types of analog components
VLSI Algorithmic Design Automation Lab. at SKKU
76
Analog/Mixed-Signal Self-Test
Approaches

DSP-based analog self-test


Targeting systems with both DAC and ADC
Pulse-Density-Modulation-based analog self-test

Targeting systems without an ADC and/or an DAC
VLSI Algorithmic Design Automation Lab. at SKKU
77
DSP-Based Self-Testing
D/A
Analog
Component
Under
Test
A/D
Synchronization
Pros • more efficient
• single setup for
multiple types of tests
Cons • limited measurement
resolution (improving)
DSP/Programmable Components
Test signal:
• digitized sinusoid
• digitized multi-tone
• pseudo random
Response analysis:
• FFT
• IEEE 1057 sinewave fitting
• cross-correlation
• auto-correlation
VLSI Algorithmic Design Automation Lab. at SKKU
78
Pulse-Density-Modulation-Based SelfTest

Targeting designs without a DAC and/or an ADC



Use simple yet high-tolerant DA & AD conversion techniques
Use DSP techniques for test synthesis and response analysis
Excellent flexibility
Test Synthesis
Test
stimulus
Software
1-bit DS
modulator
..0101...
memory
ATEATE
SOCSOC
Spec.
pass/fail
1-bit DAC
& low-pass
filter
DSP
..0101...
1-bit DS
modulator
Response Analysis
VLSI Algorithmic Design Automation Lab. at SKKU
Analog
CUT
79
PDM-Based Analog Self-Test: Current
Status

A general self-test architecture for mixed-signal systems


Characterization and calibration of 1-bit first-order DS
modulator for on-chip signal analysis


Use DS modulation principle for stimulus generation and signal
acquisition
For compensating the error caused by the imperfections
associated with the DS modulator
A self-test scheme for testing on-chip ADC and DAC
VLSI Algorithmic Design Automation Lab. at SKKU
80
Directions for the Next Three Years

Processor self-test and self-diagnosis



Analog/mixed signal self-test




Adding new “test instructions” to aid self-test and self-diagnosis
Test program synthesis for response analysis and self-diagnosis
Hardware validation of proposed PDM-based schemes
High-frequency applications
Defect-oriented test synthesis and response analysis
Full-chip self-test using self-tested processors


Testing buses, interfaces and other digital components
Reconfiguration of bus arbiters and communication protocols for test
delivery
VLSI Algorithmic Design Automation Lab. at SKKU
81
The Research Playground
Application
Algorithm
Software
Implementation
What is the
Programmer’s
Model?
Architecture
Microarchitecture
Component Assembly
and Synthesis
Compilation
Verification and
Manufacture
VLSI Algorithmic Design Automation
Lab. atTest
SKKU
82
Functional Verification for a
Family of Microarchitectures
Serdar Tasiran
University of California at Berkeley
Outline



Verification goal
State-of-the-art in processor verification
Our strategy


Rationale
Implementation



Current projects
Future extensions
Three year goals
VLSI Algorithmic Design Automation Lab. at SKKU
84
Verification Goal
Develop comprehensive, focused functional verification support
for identified microarchitectural family
Ideally, the verification approach…
 …must be adaptable: must not require new theory and tools for





different configurations
different environments for design
different verification requirements
(cache coherence, consistency with programmer’s model, …)
…must lend itself to incremental changes in design
…must degrade gracefully
VLSI Algorithmic Design Automation Lab. at SKKU
85
Processor Verification: State-of-theart
Heated research activity on verification of pipelines, superscalar processors,
out-of-order and speculative execution.

Datapath abstraction


Reduce width
Symbolic representations (e.g. multiway decision graphs)

Symbolic simulation

Theorem proving



Verifying functional units (ALUs, FPUs, etc.)
Compositional (assume-guarantee) reasoning
 Divide verification problem into pieces
 Can use a variety of methods for each piece
Reduce problem to equivalence checking of formulas
 Propositional logic with uninterpreted functions and predicates
VLSI Algorithmic Design Automation Lab. at SKKU
86
Processor Verification: State-of-theart

Formal verification valuable when applicable, but







each technique addresses only one aspect of the problem
verification expertise required from designer
methods not incremental or adaptable
difficult to use in large design groups
capacity much short of current processor complexity
Validation relies heavily on simulation
Even more likely to be the case for complex, highly programmable
systems
VLSI Algorithmic Design Automation Lab. at SKKU
87
Our Verification Strategy



Validation of complex, highly programmable systems will require semiformal methods
The natural way to verify these systems is to simulate and debug
Practical goal: Make “optimal” use of simulation resources
 IDEAL: Comprehensive validation with minimal redundant effort
OUR APPROACH:
Use coverage analysis to guide verification
 Identify good verification coverage metrics
 Develop corresponding vector generation methodology

VLSI Algorithmic Design Automation Lab. at SKKU
88
Validation using Simulation: Current
Picture
Simulation
driver
Simulation
engine
SHORTCOMINGS:

Vector generation



Manual: A lot of user effort, ad hoc
Random: Little control over what
gets exercised
Quantifying comprehensiveness


Monitors
B
u Functional
g
testing
s
Purgatory
Tapeout
p
e
r
w
e
e
k
Weeks
Low bug detection rate is the main criterion
Courtesy Prof. Dill
Likely interpretation: Not generating quality vectors any
more.
VLSI Algorithmic Design Automation Lab. at SKKU
89
Verification Using Intelligent
Simulation
Simulation
driver
Simulation
engine
Monitors
Symbolic
simulation
Vector
generation
Diagnosis of
non-verified
portions
VLSI Algorithmic Design Automation Lab. at SKKU
Coverage
analysis
90
Verification Using Intelligent Simulation –
Rationale
Simulation
driver
Simulation
engine
Monitors
Symbolic
simulation
Vector
generation
Diagnosis of
non-verified
portions
Coverage
analysis
Need formal means to:
 Gauge status and progress of verification
 Automate generation of quality vectors
VLSI Algorithmic Design Automation Lab. at SKKU
91
Coverage Analysis – Why?

What aspects of design
haven’t been exercised?


A heuristic stopping criterion
Coordinate and compare



Simulation
engine
Monitors
Symbolic
simulation
How comprehensive is
the verification so far?


Guides vector generation
Simulation
driver
Vector
generation
Diagnosis of
unverified
portions
Coverage
analysis
Separate sets of simulation runs
Model checking, symbolic simulation, …
Helps allocate verification resources
VLSI Algorithmic Design Automation Lab. at SKKU
92
Observability and Coverage Analysis
Portion of design covered only when
 it is exercised
 a discrepancy originating there causes
discrepancy in a monitored variable


(controllability)
(observability)
We initially focus on tag coverage [Devadas, Keutzer, Ghosh ’96]
 Code coverage metrics + observability requirement.
 All other verification metrics overlook observability
Tag coverage:
 Bugs modeled as errors in assignments.
 A buggy assignment may be stimulated, but still missed

Wrong value generated speculatively, but never used.
VLSI Algorithmic Design Automation Lab. at SKKU
93
Biased-Random Vector Generation Rationale

Vector generation methods
trade-off between





100%
0%
Find
Simulate
Typically > 50% of time spent on biased random simulation.
Improved random vectors  Improved overall validation quality
Less intelligence for selecting next step but many more vectors


Time to find “good” vectors
Time to simulate vectors
Portion of Computation Time
Can explore deeper into state space
Deterministic methods bad at “deep errors”

Example: 8-bit counter must expire for bug to be exercised
VLSI Algorithmic Design Automation Lab. at SKKU
94
Contrast with Alternatives

Elaborate vector generation methods
justified if





Simulate
Can’t handle large sequential depth.
Too costly to use all the time
We spend most effort on initial determination of weights


100%
Heavyweight methods have limited application


they yield better verification quality for
0%
given computation time, or
Find
if they exercise difficult corner cases
BUT: Hard to judge “quality” of test vectors a-priori.
Can run many simulation/emulation cycles fast
Our target: Get all but the most difficult bugs out.
VLSI Algorithmic Design Automation Lab. at SKKU
95
Our Approach to Biased Random Vector
Generation

Primary inputs at each clock cycle selected according to a probability
distribution

Distributions are functions of circuit state
Distributions ( “weights” ) determined prior to simulation

Faster simulation
Algorithm determines weights chosen based on

Set of tags targeted
 A structural netlist describing the circuit
Goal of weight determination algorithm:




Maximize expected number of tags covered in a given # of
simulation cycles
VLSI Algorithmic Design Automation Lab. at SKKU
96
Current Projects
Biased-Random Vector Generation for Tag Coverage
(Chinnery, Jin, Keutzer, Tasiran, Weber, UCB)
Select primary input distributions based on
 Circuit structure  Current state  Tags to be covered

Heuristic based on circuit structure and set of tags

IDEA: At each gate, bias inputs towards pins with more tags
in their transitive fan-in.
 Estimate and optimize detectability of tags
 Propagate input probability distributions across circuit
 Estimate steady-state distributions of latches
 Estimate detectability of tags along “most likely” paths
 Modify input weights to maximize expected number of detected tags
VLSI Algorithmic Design Automation Lab. at SKKU
97
Current Projects
Vector Generation for Tag Coverage
of Processor Datapaths
(Keutzer, Meyerowitz, Tasiran, UCB)
Datapath
Identify commonly encountered
structures in processor datapaths
 Determine input distributions that
increase tag coverage of
these structures
(In collaboration with configurable
processor IP provider)
Initial approach:
 Model control by hand-written
abstract machine

s2
s3
s5
s6
sinit
VLSI Algorithmic Design Automation Lab. at SKKU
Control
s4
98
Directions for the Next Three Years


Now: Biased-random vector generation
Simulation
Simulation
 Initial focus: Configurable processor
engine
driver
control and datapaths
Symbolic
simulation
 Topology-based heuristics with
tag coverage goal
Diagnosis of
Vector
unverified
generation
portions
 Later:
 More sophisticated methods for bias selection
 Methods that address control and datapath together
Monitors
Coverage
analysis
Overall: “Closed feedback loop” that integrates a variety of
 Coverage metrics, analysis and feedback methods
 Coverage guided, automatic vector generation methods
VLSI Algorithmic Design Automation Lab. at SKKU
99
The Research Playground
Application
Algorithm
Software
Implementation
What is the
Programmer’s
Model?
Architecture
Microarchitecture
Component Assembly
and Synthesis
Compilation and
Software Environment
Verification and
Manufacture Test
VLSI Algorithmic Design Automation Lab. at SKKU
100
Evaluation Strategy

Quantify quality of results of final implementation according to:

Speed
 Power
 Area/cost
 Design time
 Design cost
Compare to:



Other purely programmable solutions

FPGA, microprocessor, specialized processor
ASIC solutions
VLSI Algorithmic Design Automation Lab. at SKKU
101
Ten Year Vision Elaborated




Significant percentage of embedded system applications fielded
using only fully programmable components.
Supporting efficient but fully programmable solutions in areas of
emerging standards.
Design-time brought within acceptable limits to achieve time-tomarket goals.
Enabling new applications:
 Supporting greater complexity.
 Reducing overall design cost.
VLSI Algorithmic Design Automation Lab. at SKKU
102
What Will Get Us There?…

Flexible architectural templates covering a large design
space.


Multiple levels of support for concurrency
Automated software development environment.




Retargetable compilers/assemblers/debuggers
Architectural simulators
Run-time environments – schedulers/synchronizers
Analysis tools – design visualization, performance monitoring,
power analysis…
VLSI Algorithmic Design Automation Lab. at SKKU
103
First Year Progress Against Strategies




Identified and assembled key application – VPN router.
Identified and assembled compiler infrastructure:
Trimaran 2.0.
Initiated multiple compiler/run-time environment
projects.
(Mostly) identified initial architectural family.


Simulator for one processing element of the architectural family
assembled.
Test strategy for one processing element determined.
VLSI Algorithmic Design Automation Lab. at SKKU
104
Further Progress Against Strategies
In two years:
 Automatic retargeting onto a family of architectures and
microarchitectures from a hardware-description language.



Automatically generated performance estimator, simulator.
Automatic generation of assembler, compiler, run-time system.
Automatically generated hardware for special purpose units?
In five years:

Much like the above, but across a much broader range of
architectures/microarchitectures.
Real breakthrough will be in the development of a
natural programmer’s model
VLSI Algorithmic Design Automation Lab. at SKKU
105
Field Programmable Function Array:
Chameleon
Reconfigurable
Computing(FPFA)
Energy-efficient
wireless
communication
System
architecture for
mobile multimedia
computers
Security
VLSI Algorithmic Design Automation Lab. at SKKU
106
8
Montium Processing Tile
VLSI Algorithmic Design Automation Lab. at SKKU
107
Montium Tile Processor
VLSI Algorithmic Design Automation Lab. at SKKU
108
U-P vs XPP
VLSI Algorithmic Design Automation Lab. at SKKU
109
A SDR/Multimedia Solution
VLSI Algorithmic Design Automation Lab. at SKKU
110
PACT’s SDR XPP
VLSI Algorithmic Design Automation Lab. at SKKU
111
PACT’s SDR XPP
VLSI Algorithmic Design Automation Lab. at SKKU
112
Current Multimedia Processors





Digital Signal Processor => Multimedia Processor
RISC instruction set and pipelining to gain higher clock frequency
Instruction level parallelism (ILP)
Concern more and more on data movement and I/O interface
Pay more attention on low power design
VLSI Algorithmic Design Automation Lab. at SKKU
113
Current Multimedia Processors
Name
TMS320C82
Mpact 2
Trimedia TM1
MSP
Multiproc.
VLIW
VLIW
Multiproc.
0.5 m
0.35 m
0.35 m
0.35 m
3.3
3.3
3.3
3.3
3 (@50 MHz)
4.45
4
4
Clock frequency (MHz)
50,60
125
100
100
Performance(BOPS 8bit integer)
1.5
6
4
6.4
TI
Toshiba
&
Chromatic Res.
Philips
Sumsung
Architecture
CMOS Technology
Vcc (Volts)
Power (Watts)
Manufacturer
VLSI Algorithmic Design Automation Lab. at SKKU
114
TMS320C6x VelociTI









Highest Performance (1 GFLOPS) Floating point DSP
6-ns Instruction Cycle Time
167-MHz Clock Rate
Eight 32-Bit Instructions/Cycle
Instruction packing
Complex programming model
Poor energy and memory efficiency
600Mhz, $110
Good tools and third party support
VLSI Algorithmic Design Automation Lab. at SKKU
115
StarCore SC140, Infineon
6-issue 16-bit fixed-point architecture
Up to four 16-bit MACs per cycle
5-stage pipeline with single-cycle latency
Strong Performance on most metrics
Multi-vendor Architecture :Motorola, Agere and now Infineon
Limited Product Offerings:
poor cost-efficiency, 300Mhz, $132
VLSI Algorithmic Design Automation Lab. at SKKU
116
Analog Devices TigerSHARC







4-issue fixed- and floating-point hierarchical SIMD
atrchitecture
Upto 8 16-bit fixed point MACs per cycle
Special CDMA-oriented instructions
High memory bandwidth (8Gb/s)
250Mhz, $175
2-level SIMD complicates programming
Good tools
VLSI Algorithmic Design Automation Lab. at SKKU
117
LSI Logic ZSP400
A 4-Way Superscalar DSP Core

Up to 2 16-bit MACs per cycle

Five-stage pipeline with single-cycle
latencies

Available as core, ASIC library component ,ASSP

200 Mhz, $36
 Cost, energy and memory efficient

Superscalar architecture simplifies, complicates programming

Unproven tools and third party support

VLSI Algorithmic Design Automation Lab. at SKKU
118
 Target Applications




Video - DVD, MPEG 1 & 2 decoding
Audio - Dolby AC-3, 3D Audio, MPEG Decode,
Wavetable Synthesis
Graphics - 2D & 3D acceleration
Communication




Vocoder
ADSL, Fax/MODEM : V.34, 56k
Echo chancellor
Desktop Videoconferencing
VLSI Algorithmic Design Automation Lab. at SKKU
119
Advanced DSP
130 nm Copper Technology
Greg Delagi, TI
Greg Delagi, TI
VLSI Algorithmic Design Automation Lab. at SKKU
120
Reconfigurable Computing Research
Group








DARPA’s Adaptive Computing Systems Project
Virginia Tech
University of California at Berkeley
Brigham Young University
Chameleon Systems Inc.
Morphic Inc.
Quicksilver Technology Inc.
Sirius Inc.
VLSI Algorithmic Design Automation Lab. at SKKU
121
Quicksilver의 ACM
VLSI Algorithmic Design Automation Lab. at SKKU
122
SDR-processing requirements
or Mobile Communications (GSM)
Modem w/ basic equalizer
2 MFLOPS for CDMA sector
2.5 MFLOPS for a wideband CDMA
4 MFLOPS for a G4
Requires high performance devices s.t
PowerPC G4
PowerPC with Altivec CPUs
TMS320-C6x
SHARC/Tiger-SHARC DSPs
VLSI Algorithmic Design Automation Lab. at SKKU
123
The need for a software configurable
platform


That is capable to handle standards like AM, FM, GSM, UMTS,
digital broadcasting standards(DAB, Sirius, XM-Sat Radio), analog
and digital television and other data links.
A fully software reconfigurable multi-channel broadband
sampling receiver for standards in the 100 MHz band
VLSI Algorithmic Design Automation Lab. at SKKU
124
Versatile Reconfigurable Block
Array
F1
VRB1
F2
VRB2
F3
전력관리
VRB3
High Speed
IN
OUT
Low Speed
Master
MPU


F4
VRB4
F5
VRB5
F6
VRB6
장점

대기 지연 시간이 없다 ,적은 silicon area를 요구한다.

간단한 wrapper를 통해서 IP과 호환성 있는 데이터 전송
단점

대용량 시스템에서 timing 정확성이 감소

복잡한 시스템의 경우 Test 가 어려움

Master의 증가에 따라 arbiter 지연이 증가한다
VLSI Algorithmic Design Automation Lab. at SKKU
125
Comparisons
R
N PE .Fe
Nc.Fc
Name
Type
NPE
Nc
F (MHz)
R
ARDOISE
Fine Grain RA
2304
0.14
33
16457
MorphoSys
Coarse Grain RA
128
16
100
8
Systolic Ring
Coarse Grain RA
24
4
200
6
Coarse Grain RA
24
4 the DSP
130
Only 1 cycle
to (re)configure
6
DART

TMS320C62
DSP VLIW
8
8
300
1

Few cycles to (re)configure coarse grain RA (8)

Many cycles to (re)configure fine grain RA
VLSI Algorithmic Design Automation Lab. at SKKU
126
Multi-DSP Tree Structure
A. K. Salkintzis, N. Hong and P. T. Mathiopoulos
VLSI Algorithmic Design Automation Lab. at SKKU
127
Multi-DSP Network Structure
Data traffic is reduced with each connection
Multiplexing &
Burst Construction
Modulation
Encription
Interleaving
Channel
Coding
CRC
insertion
Data
Processing
Sequencer
Spreading
Rate matching
Channelization
Radio
Resource
Equalization
Segmentation
VLSI Algorithmic Design Automation Lab. at SKKU
128
Platform 분류

Application Platform:





3G 무선 platform: Infineon의 M-gold
Bluetooth platform: Parthus
무선 platform: ARM의 PrimeXsys
Process-centric platform


멀티미디어 platform: Nexperia, TI의 OMAP
Improv System, ARC, Tensilica, Triscend
Communication-centric platform:

Sonics, Palmchip
VLSI Algorithmic Design Automation Lab. at SKKU
129
Recent Computing Machines


ACM (Adaptive Computing Machine) – Quicksilver:
www.qstech.com (image appl.)
RCF (Reconfigurable Compute Fabric) – Motorola (SDR
base-station), array of DSP cores connected through
high-bandwidth interconnect and high-speed local
memory, controlled by a RISC.
VLSI Algorithmic Design Automation Lab. at SKKU
130
What is Software Radio
A transceiver in which all aspects of its operation are determined
using versatile general purpose hardware whose configuration is
under software control
Flexible all-purpose radios that can implement new and different
standards or protocols through reprogramming.
Same hardware for all air interfaces and modulation schemes
VLSI Algorithmic Design Automation Lab. at SKKU
131
Key Technological Constraints




High speed wide band ADCs.
High speed DSPs.
Real Time Operating Systems (isochronous software)
Power Consumption
VLSI Algorithmic Design Automation Lab. at SKKU
132
Applications




User Applications and Base Station Applications
Evolve as a universal terminal
Spectrum management: Reconfigurability is a big
advantage
Application updates, service enhancements and
personalization
VLSI Algorithmic Design Automation Lab. at SKKU
133
Research and Commercialization


DARPA’s Adaptive computing system project
Virginia Tech – algorithms and architecture ; multi user receiver based
on reconfigurable computing ; generic soft radio architecture for
reconfigurable hardware

UC Berkeley –

Sirius Inc –
Pleiades, ultra low power, high performance multimedia
computing ; high power efficiency by providing programmability
(CDMAx)
Software Reconfigurable Code Division Multiple Access
VLSI Algorithmic Design Automation Lab. at SKKU
134
Research and Commercialization

Brigham Young University –

Chameleon Systems- Reconfigurable Platform Architecture for


Development of JHDL to facilitate
hardware synthesis in reconfigurable processors
wireless base station
MorphIC Inc -Programmable hardware reconfigurable code using DRL
Quicksilver Tech. Inc – Universal Wireless `Ngine (WunChip)
baseband algorithms
VLSI Algorithmic Design Automation Lab. at SKKU
135
Programmable OFDM-CDMA Tranceiver.



CDMA suffers from Multiple access interference and ISI.
OFDM reduces interference and helps better spectrum
utilization and attainment of satisfactory BER.
It is proposed that this might be implemented by using
SDR.
VLSI Algorithmic Design Automation Lab. at SKKU
136
Programmable OFDM-CDMA Tranceiver.



CDMA suffers from Multiple access interference and ISI.
OFDM reduces interference and helps better spectrum
utilization and attainment of satisfactory BER.
It is proposed that this might be implemented by using
SDR.
VLSI Algorithmic Design Automation Lab. at SKKU
137
Programmable OFDM-CDMA Tranceiver.



CDMA suffers from Multiple access interference and ISI.
OFDM reduces interference and helps better spectrum
utilization and attainment of satisfactory BER.
It is proposed that this might be implemented by using
SDR.
VLSI Algorithmic Design Automation Lab. at SKKU
138
SDR Architecture
RF unit
Signal processing/control unit
Input/
Output
Rx SYN
LNA
RX
Tx SYN
LNA
TX
C
o
n
tr
o
l
EX.
In
te
fra
c
e
PA
B
a
s
e
b
a
n
d
M
O
D
Q
E
uM
a
d
ra
tu
re
M
O
D D
E a
M ta
c
o
n
v
e
rt
e
r
Receive/
Transmit
HMI
Terminal
RX
Receive/
Transmit
Rx SYN
PA
EX.
TX
Tx SYN
C-PCI bus
Hitachi Kokusai Electric Inc.,
VLSI Algorithmic Design Automation [email protected]
Lab. at SKKU
139
Signal processing/control unit


The signal processing/control unit consists of the following
module

Data converter

Quadrature Modem

Baseband Modem

Interface/Control
Every module is connected to each other by PCI bus, and
provides a CPU in addition to the FPGA and DSP devices.
VLSI Algorithmic Design Automation Lab. at SKKU
140
Quadrature modem module

The Quadrature modem uses FPGAs to
process
to generate baseband sampling rate
RF unit
Signal processing/control unit
Input/
Output
Rx SYN
LNA
PA
EX.
LNA
TX
C
o
n
tr
o
l

HMI
Terminal
In
te
rf
a
c
e

Tx SYN
B
a
s
e
b
a
n
d
M

Quadrature modulation
Quadrature detection
Sampling rate conversion
Filtering
Receive/
Transmit
O
D
Q
E
uM
a
d
ra
tu
re
M
O
D D
E a
M ta
c
o
n
v
e
rt
e
r

RX
RX
Receive/
Transmit
Rx SYN
PA
EX.
TX
Tx SYN
C-PCI bus
VLSI Algorithmic Design Automation Lab. at SKKU
141
Baseband modem module

The Baseband modem processes




Multi-channel modulation
Multi-channel demodulation
Using four floating points DSP devices
Signal processing/control unit
Input/
Output
Rx SYN
LNA
RX
Receive/
Transmit
Tx SYN
HMI
Terminal
B
a
s
e
b
a
n
d
M
O
D
Q
E
uM
a
d
ra
tu
re
M
O
D D
E a
M ta
c
o
n
v
e
rt
e
r
EX.
LNA
TX
C
o
n
tr
o
l
PA
In
te
rf
a
c
e
individual DSP is assigned for each
channel. Therefore, even if processing of
either channel is under execution, a
program can be downloaded to another
channel.
RF unit
RX
Receive/
Transmit
Rx SYN
PA
EX.
TX
Tx SYN
C-PCI bus
VLSI Algorithmic Design Automation Lab. at SKKU
142
Specification of Prototype
RF range
2~500MHz
Waveform
SSB, AM, FM, BPSK, QPSK, 8PSK, 16QAM
Number of channel
Four full-duplex
Radio relay
Repeat/Bridge
Frequency accuracy
<0.1ppm
Rx IF frequency
70MHz
Tx IF frequency
25MHz
Dynamic range
14bits
Rx IF sampling frequency
40MHz
Tx IF sampling frequency
100MHz
VLSI Algorithmic Design Automation Lab. at SKKU
143
Specification of Prototype
Signal processing
FPGA : Quadrature MODEM
DSP : Baseband MODEM
FPGA
XCV2000E x 3
DSP
TMS320C6701 x 4
CPU
Control module : Celeron Peripheral module
System bus
cPCI
Operating system
Linux
HMI
Operates from web browser
Interface
Audio I/O
Serial I/O
Ethernet(100BASE-TX)
VLSI Algorithmic Design Automation Lab. at SKKU
144
Descargar

Samsung Seminar