Verification and debugging of hardware
designs utilizing C-based high- level
design descriptions
Masahiro Fujita
VLSI Design and Education Center (VDEC)
University of Tokyo
System level design flow
• Typical design flow used in industry
• Mostly C language based design descriptions for
embedded system and SoC designs
conception
Ideas
Design as a
Meeting Point
Components
Component
Libraries
Performance Analysis
HW/SW partitioning
Functional
decomposition
Structural
Decomposition
Function-Separated
Description
C/TLM
High Level
Synthesis
Architectural
Exploration
SW
High Level / Transactional
HW2
HW3
HW1
RTL
C/HL
RTL
Synthesis
Automatic / Manual
RTL to RTL
Optimization
(Electric) System level design tools: Elegant
• Joint development by JAXA, Toshiba, NEC, Fujitsu, UC
Irvine, and U. of Tokyo (formal verification)
• Help assigning functions onto components
Static/model checking, Equivalence checking
System level design
A
C
Device driver
C
Protocol
B
Protocol
A
Hardware
implementation
design (RTL)
HW?
HW?
D?
Interface
design
Communication
network design
Interface circuits
HW/SW
partitioning
Software
implementation
design
Processor model
SystemC/SpecC
Algorithm
design
Traditional design starts here…
JAXA:A ?Japanese space exploration agency
B?
Elegant tool has Cactually
been
D used Bfor space
D satellites
D
C?
SW?
3
SW?
SW
HW
SW
HW
A
B
Outline
• Overviews of our verification research activities
regarding to C-based high-level design
descriptions
– Based on Extended System Dependence Graph
(ExSDG)
• Equivalence checking as well as static/model
checking with ExSDG
• Post-silicon verification method through mapping
chip traces with ExSDG
• Verification/debugging methods for large
arithmetic circuits
• Automatic generation of on-chip bus protocol
transducers
System Dependence Graph
• Sufficient representation since verification methods for
net-list are applicable to SDG
• Problems
– Existing SDGs are too complicated (# nodes/edges are huge)
– Not directly corresponding to abstract syntax trees
main
int i = 1;
main(){
int a = 0;
int b = 0;
a = a + i;
if(a == 1)
b = a++;
}
int a
int i
Dependency
Analysis
Control dependence
Data dependence
Declaration dependence
int b
i = 1;
b = 0;
a = 0;
a = a + i;
if(a==1)
a++
b = a++;
System Dependence Graph (SDG)
Extended Syntaxes in ExSDG from C
Module
Sub
Module 1
port
Sub
Module 2
Hierarchical Structure
bit a[3:0];
a
bit b[3];
a
b
c
Concurrency
b
a[2:0]@b[1]
Bit Vector
buffered bit[1] a;
a = 1;
waitfor(1);
waitfor(1);
a = 0;
waitfor(1);
par{ a.main();
b.main();
c.main();}
・・・
wait(a);
・・・
・・・
・・・
notify(b);
・・・
・・・
・・・
notify(a);
・・・
wait(b);
・・・
・・・
Synchronization
clk
a
Timed Behavior
Translation Flow to ExSDG
Untimed Behavior Level:
・ No notion of time
・ For software design or
behavior specification
SystemC
Design
System Level
/ Behavior Level
SpecC
Design
SystemVerilog
Design
RTL
Verilog
Design
VHDL
Design
AST
(UBL)
AST
(TBL)
AST
(RTL)
Complex or redundant
syntaxes are removed
ex) switch statement
Simplification
ExSDG
Timed Behavior Level:
・ Including timing specification
・ For design with timing estimation
Register Transfer Level:
・ Cycle accurate
・ For hardware design
Three Design Stages
module TEST(
int in1, int out1,
event e) {
Proc p1(); Proc p2();
void main(){
par{
p1.main();
p2.main();
}
wait(e);
}
};
module TEST(
int in1, int out1
event e){
Proc p1(); Proc p2();
void main(){
par{
p1.main(); p2.main();
}
waitfor(1);
wait(e);
}
};
Untimed Behavior Level:
・ Concurrency and
Synchronization
・ Times expressions and
buffered/wire variables
are prohibited
Timed Behavior Level:
・ Untimed Behavior Level
+ Timed expressions
・ buffered/wire variables
are prohibited
module TEST(
wire int in1,
wire int out1){
buffered int a;
void init(){
out1 = a;
}
void run_one_cycle(){
a = in1;
}
main(){}
};
Register Transfer Level:
・ par and wait/notify are
prohibited
・ init(): Wire connections
are declared
・ run_one_cycle():
Executed per clock
Edges in ExSDG
•
•
•
•
•
•
Control Flow Edge
Data Dependence Edge
Control Dependence Edge
Declaration Dependence Edge
Parameter In/Out
Summary Edge (Interprocedural
Dependence Edge)
• Parallel Edge
• Communication Edge
• Port Reference Edge
Concurrent and Synchronization
Dependence
int x, y;
event e1,e2;
void main(void){
y = 0;
x = 5;
par{
func1();
func2();
}
}
par
func1();
void func1(){
y = x;
notify(e1);
wait(e2);
x = y;
}
void func2(){
wait(e1);
y++;
y = y * y;
notify(e2);
}
Process 2
Process 1
notify(e1);
wait(e1);
wait(e2);
notify(e2);
func2();
end
Parallel Edge
Communcation Edge
Hierarchical Dependence
module M1(int in,
int out,
event e1){
void main(void){
wait(e1);
out = in * in;
}
};
module M3(int inout){
int w1;
event e1;
M1 m1(w1, inout, e1);
M2 m2(inout, w1, e1);
void main(void){
par{
m1.main();
m2.main();
}
}
};
module M2(int in
int out
event e1){
void main(void){
out = in + in;
notify(e1);
}
};
int M::in;
int M1::out;
event M1::e1;
int M3::w1;
event M3::e1;
int M3::inout;
int M2::in;
int M2::out;
event M2::e1;
Port Reference Edge
Experimental Results
• Compared with a SpecC SDG generation method using
CodeSurfer
• Performed on a PC with Xeon 3.2GB and 2GB memory
SDG Generation Time
Example
NoL
CodeSurfer
ExSDG
IDCT
420
8.7s
5.2s
Elevator
3055
9.1s
222.2s
MPEG4
5657
30.4s
902.9s
Num. of Nodes and Edges in IDCT Example
# of nodes
ExSDG
453
SDG generated by CodeSurfer 6380
# of edges
2061
48073
Outline
• Overviews of our verification research activities
regarding to C-based high-level design
descriptions
– Based on Extended System Dependence Graph
(ExSDG)
• Equivalence checking as well as static/model
checking with ExSDG
• Post-silicon verification method through mapping
chip traces with ExSDG
• Verification/debugging methods for large
arithmetic circuits
• Automatic generation of on-chip bus protocol
transducers
From the viewpoint of verification
•
•
•
•
Keep entire descriptions as correct as possible
Static/Model checking each description
Equivalence checking between two descriptions
Besides simulation, formal methods should be applied
Design
optimization
Algorithmic design
description
Many steps of
manual refinement
Static/
Model
checking
Design
optimization
High level synthesizable
description
High level
synthesis
Design
optimization
Many steps of
manual refinement
Register Transfer Level
description
Sequential
Equivalence
Checking
Basic Procedure
• Symbolic simulation
– Generates a set of equations from designs
– Every variable/operation is an uninterpreted symbol
– Every expression is a formula of symbols
• Equivalence Class (EqvClass)
– A class containing all equivalent
expressions/variables
– Generated during symbolic simulation based on
• Assignment statement
• Substitution with equivalent expressions/variables
• Solving with SMT solvers
– If generated equations have to be solved to prove the
desired equivalence, we use SMT solvers to interpret
arithmetics
– Public ones and our own solves
Example
a = v1;
b = v2;
add1 = a + b;
add2 = v1 + v2;
Description 1
Description 2
E1
E2
E3
E4
(a, v1)
(b, v2)
(add1, a+b)
(add2, v1+v2)
4 EqvClasses are generated
from 4 assignments
E1 (a, v1)
E2 (b, v2)
E3’ (add1, a+b,
add2, v1+v2)
E3 and E4 can be merged by
substituting a with v1, b with v2
Representing Symbolic Expressions:
Maximally Shared Graph
VC:
x0=a+b;
if x0<0 then
x1 = -x0;
y1 = y0;
else
x1 = x0;
y1 = y0+x0;
assert(0<=y1);
<=
x1
y1
ITE
ITE
cond
cond
false
true
false
–
>
0
true
x0
a
+
+
y0
b
Maximally Shared Graph
• Linear-sized representation
VC:
– Mathematically equivalent to
standard logical representation
• Advantages
x1
– Structure explicit
[flow of data in the graph
corresponds to flow of data in
the program]
– Simple slicing
• No structural redundancies
• Not functionally
canonical
– Practical trade-off
<=
y1
ITE
ITE
cond
cond
false
true
false
–
>
0
true
x0
a
+
+
y0
b
Problem and Our Approach
• Symbolic simulation cannot be applied to a
whole large designs
– Because of the path explosion problem
– Approach: Localizing the areas that are
symbolically simulated utilizing differences
: difference
return a;
return a;
return a;
return a;
Equivalence Checking Flow
Seq.
desc1
Seq.
desc2
Extension of the Area
and decision of its
inputs/outputs
Identification of diff.
Is there any
diff. left?
No
“Equivalent”
Decision of initial verification
area and its inputs/outputs
Verification
Verification
Not eqv
Yes
Eqv
Eqv
Not eqv
Reach to primary
inputs/outputs?
Yes
“Not equivalent”
No
Verification Area and its Input/Output
• Verification area: A set of statements
• Input variables
– Used in the area, and assigned outside
• Output variables
– Assigned in the area, and used outside
• Verification problem for the area
– Are all pairs of output variables with the same
name are equivalent?
– Using proved equivalences of the input
variables
Example
a0 = in1;
b0 = a0 + in2;
c0 = 0;
a1 = b0 + in3;
out0 = a1 * c0;
• Initial verification
Diff
a0 = in1;
b0 = a0 + in2;
c0 = 0;
a1 = b0 + 2 * in3;
out0 = a1 * c0;
(※) in1, in2, in3 are the
primary inputs of both
the descriptions
– Input … b0, in3
– Output … a1
– Result … Equivalence cannot be proved
Forward extension from a1
Example
a0 = in1;
b0 = a0 + in2;
c0 = 0;
a1 = b0 + in3;
out0 = a1 * c0;
• 2nd verification
Diff
a0 = in1;
b0 = a0 + in2;
c0 = 0;
a1 = b0 + 2 * in3;
out0 = a1 * c0;
(※) in1, in2, in3 are the
primary inputs of both
the descriptions
– Input … b0, c0, in3
– Output … out0
– Result … Equivalence cannot be proved
Backward extensions from all inputs
(forward extension cannot be applied any more)
Example
a0 = in1;
b0 = a0 + in2;
c0 = 0;
a1 = b0 + in3;
out0 = a1 * c0;
• 3rd verification
Diff
a0 = in1;
b0 = a0 + in2;
c0 = 0;
a1 = b0 + 2 * in3;
out0 = a1 * c0;
(※) in1, in2, in3 are the
primary inputs of both
the descriptions
– Input … a0, in2, c0 (= 0), in3
– Output … out0
– Result … Equivalent (due to c0 = 0)
Though the different part are not equivalent,
the primary output is equivalent
Implementation
• FLEC: An Framework for Verification, Debugging Support,
and Static Checking
– Several engines developed by ourselves
• Symbolic simulator, difference extraction, input pattern
generation, static deadlock checker, slicing, …
– Dependence graph-based internal representation of
system-level designs
• A number of APIs are provided to help development of
engines
• ExSDG (Extended System Dependence Graph)
– Designs are represented in form of ExSDG in FLEC
– Frontend from SpecC to ExSDG is already developed
FLEC Structure
SpecC
Design1
(.sc)
SpecC
Design2
(.sc)
Equivalence
checked
ExSDG Generation
ExSDG
(.fls)
ExSDG
(.fls)
Eqv. Spec
Eqv. Classes
Control
Rule-based
Engine
Symbolic
Simulator
Verification procedure
(Applied Method & Order)
Control
Diff Extraction
Sequentializing
Parallel Behaviors
SMT solvers
Result
Sequential equivalence checking
• Definition of equivalence
– Intension of designers, management of differences
• For efficient checking:
– Identification of matching states
Bounded equivalence checking between matching states
– Identification of equivalent internal points
• Use of SMT solvers
reset
spec
spec
align
falsified
inputs
clocks
outputs
1
Verify
counterexample
proven
impl
align
bounded or full
proof
reset
impl
Transactions : State View
SL
transaction
SLM
Refinement mapping
RTL transaction
RTL
• Encapsulates one or more units of computation for the
design being verified
• Self-contained since it brings the machine back to a
synchronizing state
Transaction : Memory
• Design 1 transaction :
a single memory
read/write occuring in a
single cycle
• Design 2 transaction:
single memory
read/write (potentially)
happening over
multiple cycles
RD
WR
ADDR
OUT
ADDR
DATA
Design 1
Mem
DATA
Cache
RD
WR
Mem
Design 2
Cache ctl
OUT
Transaction-Based Equivalence Checking
• The states in RTL that correspond to states in systemlevel model (SLM), are referred to as synchronizing
states
• Based on this, definitions of equivalence are generated
manually
• The total equivalence is based on inductions on
synchronizing states
Complete
SLM
Verification
Refinement mapping
RTL
or
Transient states
Sequential
counterexample
Equivalence Specification
• Equivalence is specified by
– (Port, Throughput, Latency, Condition)
behavior Adder1(in int in1,
in int in2, in int in3,
out int out1) {
void main() {
out1 = in1 + in2 + in3;
}
};
behavior Adder2(in int in1,
in int in2, in int in3,
out int out1) {
void main() {
int tmp;
while(1) {
tmp = in1 + in2;
waitfor(5);
tmp = tmp + in3;
waitfor(5);
out1 = tmp; }
}
};
behavior Adder3(in int in1,
out int out1) {
void main() {
int tmp;
while(1) {
tmp = in1; waitfor(1);
tmp = tmp + in1; waitfor(1);
tmp = tmp + in1; waitfor(1);
out1 = tmp; waitfor(1);
}
}
};
(in1, 1, 0, TRUE)
(in2, 1, 0, TRUE)
(in3, 1, 0, TRUE)
(out1, 1, 0, TRUE)
(in1, 10, 0, TRUE)
(in2, 10, 0, TRUE)
(in3, 10, 5, TRUE)
(out1, 10, 10, TRUE)
(in1, 4, 0, TRUE)
(in1, 4, 1, TRUE)
(in1, 4, 2, TRUE)
(out1, 4, 3, TRUE)
Sequentialization
• Concurrent behaviors are sequentialized
• If st1 and st2 running concurrently are “write-write” or
“read-write” relation, check the following properties:
– P1: always T(st1) > T(st2)
– P2: always T(st1) < T(st2)
– T(s) … execution time of a statement s
• The checks are based on ILP
– Is there any assignment satisfying P1 (P2) ?
– With the timing constraints generated from SpecC designs
(P1, P2) = (pass, pass) Impossible
Always st1st2
(P1, P2) = (fail, pass)
Always st2st1
(P1, P2) = (pass, fail)
(P1, P2) = (fail, fail)
Order is undecidable
Can be
sequentialized
Sequentialization
a = 10;
b = 10;
c = a + b;
x = 20;
y = 20;
z = x + y;
No dependence
No check is needed
a = 10;
b = 10;
c = a + b;
x = 20;
y = 20;
z = x + y;
a = 10;
wait e;
c = a + x;
x = 20;
notify e;
y = 20;
z = x + y;
Synchronized
always x=20  c=a+x?
Result: YES
always x=a+x  x=20?
Result: NO
Can be sequentialized!!
a = 10;
x = 20;
y = 20;
z = x + y;
c = a + x;
x = 10;
a = 10;
c = a + x;
x = 20;
y = 20;
z = x + y;
Not synchronized
always x=10  x=20?
Result: NO
always x=20  x=10?
Result: NO
Cannot be sequentialized!!
For HW/SW co-design
Software
part
RTL
C
Abstraction
on
Comm.
SW
(FSMD)
Transform
to FSMD
HW
(FSMD)
Identify
Synch.
points
Model checking
HW+SW
(FSMD)
Equivalence checking
Recude
# states
Sequentialization
Case Study 1: MPEG4
• Difference between designs
– Constant propagation, constant folding, common
sub-expression elimination in IDCT function
• Design size
– About 6300 lines in SpecC
– About 50k nodes and 36k edges in ExSDG
• ExSDG generation time: 780 sec
Nodes in
diff
Result
Run time
# of ext.
MPEG4_org MPEG4_rev1
96
Eqv
3.3 sec
0
MPEG4_org MPEG4_rev2
96
Not eqv
13.2 sec
80
Case Study 2: Elevator Controller
• Difference between designs
– Speculative code motion in control paths
• Design size
– About 3300 lines in SpecC
– About 20k nodes and 20k edges in ExSDG
• ExSDG generation time: 178 sec
Nodes in
diff
Result
Run time
# of ext.
Elv_org
Elv_rev1
4
Eqv
1.8 sec
1
Elv_org
Elv_rev2
3
----
> 12 hours
4
Rule-based Equivalence Checking
Checks the equivalence between two highlevel (e.g. SpecC) design descriptions
• Assuming the equivalences of variables,
equivalence rules are applied in a bottom-up
manner
– Equivalence rules are defined in terms of static
dependence relations and control flows
• Verification result is either "equivalent" or
"cannot prove the equivalence"
– It cannot prove that they are not equivalent
– Equivalence rules are heuristically picked up
Equivalence Rules (1/3)
Rule 1: Expression
• Checks the equivalence considering the
commutative, associative, distributive laws
d * a * b + c * d
d * (b * a + c)
Original design
Modified design
+
*
*
*
d
a
Distributive law
*
c
b
*
d
Commutative law
+
d
b
c
a
Equivalence Rules (2/3)
Rule 2: Assignment
• The variable in LHS is equivalent to RHS until
the variable is re-assigned
{
{
int c = a - b;
return c;
return a - b;
}
}
Original design
Introducing an intermediate variable
return
return
=
Rule 2
ー
a
ー
c
b
Rule 1
a
c
b
Equivalence Rules (3/3)
Rule 3: Sequential composition
• Execution order can be changed unless it
destroys the data dependence relations
{
L1: c = a + b;
L2: d = a + c;
L3: e = b + c;
}
Original design
{
L2: d = a + c;
L1: c = a + b;
L3: e = b + c;
}
Swapping L1 and L2
L2
Swapping L2 and L3
seq
seq
L1
{
L1: c = a + b;
L3: e = b + c;
L2: d = a + c;
}
L3
L2
L1
seq
L3
L1
L3
L2
Example: Bottom-Up Application of Rules
{
{
c = a - b;
f = d + e;
f = e + d;
c = a - b;
}
}
Original design
Modified design
Rule 3
seq
Internal
Rule 1
equivalences
f
ー
a
Rule 2
=
=
c
b
seq
f
+
d
e
=
=
e
ー
c
+
d
a
b
A Known Issue of Rule-based
Checking
How can we find internal equivalences?
• Our initial method finds them by "name"
– That is, all variables with the same name are
identified to be equivalent
• This approach fails the equivalence
checking when variable names change
– Typically, variable names are changed
through design transformations
– If variables in different places have the same
name, the result may be false positive
Examples of Checking Failure
Though following examples are all equivalent,
name-based equivalence checker fails
int ex1(int a, int b) {
return a - b;
}
int ex2(int b, int a) {
return b - a;
}
Original design
Swapping the variable names
int ex3(int c, int d) {
return c - d;
}
int ex4(int a, int b) {
int c = a - b;
return c;
}
Modifying the variable names
Introducing an intermediate
variable
Identifying Potential Internal Equivalences
• Perform a random simulation, then identify a set of
variables having the same signature (i.e. sequence
of simulated values)
– Well-known technique in RTL verification
• In RTL, the values of registers are uniquely known at every cycle
e = a - b;
f = a + b;
a=(35,-4,712)
b=(-220,1151,-3)
Random Pattern
Design 1
x = b + a;
y = a - b;
Design 2
Simulation
– However, there is no concept of "cycle" in behaviorallevel design descriptions
• The concept of "a variable with context" is introduced
e=(255,-1155,715)
f=(-185,1147,709)
x=(-185,1147,709)
y=(255,-1155,715)
Signatures
{e, y}
{f, x}
Potential
Internal
Equivalences
Internal Equivalences at Behavioral Level
Definition of a cycle
• A period of the execution from accepting an input pattern
to generating a set of output values
• Internal variables must be assigned once in a cycle
– Designs are in static single assignment (SSA) form
– The concept of "a variable with context" is introduced for multiple
instantiations of modules and function calls
A variable with context
• Context: a runtime path information from the top-level
module to the current function
• Guaranteed to be assigned only once in a cycle
Method Based on Potential Internal
Equivalences
Issue: Potential internal equivalences may include
false equivalences
• False equivalences may lead to false positive results
• Non-equivalent variables may be identified as equivalent
– Equivalent variables are always identified as equivalent
– True equivalences are included in potential equivalences
Solution: Explores all possible subsets of internal
equivalences when applying the rules
• Still practical since each set of internal equivalences is
typically small
Implementation
Implemented in C++ on top of our system-level
verification framework FLEC
– Given a SpecC design description as an input, ExSDG
representation is generated by parser and dependence
analyzer
• Potential internal equivalence identifier
– Generates SpecC description from ExSDG
representation as well as a random input pattern
generator module
– Compiles the random simulator using SpecC reference
compiler
• Rule-based equivalence checker
– Each rule is implemented as a callback function
– Given two nodes in ExSDGs, the equivalence between
two sub-trees are checked
Case Study: A Practical Design
Example: Two IDCT designs before and after
parallelization
• Column & row processes are parallelized
• Existing method failed checking since it could
not find the correspondences between variables
• Proposed method checked the equivalence
– 10 cycles of random simulation
– Runtime: ~3 seconds
• Mainly compilation and execution time of simulator
Outline
• Overviews of our verification research activities
regarding to C-based high-level design
descriptions
– Based on Extended System Dependence Graph
(ExSDG)
• Equivalence checking as well as static/model
checking with ExSDG
• Post-silicon verification method through mapping
chip traces with ExSDG
• Verification/debugging methods for large
arithmetic circuits
• Automatic generation of on-chip bus protocol
transducers
initiator
Debug Hardware
channel
target
put(req)
• Event Extractor
get(req)
true/false
…
true/false
– Extract basic required information
•
•
•
•
Transaction type (read/write)
get(res)
Start of transaction
true/false
End of transaction
master
Initial and target address of transaction
• Trace Buffer
put(res)
true/false
bus
slave
m_request
opb_grant
– Store extracted transactions in am_select
compact form
opb_xferac
k
…
…
opb_dbus
s_dbus
s_xferack
Post
Debug
with
ExSDG
Focus
on Silicon
communication
parts
of designs
 Establish mapping between ExSDG and chip traces
Debug Flow:
1- Extract basic transactions
2- Store in a trace buffer
* Run system until a failure *
3- Read the trace buffer
4- Analyze traces with software
Module 1
channel
Transaction
Extractor 1
Transaction type (read/write)
Start & end of transaction
Initial and target of transaction
ExSDG for C
Software
4
…
Trace
Buffer
Module k
On-chip bus,
Network, …
Initial System
Find wrong behaviors
using debug patterns
Potential race
Potential deadlock
Read Out 3
2
Debug HW
Debug SW  Store extracted
transactions
Post Silicon Debug: some results
• Hardware overhead is low
• Trace buffer not large


OPB Bus: 1966 gates
PLB Bus: 2206 gates
– Trace buffer fields: 1 or 2 bytes per transaction
Master ID Slave ID
Address
R/W Command Tag
• Debug patters as assertions
1) Master1 locks first semaphore.
2) Master2 locks second semaphore
3) Master1 waits for second semaphore
4) Master2 waits for first semaphore
5) Steps 3 and 4 are repeated
assert never
SoTr(m1, s1, Wr, -, t1) ;
SoTr(m2, s1, Wr, SAME, t2) ;
EoTr(m1, s1, Wr, SAME, t1)
| EoTr(m1, s1, Wr, -, -) ;
SoTr(m2, s1, Wr, SAME, -)
filter (*,*,*)
• Analysis is quick: 20 seconds for 100,000
transactions
– Working with ExSDG (C descriptions)
Our approach to debugging high-level
• Concrete simulation
– Depth-first search: long range,
narrow width
• Formal method: Symbolic
state traversal
Reachable states
DFS
• Our approach: user-driven
BFS combines DFS and BFS
– DFS: To collect reachable states
– BFS: To search exhaustively
around states of interest (user
specified)
– User switches DFS and BFS
using various commands
including execution path
specification
F
F
F
Initial
States
DFS
Reachable states
– Breadth-First Search: short range,
wide width
Faulty
states
F
F
F
Initial
States
Reachable states
BFS
F
F
Initial
States
BFS
F
DFS
jump
BFS Infeasible
states
Some experiences
• Filter design from a company
– 170 LoC in SpecC, part of buggy real design
– BMC cannot detect the bug (simply too deep)
– Designer specifies a set of execution paths that he has concerns
• Concern: some portions of the codes in particular sequences
may not work
• Successfully generate 61 cycles pattern with the proposed
approach
• Elevator controller
– 9500 LoC in SpecC
– Target assertion
• Door must open within 30 cycles after pushing up/down button
while elevator is stopping on a floor
– BMC failed at 120th cycle (after >10 days run)
– Looks like there is no such issue from BMC
• User-guided analysis realizes failure !
Outline
• Overviews of our verification research activities
regarding to C-based high-level design
descriptions
– Based on Extended System Dependence Graph
(ExSDG)
• Equivalence checking as well as static/model
checking with ExSDG
• Post-silicon verification method through mapping
chip traces with ExSDG
• Verification/debugging methods for large
arithmetic circuits
• Automatic generation of on-chip bus protocol
transducers
Data-path Dominated Applications
Real Number
Specification
Modular-HED
Representation
Arithmetic
Bit Level
Polynomial
Optimization
Equivalence Checking
Debugging
HED
Representation
Floating point
Model
MATLAB Model
(Fixed point)
Automatic Fixedpoint Generation
Refinement and
Optimization
RTL Model
(Fixed bit-width)
RTL Synthesis
Gate-level Model
(bit level)
Physical Design
Horner Expansion Diagram (HED)
f(x)
• Horner Expansion: f ( x , y ,...)  f const  x . f linear
– Const (Left) child (dashed line)
– Linear (Right) child (solid line)
x
fconst
• Example
– f(x,y,z) = x2y+xz-4z+2; Order: x>y>z
– f(x,y,z) = [-4z+2]+x[xy+z]
= fconst+x.flinear
-4z+2
flinear
f(x,y,z)
x
xy+z
x
• fconst = -4z+2 = f(z)
• flinear =f1(x,y,z)= xy+z
y
z
– f1(x,y,z) = xy+z = z+x[y]
y
• f1const = z; f1linear = y
– Horner form
• f = x(x(y)+z)+z(-4)+2
z
2
z
-4
0
1
0
1
High-level Polynomial Datapath Verification
Modular Equivalence checking
– Anti-aliasing function F  1 ( 2 ( a 2  b 2 ) )  1 ( 2 x )
– Expand into Taylor series
coefficients
1
F 
x 
6
64

279
9
x 
5
115
32
x 
2
64
x 
4
64
81
32
x
75
x
3
16
a
85
64
– Implemented as a fixed size datapath
• F1[15:0], F2[15:0], x[15:0]
• F1 =
+
43593 x2 + 40244x + 13281
• F2 = 156x6 + 5380x5 + 1584x4 + 10469x3 +
27209 x2 + 7456x + 13281
156x6
b
62724x5 +
17968x4 +
18661x3 +
x = a2 + b2
x
MAC
Reg
• F1 ≠ F2 over Z
• F1[15:0] = F2[15:0] mod 216
F
coefficients
Modular-HED (M-HED)
• The Smarandache function in number theory is
defined for a given positive integer b as the
smallest positive integer such that its factorial S (b )!
is divisible by b
– Example, the number 8 does not divide 1!, 2!, 3!, but
does divide 4! (4!/8 = 3), so S(8) =4
• 1*2*3*4 % 8 = 0
• 5*6*7*8 % 8 = 0
• 100*101*102*103 % 8 = 0
– A product of 4 consecutive numbers is divisible by 8
• x(x+1)(x+2)(x+3)  0 mod 8
• x(x+1)(x+2)(x+3) = x4 + 6x3 + 11x2 + 6x can be freely
added or subtracted under mod 8 !
• Co-efficients are modified transforming given polynomials
to normal forms
Modular-HED (M-HED)
• Vanishing Polynomial:
g(x) =  ( x  i )  0 mod n
• If we can factorize a polynomial g(x) into a
product of S(n) consecutive numbers, then
it can be reduced to 0 in Z n
S (n)
i 1
– f ( x )  x  21 x
– S(24) = 6
6
5
 175 x  735 x  1624 x  1764 x  720
4
3
2
over
Z 24
• f(x) = (x+1)(x+2)(x+3)(x+4)(x+5)(x+6)  0 mod 16
– It can be reduced to ZERO!
Preliminary Experimental Results
Benchm
ark
Specs
ModularHED
Var/Deg/n
Node/Time
AAF
1 / 6 / 16
D4F
Method
[9]
CUDDBDD
*BMD
miniSat [13]
MILP
[12]
Time (s)
Node/Time
Node/Time
Vars/Clauses/Time
Time (s)
8 / 0.016
6.81
1.1M / 32.2
NA / >500
3.9K / 107K / >500
>500
1 / 4 / 16
6 / 0.031
4.95
27M / 20.3
NA / >1000
25K / 76K / >1000
>1000
CHEB
1 / 5 / 16
7 / 0.01
5.95
1M / 26.9
NA / >500
3.5K / 86K / >500
>500
PSK
2 / 4 / 16
MI
Boolean
reasoning
methods
never52Kworks
16 / 0.032
13.48
NA
/ >500
NA / >500
/ 142K / >500
2Need
/ 4 / 16
9 / 0.016
14.4
NA / >1000
NA / >1000
10K
/ 30K / >1000
to
use word
level
techniques
such
as
the
proposed
one
2 / 9 / 16
26 / 0.2
17.5
23M / 39.4
NA / >1000
24K / 69K / >1000
SG
5 / 3 / 16
35 / 0.24
6.1
NA / >1000
NA / >1000
64K / 190K / >1000
>1000
QS
7 / 4 / 16
19 / 0.09
32.4
NA / >1000
NA / >1000
76K / 211K / >1000
>1000
DIRU
NA: Not Applicable;
K: Thousand;
M: Million
>500
>1000
>1000
High-level Polynomial Datapath Optimization
A partitioning and compensation heuristic
• A polynomial is given, how we can
optimize it in terms of the number of
adders and multipliers on fixed bit-width
• Our Solution
Modular-HED
– Apply Modular-HED
• To reduce over Z2n
– Partitioning approach
Partitioning
• Poly = p1*p2 + p3
• Minimize p3
– Compensation approach
• Compute Coefficients
Compensation
Partitioning Heuristic
• poly = p1p2+p3 with unknown coefficients
• Minimize the cost of p3
poly = w4 – x3 –x2y -5x2 + 2xy +2y2 + xz + yz + 14y
+5z +3
poly
• After Partitioning
p1 = a1x2 + a2y + a3z;
p2 = b1x + b2y +b3;
p3 = w4 + 3
poly = p1*p2 + p3
Partitioning
p1, p2, p3
Compensation Heuristic
• In our example
poly = w4 – x3 –x2y -5x2 + 2xy +2y2 + xz + yz
+ 14y +5z +3
p 1 = a 1 x 2 + a 2 y + a3 z
p2 = b1x + b2y +b3
p3 = w4 +3
• Set the coefficients (ai,bj) in order to
achieve the minimum cost p3
• First consider all the equations produced
by p1*p2 = poly - p3
– a1b1=-1, a1b2=-1,a1b3=-5,a2b1=2,a2b2=2,a3b1=1,
a3b2=1, a2b3=14, a3b3=5
– These equations may not have an answer!
Preliminary Experimental Results
Applicatio
n
Horner
Function
Enhanced CSE
Our Approach
M/V/D/n
#Gate
Delay
Time
#Gate
Delay
Time
#Gate
Delay
Time
Graphics
CosineWavelet
9/2/3/16
7850
23.7
0.04
5109
18.3
3.47
3678
18.5
6.52
Image
Processing
SavitzkyGolay 1
10/2/3/16
7218
20.7
0.04
3757
22.7
4.75
2879
17.8
14.9
Image
Processing
SavitzkyGolay 2
5/2/2/16
1697
20.8
0.03
1433
18.1
0.48
1057
16.2
1.63
Filter
Intensive optimization on polynomials greatly
Quad 1
5/2/2/16
2737
17.7
0.03 2269
16.4 0.63 1763
17.1
reduce
the
gerated
custom
circuits/software
1.78
Filter
Quad 2
5/2/2/16
2569
16.4
0.03
2032
16.4
0.63
1571
15.3
1.56
Automotive
Mibench
9/3/2/16
2058
15.7
0.04
2046
15.6
0.46
1303
14.1
8.75
0%
0%
-
31%
6.3%
-
49.2%
13.8%
-
Average Saving w.r.t Horner
M: the number of monomial
D: the maximum degree
V: the number of variables
n: the number of bits (word-length)
Bit Level Adder (BLA) Model
• In an arithmetic circuit
several addition processes
are possible
• However, each addition
can be represented
with BLA model
• Represent A+2mB, by some
custom full-adders and halfadders, which represent two
cascade XORs and one
XOR, respectively
– Realization of the carry
signals is not uniquely
modeled
• BLA is robust for
–
–
–
–
Carry-look-ahead
Ripple-carry
Carry-select
Carry-skip adders
Theproduct
Proposed
Debugging
• Partial
initialization
Algorithm
– Extract each column of 2i from S
– Provide a bit-level ADD_SET with
different bit-orders
– Bits in each column must be added
with each other while the
generated carries are sent to
higher order column
• Column-based XOR extraction
– Search XORs over initial partial
product terms of each ADD_SET
column
– At least one input from column 2i
– XOR extraction is performed
without carry-logic blocks
• Much faster than other XOR extraction
techniques
P1(2)
P2(2)
Unknown
P3(2)
The Proposed Debugging Algorithm
4-bit multiplier
• Carry-signal mapping using HED
– For each FA with no unknown
input build a new reference (in
HED)
– For each HA/FA with an unknown
input Cm
• Find the backward logic of Cm
• Map Cm to the HED of the most
similar reserved carry from previous
column
FA
Experimental Setup and Results
• F1 = A*B;
F2 = C*D+3E+120;
– A, B : 32-bit
F3 = A*B+C*D
C,D,E,F1 : 64-bit
F2,F3 : 129-bit
Can deal with large (64, 128) bit arithmetic
functions for efficient verification and debugging
Outline
• Overviews of our verification research activities
regarding to C-based high-level design
descriptions
– Based on Extended System Dependence Graph
(ExSDG)
• Equivalence checking as well as static/model
checking with ExSDG
• Post-silicon verification method through mapping
chip traces with ExSDG
• Verification/debugging methods for large
arithmetic circuits
• Automatic generation of on-chip bus protocol
transducers
Communication parts can be very complicated
• Need interface bw protocol A and B
• Protocols can be very complicated
– Over 30 different commands defined in the
state-of-the-art protocols (OCP)
– Manuals over 200 pages
– Bust, out-of-order modes, …
• We have developed an automatic
generator of protocol transducers
CPU
(IP)
Protocol A
Trans
-ducer
MPEG
DMAC
Protocol B
– Protocols are formally defined with
FSM/automaton
– State-of-the-art protocols can be dealt with
in a couple of minutes
– Now being extended to:
On/off-chip bus interconnections
Formal verification of interfaces
DMAC
(IP)
Protocol A
RAM
RAM
(IP)
Custom
HW
Protocol B
71
How protocol transducer is realized
• Intuitive understanding of the problem
– Follow the two protocols ⇒ compute the product of
the two FSM/automata
(stb==1)
ack<=0
設計対象
ack<=0
Clock-wise
behavior
ack<=1
Protocol Request
A
Master Response
Request
Protocol
B
Response Slave
Protocol A
Exploration
[1] + ours
Protocol B
Definition of protocol
Protocol transducer
In FSM/automaton
[1] R.Passerone, J.A.Rowson, A.Sangiovanni-Vincentelli,
“Automatic Transducer Synthesis of Interfaces between Incompatible Protocols” ,DAC’98 pp.8-13
Simple computation of products
ctrl
Protocol
data
A
8
8
Master
{Ctrl=1, data1}
Some paths are
avoidable
Transducer
A
{Ctrl=0}
B
{Ctrl=1, data1}
Invalid due to
Illegal dependency
A
E
A
B
D
D
B
Unavoidable
path
E
{Ctrl=0, data2}
C
{Ctrl=0, data2}
C
D
D
C
E
E
F
Protocol
B
Slave
8
ctrl
data
F
B
F
{Ctrl=0}
{Ctrl=1, data1}
{Ctrl=0, data2}
C
C
E
C
F
C
F
Invalid due to
Illegal dependency
{Ctrl=0}
Transducer
Minimum latency path
8
• Protocol definition automaton should not have any loops
– Even in the same state, data values are different
Need separation between
– Expansion can be infinite
comp. and comm. !
Separation of computation and communication
inside protocol transducers
• In protocol definition,
control and data are
separately specified
• Introduce two FSMs for
request and control to
describe complicated
protocols uniformly
• FIFO can be made
arbitrary complicated if
we like
Protocol
A
Master Res.
FSM
Transducer
Protocol
B
Slave
Res.
Even arithmetic
computation
possible
Protocol
A
Master
Req.
Req.
FSM
Req.
Res.
Res.
Res.
FSM
Transducer
Protocol
B
Slave
Protocols can be very complicated
• State-of-the-art protocols introduces many
features for faster throughputs
t
Req1 → Res1
Req2 → Res2
t
t
Req1
Req2
Req3
Res1
Res2
Split transaction
(Non blocking)
Blocking
(Low throughput)
t
Req1
Req2
Res1
Req3
Res3
Res2
Out of order
transaction
Protocol
Master
Request
(Address / Data)
Request
Response
(Data)
t
Protocol
Slave
Request
Addr1
Data1
Addr2
Data2
Data2
Addr3
Data3
Data3
Addr4
Data4
Data4
Burst
transaction
Addr1
Data1
Single address
Burst trans.
For more complicated protocols…
入力オートマトン
Req.
Req.
Req
Res.
Protocol A
Req
Res.
Protocol B
Control for FIFO
Read
X
Write
Pros: Can deal with more
complicated protocols
Cons: Need more latency
delay due to multiple FIFO
Protocol
A
Master
Newly introduced
FIFO
Req.
Req. Transducer
FSM
Res.
Send
FSM
FIFO
RD
X
Res
Req.
Recv
FSM
Res.
Res
X
Protocol
B
Slave
FIFO
WR
Now we can resolve it
• Elimination of loops
(to initial states)。i
i
C
D
B
e
i
C
D
e
Introduction
of ending state
Z
W
e
U
SS
=
Loops are
replaced with
super states
i
Y
X
Z
W
Elimination
pf ending states
Exploration
B
X
Y
A
• Elimination of
intermediate loops
Exploration
A
i
Exploration
i
[2]
U
• Concentrating on controls only
– Date parts are processed separately !
[2] S.Watanabe, K.Seto, Y.Ishikawa, S.Komatsu, M.Fujita, “Protocol Transducer Synthesis using Divide and Conquer approach, “
Proc. of the 12th. Asia and South Pacific Design Automation Conference, pp.280-285, 2007.
How to deal with multiple complicated
transactions [2]
• A protocol is a collection of sequences
• Each sequence can operate independently
– True for state-of-the-art protocols with separation
between computation and communication
Sequence1 (Read)
Sequence2 (Write)
Automaton2 For response
Sequence3 (4 burst read)
Sequence4 (4 burst write)
・・・
Protocol
Automaton1 For request or
blocking
Hardware
definition
(stb==1)
ack<=0
i
ack<=0
ack<=1
All sequences share initial state
Port, signal names, etc.
Hierarchical synthesis owing to comp.
and comm. separation [2]
i
Merge generated FSM with
the same initial state
i
i
+
Protocol
A
Sequence
A1
グラフ
探索
Exploration
Sequence
A2
=
Partial transducer
1
Transducer
Protocol
B
Sequence
B1
グラフ
探索
Exploration
Partial transducer
2
Sequence
B2
Sequence level synthesis followed by merge process
Tool implementation
• Planned to be distributed freely from OCP-IP
Experimental results
• Atholon64 2GHz + 1GB RAM
• Implemented as over 12,000 loc in C++
– Input: Hierarchical automaton descriptions in XML
– Output: RTL synthesizable Verilog
• Logic synthesis: Xilinx ISE
• RTL simulator: Model Sim XE
Mater's
Protocol
Slave's
Protocol
Type
Sequences
Synth.Time
Gate counts
OCP
AHB
(NB,BK)
4
1.1[s]
2,352
AHB
OCP
(BK,NB)
4
1.3[s]
1,843
OCP
OCP
(NB,NB)
2
1.9[s]
1,568
OCP
Tagged OCP
(NB,OoO)
2
2.2[s]
3,514
Tagged OCP
AXI
(OoO,OoO)
2
4.8[s]
1,377
AXI
OCP
(OoO,NB)
2
4.9[s]
1,731
OCP
AXI
(NB,OoO)
26
257.8[s]
61,205
Descargar

スライド 1