Verification and debugging of hardware designs utilizing C-based high- level design descriptions Masahiro Fujita VLSI Design and Education Center (VDEC) University of Tokyo System level design flow • Typical design flow used in industry • Mostly C language based design descriptions for embedded system and SoC designs conception Ideas Design as a Meeting Point Components Component Libraries Performance Analysis HW/SW partitioning Functional decomposition Structural Decomposition Function-Separated Description C/TLM High Level Synthesis Architectural Exploration SW High Level / Transactional HW2 HW3 HW1 RTL C/HL RTL Synthesis Automatic / Manual RTL to RTL Optimization (Electric) System level design tools: Elegant • Joint development by JAXA, Toshiba, NEC, Fujitsu, UC Irvine, and U. of Tokyo (formal verification) • Help assigning functions onto components Static/model checking, Equivalence checking System level design A C Device driver C Protocol B Protocol A Hardware implementation design (RTL) HW？ HW？ D? Interface design Communication network design Interface circuits HW/SW partitioning Software implementation design Processor model SystemC/SpecC Algorithm design Traditional design starts here… JAXA:A ?Japanese space exploration agency B? Elegant tool has Cactually been D used Bfor space D satellites D C? SW？ 3 SW？ SW HW SW HW A B Outline • Overviews of our verification research activities regarding to C-based high-level design descriptions – Based on Extended System Dependence Graph (ExSDG) • Equivalence checking as well as static/model checking with ExSDG • Post-silicon verification method through mapping chip traces with ExSDG • Verification/debugging methods for large arithmetic circuits • Automatic generation of on-chip bus protocol transducers System Dependence Graph • Sufficient representation since verification methods for net-list are applicable to SDG • Problems – Existing SDGs are too complicated (# nodes/edges are huge) – Not directly corresponding to abstract syntax trees main int i = 1; main(){ int a = 0; int b = 0; a = a + i; if(a == 1) b = a++; } int a int i Dependency Analysis Control dependence Data dependence Declaration dependence int b i = 1; b = 0; a = 0; a = a + i; if(a==1) a++ b = a++; System Dependence Graph (SDG) Extended Syntaxes in ExSDG from C Module Sub Module 1 port Sub Module 2 Hierarchical Structure bit a[3:0]; a bit b[3]; a b c Concurrency b a[2:0]@b[1] Bit Vector buffered bit[1] a; a = 1; waitfor(1); waitfor(1); a = 0; waitfor(1); par{ a.main(); b.main(); c.main();} ・・・ wait(a); ・・・ ・・・ ・・・ notify(b); ・・・ ・・・ ・・・ notify(a); ・・・ wait(b); ・・・ ・・・ Synchronization clk a Timed Behavior Translation Flow to ExSDG Untimed Behavior Level: ・ No notion of time ・ For software design or behavior specification SystemC Design System Level / Behavior Level SpecC Design SystemVerilog Design RTL Verilog Design VHDL Design AST (UBL) AST (TBL) AST (RTL) Complex or redundant syntaxes are removed ex) switch statement Simplification ExSDG Timed Behavior Level: ・ Including timing specification ・ For design with timing estimation Register Transfer Level: ・ Cycle accurate ・ For hardware design Three Design Stages module TEST( int in1, int out1, event e) { Proc p1(); Proc p2(); void main(){ par{ p1.main(); p2.main(); } wait(e); } }; module TEST( int in1, int out1 event e){ Proc p1(); Proc p2(); void main(){ par{ p1.main(); p2.main(); } waitfor(1); wait(e); } }; Untimed Behavior Level: ・ Concurrency and Synchronization ・ Times expressions and buffered/wire variables are prohibited Timed Behavior Level: ・ Untimed Behavior Level + Timed expressions ・ buffered/wire variables are prohibited module TEST( wire int in1, wire int out1){ buffered int a; void init(){ out1 = a; } void run_one_cycle(){ a = in1; } main(){} }; Register Transfer Level: ・ par and wait/notify are prohibited ・ init(): Wire connections are declared ・ run_one_cycle(): Executed per clock Edges in ExSDG • • • • • • Control Flow Edge Data Dependence Edge Control Dependence Edge Declaration Dependence Edge Parameter In/Out Summary Edge (Interprocedural Dependence Edge) • Parallel Edge • Communication Edge • Port Reference Edge Concurrent and Synchronization Dependence int x, y; event e1,e2; void main(void){ y = 0; x = 5; par{ func1(); func2(); } } par func1(); void func1(){ y = x; notify(e1); wait(e2); x = y; } void func2(){ wait(e1); y++; y = y * y; notify(e2); } Process 2 Process 1 notify(e1); wait(e1); wait(e2); notify(e2); func2(); end Parallel Edge Communcation Edge Hierarchical Dependence module M1(int in, int out, event e1){ void main(void){ wait(e1); out = in * in; } }; module M3(int inout){ int w1; event e1; M1 m1(w1, inout, e1); M2 m2(inout, w1, e1); void main(void){ par{ m1.main(); m2.main(); } } }; module M2(int in int out event e1){ void main(void){ out = in + in; notify(e1); } }; int M::in; int M1::out; event M1::e1; int M3::w1; event M3::e1; int M3::inout; int M2::in; int M2::out; event M2::e1; Port Reference Edge Experimental Results • Compared with a SpecC SDG generation method using CodeSurfer • Performed on a PC with Xeon 3.2GB and 2GB memory SDG Generation Time Example NoL CodeSurfer ExSDG IDCT 420 8.7s 5.2s Elevator 3055 9.1s 222.2s MPEG4 5657 30.4s 902.9s Num. of Nodes and Edges in IDCT Example ＃ of nodes ExSDG 453 SDG generated by CodeSurfer 6380 # of edges 2061 48073 Outline • Overviews of our verification research activities regarding to C-based high-level design descriptions – Based on Extended System Dependence Graph (ExSDG) • Equivalence checking as well as static/model checking with ExSDG • Post-silicon verification method through mapping chip traces with ExSDG • Verification/debugging methods for large arithmetic circuits • Automatic generation of on-chip bus protocol transducers From the viewpoint of verification • • • • Keep entire descriptions as correct as possible Static/Model checking each description Equivalence checking between two descriptions Besides simulation, formal methods should be applied Design optimization Algorithmic design description Many steps of manual refinement Static/ Model checking Design optimization High level synthesizable description High level synthesis Design optimization Many steps of manual refinement Register Transfer Level description Sequential Equivalence Checking Basic Procedure • Symbolic simulation – Generates a set of equations from designs – Every variable/operation is an uninterpreted symbol – Every expression is a formula of symbols • Equivalence Class (EqvClass) – A class containing all equivalent expressions/variables – Generated during symbolic simulation based on • Assignment statement • Substitution with equivalent expressions/variables • Solving with SMT solvers – If generated equations have to be solved to prove the desired equivalence, we use SMT solvers to interpret arithmetics – Public ones and our own solves Example a = v1; b = v2; add1 = a + b; add2 = v1 + v2; Description 1 Description 2 E1 E2 E3 E4 (a, v1) (b, v2) (add1, a+b) (add2, v1+v2) 4 EqvClasses are generated from 4 assignments E1 (a, v1) E2 (b, v2) E3’ (add1, a+b, add2, v1+v2) E3 and E4 can be merged by substituting a with v1, b with v2 Representing Symbolic Expressions: Maximally Shared Graph VC: x0=a+b; if x0<0 then x1 = -x0; y1 = y0; else x1 = x0; y1 = y0+x0; assert(0<=y1); <= x1 y1 ITE ITE cond cond false true false – > 0 true x0 a + + y0 b Maximally Shared Graph • Linear-sized representation VC: – Mathematically equivalent to standard logical representation • Advantages x1 – Structure explicit [flow of data in the graph corresponds to flow of data in the program] – Simple slicing • No structural redundancies • Not functionally canonical – Practical trade-off <= y1 ITE ITE cond cond false true false – > 0 true x0 a + + y0 b Problem and Our Approach • Symbolic simulation cannot be applied to a whole large designs – Because of the path explosion problem – Approach: Localizing the areas that are symbolically simulated utilizing differences : difference return a; return a; return a; return a; Equivalence Checking Flow Seq. desc1 Seq. desc2 Extension of the Area and decision of its inputs/outputs Identification of diff. Is there any diff. left? No “Equivalent” Decision of initial verification area and its inputs/outputs Verification Verification Not eqv Yes Eqv Eqv Not eqv Reach to primary inputs/outputs? Yes “Not equivalent” No Verification Area and its Input/Output • Verification area: A set of statements • Input variables – Used in the area, and assigned outside • Output variables – Assigned in the area, and used outside • Verification problem for the area – Are all pairs of output variables with the same name are equivalent? – Using proved equivalences of the input variables Example a0 = in1; b0 = a0 + in2; c0 = 0; a1 = b0 + in3; out0 = a1 * c0; • Initial verification Diff a0 = in1; b0 = a0 + in2; c0 = 0; a1 = b0 + 2 * in3; out0 = a1 * c0; (※) in1, in2, in3 are the primary inputs of both the descriptions – Input … b0, in3 – Output … a1 – Result … Equivalence cannot be proved Forward extension from a1 Example a0 = in1; b0 = a0 + in2; c0 = 0; a1 = b0 + in3; out0 = a1 * c0; • 2nd verification Diff a0 = in1; b0 = a0 + in2; c0 = 0; a1 = b0 + 2 * in3; out0 = a1 * c0; (※) in1, in2, in3 are the primary inputs of both the descriptions – Input … b0, c0, in3 – Output … out0 – Result … Equivalence cannot be proved Backward extensions from all inputs (forward extension cannot be applied any more) Example a0 = in1; b0 = a0 + in2; c0 = 0; a1 = b0 + in3; out0 = a1 * c0; • 3rd verification Diff a0 = in1; b0 = a0 + in2; c0 = 0; a1 = b0 + 2 * in3; out0 = a1 * c0; (※) in1, in2, in3 are the primary inputs of both the descriptions – Input … a0, in2, c0 (= 0), in3 – Output … out0 – Result … Equivalent (due to c0 = 0) Though the different part are not equivalent, the primary output is equivalent Implementation • FLEC: An Framework for Verification, Debugging Support, and Static Checking – Several engines developed by ourselves • Symbolic simulator, difference extraction, input pattern generation, static deadlock checker, slicing, … – Dependence graph-based internal representation of system-level designs • A number of APIs are provided to help development of engines • ExSDG (Extended System Dependence Graph) – Designs are represented in form of ExSDG in FLEC – Frontend from SpecC to ExSDG is already developed FLEC Structure SpecC Design1 (.sc) SpecC Design2 (.sc) Equivalence checked ExSDG Generation ExSDG (.fls) ExSDG (.fls) Eqv. Spec Eqv. Classes Control Rule-based Engine Symbolic Simulator Verification procedure (Applied Method & Order) Control Diff Extraction Sequentializing Parallel Behaviors SMT solvers Result Sequential equivalence checking • Definition of equivalence – Intension of designers, management of differences • For efficient checking: – Identification of matching states Bounded equivalence checking between matching states – Identification of equivalent internal points • Use of SMT solvers reset spec spec align falsified inputs clocks outputs 1 Verify counterexample proven impl align bounded or full proof reset impl Transactions : State View SL transaction SLM Refinement mapping RTL transaction RTL • Encapsulates one or more units of computation for the design being verified • Self-contained since it brings the machine back to a synchronizing state Transaction : Memory • Design 1 transaction : a single memory read/write occuring in a single cycle • Design 2 transaction: single memory read/write (potentially) happening over multiple cycles RD WR ADDR OUT ADDR DATA Design 1 Mem DATA Cache RD WR Mem Design 2 Cache ctl OUT Transaction-Based Equivalence Checking • The states in RTL that correspond to states in systemlevel model (SLM), are referred to as synchronizing states • Based on this, definitions of equivalence are generated manually • The total equivalence is based on inductions on synchronizing states Complete SLM Verification Refinement mapping RTL or Transient states Sequential counterexample Equivalence Specification • Equivalence is specified by – (Port, Throughput, Latency, Condition) behavior Adder1(in int in1, in int in2, in int in3, out int out1) { void main() { out1 = in1 + in2 + in3; } }; behavior Adder2(in int in1, in int in2, in int in3, out int out1) { void main() { int tmp; while(1) { tmp = in1 + in2; waitfor(5); tmp = tmp + in3; waitfor(5); out1 = tmp; } } }; behavior Adder3(in int in1, out int out1) { void main() { int tmp; while(1) { tmp = in1; waitfor(1); tmp = tmp + in1; waitfor(1); tmp = tmp + in1; waitfor(1); out1 = tmp; waitfor(1); } } }; (in1, 1, 0, TRUE) (in2, 1, 0, TRUE) (in3, 1, 0, TRUE) (out1, 1, 0, TRUE) (in1, 10, 0, TRUE) (in2, 10, 0, TRUE) (in3, 10, 5, TRUE) (out1, 10, 10, TRUE) (in1, 4, 0, TRUE) (in1, 4, 1, TRUE) (in1, 4, 2, TRUE) (out1, 4, 3, TRUE) Sequentialization • Concurrent behaviors are sequentialized • If st1 and st2 running concurrently are “write-write” or “read-write” relation, check the following properties: – P1: always T(st1) > T(st2) – P2: always T(st1) < T(st2) – T(s) … execution time of a statement s • The checks are based on ILP – Is there any assignment satisfying P1 (P2) ? – With the timing constraints generated from SpecC designs (P1, P2) = (pass, pass) Impossible Always st1st2 (P1, P2) = (fail, pass) Always st2st1 (P1, P2) = (pass, fail) (P1, P2) = (fail, fail) Order is undecidable Can be sequentialized Sequentialization a = 10; b = 10; c = a + b; x = 20; y = 20; z = x + y; No dependence No check is needed a = 10; b = 10; c = a + b; x = 20; y = 20; z = x + y; a = 10; wait e; c = a + x; x = 20; notify e; y = 20; z = x + y; Synchronized always x=20 c=a+x? Result: YES always x=a+x x=20? Result: NO Can be sequentialized!! a = 10; x = 20; y = 20; z = x + y; c = a + x; x = 10; a = 10; c = a + x; x = 20; y = 20; z = x + y; Not synchronized always x=10 x=20? Result: NO always x=20 x=10? Result: NO Cannot be sequentialized!! For HW/SW co-design Software part RTL C Abstraction on Comm. SW (FSMD) Transform to FSMD HW (FSMD) Identify Synch. points Model checking HW+SW (FSMD) Equivalence checking Recude # states Sequentialization Case Study 1: MPEG4 • Difference between designs – Constant propagation, constant folding, common sub-expression elimination in IDCT function • Design size – About 6300 lines in SpecC – About 50k nodes and 36k edges in ExSDG • ExSDG generation time: 780 sec Nodes in diff Result Run time # of ext. MPEG4_org MPEG4_rev1 96 Eqv 3.3 sec 0 MPEG4_org MPEG4_rev2 96 Not eqv 13.2 sec 80 Case Study 2: Elevator Controller • Difference between designs – Speculative code motion in control paths • Design size – About 3300 lines in SpecC – About 20k nodes and 20k edges in ExSDG • ExSDG generation time: 178 sec Nodes in diff Result Run time # of ext. Elv_org Elv_rev1 4 Eqv 1.8 sec 1 Elv_org Elv_rev2 3 ---- > 12 hours 4 Rule-based Equivalence Checking Checks the equivalence between two highlevel (e.g. SpecC) design descriptions • Assuming the equivalences of variables, equivalence rules are applied in a bottom-up manner – Equivalence rules are defined in terms of static dependence relations and control flows • Verification result is either "equivalent" or "cannot prove the equivalence" – It cannot prove that they are not equivalent – Equivalence rules are heuristically picked up Equivalence Rules (1/3) Rule 1: Expression • Checks the equivalence considering the commutative, associative, distributive laws d * a * b + c * d d * (b * a + c) Original design Modified design + * * * d a Distributive law * c b * d Commutative law + d b c a Equivalence Rules (2/3) Rule 2: Assignment • The variable in LHS is equivalent to RHS until the variable is re-assigned { { int c = a - b; return c; return a - b; } } Original design Introducing an intermediate variable return return = Rule 2 ー a ー c b Rule 1 a c b Equivalence Rules (3/3) Rule 3: Sequential composition • Execution order can be changed unless it destroys the data dependence relations { L1: c = a + b; L2: d = a + c; L3: e = b + c; } Original design { L2: d = a + c; L1: c = a + b; L3: e = b + c; } Swapping L1 and L2 L2 Swapping L2 and L3 seq seq L1 { L1: c = a + b; L3: e = b + c; L2: d = a + c; } L3 L2 L1 seq L3 L1 L3 L2 Example: Bottom-Up Application of Rules { { c = a - b; f = d + e; f = e + d; c = a - b; } } Original design Modified design Rule 3 seq Internal Rule 1 equivalences f ー a Rule 2 = = c b seq f + d e = = e ー c + d a b A Known Issue of Rule-based Checking How can we find internal equivalences? • Our initial method finds them by "name" – That is, all variables with the same name are identified to be equivalent • This approach fails the equivalence checking when variable names change – Typically, variable names are changed through design transformations – If variables in different places have the same name, the result may be false positive Examples of Checking Failure Though following examples are all equivalent, name-based equivalence checker fails int ex1(int a, int b) { return a - b; } int ex2(int b, int a) { return b - a; } Original design Swapping the variable names int ex3(int c, int d) { return c - d; } int ex4(int a, int b) { int c = a - b; return c; } Modifying the variable names Introducing an intermediate variable Identifying Potential Internal Equivalences • Perform a random simulation, then identify a set of variables having the same signature (i.e. sequence of simulated values) – Well-known technique in RTL verification • In RTL, the values of registers are uniquely known at every cycle e = a - b; f = a + b; a=(35,-4,712) b=(-220,1151,-3) Random Pattern Design 1 x = b + a; y = a - b; Design 2 Simulation – However, there is no concept of "cycle" in behaviorallevel design descriptions • The concept of "a variable with context" is introduced e=(255,-1155,715) f=(-185,1147,709) x=(-185,1147,709) y=(255,-1155,715) Signatures {e, y} {f, x} Potential Internal Equivalences Internal Equivalences at Behavioral Level Definition of a cycle • A period of the execution from accepting an input pattern to generating a set of output values • Internal variables must be assigned once in a cycle – Designs are in static single assignment (SSA) form – The concept of "a variable with context" is introduced for multiple instantiations of modules and function calls A variable with context • Context: a runtime path information from the top-level module to the current function • Guaranteed to be assigned only once in a cycle Method Based on Potential Internal Equivalences Issue: Potential internal equivalences may include false equivalences • False equivalences may lead to false positive results • Non-equivalent variables may be identified as equivalent – Equivalent variables are always identified as equivalent – True equivalences are included in potential equivalences Solution: Explores all possible subsets of internal equivalences when applying the rules • Still practical since each set of internal equivalences is typically small Implementation Implemented in C++ on top of our system-level verification framework FLEC – Given a SpecC design description as an input, ExSDG representation is generated by parser and dependence analyzer • Potential internal equivalence identifier – Generates SpecC description from ExSDG representation as well as a random input pattern generator module – Compiles the random simulator using SpecC reference compiler • Rule-based equivalence checker – Each rule is implemented as a callback function – Given two nodes in ExSDGs, the equivalence between two sub-trees are checked Case Study: A Practical Design Example: Two IDCT designs before and after parallelization • Column & row processes are parallelized • Existing method failed checking since it could not find the correspondences between variables • Proposed method checked the equivalence – 10 cycles of random simulation – Runtime: ~3 seconds • Mainly compilation and execution time of simulator Outline • Overviews of our verification research activities regarding to C-based high-level design descriptions – Based on Extended System Dependence Graph (ExSDG) • Equivalence checking as well as static/model checking with ExSDG • Post-silicon verification method through mapping chip traces with ExSDG • Verification/debugging methods for large arithmetic circuits • Automatic generation of on-chip bus protocol transducers initiator Debug Hardware channel target put(req) • Event Extractor get(req) true/false … true/false – Extract basic required information • • • • Transaction type (read/write) get(res) Start of transaction true/false End of transaction master Initial and target address of transaction • Trace Buffer put(res) true/false bus slave m_request opb_grant – Store extracted transactions in am_select compact form opb_xferac k … … opb_dbus s_dbus s_xferack Post Debug with ExSDG Focus on Silicon communication parts of designs Establish mapping between ExSDG and chip traces Debug Flow: 1- Extract basic transactions 2- Store in a trace buffer * Run system until a failure * 3- Read the trace buffer 4- Analyze traces with software Module 1 channel Transaction Extractor 1 Transaction type (read/write) Start & end of transaction Initial and target of transaction ExSDG for C Software 4 … Trace Buffer Module k On-chip bus, Network, … Initial System Find wrong behaviors using debug patterns Potential race Potential deadlock Read Out 3 2 Debug HW Debug SW Store extracted transactions Post Silicon Debug: some results • Hardware overhead is low • Trace buffer not large OPB Bus: 1966 gates PLB Bus: 2206 gates – Trace buffer fields: 1 or 2 bytes per transaction Master ID Slave ID Address R/W Command Tag • Debug patters as assertions 1) Master1 locks first semaphore. 2) Master2 locks second semaphore 3) Master1 waits for second semaphore 4) Master2 waits for first semaphore 5) Steps 3 and 4 are repeated assert never SoTr(m1, s1, Wr, -, t1) ; SoTr(m2, s1, Wr, SAME, t2) ; EoTr(m1, s1, Wr, SAME, t1) | EoTr(m1, s1, Wr, -, -) ; SoTr(m2, s1, Wr, SAME, -) filter (*,*,*) • Analysis is quick: 20 seconds for 100,000 transactions – Working with ExSDG (C descriptions) Our approach to debugging high-level • Concrete simulation – Depth-first search: long range, narrow width • Formal method: Symbolic state traversal Reachable states DFS • Our approach: user-driven BFS combines DFS and BFS – DFS: To collect reachable states – BFS: To search exhaustively around states of interest (user specified) – User switches DFS and BFS using various commands including execution path specification F F F Initial States DFS Reachable states – Breadth-First Search: short range, wide width Faulty states F F F Initial States Reachable states BFS F F Initial States BFS F DFS jump BFS Infeasible states Some experiences • Filter design from a company – 170 LoC in SpecC, part of buggy real design – BMC cannot detect the bug (simply too deep) – Designer specifies a set of execution paths that he has concerns • Concern: some portions of the codes in particular sequences may not work • Successfully generate 61 cycles pattern with the proposed approach • Elevator controller – 9500 LoC in SpecC – Target assertion • Door must open within 30 cycles after pushing up/down button while elevator is stopping on a floor – BMC failed at 120th cycle (after >10 days run) – Looks like there is no such issue from BMC • User-guided analysis realizes failure ! Outline • Overviews of our verification research activities regarding to C-based high-level design descriptions – Based on Extended System Dependence Graph (ExSDG) • Equivalence checking as well as static/model checking with ExSDG • Post-silicon verification method through mapping chip traces with ExSDG • Verification/debugging methods for large arithmetic circuits • Automatic generation of on-chip bus protocol transducers Data-path Dominated Applications Real Number Specification Modular-HED Representation Arithmetic Bit Level Polynomial Optimization Equivalence Checking Debugging HED Representation Floating point Model MATLAB Model (Fixed point) Automatic Fixedpoint Generation Refinement and Optimization RTL Model (Fixed bit-width) RTL Synthesis Gate-level Model (bit level) Physical Design Horner Expansion Diagram (HED) f(x) • Horner Expansion: f ( x , y ,...) f const x . f linear – Const (Left) child (dashed line) – Linear (Right) child (solid line) x fconst • Example – f(x,y,z) = x2y+xz-4z+2; Order: x>y>z – f(x,y,z) = [-4z+2]+x[xy+z] = fconst+x.flinear -4z+2 flinear f(x,y,z) x xy+z x • fconst = -4z+2 = f(z) • flinear =f1(x,y,z)= xy+z y z – f1(x,y,z) = xy+z = z+x[y] y • f1const = z; f1linear = y – Horner form • f = x(x(y)+z)+z(-4)+2 z 2 z -4 0 1 0 1 High-level Polynomial Datapath Verification Modular Equivalence checking – Anti-aliasing function F 1 ( 2 ( a 2 b 2 ) ) 1 ( 2 x ) – Expand into Taylor series coefficients 1 F x 6 64 279 9 x 5 115 32 x 2 64 x 4 64 81 32 x 75 x 3 16 a 85 64 – Implemented as a fixed size datapath • F1[15:0], F2[15:0], x[15:0] • F1 = + 43593 x2 + 40244x + 13281 • F2 = 156x6 + 5380x5 + 1584x4 + 10469x3 + 27209 x2 + 7456x + 13281 156x6 b 62724x5 + 17968x4 + 18661x3 + x = a2 + b2 x MAC Reg • F1 ≠ F2 over Z • F1[15:0] = F2[15:0] mod 216 F coefficients Modular-HED (M-HED) • The Smarandache function in number theory is defined for a given positive integer b as the smallest positive integer such that its factorial S (b )! is divisible by b – Example, the number 8 does not divide 1!, 2!, 3!, but does divide 4! (4!/8 = 3), so S(8) =4 • 1*2*3*4 % 8 = 0 • 5*6*7*8 % 8 = 0 • 100*101*102*103 % 8 = 0 – A product of 4 consecutive numbers is divisible by 8 • x(x+1)(x+2)(x+3) 0 mod 8 • x(x+1)(x+2)(x+3) = x4 + 6x3 + 11x2 + 6x can be freely added or subtracted under mod 8 ! • Co-efficients are modified transforming given polynomials to normal forms Modular-HED (M-HED) • Vanishing Polynomial: g(x) = ( x i ) 0 mod n • If we can factorize a polynomial g(x) into a product of S(n) consecutive numbers, then it can be reduced to 0 in Z n S (n) i 1 – f ( x ) x 21 x – S(24) = 6 6 5 175 x 735 x 1624 x 1764 x 720 4 3 2 over Z 24 • f(x) = (x+1)(x+2)(x+3)(x+4)(x+5)(x+6) 0 mod 16 – It can be reduced to ZERO! Preliminary Experimental Results Benchm ark Specs ModularHED Var/Deg/n Node/Time AAF 1 / 6 / 16 D4F Method [9] CUDDBDD *BMD miniSat [13] MILP [12] Time (s) Node/Time Node/Time Vars/Clauses/Time Time (s) 8 / 0.016 6.81 1.1M / 32.2 NA / >500 3.9K / 107K / >500 >500 1 / 4 / 16 6 / 0.031 4.95 27M / 20.3 NA / >1000 25K / 76K / >1000 >1000 CHEB 1 / 5 / 16 7 / 0.01 5.95 1M / 26.9 NA / >500 3.5K / 86K / >500 >500 PSK 2 / 4 / 16 MI Boolean reasoning methods never52Kworks 16 / 0.032 13.48 NA / >500 NA / >500 / 142K / >500 2Need / 4 / 16 9 / 0.016 14.4 NA / >1000 NA / >1000 10K / 30K / >1000 to use word level techniques such as the proposed one 2 / 9 / 16 26 / 0.2 17.5 23M / 39.4 NA / >1000 24K / 69K / >1000 SG 5 / 3 / 16 35 / 0.24 6.1 NA / >1000 NA / >1000 64K / 190K / >1000 >1000 QS 7 / 4 / 16 19 / 0.09 32.4 NA / >1000 NA / >1000 76K / 211K / >1000 >1000 DIRU NA: Not Applicable; K: Thousand; M: Million >500 >1000 >1000 High-level Polynomial Datapath Optimization A partitioning and compensation heuristic • A polynomial is given, how we can optimize it in terms of the number of adders and multipliers on fixed bit-width • Our Solution Modular-HED – Apply Modular-HED • To reduce over Z2n – Partitioning approach Partitioning • Poly = p1*p2 + p3 • Minimize p3 – Compensation approach • Compute Coefficients Compensation Partitioning Heuristic • poly = p1p2+p3 with unknown coefficients • Minimize the cost of p3 poly = w4 – x3 –x2y -5x2 + 2xy +2y2 + xz + yz + 14y +5z +3 poly • After Partitioning p1 = a1x2 + a2y + a3z; p2 = b1x + b2y +b3; p3 = w4 + 3 poly = p1*p2 + p3 Partitioning p1, p2, p3 Compensation Heuristic • In our example poly = w4 – x3 –x2y -5x2 + 2xy +2y2 + xz + yz + 14y +5z +3 p 1 = a 1 x 2 + a 2 y + a3 z p2 = b1x + b2y +b3 p3 = w4 +3 • Set the coefficients (ai,bj) in order to achieve the minimum cost p3 • First consider all the equations produced by p1*p2 = poly - p3 – a1b1=-1, a1b2=-1,a1b3=-5,a2b1=2,a2b2=2,a3b1=1, a3b2=1, a2b3=14, a3b3=5 – These equations may not have an answer! Preliminary Experimental Results Applicatio n Horner Function Enhanced CSE Our Approach M/V/D/n #Gate Delay Time #Gate Delay Time #Gate Delay Time Graphics CosineWavelet 9/2/3/16 7850 23.7 0.04 5109 18.3 3.47 3678 18.5 6.52 Image Processing SavitzkyGolay 1 10/2/3/16 7218 20.7 0.04 3757 22.7 4.75 2879 17.8 14.9 Image Processing SavitzkyGolay 2 5/2/2/16 1697 20.8 0.03 1433 18.1 0.48 1057 16.2 1.63 Filter Intensive optimization on polynomials greatly Quad 1 5/2/2/16 2737 17.7 0.03 2269 16.4 0.63 1763 17.1 reduce the gerated custom circuits/software 1.78 Filter Quad 2 5/2/2/16 2569 16.4 0.03 2032 16.4 0.63 1571 15.3 1.56 Automotive Mibench 9/3/2/16 2058 15.7 0.04 2046 15.6 0.46 1303 14.1 8.75 0% 0% - 31% 6.3% - 49.2% 13.8% - Average Saving w.r.t Horner M: the number of monomial D: the maximum degree V: the number of variables n: the number of bits (word-length) Bit Level Adder (BLA) Model • In an arithmetic circuit several addition processes are possible • However, each addition can be represented with BLA model • Represent A+2mB, by some custom full-adders and halfadders, which represent two cascade XORs and one XOR, respectively – Realization of the carry signals is not uniquely modeled • BLA is robust for – – – – Carry-look-ahead Ripple-carry Carry-select Carry-skip adders Theproduct Proposed Debugging • Partial initialization Algorithm – Extract each column of 2i from S – Provide a bit-level ADD_SET with different bit-orders – Bits in each column must be added with each other while the generated carries are sent to higher order column • Column-based XOR extraction – Search XORs over initial partial product terms of each ADD_SET column – At least one input from column 2i – XOR extraction is performed without carry-logic blocks • Much faster than other XOR extraction techniques P1(2) P2(2) Unknown P3(2) The Proposed Debugging Algorithm 4-bit multiplier • Carry-signal mapping using HED – For each FA with no unknown input build a new reference (in HED) – For each HA/FA with an unknown input Cm • Find the backward logic of Cm • Map Cm to the HED of the most similar reserved carry from previous column FA Experimental Setup and Results • F1 = A*B; F2 = C*D+3E+120; – A, B : 32-bit F3 = A*B+C*D C,D,E,F1 : 64-bit F2,F3 : 129-bit Can deal with large (64, 128) bit arithmetic functions for efficient verification and debugging Outline • Overviews of our verification research activities regarding to C-based high-level design descriptions – Based on Extended System Dependence Graph (ExSDG) • Equivalence checking as well as static/model checking with ExSDG • Post-silicon verification method through mapping chip traces with ExSDG • Verification/debugging methods for large arithmetic circuits • Automatic generation of on-chip bus protocol transducers Communication parts can be very complicated • Need interface bw protocol A and B • Protocols can be very complicated – Over 30 different commands defined in the state-of-the-art protocols (OCP) – Manuals over 200 pages – Bust, out-of-order modes, … • We have developed an automatic generator of protocol transducers CPU (IP) Protocol A Trans -ducer MPEG DMAC Protocol B – Protocols are formally defined with FSM/automaton – State-of-the-art protocols can be dealt with in a couple of minutes – Now being extended to: On/off-chip bus interconnections Formal verification of interfaces DMAC (IP) Protocol A RAM RAM (IP) Custom HW Protocol B 71 How protocol transducer is realized • Intuitive understanding of the problem – Follow the two protocols ⇒ compute the product of the two FSM/automata (stb==1) ack<=0 設計対象 ack<=0 Clock-wise behavior ack<=1 Protocol Request A Master Response Request Protocol B Response Slave Protocol A Exploration [1] + ours Protocol B Definition of protocol Protocol transducer In FSM/automaton [1] R.Passerone, J.A.Rowson, A.Sangiovanni-Vincentelli, “Automatic Transducer Synthesis of Interfaces between Incompatible Protocols” ,DAC’98 pp.8-13 Simple computation of products ctrl Protocol data A 8 8 Master {Ctrl=1, data1} Some paths are avoidable Transducer A {Ctrl=0} B {Ctrl=1, data1} Invalid due to Illegal dependency A E A B D D B Unavoidable path E {Ctrl=0, data2} C {Ctrl=0, data2} C D D C E E F Protocol B Slave 8 ctrl data F B F {Ctrl=0} {Ctrl=1, data1} {Ctrl=0, data2} C C E C F C F Invalid due to Illegal dependency {Ctrl=0} Transducer Minimum latency path 8 • Protocol definition automaton should not have any loops – Even in the same state, data values are different Need separation between – Expansion can be infinite comp. and comm. ! Separation of computation and communication inside protocol transducers • In protocol definition, control and data are separately specified • Introduce two FSMs for request and control to describe complicated protocols uniformly • FIFO can be made arbitrary complicated if we like Protocol A Master Res. FSM Transducer Protocol B Slave Res. Even arithmetic computation possible Protocol A Master Req. Req. FSM Req. Res. Res. Res. FSM Transducer Protocol B Slave Protocols can be very complicated • State-of-the-art protocols introduces many features for faster throughputs t Req1 → Res1 Req2 → Res2 t t Req1 Req2 Req3 Res1 Res2 Split transaction （Non blocking） Blocking （Low throughput） t Req1 Req2 Res1 Req3 Res3 Res2 Out of order transaction Protocol Master Request (Address / Data) Request Response (Data) t Protocol Slave Request Addr1 Data1 Addr2 Data2 Data2 Addr3 Data3 Data3 Addr4 Data4 Data4 Burst transaction Addr1 Data1 Single address Burst trans. For more complicated protocols… 入力オートマトン Req. Req. Req Res. Protocol A Req Res. Protocol B Control for FIFO Read X Write Pros: Can deal with more complicated protocols Cons: Need more latency delay due to multiple FIFO Protocol A Master Newly introduced FIFO Req. Req. Transducer FSM Res. Send FSM FIFO RD X Res Req. Recv FSM Res. Res X Protocol B Slave FIFO WR Now we can resolve it • Elimination of loops (to initial states)。i i C D B e i C D e Introduction of ending state Z W e U SS = Loops are replaced with super states i Y X Z W Elimination pf ending states Exploration B X Y A • Elimination of intermediate loops Exploration A i Exploration i [2] U • Concentrating on controls only – Date parts are processed separately ! [2] S.Watanabe, K.Seto, Y.Ishikawa, S.Komatsu, M.Fujita, “Protocol Transducer Synthesis using Divide and Conquer approach, “ Proc. of the 12th. Asia and South Pacific Design Automation Conference, pp.280-285, 2007. How to deal with multiple complicated transactions [2] • A protocol is a collection of sequences • Each sequence can operate independently – True for state-of-the-art protocols with separation between computation and communication Sequence1 （Read） Sequence2 （Write） Automaton2 For response Sequence3 （4 burst read） Sequence4 （4 burst write） ・・・ Protocol Automaton1 For request or blocking Hardware definition (stb==1) ack<=0 i ack<=0 ack<=1 All sequences share initial state Port, signal names, etc. Hierarchical synthesis owing to comp. and comm. separation [2] i Merge generated FSM with the same initial state i i ＋ Protocol A Sequence A1 グラフ 探索 Exploration Sequence A2 ＝ Partial transducer 1 Transducer Protocol B Sequence B1 グラフ 探索 Exploration Partial transducer 2 Sequence B2 Sequence level synthesis followed by merge process Tool implementation • Planned to be distributed freely from OCP-IP Experimental results • Atholon64 2GHｚ + 1GB RAM • Implemented as over 12,000 loc in C++ – Input: Hierarchical automaton descriptions in XML – Output: RTL synthesizable Verilog • Logic synthesis: Xilinx ISE • RTL simulator: Model Sim XE Mater's Protocol Slave's Protocol Type Sequences Synth.Time Gate counts OCP AHB (NB,BK) 4 1.1[s] 2,352 AHB OCP (BK,NB) 4 1.3[s] 1,843 OCP OCP (NB,NB) 2 1.9[s] 1,568 OCP Tagged OCP (NB,OoO) 2 2.2[s] 3,514 Tagged OCP AXI (OoO,OoO) 2 4.8[s] 1,377 AXI OCP (OoO,NB) 2 4.9[s] 1,731 OCP AXI (NB,OoO) 26 257.8[s] 61,205

Descargar
# スライド 1