Outline
What is a “Soft” Processor
 What is the NIOS II?
 Architecture for NIOS II, what are the
implications

• TigerSHARC VS. NIOS II
• Pipeline Issues
• Issues related to FIR

Hardware acceleration, using FPGA
logic
What’s is a “Soft” Processor?
Processor implemented in VHDL, Verilog,
etc., and downloaded onto FPGA hardware
 Can implement many parallel processors
on one FPGA
 Can use addition FPGA resources on the
same chip that is not part of the processor
core.


NIOS II is a “Soft” Processor
Why “Soft” Processor?
Higher level of design reuse
 Reduced obsolescence risk
 Simplified design update or change
 Increased design implementation
options
 Lower latency between processor and
FPGA components

What is NIOS II?
Software-defined processor
 The processor core is loaded onto
FPGA
 Programmed using ‘normal’
programming tools (C, asm), not
hardware description languages
 Can use the rest of the FPGA hardware
for accelerating parts of the code

How Is NIOS II Implemented
The custom FPGA logic that interacts
with the processor is implemented in
Altera Quartus II
 The Avalon Interface bus (common
instruction/data bus) is implemented in
Quartus II
 The architecture is generated in Quartus
II and used for programming in Eclipse
IDE

NIOS II IDE

Coding is implemented in Eclipse rather than
VisualDSP.
The Different NIOS II Cores

There are 3 cores available from Altera
 NIOSII/e: Economical Core
 NIOSII/s: Standard Core
 NIOSII/f: Fast Core
What’s the Difference between
the Cores?
An LE is equivalent to a 8-1 NAND gate + 1 D-Flip Flop
An ALM is equivalent to 2 LE’s
Comparison of TigerSHARC and NIOS
II architecture
TigerSHARC Architecture
NIOS II Architecture
-thirty two 32-bit general registers, six 32-bit control registers
-variable cache based on how much FPGA space you have
-ALU- 32bit two input to one input, does shifts, logic and arithmetic. Shifter is
not separate like TigerSHARC
Avalon Interface
-separate address, data and control lines
-up to 1024-bit data width transfer, can be set to any width (not power of 2)
-one transfer per clock cycle.
NIOS II/f pipeline
Six stages
 One instruction can be dispatched and/or
retired pre cycle
 Dynamic branch prediction: 2-bit branch
history table (no BTB like in TigerSHARC)

NIOS II/f pipeline
The pipeline stalls for:
•
Multi-cycle instructions
• Cache misses
• Data dependencies (2 cycles between
calculating and using result)
Mispredicted branch penalty: 3 cycles
Hardware multiply

Can use different options for multiplier
(at the processor design stage)
 No h/w multiply (saves FPGA gates)
○ Speed depends on algorithm
 Use embedded multipliers (if FPGA has
those)
○ 1-5 cycles (depends on FPGA)
 Implement multipliers on FPGA gates
○ 11 cycles
 Division 4-66 cycles on hardware
Compare to TigerSHARC
No support for parallel instructions
 No support for SIMD operations
 Multicycle instructions stall the pipeline

All the above limitations can be overcome
by using FPGA space unoccupied by the
processor itself
Comparison of NIOS II and
TigerSHARC on an FIR Algorithm
Integer FIR algorithm
int coeff[]={1, 2, 3, 4, 5, 6, 7, 8};
int data1[] = {1, 0, 0, 0, 0 ,0 ,0 ,0};
int output[8];
int i=0, j=0, k=0;
for(k=0; k<8; k++) output[k] =0;
for( j =0; j< 8; j++)
{
for( i= 0; i< 8; i++)
{
output[j] += data1[i]*coeff[7-i];
}
}
Speed analysis
movi r4,8
i=8
ldw r2,0(r6)
load data
2
ldw r3,0(r7)
load coefficient
3
addi r4,r4,-1
i--
4
addi r6,r6,4
coeffPt++
5
mul r2,r2,r3
data = data * coeff
6
addi r7,r7,-4
dataPt--
7
stall
data stall – waiting for multiplication
result
8
add r5,r5,r2
output += data
9
bne
r4,zero,0x10002a0
will mispredict 2 times in the
beginning, and 1 time in the end of
the loop (waste 3 cycles each time)
0
1
Loop:
Speed analysis





9 cycles per iteration except the first two
(branch predicted not taken) and the last
(branch predicted taken) – those will be
9+3=12 cycles
1 data stall – can remove by moving
instruction from line 4 to 7
Speed: 8 cycles * (N-3) + 11 cycles * 3 =
8*(N-3)+33 cycles
For 1024-tap FIR: 8201 cycles
Clock cycle is 3 times longer (200MHz vs
600MHz)
Speed comparison
•
•
8201 NIOS II cycles equivalent to 24603
TigerSHARC cycles
Lab3 timing:
– 56000 cycles Debug mode
– 13000 unoptimized ASM
– 4000 Optimized ASM
Worse than unoptimized assembly, but no
hardware acceleration used, so this is not
that bad
Hardware Acceleration
Profiling tool in Eclipse can show how
long each function takes
 If function takes too long, it can be sped
up by

 Custom instructions
 Hardware Acceleration

Hardware Acceleration is to take the
function and transform it into FPGA
circuitry
Hardware Acceleration
Can be done using C2H compiler from Altera
 Trades off Logic Size for Speed up.

Table 1. User Application Results Example
Algorithm
Speed Increase
(vs. Nios II CPU)
System fMAX
(Mhz)
System Resource
Increase (1)
Autocorrelation
41.0x
115
124%
Bit Allocation
42.3x
110
152%
Convolution Encoder
13.3x
95
133%
Fast Fourier Transform
(FFT)
15.0x
85
208%
High Pass Filter
42.9x
110
181%
Matrix Rotate
73.6x
95
106%
RGB to CMYK
41.5x
120
84%
RGB to YIQ
39.9x
110
158%
Conclusion
“Soft” Processors such as the NIOSII
offers another alternative in the
embedded system scene.
 The NIOSII offers the advantage of
added configurability, and customization
that blur the line between FPGAs and
DSPs

References
[1] http://www.fpgajournal.com/articles/behere.htm
Describes an FPGA-DSP project based on Altera Nios
[2] http://www.altera.com/products/ip/processors/nios2/ni2-index.html
Official Nios II page
[3] http://www.hunteng.co.uk/dsp-fpga.htm
DSP or FPGA? What is better when?
[4] http://www.hunteng.co.uk/pdfs/tech/DSP1736FPGA.pdf
Article from Xilinx about FPGA DSPs
[5] http://www.niosforum.com
Community forum for NIOS
[6] http://www.altera.com/literature/hb/nios2/n2cpu_nii5v1.pdf
NIOSII Processor Handbook –Altera Corporation
[7] http://www.altera.com/literature/manual/mnl_avalon_spec.pdf
Avalon Memory-Mapped Interface Specifications – Altera Corporation
[8] http://www.analog.com/en/prod/0,2877,ADSP%252DTS201S,00.html
ADSP-TS201S 500/600 MHz TigerSHARC Processor with 24 Mbit on-chip embedded
DRAM
Descargar

NIOS II Processor - University of Calgary