Multithreading and Dataflow
CPSC 321
Andreas Klappenecker
T November 16: Multithreading
R November 18: Quantum Computing
T November 23: QC + Exam prep
R November 25: Thanksgiving
• M November 29: Review ???
• T November 30: Exam
• R December 02: Summary and Outlook
• T December 07: move to November 29?
• Office hours 2:00pm-3:00pm
• Bonfire memorial
• Hardware parallelism
• all current architectures
• Instruction-level parallelism
• superscalar processor, VLIW processor
• Thread-level parallelism
• Niagara, Pentium 4, ...
• Process parallelism
• MIMD computer
What is a Thread?
A thread is a sequence of instructions that
can be executed in parallel with other
Threads typically share the same
resources and have a minimal context.
• Definition 1 Different threads within the
same process share the same address space,
but have separate copies of the register file,
PC, and stack
• Definition 2 Alternatively, different threads
have separate copies of the register file, PC,
and page table (more relaxed than previous
• One can use a multiple issue, out-of-order,
execution engine.
Why Thread-Level Parallelism?
Extracting instruction-level parallelism is
•hazards and stalls
•data dependencies
•structural limitations
•static optimization limits
Von Neumann Execution Model
Each node is an instruction. The pink arrow indicates a static
scheduling of the instructions. If an instruction stalls (e.g. due
to a cache miss) then the entire program must wait for the
stalled instruction to resume execution.
The Dataflow Execution Model
Each node represents an instruction. The instructions are not
scheduled until run-time. If an instruction stalls, other
instructions can still execute, provided their input data is
The Multithreaded Execution Model
Each node represents an instruction and each gray region
represents a thread. The instructions within each thread are
statically scheduled while the threads themselves are dynamically
If an instruction stalls, the thread stalls but other threads can
continue execution.
Single-Threaded Processors
Memory access latency can dominate the processing time,
because each time a cache miss occurs hundreds of clock
cycles can be lost when a single-threaded processor is waiting
for the memory.
Top: Increasing the clock speed improves the processing time,
but does not affect the memory access time.
Multi-Threaded Processors
Multithreading Types
Coarse-grained multithreading
If a thread faces a costly stall, switch to another
thread. Usually flushes the pipe before switching
Fine-grained multithreading
interleave the issue of instruction from multiple
threads (cycle-by-cycle), skipping the threads that
are stalled. Instructions issued in any given cycle
comes from the same thread.
Scalar Execution
Dependencies reduce throughput and utilization.
Superscalar Execution
Chip Multiprocessor
Fine-Grained Multithreading
Instructions issues in the same cycle come from the same thread
Fine-Grained Multithreading
• Threads are switched every clock cycle, in
round robin fashion, among active threads
• Throughput is improved, instructions can be
issued every cycle
• Single-thread performance is decreased,
because one thread is expected to get just
every n-th clock cycle among n processes
• Fine-grained multithreading requires
hardware modifications to keep track of
threads (separate register files, renaming
tables and commit buffers)
Multithreading Types
A single thread cannot effectively use all
functional units of a multiple issue processor
Simultaneous multithreading
uses multiple issue slots in each clock cycle for
different threads.
More flexible than fine grained MT.
Simultaneous Multithreading
• Superscalar:
• looks at multiple instructions from same process,
both horizontal and vertical waste.
• Multithreaded:
• minimizes vertical waste: tolerate long latency
• Simultaneous Multithreading
• Selects instructions from any "ready" thread
Time (Processor cycle)
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Issue slots
SMT Issues
A Glance at a Pentium 4 Chip
Picture courtesy of Tom’s hardware guide
The Pipeline
Trace cache
Intel’s Hyperthreading Patent
Pentium 4 Pipeline
Trace cache access, predictor 5 clock cycles
Microoperation queue
2. Reorder buffer allocation, register renaming
4 clock cycles
functional unit queues
3. Scheduling and dispatch unit 5 clock cycles
4. Register file access 2 clock cycles
5. Execution 1 clock cycle
reorder buffer
6. Commit 3 clock cycles (total: 20 clock cycles)
The XPP processes a stream of data using
configurable arithmetic-logic units.
The architecture owes much to dataflow
A Matrix-Vector Multiplication
Graphic courtesy of PACT
Basic Idea
• Replace the von Neumann instruction
stream with fixed instruction
scheduling by a configuration stream.
• Process streams of data as opposed to
processing of small data entities.
von Neumann vs. XPP
Basic Components of XPP
• Processing Arrays
• Packet oriented communication network
• Hierarchical configuration manager tree
• A set of I/O modules
Supports the execution of multiple data
flow applications running in parallel.
Four Processing Arrays
Graphics courtesy of PACT. SCM is short for supervising configuration manager
Data Processing
Event Packets
XPP 64-A
Further Reading
• PACT XPP – A Reconfigurable Data
Processing Architecture by Baumgarte,
May, Nueckel, Vorbach and Weinhardt

A Galois Theory of Quantum Error Correcting Codes