On
Cosmic Rays,
Bat Droppings,
and what to do about them
David Walker
Princeton University
with Jay Ligatti, Lester Mackey, George Reis and David August
A Little-Publicized Fact
1 + 1 = 23
How do Soft Faults Happen?
“Solar
Particles”
Affect
Satellites;
Cause < 5% of
Terrestrial problems
“Galactic Particles”
Are high-energy particles that
penetrate to Earth’s surface, through
buildings and walls
Alpha particles from
bat droppings


High-energy particles pass through devices and collides with silicon
atom
Collision generates an electric charge that can flip a single bit
How Often do Soft Faults Happen?
How Often do Soft Faults Happen?
IBM Soft Fail Rate Study; Mainframes; 83-86
City Altitude (feet)
12000
Leadville, CO
10000
8000
Denver, CO
6000
4000
Tucson, AZ
2000
0
NYC
0
5
10
Cosmic ray flux/fail rate (multiplier)
15
How Often do Soft Faults Happen?
IBM Soft Fail Rate Study; Mainframes; 83-86 [Zeiger-Puchner 2004]
City Altitude (feet)
12000
Leadville, CO
10000
8000
Denver, CO
6000
4000
Tucson, AZ
2000
0
NYC
0
5
10
15
Cosmic ray flux/fail rate (multiplier)
Some Data Points:
• 83-86: Leadville (highest incorporated city in the US): 1 fail/2 days
• 83-86: Subterrean experiment: under 50ft of rock: no fails in 9 months
• 2004: 1 fail/year for laptop with 1GB ram at sea-level
• 2004: 1 fail/trans-pacific roundtrip [Zeiger-Puchner 2004]
How Often do Soft Faults Happen?
Relative Soft Error Rate Increase
Soft Error Rate Trends
[Shenkhar Borkar, Intel, 2004]
150
~8% degradation/bit/generation
100
50
6 years
from now
0
180
130
90
65
45
32
22
16
Chip Feature Size
we are
approximately
here
How Often do Soft Faults Happen?
Relative Soft Error Rate Increase
Soft Error Rate Trends
[Shenkhar Borkar, Intel, 2004]
150
~8% degradation/bit/generation
100
50
6 years
from now
0
180
130
90
65
45
32
22
16
Chip Feature Size
• Soft error rates go up as:
• Voltages decrease
• Feature sizes decrease
• Transistor density increases
• Clock rates increase
all future
manufacturing
trends
we are
approximately
here
How Often do Soft Faults Happen?

In 1948, Presper Eckert notes that cascading effects of a single-bit
error destroyed hours of Eniac’s work. [Zeiger-Puchner 2004]

In 2000, Sun server systems deployed to America Online, eBay, and
others crashed due to cosmic rays [Baumann 2002]

“The wake-up call came in the end of 2001 ... billion-dollar factory
ground to a halt every month due to ... a single bit flip” [Zeiger-Puchner 2004]

Los Alamos National Lab Hewlett-Packard ASC Q 2048-node
supercomputer was crashing regularly from soft faults due to cosmic
radiation [Michalak 2005]
What Problems do Soft Faults Cause?

a single bit in memory gets flipped

a single bit in the processor logic gets flipped and



there’s no difference in external observable behavior
the processor locks up
the computation is silently corrupted



register value corrupted (simple data fault)
control-flow transfer goes to wrong place (control-flow fault)
different opcode interpreted (instruction fault)
FT Solutions

Redundancy in Information




Redundancy in Space





eg: Error correcting codes (ECC)
pros: protects stored values efficiently
cons: difficult to design for arithmetic units and control logic
multiple redundant hardware devices
eg: Compaq Non-stop Himalaya runs two identical programs on two
processors, comparing pins on every cycle
pros: efficient in time
cons: expensive in hardware (double the space)
Redundancy in Time



perform the same computations at different times (eg: in sequence)
pros: efficient in hardware (space is reused)
cons: expensive in time (slower --- but not twice as slow)
Solutions in Time
Compiler generates code containing replicated
computations, fault detection checks and recovery
routines

eg: Rebaudengo 01, CFCSS [Oh et al. 02], SWIFT or CRAFT [Reis et al. 05],
...


pros: software-controlled --- new code with better reliability
properties may be deployed whenever, wherever needed
cons: for fixed reliability policy, slower than specialized
hardware solutions
Solutions in Time
Compiler generates code containing replicated
computations, fault detection checks and recovery
routines

eg: Rebaudengo 01, CFCSS [Oh et al. 02], SWIFT or CRAFT [Reis et al. 05],
...



pros: flexibility --- new code with better reliability properties
may be deployed whenever, wherever needed
cons: for fixed reliability policy, slower than specialized
hardware solutions
cons: it might not actually work
“It might not actually work”
Agenda


Answer basic scientific questions about softwarecontrolled fault tolerance:

Do software-only or hybrid SW/HW techniques actually work?

For what fault models? How do we specify them?

How can we prove it?
Build compilers that produce software that runs reliably
on faulty hardware


Moreover: Let’s not replace faulty hardware with faulty software.
Let’s prove every binary we produce is fault tolerant relative to
the specified fault model
A Possible Compiler Architecture
compiler front end
ordinary
program
reliability transform
fault tolerant
program
optimization
optimized
FT program
A Possible Compiler Architecture
compiler front end
ordinary
program
reliability transform
Testing Requirements:
all combinations of features
multiplied by
all combinations of faults
fault tolerant
program
optimization
optimized
FT program
A More Reliable Compiler Architecture
compiler front end
ordinary
program
reliability transform
fault tolerant
program
reliability
proof
optimization
optimized
FT program
modified
proof
proof
checker
A More Reliable Compiler Architecture
compiler front end
ordinary
program
reliability transform
Testing Requirements:
fault tolerant
program
reliability
proof
all combinations of features
multiplied by
all combinations of faults
optimization
optimized
FT program
modified
proof
proof
checker
Central Technical Challenges

Designing
Step 1: Lambda Zap

Lambda Zap [ICFP 06]


a lambda calculus that exhibits intermittent data faults + operators
to detect and correct them
a type system that guarantees observable outputs of well-typed
programs do not change in the presence of a single fault



types act as the “proofs” of fault tolerance
expressive enough to implement an ordinary typed lambda calculus
End result:

the foundation for a fault-tolerant typed intermediate language
The Fault Model

Lambda zap models simple data faults only
( M, F[ v1 ] )

Not modelled:




---> ( M, F[ v2 ] )
memory faults (better protected using ECC hardware)
control-flow faults (ie: faults during control-flow transfer)
instruction faults (ie: faults in instruction opcodes)
Goal: to construct programs that tolerate 1 fault

observers cannot distinguish between fault-free and 1-fault runs
Lambda to Lambda Zap: The main idea
let x = 2 in
let y = x + x in
out y
Lambda to Lambda Zap: The main idea
let x = 2 in
let y = x + x in
out y
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
replicate
instructions
atomic majority vote
+ output
Lambda to Lambda Zap: The main idea
let x = 2 in
let y = x + x in
out y
let x1 = 2 in
let x2 = 2 in
let x3 = 7 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
Lambda to Lambda Zap: The main idea
let x = 2 in
let y = x + x in
out y
let x1 = 2 in
let x2 = 2 in
let x3 = 7 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
corrupted values
copied and percolate
through computation
but final output
unchanged
Lambda to Lambda Zap: Control-flow
recursively translate
subexpressions
let x = 2 in
if x then e1 else e2
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
if [x1, x2, x3] then [[ e1 ]] else [[ e2 ]]
majority vote on
control-flow transfer
Lambda to Lambda Zap: Control-flow
recursively translate
subexpressions
let x = 2 in
if x then e1 else e2
(function calls replicate arguments,
results and function itself)
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
if [x1, x2, x3] then [[ e1 ]] else [[ e2 ]]
majority vote on
control-flow transfer
Almost too easy,
can anything go wrong?...
Almost too easy,
can anything go wrong?...
yes!
optimization reduces replication overhead
dramatically (eg: ~ 43% for 2 copies),
but can be unsound!
original implementation of SWIFT [Reis et al.]
optimized away all redundancy leaving them
with an unreliable implementation!!
Faulty Optimizations
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
CSE
let x1 = 2 in
let y1 = x1 + x1 in
out [y1, y1, y1]
In general, optimizations eliminate redundancy,
fault-tolerance requires redundancy.
The Essential Problem
bad code:
let x1 = 2 in
let y1 = x1 + x1 in
out [y1, y1, y1]
voters depend on
common value x1
The Essential Problem
bad code:
let x1 = 2 in
let y1 = x1 + x1 in
out [y1, y1, y1]
voters depend on
common value x1
good code:
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
voters do not depend on
a common value
The Essential Problem
bad code:
let x1 = 2 in
let y1 = x1 + x1 in
out [y1, y1, y1]
voters depend on
a common value
good code:
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
voters do not depend on
a common value
(red on red;
green on green;
blue on blue)
A Type System for Lambda Zap

Key idea: types track the “color” of the underlying value
& prevents interference between colors
Colors C ::= R | G | B
Types T ::= C int | C bool | C (T1,T2,T3)  (T1’,T2’,T3’)
Sample Typing Rules
Judgement Form:
G |--z e : T
where z ::= C | .
simple value typing rules:
(x : T) in G
--------------G |--z x : T
-----------------------G |--z C n : C int
-----------------------------G |--z C true : C bool
Sample Typing Rules
Judgement Form:
G |--z e : T
where z ::= C | .
sample expression typing rules:
G |--z e1 : C int
G |--z e2 : C int
------------------------------------------------G |--z e1 + e2 : C int
G |--z e1 : R int
G |--z e2 : G int
G |--z e3 : B int
G |--z e4 : T
-----------------------------------G |--z out [e1, e2, e3]; e4 : T
G |--z e1 : R bool
G |--z e2 : G bool
G |--z e3 : B bool
G |--z e4 : T
G |--z e5 : T
----------------------------------------------------G |--z if [e1, e2, e3] then e4 else e5 : T
Sample Typing Rules
Judgement Form:
G |--z e : T
where z ::= C | .
recall “zap rule” from operational semantics:
( M, F[ v1 ] ) ---> ( M, F[ v2 ] )
before:
|-- v1 : T
after:
|-- v2 ?? T
==> how will we obtain type preservation?
Sample Typing Rules
Judgement Form:
G |--z e : T
where z ::= C | .
recall “zap rule” from operational semantics:
( M, F[ v1 ] ) ---> ( M, F[ v2 ] )
before:
no conditions
|-- v1 : C U
after:
|--C v2 : C U
by rule:
---------------------G |--C C v : C U
“faulty typing”
occurs within
a single color
only.
Theorems

Theorem 1: Well-typed programs are safe, even when
there is a single error.

Theorem 2: Well-typed programs executing with a
single error simulate the output of well-typed programs
with no errors [with a caveat].

Theorem 3: There is a correct, type-preserving
translation from the simply-typed lambda calculus into
lambda zap [that satisfies the caveat].
ICFP 06
The Caveat
Goal: 0-fault and 1-fault executions should be indistinguishable
bad, but well-typed code:
out [2, 3, 3]
outputs 3 after no faults
out [2, 3, 3]
out [2, 2, 3]
outputs 2 after 1 fault
More importantly:
out [2, 3, 3] is obviously a symptom of a compiler bug
out [2, 3, 4] is even worse – good runs never come to consensus
Solution: computations must independent, but equivalent
The Caveat
modified typing:
G |--z e1 : R U
G |--z e2 : G U
G |--z e3 : B U
G |--z e4 : T
G |--z e1 ~~ e2
G |--z e2 ~~ e3
---------------------------------------------------------------------------G |-- out [e1, e2, e3]; e4 : T
The Caveat
More generally, programmers may form “triples” of equivalent values
Introduction form:
[e1, e2, e3]
• a collection of 3 items
• each of 3 stored in separate register
• single fault effects at most one
Elimination form:
let [x1, x2, x3] = e1 in e2
The Caveat
More generally, programmers may form “triples” of equivalent values
Introduction form:
G |--z e1 : R U
G |--z e2 : G U
G |--z e3 : B U
G |--z e1 ~~ e2
G |--z e2 ~~ e3
--------------------------------------------G |-- [e1, e2, e3] : [R U, G U, B U]
Elimination form:
G |--z e1 : [R U, G U, B U]
G, x1:R U, x2:G U, x3:B U,
x1 ~ x2, x2 ~ x3 |--z e2 : T
--------------------------------------------G |-- let [x1, x2, x3] = e1 in e2 : T
Theorems*

Theorem 1: Well-typed programs are safe, even when
there is a single error.

Theorem 2: Well-typed programs executing with a
single error simulate the output of well-typed programs
with no errors.

Theorem 3: There is a correct, type-preserving
translation from the simply-typed lambda calculus into
lambda zap.
* There is still one “i” to be dotted in the proofs of these theorems.
Lester Mackey, brilliant Princeton undergrad, has proven all key theorems
modulo the dotted “i”.
Step 2: Fault Tolerant
Typed Assembly Language (TAL/FT)


Lambda zap is playground for studying the principles of
fault tolerance in an idealized setting
TAL/FT is a more realistic assembly-level, hybrid HW/SW,
fault tolerance scheme with
(1) a formal fault model
(2) a formal definition of fault tolerance relative to memorymapped I/O
(3) a sound type system for proving compiled programs
are fault tolerant
A More Reliable Compiler Architecture
compiler front end
ordinary
program
reliability transform
reliability
proof
TAL/FT
types
optimization
types
optimized
TAL/FT
modified
proof
type
proof
checker
TAL/FT: Key Ideas (Fault Model)

Fault model:



registers may incur arbitrary faults in between
execution of any two instructions
memory (including code) is protected by ECC
fault model formalized as part of hardware
operational semantics
TAL/FT: Key Ideas (Properties)
store
Processor
read
ECC-protected
memory
Mem-mapped
I/O device
Primary Goal: if there is one fault then either
(1) Mem-mapped I/O device sees exactly the same sequence
of stores as a fault-free execution, or
(2) Hardware detects and signals a fault and mem-mapped I/O
sees a prefix of the stores from a fault-free execution
Secondary Goal: no false positives
TAL/FT: Key Ideas (Mechanisms)

Compiler strategy

create two redundant computations as lambda zap



Hardware support



two copies ==> fault detection
fault recovery handled by a higher-level process
special instructions & modified store buffer for implementing reliable stores
special instructions for reliable control-flow transfers
Type system



Simple typing mechanisms based on original TAL [Morrisett, Walker, et al.]
Redundant values with separate colors like in lambda zap
Value identities needed for equivalence checking tracked using singleton
types combined with some ideas drawn from traditional Hoare logics
Current & Future Work

Build the first compiler that can automatically generate
reliability proofs for compiled programs



Study alternative fault detection schemes



TAL/FT refinement and implementation
type- and reliability-preserving optimizations
fault detection & recovery on current hardware
exploit multi-core alternatives
Understand the foundational theoretical principles that allow
programs to tolerate transient faults

general purpose program logics for reasoning about faulty programs
Other Research I Do

PADS [popl 06, sigmod 06 demo, popl 07]



Program Monitoring [popl 00, icfp 03, pldi 05, popl 06, ...]


automatic generation of tools (parsers, printers, validators, format
translators, query engines, etc.) for “ad hoc” data formats
with Kathleen Fisher (AT&T)
semantics, design and implementation of programs that monitor
other programs for security (or other purposes)
TAL & other type systems [popl 98, popl 99, toplas 99, jfp 02, ... ]

theory, design and implementation of type systems for compiler
target and intermediate languages
Conclusions
Semi-conductor manufacturers are
deeply worried about how to
deal with soft faults in future
architectures (10+ years out)
Using proofs and types I think we
are going to be able to develop
highly reliable software that runs
on unreliable hardware
end!
The Caveat
Function O.S. follows
Lambda to Lambda Zap: Control-flow
let f = \x.e in
f2
let [f1, f2, f3] = \x. [[ e ]] in
[f1, f2, f3] [2, 2, 2]
majority vote on
control-flow transfer
Lambda to Lambda Zap: Control-flow
let f = \x.e in
f2
let [f1, f2, f3] = \x. [[ e ]] in
[f1, f2, f3] [2, 2, 2]
operational semantics:
(M; let [f1, f2, f3] = \x.e1 in e2)
--->
(M,l=\x.e1; e2[ l / f1][ l / f2][ l / f3])
majority vote on
control-flow transfer
TAL/FT Hardware
replicated
program
counters
store queue
ECC-protected
Caches/Memory
Related Work Follows
Software Mitigation Techniques

Examples:
 N-version programming, EDDI, CFCSS [Oh et al. 2002], SWIFT [Reis et al. 2005], ...
 Hybrid hardware-software techniques: Watchdog Processors,
CRAFT [Reis et al. 2005] , ...

Pros:

immediate deployment




would have benefitted Los Alamos Labs, etc...
policies may be customized to the environment, application
reduced hardware cost
Cons:

For the same universal policy, slower (but not as much as you’d think).
Software Mitigation Techniques

Examples:
 N-version programming, EDDI, CFCSS [Oh et al. 2002], SWIFT [Reis et al.
2005], etc...
 Hybrid hardware-software techniques: Watchdog Processors,
CRAFT [Reis et al. 2005] , etc...

Pros:

immediate deployment: if your system is suffering soft error-related
failures, you may deploy new software immediately




would have benefitted Los Alamos Labs, etc...
policies may be customized to the environment, application
reduced hardware cost
Cons:


For the same universal policy, slower (but not as much as you’d think).
IT MIGHT NOT ACTUALLY WORK!
Mitigation Techniques
Hardware:
 error-correcting codes
 redundant hardware
Pros:

Pros:

Software and hybrid schemes:
 replicate computations

fast for a fixed policy

Cons:



FT policy decided at
hardware design time
 mistakes cost millions
one-size-fits-all policy
expensive
immediate deployment
policies customized to
environment, application
reduced hardware cost
Cons:

for the same universal policy,
slower (but not as much as you’d
think).
Mitigation Techniques
Hardware:
 error-correcting codes
 redundant hardware
Pros:

Pros:

Software and hybrid schemes:
 replicate computations

fast for fixed policy

Cons:



FT policy decided at
hardware design time
 mistakes cost millions
one-size-fits-all policy
expensive
immediate deployment
policies customized to
environment, application
reduced hardware cost
Cons:


for the same universal policy,
slower (but not as much as you’d
think).
It may not actually work!

much research in HW/compilers
community completely lacking
proof
Solutions in Time

Solutions in Hardware






replication of instructions and checking implemented in specialpurpose hardware
eg: Reinhardt & Mukherjee 2000
pros: transparent to software
cons: one-size-fits-all reliability policy
cons: can’t fix existing problem; specialized hardware has reduced
market
Solutions in Software (or hybrid Hardware/Software)





compiler generates replicated instructions and checking code
eg: Reis et al. 05
pros: flexibility: new reliability policies may be deployed whenever
needed
cons: for fixed reliability policy, slower than specialized hardware
solutions
cons: it might not actually work
Descargar

a theory of aspects - Princeton University Computer Science