```The Hardness of Cache
Conscious Data Placement
Erez Petrank, Technion
Dror Rawitz, Caesarea Rothschild Institute
Appeared in
29th ACM Conference on Principles of Programming Languages
Portland, Oregon, January 16, 2002
Agenda


Background & motivation
The problem of cache conscious
data / code placement is:





extremely difficult
in various models
Positive matching results (weak…).
Some proof techniques and details
Conclusion
2
Cache Structure





Large memory divided into
blocks.
Small cache - k blocks.
Mapping of memory blocks
to cache blocks.
(e.g., modulus function)
Cache hit - accessed block
is in the cache.
Cache miss - required block
from memory.
Direct mapping
4
What can we do to improve
program cache behavior?


Arrange code / data to
minimize cache misses
Write cache-conscious
programs
In this work we
concentrate on the first.
5
How do we place data (or code)
optimally?

Step 1: Discover future accesses to data.
Step 2: Find placement of data that
minimizes the number of cache misses.
Step 3: Rearranged the data in memory.
Step 4: Run program.

Some “minor” problems:





In Step 1: We cannot tell the future
In Step 2: We don’t know how to do that
6
Step 1: Discover future
accesses to data



Static analysis.
Profiling.
Runtime monitoring.
This work:
Even if future accesses are known exactly,
Step 2 (placing data optimally) is extremely difficult.
7
The Problem

Input: a set of objects O={o1,…,om}, and a
sequence of accesses =(1,…,n).
E.g.  = (o1,o3,o7,o1,o2,o1,o3,o4,o1).


Solution: a placement, f:ON.
Measure: number of misses.
We want: placement of o1,…,om in memory
that obtains minimum number of cache misses
(over all possible placements).
8
Our Results
Can we (efficiently) find an optimal
placement?
No!
Unless, P=NP.
9
Our Results
Can we (efficiently) find an “almost” optimal
placement?
Almost = # misses  twice the optimum
No!
Unless, P=NP.
Can we (eff.) find “fairly” optimal placement?
Fairly = # misses  100 times the optimum
No!
Unless, P=NP.
10
Our Results
Can we (eff.) find a “reasonable” placement?
reasonable = # misses  log(n) the optimum
No!
Unless, P=NP.
Can we (eff.) find an “acceptable” placement?
Acceptable = # misses  n0.99 times the
optimum
No!
Unless, P=NP.
11
The Main Theorem
Let ε be any real number, 0<ε<1.
If there is a polynomial time algorithm that
finds a placement which is within a factor
of n(1-) from the optimum, then P=NP.
(Theorem holds for caches with > 2 blocks)
12
Extend to t-way Associative
Caches
t-way Associative Caches:
 t·k blocks in cache, k
sets, t blocks in a set.
 memory block mapped
to a set.
 Inside a set: a
replacement protocol.
.
.
.
1
2
3
.
.
.
Theorem 2: same hardness holds for
t-way associative cache systems.
13
Result is “robust”

Holds for a variety of models. E.g.,





Mapping of memory block to cache is
not by modulus,
Replacement policy is not standard,
Object sizes are fixed, (or they are
not),
Objects must be aligned to cache
blocks, (or not),
Etc…
14
More Difficulties:
Pairwise Information


A practical problem: sequence  of
accesses is long. Processing it is costly.
Solution in previous work: keep
relations between pairs of objects.


E.g., for each pair how beneficial is putting
them in the same memory block.
E.g., for each pair how many misses would
be caused by mapping the two to the same
cache block
15
Pairwise Information is Lossy
Theorem 3:
There exists a sequence  such that
#misses(f)  (k-3)  #misses(f*)
f - optimal pairwise placemet
f* - optimal placement
Conclusion:
Even when given unrestricted time,
finding an optimal pairwise placement is a
16
Pairwise Information:
Hardness Result
Theorem 4: Let ε be any real number, 0<ε<1.
If there is a polynomial time algorithm that
finds a placement which is within a factor
of n(1- ε) from the optimum with respect to
pairwise information, then P=NP.
Proof is similar to the direct mapping case.
17
A Simple Observation





Input:
Objects O={o1,…,om}, and access sequence
=(1,…,n).
Any placement yields at most n cache misses.
Any placement yields at least 1 cache miss.
Therefore, any placement is within a factor of
n from the optimum.
(Recall: a solution within n(1-ε) is not possible.)
18
results?
In light of the lower bound not much
can be done in general. Yet…
Theorem 5: There exists a polynomial
time approximation algorithm that
outputs a placement (always) within a
n
factor of c  log n from the optimal
placement for any c.
Compare: impossible: n(1-), possible: n/c logn
19
Other Problems

Our Problem:



Inapproximable
n/
c logn -approximation algorithm
Famous problems with similar results:


Minimum graph coloring
Maximum clique
20
Implications:


We cannot hope to find an algorithm that
will always give a good placement.
We must use heuristics.
We cannot estimate the potential benefit
of rearranging data in memory to the
cache behavior.
We can only check what a proposed
heuristic does for common benchmarks.
21
Some Proof Ideas
(simplest – direct mapping)
Theorem 1: Let  be any real number, 0<<1.
If there is a polynomial time algorithm that
finds a placement which is within a factor
of n(1-) from the optimum, then P=NP.
Proof: We show that if the above algorithm
exists, then we can decide for any given
graph G, if G is k-colorable.
22
The k-colorability Problem

Problem: Given G=(V,E), is G k-colorable?

Known to be NP-complete for k>2.
23
Translating a Graph G into
a Cache Question
Graph
Cache Question
Color
Cache line
Vertex vi
Object ov
Edge e=(vi,vj)
Subsequence e=(oi,oj)M
(i.e., M times (oi,oj).)
Coloring  Placement
24
Translating a Graph G into
a Cache Question


A vertex vi is represented by an object oi:
OG = { oi : vi  V }
Let =O(1/).
Each edge (vi,vj) is represented by |E|
repetitions of the two objects oi,oj:
G 

( v i , v j ) E
o , o 
i
E

j
25
Examples

G1 (not 3-colorable)
 1   o1 , o 2 

G2
6

1
2
3
4
 o1 , o 3   o1 , o 4   o 2 , o 3   o 2 , o 4   o 3 , o 4 
6

6

6

6

6

1
2
3
4
(3-colorable)
 2   o1 , o 2 
5

 o1 , o 3   o1 , o 4   o 2 , o 3   o 2 , o 4 
5

5

5

5

26
Properties of the Translation




Length of  : n = O(|E|+1)
Case I: G is k-colorable.
Then, Opt(OG,G) = O(|E|) = O(n1/(+1)) = O(n/2)
Case II: G is not k-colorable.
Then, Opt(OG,G) = Ω(|E|) = Ω(n/(+1)) = Ω(n1-/2)
Thus, an algorithm that provides a placement
within n(1-) from the optimum can distinguish
between the two cases!
27
Replacement Protocol




Relevant in t-way caches
Which object is removed
when set is full?
We need a replacement
policy or protocol.
Examples:



LRU
FIFO
…
.
.
.
1
2
3
.
.
.
28
Replacement Protocol
Problem: replacement protocol may


Recently used objects are being removed
(e.g., most recently used).
Any placement is good.
Solution: Only reasonable caches are
considered.
Problem: How do we define reasonable?
29
Sensible Caches


=(o1,…,oq), s.t. ij, oi  oj
No more than t objects are mapped to the
same cache set
 causes at most q+C misses,
when accessed after any ’.
E.g., LRU is 0-sensible.
30
Pseudo-LRU



Used in Intel® Pentium®
4-way associative
Each set:

Two pairs
LRU block flag (for each pair)
LRU pair flag

LRU block in LRU pair is removed




Replacement Protocol:
Pseudo-LRU is 0-sensible.
31
t-way associative caches:
Some Proof Ideas
Theorem 2: Let  be any real number, 0<<1.
If there is a polynomial time algorithm that finds a
placement which is within a factor of n(1- ) from
the optimum, then P=NP.
Proof: If the above algorithm exists, then we can
decide for any given graph G, if G is k-colorable.
Main idea: a sensible replacement protocol is
“forced” to behave as a direct mapping cache.


We do this by using dummy objects
32
How do we construct the
approximation algorithm?

If there are  c logn objects



Opt  c logn.
Any placement is
Otherwise,


n
c log n
-approximate.
There are < c logn objects in .
In this case, we can find an optimal placement
by examining k c log n  n c log k possible
placements.
33
Unrestricted Alignments and
non Uniform Size Objects
Two simplifying assumptions:



Uniform size objects
Aligned objects
Can we get the same results
without them?



Lower bound?
Upper bound?
34
Problem with Unrestricted
Alignments





Cache with 3 blocks.
2 objects o1, o2 each of size 1.5 blocks.
The sequence: (o1,o2)M
Restricted alignment: O(M) misses.
Unrestricted alignment: 3 misses.
(If they are placed consecutively in
memory.)
35
Unrestricted Alignments
(uniform instances)



Aligned version of a
placement:
Direct mapping:
#misses(g)  #misses(f)
H

f
g
Associative caches:
#misses(g) = O(#misses(f)) + O(|E|)
when (O,) = H(G)
Lower bounds
36
Conclusion




Computing the best placement of data
in memory (w.r.t. reducing cache
misses) is extremely difficult.
We cannot even get close (if PNP).
There exists a matching (weak)
positive result.
Implications:


using heuristics cannot be avoided.
We cannot hope to evaluate potential
benefit.
37
An Open Question:
Can we classify programs for which
the problem becomes simpler?
38
The end
39
```