The Hardness of Cache Conscious Data Placement Erez Petrank, Technion Dror Rawitz, Caesarea Rothschild Institute Appeared in 29th ACM Conference on Principles of Programming Languages Portland, Oregon, January 16, 2002 Agenda Background & motivation The problem of cache conscious data / code placement is: extremely difficult in various models Positive matching results (weak…). Some proof techniques and details Conclusion 2 Cache Structure Large memory divided into blocks. Small cache - k blocks. Mapping of memory blocks to cache blocks. (e.g., modulus function) Cache hit - accessed block is in the cache. Cache miss - required block must be read to cache from memory. Direct mapping 4 What can we do to improve program cache behavior? Arrange code / data to minimize cache misses Write cache-conscious programs In this work we concentrate on the first. 5 How do we place data (or code) optimally? Step 1: Discover future accesses to data. Step 2: Find placement of data that minimizes the number of cache misses. Step 3: Rearranged the data in memory. Step 4: Run program. Some “minor” problems: In Step 1: We cannot tell the future In Step 2: We don’t know how to do that 6 Step 1: Discover future accesses to data Static analysis. Profiling. Runtime monitoring. This work: Even if future accesses are known exactly, Step 2 (placing data optimally) is extremely difficult. 7 The Problem Input: a set of objects O={o1,…,om}, and a sequence of accesses =(1,…,n). E.g. = (o1,o3,o7,o1,o2,o1,o3,o4,o1). Solution: a placement, f:ON. Measure: number of misses. We want: placement of o1,…,om in memory that obtains minimum number of cache misses (over all possible placements). 8 Our Results Can we (efficiently) find an optimal placement? No! Unless, P=NP. 9 Our Results Can we (efficiently) find an “almost” optimal placement? Almost = # misses twice the optimum No! Unless, P=NP. Can we (eff.) find “fairly” optimal placement? Fairly = # misses 100 times the optimum No! Unless, P=NP. 10 Our Results Can we (eff.) find a “reasonable” placement? reasonable = # misses log(n) the optimum No! Unless, P=NP. Can we (eff.) find an “acceptable” placement? Acceptable = # misses n0.99 times the optimum No! Unless, P=NP. 11 The Main Theorem Let ε be any real number, 0<ε<1. If there is a polynomial time algorithm that finds a placement which is within a factor of n(1-) from the optimum, then P=NP. (Theorem holds for caches with > 2 blocks) 12 Extend to t-way Associative Caches t-way Associative Caches: t·k blocks in cache, k sets, t blocks in a set. memory block mapped to a set. Inside a set: a replacement protocol. . . . 1 2 3 . . . Theorem 2: same hardness holds for t-way associative cache systems. 13 Result is “robust” Holds for a variety of models. E.g., Mapping of memory block to cache is not by modulus, Replacement policy is not standard, Object sizes are fixed, (or they are not), Objects must be aligned to cache blocks, (or not), Etc… 14 More Difficulties: Pairwise Information A practical problem: sequence of accesses is long. Processing it is costly. Solution in previous work: keep relations between pairs of objects. E.g., for each pair how beneficial is putting them in the same memory block. E.g., for each pair how many misses would be caused by mapping the two to the same cache block 15 Pairwise Information is Lossy Theorem 3: There exists a sequence such that #misses(f) (k-3) #misses(f*) f - optimal pairwise placemet f* - optimal placement Conclusion: Even when given unrestricted time, finding an optimal pairwise placement is a bad idea (worst case). 16 Pairwise Information: Hardness Result Theorem 4: Let ε be any real number, 0<ε<1. If there is a polynomial time algorithm that finds a placement which is within a factor of n(1- ε) from the optimum with respect to pairwise information, then P=NP. Proof is similar to the direct mapping case. 17 A Simple Observation Input: Objects O={o1,…,om}, and access sequence =(1,…,n). Any placement yields at most n cache misses. Any placement yields at least 1 cache miss. Therefore, any placement is within a factor of n from the optimum. (Recall: a solution within n(1-ε) is not possible.) 18 What about positive results? In light of the lower bound not much can be done in general. Yet… Theorem 5: There exists a polynomial time approximation algorithm that outputs a placement (always) within a n factor of c log n from the optimal placement for any c. Compare: impossible: n(1-), possible: n/c logn 19 Other Problems Our Problem: Inapproximable n/ c logn -approximation algorithm Famous problems with similar results: Minimum graph coloring Maximum clique 20 Implications: We cannot hope to find an algorithm that will always give a good placement. We must use heuristics. We cannot estimate the potential benefit of rearranging data in memory to the cache behavior. We can only check what a proposed heuristic does for common benchmarks. 21 Some Proof Ideas (simplest – direct mapping) Theorem 1: Let be any real number, 0<<1. If there is a polynomial time algorithm that finds a placement which is within a factor of n(1-) from the optimum, then P=NP. Proof: We show that if the above algorithm exists, then we can decide for any given graph G, if G is k-colorable. 22 The k-colorability Problem Problem: Given G=(V,E), is G k-colorable? Known to be NP-complete for k>2. 23 Translating a Graph G into a Cache Question Graph Cache Question Color Cache line Vertex vi Object ov Edge e=(vi,vj) Subsequence e=(oi,oj)M (i.e., M times (oi,oj).) Coloring Placement 24 Translating a Graph G into a Cache Question A vertex vi is represented by an object oi: OG = { oi : vi V } Let =O(1/). Each edge (vi,vj) is represented by |E| repetitions of the two objects oi,oj: G ( v i , v j ) E o , o i E j 25 Examples G1 (not 3-colorable) 1 o1 , o 2 G2 6 1 2 3 4 o1 , o 3 o1 , o 4 o 2 , o 3 o 2 , o 4 o 3 , o 4 6 6 6 6 6 1 2 3 4 (3-colorable) 2 o1 , o 2 5 o1 , o 3 o1 , o 4 o 2 , o 3 o 2 , o 4 5 5 5 5 26 Properties of the Translation Length of : n = O(|E|+1) Case I: G is k-colorable. Then, Opt(OG,G) = O(|E|) = O(n1/(+1)) = O(n/2) Case II: G is not k-colorable. Then, Opt(OG,G) = Ω(|E|) = Ω(n/(+1)) = Ω(n1-/2) Thus, an algorithm that provides a placement within n(1-) from the optimum can distinguish between the two cases! 27 Replacement Protocol Relevant in t-way caches Which object is removed when set is full? We need a replacement policy or protocol. Examples: LRU FIFO … . . . 1 2 3 . . . 28 Replacement Protocol Problem: replacement protocol may behave badly. Recently used objects are being removed (e.g., most recently used). Any placement is good. Solution: Only reasonable caches are considered. Problem: How do we define reasonable? 29 Sensible Caches =(o1,…,oq), s.t. ij, oi oj No more than t objects are mapped to the same cache set causes at most q+C misses, when accessed after any ’. O(1) misses after first access to . E.g., LRU is 0-sensible. 30 Pseudo-LRU Used in Intel® Pentium® 4-way associative Each set: Two pairs LRU block flag (for each pair) LRU pair flag LRU block in LRU pair is removed Replacement Protocol: Pseudo-LRU is 0-sensible. 31 t-way associative caches: Some Proof Ideas Theorem 2: Let be any real number, 0<<1. If there is a polynomial time algorithm that finds a placement which is within a factor of n(1- ) from the optimum, then P=NP. Proof: If the above algorithm exists, then we can decide for any given graph G, if G is k-colorable. Main idea: a sensible replacement protocol is “forced” to behave as a direct mapping cache. We do this by using dummy objects 32 How do we construct the approximation algorithm? If there are c logn objects Opt c logn. Any placement is Otherwise, n c log n -approximate. There are < c logn objects in . In this case, we can find an optimal placement by examining k c log n n c log k possible placements. 33 Unrestricted Alignments and non Uniform Size Objects Two simplifying assumptions: Uniform size objects Aligned objects Can we get the same results without them? Lower bound? Upper bound? 34 Problem with Unrestricted Alignments Cache with 3 blocks. 2 objects o1, o2 each of size 1.5 blocks. The sequence: (o1,o2)M Restricted alignment: O(M) misses. Unrestricted alignment: 3 misses. (If they are placed consecutively in memory.) 35 Unrestricted Alignments (uniform instances) Aligned version of a placement: Direct mapping: #misses(g) #misses(f) H f g Associative caches: #misses(g) = O(#misses(f)) + O(|E|) when (O,) = H(G) Lower bounds 36 Conclusion Computing the best placement of data in memory (w.r.t. reducing cache misses) is extremely difficult. We cannot even get close (if PNP). There exists a matching (weak) positive result. Implications: using heuristics cannot be avoided. We cannot hope to evaluate potential benefit. 37 An Open Question: Can we classify programs for which the problem becomes simpler? 38 The end 39

Descargar
# Concurrent Zero Knowledge