improving performance of caches

avg. memory access time = hit time + miss rate * miss penalty

accordingly:

reduce hit time
reduce miss rate
reduce miss penalty

the first term (hit time) is small;
the second term (miss rate * miss penalty) is large.

therefore concentrate on (2) and (3) above, since they are more important than (1)

reducing cache misses

the three "C"s of cache design

compulsory: first access to block that's not cached.
capacity: if cache cannot contain all blocks for execution of a program, misses will occur because of blocks being discarded and later retrieved.
conflict: if block replacement strategy is set associative or direct mapped then misses will occur if too many blocks are mapped to its set.

the 7 ways of dealing with conflicts

increase associativity. tradeoff: more hardware complexity, increased cost and may decrease clock rate, slowing overall system.
increase cache capacity or make block sizes larger. reduces miss rate. tradeoff: larger amounts of info have to be moved between cache and next lower level. increases miss penalty.
use victim cache. (victim cache is a small cache holding discarded blocks due to miss). victim cache is checked on next miss to see if it ocntains desired data before going down to next lower level of memory hierarchy. typically a small fully associated victim cache is used with a direct mapped main cache. e.g. a 4 entry victim cache typ. reduces 50% of misses in a 4k direct mapped cache.
pseudo associative caches: works like direct mapped cache on hit; but on a miss, another cache entry is checked before reverting to next lower memory level. thus it has a fast and slow hit time.
hardware instruction prefetch: prefetch of instructions and data. instruction prefetch done in hardware outside cache. same can be done for data. can have multiple data streams each fetching at different addresses
compiler controlled prefetching: compiler can insert prefetches to request data before needed:
- register prefetch: loads register with data
- cache prefetch: access a block before it is needed in order to have data in cache at a later time.
loops are prime targets for compiler controlled prefetching.
compiler optimizations:
- done without affecting hardware
- rearranging instructions can keep instruction blocks in cache longer.
- goal is to improve spatiotemporal (spatial AND temporal) locality.
- data can be more rearranged than code...
  example:
```
for (n=0;n<640;n++)
 for (m=0;m<480;m++)
  mypicq[m][n] = w1*mypicq[m][n];
 } //end form
} //end forn
```
  what's wrong with the above???
  now we rearrange:
```
for (m=0;m<480;m++)
 for (n=0;n<640;n++)
  mypicq[m][n] = w1*mypicq[m][n];
 } //end forn
} //end form
```
  note direction of axes, e.g. x down, y across to right.

Cache hierarchy; typical cache sizes

improving performance of caches

reducing cache misses

the three "C"s of cache design

the 7 ways of dealing with conflicts

Cache hierarchy; typical cache sizes

y2k keyboard