improving performance of caches
avg. memory access time = hit time + miss rate * miss penalty
accordingly:
- reduce hit time
- reduce miss rate
- reduce miss penalty
the first term (hit time) is small;
the second term (miss rate * miss penalty) is large.
therefore concentrate on (2) and (3) above, since they are more important
than (1)
reducing cache misses
the three "C"s of cache design
- compulsory: first access to block that's not cached.
- capacity: if cache cannot contain all blocks for execution of a program,
misses will occur because of blocks being discarded and later retrieved.
- conflict: if block replacement strategy is set associative or direct mapped
then misses will occur if too many blocks are mapped to its set.
the 7 ways of dealing with conflicts
- increase associativity. tradeoff: more hardware complexity, increased cost
and may decrease clock rate, slowing overall system.
- increase cache capacity or make block sizes larger. reduces miss rate.
tradeoff: larger amounts of info have to be moved between cache and
next lower level. increases miss penalty.
- use victim cache. (victim cache is a small cache holding
discarded blocks due to miss). victim cache is checked on next miss to
see if it ocntains desired data before going down to next lower level of
memory hierarchy. typically a small fully associated victim cache is used
with a direct mapped main cache. e.g. a 4 entry victim cache typ.
reduces 50% of misses in a 4k direct mapped cache.
- pseudo associative caches: works like direct mapped cache on hit;
but on a miss, another cache entry is checked before reverting to
next lower memory level. thus it has a fast and slow hit time.
- hardware instruction prefetch: prefetch of instructions and data.
instruction prefetch done in hardware outside cache.
same can be done for data.
can have multiple data streams each fetching at different addresses
- compiler controlled prefetching: compiler can insert prefetches
to request data before needed:
- register prefetch: loads register with data
- cache prefetch: access a block before it is needed in order to have
data in cache at a later time.
loops are prime targets for compiler controlled prefetching.
- compiler optimizations:
- done without affecting hardware
- rearranging instructions can keep instruction blocks in cache longer.
- goal is to improve spatiotemporal (spatial AND temporal) locality.
- data can be more rearranged than code...
example:
for (n=0;n<640;n++)
for (m=0;m<480;m++)
mypicq[m][n] = w1*mypicq[m][n];
} //end form
} //end forn
what's wrong with the above???
now we rearrange:
for (m=0;m<480;m++)
for (n=0;n<640;n++)
mypicq[m][n] = w1*mypicq[m][n];
} //end forn
} //end form
note direction of axes, e.g. x down, y across to right.