# CSSE132 Introduction to Computer Systems

23 : Cache

April 17, 2013

# **Today**

- **■** Cache memory organization and operation
- Performance impact of caches
  - The memory mountain

See book for more details and examples

## **Cache Memories**

- Cache memories are small, fast SRAM-based memories managed automatically in hardware.
  - Hold frequently accessed blocks of main memory
- CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory.
- Typical system structure:



# General Cache Organization (S, E, B)





## **Example: Direct Mapped Cache (E = 1)**

Direct mapped: One line per set Assume: cache block size 8 bytes



## **Example: Direct Mapped Cache (E = 1)**

Direct mapped: One line per set Assume: cache block size 8 bytes



## **Example: Direct Mapped Cache (E = 1)**

Direct mapped: One line per set Assume: cache block size 8 bytes



No match: old line is evicted and replaced

# **Direct-Mapped Cache Simulation**

| t=1 | s=2 | b=1 |
|-----|-----|-----|
| X   | ХX  | X   |

M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 Blocks/set

Address trace (reads, one byte per read):

| 0 | $[0000_2],$                    | miss |
|---|--------------------------------|------|
| 1 | [0 <u>00</u> 1 <sub>2</sub> ], | hit  |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], | miss |
| 8 | $[1000_{2}^{-}],$              | miss |
| n | [0000]                         | miss |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 0   | M[0-1] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 | 1 | 0   | M[6-7] |

## E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0...01 100 0 1 2 3 4 5 6 0 1 2 3 4 5 tag find set 0 1 2 3 4 5 6 tag 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 tag 0 1 2 3 4 5 6 7 tag 0 1 2 3 4 5 6 7 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 tag

# E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set



# E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set



#### No match:

- One line in set is selected for eviction and replacement
- Replacement policies: random, least recently used (LRU), ...

# 2-Way Set Associative Cache Simulation

| <br>t=2 | s=1 | b=1 |
|---------|-----|-----|
| XX      | Х   | Х   |

M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

Address trace (reads, one byte per read):

| 0 | $[00\underline{0}0_{2}],$     | miss |
|---|-------------------------------|------|
| 1 | $[00\underline{0}1_{2}],$     | hit  |
| 7 | $[01\underline{1}_{2}],$      | miss |
| 8 | $[10\underline{0}0_{2}^{-}],$ | miss |
| 0 | [0000]                        | hit  |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
|       | 1 | 10  | M[8-9] |

| Set 1 | 1 | 01 | M[6-7] |
|-------|---|----|--------|
|       | 0 |    |        |

## **Eviction**

#### Must evict old lines if no room for new lines

- Cache miss occurs
- New data is fetched
- If cache is full, old line must be evicted to make room

## Eviction strategies

- Random
  - Works, but not very smart
- Least recently used
  - Track last time line was referenced
  - Evict 'oldest' line
- Least frequently used
  - Track number of times used
  - Evict infrequently used line

## What about writes?

### Multiple copies of data exist:

L1, L2, Main Memory, Disk

#### What to do on a write-hit?

- Write-through (write immediately to memory)
- Write-back (defer write to memory until replacement of line)
  - Need a dirty bit (line different from memory or not)

#### What to do on a write-miss?

- Write-allocate (load into cache, update line in cache)
  - Good if more writes to the location follow
- No-write-allocate (writes immediately to memory)

## Typical

- Write-through + No-write-allocate
- Write-back + Write-allocate

## Real caches

- Different caches for different data
  - Tune each cache for specific purpose
- Data cache (d-cache)
  - Stores program data
- Instruction cache (i-cache)
  - Stores program instructions
- Unified cache
  - Stores data and instructions

## **Intel Core i7 Cache Hierarchy**

#### Processor package



#### L1 i-cache and d-cache:

32 KB, 8-way, Access: 4 cycles

#### L2 unified cache:

256 KB, 8-way, Access: 11 cycles

#### L3 unified cache:

8 MB, 16-way, Access: 30-40 cycles

**Block size**: 64 bytes for

all caches.

## **Cache Performance Metrics**

#### Miss Rate

- Fraction of memory references not found in cache (misses / accesses)
   = 1 hit rate
- Typical numbers (in percentages):
  - 3-10% for L1
  - can be quite small (e.g., < 1%) for L2, depending on size, etc.</li>

#### Hit Time

- Time to deliver a line in the cache to the processor
  - includes time to determine whether the line is in the cache
- Typical numbers:
  - 1-2 clock cycle for L1
  - 5-20 clock cycles for L2

#### Miss Penalty

- Additional time required because of a miss
  - typically 50-200 cycles for main memory (Trend: increasing!)

## Lets think about those numbers

- Huge difference between a hit and a miss
  - Could be 100x, if just L1 and main memory
- Would you believe 99% hits is twice as good as 97%?
  - Consider: cache hit time of 1 cycle miss penalty of 100 cycles
  - Average access time:

```
97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
```

■ This is why "miss rate" is used instead of "hit rate"

## **Writing Cache Friendly Code**

- Make the common case go fast
  - Focus on the inner loops of the core functions
- Minimize the misses in the inner loops
  - Repeated references to variables are good (temporal locality)
  - Stride-1 reference patterns are good (spatial locality)

Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.

# **Today**

- Cache organization and operation
- Performance impact of caches
  - The memory mountain

## **The Memory Mountain**

- Read throughput (read bandwidth)
  - Number of bytes read from memory per second (MB/s)
- Memory mountain: Measured read throughput as a function of spatial and temporal locality.
  - Compact way to characterize memory system performance.

# **The Memory Mountain**





**Intel Core i7** 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache

All caches on-chip

# The Memory Mountain



**Intel Core i7** 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache

All caches on-chip

