## Sequential Computer Architecture

Read section 1.2 to learn about memory hierarchies, and read section 1.4 to learn about locality and data reuse.

#### 1.2 Memory Hierarchies

We will now refine the picture of the Von Neuman architecture, in which data is loaded immediately from memory to the processors, where it is operated on. This picture is unrealistic because of the so-called memory wall: the processor is much too fast to load data this way. Specifically, a single load can take 1000 cycles, while a processor can perform several operations per cycle. (After this long wait for a load, the next load can come faster, but still too slow for the processor. This matter of wait time versus throughput will be addressed below in section 1.2.2.)

In reality, there will be various memory levels in between the floating point unit and the main memory: the registers and the caches. Each of these will be faster to a degree than main memory; unfortunately, the faster the memory on a certain level, the slower it will be. This leads to interesting programming problems, which we will discuss in the rest of this chapter, and particularly section 1.5.

The use of registers is the first instance you will see of measures taken to counter the fact that loading data from memory is slow. Access to data in registers, which are built into the processor, is almost instantaneous, unlike main memory, where hundreds of clock cycles can pass between requesting a data item, and it being available to the processor.

One advantage of having registers is that data can be kept in them during a computation, which obviates the need for repeated slow memory loads and stores. For example, in

s = 0;for (i=0; i<n; i++)  s += a[i]*b[i];

the variable s only needs to be stored to main memory at the end of the loop; all intermediate values are almost immediately overwritten. Because of this, a compiler will keep s in a register, eliminating the delay that would result from continuously storing and loading the variable. Such questions of data reuse will be discussed in more detail below; we will first discuss the components of the memory hierarchy.

##### 1.2.1 Busses

The wires that move data around in a computer, from memory to cpu or to a disc controller or screen, are called busses. The most important one for us is the Front-Side Bus (FSB) which connects the processor to memory. In one popular architecture, this is called the ‘north bridge’, as opposed to the ‘south bridge’ which connects to external devices, with the exception of the graphics controller.

The bus is typically much slower than the processor, operating between 500MHz and 1GHz, which is the main reason that caches are needed.

##### 1.2.2 Latency and Bandwidth

Above, we mentioned in very general terms that operating on data in registers is almost instantaneous, whereas loading data from memory into the registers, a necessary step before any operation, incurs a substantial delay. We will now make this story slightly more precise.

There are two important concepts to describe the movement of data: latency and bandwidth. The assumption here is that requesting an item of data incurs an initial delay; if this item was the first in a stream of data, usually a consecutive range of memory addresses, the remainder of the stream will arrive with no further delay at a regular amount per time period.

Latency is the delay between the processor issuing a request for a memory item, and the item actually arriving. We can distinguish between various latencies, such as the transfer from memory to cache, cache to register, or summarize them all into the latency between memory and processor. Latency is measured in (nano) seconds, or clock periods.

If a processor executes instructions in the order they are found in the assembly code, then execution will often stall while data is being fetched from memory; this is also called memory stall. For this reason, a low latency is very important. In practice, many processors have ‘out-of-order execution’ of instructions, allowing them to perform other operations while waiting for the requested data. Programmers can take this into account, and code in a way that achieves latency hiding.

Bandwidth is the rate at which data arrives at its destination, after the initial latency is overcome. Bandwidth is measured in bytes (kilobyes, megabytes, gigabyes) per second or per clock cycle. The bandwidth between two memory levels is usually the product of the cycle speed of the channel (the bus speed) and width of the bus: the number of bits that can be sent simultaneously in every cycle of the bus clock.

The concepts of latency and bandwidth are often combined in a formula for the time that a message takes from start to finish:

$T(n)= \alpha + \beta n$

where $\alpha$ is the latency and $\beta$ is the inverse of the bandwidth: the time per byte.

Typically, the further away from the processor one gets, the longer the latency is, and the lower the bandwidth. These two factors make it important to program in such a way that, if at all possible, the processor uses data from cache or register, rather than from main memory. To illustrate that this is a serious matter, consider a vector addition

for (i)  a[i] = b[i]+c[i]

Each iteration performs one floating point operation, which modern CPUs can do in one clock cycle by using pipelines. However, each iteration needs two numbers loaded and one written, for a total of 24 bytes4 of memory traffic. Typical memory bandwidth figures (see for instance figure 1.2) are nowhere near 32 bytes per cycle. This means that, without caches, algorithm performance can be bounded by memory performance.

The concepts of latency and bandwidth will also appear in parallel computers, when we talk about sending data from one processor to the next.

##### 1.2.3 Registers

The registers are the memory level that is closest to the processor: any operation acts on data in register and leaves its result in register. Programs written in assembly language explicitly use these registers:

addl %eax, %edx

which is an instruction to add the content of one register to another. As you see in this sample instruction, registers are not numbered in memory, but have distinct names that are referred to in the assembly instruction.

Registers have a high bandwidth and low latency, because they are part of the processor. That also makes them a very scarce resource.

##### 1.2.4 Caches

In between the registers, which contain the data that the processor operates on, and the main memory where lots of data can reside for a long time, are various levels of cache memory, that have lower latency and higher bandwidth than main memory. Data from memory travels the cache hierarchy to wind up in registers. The advantage to having cache memory is that if a data item is reused shortly after it was first needed, it will still be in cache, and therefore it can be accessed much faster than if it would have to be brought in from memory.

The caches are called ‘level 1’ and ‘level 2’ (or, for short, L1 and L2) cache; some processors can have an L3 cache. The L1 and L2 caches are part of the die, the processor chip, although for the L2 cache that is a recent development; the L3 cache is off-chip. The L1 cache is small, typically around 16Kbyte. Level 2 (and, when present, level 3) cache is more plentiful, up to several megabytes, but it is also slower. Unlike main memory, which is expandable, caches are fixed in size. If a version of a processor chip exists with a larger cache, it is usually considerably more expensive.

Figure 1.2: Memory hierarchy, characterized by speed and size.

Data that is needed in some operation gets copied into the various caches on its way to the processor. If, some instructions later, a data item is needed again, it is first searched for in the L1 cache; if it is not found there, it is searched for in the L2 cache; if it is not found there, it is loaded from main memory. Finding data in cache is called a cache hit, and not finding it a cache miss.

Figure 1.2 illustrates the basic facts of caches, in this case for the AMD Opteron chip: the closer caches are to the floating point units, the fast, but also the smaller they are. Some points about this figure.

• Loading data from registers is so fast that it does not constitute a limitation on algorithm execution speed. On the other hand, there are few registers. The Opteron5 has 16 general purpose registers, 8 media and floating point registers, and 16 SIMD registers.
• The L1 cache is small, but sustains a bandwidth of 32 bytes, that is 4 double precision number, per cycle. This is enough to load two operands each for two operations, but note that the Opteron can actually perform 4 operations per cycle. Thus, to achieve peak speed, certain operands need to stay in register. The latency from L1 cache is around 3 cycles.
• The bandwidth from L2 and L3 cache is not documented and hard to measure due to cache policies (see below). Latencies are around 15 cycles for L2 and 50 for L3.
• Main memory access has a latency of more than 100 cycles, and a bandwidth of 4.5 bytes per cycle, which is about 1/7th of the L1 bandwidth. However, this bandwidth is shared by the 4 cores of the opteron chip, so effectively the bandwidth is a quarter of this number. In a machine like Ranger, which has 4 chips per node, some bandwidth is spent on maintaining cache coherence (see section 1.3) reducing the bandwidth for each chip again by half.

On level 1, there are separate caches for instructions and data; the L2 and L3 cache contain both data and instructions.

You see that the larger caches are increasingly unable to supply data to the processors fast enough. For this reason it is necessary to code in such a way that data is kept as much as possible in the highest cache level possible. We will discuss this issue in detail in the rest of this chapter.

Exercise 1.5. The L1 cache is smaller than the L2 cache, and if there is an L3, the L2 is smaller than the L3. Give a practical and a theoretical reason why this is so.

1.2.4.1 Reuse is the name of the game

The presence of one or more caches is not immediately a guarantee for high performance: this largely depends on the memory access pattern of the code, and how well this exploits the caches. The first time that an item is referenced, it is copied from memory into cache, and through to the processor registers. The latency and bandwidth for this are not mitigated in any way by the presence of a cache. When the same item is referenced a second time, it may be found in cache, at a considerably reduced cost in terms of latency and bandwidth: caches have shorter latency and higher bandwidth than main memory.

We conclude that, first, an algorithm has to have an opportunity for data reuse. If every data item is used only once (as in addition of two vectors), there can be no reuse, and the presence of caches is largely irrelevant. A code will only benefit from the increased bandwidth and reduced latency of a cache if items in cache are referenced more than once; see section 1.4.1 for a detailed discussion. An example would be the matrix-vector multiplication y = Ax where each element of x is used in n operations, where n is the matrix dimension.

Secondly, an algorithm may theoretically have an opportunity for reuse, but it needs to be coded in such a way that the reuse is actually exposed. We will address these points in section 1.4.3. This second point especially is not trivial.

1.2.4.2 Replacement policies

Data in cache and registers is placed there by the system, outside of programmer control. Likewise, the system decides when to overwrite data in the cache or in registers if it is not referenced in a while, and as other data needs to be placed there. Below, we will go into detail on how caches do this, but as a general principle, a Least Recently Used (LRU) cache replacement policy is used: if a cache is full and new data needs to be placed into it, the data that was least recently used is flushed, meaning that it is overwritten with the new item, and therefore no longer accessible. LRU is by far the most common replacement policy; other possibilities are FIFO (first in first out) or random replacement.

Exercise 1.6. Sketch a simple scenario, and give some (pseudo) code, to argue that LRU is preferable over FIFO as a replacement strategy.

1.2.4.3 Cache lines

Data is not moved between memory and cache, or between caches, in single bytes. Instead, the smallest unit of data moved is called a cache line. A typical cache line is 64 or 128 bytes long, which in the context of scientific computing implies 8 or 16 double precision floating point numbers. The cache line size for data moved into L2 cache can be larger than for data moved into L1 cache.

It is important to acknowledge the existence of cache lines in coding, since any memory access costs the transfer of several words (see section 1.5.3 for some practical discussion). An efficient program then tries to use the other items on the cache line, since access to them is effectively free. This phenomenon is visible in code that accesses arrays by stride: elements are read or written at regular intervals.

Stride 1 corresponds to sequential access of an array:

for (i=0; i<N; i++)  ... = ... x[i] ...

A larger stride

for (i=0; i<N; i+=stride)  ... = ... x[i] ...

implies that in every cache line only certain elements are used; a stride of 4 with a cache line size of 4 double precision numbers implies that in every cache line only one number is used, and therefore that 3/4 of the used bandwidth has been wasted.

Some applications naturally lead to strides greater than 1, for instance, accessing only the real parts of an array of complex numbers; section 3.4.4. Also, methods that use recursive doubling often have a code structure that exhibits non-unit strides

for (i=0; i<N/2; i++)  x[i] = y[2*i];

1.2.4.4 Cache mapping

Caches get faster, but also smaller, the closer to the floating point units they get; even the largest cache is consid-erably smaller than the main memory size. We already noted that this has implications for the cache replacement strategy. Another issue we need to address in this context is that of cache mapping, which is the question of ‘if an item is placed in cache, where does it get placed’. This problem is generally addressed by mapping the (main memory) address of the item to an address in cache, leading to the question ‘what if two items get mapped to the same address’.

1.2.4.5 Direct mapped caches

The simplest cache mapping strategy is direct mapping. Suppose that memory addresses are 32 bits long, so that they can address 4G bytes; suppose further that the cache has 8K words, that is, 64k bytes, needing 16 bits to address. Direct mapping then takes from each memory address the last (‘least significant’) 16 bits, and uses these as the address of the data item in cache.

Direct mapping is very efficient because of its address calculations can be performed very quickly, leading to low latency, but it has a problem in practical applications. If two items are addressed that are separated by 8K words, they will be mapped to the same cache location, which will make certain calculations inefficient. Example:

double A[3][8192]; int i;for (i=0; i<512; i++)  a[2][i] = ( a[0][i]+a[1][i] )/2.;

Here, the locations of a[0][i], a[0][i], and a[2][i] are 8K from each other for every i, so the last 16 bits of their addresses will be the same, and hence they will be mapped to the same location in cache. The execution of the loop will now go as follows:

• The data at a[0][0] is brought into cache and register. This engenders a certain amount of latency. Together with this element, a whole cache line is transferred.
• The data at a[1][0] is brought into cache (and register, as we will not remark anymore from now on), together with its whole cache line, at cost of some latency. Since this cache line is mapped to the same location as the first, the first cache line is overwritten.
• In order to write the output, the cache line containing a[2][0] is brought into memory. This is again mapped to the same location, causing flushing of the cache line just loaded for a[1][0].
• In the next iteration, a[0][1] is needed, which is on the same cache line as a[0][0]. However, this cache line has been flushed, so it needs to be brought in anew from main memory or a deeper cache level. In doing so, it overwrites the cache line that holds a[1][0].
• A similar story hold for a[1][1]: it is on the cache line of a[1][0], which unfortunately has been overwritten in the previous step.

If a cache line holds four words, we see that each four iterations of the loop involve eight transfers of elements of a, where two would have sufficed, if it were not for the cache conflicts.

Exercise 1.7. In the example of direct mapped caches, mapping from memory to cache was done by using the final 16 bits of a 32 bit memory address as cache address. Show that the problems in this example go away if the mapping is done by using the first (‘most significant’) 16 bits as the cache address. Why is this not a good solution in general?

1.2.4.6 Associative caches

The problem of cache conflicts, outlined in the previous section, would be solved if any data item could go to any cache location. In that case there would be no conflicts, other than the cache filling up, in which case a cache replacement policy (section 1.2.4.2) would flush data to make room for the incoming item. Such a cache is called fully associative, and while it seems optimal, it is also very costly to build, and much slower in use.

For this reason, the most common solution is to have a k-way associative cache, where k is at least two. In this case, a data item can go to any of k cache locations. Code would have to have a k + 1-way conflict before data would be flushed prematurely as in the above example. In that example, a value of k = 2 would suffice, but in practice higher values are often encountered.

For instance, the Intel Woodcrest processor has

• an L1 cache of 32K bytes, that is 8-way set associative with a 64 byte cache line size;
• an L2 cache of 4M bytes, that is 8-way set associative with a 64 byte cache line size.

On the other hand, the AMD Barcelona chip has 2-way associativity for the L1 cache, and 8-way for the L2. A higher associativity (‘way-ness’) is obviously desirable, but makes a processor slower, since determining whether an address is already in cache becomes more complicated. For this reason, the associativity of the L1 cache, where speed is of the greatest importance, is typically lower than of the L2.

Exercise 1.8.  Write a small cache simulator in your favourite language. Assume a k-way associative cache of 32 entries and an architecture with 16 bit addresses. Run the following experiment for k = 1, 2, 4:

1. Let k be the associativity of the simulated cache.
2. Write the translation from 16 bit address to $32/2^k$ bit cache address.
3. Generate 32 random machine addresses, and simulate storing them in cache.

Since the cache has 32 entries, optimally the 32 addresses can all be stored in cache. The chance of this actually happening is small, and often the data of one address will be removed from the cache when it conflicts with another address. Record how many addresses, out of 32, are actually stored in the cache. Do step 3 100 times, and plot the results; give median and average value, and the standard deviation. Observe that increasing the associativity improves the number of addresses stored. What is the limit behaviour? (For bonus points, do a formal statistical analysis.)

##### 1.2.5 Prefetch streams

In the traditional von Neumann model (section 1.1), each instruction contains the location of its operands, so a CPU implementing this model would make a separate request for each new operand. In practice, often subsequent data items are adjacent or regularly spaced in memory. The memory system can try to detect such data patterns by looking at cache miss points, and request a prefetch data stream.

In its simplest form, the CPU will detect that consecutive loads come from two consecutive cache lines, and au-tomatically issue a request for the next following cache line. This process can be repeated or extended if the code makes an actual request for that third cache line. Since these cache lines are now brought from memory well before they are needed, prefetch has the possibility of eliminating the latency for all but the first couple of data items.

The concept of cache miss now needs to be revisited a little. From a performance point of view we are only interested in stalls on cache misses, that is, the case where the computation has to wait for the data to be brought in. Data that is not in cache, but can be brought in while other instructions are still being processed, is not a problem. If an ‘L1 miss’ is understood to be only a ‘stall on miss’, then the term ‘L1 cache refill’ is used to describe all cacheline loads, whether the processor is stalling on them or not.

Since prefetch is controlled by the hardware, it is also described as hardware prefetch. Prefetch streams can some-times be controlled from software, though often it takes assembly code to do so.

##### 1.2.6 Memory banks

Above, we discussed issues relating to bandwdith. You saw that memory, and to a lesser extent caches, have a bandwidth that is less than what a processor can maximally absorb. The situation is actually even worse than the above discussion made it seem. For this reason, memory is often divided into memory banks that are interleaved: with four memory banks, words 0, 4, 8 . . . are in bank 0, words 1, 5, 9 . . . are in bank 1, et cetera.

Suppose we now access memory sequentially, then such 4-way interleaved memory can sustain four times the bandwidth of a single memory bank. Unfortunately, accessing by stride 2 will halve the bandwidth, and larger strides are even worse. In practice the number of memory banks will be higher, so that strided memory access with small strides will still have the full advertised bandwidth.

This concept of banks can also apply to caches. For instance, the cache lines in the L1 cache of the AMD Barcelona chip are 16 words long, divided into two interleaved banks of 8 words. This means that sequential access to the elements of a cache line is efficient, but strided access suffers from a deteriorated performance.

##### 1.2.7 TLB and virtual memory

All of a program’s data may not be in memory simultaneously. This can happen for a number of reasons:

• The computer serves multiple users, so the memory is not dedicated to any one user;
• The computer is running multiple programs, which together need more than the physically available memory;
• One single program can use more data than the available memory.

For this reason, computers use Virtual memory: if more memory is needed than is available, certain blocks of memory are written to disc. In effect, the disc acts as an extension of the real memory. This means that a block of data can be anywhere in memory, and in fact, if it is swapped in and out, it can be in different locations at different times. Swapping does not act on individual memory locations, but rather on memory pages: contiguous blocks of memory, from a few kilobytes to megabytes in size. (In an earlier generation of operating systems, moving memory to disc was a programmer’s responsibility. Pages that would replace each other were called overlays.)

For this reason, we need a translation mechanism from the memory addresses that the program uses to the actual addresses in memory, and this translation has to be dynamic. A program has a ‘logical data space’ (typically starting from address zero) of the addresses used in the compiled code, and this needs to be translated during program execution to actual memory addresses. For this reason, there is a page table that specifies which memory pages contain which logical pages.

However, address translation by lookup in this table is slow, so CPUs have a Translation Look-aside Buffer (TLB). The TLB is a cache of frequently used Page Table Entries: it provides fast address translation for a number of pages. If a program needs a memory location, the TLB is consulted to see whether this location is in fact on a page that is remembered in the TLB. If this is the case, the logical address is translated to a physical one; this is a very fast process. The case where the page is not remembered in the TLB is called a TLB miss, and the page lookup table is then consulted, if necessary bringing the needed page into memory. The TLB is (sometimes fully) associative (section 1.2.4.6), using an LRU policy (section 1.2.4.2).

A typical TLB has between 64 and 512 entries. If a program accesses data sequentially, it will typically alternate between just a few pages, and there will be no TLB misses. On the other hand, a program that access many random memory locations can experience a slowdown because of such misses.

Section 1.5.4 and appendix D.5 discuss some simple code illustrating the behaviour of the TLB.

[There are some complications to this story. For instance, there is usually more than one TLB. The first one is associated with the L2 cache, the second one with the L1. In the AMD Opteron, the L1 TLB has 48 entries, and is is fully (48-way) associative, while the L2 TLB has 512 entries, but is only 4-way associative. This means that there can actually be TLB conflicts. In the discussion above, we have only talked about the L2 TLB. The reason that this can be associated with the L2 cache, rather than with main memory, is that the translation from memory to L2 cache is deterministic.]

#### 1.4 Locality and data reuse

By now it should be clear that there is more to the execution of an algorithm than counting the operations: the data
transfer involved is important, and can in fact dominate the cost. Since we have caches and registers, the amount of data transfer can be minimized by programming in such a way that data stays as close to the processor as possible.
Partly this is a matter of programming cleverly, but we can also look at the theoretical question: does the algorithm
allow for it to begin with.

##### 1.4.1 Data reuse

In this section we will take a look at data reuse: are data items involved in a calculation used more than once, so
that caches and registers can be exploited? What precisely is the ratio between the number of operations and the
amount of data transferred.

We define the data reuse of an algorithm as follows:

If n is the number of data items that an algorithm operates on, and f(n) the number of operations it takes, then the reuse factor is f(n)/n.

Consider for example the vector addition

$\forall_i:x_i\leftarrow x_i+y_i$

This involves three memory accesses (two loads and one store) and one operation per iteration, giving a data reuse
of 1/3. The axpy (for 'a times x plus y') operation

$\forall_i:x_i\leftarrow x_i+a\cdot y$

has two operations, but the same number of memory access since the one-time load of a is amortized. It is therefore more efficient than the simple addition, with a reuse of 2/3.

The inner product calculation

$\forall_i:s\leftarrow s+x_i\cdot y_i$

is similar in structure to the axpy operation, involving one multiplication and addition per iteration, on two vectors and one scalar. However, now there are only two load operations, since s can be kept in register and only written
back to memory at the end of the loop. The reuse here is 1.

##### 1.4.2 Matrix-matrix product

Next, consider the matrix-matrix product:

$\forall_{i,j}:c_{ij}=\sum_{k}a_{ik}b_{kj}$.

This involves 3n2 data items and 2n3 operations, which is of a higher order. The data reuse is $O(n)$, meaning that every data item will be used $O(n)$ times. This has the implication that, with suitable programming, this operation has the potential of overcoming the bandwidth/clock speed gap by keeping data in fast cache memory.

Exercise 1.10. The matrix-matrix product, considered as operation, clearly has data reuse by the above definition. Argue that this reuse is not trivially attained by a simple implementation. What determines whether the naive implementation has reuse of data that is in cache?

[In this discussion we were only concerned with the number of operations of a given implementation, not the
mathematical operation. For instance, there are ways or performing the matrix-matrix multiplication and Gaussian
elimination algorithms in fewer than $O(n^3)$ operations. However, this requires a different implementation, which has its own analysis in terms of memory access and reuse.]

The matrix-matrix product is the heart of the 'LINPACK benchmark'. The benchmark may give an optimistic view of the performance of a computer: the matrix-matrix product is an operation that has considerable data reuse, so it is relatively insensitive to memory bandwidth and, for parallel computers, properties of the network. Typically, computers will attain 60–90% of their peak performance on the Linpack benchmark. Other benchmarks may give considerably lower figures.

##### 1.4.3 Locality

Since using data in cache is cheaper than getting data from main memory, a programmer obviously wants to code in such a way that data in cache is reused. While placing data in cache is not under explicit programmer control, even from assembly language, in most CPUs, it is still possible, knowing the behaviour of the caches, to know what data is in cache, and to some extent to control it.

The two crucial concepts here are temporal locality and spatial locality. Temporal locality is the easiest to explain: this describes the use of a data element within a short time of its last use. Since most caches have a LRU replacement policy (section 1.2.4.2), if in between the two references less data has been referenced than the cache size, the element will still be in cache and therefore quickly accessible. With other replacement policies, such as random replacement, this guarantee can not be made.

As an example, consider the repeated use of a long vector:

for (loop=0; loop<10; loop++) {
for (i=0; i<N; i++) {
... = ... x[i] ...
}
}

Each element of x will be used 10 times, but if the vector (plus other data accessed) exceeds the cache size, each element will be flushed before its next use. Therefore, the use of x[i] does not exhibit temporal locality: subsequent uses are spaced too far apart in time for it to remain in cache.

If the structure of the computation allows us to exchange the loops:

for (i=0; i<N; i++) {  for (loop=0; loop<10; loop++) {    ... = ... x[i] ...  }}

the elements of x are now repeatedly reused, and are therefore more likely to remain in the cache. This rearranged code displays better temporal locality in its use of x[I].

The concept of spatial locality is slightly more involved. A program is said to exhibit spatial locality if it references memory that is ‘close’ to memory it already referenced. In the classical von Neumann architecture with only a processor and memory, spatial locality should be irrelevant, since one address in memory can be as quickly retrieved as any other. However, in a modern CPU with caches, the story is different. Above, you have seen two examples of spatial locality:

• Since data is moved in cache lines rather than individual words or bytes, there is a great benefit to coding in such a manner that all elements of the cacheline are used. In the loop
for (i=0; i<N*s; i+=s) {   ... x[i] ...}
spatial locality is a decreasing function of the stride s.

Let S be the cacheline size, then as s ranges from 1 . . . S, the number of elements used of each cacheline goes down from S to 1. Relatively speaking, this increases the cost of memory traffic in the loop: if s = 1, we load 1/S cachelines per element; if s = S, we load one cacheline for each element. This effect is demonstrated in section 1.5.3.
• A second example of spatial locality worth observing involves the TLB (section 1.2.7). If a program references elements that are close together, they are likely on the same memory page, and address translation will be fast. On the other hand, if a program references many widely disparate elements, it will also be referencing many different pages. The resulting TLB misses are very costly; see also section 1.5.4.

Let us examine locality issues for a realistic example. The matrix-matrix multiplicatiom $C \leftarrow C + A \cdot B$ can be computed in several ways. We compare two implementations, assuming that all matrices are stored by rows, and that the cache size is insufficient to store a whole row or column.

for i=1..n  for j=1..n    for k=1..n      c[i,j] = c[i,j]               + a[i,k]*b[k,j]
for i=1..n  for k=1..n    for j=1..n      c[i,j] = c[i,j]               + a[i,k]*b[k,j]

Our first observation is that both implementations indeed compute $C \leftarrow C + A \cdot B$, and that they both take roughly 2n3 operations. However, their memory behaviour, including spatial and temporal locality is very different.

c[i,j]: In the first implementation, c[i,j] is invariant in the  inner iteration, which constitutes temporal locality, so it can be kept in register. As a result, each element of C will be loaded and stored only once. In the second implementation, c[i,j] will be loaded and stored in each inner iteration. In particular, this implies that there are now n3 store operations, a factor of n more than the first implementation.

a[i,k]: In both implementations, a[i,j] elements are  accessed by rows, so there is good spatial locality, as each loaded cacheline will be used entirely. In the second implementation, a[i,k] is invariant in the inner loop, which constitutes temporal locality; it can be kept in register. As a result, in the second case A will be loaded only once, as opposed to n times in the first case.

b[k,j]: The two implementations differ greatly in how they access the matrix B. First of all, b[k,j] is never invariant so it will not be kept in register, and B engenders n3 memory loads in both cases. However, the access patterns differ. In second case, b[k,j] is access by rows so there is good spatial locality: cachelines will be fully utilized after they are loaded. In the first implementation, b[k,j] is accessed by columns. Because of the row storage of the matrices, a cacheline contains a part of a row, so for each cacheline loaded, only one element is used in the columnwise  traversal. This means that the first implementation has more loads for B by a factor of the cacheline length.

Note that we are not making any absolute predictions on code performance for these implementations, or even relative comparison of their runtimes. Such predictions are very hard to make. However, the above discussion identifies issues that are relevant for a wide range of classical CPUs.

Exercise 1.11. Consider the following pseudocode of an algorithm for summing n numbers x[i] where n is a power of 2:

for s=2,4,8,...,n/2,n:  for i=0 to n-1 with steps s:    x[i] = x[i] + x[i+s/2]sum = x[0]

Analyze the spatial and temporal locality of this algorithm, and contrast it with the standard algorithm

sum = 0
for i=0,1,2,...,n-1
sum = sum + x[i]

Source: Victor Eijkhout, Edmond Chow, and Robert van de Geijn, https://s3.amazonaws.com/saylordotorg-resources/wwwresources/site/textbookuploads/5345_scicompbook.pdf