## Programming Strategies for High Performance

Read section 1.5, which discusses the relationship of pipelining and caching to programming.

#### 1.5 Programming strategies for high performance

In this section we will look at how different ways of programming can influence the performance of a code. This will only be an introduction to the topic; for further discussion see the book by Goedeker and Hoisie [42].

The full listings of the codes and explanations of the data graphed here can be found in appendix D. All performance results were obtained on the AMD processors of the Ranger computer [10].

##### 1.5.1 Pipelining

In section 1.1.1.1 you learned that the floating point units in a modern CPU are pipelined, and that pipelines require a number of independent operations to function efficiently. The typical pipelineable operation is a vector addition; an example of an operation that can not be pipelined is the inner product accumulation

for (i=0; i<N; i++)  s += a[i]*b[i]

The fact that s gets both read and written halts the addition pipeline. One way to fill the floating point pipeline is to apply loop unrolling:

for (i = 0; i < N/2-1; i ++) {  sum1 += a[2*i] * b[2*i];  sum2 += a[2*i+1] * b[2*i+1];}

With a little indexing optimization this becomes:

for (i = 0; i < N/2-1; i ++) {  sum1 += *(a + 0) * *(b + 0);  sum2 += *(a + 1) * *(b + 1);
  a += 2; b += 2;}

A first observation about this code is that we are implicitly using associativity and commutativity of addition: while the same quantities are added, they are now in effect added in a different order. As you will see in chapter 3, in computer arithmetic this is not guaranteed to give the exact same result.

In a further optimization, we disentangle the addition and multiplication part of each instruction. The hope is that while the accumulation is waiting for the result of the multiplication, the intervening instructions will keep the processor busy, in effect increasing the number of operations per second.

for (i = 0; i < N/2-1; i ++) {  temp1 = *(a + 0) * *(b + 0);  temp2 = *(a + 1) * *(b + 1);
  sum1 += temp1; sum2 += temp2;
  a += 2; b += 2;}

Finally, we realize that the furthest we can move the addition away from the multiplication, is to put it right in front of the multiplication of the next iteration:

for (i = 0; i < N/2-1; i ++) {
  sum1 += temp1;				  temp1 = *(a + 0) * *(b + 0);
  sum2 += temp2;
temp2 = *(a + 1) * *(b + 1);

a += 2; b += 2;

}
s = temp1 + temp2;

Of course, we can unroll the operation by more than a factor of two. While we expect an increased performance, large unroll factors need large numbers of registers. Asking for more registers than a CPU has is called register spill, and it will decrease performance.

Another thing to keep in mind is that the total number of operations is unlikely to be divisible by the unroll factor. This requires cleanup code after the loop to account for the final iterations. Thus, unrolled code is harder to write than straight code, and people have written tools to perform such source-to-source transformations automatically.

Cycle times for unrolling the inner product operation up to six times are given in table 1.2. Note that the timings do not show a monotone behaviour at the unrolling by four. This sort of variation is due to various memory-related factors.

 1 2 3 4 5 6 6794 507 340 359 334 528

Table 1.2: Cycle times for the inner product operation, unrolled up to six times

##### 1.5.2 Cache size

Above, you learned that data from L1 can be moved with lower latency and higher bandwidth than from L2, and L2 is again faster than L3 or memory. This is easy to demonstrate with code that repeatedly access the same data:

for (i=0; i<NRUNS; i++)  for (j=0; j<size; j++)    array[j] = 2.3*array[j]+1.2;

If the size parameter allows the array to fit in cache, the operation will be relatively fast. As the size of the dataset grows, parts of it will evict other parts from the L1 cache, so the speed of the operation will be determined by the latency and bandwidth of the L2 cache. This can be seen in figure 1.5. The full code is given in section D.2.

Exercise 1.12. Argue that with a LRU replacement policy (section 1.2.4.2) essentially all data in the L1 will be replaced in every iteration of the outer loop. Can you write an example code that will let some of the L1 data stay resident?

Often, it is possible to arrange the operations to keep data in L1 cache. For instance, in our example, we could write

for (i=0; i<NRUNS; i++) {  blockstart = 0;  for (b=0; b<size/l1size; b++)    for (j=0; j<l1size; j++)      array[blockstart+j] = 2.3*array[blockstart+j]+1.2;

assuming that the L1 size divides evenly in the dataset size. This strategy is called cache blocking or blocking for cache reuse.

Figure 1.5: Average cycle count per operation as function of the dataset size

##### 1.5.3 Cache lines

Since data is moved from memory to cache in consecutive chunks named cachelines (see section 1.2.4.3), code that does not utilize all data in a cacheline pays a bandwidth penalty. This is born out by a simple code

for (i=0,n=0; i<L1WORDS; i++,n+=stride)  array[n] = 2.3*array[n]+1.2;

Here, a fixed number of operations is performed, but on elements that are at distance stride. As this stride increases, we expect an increasing runtime, which is born out by the graph in figure 1.6.

The graph also shows a decreasing reuse of cachelines, defined as the number of vector elements divided by the number of L1 misses (on stall; see section 1.2.5).

The full code is given in section D.3.

##### 1.5.4 TLB

As explained in section 1.2.7, the Translation Look-aside Buffer (TLB) maintains a list of currently in use memory pages, and addressing data that is located on one of these pages is much faster than data that is not. Consequently, one wants to code in such a way that the number of pages in use is kept low.

Figure 1.6: Run time in kcycles and L1 reuse as a function of stride

Consider code for traversing the elements of a two-dimensional array in two different ways.

#define INDEX(i,j,m,n) i+j*m  array = (double*) malloc(m*n*sizeof(double));
  /* traversal #1 */  for (j=0; j<n; j++)    for (i=0; i<m; i++)      array[INDEX(i,j,m,n)] = array[INDEX(i,j,m,n)]+1;
  /* traversal #2 */    for (i=0; i<m; i++)      for (j=0; j<n; j++)        array[INDEX(i,j,m,n)] = array[INDEX(i,j,m,n)]+1;

The results (see Appendix D.5 for the source code) are plotted in figures 1.8 and 1.7.

Using m = 1000 means that, on the which has pages of 512 doubles, we need roughly two pages for each column. We run this example, plotting the number ‘TLB misses’, that is, the number of times a page is referenced that is not recorded in the TLB.

1. In the first traversal this is indeed what happens. After we touch an element, and the TLB records the page it is on, all other elements on that page are used subsequently, so no further TLB misses occur. Figure 1.8 shows that, with increasing n, the number of TLB misses per column is roughly two.
2. In the second traversal, we touch a new page for every element of the first row. Elements of the second row will be on these pages, so, as long as the number of columns is less than the number of TLB entries, these pages will still be recorded in the TLB. As the number of columns grows, the number of TLB increases, and ultimately there will be one TLB miss for each element access. Figure 1.7 shows that, with a large enough number of columns, the number of TLB misses per column is equal to the number of elements per column.

Figure 1.7: Number of TLB misses per column as function of the number of columns; columnwise traversal of the array.

Figure 1.8: Number of TLB misses per column as function of the number of columns; rowwise traversal of the array.

##### 1.5.5 Cache associativity

There are many algorithms that work by recursive division of a problem, for instance the Fast Fourier Transform (FFT) algorithm. As a result, code for such algorithms will often operate on vectors whose length is a power of two. Unfortunately, this can cause conflicts with certain architectural features of a CPU, many of which involve powers of two.

Consider the operation of adding a small number of vectors

$\forall_j: y_j = y_j + \sum_{i=1}^{m} x_{i,j}$

If the length of the vectors $y, x_i$ is precisely the right (or rather, wrong) number, $y_j$ and $x_{i,j}$ will all be mapped to the same location in cache. As an example we take the AMD , which has an L1 cache of 64K bytes, and which is  two-way set associative. Because of the set associativity, the cache can handle two addresses being mapped to the same cache location, but not three of more. Thus, we let the vectors be of size n = 4096 doubles, and we measure the effect in cache misses and cycles of letting m = 1, 2 . . ..

First of all, we note that we use the vectors sequentially, so, with a cacheline of eight doubles, we should ideally see a cache miss rate of 1/8 times the number of vectors m. Instead, in figure 1.9 we see a rate approximately proportional to m, meaning that indeed cache lines are evicted immediately. The exception here is the case  = 1, where the two-way associativity allows the cachelines of two vectors to stay in cache.

Compare this to figure 1.10, where we used a slightly longer vector length, so that locations with the same j are no longer mapped to the same cache location. As a result, we see a cache miss rate around 1/8, and a smaller number of cycles, corresponding to a complete reuse of the cache lines.

Two remarks: the cache miss numbers are in fact lower than the theory predicts, since the processor will use prefetch streams. Secondly, in figure 1.10 we see a decreasing time with increasing m; this is probably due to a progressively more favourable balance between load and store operations. Store operations are more expensive than loads, for various reasons.

Figure 1.9: The number of L1 cache misses and the number of cycles for each j column accumulation, vector length 4096

Figure 1.10: The number of L1 cache misses and the number of cycles for each j column accumulation, vector length 4096 + 8

##### 1.5.6 Loop tiling
for (n=0; n<10; n++)  for (i=0; i<100000; i++)    x[i] = ...
for (b=0; b<100; b++)  for (n=0; n<10; n++)    for (i=b*1000; i<(b+1)*1000; i++)      x[i] = ...
##### 1.5.7 Case study: Matrix-vector product

Let us consider in some detail the matrix-vector product

$\forall_{i,j}: y_i \leftarrow a_{i j} \cdot x_j$

This involves $2n^2$ operations on $n^2 + 2n$ data items, so reuse is $O(1)$ : memory accesses and operations are of the same order. However, we note that there is a double loop involved, and the $x, y$vectors have only a single index, so each element in them is used multiple times.

Exploiting this theoretical reuse is not trivial. In

/* variant 1 */for (i)  for (j)    y[i] = y[i] + a[i][j] * x[j];

the element y[i] seems to be reused. However, the statement as given here would write y[i] to memory in every inner iteration, and we have to write the loop as

/* variant 2 */for (i) {  s = 0;  for (j)    s = s + a[i][j] * x[j];  y[i] = s;}

to ensure reuse. This variant uses $2n^2$ loads and n stores.

This code fragment only exploits the reuse of y explicitly. If the cache is too small to hold the whole vector x plus a column of a, each element of a is still repeatedly loaded in every outer iteration.

Reversing the loops as

/* variant 3 */for (j)  for (i)    y[i] = y[i] + a[i][j] * x[j];

exposes the reuse of x, especially if we write this as

/* variant 3 */for (j) {  t = x[j];  for (i)    y[i] = y[i] + a[i][j] * t;}

but now y is no longer reused. Moreover, we now have $2n^2 + n$ loads, comparable to variant 2, but $n^2$ stores, which is of a higher order.

It is possible to get reuse both of x and y, but this requires more sophisticated programming. The key here is split the loops into blocks. For instance:

for (i=0; i<M; i+=2) {  s1 = s2 = 0;  for (j) {    s1 = s1 + a[i][j] * x[j];    s2 = s2 + a[i+1][j] * x[j];  }  y[i] = s1; y[i+1] = s2;}

This is also called loop unrolling, loop tiling, or strip mining. The amount by which you unroll loops is determined by the number of available registers.

##### 1.5.8 Optimization strategies

Figure 1.11: Performance of naive and optimized implementations of the Discrete Fourier Transform

Figure 1.12: Performance of naive and optimized implementations of the matrix-matrix product

Figures 1.11 and 1.12 show that there can be wide discrepancy between the performance of naive implementations of an operation (sometimes called the ‘reference implementation’), and optimized implementations. Unfortunately, optimized implementations are not simple to find. For one, since they rely on blocking, their loop nests are dou-ble the normal depth: the matrix-matrix multiplication becomes a six-deep loop. Then, the optimal block size is dependent on factors like the target architecture.

We make the following observations:

• Compilers are not able to extract anywhere close to optimal performance8.
• There are autotuning projects for automatic generation of implementations that are tuned to the architec-ture. This approach can be moderately to very successful. Some of the best known of these projects are Atlas [84] for Blas kernels, and Spiral [73] for transforms.

##### 1.5.9 Cache aware programming

Unlike registers and main memory, both of which can be addressed in (assembly) code, use of caches is implicit. There is no way a programmer can load data explicitly to a certain cache, even in assembly language.

However, it is possible to code in a ‘cache aware’ manner. Suppose a piece of code repeatedly operates on an amount of data that less data than the cache size. We can assume that the first time the data is accessed, it is brought into cache; the next time it is accessed it will already be in cache. On the other hand, if the amount of data is more than the cache size, it will partly or fully be flushed out of cache in the process of accessing it.

We can experimentally demonstrate this phenomenon. With a very accurate counter, the code fragment

for (x=0; x<NX; x++)  for (i=0; i<N; i++)    a[i] = sqrt(a[i]);

will take time linear in N up to the point where a fills the cache. An easier way to picture this is to compute a normalized time, essentially a time per execution of the inner loop:

t = time();for (x=0; x<NX; x++)  for (i=0; i<N; i++)    a[i] = sqrt(a[i]);t = time()-t;t_normalized = t/(N*NX);

The normalized time will be constant until the array a fills the cache, then increase and eventually level off again.

The explanation is that, as long as a[0]...a[N-1] fit in L1 cache, the inner loop will use data from the L1 cache. Speed of access is then determined by the latency and bandwidth of the L1 cache. As the amount of data grows beyond the L1 cache size, some or all of the data will be flushed from the L1, and performance will be determined by the characteristics of the L2 cache. Letting the amount of data grow even further, performance will again drop to a linear behaviour determined by the bandwidth from main memory.

##### 1.5.10 Arrays and programming languages

In section B.10.1.1 you can find a discussion of the different ways arrays are stored in C/C++ and Fortran. These storage modes have some ramifications on performance. Both from a point of cache line usage (section 1.5.3) and prevention of TLB misses (section 1.5.4) it is best to traverse a multi-dimensional array so as to access sequential memory locations, rather than strided. This means that

• In Fortran you want to loop through an array first by columns (that is, in the inner loop), then by rows (in the outer loop);
• In C/C++ you want to loop first through rows (inner loop), then through columns (outer loop).

Source: Victor Eijkhout, Edmond Chow, and Robert van de Geijn, https://s3.amazonaws.com/saylordotorg-resources/wwwresources/site/textbookuploads/5345_scicompbook.pdf