## The Von Neumann Architecture

Read this section to learn about sequential or Von Neumann computer architecture. Computer architecture is the high-level computer design comprising components that perform the functions of data storage, computations, data transfer, and control.

#### The Von Neumann architecture

While computers, and most relevantly for this chapter, their processors, can differ in any number of details, they also have many aspects in common. On a very high level of abstraction, many architectures can be described as von Neumann architectures. This describes a design with an undivided memory that stores both program and data (‘stored program’), and a processing unit that executes the instructions, operating on the data.

This setup distinguishes modern processors for the very earliest, and some special purpose contemporary, designs where the program was hard-wired. It also allows programs to modify themselves, since instructions and data are in the same storage. This allows us to have editors and compilers: the computer treats program code as data to operate on. In this book we will not explicitly discuss compilers, the programs that translate high level languages to machine instructions. However, on occasion we will discuss how a program at high level can be written to ensure efficiency at the low level.

In scientific computing, however, we typically do not pay much attention to program code, focusing almost exclusively on data and how it is moved about during program execution. For most practical purposes it is as if program and data are stored separately. The little that is essential about instruction handling can be described as follows.

The machine instructions that a processor executes, as opposed to the higher level languages users write in, typically specify the name of an operation, as well as of the locations of the operands and the result. These locations are not expressed as memory locations, but as registers: a small number of named memory locations that are part of the CPU. As an example, here is a simple C routine

void store(double *a, double *b, double *c) {  *c = *a + *b;}

and its X86 assembler output, obtained by

gcc -O2 -S -o - store.c
  .text  .p2align 4,,15.globl store  .type store, @functionstore:  movsd (%rdi), %xmm0 # Load *a to %xmm0  addsd (%rsi), %xmm0 # Load *b and add to %xmm0  movsd %xmm0, (%rdx) # Store to *c  ret

The instructions here are:

• A load from memory to register;
• Writing back the result to memory.
Each instruction is processed as follows:

• Instruction fetch: the next instruction according to the program counter is loaded into the processor. We will ignore the questions of how and from where this happens.
• Instruction decode: the processor inspects the instruction to determine the operation and the operands.
• Memory fetch: if necessary, data is brought from memory into a register.
• Execution: the operation is executed, reading data from registers and writing it back to a register.
• Write-back: for store operations, the register contents is written back to memory.

Complicating this story, contemporary CPUs operate on several instructions simultaneously, which are said to be ‘in flight’, meaning that they are in various stages of completion. This is the basic idea of the superscalar CPU architecture, and is also referred to as instruction-level parallelism. Thus, while each instruction can take several clock cycles to complete, a processor can complete one instruction per cycle in favourable circumstances; in some cases more than one instruction can be finished per cycle.

The main statistic that is quoted about CPUs is their Gigahertz rating, implying that the speed of the processor is the main determining factor of a computer’s performance. While speed obviously correlates with performance, the story is more complicated. Some algorithms are cpu-bound, and the speed of the processor is indeed the most important factor; other algorithms are memory-bound, and aspects such as bus speed and cache size become important.

In scientific computing, this second category is in fact quite prominent, so in this chapter we will devote plenty of attention to the process that moves data from memory to the processor, and we will devote relatively little attention to the actual processor.

##### Floating point units

Many modern processors are capable of doing multiple operations simultaneously, and this holds in particular for the arithmetic part. For instance, often there are separate addition and multiplication units; if the compiler can find addition and multiplication operations that are independent, it can schedule them so as to be executed simultaneously, thereby doubling the performance of the processor. In some cases, a processor will have multiple addition or multiplication units.

Another way to increase performance is to have a ‘fused multiply-add’ unit, which can execute the instruction $x \leftarrow ax+b$

in the same amount of time as a separate addition or multiplication. Together with pipelining (see below), this means that a processor has an asymptotic speed of several floating point operations per clock cycle.
 Processor floating point units max operations per cycle Pentium4, Opteron 2 add or 2 mul 2 Woodcrest, Barcelona 2 add + 2 mul 4 IBM POWER4, POWER5, POWER6 2 FMA 4 IBM BG/L, BG/P 1 SIMD FMA 4 SPARC IV 1 add + 1 mul 2 Itanium2 2 FMA 4

Table 1.1: Floating point capabilities of several current processor architectures

Pipelining

The floating point add and multiply units of a processor are pipelined, which has the effect that a stream of inde-pendent operations can be performed at an asymptotic speed of one result per clock cycle.

The idea behind a pipeline is as follows. Assume that an operation consists of multiple simpler operations, and that for each suboperation there is separate hardware in the processor. For instance, an multiply instruction can have the following components:

• Decoding the instruction, including finding the locations of the operands. Copying the operands into registers (‘data fetch’).
• Aligning the exponents; the multiplication $.3 \times 10^{-1} \times .2 \times 10^2$ becomes $.0003 \times 10^2 \times .2 \times 10^2$.
• Executing the multiplication, in this case giving $.00006 \times 10^0$.
• Normalizing the result, in this example to $.6 \times 10^0$.
• Storing the result.

These parts are often called the ‘stages’ or ‘segments’ of the pipeline.

If every component is designed to finish in 1 clock cycle, the whole instruction takes 6 cycles. However, if each has its own hardware, we can execute two multiplications in less than 12 cycles:

• Execute the decode stage for the first operation;
• Do the data fetch for the first operation, and at the same time the decode for the second.
• Execute the third stage for the first operation and the second stage of the second operation simultaneously.
• Et cetera.

You see that the first operation still takes 6 clock cycles, but the second one is finished a mere 1 cycle later. This idea can be extended to more than two operations: the first operation still takes the same amount of time as before, but after that one more result will be produced each cycle. Formally, executing n operations on a s-segment pipeline takes s + n - 1 cycles.

Exercise 1.1. Let us compare the speed of a classical floating point unit, and a pipelined one. If the pipeline has s stages, what is the asymptotic speedup? That is, with T0(n) the time for n operations on a classical CPU, and Ts(n) the time for n operations on an s-segment pipeline, what is $\lim_{n\rightarrow \infty}(T_0(n)/T_s(n))$?

Next you can wonder how long it takes to get close to the asymptotic behaviour. Define Ss(n) as the speedup achieved on n operations. The quantity n1/2 is defined as the value of n such that Ss(n) is half the asymptotic speedup. Give an expression for n1/2.

Since a vector processor works on a number of instructions simultaneously, these instructions have to be independent. The operation $\forall_i: a_i \leftarrow b_i +c_i$ has independent additions; the operation $\forall_i: a_{i+1} \leftarrow a_ib_i+c_i$ feeds the result of one iteration $(a_i)$ to the input of the next $(a_{i+1} = ...)$so the operations are not independent.

[A pipelined processor can speed up operations by a factor of 4; 5; 6 with respect to earlier CPUs. Such numbers were typical in the 1980s when the first successful vector computers came on the market. These days, CPUs can have 20-stage pipelines. Does that mean they are incredibly fast? This question is a bit complicated. Chip designers continue to increase the clock rate, and the pipeline segments can no longer finish their work in one cycle, so they are further spit up. Sometimes there are even segments in which nothing happens: that time is needed to make sure data can travel to a different part of the chip in time.]

Figure 1.1: Schematic depiction of a pipelined operation

The amount of improvement you can get from a pipelined CPU is limited, so in a quest for ever higher performance several variations on the pipeline design have been tried. For instance, the Cyber 205 had separate addition and multiplication pipelines, and it was possible to feed one pipe into the next without data going back to memory first. Operations like $\forall_i: a_i \leftarrow b_i + c \cdot d_i$ were called ‘linked triads’ (because of the number of paths to memory, one input operand had to be scalar).

Another way to increase performance is to have multiple identical pipes. This design was perfected by the NEC SX series. With, for instance, 4 pipes, the operation \[\forall_i: a_i \leftarrow b_i + c_i\) would be split module 4, so that the first pipe operated on indices $i = 4 \cdot j$, the second on $i = 4 \cdot j +1$, et cetera.

Exercise 1.3. Analyze the speedup and n1/2 of a processor with multiple pipelines that operate in parallel. That is, suppose that there are p independent pipelines, executing the same instruction, that can each handle a stream of operands.

(The reason we are mentioning some fairly old computers here is that true pipeline supercomputers hardly exist anymore. In the US, the Cray X1 was the last of that line, and in Japan only NEC still makes them. However, the functional units of a CPU these days are pipelined, so the notion is still important.)

Exercise 1.4. The operation

for (i) {  x[i+1] = a[i]*x[i] + b[i];}

can not be handled by a pipeline or SIMD processor because there is a dependency between input of one iteration of the operation and the output of the previous. However, you can transform the loop into one that is mathematically equivalent, and potentially more efficient to compute. Derive an expression that computes x[i+2] from x[i] without involving x[i+1]. This is known as recursive doubling. Assume you have plenty of temporary storage. You can now perform the calculation by

• Doing some preliminary calculations;
• computing x[i],x[i+2],x[i+4],..., and from these,
• compute the missing terms x[i+1],x[i+3],....

Analyze the efficiency of this scheme by giving formulas for T0(n) and Ts(n). Can you think of an argument why the preliminary calculations may be of lesser importance in some circumstances?

Peak performance

For marketing purposes, it may be desirable to define a ‘top speed’ for a CPU. Since a pipelined floating point unit can yield one result per cycle asymptotically, you would calculate the theoretical peak performance as the product of the clock speed (in ticks per second), number of floating point units, and the number of cores (see section 1.3). This top speed is unobtainable in practice, and very few codes come even close to it; see section 2.11. Later in this chapter you will learn the reasons that it is so hard to get this perfect performance.

Pipelining beyond arithmetic: instruction-level parallelism

In fact, nowadays, the whole CPU is pipelined. Not only floating point operations, but any sort of instruction will be put in the instruction pipeline as soon as possible. Note that this pipeline is no longer limited to identical instructions: the notion of pipeline is now generalized to any stream of partially executed instructions that are simultaneously “in flight”.

This concept is also known as instruction-level parallelism, and it is facilitated by various mechanisms:

• multiple-issue: instructions that are independent can be started at the same time;
• pipelining: already mentioned, arithmetic units can deal with multiple operations in various stages of completion;
• branch prediction and speculative execution: a compiler can ‘guess’ whether a conditional instruction will evaluate to true, and execute those instructions accordingly;
• out-of-order execution: instructions can be rearranged if they are not dependent on each other, and if the resulting execution will be more efficient;
• prefetching: data can be speculatively requested before any instruction needing it is actually encountered (this is discussed further in section 1.2.5).

As clock frequency has gone up, the processor pipeline has grown in length to make the segments executable in less time. You have already seen that longer pipelines have a larger n1/2, so more independent instructions are needed to make the pipeline run at full efficiency. As the limits to instruction-level parallelism are reached, making pipelines longer (sometimes called ‘deeper’) no longer pays off. This is generally seen as the reason that chip designers have moved to multi-core architectures as a way of more efficiently using the transistors on a chip; section 1.3.

There is a second problem with these longer pipelines: if the code comes to a branch point (a conditional or the test in a loop), it is not clear what the next instruction to execute is. At that point the pipeline can stall. CPUs have taken to speculative execution’ for instance, by always assuming that the test will turn out true. If the code then takes the other branch (this is called a branch misprediction), the pipeline has to be cleared and restarted. The resulting delay in the execution stream is called the branch penalty.

8-bit, 16-bit, 32-bit, 64-bit

Processors are often characterized in terms of how big a chunk of data they can process as a unit. This can relate to

• The width of the path between processor and memory: can a 64-bit floating point number be loaded in one cycle, or does it arrive in pieces at the processor.
• The way memory is addressed: if addresses are limited to 16 bits, only 64,000 bytes can be identified. Early PCs had a complicated scheme with segments to get around this limitation: an address was specified with a segment number and an offset inside the segment.
• The number of bits in a register, in particular the size of the integer registers which manipulate data address; see the previous point. (Floating point register are often larger, for instance 80 bits in the x86 architecture.) This also corresponds to the size of a chunk of data that a processor can operate on simul-taneously.
• The size of a floating point number. If the arithmetic unit of a CPU is designed to multiply 8-byte numbers efficiently (‘double precision’; see section 3.2) then numbers half that size (‘single precision’) can some-times be processed at higher efficiency, and for larger numbers (‘quadruple precision’) some complicated scheme is needed. For instance, a quad precision number could be emulated by two double precision numbers with a fixed difference between the exponents.

These measurements are not necessarily identical. For instance, the original Pentium processor had 64-bit data busses, but a 32-bit processor. On the other hand, the Motorola 68000 processor (of the original Apple Macintosh) had a 32-bit CPU, but 16-bit data busses.

The first Intel microprocessor, the 4004, was a 4-bit processor in the sense that it processed 4 bit chunks. These days, processors are 32-bit, and 64-bit is becoming more popular.

Source: Victor Eijkhout, Edmond Chow, and Robert van de Geijn, https://s3.amazonaws.com/saylordotorg-resources/wwwresources/site/textbookuploads/5345_scicompbook.pdf