Read section 1.3 to learn about multi-core chips. These two pages give a summary of processor and chip trends to overcome the challenge of increasing performance and addressing the heat problem of a single core.
1.3 Multi-core chips
In recent years, the limits of performance have been reached for the traditional processor chip design.
- Clock frequency can not increased further, since it increases energy consumption, heating the chips too much. Figure 1.3 gives a dramatic illustration of the heat that a chip would give off, if single-processor trends had continued. The reason for this is that the power dissipation of a chip is propertional to the voltage squared times the frequency. Since voltage and frequency are proportional, that makes power proportional to the third power of the frequency.
- It is not possible to extract more instruction-level parallelism from codes, either because of compiler lim-itations, because of the limited amount of intrinsically available parallelism, or because branch prediction makes it impossible (see section 188.8.131.52).
One of the ways of getting a higher utilization out of a single processor chip is then to move from a strategy of further sophistication of the single processor, to a division of the chip into multiple processing ‘cores.’ The separate cores can work on unrelated tasks, or, by introducing what is in effect data parallelism (section 2.2.1), collaborate on a common task at a higher overall efficiency.
While first multi-core chips were simply two processors on the same die, later generations incorporated L2 caches that were shared between the two processor cores. This design makes it efficient for the cores to work jointly on the same problem. The cores would still have their own L1 cache, and these separate caches lead to a cache coherence problem; see section 1.3.1 below.
We note that the term ‘processor’ is now ambiguous: it can refer to either the chip, or the processor core on the chip. For this reason, we mostly talk about a socket for the whole chip and core for part containing one arithmetic and logic unit and having its own registers. Currently, CPUs with 4 or 6 cores are on the market and 8-core chips will be available shortly. The core count is likely to go up in the future: Intel has already shown an 80-core prototype that is developed into the 48 core ‘Single-chip Cloud Computer’, illustrated in fig 1.4. This chip has a structure with 24 dual-core ‘tiles’ that are connected through a 2D mesh network. Only certain tiles are connected to a memory controller, others can not reach memory other than through the on-chip network.
Figure 1.3: Projected heat dissipation of a CPU if trends had continued – this graph courtesy Pat Helsinger
With this mix of shared and private caches, the programming model for multi-core processors is becoming a hybrid between shared and distributed memory:
Core: The cores have their own private L1 cache, which is a sort of distributed memory. The above mentioned Intel 80-core prototype has the cores communicating in a distributed memory fashion.
Socket: On one socket, there is often a shared L2 cache, which is shared memory for the cores.
Node: There can be multiple sockets on a single ‘node’ or motherboard, accessing the same shared memory.
Network: Distributed memory programming (see the next chapter) is needed to let nodes communicate.
1.3.1 Cache coherence
With parallel processing, there is the potential for a conflict if more than one processor has a copy of the same data item. The problem of ensuring that all cached data are an accurate copy of main memory, is referred to as cache coherence.
In distributed memory architectures, a dataset is usually partitioned disjointly over the processors, so conflicting copies of data can only arise with knowledge of the user, and it is up to the user to prevent deal with the problem. The case of shared memory is more subtle: since processes access the same main memory, it would seem that conflicts are in fact impossible. However, processor typically have some private cache, which contains copies of data from memory, so conflicting copies can occur. This situation arises in particular in multi-core designs.
Figure 1.4: Structure of the Intel Single-chip Cloud Computer chip
Suppose that two cores have a copy of the same data item in their (private) L1 cache, and one modifies its copy. Now the other has cached data that is no longer an accurate copy of its counterpart, so it needs to reload that item. This will slow down the computation, and it wastes bandwidth to the core that could otherwise be used for loading or storing operands.
The state of a cache line with respect to a data item in main memory is usually described as one of the following:
Scratch: the cache line does not contain a copy of the item;
Valid: the cache line is a correct copy of data in main memory;
Reserved: the cache line is the only copy of that piece of data;
Dirty: the cache line has been modified, but not yet written back to main memory;
Invalid: the data on the cache line is also present on other processors (it is not reserved ), and another process has modified its copy of the data.
Exercise 1.9. Consider two processors, a data item x in memory, and cachelines x1,x2 in the private caches of the two processors to which x is mapped. Describe the transitions between the states of x1 and x2 under reads and writes of x on the two processors. Also indicate which actions cause memory bandwidth to be used. (This list of transitions is a Finite State Automaton (FSA); see section A.3.)
Source: Victor Eijkhout, Edmond Chow, and Robert van de Geijn, https://s3.amazonaws.com/saylordotorg-resources/wwwresources/site/textbookuploads/5345_scicompbook.pdf
This work is licensed under a Creative Commons Attribution 3.0 License.