Read section 2.6 to learn about network topologies. If a task cannot be performed by a computer with one processor, we decompose the task into subtasks, which will be allocated to multiple hardware devices, say processors or arithmetic units or memory. These multiple hardware devices need to communicate so that the original task can be done with acceptable cost and performance. The hardware devices and their interconnections form a network.
Consider another situation: suppose we have a large software system made up of a large number of subsystems, in turn composed of many software components, subcomponents, etc. Suppose we list all the lowest level subcomponent's names across the top of a sheet of paper. These will be our column headings. Also, let's list down the side of the same sheet the same subcomponent names. These will be our row headings. This forms a table or two by two matrix. Finally, suppose we put a 1 in the table whenever or wherever there is a connection between the subcomponent named in the column heading and the subcomponent named in the row heading. Let's put a 0 everywhere else. This table now represents a topology for our network of software components; it could also be done for hardware components. These components and their interconnections are part of the software architecture. Realize that the matrix could be huge: 100 by 100, 1000 by 1000, and so on. The complexity of the interconnections is a factor in the reliability and performance of the architecture.
Read section 2.7, which reviews Amdahl's law, an extension of Amdahl's law that includes communication overhead, and Gustafson's law. These laws express expected performance as the number of processors increases, or as both the size of the problem and number of processors increases.
Then, read section 2.9 to learn about load balancing. This section looks at the situation where a processor is idle and another is busy, which is referred to as a load imbalance. If the work were to be distributed differently among the processors, then the idle time might be able to be eliminated. In this case, the load balance problem is explained as a graph problem.
If a number of processors are working together on a single task, most likely they need to communicate data. For this reason there needs to be a way for data to make it from any processor to any other. In this section we will discuss some of the possible schemes to connect the processors in a parallel machine.
In order to get an appreciation for the fact that there is a genuine problem here, consider two simple schemes that do not ‘scale up’:
In this section we will see a number of schemes that can be increased to large numbers of processors.
The network that connects the processors in a parallel computer can conveniently be described with some elementary graph theory concepts. We describe the parallel machine with a graph where each processor is a node, and two nodes are connected if there is a direct connection between them.
We can then analyze two important concepts of this graph.
First of all, the degree of a node in a graph is the number of other nodes it is connected to. With the nodes representing processors, and the edges the wires, it is clear that a high degree is not just desirable for efficiency of computing, but also costly from an engineering point of view. We assume that all processors have the same degree.
Secondly, a message traveling from one processor to another, through one or more intermediate nodes, will most incur some delay at each intermediate node. For this reason, the diameter of the graph is important. The diameter is defined as the maximum shortest distance, counting numbers of wires, between any two processors:
d(G) = maxij | shortest path between i and j | :
If d is the diameter, and if sending a message over one wire takes unit time (more about this in the next section), this means a message will always arrive in at most time d.
Exercise 2.4. Find a relation between the number of processors, their degree, and the diameter of the connectivity graph.
In addition to the question ‘how long will a message from processor A to processor B take’, we often worry about conflicts between two simultaneous messages: is there a possibility that two messages, under way at the same time, will need to use the same network link? This sort of conflict is called congestion or contention. Clearly, the more links the graph of a parallel comupter has, the smaller the chance of congestion.
A precise way to describe the likelihood of congestion, is to look at the bisection width . This is defined as the minimum number of links that have to be removed to partition the processor graph into two unconnected graphs. For instance, consider processors connected as a linear array, that is, processor Pi is connected to Pi-1 and Pi+1. In this case the bisection width is 1.
The bisection width w describes how many messages can, guaranteed, be under way simultaneously in a parallel computer. Proof: take w sending and w receiving processors. The w paths thus defined are disjoint: if they were not, we could separate the processors into two groups by removing only w - 1 links.
In practice, of course, more than w messages can be under way simultaneously. For instance, in a linear array, which has w = 1, P/2 messages can be sent and received simultaneously if all communication is between neighbours, and if a processor can only send or receive, but not both, at any one time. If processors can both send and receive simultaneously, P messages can be under way in the network.
Bisection width also describes redundancy in a network: if one or more connections are malfunctioning, can a message still find its way from sender to receiver?
Exercise 2.5. What is the diameter of a 3D cube of processors? What is the bisection width? How does that change if you add wraparound torus connections?
While bisection width is a measure express as a number of wires, in practice we care about the capacity through those wires. The relevant concept here is bisection bandwidth : the bandwidth across the bisection width, which is the product of the bisection width, and the capacity (in bits per second) of the wires. Bisection bandwidth can be considered as a measure for the bandwidth that can be attained if an arbitrary half of the processors communicates with the other half. Bisection bandwidth is a more realistic measure than the aggregate bandwidth which is some-times quoted: it is defined as the total data rate if every processor is sending: the number of processors times the bandwidth of a connection times the number of simultaneous sends a processor can perform. This can be quite a high number, and it is typically not representative of the communication rate that is achieved in actual applications.
A simple way to hook up multiple processors is to connect them in a linear array: every processor has a number i,
and processor Pi is connected to Pi-1 and Pi+1. The first and last processor are possible exceptions: if they are connected to each other, we call the architecture a ring network .
This solution requires each processor to have two network connections, so the design is fairly simple.
Exercise 2.6. What is the bisection width of a linear array? Of a ring?
Exercise 2.7. With the limited connections of a linear array, you may have to be clever about how to program parallel algorithms. For instance, consider a ‘broadcast’ operation: processor 0 has a data item that needs to be sent to every other processor.
We make the following simplifying assumptions:
In a fully connected network you can simply write
for i = 1 . . . N - 1:
send the message to processor i
in a fully connected network this means that the operation is done in one step.
Now consider a linear array. Show that, even with this unlimited capacity for sending, the above algorithm runs into trouble because of congestion.
Find a better way to organize the send operations. Hint: pretend that your processors are connected as a binary tree. Assume that there are N = 2n processors. Show that the broadcast can be done in log N stages, and that processors only need to be able to send a single message simultaneously.
This exercise is an example of embedding a ‘logical’ communication pattern in a physical one.
A popular design for parallel computers is to organize the processors in a two-dimensional or three-dimensional cartesian mesh . This means that every processor has a coordinate (i, j) or (i, j, k), and it is connected to its neigh-bours in all coordinate directions. The processor design is still fairly simple: the number of network connections (the degree of the connectivity graph) is twice the number of space dimensions (2 or 3) of the network.
It is a fairly natural idea to have 2D or 3D networks, since the world around us is three-dimensional, and computers are often used to model real-life phenomena. If we accept for now that the physical model requires nearest neighbour type communications (which we will see is the case in section 4.2.3), then a mesh computer is a natural candidate for running physics simulations.
Exercise 2.8. Analyze the diameter and bisection width of 2D and 3D meshes and toruses.
Exercise 2.9. Your parallel computer has its processors organized in a 2D grid. The chip manufacturer comes out with a new chip with same clock speed that is dual core instead of single core, and that will fit in the existing sockets. Critique the following argument: ”the amount work per second that can be done (that does not involve communication) doubles; since the network stays the same, the bisection bandwidth also stays the same, so I can reasonably expect my new machine to become twice as fast.”
Above we gave a hand-waving argument for the suitability of mesh-organized processors, based on the prevalence of nearest neighbour communications. However, sometimes sends and receives between arbitrary processors occur. One example of this is the above-mentioned broadcast. For this reason, it is desirable to have a network with a smaller diameter than a mesh. On the other hand we want to avoid the complicated design of a fully connected network.
A good intermediate solution is the hypercube design. An n-dimensional hypercube computer has 2n processors, with each processor connected to one other in each dimension; see figure 2.6. The nodes of a hypercube are num-bered by bit patterns as in figure 2.7.
Figure 2.6: Hypercubes
Figure 2.7: Numbering of the nodes of a hypercube
An easy way to describe this is to give each processor an address consisting of d bits. A processor is then connected to all others that have an address that differs by exactly one bit.
The big advantages of a hypercube design are the small diameter and large capacity for traffic through the network.
Exercise 2.10. Diameter? Bisection width?
One disadvantage is the fact that the processor design is dependent on the total machine size. In practice, processors will be designed with a maximum number of possible connections, and someone buying a smaller machine then will be paying for unused capacity. Another disadvantage is the fact that extending a given machine can only be done by doubling it: other sizes than 2p are not possible.
Exercise 2.11. Consider the parallel summing example of section 2.1, and give the execution time of a parallel implementation, first on a linear array, then on a hypercube. Assume that sending one number from a processor to a neighbour takes time and that a floating point operation takes time . Show that on a linear array the algorithm gives at best a factor speedup over the sequential algorithm; show that the theoretical speedup from the example is attained (up to a factor) for the implementation on a hypercube.
22.214.171.124 Embedding grids in a hypercube
Above we made the argument that mesh-connected processors are a logical choice for many applications that model physical phenomena. How is that for hypercubes? The answer is that a hyercube has enough connections that it can simply pretend to be a mesh by ignoring certain connections. However, we can not use the obvious numbering of nodes as in figure 2.7. For instance, node 1 is directly connected to node 0, but has a distance of 2 to node 2. The left neighbour of node 0 in a ring, node 7, even has the maximum distance of 3. To explain how we can embed a mesh in a hypercube, we first show that it’s possible to walk through a hypercube, touching every corner exactly once.
The basic concept here is a (binary reflected) Gray code . This is a way of ordering the binary numbers 0 . . . 2d - 1 as g0, . . . g2d-1 so that gi and gi+1 differ in only one bit. Clearly, the ordinary binary numbers do not satisfy this: the binary representations for 1 and 2 already differ in two bits. Why do Gray codes help us? Well, since gi and gi+1 differ only in one bit, it means they are the numbers of nodes in the hypercube that are directly connected.
Figure 2.8 illustrates how to construct a Gray code. The procedure is recursive, and can be described informally as ‘divide the cube into two subcubes, number the one subcube, cross over to the other subcube, and number its nodes in the reverse order of the first one’.
Since a Gray code offers us a way to embed a one-dimensional ‘mesh’ into a hypercube, we can now work our way up.
Exercise 2.12. Show how a square mesh of 22d nodes can be embedded in a hypercube by appending the bit patterns of the embeddings of two 2d node cubes. How would you accomodate a mesh of 2d1+d2 nodes? A three-dimensional mesh of 2d1+ d2 + d3 nodes?
Above, we briefly discussed fully connected processors. They are impractical if the connection is made by making a large number of wires between all the processors. There is another possibility, however, by connecting all the processors to a switch or switching network. Some popular network designs are the crossbar butterfly exchange and the fat tree .
Figure 2.8: Gray codes
Switching networks are made out of switching elements, each of which have a small number (up to about a dozen) of inbound and outbound links. By hooking all processors up to some switching element, and having multiple stages of switching, it then becomes possible to connect any two processors by a path through the network.
126.96.36.199 Cross bar
Figure 2.9: A simple cross bar connecting 6 inputs to 6 outputs
The simplest switching network is a cross bar, an arrangement of n horizontal and vertical lines, with a switch element on each intersection that determines whether the lines are connected; see figure 2.9. If we designate the horizontal lines as inputs the vertical as outputs, this is clearly a way of having n inputs be mapped to n outputs. Every combination of inputs and outputs (sometimes called a ‘permutation’) is allowed.
188.8.131.52 Butterfly exchange
Butterfly exchanges are typically built out of small switching elements, and they have multiple stages; as the number of processors grows, the number of stages grows with it. As you can see in figure 2.11, butterfly exchanges allow several processors to access memory simultaneously. Also, their access times are identical, see exchange networks are a way of implementing a UMA architecture; see section 2.3.1.
Figure 2.10: A butterfly exchange network for two and four processors/memories
Exercise 2.13. For both the simple cross bar and the butterfly exchange, the network needs to be expanded as the number of processors grows. Give the number of wires (of some unit length) and the number of switching elements that is needed in both cases to connect n processors and memories. What is the time that a data packet needs to go from memory to processor, expressed in the unit time that it takes to traverse a unit length of wire and the time to traverse a switching element?
If we were to connect switching nodes like a tree, there would be a big problem with congestion close to the root since there are only two wires attached to the root note. Say we have a k-level tree, so there are 2k leaf nodes. If all leaf nodes in the left subtree try to communicate with nodes in the right subtree, we have 2k-1 messages going through just one wire into the root, and similarly out through one wire. A fat-tree is a tree network where each level has the same total bandwidth, so that this congestion problem does not occur: the root will actually have 2k-1 incoming and outgoing wires attached.
The first successful computer architecture based on a fat-tree was the Connection Machines CM5.
In fat-trees, as in other switching networks, each message carries its own routing information. Since in a fat-tree the choices are limited to going up a level, or switching to the other subtree at the current level, a message needs to carry only as many bits routing information as there are levels, which is log2 n for n processors.
The theoretical exposition of fat-trees in  shows that fat-trees are optimal in some sense: it can deliver messages as fast (up to logarithmic factors) as any other network that takes the same amount of space to build. The underlying assumption of this statement is that switches closer to the root have to connect more wires, therefore take more components, and correspondingly are larger.
This argument, while theoretically interesting, is of no practical significance, as the physical size of the network hardly plays a role in the biggest currently available computers that use fat-tree interconnect. For instance, in the Ranger supercomputer of The University of Texas at Austin, the fat-tree switch connects 60,000 processors, yet takes less than 10 percent of the floor space.
A fat tree, as sketched above, would be costly to build, since for every next level a new, bigger, switch would have to be designed. In practice, therefore, a network with the characteristics of a fat-tree is constructed from simple switching elements; see figure 2.12. This network is equivalent in its bandwidth and routing possibilities to a fat-tree. Routing algorithms will be slightly more complicated: in a fat-tree, a data packet can go up in only one way, but here a packet has to know to which of the two higher switches to route.
This type of switching network is one case of a Clos network .
The statement above that sending a message can be considered a unit time operation, is of course unrealistic. A large message will take longer to transmit than a short one. There are two concepts to arrive at a more realistic description of the transmission process; we have already seen this in section 1.2.2 in the context of transferring data between cache levels of a processor.
Latency: Setting up a communication between two processors takes an amount of time that is independent of the message size. The time that this takes is known as the latency of a message. There are various causes for this delay.
Bandwidth: After a transmission between two processors has been set up, the main number of interest is the number of bytes per second that can go through the channel. This is known as the bandwidth . The bandwidth can usually be determined by the channel rate, the rate at which a physical link can deliver bits, and the channel width , the number of physical wires in a link. The channel width is typically a multiple of 16, usually 64 or 128. This is also expressed by saying that a channel can send one or two 8-byte words simultaneously.
Bandwidth and latency are formalized in the expression
for the transmission time of an n-byte message. Here, is the time per byte, that is, the inverse of bandwidth.is the latency and
There are two important reasons for using a parallel computer: to have access to more memory or to obtain higher performance. It is easy to characterize the gain in memory, as the total memory is the sum of the individual memo-ries. The speed of a parallel computer is harder to characterize. A simple approach is to let the same program run on a single processor, and on a parallel machine with p processors, and to compare runtimes.
With T1 the execution time on a single processor and Tp the time on p processors, we define the speedup as Sp = T1/Tp. (Sometimes T1 is defined as ‘the best time to solve the problem on a single processor’, which allows for using a different algorithm on a single processor than in parallel.) In the ideal case, Tp = T1/p, but in practice we don’t expect to attain that, so. To measure how far we are from the ideal speedup, we introduce the efficiency . Clearly, .
There is a practical problem with this definition: a problem that can be solved on a parallel machine may be too large to fit on any single processor. Conversely, distributing a single processor problem over many processors may give a distorted picture
since very little data will wind up on each processor.
There are various reasons why the actual speed is less than P . For one, using more than one processors necessitates communication, which is overhead. Secondly, if the processors do not have exactly the same amount of work to do, they may be idle part of the time, again lowering the actually attained speedup. Finally, code may have sections that are inherently sequential.
Communication between processors is an important source of a loss of efficiency. Clearly, a problem that can be solved without communication will be very efficient. Such problems, in effect consisting of a number of completely independent calculations, is called embarassingly parallel ; it will have close to a perfect speedup and efficiency.
Exercise 2.14. The case of speedup larger than the number of processors is called superlinear speedup.
Give a theoretical argument why this can never happen.
In practice, superlinear speedup can happen. For instance, suppose a problem is too large to fit in memory, and a single processor can only solve it by swapping data to disc. If the same problem fits in the memory of two processors, the speedup may well be larger than 2 since disc swapping no longer occurs. Having less, or more localized, data may also improve the cache behaviour of a code.
One reason for less than perfect speedup is that parts of a code can be inherently sequential. This limits the parallel efficiency as follows. Suppose that 5% of a code is sequential, then the time for that part can not be reduced, no matter how many processors are available. Thus, the speedup on that code is limited to a factor of 20. This phenomenon is known as Amdahl’s Law , which we will now formulate.
Let Fp be the parallel fraction and Fs be the parallel fraction (or more strictly: the ‘parallelizable’ fraction) of a code, respectively. Then Fp + Fs = 1. The parallel execution time Tp is the sum of the part that is sequential T1Fs and the part that can be parallelized T1Fs/p:
As the number of processors grows . We conclude that speedup is limited by and efficiency is a decreasing function ., the parallel execution time now approaches that of the sequential fraction of the code:
The sequential fraction of a code can consist of things such as I/O operations. However, there are also parts of a code that in effect act as sequential. Consider a program that executes a single loop, where all iterations can be computed independently. Clearly, this code is easily parallelized. However, by splitting the loop in a number of parts, one per processor, each processor now has to deal with loop overhead: calculation of bounds, and the test for completion. This overhead is replicated as many times as there are processors. In effect, loop overhead acts as a sequential part of the code.
In practice, many codes do not have significant sequential parts, and overhead is not important enough to affect parallelization adversely. One reason for this is discussed in the next section.
Exercise 2.15. Investigate the implications of Amdahls’s law: if the number of processors P increases, how does the parallel fraction of a code have to increase to maintain a fixed efficiency?
In a way, Amdahl’s law, sobering as it is, is even optimistic. Parallelizing a code will give a certain speedup, but it also introduces communication overhead that will lower the speedup attained. Let us refine our model of equation (2.2) (see [62, p. 367]):
where Tc is a fixed communication time.
To assess the influence of this communication overhead, we assume that the code is fully parallelizable, that is, fp = 1. We then find that
For this to be close to p, we need In other words, the number of processors should not grow beyond the ratio of scalar execution time and communication overhead.or .
Above, we remarked that splitting a given problem over more and more processors does not make sense: at a certain point there is just not enough work for each processor to operate efficiently. Instead, in practice, users of a parallel code will either choose the number of processors to match the problem size, or they will solve a series of increasingly larger problems on correspondingly growing numbers of processors. In both cases it is hard to talk about speedup. Instead, the concept of scalability is used.
We distinguish two types of scalability. So-called strong scalability is in effect the same as speedup, discussed above. We say that a program shows strong scalability if, partitioned over more and more processors, it shows perfect or near perfect speedup. Typically, one encounters statements like ‘this problem scales up to 500 processors’, meaning that up to 500 processors the speedup will not noticeably decrease from optimal.
More interesting, weak scalability is a more vaguely defined term. It describes that, as problem size and number of processors grow in such a way that the amount of data per processor stays constant, the speed in operations per second of each processor also stays constant. This measure is somewhat hard to report, since the relation between the number of operations and the amount of data can be complicated. If this relation is linear, one could state that the amount of data per processor is kept constant, and report that parallel execution time is constant as the number of processors grows.
Scalability depends on the way an algorithm is parallelized, in particular on the way data is distributed. In section 6.3 you will find an analysis of the matrix-vector product operation: distributing a matrix by block rows turns out not to be scalable, but a two-dimensional distribution by submatrices is.
Amdahl’s law describes speedup in the strong scaling sense discussed above. Gustafson’s law is an attempt to formalize weak scaling. Let the computation be normalized again so that Fp + Fs = 1, and assume that we keep the amount of parallel work per processor constant. This means that the total amount of work performed by p processors is now Fs + pFp, and the corresponding speedup formula is
This is a linearly decreasing function of FS, rather than the 1/FS function as before.
In much of this chapter, we assumed that a problem could be perfectly divided over processors, that is, a processor would always be performing useful work, and only be idle because of latency in communication. In practice, however, a processor may be idle because it is waiting for a message, and the sending processor has not even reached the send instruction in its code. Such a situation, where one processor is working and another is idle, I described as load unbalance: there is no intrinsic reason for the one processor to be idle, and it could have been working if we had distributed the work load differently.
Let us consider the case of a job that can be partitioned into independent tasks, for instance computing the pixels of a Mandelbrot set picture, where each pixel is set according to a mathematical function that does not depend on surrounding pixels. If we could predict the time it would take to draw an arbitrary part of the picture, we could make a perfect division of the work and assign it to the processors. This is known as static load balancing.
More realistically, we can not predict the running time of a part of the job perfectly, and we use an over-decomposition of the work: we divide the work in more tasks than there are processors. These tasks are then assigned to a work pool , and processors take the next job from the pool whenever they finish a job. This is known as dynamic load balancing. Many graph and combinatorial problems can be approached this way.
A parallel computation can be formulated as a graph (see Appendix A.6 for an introduction to graph theory) where the processors are the vertices, and there is an edge between two vertices if their processors need to communicate at some point. Such a graph is often derived from an underlying graph of the problem being solved. Let us consider for example the matrix-vector product y = Ax where A is a sparse matrix, and look in detail at the processor that is computing yi for some i. The statementimplies that this processor will need the value xj, so, if this variable is on a different processor, it needs to be sent over.
We can formalize this: Let the vectors x and y be distributed disjointly over the processors, and define uniquely P(i) as the processor that owns index i. Then there is an edge (P, Q) if there is a nonzero element aij with P = P(i) and Q = P(j). This graph is undirected if the matrix is structurally symmetric, that is.
The distribution of indices over the processors now gives us vertex and edge weights: a processor has a vertex weight that is the number of indices owned by it; an edge (P, Q) has a weight that is the number of vector components that need to be sent from Q to P , as described above.
The load balancing problem can now be formulated as follows:
Find a partitioning , such the variation in vertex weights is minimal, and simulta-neously the edge weights are as low as possible.
Minimizing the variety in vertex weights implies that all processor have approximately the same amount of work. Keeping the edge weights low means that the amount of communication is low. These two objectives need not be satisfiable at the same time: some trade-off is likely.
Exercise 2.16. Consider the limit case where processors are infinitely fast and bandwidth between pro-cessors is also unlimited. What is the sole remaining factor determining the runtime? What graph problem do you need to solve now to find the optimal load balance? What property of a sparse matrix gives the worst case behaviour?
Source: Victor Eijkhout, Edmond Chow, and Robert van de Geijn, https://s3.amazonaws.com/saylordotorg-resources/wwwresources/site/textbookuploads/5345_scicompbook.pdf
This work is licensed under a Creative Commons Attribution 3.0 License.