Parallel Programming

Read section 2.5, which covers two extreme approaches to parallel programming. First, parallelism is handled by the lower software and hardware layers. OpenMP is applicable in this first case. Secondly, parallelism is handled by the programmer. MPI is applicable in the second case.

2.5 Parallel programming

Parallel programming is more complicated than sequential programming. While for sequential programming most programming languages operate on similar principles (some exceptions such as functional or logic languages aside), there is a variety of ways of tackling parallelism. Let’s explore some of the concepts and practical aspects.

There are various approaches to parallel programming. One might say that they fall in two categories:

  • Let the programmer write a program that is essentially the same as a sequential program, and let the lower software and hardware layers solve the parallelism problem; and
  • Expose the parallelism to the programmer and let the programmer manage everything explicitly.

The first approach, which is more user-friendly, is in effect only possible with shared memory, where all processes have access to the same address space. We will discuss this in the section on OpenMP 2.5.1. The second approach is necessary in the case of distributed memory programming. We will have a general discussion of distributed programming in section 2.5.2; section 2.5.3 will discuss the MPI library.

2.5.1 OpenMP

OpenMP is an extension to the programming languages C and Fortran. Its main approach to parallelism is the parallel execution of loops: based on compiler directives, a preprocessor can schedule the parallel execution of the loop iterations.

The amount of parallelism is flexible: the user merely specifies a parallel region, indicating that all iterations are independent to some extent, and the runtime system will then use whatever resources are available. Because of this dynamic nature, and because no data distribution is specified, OpenMP can only work with threads on shared memory.

OpenMP is neither a language nor a library: it operates by inserting directives into source code, which are interpreted by the compiler. Many compilers, such as GCC or the Intel compiler, support the OpenMP extensions. In Fortran, OpenMP directives are placed in comment statements; in C, they are placed in #pragma CPP directives, which indicate compiler specific extensions. As a result, OpenMP code still looks like legal C or Fortran to a compiler that does not support OpenMP. Programs need to be linked to an OpenMP runtime library, and their behaviour can be controlled through environment variables.

OpenMP features dynamic parallelism: the number of execution streams operating in parallel can vary from one part of the code to another.

For more information about OpenMP, see [21].

2.5.1.1 Processes and threads

OpenMP is based on ‘threads’ working in parallel; see section 2.5.1.3 for an illustrative example. A thread is an independent instruction stream, but as part of a Unix process. While processes can belong to different users, or be different programs that a single user is running concurrently, and therefore have their own data space, threads are part of one process and therefore share each other’s data. Threads do have a possibility of having private data, for instance, they have their own data stack.

Threads serve two functions:

  1. By having more than one thread on a single processor, a higher processor utilization can result, since the instructions of one thread can be processed while another thread is waiting for data.
  2. In a shared memory context, multiple threads running on multiple processors or processor cores can be an easy way to parallelize a process. The shared memory allows the threads to all see the same data.

2.5.1.2 Issues in shared memory programming

Shared memory makes life easy for the programmer, since every processor has access to all of the data: no explicit data traffic between the processor is needed. On the other hand, multiple processes/processors can also write to the same variable, which is a source of potential problems.

Suppose that two processes both try to increment an integer variable I by one:

process 1: I=I+2

process 2: I=I+3

If the processes are not completely synchronized, one will read the current value, compute the new value, write it back, and leave that value for the other processor to find. In this scenario, the parallel program has the same result (I=I+5) as if all instructions were executed sequentially.

However, it could also happen that both processes manage to read the current value simultaneously, compute their own new value, and write that back to the location of I. Even if the conflicting writes can be reconciled, the final result will be wrong: the new value will be either I+2 or I+3, not I+5. Moreover, it will be indeterminate, depending on details of the execution mechanism.

For this reason, such updates of a shared variable are called a critical section of code. OpenMP has a mechanism to declare a critical section, so that it will be executed by only one process at a time. One way of implementing this, is to set a temporary lock on certain memory areas. Another solution to the update problem, is to have atomic operations: the update would be implemented in such a way that a second process can not get hold of the data item being updated. One implementation of this is transactional memory, where the hardware itself supports atomic operations; the term derives from database transactions, which have a similar integrity problem.

Finally, we mention the semaphore mechanism for dealing with critical sections [26]. Surrounding each critical section there will be two atomic operations controlling a semaphore. The first process to encounter the semaphore will lower it, and start executing the critical section. Other processes see the lowered semaphore, and wait. When the first process finishes the critical section, it executes the second instruction which raises the semaphore, allowing one of the waiting processes to enter the critical section.

2.5.1.3 Threads example

The following example spawns a number of tasks that all update a global counter. Since threads share the same memory space, they indeed see and update the same memory location.

#include <stdlib.h>
#include <stdio.h>
#include "pthread.h"
int sum=0;
void adder() {
sum = sum+1;
return;
}
#define NTHREADS 50
int main() {
int i;
pthread_t threads[NTHREADS];
printf("forking\n");
for (i=0; i<NTHREADS; i++)
if (pthread_create(threads+i,NULL,&adder,NULL)!=0) return i+1; printf("joining\n");
for (i=0; i<NTHREADS; i++)
    if (pthread_join(threads[i],NULL)!=0) return NTHREADS+i+1; printf("Sum computed: %d\n",sum);
  return 0;
}

The fact that this code gives the right result is a coincidence: it only happens because updating the variable is so much quicker than creating the thread. (On a multicore processor the chance of errors will greatly increase.) If we artificially increase the time for the update, we will no longer get the right result:

void adder() {
int t = sum; sleep(1); sum = t+1;
return;
}

Now all threads read out the value of sum, wait a while (presumably calculating something) and then update.

This can be fixed by having a lock on the code region that should be ‘mutually exclusive’:

pthread_mutex_t lock;
void adder() {
int t,r;
pthread_mutex_lock(&lock);
t = sum; sleep(1); sum = t+1;
pthread_mutex_unlock(&lock);
return;
}
int main() {
....
pthread_mutex_init(&lock,NULL);

The lock and unlock commands guarantee that no two threads can interfere with each other’s update.

2.5.1.4 OpenMP examples

The simplest example of OpenMP use is the parallel loop.

#pragma omp for
for (i=0; i<ProblemSize; i++) {
a[i] = b[i];
}

Clearly, all iterations can be executed independently and in any order. The pragma CPP directive then conveys this fact to the compiler.

Some loops are fully parallel conceptually, but not in implementation:

for (i=0; i<ProblemSize; i++) {
t = b[i]*b[i];
a[i] = sin(t) + cos(t);
}

Here it looks as if each iteration writes to, and reads from, a shared variable t. However, t is really a temporary variable, local to each iteration. OpenMP indicates that as follows:

#pragma parallel for shared(a,b), private(t)
for (i=0; i<ProblemSize; i++) {
t = b[i]*b[i];
a[i] = sin(t) + cos(t);
}

If a scalar is indeed shared, OpenMP has various mechanisms for dealing with that. For instance, shared variables commonly occur in reduction operations:

  s = 0;
#pragma parallel for reduction(+:sum)
for (i=0; i<ProblemSize; i++) {
s = s + a[i]*b[i];
}


Figure 2.4: Static or round-robin (left) vs dynamic (right) thread scheduling; the task numbers are indicated.

As you see, a sequential code can be easily parallelized this way.

The assignment of iterations to threads is done by the runtime system, but the user can guide this assignment. We are mostly concerned with the case where there are more iterations than threads: if there are P threads and N iterations and N > P, how is iteration i going to be assigned to a thread?

The simplest assignment uses round-robin scheduling, a static scheduling strategy where thread p get iterations p, p + N, p + 2N, . . .. This has the advantage that if some data is reused between iterations, it will stay in the data cache of the processor executing that thread. On the other hand, if the iterations differ in the amount of work involved, the process may suffer from load unbalance with static scheduling. In that case, a dynamic scheduling strategy would work better, where each thread starts work on the next unprocessed iteration as soon as it finishes its current iteration. The example in figure 2.4 shows static versus dynamic scheduling of a number of tasks that gradually decrease in individual running time. In static scheduling, the first thread gets tasks 1 and 4, the second 2 and 5, et cetera. In dynamic scheduling, any thread that finishes its task gets the next task. This clearly gives a better running time in this particular example.

2.5.2 The global versus the local view in distributed programming

There can be a marked difference between how a parallel algorithm looks to an observer, and how it is actually programmed. Consider the case where we have an array of processors \left \{ P_i \right \}_{i=0..p-1}, each containing one element of the arrays x and y, and Pi computes

\left\{\begin{matrix}y_i \leftarrow y_i + x_{i-1}\; \; \; i>0 \\ y_i \; \; \; \mathrm{unchanged}\; \; \; \; i = 0 \end{matrix}\right.

The global description of this could be

  • Every processor Pi except the last sends its x element to Pi+1;
  • every processor Pi except the first receive an x element from their neighbour Pi-1, and
  • they add it to their y element.

However, in general we can not code in these global terms. In the SPMD model (section 2.2.2) each processor executes the same code, and the overall algorithm is the result of these individual behaviours. The local program has access only to local data – everything else needs to be communicated with send and receive operations – and the processor knows its own number. 

A naive attempt at the processor code would look like:

  • If I am processor 0, do nothing; otherwise
  • receive an x element from my left neighbour, add it to my y element, and
  • send my x element to my right neighbour, unless I am the last processor.
This code is correct, but it is also inefficient: processor i + 1 does not start working until processor i is done. In other words, the parallel algorithm is in effect sequential, and offers no speedup.

We can easily make our algorithm fully parallel, by changing the processor code to:

  • If am not the last processor, send my x element to the right.
  • If am not the first processor, receive an x element from the left, and
  • add it to my y element.

Now all sends, receives, and additions can happen in parallel.

There can still be a problem with this solution if the sends and receives are so-called blocking communication instructions: a send instruction does not finish until the sent item is actually received, and a receive instruction waits for the corresponding send. In this scenario:

All processors but the last start their send instruction, and consequently block, waiting for their neighbour to execute the corresponding receive.

The last processor is only one not sending, so it executes its receive instruction.

The processor one-before-last now completes its send, and progresses to its receive; Which allows its left neighbour to complete its send, et cetera.

You see that again the parallel execution becomes serialized.

Exercise 2.1. Suggest a way to make the algorithm parallel, even with blocking operations. (Hint: you won’t be able to get all processors active at the same time.)

If the algorithm in equation 2.1 had been cyclic:

\left\{\begin{matrix} y_i \leftarrow y_i + x_{x-1}\; \; \; \; i=1...n-1 \\ y_0 \leftarrow y_0 + x_{n-1} \; \; \; \; \; \;\;\;\ i=0 \end{matrix}\right.

the problem would be even worse. Now the last processor can not start its receive since it is blocked sending xn 1 to processor 0. This situation, where the program can not progress because every processor is waiting for another, is called deadlock.

The reason for blocking instructions is to prevent accumulation of data in the network. If a send instruction were to complete before the corresponding receive started, the network would have to store the data somewhere in the mean time. Consider a simple example:

buffer = ... ;	// generate some data
send(buffer,0); // send to processor 0
buffer = ... ; // generate more data
send(buffer,1); // send to processor 1

After the first send, we start overwriting the buffer. If the data in it hasn’t been received, the first set of values would have to be buffered somewhere in the network, which is not realistic. By having the send operation block, the data stays in the sender’s buffer until it is guaranteed to have been copied to the recipient’s buffer.

One way out of the problem of sequentialization or deadlock that arises from blocking instruction is the use of non-blocking communication instructions, which include explicit buffers for the data. With non-blocking send instruction, the user needs to allocate a buffer for each send, and check when it is safe to overwrite the buffer.

buffer0 = ... ; // data for processor 0
send(buffer0,0); // send to processor 0
buffer1 = ... ; // data for processor 1
send(buffer1,1); // send to processor 1
...
// wait for completion of all send operations.
2.5.3 MPI

If OpenMP is the way to program shared memory, MPI [77] is the standard solution for programming distributed memory. MPI (‘Message Passing Interface’) is a specification for a library interface for moving between processes that do not otherwise share data. The MPI routines can be divided roughly in the following categories:

  • Process management. This includes querying the parallel environment and constructing subsets of processors.
  • Point-to-point communication. This is a set of calls where two processes interact. These are mostly variants of the send and receive calls.
  • Collective calls. In these routines, all processors (or the whole of a specified subset) are involved. Exam-ples are the broadcast call, where one processor shares its data with every other processor, or the gather call, where one processor collects data from all participating processors.

Let us consider how the OpenMP examples can be coded in MPI. First of all, we no longer allocate

double a[ProblemSize];

but

double a[LocalProblemSize];

where the local size is roughly a 1/P fraction of the global size. (Practical considerations dictate whether you want this distribution to be as evenly as possible, or rather biased in some way.)

The parallel loop is trivially parallel, with the only difference that it now operates on a fraction of the arrays:

for (i=0; i<LocalProblemSize; i++) {
a[i] = b[i];
}

However, if the loop involves a calculation based on the iteration number, we need to map that to the global value:

for (i=0; i<LocalProblemSize; i++) {
a[i] = b[i]+f(i+MyFirstVariable);
}

(We will assume that each process has somehow calculated the values of LocalProblemSize and MyFirstVariable. Local variables are now automatically local, because each process has its own instance:

for (i=0; i<LocalProblemSize; i++) {
t = b[i]*b[i];
a[i] = sin(t) + cos(t);
}

However, shared variables are harder to implement. Since each process has its own data, the local accumulation has to be explicitly assembled:

for (i=0; i<LocalProblemSize; i++) {
s = s + a[i]*b[i];
}
MPI_Allreduce(s,globals,1,MPI_DOUBLE,MPI_SUM);

The ‘reduce’ operation sums together all local values s into a variable globals that receives an identical value on each processor. This is known as a collective operation.

Let us make the example slightly more complicated:

for (i=0; i<ProblemSize; i++) {
if (i==0)
a[i] = (b[i]+b[i+1])/2
else if (i==ProblemSize-1)
a[i] = (b[i]+b[i-1])/2
else
a[i] = (b[i]+b[i-1]+b[i+1])/3

The basic form of the parallel loop is:

for (i=0; i<LocalProblemSize; i++) {
bleft = b[i-1]; bright = b[i+1];
a[i] = (b[i]+bleft+bright)/3

First we account for the fact that bleft and bright need to be obtained from a different processor for i==0 (bleft), and for i==LocalProblemSize-1 (bright). We do this with a exchange operation with our left and right neighbour processor:

get bfromleft and bfromright from neighbour processors, then for (i=0; i<LocalProblemSize; i++) {
if (i==0) bleft=bfromleft;
     else bleft = b[i-1]
   if (i==LocalProblemSize-1) bright=bfromright;
     else bright = b[i+1];
a[i] = (b[i]+bleft+bright)/3

Obtaining the neighbour values is done as follows. First we need to ask our processor number, so that we can start a communication with the processor with a number one higher and lower.

MPI_Comm_rank(MPI_COMM_WORLD,&myTaskID); MPI_Sendrecv
(/* to be sent: */ &b[LocalProblemSize-1],
/* result: */ &bfromleft,
/* destination */ myTaskID+1, /* some parameters omited */ );
MPI_Sendrecv(&b[0],&bfromright,myTaskID-1 /* ... */ );

There are still two problems with this code. First, the sendrecv operations need exceptions for the first and last processors. This can be done elegantly as follows:

MPI_Comm_rank(MPI_COMM_WORLD,&myTaskID); MPI_Comm_size(MPI_COMM_WORLD,&nTasks);
if (myTaskID==0) leftproc = MPI_PROC_NULL;
  else leftproc = myTaskID-1;
if (myTaskID==nTasks-1) rightproc = MPI_PROC_NULL;
  else rightproc = myTaskID+1;
MPI_Sendrecv( &b[LocalProblemSize-1], &bfromleft, rightproc );
MPI_Sendrecv( &b[0], &bfromright, leftproc);

Exercise 2.2. There is still a problem left with this code: the boundary conditions from the original, global, version have not been taken into account. Give code that solves that problem.

MPI gets complicated if different processes need to take different actions, for example, if one needs to send data to another. The problem here is that each process executes the same executable, so it needs to contain both the send and the receive instruction, to be executed depending on what the rank of the process is.

if (myTaskID==0) {
MPI_Send(myInfo,1,MPI_INT,/* to: */ 1,/* labeled: */,0, MPI_COMM_WORLD);
} else {
MPI_Recv(myInfo,1,MPI_INT,/* from: */ 0,/* labeled: */,0,
/* not explained here: */&status,MPI_COMM_WORLD);
}

Although MPI is sometimes called the ‘assembly language of parallel programming’, for its perceived difficulty and level of explicitness, it not all that hard to learn, as evinced by the large number of scientific codes that use it. The main issues that make MPI somewhat intricate to use, are buffer management and blocking semantics.

These issues are related, and stem from the fact that, ideally, data should not be in two places at the same time. Let us briefly consider what happens if processor 1 sends data to processor 2. The safest strategy is for processor 1 to execute the send instruction, and then wait until processor 2 acknowledges that the data was successfully received. This means that processor 1 is temporarily blocked until processor 2 actually executes its receive instruction, and the data has made its way through the network. Alternatively, processor 1 could put its data in a buffer, tell the system to make sure that it gets sent at some point, and later checks to see that the buffer is safe to reuse. This second strategy is called non-blocking communication, and it requires the use of a temporary buffer.

2.5.3.1 Collective operations

In the above examples, you saw the MPI_Allreduce call, which computed a global sum and left the result on each processor. There is also a local version MPI_Reduce which computes the result only on one processor. These calls are examples of collective operations or collectives. The collectives are:

Reduction: each processor has a data item, and these items need to be combined arithmetically with an addition, multiplication, max, or min operation. The result can be left on one processor, or on all, in which case we call this an allreduce operation.

Broadcast: one processor has a data item that all processors need to receive.

Gather: each processor has a data item, and these items need to be collected in an array, without combining them in an operations such as an addition. The result can be left on one processor, or on all, in which case we call this an allgather.

Scatter: one processor has an array of data items, and each processor receives one element of that array.

All-to-all: each processor has an array of items, to be scattered to all other processors.

We will analyze the cost of collective operations in detail in section 6.3.1.

2.5.3.2 MPI version 1 and 2

The first MPI standard [68] had a number of notable omissions, which are included in the MPI 2 standard [48]. One of these concerned parallel input/output: there was no facility for multiple processes to access the same file, even if the underlying hardware would allow that. A separate project MPI-I/O has now been rolled into the MPI-2 standard.

A second facility missing in MPI, though it was present in PVM [28, 38] which predates MPI, is process manage-ment: there is no way to create new processes and have them be part of the parallel run.

Finally, MPI-2 has supported for one-sided communication: one process can do a send, without the receiving process actually doing a receive instruction.

2.5.3.3 Non-blocking communication

In a simple computer program, each instruction takes some time to execute, in a way that depends on what goes on in the processor. In parallel programs the situation is more complicated. A send operation, in its simplest form, declares that a certain buffer of data needs to be sent, and program execution will then stop until that buffer has been safely sent and received by another processor. This sort of operation is called a non-local operation since it depends on the actions of other processes, and a blocking communication operation since execution will halt until a certain event takes place.

Blocking operations have the disadvantage that they can lead to deadlock , if two processes wind up waiting for each other. Even without deadlock, they can lead to considerable idle time in the processors, as they wait without performing any useful work. On the other hand, they have the advantage that it is clear when the buffer can be reused: after the operation completes, there is a guarantee that the data has been safely received at the other end.

The blocking behaviour can be avoided, at the cost of complicating the buffer semantics, by using non-blocking operations. A non-blocking send (MPI_Isend) declares that a data buffer needs to be sent, but then does not wait for the completion of the corresponding receive. There is a second operation MPI_Wait that will actually block until the receive has been completed. The advantage of this decoupling of sending and blocking is that it now becomes possible to write:

ISend(somebuffer,&handle); // start sending, and
get a handle to this particular communication { ... } // do useful work on local data
Wait(handle); // block until the communication is completed; { ... } // do useful work on incoming data

With a little luck, the local operations take more time than the communication, and you have completely eliminated the communication time.

In addition to non-blocking sends, there are non-blocking receives. A typical piece of code then looks like

ISend(sendbuffer,&sendhandle);
IReceive(recvbuffer,&recvhandle);
{ ... } // do useful work on local data Wait(sendhandle); Wait(recvhandle);
{ ... }  // do useful work on incoming data

Exercise 2.3.  Go back to exercise 1 and give pseudocode that solves the problem using non-blocking sends and receives. What is the disadvantage of this code over a blocking solution?

2.5.4 Parallel languages

One approach to mitigating the difficulty of parallel programming is the design of languages that offer explicit support for parallelism. There are several approaches, and we will see some examples.

  • Some languages reflect the fact that many operations in scientific computing are data parallel (sec-tion 2.4.1). Languages such as High Performance Fortran (HPF) (section 2.5.4.4) have an array syntax, where operations such addition of arrays can be expressed A = B+C. This syntax simplifies program-ming, but more importantly, it specifies operations at an abstract level, so that a lower level can make specific decision about how to handle parallelism. However, the data parallelism expressed in HPF is only of the simplest sort, where the data are contained in regular arrays. Irregular data parallelism is harder; the Chapel language (section 2.5.4.6) makes an attempt at addressing this.
  • Another concept in parallel languages, not necessarily orthogonal to the previous, is that of Partitioned Global Address Space (PGAS) model: there is only one address space (unlike in the MPI model), but this address space is partitioned, and each partition has affinity with a thread or process. Thus, this model encompasses both SMP and distributed shared memory. The typical PGAS languages, Unified Parallel C (UPC), allows you to write programs that for the most part looks like regular C code. However, by indicating how the major arrays are distributed over processors, the program can be executed in parallel.

2.5.4.1 Discussion

Parallel languages hold the promise of making parallel programming easier, since they make communication oper-ations appear as simple copies or arithmetic operations. However, by doing so they invite the user to write code that may not be efficient, for instance by inducing many small messages.

As an example, consider arrays a,b that have been horizontally partitioned over the processors, and that are shifted:

for (i=0; i<N; i++)
for (j=0; j<N/np; j++)
a[i][j+joffset] = b[i][j+1+joffset]

If this code is executed on a shared memory machine, it will be efficient, but a naive translation in the distributed case will have a single number being communicated in each iteration of the i loop. Clearly, these can be combined in a single buffer send/receive operation, but compilers are usually unable to make this transformation. As a result, the user is forced to, in effect, re-implement the blocking that needs to be done in an MPI implementation:

for (i=0; i<N; i++)
t[i] = b[i][N/np+joffset]
for (i=0; i<N; i++)
for (j=0; j<N/np-1; j++) {
a[i][j] = b[i][j+1]
a[i][N/np] = t[i]

On the other hand, certain machines support direct memory copies through global memory hardware. In that case, PGAS languages can be more efficient than explicit message passing, even with physically distributed memory.

2.5.4.2 Unified Parallel C

Unified Parallel C (UPC) [11] is an extension to the C language. Its main source of parallelism is data parallelism, where the compiler discovers indepdence of operations on arrays, and assigns them to separate processors. The language has an extended array declaration, which allows the user to specify whether the array is partitioned by blocks, or in a round-robin fashion.

The following program in UPC performs a vector-vector addition.

//vect_add.c
#include <upc_relaxed.h>
#define N 100*THREADS
shared int v1[N], v2[N], v1plusv2[N];
void main() {
int i;
for(i=MYTHREAD; i<N; i+=THREADS)
v1plusv2[i]=v1[i]+v2[i];
}

The same program with an explicitly parallel loop construct:

//vect_add.c
#include <upc_relaxed.h>
#define N 100*THREADS
shared int v1[N], v2[N], v1plusv2[N];
void main()
{
int i;
upc_forall(i=0; i<N; i++; i)
v1plusv2[i]=v1[i]+v2[i];
}

2.5.4.3 Titanium

Titanium is comparable to UPC in spirit, but based on Java rather than on C.

2.5.4.4 High Performance Fortran

High Performance Fortran3 (HPF) is an extension of Fortran 90 with constructs that support parallel computing, published by the High Performance Fortran Forum (HPFF). The HPFF was convened and chaired by Ken Kennedy of Rice University. The first version of the HPF Report was published in 1993.

Building on the array syntax introduced in Fortran 90, HPF uses a data parallel model of computation to support spreading the work of a single array computation over multiple processors. This allows efficient implementation on both SIMD and MIMD style architectures. HPF features included:

  • New Fortran statements, such as FORALL, and the ability to create PURE (side effect free) procedures
  • Compiler directives for recommended distributions of array data
  • Extrinsic procedure interface for interfacing to non-HPF parallel procedures such as those using message passing
  • Additional library routines - including environmental inquiry, parallel prefix/suffix (e.g., ’scan’), data scattering, and sorting operations

Fortran 95 incorporated several HPF capabilities. While some vendors did incorporate HPF into their compilers in the 1990s, some aspects proved difficult to implement and of questionable use. Since then, most vendors and users have moved to OpenMP-based parallel processing. However, HPF continues to have influence. For example the proposed BIT data type for the upcoming Fortran-2008 standard contains a number of new intrinsic functions taken directly from HPF.

2.5.4.5 Co-array Fortran

Co-array Fortran (CAF) is an extension to the Fortran 95/2003 language. The main mechanism to support paral-lelism is an extension to the array declaration syntax, where an extra dimension indicates the parallel distribution. For instance,

real,allocatable,dimension(:,:,:)[:,:] :: A

declares an array that is three-dimensional on each processor, and that is distributed over a two-dimensional pro-cessor grid.

Communication between processors is now done through copies along the dimensions that describe the processor grid:

    COMMON/XCTILB4/ B(N,4)[*]
SAVE/XCTILB4/
C
CALL SYNC_ALL( WAIT=(/IMG_S,IMG_N/) )
B(:,3) = B(:,1)[IMG_S]
B(:,4) = B(:,2)[IMG_N]
CALL SYNC_ALL( WAIT=(/IMG_S,IMG_N/) )

The Fortran 2008 standard will include co-arrays.

2.5.4.6 Chapel

Chapel [1] is a new parallel programming language4 being developed by Cray Inc. as part of the DARPA-led High Productivity Computing Systems program (HPCS). Chapel is designed to improve the productivity of high-end computer users while also serving as a portable parallel programming model that can be used on commodity clusters or desktop multicore systems. Chapel strives to vastly improve the programmability of large-scale parallel computers while matching or beating the performance and portability of current programming models like MPI.

Chapel supports a multithreaded execution model via high-level abstractions for data parallelism, task parallelism, concurrency, and nested parallelism. Chapel’s locale type enables users to specify and reason about the placement of data and tasks on a target architecture in order to tune for locality. Chapel supports global-view data aggregates with user-defined implementations, permitting operations on distributed data structures to be expressed in a natural manner. In contrast to many previous higher-level parallel languages, Chapel is designed around a multiresolution philosophy, permitting users to initially write very abstract code and then incrementally add more detail until they are as close to the machine as their needs require. Chapel supports code reuse and rapid prototyping via object-oriented design, type inference, and features for generic programming.

Chapel was designed from first principles rather than by extending an existing language. It is an imperative block-structured language, designed to be easy to learn for users of C, C++, Fortran, Java, Perl, Matlab, and other popular languages. While Chapel builds on concepts and syntax from many previous languages, its parallel features are most directly influenced by ZPL, High-Performance Fortran (HPF), and the Cray MTA’s extensions to C and Fortran.

Here is vector-vector addition in Chapel:

const BlockDist= newBlock1D(bbox=[1..m], tasksPerLocale=...);
const ProblemSpace: domain(1, 64)) distributed BlockDist = [1..m];
varA, B, C: [ProblemSpace] real;
forall(a, b, c) in(A, B, C) do
a = b + alpha * c;

2.5.4.7 Fortress

Fortress [7] is a programming language developed by Sun Microsystems. Fortress5 aims to make parallelism more tractable in several ways. First, parallelism is the default. This is intended to push tool design, library design, and programmer skills in the direction of parallelism. Second, the language is designed to be more friendly to paral-lelism. Side-effects are discouraged because side-effects require synchronization to avoid bugs. Fortress provides transactions, so that programmers are not faced with the task of determining lock orders, or tuning their locking code so that there is enough for correctness, but not so much that performance is impeded. The Fortress looping constructions, together with the library, turns ”iteration” inside out; instead of the loop specifying how the data is accessed, the data structures specify how the loop is run, and aggregate data structures are designed to break into large parts that can be effectively scheduled for parallel execution. Fortress also includes features from other languages intended to generally help productivity – test code and methods, tied to the code under test; contracts that can optionally be checked when the code is run; and properties, that might be too expensive to run, but can be fed to a theorem prover or model checker. In addition, Fortress includes safe-language features like checked array bounds, type checking, and garbage collection that have been proven-useful in Java. Fortress syntax is designed to resemble mathematical syntax as much as possible, so that anyone solving a problem with math in its specification can write a program that can be more easily related to its original specification.

2.5.4.8 X10

X10 is an experimental new language currently under development at IBM in collaboration with academic part-ners. The X10 effort is part of the IBM PERCS project (Productive Easy-to-use Reliable Computer Systems) in the DARPA program on High Productivity Computer Systems. The PERCS project is focused on a hardware-software co-design methodology to integrate advances in chip technology, architecture, operating systems, compilers, pro-gramming language and programming tools to deliver new adaptable, scalable systems that will provide an order-of-magnitude improvement in development productivity for parallel applications by 2010.

X10 aims to contribute to this productivity improvement by developing a new programming model, combined with a new set of tools integrated into Eclipse and new implementation techniques for delivering optimized scalable parallelism in a managed runtime environment. X10 is a type-safe, modern, parallel, distributed object-oriented language intended to be very easily accessible to Java(TM) programmers. It is targeted to future low-end and high-end systems with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies, and interconnected in scalable cluster configurations. A member of the Partitioned Global Address Space (PGAS) family of languages, X10 highlights the explicit reification of locality in the form of places; lightweight activities embodied in async, future, foreach, and ateach constructs; constructs for termination detection (finish) and phased computation (clocks); the use of lock-free synchronization (atomic blocks); and the manipulation of global arrays and data structures.

2.5.4.9 Linda

As should be clear by now, the treatment of data is by far the most important aspect of parallel programing, far more important than algorithmic considerations. The programming system Linda [39], also called a coordination language, is designed to address the data handling explicitly. Linda is not a language as such, but can, and has been, incorporated into other languages.

The basic concept of Linda is tuple space: data is added to a pool of globally accessible information by adding a label to it. Processes then retrieve data by their label, and without needing to know which processes added them to the tuple space.

Linda is aimed primarily at a different computation model than is relevant for High-Performance Computing (HPC): it addresses the needs of asynchronous communicating processes. However, is has been used for scientific compu-tation [25]. For instance, in parallel simulations of the heat equation (section 4.3), processors can write their data into tuple space, and neighbouring processes can retrieve their ghost region without having to know its provenance. Thus, Linda becomes one way of implementing .

2.5.5 OS-based approaches

It is possible to design an architecture with a shared address space, and let the data movement be handled by the operating system. The Kendall Square computer [5] had an architecture name ‘all-cache’, where no data was natively associated with any processor. Instead, all data was considered to be cached on a processor, and moved through the network on demand, much like data is moved from main memory to cache in a regular CPU. This idea is analogous to the numa support in current SGI architectures.

2.5.6 Latency hiding

Communication between processors is typically slow, slower than data transfer from memory on a single processor, and much slower than operating on data. For this reason, it is good to think about the relative volumes of network traffic versus ‘useful’ operations (see also section 2.9.2) when designing a parallel program. Another way of coping with the relative slowness of communication is to arrange the program so that the communication actually happens while some computation is going on.

For example, consider the parallel execution of a matrix-vector product y = Ax (there will be further discussion of this operation in section 6.2). Assume that the vectors are distributed, so each processor p executes

\forall_{i \in I_p}: y_i = \sum_{j}a_{ij}x_i


Figure 2.5: The parallel matrix-vector product with a blockrow distribution.

Since x is also distributed, we can write this as

\forall_{i \in I_p}: y_i = \left ( \sum_{j \; \mathrm{local}} +\sum_{j \; \mathrm{not \: local}} \right ) a_{ij}x_j

This scheme is illustrated in figure 2.5. We can now proceed as follows:

  • Start the transfer of non-local elements of x;
  • Operate on the local elements of x while data transfer is going on;
  • Make sure that the transfers are finished;
  • Operate on the non-local elements of x.

Of course, this scenario presupposes that there is software and hardware support for this overlap. MPI allows for this, through so-called asynchronous communication or non-blocking comumnication routines. This does not immediately imply that overlap will actually happen, since hardware support is an entirely separate question.


Source: Victor Eijkhout, Edmond Chow, and Robert van de Geijn, https://s3.amazonaws.com/saylordotorg-resources/wwwresources/site/textbookuploads/5345_scicompbook.pdf
Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 License.

Last modified: Wednesday, 15 July 2020, 11:36 PM