Study this material, which focus on the problem of parallel software and discusses scaling by using an example to explain shared memory and message passing, and identify some problems related to cache and memory consistency.

Bhanu Kapoor, Ph.D.

- Parallel software is the problem
- Need to get significant performance improvement
- Otherwise, just use a faster uniprocessor, since it’s easier!
- Difficulties
- Partitioning
- Coordination
- Communications overhead

- Sequential part can limit speedup
- Example: 100 processors, 90× speedup?
- T
_{new}= T_{parallelizable}/100 + T_{sequential} - Speedup = 1 / [ (1 - F
_{parallelizable}) + F_{parallelizable}/ 100 ] = 90 - Solving: F
_{parallelizable}= 0.999 - Need sequential part to be 0.1% of original time

- Workload: sum of 10 scalars, and 10 × 10 matrix sum
- Speed up from 10 to 100 processors
- Single processor: Time = (10 + 100) × t
_{add} - 10 processors
- Time = 10 × t
_{add}+ 100/10 × t_{add}= 20 × t_{add} - Speedup = 110/20 = 5.5 (55% of potential)
- 100 processors
- Time = 10 × t
_{add}+ 100/100 × t_{add}= 11 × t_{add} - Speedup = 110/11 = 10 (10% of potential)
- Assumes load can be balanced across processors

- What if matrix size is 100 × 100?
- Single processor: Time = (10 + 10000) × t
_{add} - 10 processors
- Time = 10 × t
_{add}+ 10000/10 × t_{add}= 1010 × t_{add} - Speedup = 10010/1010 = 9.9 (99% of potential)
- 100 processors
- Time = 10 × t
_{add}+ 10000/100 × t_{add}= 110 × t_{add} - Speedup = 10010/110 = 91 (91% of potential)
- Assuming load balanced

- Strong scaling: problem size fixed
- As in example
- Weak scaling: problem size proportional to number of processors
- 10 processors, 10 × 10 matrix
- Time = 20 × t
_{add} - 100 processors, 32 × 32 matrix
- Time = 10 × t
_{add}+ 1000/100 × t_{add}= 20 × t_{add} - Constant performance in this example

- SMP: shared memory multiprocessor
- Hardware provides single physical address space for all processors
- Synchronize shared variables using locks
- Memory access time
- UMA (uniform) vs. NUMA (nonuniform)

- Sum 100,000 numbers on 100 processor UMA
- Each processor has ID: 0 ≤ Pn ≤ 99
- Partition 1000 numbers per processor
- Initial summation on each processor

sum[Pn] = 0;

for (i = 1000*Pn;

i < 1000*(Pn+1); i = i + 1)

sum[Pn] = sum[Pn] + A[i];

- Now need to add these partial sums
- Reduction: divide and conquer
- Half the processors add pairs, then quarter, …
- Need to synchronize between reduction steps

half = 100;

repeat

synch();

if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1];

/* Conditional sum needed when half is odd; Processor0 gets missing element */

half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];

until (half == 1);

- Each processor has private physical address space
- Hardware sends/receives messages between processors

- Network of independent computers
- Each has private memory and OS
- Connected using I/O system
- E.g., Ethernet/switch, Internet
- Suitable for applications with independent tasks
- Web servers, databases, simulations, …
- High availability, scalable, affordable
- Problems
- Administration cost (prefer virtual machines)
- Low interconnect bandwidth
- c.f. processor/memory bandwidth on an SMP

- Sum 100,000 on 100 processors
- First distribute 1000 numbers to each
- The do partial sums

sum = 0;

for (i = 0; i<1000; i = i + 1) sum = sum + AN[i];

- Reduction
- Half the processors send, other half receive and add
- The quarter send, quarter receive and add, …

- Given send() and receive() operations

limit = 100; half = 100;/* 100 processors */ repeat

half = (half+1)/2; /* send vs. receive dividing line */

if (Pn >= half && Pn < limit)

send(Pn - half, sum);

if (Pn < (limit/2))

sum = sum + receive();

limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */

- Send/receive also provide synchronization
- Assumes send/receive take similar time to addition

- Suppose two CPU cores share a physical address space
- Write-through caches

- Informally: Reads return most recently written value
- Formally:
- P writes X; P reads X (no intervening writes)
- read returns written value
- P
_{1}writes X; P_{2}reads X (sufficiently later) - read returns written value
- c.f. CPU B reading X after step 3 in example
- P
_{1}writes X, P_{2}writes X - all processors see writes in the same order
- End up with the same final value for X

- When are writes seen by other processors
- “Seen” means a read returns the written value
- Can’t be instantaneously
- Assumptions
- A write completes only when all processors have seen it
- A processor does not reorder writes with other accesses
- Consequence
- P writes X then writes Y
- all processors that see new Y also see new X
- Processors can reorder reads, but not writes

- Caches provide [for shared items]
- Migration
- Replication
- Migration Reduces
- Latency
- Bandwidth demands
- Replication reduces
- Latency
- Contention for a read of shared item

Source: Saylor Academy

This work is licensed under a Creative Commons Attribution 4.0 License.

Last modified: Wednesday, July 15, 2020, 10:37 PM