Shared Memory and Distributed Multiprocessing

Bhanu Kapoor, Ph.D.

Issue with Parallelism

  • Parallel software is the problem
  • Need to get significant performance improvement
    • Otherwise, just use a faster uniprocessor, since it’s easier!
  • Difficulties
    • Partitioning
    • Coordination
    • Communications overhead

Amdahl’s Law

  • Sequential part can limit speedup
  • Example: 100 processors, 90× speedup?
    • Tnew = Tparallelizable/100 + Tsequential
    • Speedup = 1 / [ (1 - Fparallelizable) + Fparallelizable / 100 ] = 90
    • Solving: Fparallelizable = 0.999
  • Need sequential part to be 0.1% of original time

Scaling Example

  • Workload: sum of 10 scalars, and 10 × 10 matrix sum
    • Speed up from 10 to 100 processors
  • Single processor: Time = (10 + 100) × tadd
  • 10 processors
    • Time = 10 × tadd + 100/10 × tadd = 20 × tadd
    • Speedup = 110/20 = 5.5 (55% of potential)
  • 100 processors
    • Time = 10 × tadd + 100/100 × tadd = 11 × tadd
    • Speedup = 110/11 = 10 (10% of potential)
  • Assumes load can be balanced across processors

Scaling Example (cont)

  • What if matrix size is 100 × 100?
  • Single processor: Time = (10 + 10000) × tadd
  • 10 processors
    • Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
    • Speedup = 10010/1010 = 9.9 (99% of potential)
  • 100 processors
    • Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
    • Speedup = 10010/110 = 91 (91% of potential)
  • Assuming load balanced

Strong vs Weak Scaling

  • Strong scaling: problem size fixed
    • As in example
  • Weak scaling: problem size proportional to number of processors
    • 10 processors, 10 × 10 matrix
      • Time = 20 × tadd
    • 100 processors, 32 × 32 matrix
      • Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
    • Constant performance in this example

Shared Memory

  • SMP: shared memory multiprocessor
    • Hardware provides single physical address space for all processors
    • Synchronize shared variables using locks
    • Memory access time
      • UMA (uniform) vs. NUMA (nonuniform)


Example: Sum Reduction

  • Sum 100,000 numbers on 100 processor UMA
    • Each processor has ID: 0 ≤ Pn ≤ 99
    • Partition 1000 numbers per processor
    • Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
  • Now need to add these partial sums
    • Reduction: divide and conquer
    • Half the processors add pairs, then quarter, …
    • Need to synchronize between reduction steps

Example: Sum Reduction

half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd; Processor0 gets missing element */
half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);

Message Passing

  • Each processor has private physical address space
  • Hardware sends/receives messages between processors


Loosely Coupled Clusters

  • Network of independent computers
    • Each has private memory and OS
    • Connected using I/O system
      • E.g., Ethernet/switch, Internet
  • Suitable for applications with independent tasks
    • Web servers, databases, simulations, …
  • High availability, scalable, affordable
  • Problems
    • Administration cost (prefer virtual machines)
    • Low interconnect bandwidth
      • c.f. processor/memory bandwidth on an SMP

Sum Reduction (Again)

  • Sum 100,000 on 100 processors
  • First distribute 1000 numbers to each
    • The do partial sums
sum = 0;
for (i = 0; i<1000; i = i + 1) sum = sum + AN[i];
  • Reduction
    • Half the processors send, other half receive and add
    • The quarter send, quarter receive and add, …

Sum Reduction (Again)

  • Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */ repeat
half = (half+1)/2; /* send vs. receive dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */

    • Send/receive also provide synchronization
    • Assumes send/receive take similar time to addition

Cache Coherence Problem

  • Suppose two CPU cores share a physical address space
    • Write-through caches


Coherence Defined

  • Informally: Reads return most recently written value
  • Formally:
    • P writes X; P reads X (no intervening writes)
      • read returns written value
    • P1 writes X; P2 reads X (sufficiently later)
      • read returns written value
        • c.f. CPU B reading X after step 3 in example
    • P1 writes X, P2 writes X
      • all processors see writes in the same order
        • End up with the same final value for X

Memory Consistency

  • When are writes seen by other processors
    • “Seen” means a read returns the written value
    • Can’t be instantaneously
  • Assumptions
    • A write completes only when all processors have seen it
    • A processor does not reorder writes with other accesses
  • Consequence
    • P writes X then writes Y
      • all processors that see new Y also see new X
    • Processors can reorder reads, but not writes

Multiprocessor Caches


  • Caches provide [for shared items]
    • Migration
    • Replication
  • Migration Reduces
    • Latency
    • Bandwidth demands
  • Replication reduces
    • Latency
    • Contention for a read of shared item

Source: Saylor Academy
Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License.

Last modified: Wednesday, 15 July 2020, 10:37 PM