Topic  Name  Description 

Course Introduction  Course Syllabus  
Course Terms of Use  
Unit 1: Introduction to Computer Technology  Unit 1 Learning Outcomes  
1.1: Introduction to Computer Processors  John R. Harris' "Computer History"  Explore each of posts in the Computer History collection of the Virtual Travelog site, which focus on the early history of computers. These articles will provide you with insight into the early history of computers and will introduce you to the powerful ideas that enabled computer architecture of our day and that will influence computer architecture of tomorrow. 
Wikipedia: "History of Computing Hardware (1960Present)"  Read this article, which serves as a continuation of the other reading in this subunit. The primary purpose of this reading is to inform you of the history of computers from the third generation computers of the 1960s to the today's technology of microcomputers, which has allowed for a computer presence in people's homes. 

1.2: Components of a Computer  Wikipedia: "Personal Computer Hardware"  Read this article for a solid overview of various components of a computer, including the motherboard, power supply, removable media devices, secondary storage, sound cards, and input and output peripherals. 
1.3: The Role of Processor Performance  University of Maryland, Baltimore County: Dr. Jon Squire's "Benchmarks"  Read these lecture notes. 
University of Maryland, Baltimore County: Dr. Jon Squire's "Lecture 3, Performance"  Read these lecture notes. 

1.4: The Power Problem  Tom Foremski's "The Need For A Radical New Type Of Computer Architecture"  This article is about the challenges facing computer architecture in building more powerful computers for high performance applications and for faster, cheaper, more efficient computers for IT applications. Foremski responds to Irving WladawskyBerger's article, "Extreme Scale Computing"; you may click on the embedded link to read WladawskyBerger's article. 
1.5: The Switch to Parallel Processing  Stanford University: Dr. Dave Patterson's "Computer Architecture Is Back: Parallel Computing Landscape"  Watch this lecture for an understanding of the reasons behind the switch to parallel computing. This lecture provides motivation, insight into thinking about computer architecture, and an explanation computing trends. 
1.6: Case Study: A Recent Intel Processor  Massachusetts Institute of Technology: Eric Grimson's "Data Types, Operators, and Variables"  The beginning of the lecture is administrative, so you may skip to around 16:14. It introduces the concept of computational thinking. While the course is an introduction to programming, computational thinking applies to both software, i.e., programming, and to hardware, i.e., computer architecture. 
Unit 2: Instructions: Hardware Language  Unit 2 Learning Outcomes  
2.1: Computer Hardware Operations  University of Maryland, Baltimore Country: Dr. Jon Squire's "Lecture 4, CPU Operation"  Study these lecture notes to learn about the basic operations, or machine instructions, of a computer processor. 
2.2: Number Representation in Computers  University of California, Santa Barbara: Behrooz Parhami's "Part I: Number Representation"  Read through slide 37. These slides explain how numbers are represented, i.e., encoded by using a string of bits  or binary digits. These lecture notes also describe how the sign of a number is represented and how 2's complement representation. Finally, skim slides 3888 to get a basic understanding of what they cover: redundant number systems and residue number systems. This material may or may not be difficult, depending on your mathematical background; make sure to take your time as you study this material. 
Tony R. Kuphalt's "Numeration Systems"  Read this chapter on numeration systems. This is an alternative reading for number representation used for digital hardware devices. 

2.3: Instruction Representation  MIPS Assembly: "Instructions"  Read this article for an introduction to the three different instruction formats for the MIPS processor: the RFormat, the IFormat, and the JFormat instructions. MIPS is an acronym that stands for Microprocessor Instructions without Interlocked Pipeline Stages. MIPS is a RISC (Reduced Instruction Set Computer) introduced by MIPS technologies. Also, ISA, if you encounter it, stands for Instruction Set Architecture. 
2.4: Logical and Arithmetic Instructions  MIPS Assembly: "Arithmetic Instructions"  Read this article to learn about arithmetic and logical instructions for the MIPS processor. 
2.5: Control Instructions  MIPS Assembly: "Control Flow Instructions"  Read this article to learn about the control flow instructions for the MIPS processor, including the basic ones: jump and branch instructions. 
2.6: Instructions for Memory Operations  MIPS Assembly: "Memory Instructions"  Read this article to learn about memory instructions for the MIPS processor. 
2.7: Different Modes for Addressing Memory  Wikipedia: "Addressing Mode"  Read this article to study the various formats for addressing memory. 
MIPS Assembly: "Instruction Format"  MIPS assembly is an assembly language, which is a mnemonic or meaningful code for the machine language format in computer programming. Read this section to get a sense of how the addresses to memory are coded in a MIPS microprocessor. 

2.8: Case Study: Intel and ARM Instructions  X86 Assembly: "X86 Instructions"  Read this article. This article and the following provide two examples of instructions set architectures (ISAs). Look over how the different microprocessors address memory. Take note of similarities and differences of format, instructions and type of instructions, and addressing modes between these two as well as between these and the MIPS instructions of the previous sections. 
Wikipedia: "ARM Architecture"  Read this article. It and the previous provide two examples of instructions set architectures (ISAs). Look over how the different microprocessors address memory. Take note of similarities and differences of format, instructions and type of instructions, and addressing modes between these two as well as between these and the MIPS instructions of the previous sections. 

Unit 3: Fundamentals of Digital Logic Design  Unit 3 Learning Outcomes  
3.1: Beginning Design: Logic Gates, Truth Table, and Logic Equations  Massachusetts Institute of Technology: Jerome H. Saltzer and M. Frans Kaashoek's "Design Principles"  Study these principles. This reading provides a list of important design principles, applicable to any type of design and, in particular to computer system design, software or hardware. Consider these principles as well as other design considerations as a guide to computer system design. 
Wikipedia: "Logic Gates"  Read this article. Logic devices are physical implementations of Boolean logic and are built from components, which have gotten larger and more complex over time, for example: relays and transistors, gates, registers, multiplexors, adders, multipliers, ALUs (arithmetic logic units), data buses, memories, interfaces, and processors. These devices respond to control and data signals specified in machine instructions to perform the functions for which they were designed. 

3.2: Combinational Logic  Tony R. Kuphalt's "Combinational Logic Functions"  Read this chapter, which describes the design of several components using logic gates, including adders, encoders and decoders, multiplexers, and demultiplexers. This chapter also mentions ladder logic. If you are not familiar with ladder logic, you may also optionally read Chapter 6 as a reference. 
3.3: FlipFlops, Latches, and Registers  Tony R. Kuphalt's "Multivibrators"  Read this chapter, which discusses how logic gates are connected to store bits, i.e., 0's and 1's. Combinational circuits, described in the previous section, do not have memory. Using logic gates, latches and flip flops are designed for storing bits. Groups of flip flops are used to build registers which hold strings of bits. For each storage device in Chapter 10, focus on the overview at the beginning of the section and the review of the device's characteristics at the end of its section. While you do not absolutely need to know the details of how latches and flip flops work, you might find the material of interest. We strongly recommend that you read the details of the design of each storage device, because it will give you a stronger background. 
3.4: Sequential Logic Design  Tony R. Kuphalt's "Sequential Circuits"  Read this chapter on sequential circuits. Combinatorial circuits, discussed in a previous unit, have outputs that depend on the inputs. Sequential circuits and finite state machines have outputs that depend on the inputs AND the current state, values stored in memory. 
Tony R. Kuphalt's "Karnaugh Mapping"  Read this chapter on Karnaugh mapping, a tabular way for simplifying Boolean logic. There are several ways for representing Boolean logic: algebraic expressions which use symbols and Boolean operations; Venn diagrams which use distinct and overlapping circles; and tables relating inputs to outputs (for combinational logic) or tables relating inputs and current state to outputs and next state (for sequential logic). When designing sequential logic, some of the components are memory devices. Cost and processing time are considerations in using memory devices, which can be expensive. To reduce the cost or processing time the logic can be simplified. This simplification can be done using algebraic rules to manipulate the symbols and operations, analysis of the areas inside the circles for Venn diagrams, or Karnaugh maps for input/output tables. Some of you may be familiar with Karnaugh mapping from previous courses or work experience. 

3.5: Case Study: Design of a Finite State Machine (FSM) to Control a Vending Machine  Finite State Automata  Read this article for an example of a finite state machine design of a simple vending machine. A sequential circuit is also called a sequential machine or a finite state machine (FSM) or a finite state automaton. This case study gives an example that illustrates the concepts of this unit for the design of a sequential circuit. A binary table represents the input/output behavior of the circuit. We use a sequential circuit, because the output also depends on the state. Recall from your readings, that state requires memory, i.e., flip flops. Thus, a binary table with entries that give the output and next state for given inputs and current state represents the design of the machine. A finite state machine diagram can also represent the design: circles represent states; arrows represent transitions next to states; and inputs and outputs label the arrows (sometimes written as input/ output). Finally, Boolean equations can also represent the design. Lastly, Karnaugh maps or Boolean logic rules can be used to simplify, i.e., minimize, the equations and, thus, the design. 
Unit 4: Computer Arithmetic  Unit 4 Learning Outcomes  
4.1: Number Representation  Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapter 3: Computer Arithmetic"  Read sections 3.1 and 3.2 on pages 8894 on representation of integers and real numbers. In subunit 2.2, you have previously read about number systems and the representation of numbers used for computing. This reading will give you a chance to review that material. Computer architecture comprises components which perform the functions of storage of data, transfer of data from one component to another, computations, and interfacing to devices external to the computer. Data is stored in terms of units, called words. A word is made up of a number of bits, typically, depending on the computer, 32 bits or 64 bits. Words keep getting longer, i.e., larger number of bits. Instructions are also stored in words. In previous subunits, you have seen examples of how instructions are stored in a word or words. In this subunit, you will see how numbers are stored in words. 
Wikipedia: "Floating Point"  Read this article, which supplements the prior reading on representation of real numbers. 

4.2: Addition and Subtraction Hardware  University of California, Santa Barbara: Behrooz Parhami's "Addition"  Study slides 129 to learn how addition is implemented and carried out at the gate level. In the state of the practice, i.e., in the current profession, computers are architected using larger components. For example, to perform addition and subtraction, computer architects utilize ALU's, arithmetic logic units. You can design a computer without knowing the details of an ALU or of an adder, similar to using a calculator to find the square root of a number without knowing how to manually compute the square root (or in computer science terminology, without knowing the algorithm that the calculator performs to find the square root). However, we want you to have the strongest foundation in your study of computer architecture. Hence, study the assigned slides for Chapter 5. If you feel very ambitious, optionally you can study Chapters 68, which expand on basic addition. Knowing the underlying algorithm for larger components, you will be able to better use them in constructing larger components, for example, using half adders to construct a full adder. A half adder takes 2 bits as input and outputs a sum and a carry bit; a full adder takes 2 operand bits and a carry bit as input, and outputs a sum bit and a carry bit. Note that to add two floating point numbers, one or both need to be put in a form such that they have the same exponent. Then, the mantissas are added. Lastly, the result is normalized. We will cover floating point addition in subunit 4.4. Subtraction is not discussed explicitly, because it is done via addition by reversing the sign of the number to be subtracted (subtrahend) and adding the result to the number subtracted from (minuend). You have to be careful when subtracting floating point numbers, because of the possible large round off error in the result. 
4.3: Multiplication  University of California, Santa Barbara: Behrooz Parhami's "Multiplication"  Study slides 126. In particular, focus on section 9.1 to learn how multiplication is performed using basic steps that are carried out by elemental logic components, such as an adder and shifter. Two numbers are loaded into registers (fast word storage) and multiplication is carried out by repeated addition and shifting (moving the bits in a register to the right or left). The presentation first shows how multiplication is done for unsigned integers and then how signed numbers are handled. When you read the slides, follow along by using a pad of paper and multiply two small binary numbers. Note k is the position of the binary digit; from right to left, k takes on 0, 1, 2, etc. j is the partial product; j goes from 1, 2, 3, etc. However, for consistency the 0th partial product is defined to be 0. Chapters 1012 are optional; these chapters discuss ways of speeding up multiplication. Note that to multiply two floating point numbers, add the exponents, multiply the mantissas, normalize the mantissa of the product, and adjust the exponent accordingly. Floating point multiplication will be covered in a following subunit 4.4. 
4.4: Floating Point Arithmetic  Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapter 3: Computer Arithmetic"  Read section 3.3 on pages 94100 and section 3.4 on pages 100102. Section 3.3 provides an additional explanation of error analysis when numbers are represented using a fixed number of digits. This issue mostly arises when using floating point numbers. Real numbers are represented using a fixed number of bits. The number of bits is the precision of the representation. The accuracy of the representation is described in terms of the difference between the actual number and its representation using a fixed number of bits. This difference is the error of the representation. The accuracy becomes more significant, because computations can cause the error to get so large that the result is meaningless and potentially even high risk depending on the application. Section 3.4 explains how programming languages approach number representation. 
4.5: Division  University of California, Santa Barbara: Behrooz Parhami's "Division"  Study slides 134. In particular, focus on section 13.1, which explain the basics of division using subtraction and shifting. Chapters 1416 are optional; these chapters cover ways of speeding up division. 
4.6: Case Study: Floating Point Arithmetic in an x86 Processor  Wikipedia: "Extended Precision"  Read this article to learn about minimizing roundoff and overflow in floating point arithmetic using extended precision. 
Unit 5: Designing a Processor  Unit 5 Learning Outcomes  
5.1: Von Neumann Architecture  Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapter 1: Sequential Computer Architecture"  Study section 1.1 of Chapter 1 on pages 713 to learn about sequential or Von Neumann computer architecture. Computer architecture is the high level computer design comprising components which perform the functions of data storage, computations, data transfer, and control. 
5.2: Simple MIPS Processor Components  Indian Institute of Technology, Delhi: Anshul Kumar's "Processor Design: An Introduction"  Watch this lecture, which introduces the components of MIPS architecture that are required to process a subset of MIPS instructions. This is the first lecture of a series of video lectures on the design of a MIPS processor. This series of six videos incrementally design a processor that implements a subset of eight MIPS five arithmetic instructions, two memory reference instructions, and one flow control instruction. Using hardware components, referred to as building blocks, and a microprogram control, the lecturer executes these eight instructions to develop a simple design of a processor. In this lecture, Kumar discusses the performance of the design and performance improvement using a multicycle design. Also, Kumar identifies an extension of the design to deal with exceptional cases that could occur when executing programs written using the eight instructions (exception handling). Lastly, the lecturer identifies increments to the design to include additional instructions. 
5.3: Designing a Datapath for a Simple Processor  Indian Institute of Technology, Delhi: Anshul Kumar's "Processor Design: Datapath"  Watch this lecture, which is the second video in a series of six video lectures on the design of a MIPS processor. The previous video lecture presented the processor building blocks that will be used in the series. This video lecture explains how to build a datapath of the MIPS architecture to process a subset of the MIPS instructions. We will take Rformat instructions and memory instructions and look into the datapath requirements to process them. Then, the other instructions will be addressed one at a time, and incremental changes to the design will be made to handle them. The data path and controller will be interconnected, and the (micro) control signals to perform the correct hardware operations at the right time will be identified. The next video in subunit 5.4 will look at the design of the controller. 
5.4: Alternative Approach to Datapath Design and Design of a Control for a Simple Processor  Indian Institute of Technology, Delhi: Anshul Kumar's "Processor Design: Control"  Watch this lecture, which is the third video of the series of video lectures on the design of a simple processor. This lecture explains how to build the control part of the MIPS architecture that is required to process a subset of MIPS instructions. This builds on the datapath design from the last lecture. That datapath design approach started with a design for the R class instructions  instructions having operands in registers, e.g., add, subtract, 'and', 'or', 'less than'. It then included the other instructions, one at a time, and incremented the design to accommodate them. In this video lecture, an alternative approach is used to arrive at the same design. Here, the datapath for the arithmetic and logic instructions is designed. Then, the data path for the store, then for the load, then the branch on equal, and, finally, the jump are designed individually. Next, datapath design for all eight instructions is the union of the five individual designs. The control signals for each instruction are identified and combined to form a truth table for a controller, which is implemented using a PLA (program logic array). The video concludes with a performance/delay analysis of the design to show the limitations of a single cycle datapath. The next video in subunit 5.5 will look at pipelining for increased performance. 
5.5: Pipelining and Hazards  Indian Institute of Technology, Delhi: Anshul Kumar's "Pipelined Processor Design"  Watch this lecture, which presents basic ideas on improving processor performance through the use of pipelining. The previous video showed the limitations of a single cycle datapath design. To overcome the limitations and improve performance, a pipeline datapath design is considered. This video lecture explains pipelining. A pipeline datapath is analogous to an assembly line in manufacturing. First, you develop a skeleton design of a pipeline datapath. Performance analysis shows that several types of delays can arise, called hazards: structure, data, or control hazards. Design can address those arising from structure. However, data and control delays cannot always be prevented. The next video lecture will complete the design of a pipeline datapath for the simple processor. 
5.6: Pipelined Processor  Indian Institute of Technology, Delhi: Anshul Kumar's "Pipelined Processor Design: Datapath"  Watch this lecture, which explains how to design a pipelined MIPS processor. The previous video introduced pipelining as a way to increase performance. It showed how hazards can limit the performance improvement of a pipeline datapath. This video lecture completes the design of a pipeline datapath. Ignoring hazards, the lecturer designs a control for the pipeline, integrates all the components including the control with the pipeline, and then considers the behavior with respect to hazards. 
5.7: Instruction Level Parallelism  Charles Severance and Kevin Dowd's "Understanding Parallelism  Introduction"  Read this article, which discusses granularity of parallelism. On a uniprocessor, instruction level parallelism includes pipelining techniques and multiple functional units. Parallel processing using multiprocessors will be covered in Unit 8. 
Wikipedia: "InstructionLevel Parallelism"  Read the explanation of instruction level parallelism. Parallelism can occur at different levels of granularity, and when discussing parallelism, we need to be clear on exactly what is being done in parallel. 

Unit 6: The Memory Hierarchy  Unit 6 Learning Outcomes  
6.1: Elements of Memory Hierarchy and Caches  Indian Institute of Technology, Delhi: Anshul Kumar's "Memory Hierarchy: Basic Idea"  Watch this lecture. Previously, you have focused on processor design to increase performance. Now, you will turn to memory. This video introduces various methods of improving processor performance through the use of memory hierarchy. The lecture discusses memory technologies, which vary in cost and speed. We have to assume that memory is flat, but with current technology, flat memory does not meet performance demands placed on it. You will take a look at hierarchical memory and the use of cache. This video will also discuss analysis of memory hierarchies and cache performance with respect to miss rates and block size. Finally, the lecturer considers cache policy. 
Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapter 1: Sequential Computer Architecture"  Read section 1.2 of Chapter 1 on pages 1423 to learn about memory hierarchies, and read section 1.4 on pages 2529 to learn about locality and data reuse. Subsections 1.2.6 and 1.2.7 apply to the topics outlined below in subunits 6.2 and 6.3. These readings supplement the memory topics discussed in Kumar's video. 

6.2: Cache Architectures and Improving Cache Performance  Indian Institute of Technology, Delhi: Anshul Kumar's "Memory Hierarchy: Cache Organization"  Watch this lecture on memory hierarchy design with caches. This lecture discusses the impact that memory operations have on overall processor performance and identifies different cache architectures that can improve overall processor performance. In the memory hierarchy, from top to bottom, there are: processor registers, cache, main memory, and secondary memory. The order goes from faster, smaller capacity, and higher cost to slower, higher capacity, and lower cost. Data moves from the main memory to cache. To do this requires a mapping from addresses in the main memory to cache addresses. How this is done affects performance. Performance analysis of cache and memory organization involves miss rate and miss penalty. Policies for reading, loading, fetching, replacing, and writing memory also affect cost and performance. 
Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapter 1: Sequential Computer Architecture"  Read section 1.5 of Chapter 1 on pages 3041. This reading discusses the relationship of pipelining and cache to programming. 

6.3: Main Memory and Virtual Memory  Indian Institute of Technology, Delhi: Anshul Kumar's "Memory Hierarchy: Virtual Memory"  Watch this lecture, which addresses the subject of virtual memory. This topic pertains to the relationship of main memory and secondary memory and has similarities and differences to the relationship of cache and main memory. Differences arise due to the speed of the various memories and their capacities. This lecture discusses virtual memory, mapping from virtual memory addresses to physical addresses in main and paging techniques used with the main memory. The lecture discusses the use of page tables in translating virtual addresses to physical addresses. Issues that arise with page tables include structure, location, and large size. 
6.4: Performance Tuning  Blaise Barney's "Introduction to Parallel Computing, Chapters 15"  Read Chapters 15. Before reading these chapters, list the factors that you can think of that can affect performance, e.g., memory performance, cache, memory hierarchy, multicores, etc. and what you might suggest as ways to increase performance. After reading these chapters, what might you add, if anything, to your list? 
Unit 7: Storage and I/O  Unit 7 Learning Outcomes  
7.1: I/O Devices  Indian Institute of Technology, Delhi: Anshul Kumar's "I/O Subsystem: Introduction"  Watch this lecture, which discusses the basic ideas behind the input/output (I/O) subsystem of a computer system. The lecturer also looks into performance measurement for I/O devices and interfaces used to interconnect I/O devices to the processor. This is the first of two video lectures on I/O. A computer subsystem consists of three major components: processor, memory, and connections. The key words in the previous sentence are subsystem and connections. To be useful, a computer system needs to have connections with external devices to get data and control signals into the computer and to put data and control signals out. The external devices can be other systems or other systems may be connected to the same devices. Thus, our computer system is part of a network of a few or many other subsystems interconnected to perform useful tasks. We are always interested in how well a task is performed, in terms of time, capacity, and cost. In considering performance relative to our useful task, we have to consider processor performance, memory performance, and the performance of the connections, including the performance of the external devices. This first video lecture looks at external or peripheral devices and I/O performance. The next video in subunit 7.2 discusses interfaces, buses, and I/O transfer. 
7.2: Connecting I/O Devices to the Processor  Indian Institute of Technology, Delhi: Anshul Kumar's "I/O Subsystem: Interfaces and Buses"  Watch this lecture for an introduction to the interconnection schemes used for the input/output (I/O) subsystem of a computer system. This is the second of two lectures on I/O devices. The previous lecture looked at the connection of memory, either cache or main memory, with peripheral devices and the transfer and transformation of data between them. This video lecture analyzes alternative interconnection schemes with a focus on buses. Also, it discusses protocols for the data that flows on the buses: asynchronous and synchronous. A synchronous protocol uses a clock to time sequence the information flow. An asynchronous protocol does not use a clock; the signal carries the sequencing information. Then, the lecture shows a performance comparison of two different protocols. 
7.3: Measuring Disk Performance  Wikipedia: "Hard Disk Drive Performance Characteristics"  Read this article. Disks have various characteristics, which determine quality attributes, such as reliability, performance, etc. Here, we are interested in performance. Think of some characteristics that affect performance. How can performance of a disk be measured? 
7.4: Redundant Array of Inexpensive Disks (RAID)  Wikipedia: "RAID"  This article will provide you with a description of RAID technology and will introduce you to techniques that are used for studying performance and reliability in general. 
Unit 8: Parallel Processing  Unit 8 Learning Outcomes  
8.1: The Reason for the Switch to Parallel Processing  Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapter 1: Sequential Computer Architecture"  Read section 1.3 of Chapter 1 on pages 23 and 24 to learn about multicore chips. These two pages give a summary of processor and chip trends to overcome the challenge of increasing performance and addressing the heat problem of a single core. 
8.2: Limitations in Parallel Processing: Amdahl's Law  Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapter 2: Parallel Computer Architecture"  Read sections 2.1 and 2.2 of Chapter 2 on pages 4145 to learn about parallel computer architectures. There are different types of parallelism: there is instructionlevel parallelism, where a stream of instructions is simultaneously in partial stages of execution by a single processor; there are multiple streams of instructions, which are simultaneously executed by multiple processors. The former was addressed in Unit 5. The latter is addressed in this reading. A quote from the beginning of the chapter states the key ideas: "In this chapter, we will analyze this more explicit type of parallelism, the hardware that supports it, the programming that enables it, and the concepts that analyze it." This reading begins with a simple scientific computation example, followed by a description of SISD, SIMD, MISD, and MIMD architectures. 
Ariel Ortiz Ramírez's "Parallelism and Performance"  Study the section titled "Amdahl's Law." Amdahl's law explains the limitations to performance gains achievable through parallelism. Over the last several decades or so, increases in computer performance have largely come from improvements to current hardware technologies and less from software technologies. Now, however, the limits to these improvements may be near. For significant continued performance improvement, either new physical technology needs to be discovered and/or transitioned to practice, or software techniques will have to be developed to get significant gains in computing performance. In the equation for Amdahl's law, P is the fraction of code that can be parallelized, i.e., that must be executed serially; S is the fraction of code that cannot be parallelized; and n is the number of processors. Note P + S is 1. If there are n processors, then P + S can be executed in the same time that P/n + S can be executed. Thus, the ratio of the time using 1 processor to the time of using n processors is 1/(P/n + S). This is the speedup in going from 1 processor to n processors. Note that the speedup is limited, even for large n. If n is 1, the speedup is 1. If n is very large, then the speedup is 1/S. If P is very small, then P/n is even smaller, and P/n + S is approximately S, i.e., the speedup is 1/S, but S is approximately S + P, which is 1. Therefore, the speed of execution of this code using 1 processor is about the same as using n processors. Another way of writing Amdahl's law is 1/(P/n + [1  P]). Thus, if P is close to 1, the speedup is 1/(P/n) or n/P, which is approximately n. Apply Amdahl's law to better understand how it works by substituting a variety of numeric values into this equation and sketching the graph of the equation. 

Blaise Barney's "Introduction to Parallel Computing, Chapter 6, Section 10: Limits and Costs of Parallel Programming"  In section 10 of Chapter 6, study the section titled "Amdahl's Law" up to the section titled "Complexity." 

8.3: Shared Memory and Distributed Memory Multiprocessing  Multiprocessing  Study these slides. This reading focuses on the problem of parallel software. It discusses scaling, uses a single example to explain shared memory and message passing, and identifies problems related to cache and memory consistency. 
8.4: Multicore Processors and Programming with OpenMP and MPI  Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapter 2: Parallel Computer Architecture"  Read section 2.5 of Chapter 2 on pages 5268. The reading covers two extreme approaches to parallel programming. First, parallelism is handled by the lower software and hardware layers. OpenMP is applicable in this first case. Secondly, parallelism is handled by the programmer. MPI is applicable in the second case. 
Norm Matloff's "Programming on Parallel Machines"  Read Chapter 1 on pages 120. If you go to the table of contents, selecting the section will jump you to the desired page to avoid scrolling through the text. Chapter 1 uses a matrix times (multiplication) vector example in section 1.3.1. This chapter goes on to describe parallel approaches for computing a solution: section 1.3.2 describes a sharedmemory and threads approach; section 1.3.3 describes a message passing approach; section 1.3.4 describes the MPI and R language approach. Study these sections to get an overview of the idea of software approaches to parallelism. Read Chapter 2 on pages 21  30. This chapter presents issues that slow the performance of parallel programs. Read Chapter 3 on pages 31  66 to learn about shared memory parallelism. Parallel programming and parallel software are extensive topics and our intent is to give you an overview of them; more in depth study is provided by the following chapters. Read Chapter 4 on pages 67  100. This chapter discusses MP directives and presents a variety of examples. Read Chapter 5 on pages 101  136. This chapter presents GPUs (Graphic Processing Units) and the CUDA language. This chapter also discusses parallel programming issues in the context of GPUs and CUDA and illustrates them with various examples. Read Chapter 7 on pages 161  166. This chapter illustrates the message passing approach using various examples. Read Chapter 8 on pages 167  169 for a description of MPI (Message Passage Interface), which applies to networks of workstations (NOWs). The rest of the chapter illustrates this approach with various examples. Read Chapter 9 on pages 193  206 for an overview of cloud computing and the hadoop platform, which are interesting topics for today not just for parallel computing. Lastly, read section 10.1 of Chapter 10 on pages 207 and 208, which explains what R is. The rest of the chapters of the text and the four appendices cover other interesting topics. These chapters and the appendices are optional. 

Unit 9: Look Back and Look Ahead  Unit 9 Learning Outcomes  
9.1: Theory and Laws  Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapter 2: Parallel Computer Architecture"  Read section 2.6 of Chapter 2 on pages 68  76 to learn about network topologies. If a task cannot be performed by a computer with one processor, we decompose the task into subtasks, which will be allocated to multiple hardware devices, say processors or arithmetic units or memory. These multiple hardware devices need to communicate so that the original task can be done with acceptable cost and performance. The hardware devices and their interconnections form a network. Consider another situation: suppose we have a large software system made up of a large number of subsystems, in turn composed of many software components, subcomponents, etc. Suppose we list all the lowest level subcomponent's names across the top of a sheet of paper. These will be our column headings. Also, let's list down the side of the same sheet the same subcomponent names. These will be our row headings. This forms a table or two by two matrix. Finally, suppose we put a 1 in the table whenever or wherever there is a connection between the subcomponent named in the column heading and the subcomponent named in the row heading. Let's put a 0 everywhere else. This table now represents a topology for our network of software components; it could also be done for hardware components. These components and their interconnections are part of the software architecture. Realize that the matrix could be huge: 100 by 100, 1000 by 1000, etc. The complexity of the interconnections is a factor in the reliability and performance of the architecture. Read section 2.7 of Chapter 2 on pages 77  79. This reading reviews Amdahl's law, an extension of Amdahl's law that includes communication overhead, and Gustafson's law. These laws express expected performance as the number of processors increases, or as both the size of the problem and number of processors increases. Then, read section 2.9 of Chapter 2 on pages 80 and 81 to learn about load balancing. This reading looks at the situation where a processor is idle and another is busy, which is referred to as a load imbalance. If the work were to be distributed differently among the processors, then the idle time might be able to be eliminated. This brief reading poses the load balance problem as a graph problem. 
9.2: Special Purpose Computing Architectures  Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapter 2: Parallel Computer Architecture"  Read section 2.8 of Chapter 2 on pages 79 and 80 to learn about GPU computing. GPU stands for Graphics Processing Unit. These are of interest because of the increase in the amount of graphics data handled by popular laptops and desktop computers. Furthermore, it turns out that since a GPU does primarily arithmetic computations, the architecture of a GPU is applicable to other types of applications that involve arithmetic on large amounts of data. Read section 2.10 of Chapter 2 on pages 82 and 83 to learn about distributed, grid, and cloud computing. These are configurations of multiple computers for increasing performance and/or decreasing cost for high volume database access, sharing of resources, or accessing remote computer resources, respectively. 
Wikipedia: "TOP500"  Read this article, paying particular attention to the rankings of supercomputers, based on performance in running a LINPACK benchmark for computing a dense set of linear equations. Towards the end of the article, note the large number of cores in these supercomputers. 

9.3: Case Study: Special Purpose Applications of Parallel Computing  Eijkhout, Chow, and van de Geijn's "Introduction to HighPerformance Scientific Computing, Chapters 48"  Read Chapters 48, which describe problems pertaining to special application areas, called domains. 
Optional Course Evaluation Survey  Optional Course Evaluation Survey  Please take a few moments to provide some feedback about this course. Consider completing the survey whether you have completed the course, you are nearly at that point, or you have just come to study one unit or a few units of this course. Your feedback will focus our efforts to continually improve our course design, content, technology, and general easeofuse. Additionally, your input will be considered alongside our consulting professors' evaluation of the course during its next round of peer review. As always, please report urgent course experience concerns to [email protected] and/or our discussion forums. 