|Course Syllabus||Course Syllabus|
|1.1: Introduction to Computer Processors||Computer History||
Explore each of these articles, which focus on the early history of computers and give some insight into the early history of computers. They will introduce you to the powerful ideas that enabled computer architecture of today and that will influence computer architecture of tomorrow.
|History of Computing Hardware (1960–Present)||
The main purpose of this article is to take a look at the history of computers from the third generation computers of the 1960s to today's technology of microcomputers, which have allowed for a computer presence in people's homes.
|1.2: Components of a Computer||Personal Computer Hardware||
Read this article for a solid overview of various components of a computer, including the motherboard, power supply, removable media devices, secondary storage, sound cards, and input and output peripherals.
|1.3: The Role of Processor Performance||CPU and Processor Time Counter||
Read this introduction to computer performance and information about computing processor time. In most programming languages, there is a process to measure elapsed time, such as time() in C. By subtracting the time at the beginning of a process from the time at the end you get the total time for a particular operation. Usually short operations are put in a loop that repeats the operation a sufficient number of times to get an accurate measurement.
|Microprocessor Design and Performance||
This article gives more information on computer performance, including some helpful definitions.
This article discusses benchmarks, which help measure computer performance.
This article introduces Amdahl's law, which can be used to measure how certain improvements to performance can be measured. A general form of Amdahl's law is: (that part of the time that is improved/the improvement factor) plus that part of the time that is not improved equals the new improved time. Speedup factor is the old time divided by the new time. Be sure to study the examples.
|1.4: The Power Problem||The Need For A Radical New Type Of Computer Architecture||
This article is about the challenges facing computer architecture in building more powerful computers for high performance applications and for faster, cheaper, more efficient computers for IT applications. The author is responding to another article on extreme-scale computing, which you may read by following the link if you desire.
|1.5: The Switch to Parallel Processing||Parallel Computing Landscape||
Watch this lecture for an understanding of the reasons behind the switch to parallel computing. This lecture provides motivation, insight into thinking about computer architecture, and an explanation computing trends.
|1.6: Case Study: A Recent Intel Processor||Data Types, Operators, and Variables||
The beginning of the lecture is administrative, so you may skip to around 16:14. It introduces the concept of computational thinking. While the course is an introduction to programming, computational thinking applies to both software (via programming) and to hardware (via computer architecture). Pay particular attention to the computer block diagram, the program counter and the way instructions relate to the computer hardware.
|2.1: Computer Hardware Operations||Introduction to Programming Languages||
Read this article to see how a program is edited, compiled (or assembled), linked, and executed in the computer.
Read this article, which discusses how a MIPS assembly language statement is assembled into machine code. Carefully study the example in the article.
|The Machine Cycle||
Read this article about how a computer actually executes the machine code produced by the compiler (assembler). Registers are simply small memory units that are usually 8, 16, 32, or more bits long. The program counter and instruction register are used in the machine cycle.
|2.2: Number Representation in Computers||Introduction to Number Systems||
Read this introduction to number systems.
|More on Number Systems||
Read this article on how numbers can be represented in 1's and 0's that the computer can understand. Study the way numbers can be converted into other representations, like binary, 2's complement, and so on. Also be sure to watch the videos, including the one on floating point representation. Study the examples of binary addition to make sure you understand how it works.
Read this article on the representation of real numbers. Read up to the section labeled "Addition and subtraction."
|Practice with Number Systems||
Study this article if you need more practice at understanding number systems.
|Converting Decimal Numbers to Binary||
This article demonstrates one method for converting from decimal to binary. This conversion is often used as a first step in converting from decimal to any other representation.
|Another Way to Convert from Decimal to Binary||
This article illustrates a second method for converting from decimal to binary. Use whichever method you prefer.
This page contains an interactive decimal to binary conversion when there are fractions involved. Click on the 0's to change them into 1's. You can also see the 2's complement representation for negative numbers.
|2.3: Instruction Representation||MIPS Instructions||
Read this article for an introduction to the three different instruction formats for the MIPS processor: the R-Format, the I-Format, and the J-Format instructions. MIPS is an acronym that stands for Microprocessor Instructions without Interlocked Pipeline Stages. MIPS is a RISC (Reduced Instruction Set Computer) introduced by MIPS technologies. Also, ISA, if you encounter it, stands for Instruction Set Architecture.
|2.4: Logical and Arithmetic Instructions||MIPS Arithmetic Instructions||
Read this article to learn about arithmetic and logical instructions for the MIPS processor.
|2.5: Control Instructions||MIPS Control Flow Instructions||
Read this article to learn about the control flow instructions for the MIPS processor, including the basic ones: jump and branch instructions.
|2.6: Instructions for Memory Operations||MIPS Memory Instructions||
Read this article to learn about memory instructions for the MIPS processor.
|2.7: Different Modes for Addressing Memory||Addressing Memory||
Read this article to study the various formats for addressing memory.
|MIPS Instruction Format||
MIPS assembly is an assembly language, which is a mnemonic or meaningful code for the machine language format in computer programming. Read this section to get a sense of how the addresses to memory are coded in a MIPS microprocessor.
|2.8: Case Study: Intel and ARM Instructions||X86 Instructions and ARM Architecture||
Read this article, which gives two examples of instructions set architectures (ISAs). Look over how the different microprocessors address memory. Take note of similarities and differences of format, instructions and type of instructions, and addressing modes between these two as well as between these and the MIPS instructions of the previous sections.
|3.1: Beginning Design: Logic Gates, Truth Table, and Logic Equations||Logic Design Principles||
Study these important design principles, which are applicable to any type of design, particularly computer system design, software, and hardware. Consider these principles alongside the other design considerations we use as a guide to computer system design.
Read the introduction, look at the different logic families under "Electronic Gates" and study carefully the sections "Symbols" and "Truth Tables." Scan the rest of the article. Truth tables list the total possible input combinations of 1's and 0's and the corresponding outputs for each gate. Logic devices are physical implementations of Boolean logic and are built from components, which have gotten larger and more complex over time, for example: relays and transistors, gates, registers, multiplexors, adders, multipliers, ALUs (arithmetic logic units), data buses, memories, interfaces, and processors. These devices respond to control and data signals specified in machine instructions to perform the functions for which they were designed.
This section on ladder logic is optional. If you would like to know more, read these sections.
Read this chapter on Karnaugh mapping, which is a tabular way for simplifying Boolean logic. There are several ways for representing Boolean logic: algebraic expressions, which use symbols and Boolean operations; Venn diagrams, which use distinct and overlapping circles; and tables relating inputs to outputs (for combinational logic) or tables relating inputs and current state to outputs and next state (for sequential logic). When designing sequential logic, some of the components are memory devices. Cost and processing time are considerations in using memory devices, which can be expensive. To reduce the cost or processing time the logic can be simplified. This simplification can be done using algebraic rules to manipulate the symbols and operations, analysis of the areas inside the circles for Venn diagrams, or Karnaugh maps for input/output tables.
|3.2: Combinational Logic||Combinational Logic Functions||
Read this chapter, which describes the design of several components using logic gates, including adders, encoders and decoders, multiplexers, and demultiplexers. This chapter also mentions ladder logic. If you are not familiar with ladder logic, you may also optionally read Chapter 6.1 and 6.2 as a reference. Note that ladder logic is not generally used in computer design and may be omitted. For the one input decoder, note that for a 0 input, the D0 output is a 1 and when the input is a 1, the D1 output is a 1. All the other decoders work the same way. One output line is a 1and the rest are 0's, indicating which binary number has been placed on the input lines. So an input in binary of the number 6 would cause D6 to be a 1.
|3.3: Flip-Flops, Latches, and Registers||Multivibrators||
Read this chapter, which discusses how logic gates are connected to store bits (0s and 1s). Combinational circuits, described in the previous section, do not have memory. Using logic gates, latches and flip flops are designed for storing bits. Groups of flip flops are used to build registers which hold strings of bits. For each storage device in Chapter 10, focus on the overview at the beginning of the section and the review of the device's characteristics at the end of its section. While you do not absolutely need to know the details of how latches and flip flops work, you might find the material of interest. We strongly recommend that you read the details of the design of each storage device, because it will give you a stronger background.
|3.4: Sequential Logic Design||Sequential Circuits||
Read this chapter on sequential circuits. Combinatorial circuits have outputs that depend on the inputs. Sequential circuits and finite state machines have outputs that depend on the inputs AND the current state, values stored in memory.
|3.5: Case Study: Design of a Finite State Machine (FSM) to Control a Vending Machine||Finite State Automata||
Read this article for an example of a finite state machine design of a simple vending machine.A sequential circuit is also called a sequential machine or a finite state machine (FSM) or a finite state automaton. This case study gives an example of the design of a sequential circuit. Each state is assigned a distinct set of 1's and 0's, not using more variables than necessary. Thus 4 states require 2 variables, 00, 01, 10 and 11. A binary table represents the input/output behavior of the circuit. We use a sequential circuit, because the output also depends on the state. Recall that state requires memory (that is, flip flops). Thus, a binary table with entries that give the output and next state for given inputs and current state represents the design of the machine. A finite state machine diagram can also represent the design: circles represent states; arrows represent transitions next to states; and inputs and outputs label the arrows (sometimes written as input/ output). Finally, Boolean equations can also represent the design. Lastly, Karnaugh maps or Boolean logic rules can be used to simplify (or minimize) the equations, and thus the design.
|4.1: Number Representation||Floating Points||
Read this article on the representation of real numbers.
|Integers and the Representation of Real Numbers||
Read these sections on the representation of integers and real numbers. Earlier, you read about number systems and the representation of numbers used for computing. This will give you a chance to review that material. Computer architecture comprises components which perform the functions of storage of data, transfer of data from one component to another, computations, and interfacing to devices external to the computer. Data is stored in terms of units, called words. A word is made up of a number of bits, typically, depending on the computer, 32 bits or 64 bits. Words keep getting longer, with larger numbers of bits. Instructions are also stored in words. Before, you saw examples of how instructions are stored in a word or words. Now, you will see how numbers are stored in words.
|4.2: Addition and Subtraction Hardware||Add and Subtract Blocks||
Study this article to learn how addition is implemented and carried out at the gate level. Nowadays, computers are architected using larger components. For example, to perform addition and subtraction, computer architects utilize ALUs, arithmetic logic units. You can design a computer without knowing the details of an ALU or of an adder, similar to using a calculator to find the square root of a number without knowing how to manually compute the square root (or in computer science terminology, without knowing the algorithm that the calculator performs to find the square root). However, we want you to have the strongest foundation in your study of computer architecture. Knowing the underlying algorithm for larger components, you will be able to better use them in constructing larger components, for example, using half adders to construct a full adder. A half adder takes 2 bits as input and outputs a sum and a carry bit; a full adder takes 2 operand bits and a carry bit as input, and outputs a sum bit and a carry bit. To add two floating point numbers, one or both need to be put in a form such that they have the same exponent. Then, the mantissas are added. Lastly, the result is normalized. We will cover floating point addition later. Subtraction is not discussed explicitly, because it is done via addition by reversing the sign of the number to be subtracted (subtrahend) and adding the result to the number subtracted from (minuend). You have to be careful when subtracting floating point numbers, because of the possible large round off error in the result.
|4.3: Multiplication||Binary Multipliers||
Study this article to learn how multiplication is performed using basic steps that are carried out by elemental logic components, such as an adder and shifter. Two numbers are loaded into registers (fast word storage) and multiplication is carried out by repeated addition and shifting (moving the bits in a register to the right or left). Multiplication by an integer is performed by shifting the bits in the word to the left. Each shift multiplys by 2. Note that the ALU consists of the adder and associated circuitry that can also do ands, ors, inversions and other operations, so multiplications can be done by the adder or even by programs steps which can do the same process. A hardware multiplier can also be used, which is faster than using the adder and shifter or using programmming. To multiply two floating point numbers, add the exponents, multiply the mantissas, normalize the mantissa of the product, and adjust the exponent accordingly. Floating point multiplication will be covered later.
Read this article which gives another overview of multipliers.
|4.4: Floating Point Arithmetic||Floating Point Arithmetic and Error Analysis||
Read these sections, which explain error analysis when numbers are represented using a fixed number of digits. This issue mostly arises when using floating-point numbers. Real numbers are represented using a fixed number of bits. The number of bits is the precision of the representation. The accuracy of the representation is described in terms of the difference between the actual number and its representation using a fixed number of bits. This difference is the error of the representation. The accuracy becomes more significant because computations can cause the error to get so large that the result is meaningless and potentially even high risk depending on the application. These sections also explain how programming languages approach number representation.
Read this article to get the basics of division. Again one can divide by an integer by shifting right. Each shift is a divide by 2. Note also that using the add and shift hardware or using an algorithm to do the same thing is slower that using a hardware divider.
|Arithmetic for Computers||
This article nicely sums up what the ALU does. It includes block diagrams of the ALU, not present in the other articles.
|4.6: Case Study: Floating Point Arithmetic in an x86 Processor||Extended Precision||
Read this article to learn about minimizing roundoff and overflow in floating point arithmetic using extended precision.
|5.1: Von Neumann Architecture||The Von Neumann Architecture||
Read this section to learn about sequential or Von Neumann computer architecture. Computer architecture is the high-level computer design comprising components that perform the functions of data storage, computations, data transfer, and control.
|5.2: Simple MIPS Processor Components||An Introduction to Processor Design||
Watch this lecture, which introduces the components of MIPS architecture that are required to process a subset of MIPS instructions. This is the first lecture of a series of video lectures on the design of a MIPS processor. This series of six videos incrementally design a processor that implements a subset of eight MIPS five arithmetic instructions, two memory reference instructions, and one flow control instruction. Using hardware components, referred to as building blocks, and a microprogram control, the lecturer executes these eight instructions to develop a simple design of a processor. In this lecture, Kumar discusses the performance of the design and performance improvement using a multi-cycle design. Also, Kumar identifies an extension of the design to deal with exceptional cases that could occur when executing programs written using the eight instructions (exception handling). Lastly, the lecturer identifies increments to the design to include additional instructions.
|5.3: Designing a Datapath for a Simple Processor||Datapaths||
Watch this lecture, which is the second video in a series of six video lectures on the design of a MIPS processor. The previous video lecture presented the processor building blocks that will be used in the series. This video lecture explains how to build a datapath of the MIPS architecture to process a subset of the MIPS instructions. We will take R-format instructions and memory instructions and look into the datapath requirements to process them. Then, the other instructions will be addressed one at a time, and incremental changes to the design will be made to handle them. The data path and controller will be interconnected, and the (micro) control signals to perform the correct hardware operations at the right time will be identified. Later, we will look at the design of the controller.
|5.4: Alternative Approach to Datapath Design and Design of a Control for a Simple Processor||Processor Design Control||
Watch this lecture, which is the third video of the series of video lectures on the design of a simple processor. This lecture explains how to build the control part of the MIPS architecture that is required to process a subset of MIPS instructions. This builds on the datapath design from the last lecture. That datapath design approach started with a design for the R class instructions with operands in registers, such as add, subtract, 'and', 'or', and 'less than'. It then included the other instructions, one at a time, and incremented the design to accommodate them. In this video lecture, an alternative approach is used to arrive at the same design. Here, the datapath for the arithmetic and logic instructions is designed. Then, the data path for the store, then for the load, then the branch on equal, and, finally, the jump are designed individually. Next, datapath design for all eight instructions is the union of the five individual designs. The control signals for each instruction are identified and combined to form a truth table for a controller, which is implemented using a PLA (program logic array). The video concludes with a performance/delay analysis of the design to show the limitations of a single cycle datapath. Next, we will look at pipelining for increased performance.
|5.5: Pipelining and Hazards||Pipelined Processor Design||
Watch this lecture, which presents basic ideas on improving processor performance through the use of pipelining. The previous video showed the limitations of a single cycle datapath design. To overcome the limitations and improve performance, a pipeline datapath design is considered. This video lecture explains pipelining. A pipeline datapath is analogous to an assembly line in manufacturing. First, you develop a skeleton design of a pipeline datapath. Performance analysis shows that several types of delays can arise, called hazards: structure, data, or control hazards. Design can address those arising from structure. However, data and control delays cannot always be prevented. The next video lecture will complete the design of a pipeline datapath for the simple processor.
|5.6: Pipelined Processors||Pipelined Processor Datapaths||
Watch this lecture, which explains how to design a pipelined MIPS processor. The previous video introduced pipelining as a way to increase performance. It showed how hazards can limit the performance improvement of a pipeline datapath. This video lecture completes the design of a pipeline datapath. Ignoring hazards, the lecturer designs a control for the pipeline, integrates all the components including the control with the pipeline, and then considers the behavior with respect to hazards.
|5.7: Instruction-Level Parallelism||Understanding Parallelism||
Read this article, which discusses granularity of parallelism. On a uniprocessor, instruction level parallelism includes pipelining techniques and multiple functional units. Parallel processing using multi-processors will be covered in Unit 8.
Read the explanation of instruction level parallelism. Parallelism can occur at different levels of granularity, and when discussing parallelism, we need to be clear on exactly what is being done in parallel.
|6.1: Elements of Memory Hierarchy and Caches||The Basics of Memory Hierarchy||
Watch this lecture. Previously, we focused on processor design to increase performance. Now, we will turn to memory. This video introduces various methods of improving processor performance through the use of memory hierarchy. The lecture discusses memory technologies, which vary in cost and speed. We have to assume that memory is flat, but with current technology, flat memory does not meet performance demands placed on it. You will take a look at hierarchical memory and the use of caches. This video will also discuss the analysis of memory hierarchies and cache performance with respect to miss rates and block size. Finally, the lecturer considers cache policy.
|Sequential Computer Architecture||
Read section 1.2 to learn about memory hierarchies, and read section 1.4 to learn about locality and data reuse.
|6.2: Cache Architectures and Improving Cache Performance||Cache Organization||
Watch this lecture on memory hierarchy design with caches. This lecture discusses the impact that memory operations have on overall processor performance and identifies different cache architectures that can improve overall processor performance. In the memory hierarchy, from top to bottom, there are: processor registers, cache, main memory, and secondary memory. The order goes from faster, smaller capacity, and higher cost to slower, higher capacity, and lower cost. Data moves from the main memory to cache. To do this requires a mapping from addresses in the main memory to cache addresses. How this is done affects performance. Performance analysis of cache and memory organization involves miss rate and miss penalty. Policies for reading, loading, fetching, replacing, and writing memory also affect cost and performance.
|Programming Strategies for High Performance||
Read section 1.5, which discusses the relationship of pipelining and caching to programming.
|6.3: Main Memory and Virtual Memory||Virtual Memory||
Watch this lecture, which addresses the subject of virtual memory. This topic pertains to the relationship of main memory and secondary memory and has similarities and differences to the relationship of cache and main memory. Differences arise due to the speed of the various memories and their capacities. This lecture discusses virtual memory, mapping from virtual memory addresses to physical addresses in main and paging techniques used with the main memory. The lecture discusses the use of page tables in translating virtual addresses to physical addresses. Issues that arise with page tables include structure, location, and large size.
|6.4: Performance Tuning||Parallel Computing||
Read Chapters 1–5. Before reading these chapters, list the factors that you can think of that can affect performance, like memory performance, cache, memory hierarchy, multi-cores, and so on, and what you might suggest as ways to increase performance. After reading these chapters, what might you add, if anything, to your list?
|7.1: I/O Devices||Introduction to I/O Subsystems||
Watch this lecture, which discusses the basic ideas behind the input/output (I/O) subsystem of a computer system. The lecturer also looks into performance measurement for I/O devices and interfaces used to interconnect I/O devices to the processor. This is the first of two video lectures on I/O. A computer subsystem consists of three major components: processor, memory, and connections. The key words in the previous sentence are subsystem and connections. To be useful, a computer system needs to have connections with external devices to get data and control signals into the computer and to put data and control signals out. The external devices can be other systems or other systems may be connected to the same devices. Thus, our computer system is part of a network of a few or many other subsystems interconnected to perform useful tasks. We are always interested in how well a task is performed, in terms of time, capacity, and cost. In considering performance relative to our useful task, we have to consider processor performance, memory performance, and the performance of the connections, including the performance of the external devices. This first video lecture looks at external or peripheral devices and I/O performance. Next, we will discuss interfaces, buses, and I/O transfer.
|7.2: Connecting I/O Devices to the Processor||Interfaces and Buses||
Watch this lecture for an introduction to the interconnection schemes used for the input/output (I/O) subsystem of a computer system. This is the second of two lectures on I/O devices. The previous lecture looked at the connection of memory, either cache or main memory, with peripheral devices and the transfer and transformation of data between them. This video lecture analyzes alternative interconnection schemes with a focus on buses. Also, it discusses protocols for the data that flows on the buses: asynchronous and synchronous. A synchronous protocol uses a clock to time sequence the information flow. An asynchronous protocol does not use a clock; the signal carries the sequencing information. Then, the lecture shows a performance comparison of two different protocols.
Be aware that the last example on this resource has an error. When finding the BW, it says 256 x 4 x 1/14400. It should be 256 x 8 x 1/14400. A byte is 8 bits.
|7.3: Measuring Disk Performance||Hard Disk Drive Performance Characteristics||
Read this article. Disks have various characteristics, which determine quality attributes, such as reliability, performance, etc. Here, we are interested in performance. Think of some characteristics that affect performance. How can performance of a disk be measured?
|7.4: Redundant Array of Inexpensive Disks (RAID)||RAID||
This article will provide you with a description of RAID technology and will introduce you to techniques that are used for studying performance and reliability in general.
|8.1: The Reason for the Switch to Parallel Processing||Multi-Core Chips||
Read section 1.3 to learn about multi-core chips. These two pages give a summary of processor and chip trends to overcome the challenge of increasing performance and addressing the heat problem of a single core.
|8.2: Limitations in Parallel Processing: Amdahl's Law||Introduction to Parallel Computer Architecture||
Read sections 2.1 and 2.2 to learn about parallel computer architectures. There are different types of parallelism: there is instruction-level parallelism, where a stream of instructions is simultaneously in partial stages of execution by a single processor; there are multiple streams of instructions, which are simultaneously executed by multiple processors. A quote from the beginning of the chapter states the key ideas:
"In this chapter, we will analyze this more explicit type of parallelism, the hardware that supports it, the programming that enables it, and the concepts that analyze it."
This chapter begins with a simple scientific computation example, followed by a description of SISD, SIMD, MISD, and MIMD architectures."
|Parallelism and Performance||
Study the "Amdahl's Law" section. Amdahl's law explains the limitations to performance gains achievable through parallelism. Over the last several decades or so, increases in computer performance have largely come from improvements to current hardware technologies and less from software technologies. Now, however, the limits to these improvements may be near. For significant continued performance improvement, either new physical technology needs to be discovered and/or transitioned to practice, or software techniques will have to be developed to get significant gains in computing performance.
In the equation for Amdahl's law, P is the fraction of code that can be parallelized (that is, the part that must be executed serially); S is the fraction of code that cannot be parallelized; and n is the number of processors. P + S is 1. If there are n processors, then P + S can be executed in the same time that P/n + S can be executed. Thus, the ratio of the time using 1 processor to the time of using n processors is 1/(P/n + S). This is the speedup in going from 1 processor to n processors.
The speedup is limited, even for large n. If n is 1, the speedup is 1. If n is very large, then the speedup is 1/S. If P is very small, then P/n is even smaller, and P/n + S is approximately S. The speedup is 1/S, but S is approximately S + P, which is 1. Therefore, the speed of execution of this code using 1 processor is about the same as using n processors.
Another way of writing Amdahl's law is 1/(P/n + [1 - P]). Thus, if P is close to 1, the speedup is 1/(P/n) or n/P, which is approximately n.
Apply Amdahl's law to better understand how it works by substituting a variety of numeric values into this equation and sketching the graph of the equation.
|Limits and Costs of Parallel Programming||
From section 4 of Chapter 3 of this resource, study the section titled "Amdahl's Law."
|8.3: Shared Memory and Distributed Memory Multiprocessing||Multiprocessing||
Study this material, which focus on the problem of parallel software and discusses scaling by using an example to explain shared memory and message passing, and identify some problems related to cache and memory consistency.
|8.4: Multicore Processors and Programming with OpenMP and MPI||Parallel Programming||
Read section 2.5, which covers two extreme approaches to parallel programming. First, parallelism is handled by the lower software and hardware layers. OpenMP is applicable in this first case. Secondly, parallelism is handled by the programmer. MPI is applicable in the second case.
|Programming on Parallel Machines||
If you want to know more about programming on parallel machines, you may read this book.
Chapter 1 uses a matrix times (multiplication) vector example in section 1.3.1. It goes on to describe parallel approaches for computing a solution. Section 1.3.2 describes a shared-memory and threads approach; section 1.3.3 describes a message passing approach; section 1.3.4 describes the MPI and R language approach. Study these sections to get an overview of the idea of software approaches to parallelism.
Chapter 2 presents issues that slow the performance of parallel programs.
Chapter 3 discusses shared-memory parallelism. Parallel programming and parallel software are extensive topics and our intent is to give you an overview of them; a more in-depth study is provided by the following chapters.
Chapter 4 discusses MP directives and presents a variety of examples.
Chapter 5 presents GPUs (Graphics Processing Units) and the CUDA language. This chapter also discusses parallel programming issues in the context of GPUs and CUDA and illustrates them with various examples.
Chapter 7 illustrates the message-passing approach using various examples.
Chapter 8 describes MPI (Message Passage Interface), which applies to networks of workstations (NOWs). The rest of the chapter illustrates this approach with various examples.
Chapter 9 gives an overview of cloud computing and the Hadoop platform, which are interesting topics for today not just for parallel computing.
Section 10.1 in Chapter 10 explains what R is.
|9.1: Theory and Laws||Topologies||
Read section 2.6 to learn about network topologies. If a task cannot be performed by a computer with one processor, we decompose the task into subtasks, which will be allocated to multiple hardware devices, say processors or arithmetic units or memory. These multiple hardware devices need to communicate so that the original task can be done with acceptable cost and performance. The hardware devices and their interconnections form a network.
Consider another situation: suppose we have a large software system made up of a large number of subsystems, in turn composed of many software components, subcomponents, etc. Suppose we list all the lowest level subcomponent's names across the top of a sheet of paper. These will be our column headings. Also, let's list down the side of the same sheet the same subcomponent names. These will be our row headings. This forms a table or two by two matrix. Finally, suppose we put a 1 in the table whenever or wherever there is a connection between the subcomponent named in the column heading and the subcomponent named in the row heading. Let's put a 0 everywhere else. This table now represents a topology for our network of software components; it could also be done for hardware components. These components and their interconnections are part of the software architecture. Realize that the matrix could be huge: 100 by 100, 1000 by 1000, and so on. The complexity of the interconnections is a factor in the reliability and performance of the architecture.
Read section 2.7, which reviews Amdahl's law, an extension of Amdahl's law that includes communication overhead, and Gustafson's law. These laws express expected performance as the number of processors increases, or as both the size of the problem and number of processors increases.
Then, read section 2.9 to learn about load balancing. This section looks at the situation where a processor is idle and another is busy, which is referred to as a load imbalance. If the work were to be distributed differently among the processors, then the idle time might be able to be eliminated. In this case, the load balance problem is explained as a graph problem.
|9.2: Special Purpose Computing Architectures||GPU, Distributed, Grid, and Cloud Computing||
Read section 2.8 to learn about GPU computing. GPU stands for Graphics Processing Unit. These are of interest because of the increase in the amount of graphics data handled by popular laptops and desktop computers. Furthermore, it turns out that since a GPU does primarily arithmetic computations, the architecture of a GPU is applicable to other types of applications that involve arithmetic on large amounts of data.
Read section 2.10 to learn about distributed, grid, and cloud computing. These are configurations of multiple computers for increasing performance and/or decreasing cost for high volume database access, sharing of resources, or accessing remote computer resources, respectively.
Read this article, paying particular attention to the rankings of supercomputers, based on performance in running a LINPACK benchmark for computing a dense set of linear equations. Towards the end of the article, note the large number of cores in these supercomputers.
|9.3: Case Study: Special Purpose Applications of Parallel Computing||Linear Algebra in High-Performance Computing||
If you want to learn more about parallel computing, read chapters 4–8, which describe problems pertaining to special application areas, called domains.
|Course Feedback Survey||Course Feedback Survey|