Data science must deal with the various forms that data can take on when it is collected. While Python lists and dictionaries are powerful, data can often come in the form of numerical arrays or spreadsheets. This unit introduces the numpy module, which is useful for performing matrix operations on data contained within arrays. When you finish this unit, you will be able to implement a host of matrix operations relevant to data science.
Much of the syntax used for lists, such as indexing and slicing, naturally carries over to numpy arrays. Array operations in numpy are typical of what you would expect from basic linear algebra (such as matrix addition and multiplication, systems of linear equations, determinants, and matrix inverses). Additionally, methods within the matplotlib module can accept numpy arrays as input. Finally, we will also take a look at file handling using numpy, since having this skill is often useful for machine learning applications.
Completing this unit should take you approximately 6 hours.
Technically speaking, vectors and matrices are mathematical objects usually applied within the context of linear algebra. An array, on the other hand, is the computer science data structure used to represent a matrix. So, for example, a vector can be stored in a one-dimensional array. A matrix can be stored in a two-dimensional (or "double subscript") array. Much of data science depends upon using arrays (often multi-dimensional) containing numerical data. Before delving into the numpy module, use this playlist to review basic vector and matrix operations.
In contrast to many other programming languages, the basic Python language does not have an explicit array data structure. Instead, arrays can be constructed using nested lists; however, methods for basic matrix operations would then have to be formulated. The numpy module relieves the programmer of this responsibility so that they can readily perform array computations. The first step in applying this module is to learn the syntax for constructing an array.
The numpy module contains a multitude of methods for array processing. Use the materials in this section to gain fluency in numpy syntax for manipulating arrays. Pay close attention to how you can use relational operators to extract specific elements from an array.
As you approach the end of this introduction to constructing arrays in numpy, read this section to familiarize yourself with various numpy methods used throughout the course. Specifically, the need for methods such as shape, size, linspace, reshape, eye, and zeros often arise when manipulating arrays.
One major theme of numpy is that your intuition for indexing ordered containers such as Python lists should naturally carry over to arrays. Therefore, the syntax for techniques such as slicing remains consistent. Watch this video to practice the gymnastics of indexing and slicing when applied to numpy arrays.
Before continuing, use this video to summarize and reinforce everything discussed in this unit. Additionally, pay close attention to various indexing tricks and shortcuts that can help write compact and efficient code.
Now that you have mastered some basic numpy syntax, it is time to see how array representations can correspond to actual data. In other words, numbers in an array, by intentional construction of the data scientist, are meant to represent the characteristics of the data. Data can come in many different forms depending upon the application. One dimensional data might be derived from an audio signal. Two dimensional data might be derived from an image. Three dimensional data might be derived from a video signal. A given array representation is clearly problem dependent, so it is important see some examples of how this is accomplished in practice.
No discussion of numpy would be complete without introducing how matrix computations are accomplished. For instance, it is important to understand how analogs of methods within the math module (such as cos, log, and exp) can be applied in the numpy module. Additionally, You should give the concept of broadcasting your undivided attention because it defines the rules for how arithmetic operations are applied.
The numpy module is often used in conjunction with the matplotlib module to visualize array data. Use this video to solidify your understanding of visualization of matrix computations.
It is important to have some knowledge of file handling using numpy. Data science applications often require the capacity to save and load array data. The section will help to extend your understanding of file handling constructs beyond that provided within the basic Python language.
As data arrays get larger and larger, it becomes necessary to compress them if they are to be saved to a file. Data science and machine learning applications using numpy often deal with saving and loading compressed data in the form of files with .npz extensions. This material introduces how to load compressed numpy files.
This page gives an example of saving a compressed file using numpy. As with any file format, a programmer should feel comfortable reading and writing files. The reason for compressing files is that the file sizes in machine learning applications can be enormous. Anything you can do to reduce the payload can result in more efficient file processing.
Here is a brief comparison between .npy and .npz files. A .npy file is reserved for storing a single array in the form of binary data. A .npz file can store multiple arrays and additionally compresses the data.
Take this assessment to see how well you understood this unit.