CS250 Study Guide

Unit 3: The numpy Module

3a. Implement instructions to create numpy arrays 

  • What is numpy?
  • How is numpy used?
  • What is the syntax for indexing an element within a numpy array?
  • What are some typical instructions applied when using numpy?

The numpy module is fundamental for applying numerical and linear algebra techniques involving arrays of data. Some of the most used modules for data science and machine learning, such as scipy, scikit-learn, pandas, and statsmodels, use numpy as the core for their class constructions. Mathematical entities from linear algebra, such as vectors and matrices, can be represented using numpy arrays. In computer science, an array is simply a data structure for housing elements of the same data type. In the case of numpy arrays, since the intent is usually a numerical one, the data type of array elements is most often int, float, or bool (although string arrays are possible).
 
In numpy, a vector is represented using a one-dimensional array (where only one index is used to refer to array elements). A two-dimensional matrix is represented using a two-dimensional array using two indices, and so on for higher dimensions. The syntax for applying indices to multi-index arrays is the same as that of nested lists; however, a more convenient form is usually used to minimize the use of square brackets. For example, assuming a four-dimensional array is initialized as a = np.random.normal(size=(3,4,2,5)). It is equally valid to index an array element using either a[2][3][0][4] or a[2,3,0,4].

Basic operations for applying numpy include the array method for creating an array, the shape method for determining the shape of an array, the sum method for summing the values contained within an array, and the max and min methods for computing the maximum and minimum elements. The ones method can be used to create an array of ones; the zeros method will create an array of zeros. The eye method can be used to create an identity matrix. A sequence of numbers can be generated in an array using the arange method, and regularly spaced sequences of numbers between particular values can be generated using the linspace method. It is important to understand the difference between these two methods. Finally, you should have a basic understanding of generating an array of random integers using the randint method from the random class within numpy.

To review, see numpy for Numerical and Scientific Computing.

 

3b. Execute instructions to index arrays using slicing 

  • What is slicing?
  • What is the syntax for slicing in numpy?
  • What are some typical shortcuts when applying slicing?

Slicing is an indexing technique for extracting several contiguous elements at once from an ordered container, such as a list or a tuple. The syntax for slicing extends to numpy arrays with multiple indices. Slicing a numpy vector (that is, a one-dimensional array) works the same way as slicing a Python list. Three values must be specified or implied: the start index, the stop index, and the step, b[start:stop:step], where b is a numpy vector. Slicing a multidimensional array means that this syntax can be applied to any index position.

When the start is omitted, an index of 0 is assumed. When the stop is omitted, the last index is assumed. When the step is omitted, a step of one is assumed. You should also be comfortable with using negative indices or a negative step. For example, d = c[::-1] would form a new vector using all the elements of c in reverse order. This is because the negative step counts backward. Also, the last index is assumed because the stop is omitted, and the 0 index is assumed because the start is omitted. Finally, always remember the element referred to by the stop has an index that is one less than the stop. For example, consider a 2-d array a with at least 2 rows and 6 columns. A command such as b = a[0:1,4:5] will slice out a 1×1 array that is inherently two-dimensional (that is, two indices are required to refer to the element). You should test the similarities and differences between b, b[0] and b[0][0] to internalize this example.

To review, see Advanced Indexing with numpy.

 

3c. Demonstrate computation and visualization using array operations 

  • What is vectorized programming?
  • What is broadcasting?
  • How does matplotlib work with numpy arrays?

The numpy module has been designed to work with vectors and matrices. Most programmers start with languages that are scalar. For example, the random module is designed to generate scalar random numbers, and, using only Python built-in data structures, we could use a loop to fill a list to mimic a vector of random numbers. The numpy module is designed for vectorized programming, where single commands can be used to generate and operate on a vector or a matrix of data.

To correctly apply the vectorized methodology in numpy, it is important to understand how broadcasting is used to accomplish "element-wise" computation for arithmetic operators such as +, -, *, and /. For example, to perform * between a 3×1 matrix and a 3×5 matrix, the 3×1 matrix will be broadcast 5 times along the column dimension and will be multiplied elementwise by each column in the 3×5 matrix. Notice that element-wise multiplication using * is different from matrix multiplication in the linear algebraic sense. To accomplish matrix multiplication, either the @ operator or the dot method must be applied to the numpy arrays.

When it comes to plotting data, the matplotlib module has been designed to work with numpy data. The matplotlib module capacity to plot Python list data is a special case made available for convenience. The typical use case and syntax for matplotlib plotting methods are meant for numpy arrays.

To review, see numpy Arrays and Vectorized Programming, Mathematical Operations with numpy, and numpy with matplotlib.

 

3d. Explain instructions to load and save data using numpy file formats 

  • How is numpy file handling similar to basic Python file handling?
  • How is numpy file handling different from basic Python file handling?
  • What is a .npy file?
  • What is a .npz file?

Understanding how to save and load numpy array data in machine learning applications is important. In such cases, arduous and time-consuming computations can lead to large parameter matrices that must be stored for later use. The simplest solution is to apply the loadtxt and savetxt methods, which load and save numpy data in text format. It is also possible to read and write numpy data in text format using basic Python file handling methods such as write, read, or readline. In fact, it is possible to read data using the numpy method loadtxt from a text file generated using the write method.

Text files can be large compared to their numerical or "binary" counterparts. Hence, the binary format is the preferred method for storing and loading numpy data. Furthermore, it is possible to compress the data to save storage space. The ability to store compressed array data and multiple arrays differs from typical Python file handling. The .npy extension is used for the standard binary file format for saving a single numpy array. The .npz extension is used for the standard binary file format for saving multiple arrays to a single file. Therefore, the numpy save method can be used to save a single array. The numpy savez method can be used to save multiple arrays, and the numpy savez_compressed method can be used to save multiple arrays in compressed format. The numpy load method then can be used to load either .npy or .npz formatted files.

To review, see Storing Data in Files and ".npy" versus ".npz" Files.

 

Unit 3 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • arange
  • array
  • broadcasting
  • index
  • linspace
  • .npy
  • .npz
  • ones
  • randint
  • slicing
  • vectorized programming
  • zeros