numpy for Numerical and Scientific Computing

As you approach the end of this introduction to constructing arrays in numpy, read this section to familiarize yourself with various numpy methods used throughout the course. Specifically, the need for methods such as shape, size, linspace, reshape, eye, and zeros often arise when manipulating arrays.

Broadcasting in Python

Python deals with arrays in an interesting way, in terms of matching up dimensions of arrays for arithmetic operations. There are 3 rules:

  1. If two arrays differ in the number of dimensions, the shape of the smaller array is padded with 1s on its left side
  2. If the shape doesn't match in any dimension, the array with shape = 1 in that dimension is stretched to match the others' shape
  3. If in any dimension the sizes disagree and none of the sizes are 1, then an error is generated
A = rng.normal(0,1,(4,5))
 B = rng.normal(0,1,5)
A.shape
(4, 5)
B.shape
(5,)
A - B
array([[ 0.25410957,  1.89009891, -0.87300221, -0.17852271, -0.64735645],
       [ 0.72991872,  4.58821268, -0.03379553, -1.67352398,  1.43345867],
       [-1.51421293,  1.69993003,  1.81140727, -1.71622014,  0.52276992],
       [-1.30611819,  3.29767231,  0.91060221,  0.29490453,  1.24919619]])

B is 1-d, A is 2-d, so B's shape is made into (1,5) (added to the left). Then it is repeated into 4 rows to make its shape (4,5), then the operation is performed. This means that we subtract the first element of B from the first column of A, the second element of B from the second column of A, and so on.

You can be explicit about adding dimensions for broadcasting by using np.newaxis.

B[np.newaxis,:].shape
(1, 5)
B[:,np.newaxis].shape
(5, 1)


An example (optional, intermediate/advanced))

This can be very useful since these operations are faster than for loops. For example:

d = rng.random_sample((10,2))
 d
array([[0.9497432 , 0.22672332],
       [0.96319737, 0.61011348],
       [0.2542308 , 0.60550727],
       [0.64054935, 0.85273037],
       [0.19747218, 0.45957414],
       [0.41571736, 0.9902779 ],
       [0.33720945, 0.30637872],
       [0.15139082, 0.30126537],
       [0.72158605, 0.3560211 ],
       [0.86288412, 0.66608767]])

We want to find the Euclidean distance (the sum of squared differences) between the points defined by the rows. This should result in a 10x10 distance matrix

d.shape
(10, 2)
d[np.newaxis,:,:]
array([[[0.9497432 , 0.22672332],
        [0.96319737, 0.61011348],
        [0.2542308 , 0.60550727],
        [0.64054935, 0.85273037],
        [0.19747218, 0.45957414],
        [0.41571736, 0.9902779 ],
        [0.33720945, 0.30637872],
        [0.15139082, 0.30126537],
        [0.72158605, 0.3560211 ],
        [0.86288412, 0.66608767]]])

creates a 3-d array with the first dimension being of length 1

d[np.newaxis,:,:].shape
(1, 10, 2)
d[:, np.newaxis,:]
array([[[0.9497432 , 0.22672332]],

       [[0.96319737, 0.61011348]],

       [[0.2542308 , 0.60550727]],

       [[0.64054935, 0.85273037]],

       [[0.19747218, 0.45957414]],

       [[0.41571736, 0.9902779 ]],

       [[0.33720945, 0.30637872]],

       [[0.15139082, 0.30126537]],

       [[0.72158605, 0.3560211 ]],

       [[0.86288412, 0.66608767]]])

creates a 3-d array with the 2nd dimension being of length 1

d[:,np.newaxis,:].shape
(10, 1, 2)

Now for the trick, using broadcasting of arrays. These two arrays are incompatible without broadcasting, but with broadcasting, the right things get repeated to make things compatible

dist_sq = np.sum((d[:,np.newaxis,:] - d[np.newaxis,:,:]) ** 2)
dist_sq.shape
()
dist_sq
29.512804540441067

Whoops! we wanted a 10x10 matrix, not a scalar.

(d[:,np.newaxis,:] - d[np.newaxis,:,:]).shape
(10, 10, 2)

What we really want is the 10x10 distance matrix.

dist_sq = np.sum((d[:,np.newaxis,:] - d[np.newaxis,:,:]) ** 2, axis=2)

You can verify what is happening by creating D = d[:,np.newaxis,:]-d[np.newaxis,:,:] and then looking at D[:,:,0] and D[:,:,1]. These are the difference between each combination in the first and second columns of d, respectively. Squaring and summing along the 3rd axis then gives the sum of squared differences.

dist_sq
array([[0.        , 0.14716903, 0.62721478, 0.48748566, 0.6201312 ,
        0.8681992 , 0.38154258, 0.64292305, 0.0687736 , 0.20058553],
       [0.14716903, 0.        , 0.5026548 , 0.16296469, 0.60899716,
        0.44425934, 0.48411568, 0.75441703, 0.12293897, 0.01319586],
       [0.62721478, 0.5026548 , 0.        , 0.21036128, 0.02451802,
        0.17412634, 0.09636334, 0.1031392 , 0.28066428, 0.37412884],
       [0.48748566, 0.16296469, 0.21036128, 0.        , 0.35088921,
        0.06946875, 0.39051522, 0.54338972, 0.25328705, 0.08426824],
       [0.6201312 , 0.60899716, 0.02451802, 0.35088921, 0.        ,
        0.32927744, 0.04299534, 0.02718516, 0.28541859, 0.48542089],
       [0.8681992 , 0.44425934, 0.17412634, 0.06946875, 0.32927744,
        0.        , 0.47388157, 0.54460678, 0.49583735, 0.30505741],
       [0.38154258, 0.48411568, 0.09636334, 0.39051522, 0.04299534,
        0.47388157, 0.        , 0.03455471, 0.15020974, 0.40572438],
       [0.64292305, 0.75441703, 0.1031392 , 0.54338972, 0.02718516,
        0.54460678, 0.03455471, 0.        , 0.3281208 , 0.63931803],
       [0.0687736 , 0.12293897, 0.28066428, 0.25328705, 0.28541859,
        0.49583735, 0.15020974, 0.3281208 , 0.        , 0.11610642],
       [0.20058553, 0.01319586, 0.37412884, 0.08426824, 0.48542089,
        0.30505741, 0.40572438, 0.63931803, 0.11610642, 0.        ]])
dist_sq.shape
(10, 10)
dist_sq.diagonal()
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0