".npy" versus ".npz" Files

Here is a brief comparison between .npy and .npz files. A .npy file is reserved for storing a single array in the form of binary data. A .npz file can store multiple arrays and additionally compresses the data.

What is the advantage of saving '.npz' files instead of '.npy' in python regarding speed, memory, and look-up?


There are two parts of the explanation for answering your question.

I. NPY vs. NPZ
As we already read from the doc, the .npy format is the standard binary file format in NumPy for persisting a single arbitrary NumPy array on disk. ... The format is designed to be as simple as possible while achieving its limited goals. (sources)

And .npz is only a simple way to combine multiple arrays into a single file, one can use ZipFile to contain multiple ".npy" files. We recommend using the file extension ".npz" for these archives. (sources)

So, .npz is just a ZipFile containing multiple ".npy" files. And this ZipFile can be either compressed (by using np.savez_compressed) or uncompressed (by using np.savez).

It's similar to a tarball archive file in a Unix-like system, where a tarball file can be just an uncompressed archive file containing other files or a compressed archive file by combining with various compression programs (gzip, bzip2, etc.)

II. Different APIs for binary serialization
And Numpy also provides different APIs to produce these binary file outputs:

  • np.save ---> Save an array to a binary file in NumPy .npy format
  • np.savez --> Save several arrays into a single file in uncompressed .npz format
  • np.savez_compressed --> Save several arrays into a single file in compressed .npz format
  • np.load --> Load arrays or pickled objects from .npy, .npz or pickled files

If we skim the source code of Numpy under the hood, there are:

def _savez(file, args, kwds, compress, allow_pickle=True, pickle_kwargs=None):
    ...
    if compress:
        compression = zipfile.ZIP_DEFLATED
    else:
        compression = zipfile.ZIP_STORED
    ...


def savez(file, *args, **kwds):
    _savez(file, args, kwds, False)


def savez_compressed(file, *args, **kwds):
    _savez(file, args, kwds, True)

Then back to the question:

If only use np.save, there is no more compression on top of the .npy format, only just a single archive file for the convenience of managing multiple related files.

If we use np.savez_compressed, then, of course, we use less memory on disk because more CPU time is used to do the compression job (i.e., a bit slower).

The main advantage is that the arrays are lazy-loaded. That is, if you have an npz file with 100 arrays, you can load the file without actually loading any of the data. If you request a single array, only the data for that array is loaded.

A downside to npz files is they can't be memory mapped (using load(<file>, mmap_mode='r')), so for large arrays, they may not be the best choice. For data where the arrays have a common shape, I'd suggest taking a look at structured arrays. These can be memory mapped and allow accessing data with a dict-like syntax (i.e., arr['field']), and are very efficient memory-wise.


Source: newbedev.com, https://newbedev.com/python-npz-numpy-save-code-example
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.

Last modified: Friday, September 23, 2022, 2:15 PM