Faster analysis of large datasets in Python

Have you ever run into a memory error or thought your function is taking too long to run? Here are a few tips on how to tackle these issues.

In meteorology we often have to analyse large datasets, which can be time consuming and/or lead to memory errors. While the netCDF4, numpy and pandas packages in Python provide great tools for our data analysis, there are other packages we can use, that parallelize our code: joblib, xarray and dask (view links for documentation and references for further reading). This means that the input data is split between the different cores of the computer and our analysis of different bits of data runs in parallel, rather than one after the other, speeding up the process. At the end the data is collected and returned to us in the same form as before, but now it was done faster. One of the basic ideas behind the parallelization is the ‘divide and conquer’ algorithm [Fig. 1] (see, e.g., Cormen et al. 2009, or Wikipedia for brief introduction), which finds the best possible (fastest) route for calculating the data and return it.

divide_and_conquer
Figure 1: A simple example of the ‘divide and conquer’ algorithm for sorting a list of numbers. First the list is split into simpler subproblems, that are then solved (sorted) and merged to a final sorted array. Source

The simplest module we can use is joblib. This module can be easily implemented for for-loops (see an example here): e.g. the operation that needs to be executed 1000 times, can be split between 40 cores that your computer has, making the calculation that much faster. Note that often Python modules include optimized routines, and we can avoid for-loops entirely, which is usually a faster option.

The xarray module provides tools for opening and saving multiple netCDF-type (though not limited to this) datasets, which can then be analysed either as numpy arrays or dask arrays. If we choose to use the dask arrays (also available via dask module), any command we use on the array will be calculated in parallel automatically via a type of ‘divide and conquer’ algorithm. Note that this on its own does not help us avoid a memory error as the data eventually has to be loaded in the memory (potentially using a for-loop on these xarray/dask arrays can speed-up the calculation). There are often also options to run your data on high-memory nodes, and the larger the dataset the more time we save through parallelization.

In the end it really depends on how much time you are willing to spend on learning about these arrays and whether it is worth the extra effort – I had to use them as they resolved my memory issues and sped up the code. It is certainly worth keeping this option in mind!

Getting started with xarray/dask

In the terminal window:

  • Use a system with conda installed (e.g. anaconda)
  • To start a bash shell type: bash
  • Create a new python environment (e.g. ‘my_env’) locally, so you can install custom packages. Give it a list of packages:
    • conda create -n my_env xarray
  • Then activate the new python environment (Make sure that you are in ‘my_env’ when using xarray):
    • source activate my_env
  • If you need to install any other packages that you need, you can add them later (via conda install), or you could list them with xarray when you create the environment:
    • conda install scipy pandas numpy dask matplotlib joblib #etc.
  • If the following paths are not ‘unset’ then you need to unset them (check this with command: conda info -a):
    • unset PYTHONPATH PYTHONHOME LD_LIBRARY_PATH
  • In python you can then simply import xarray, numpy or dask modules:
    • import xarray as xr; import dask.array as da; import numpy as np; from joblib import Parallel, delayed; # etc.
  • Now you can easily import datasets [e.g.: dataset = xr.open_dataset() from one file or dataset = xr.open_mfdataset() from multiple files; similarly dataset.to_netcdf() to save to one netcdf file or xr.save_mfdataset() to save to multiple netcdf files] and manipulate them using dask and xarray modules – documentation for these can be found in the links above and references below.
  • Once you open a dataset, you can access data either by loading it into memory (xarray data array: dataset.varname.values) and further analyzing it as before using numpy package (which will not run in parallel); or you can access data through the dask array (xarray dask array: dataset.varname.data), which will not load the data in the memory (it will create the best possible path to executing the operation) until you wish to save the data to a file or plot it. The latter can be analysed in a similar way as the well-known numpy arrays, but instead using the dask module [e.g. numpy.mean (array,axis=0) in dask becomes dask.array.mean (dask_array,axis=0)]. Many functions exist in xarray module as well, meaning you can run them on the dataset itself rather than the array [e.g. dataset.mean(dim=’time’) is equivalent to the mean in dask or numpy].
  • Caution: If you try to do too many operations on the array the ‘divide and conquer’ algorithm will become so complex that the programme will not be able to manage it. Therefore, it is best to calculate everything step-by-step, by using dask_array.compute() or dask_array.persist(). Another issue I find with these new array-modulesis that they are slow when it comes to saving the data on disk (i.e. not any faster than other modules).

I would like to thank Shannon Mason and Peter Gabrovšek for their helpful advice and suggestions.

References

Cormen, T.H., C.E. Leiserson, R.L. Rivest, C. Stein, 2009: An introduction to algorithms. MIT press, third edition, 1312 pp.

Dask Development Team, 2016: Dask: Library for dynamic task scheduling. URL: http://dask.pydata.org

Hoyer, S. & Hamman, J., 2017. xarray: N-D labeled Arrays and Datasets in Python. Journal of Open Research Software. 5, p.10. DOI: http://doi.org/10.5334/jors.148

Hoyer, S., C. Fitzgerald, J. Hamman and others, 2016: xarray: v0.8.0. DOI: http://dx.doi.org/10.5281/zenodo.59499

Rocklin, M., 2016: Dask: Parallel Computation with Blocked algorithms and Task Scheduling. Proceedings of the 14th Python in Science Conference (K. Huff and J. Bergstra, eds.), 130 – 136.

One thought on “Faster analysis of large datasets in Python

Leave a comment