Faster analysis of large datasets in Python

Have you ever run into a memory error or thought your function is taking too long to run? Here are a few tips on how to tackle these issues.

In meteorology we often have to analyse large datasets, which can be time consuming and/or lead to memory errors. While the netCDF4, numpy and pandas packages in Python provide great tools for our data analysis, there are other packages we can use, that parallelize our code: joblib, xarray and dask (view links for documentation and references for further reading). This means that the input data is split between the different cores of the computer and our analysis of different bits of data runs in parallel, rather than one after the other, speeding up the process. At the end the data is collected and returned to us in the same form as before, but now it was done faster. One of the basic ideas behind the parallelization is the ‘divide and conquer’ algorithm [Fig. 1] (see, e.g., Cormen et al. 2009, or Wikipedia for brief introduction), which finds the best possible (fastest) route for calculating the data and return it.

divide_and_conquer
Figure 1: A simple example of the ‘divide and conquer’ algorithm for sorting a list of numbers. First the list is split into simpler subproblems, that are then solved (sorted) and merged to a final sorted array. Source

The simplest module we can use is joblib. This module can be easily implemented for for-loops (see an example here): e.g. the operation that needs to be executed 1000 times, can be split between 40 cores that your computer has, making the calculation that much faster. Note that often Python modules include optimized routines, and we can avoid for-loops entirely, which is usually a faster option.

The xarray module provides tools for opening and saving multiple netCDF-type (though not limited to this) datasets, which can then be analysed either as numpy arrays or dask arrays. If we choose to use the dask arrays (also available via dask module), any command we use on the array will be calculated in parallel automatically via a type of ‘divide and conquer’ algorithm. Note that this on its own does not help us avoid a memory error as the data eventually has to be loaded in the memory (potentially using a for-loop on these xarray/dask arrays can speed-up the calculation). There are often also options to run your data on high-memory nodes, and the larger the dataset the more time we save through parallelization.

In the end it really depends on how much time you are willing to spend on learning about these arrays and whether it is worth the extra effort – I had to use them as they resolved my memory issues and sped up the code. It is certainly worth keeping this option in mind!

Getting started with xarray/dask

In the terminal window:

  • Use a system with conda installed (e.g. anaconda)
  • To start a bash shell type: bash
  • Create a new python environment (e.g. ‘my_env’) locally, so you can install custom packages. Give it a list of packages:
    • conda create -n my_env xarray
  • Then activate the new python environment (Make sure that you are in ‘my_env’ when using xarray):
    • source activate my_env
  • If you need to install any other packages that you need, you can add them later (via conda install), or you could list them with xarray when you create the environment:
    • conda install scipy pandas numpy dask matplotlib joblib #etc.
  • If the following paths are not ‘unset’ then you need to unset them (check this with command: conda info -a):
    • unset PYTHONPATH PYTHONHOME LD_LIBRARY_PATH
  • In python you can then simply import xarray, numpy or dask modules:
    • import xarray as xr; import dask.array as da; import numpy as np; from joblib import Parallel, delayed; # etc.
  • Now you can easily import datasets [e.g.: dataset = xr.open_dataset() from one file or dataset = xr.open_mfdataset() from multiple files; similarly dataset.to_netcdf() to save to one netcdf file or xr.save_mfdataset() to save to multiple netcdf files] and manipulate them using dask and xarray modules – documentation for these can be found in the links above and references below.
  • Once you open a dataset, you can access data either by loading it into memory (xarray data array: dataset.varname.values) and further analyzing it as before using numpy package (which will not run in parallel); or you can access data through the dask array (xarray dask array: dataset.varname.data), which will not load the data in the memory (it will create the best possible path to executing the operation) until you wish to save the data to a file or plot it. The latter can be analysed in a similar way as the well-known numpy arrays, but instead using the dask module [e.g. numpy.mean (array,axis=0) in dask becomes dask.array.mean (dask_array,axis=0)]. Many functions exist in xarray module as well, meaning you can run them on the dataset itself rather than the array [e.g. dataset.mean(dim=’time’) is equivalent to the mean in dask or numpy].
  • Caution: If you try to do too many operations on the array the ‘divide and conquer’ algorithm will become so complex that the programme will not be able to manage it. Therefore, it is best to calculate everything step-by-step, by using dask_array.compute() or dask_array.persist(). Another issue I find with these new array-modulesis that they are slow when it comes to saving the data on disk (i.e. not any faster than other modules).

I would like to thank Shannon Mason and Peter Gabrovšek for their helpful advice and suggestions.

References

Cormen, T.H., C.E. Leiserson, R.L. Rivest, C. Stein, 2009: An introduction to algorithms. MIT press, third edition, 1312 pp.

Dask Development Team, 2016: Dask: Library for dynamic task scheduling. URL: http://dask.pydata.org

Hoyer, S. & Hamman, J., 2017. xarray: N-D labeled Arrays and Datasets in Python. Journal of Open Research Software. 5, p.10. DOI: http://doi.org/10.5334/jors.148

Hoyer, S., C. Fitzgerald, J. Hamman and others, 2016: xarray: v0.8.0. DOI: http://dx.doi.org/10.5281/zenodo.59499

Rocklin, M., 2016: Dask: Parallel Computation with Blocked algorithms and Task Scheduling. Proceedings of the 14th Python in Science Conference (K. Huff and J. Bergstra, eds.), 130 – 136.

Managing your supervisor

Written by: h.l.gough@pgr.reading.ac.uk

You’re going to be working with them for a while. Supervisors, like projects, are all unique and have their own ways of working. Lots of us have banded together to give tips and advice on how to ‘manage your supervisor’ and by that we mean make the road towards a PhD a little bit easier.

For those of you looking to start a PhD, getting the right match between you and a supervisor is key. PhDs are already stressful enough without a strained supervisor-student relationship.

Know how they work

Supervisors all work differently. Some will leave you to wander for a bit before drawing you back to the point, and others will provide a map of where you’re going. There’s no right or wrong but sometimes their methods can get frustrating when you start comparing supervisors.

Find out the best way to contact them. Some never reply to email and others are never in the office or dislike being disturbed. Figure out between you and your supervisor the best way of getting in contact.

Personal and work balance

Some supervisors are happy to talk about personal problems. Others aren’t. Again, neither option is right or wrong, but it’s something you have to be aware of.

Ask for things

A PhD is intended as a personal development training programme and not just for writing a thesis and publishing papers. Don’t be afraid to ask to do something different, such as environment-Yes, internships, field work and summer schools to name a few.

If you don’t ask, you don’t get.

Manage expectations

Saying yes to all the work they give you is only going to lead to disappointment for them and you. Be honest with the amount of work you can do, and say when you’re having a bad week. They’ll understand. Say when you’ve got enough on your plate already.

Know how long they take to read things, otherwise you’ll end up disappointed when the feedback you expected on a certain day doesn’t arrive.

Don’t expect them to be on email 24/7. Likewise, let them know that you’re not going to be checking emails at 3am either.

Know their style and expertise

Some come across more critical than others, some highlight the good as well as the bad. Their subject may make them biased on certain topics. Knowing their expertise allows you to tailor questions for them.

This is a lot more relevant to people with multiple supervisors, as often you can get two conflicting opinions and have no idea which one to accept. This happens, and it does teach you some diplomacy skills, but don’t go picking sides.

Get advice from other students

Chances are, other students will be supervised by your supervisor. Ask them for hints and tips of how they work. Ask about pitfalls to avoid and helpful tips. They might even have a manual on how to deal with them! There is a camaraderie between people who share the same supervisor!

If you’re still stuck and doing a PhD at Reading University, there’s an RRDP course by the graduate school called managing your supervisor. Definitely worth going to.