Faster analysis of large datasets in Python

Have you ever run into a memory error or thought your function is taking too long to run? Here are a few tips on how to tackle these issues.

In meteorology we often have to analyse large datasets, which can be time consuming and/or lead to memory errors. While the netCDF4, numpy and pandas packages in Python provide great tools for our data analysis, there are other packages we can use, that parallelize our code: joblib, xarray and dask (view links for documentation and references for further reading). This means that the input data is split between the different cores of the computer and our analysis of different bits of data runs in parallel, rather than one after the other, speeding up the process. At the end the data is collected and returned to us in the same form as before, but now it was done faster. One of the basic ideas behind the parallelization is the ‘divide and conquer’ algorithm [Fig. 1] (see, e.g., Cormen et al. 2009, or Wikipedia for brief introduction), which finds the best possible (fastest) route for calculating the data and return it.

divide_and_conquer
Figure 1: A simple example of the ‘divide and conquer’ algorithm for sorting a list of numbers. First the list is split into simpler subproblems, that are then solved (sorted) and merged to a final sorted array. Source

The simplest module we can use is joblib. This module can be easily implemented for for-loops (see an example here): e.g. the operation that needs to be executed 1000 times, can be split between 40 cores that your computer has, making the calculation that much faster. Note that often Python modules include optimized routines, and we can avoid for-loops entirely, which is usually a faster option.

The xarray module provides tools for opening and saving multiple netCDF-type (though not limited to this) datasets, which can then be analysed either as numpy arrays or dask arrays. If we choose to use the dask arrays (also available via dask module), any command we use on the array will be calculated in parallel automatically via a type of ‘divide and conquer’ algorithm. Note that this on its own does not help us avoid a memory error as the data eventually has to be loaded in the memory (potentially using a for-loop on these xarray/dask arrays can speed-up the calculation). There are often also options to run your data on high-memory nodes, and the larger the dataset the more time we save through parallelization.

In the end it really depends on how much time you are willing to spend on learning about these arrays and whether it is worth the extra effort – I had to use them as they resolved my memory issues and sped up the code. It is certainly worth keeping this option in mind!

Getting started with xarray/dask

In the terminal window:

  • Use a system with conda installed (e.g. anaconda)
  • To start a bash shell type: bash
  • Create a new python environment (e.g. ‘my_env’) locally, so you can install custom packages. Give it a list of packages:
    • conda create -n my_env xarray
  • Then activate the new python environment (Make sure that you are in ‘my_env’ when using xarray):
    • source activate my_env
  • If you need to install any other packages that you need, you can add them later (via conda install), or you could list them with xarray when you create the environment:
    • conda install scipy pandas numpy dask matplotlib joblib #etc.
  • If the following paths are not ‘unset’ then you need to unset them (check this with command: conda info -a):
    • unset PYTHONPATH PYTHONHOME LD_LIBRARY_PATH
  • In python you can then simply import xarray, numpy or dask modules:
    • import xarray as xr; import dask.array as da; import numpy as np; from joblib import Parallel, delayed; # etc.
  • Now you can easily import datasets [e.g.: dataset = xr.open_dataset() from one file or dataset = xr.open_mfdataset() from multiple files; similarly dataset.to_netcdf() to save to one netcdf file or xr.save_mfdataset() to save to multiple netcdf files] and manipulate them using dask and xarray modules – documentation for these can be found in the links above and references below.
  • Once you open a dataset, you can access data either by loading it into memory (xarray data array: dataset.varname.values) and further analyzing it as before using numpy package (which will not run in parallel); or you can access data through the dask array (xarray dask array: dataset.varname.data), which will not load the data in the memory (it will create the best possible path to executing the operation) until you wish to save the data to a file or plot it. The latter can be analysed in a similar way as the well-known numpy arrays, but instead using the dask module [e.g. numpy.mean (array,axis=0) in dask becomes dask.array.mean (dask_array,axis=0)]. Many functions exist in xarray module as well, meaning you can run them on the dataset itself rather than the array [e.g. dataset.mean(dim=’time’) is equivalent to the mean in dask or numpy].
  • Caution: If you try to do too many operations on the array the ‘divide and conquer’ algorithm will become so complex that the programme will not be able to manage it. Therefore, it is best to calculate everything step-by-step, by using dask_array.compute() or dask_array.persist(). Another issue I find with these new array-modulesis that they are slow when it comes to saving the data on disk (i.e. not any faster than other modules).

I would like to thank Shannon Mason and Peter Gabrovšek for their helpful advice and suggestions.

References

Cormen, T.H., C.E. Leiserson, R.L. Rivest, C. Stein, 2009: An introduction to algorithms. MIT press, third edition, 1312 pp.

Dask Development Team, 2016: Dask: Library for dynamic task scheduling. URL: http://dask.pydata.org

Hoyer, S. & Hamman, J., 2017. xarray: N-D labeled Arrays and Datasets in Python. Journal of Open Research Software. 5, p.10. DOI: http://doi.org/10.5334/jors.148

Hoyer, S., C. Fitzgerald, J. Hamman and others, 2016: xarray: v0.8.0. DOI: http://dx.doi.org/10.5281/zenodo.59499

Rocklin, M., 2016: Dask: Parallel Computation with Blocked algorithms and Task Scheduling. Proceedings of the 14th Python in Science Conference (K. Huff and J. Bergstra, eds.), 130 – 136.

Evidence week, or why I chatted to politicians about evidence.

Email: a.w.bateson@pgr.reading.ac.uk

Twitter: @a_w_bateson

On a sunny Tuesday morning at 8.30 am I found myself passing through security to enter the Palace of Westminster. The home of the MPs and peers is not obvious territory for a PhD student. However, I was here as a Voice of Young Science (VoYS) volunteer for the Sense about Science Evidence WeekSense about Science in an independent charity that aims to scrutinise the use of evidence in the public domain and to challenge misleading or misrepresented science. I have written previously here about attending one of their workshops about peer review, and also here about contributing to a campaign aiming to assess the transparency of evidence used in government policy documents.

The purpose of evidence week was to bring together MPs, peers, parliamentary services and people from all walks of life to generate a conversation about why evidence in policy-making matters. The week was held in collaboration with the House of Commons Library, Parliamentary Office of Science and Technology and House of Commons Science and Technology Committee, in partnership with SAGE Publishing. Individual events and briefings were contributed to by further organisations including the Royal Statistical Society, Alliance for Useful Evidence and UCL. Each day had a different theme to focus on including ‘questioning quality’ and ‘wicked problems’ i.e. superficially simple problems which turn out to be complex and multifaceted.

DgiC0YjX4AEtZeH
Throughout the week both MPs, parliamentary staff and the public were welcomed to a stand in the Upper Waiting Hall to have conversations about why evidence is important to them. Photo credit to Sense about Science.

Throughout the parliamentary week, which lasts from Monday to Thursday, Sense about Science had a stand in the Upper Waiting Hall of Parliament. This location is right outside committee rooms where members of the public will give evidence to one of the many select committees. These are collections of MPs from multiple parties whose role it is to oversee the work of government departments and agencies, though their role in gathering evidence and scrutiny can sometimes have significance beyond just UK policy-making (for example this story documenting one committee’s role in investigating the relationship between Facebook, Cambridge Analytica and the propagation of ‘fake news’). The aim of this stand was to catch the attention of both the public, parliamentary staff, and MPs, and to engage them in conversations about the importance of evidence. Alongside the stand, a series of events and briefings were held within Parliament on the topic of evidence. Titles included ‘making informed decisions about health care’ and ‘it ain’t necessarily so… simple stories can go wrong’.

Each day brought a new set of VoYS volunteers to the campaign, both to attend to the stand and to document and help out with the various events during the week. Hence I found myself abandoning my own research for a day to contribute to Day 2 of the campaign, focusing on navigating data and statistics. I had a busy day; beyond chatting to people at the stand I also took over the VoYS Twitter account to document some of the day’s key events, attended a briefing about the 2021 census, and provided a video roundup for the day (which can be viewed here!). For conversations that we had at the stand we were asked to particularly focus on questions in line with the theme of the day including ‘if a statistic is the answer, what was the question?’ and ‘where does this data come from?’

DgoFUzoXkAAVW8m
MP for Bath, Wera Hobhouse, had a particular interest in the pollution data for her constituency and the evidence for the most effective methods to improve air quality.  Photo credit to Sense about Science.

Trying to engage people at the stand proved to be challenging; the location of the stand meant people passing by were often in a rush to committee meetings. Occasionally the division bells, announcing a parliamentary vote, would also ring and a rush of MPs would flock by, great for trying to spot the more well-known MPs but less good for convincing them to stop to talk about data and statistics. In practice this meant I and other VoYS members had to adopt a very assertive approach in talking to people, a style that is generally not within the comfort zone of most scientists! However this did lead to some very interesting conversations, including with a paediatric surgeon who was advocating to the health select committee for increasing the investment in research to treat tumours in children. He posed a very interesting question: given a finite amount of funding for tumour research, how much of this should be specifically directed towards improving the survival outcomes of younger patients and how much to older patients? We also asked MPs and members of the public to add any evidence questions they had to the stand. A member of the public wondered, ‘are there incentives to show what doesn’t work?’ and Layla Moran, MP for Oxford West and Abingdon, asked ‘how can politicians better understand uncertainty in data?’

DgitoU7W4AEvohS
Visitors to the stand, including MPs and Peers, were asked to add any burning questions they had about evidence to the stand. Photo credit to Sense about Science.

The week proved to be a success. Over 60 MPs from across parliamentary parties, including government ministers, interacted with some aspect of evidence week, accounting for around 10% of the total number of MPs. Also, a wider audience who engaged with the stand included parliamentary staff and members of the public. Sense about Science highlighted two outcomes after the event: one was the opening event where members of various community groups met with over 40 MPs and peers and had the opportunity to explain why evidence was important to them, whether their interest was in beekeeping, safe standing at football matches or IVF treatment; the second was the concluding round-table event regarding what people require from evidence gathering. SAGE will publish an overview of this round-table as a discussion paper in Autumn.

On a personal level, I had a very valuable experience. Firstly, it was great opportunity to visit somewhere as imposing and important as the Houses of Parliament and to contribute to such an exciting and innovate week. I was able to have some very interesting conversations with both MPs and members of the public. I found that generally everybody was enthusiastic about the need for increased use and transparency of evidence in policy-making. The challenge, instead, is to ensure that both policy-makers and the general public have the tools they need to collect, assess and apply evidence.