Faster analysis of large datasets in Python

Have you ever run into a memory error or thought your function is taking too long to run? Here are a few tips on how to tackle these issues.

In meteorology we often have to analyse large datasets, which can be time consuming and/or lead to memory errors. While the netCDF4, numpy and pandas packages in Python provide great tools for our data analysis, there are other packages we can use, that parallelize our code: joblib, xarray and dask (view links for documentation and references for further reading). This means that the input data is split between the different cores of the computer and our analysis of different bits of data runs in parallel, rather than one after the other, speeding up the process. At the end the data is collected and returned to us in the same form as before, but now it was done faster. One of the basic ideas behind the parallelization is the ‘divide and conquer’ algorithm [Fig. 1] (see, e.g., Cormen et al. 2009, or Wikipedia for brief introduction), which finds the best possible (fastest) route for calculating the data and return it.

Figure 1: A simple example of the ‘divide and conquer’ algorithm for sorting a list of numbers. First the list is split into simpler subproblems, that are then solved (sorted) and merged to a final sorted array. Source

The simplest module we can use is joblib. This module can be easily implemented for for-loops (see an example here): e.g. the operation that needs to be executed 1000 times, can be split between 40 cores that your computer has, making the calculation that much faster. Note that often Python modules include optimized routines, and we can avoid for-loops entirely, which is usually a faster option.

The xarray module provides tools for opening and saving multiple netCDF-type (though not limited to this) datasets, which can then be analysed either as numpy arrays or dask arrays. If we choose to use the dask arrays (also available via dask module), any command we use on the array will be calculated in parallel automatically via a type of ‘divide and conquer’ algorithm. Note that this on its own does not help us avoid a memory error as the data eventually has to be loaded in the memory (potentially using a for-loop on these xarray/dask arrays can speed-up the calculation). There are often also options to run your data on high-memory nodes, and the larger the dataset the more time we save through parallelization.

In the end it really depends on how much time you are willing to spend on learning about these arrays and whether it is worth the extra effort – I had to use them as they resolved my memory issues and sped up the code. It is certainly worth keeping this option in mind!

Getting started with xarray/dask

In the terminal window:

  • Use a system with conda installed (e.g. anaconda)
  • To start a bash shell type: bash
  • Create a new python environment (e.g. ‘my_env’) locally, so you can install custom packages. Give it a list of packages:
    • conda create -n my_env xarray
  • Then activate the new python environment (Make sure that you are in ‘my_env’ when using xarray):
    • source activate my_env
  • If you need to install any other packages that you need, you can add them later (via conda install), or you could list them with xarray when you create the environment:
    • conda install scipy pandas numpy dask matplotlib joblib #etc.
  • If the following paths are not ‘unset’ then you need to unset them (check this with command: conda info -a):
  • In python you can then simply import xarray, numpy or dask modules:
    • import xarray as xr; import dask.array as da; import numpy as np; from joblib import Parallel, delayed; # etc.
  • Now you can easily import datasets [e.g.: dataset = xr.open_dataset() from one file or dataset = xr.open_mfdataset() from multiple files; similarly dataset.to_netcdf() to save to one netcdf file or xr.save_mfdataset() to save to multiple netcdf files] and manipulate them using dask and xarray modules – documentation for these can be found in the links above and references below.
  • Once you open a dataset, you can access data either by loading it into memory (xarray data array: dataset.varname.values) and further analyzing it as before using numpy package (which will not run in parallel); or you can access data through the dask array (xarray dask array:, which will not load the data in the memory (it will create the best possible path to executing the operation) until you wish to save the data to a file or plot it. The latter can be analysed in a similar way as the well-known numpy arrays, but instead using the dask module [e.g. numpy.mean (array,axis=0) in dask becomes dask.array.mean (dask_array,axis=0)]. Many functions exist in xarray module as well, meaning you can run them on the dataset itself rather than the array [e.g. dataset.mean(dim=’time’) is equivalent to the mean in dask or numpy].
  • Caution: If you try to do too many operations on the array the ‘divide and conquer’ algorithm will become so complex that the programme will not be able to manage it. Therefore, it is best to calculate everything step-by-step, by using dask_array.compute() or dask_array.persist(). Another issue I find with these new array-modulesis that they are slow when it comes to saving the data on disk (i.e. not any faster than other modules).

I would like to thank Shannon Mason and Peter Gabrovšek for their helpful advice and suggestions.


Cormen, T.H., C.E. Leiserson, R.L. Rivest, C. Stein, 2009: An introduction to algorithms. MIT press, third edition, 1312 pp.

Dask Development Team, 2016: Dask: Library for dynamic task scheduling. URL:

Hoyer, S. & Hamman, J., 2017. xarray: N-D labeled Arrays and Datasets in Python. Journal of Open Research Software. 5, p.10. DOI:

Hoyer, S., C. Fitzgerald, J. Hamman and others, 2016: xarray: v0.8.0. DOI:

Rocklin, M., 2016: Dask: Parallel Computation with Blocked algorithms and Task Scheduling. Proceedings of the 14th Python in Science Conference (K. Huff and J. Bergstra, eds.), 130 – 136.

Hierarchies of Models

With thanks to Inna Polichtchouk.

General circulation models (GCMs) of varying complexity are used in atmospheric and oceanic sciences to study different atmospheric processes and to simulate response of climate to climate change and other forcings.

However, Held (2005) warned the climate community that the gap between understanding and simulating atmospheric and oceanic processes is becoming wider. He stressed the use of model hierarchies for improved understanding of the atmosphere and oceans (Fig. 1). Often at the bottom of the hierarchy lie the well-understood, idealized, one- or two-layer models.  In the middle of the hierarchy lie multi-layer models, which omit certain processes such as land-ocean-atmosphere interactions or moist physics. And finally, at the top of the hierarchy lie fully coupled atmosphere-ocean general circulation models that are used for climate projections. Such model hierarchies are already well developed in other sciences (Held 2005), such as molecular biology, where studying less complex animals (e.g. mice) infers something about the more complex humans (through evolution).

Figure 1: Model hierarchy of midlatitude atmosphere (as used for studying storm tracks). The simplest models are on the left and the most complex models are on the right. Bottom panels show eddy kinetic energy (EKE, contours) and precipitation (shading) with increase in model hierarchy (left-to-right): No precipitation in a dry core model (left), zonally homogeneous EKE and precipitation in an aquaplanet model (middle), and zonally varying EKE and precipitation in the most complex model (right). Source: Shaw et al. (2016), Fig. B2.

Model hierarchies have now become an important research tool to further our understanding of the climate system [see, e.g., Polvani et al. (2017), Jeevanjee et al. (2017), Vallis et al. (2018)]. This approach allows us to delineate most important processes responsible for circulation response to climate change (e.g., mid-latitude storm track shift, widening of tropical belt etc.), to perform hypothesis testing, and to assess robustness of results in different configurations.

In my PhD, I have extensively used the model hierarchies concept to understand mid-latitude tropospheric dynamics (Fig. 1). One-layer barotropic and two-layer quasi-geostrophic models are often used as a first step to understand large-scale dynamics and to establish the importance of barotropic and baroclinic processes (also discussed in my previous blog post). Subsequently, more realistic “dry” non-linear multi-layer models with simple treatment for boundary layer and radiation [the so-called “Held & Suarez” setup, first introduced in Held and Suarez (1994)] can be used to study zonally homogeneous mid-latitude dynamics without complicating the setup with physical parametrisations (e.g. moist processes), or the full range of ocean-land-ice-atmosphere interactions. For example, I have successfully used the Held & Suarez setup to test the robustness of the annular mode variability (see my previous blog post) to different model climatologies (Boljka et al., 2018). I found that baroclinic annular mode timescale and its link to the barotropic annular mode is sensitive to model climatology. This can have an impact on climate variability in a changing climate.

Additional complexity can be introduced to the multi-layer dry models by adding moist processes and physical parametrisations in the so-called “aquaplanet” setup [e.g. Neale and Hoskins (2000)]. The aquaplanet setup allows us to elucidate the role of moist processes and parametrisations on zonally homogeneous dynamics. For example, mid-latitude cyclones tend to be stronger in moist atmospheres.

To study effects of zonal asymmetries on the mid-latitude dynamics, localized heating or topography can be further introduced to the aquaplanet and Held & Suarez setup to force large-scale stationary waves, reproducing the south-west to north-east tilts in the Northern Hemisphere storm tracks (bottom left panel in Fig. 1). This setup has helped me elucidate the differences between the zonally homogeneous and zonally inhomogeneous atmospheres, where the planetary scale (stationary) waves and their interplay with the synoptic eddies (cyclones) become increasingly important for the mid-latitude storm track dynamics and variability on different temporal and spatial scales.

Even further complexity can be achieved by coupling atmospheric models to the dynamic ocean and/or land and ice models (coupled atmosphere-ocean or atmosphere only GCMs, in Fig. 1), all of which bring the model closer to reality. However, interpreting results from such complex models is very difficult without having first studied the hierarchy of models as too many processes are acting simultaneously in such fully coupled models.  Further insights can also be gained by improving the theoretical (mathematical) understanding of the atmospheric processes by using a similar hierarchical approach [see e.g. Boljka and Shepherd (2018)].


Boljka, L. and T.G. Shepherd, 2018: A multiscale asymptotic theory of extratropical wave–mean flow interaction. J. Atmos. Sci., 75, 1833–1852, .

Boljka, L., T.G. Shepherd, and M. Blackburn, 2018: On the boupling between barotropic and baroclinic modes of extratropical atmospheric variability. J. Atmos. Sci., 75, 1853–1871, .

Held, I. M., 2005: The gap between simulation and understanding in climate modeling. Bull. Am. Meteorol. Soc., 86, 1609 – 1614.

Held, I. M. and M. J. Suarez, 1994: A proposal for the intercomparison of the dynamical cores of atmospheric general circulation models. Bull. Amer. Meteor. Soc., 75, 1825–1830.

Jeevanjee, N., Hassanzadeh, P., Hill, S., Sheshadri, A., 2017: A perspective on climate model hierarchies. JAMES9, 1760-1771.

Neale, R. B., and B. J. Hoskins, 2000: A standard test for AGCMs including their physical parametrizations: I: the proposal. Atmosph. Sci. Lett., 1, 101–107.

Polvani, L. M., A. C. Clement, B. Medeiros, J. J. Benedict, and I. R. Simpson (2017), When less is more: Opening the door to simpler climate models. EOS, 98.

Shaw, T. A., M. Baldwin, E. A. Barnes, R. Caballero, C. I. Garfinkel, Y-T. Hwang, C. Li, P. A. O’Gorman, G. Riviere, I R. Simpson, and A. Voigt, 2016: Storm track processes and the opposing influences of climate change. Nature Geoscience, 9, 656–664.

Vallis, G. K., Colyer, G., Geen, R., Gerber, E., Jucker, M., Maher, P., Paterson, A., Pietschnig, M., Penn, J., and Thomson, S. I., 2018: Isca, v1.0: a framework for the global modelling of the atmospheres of Earth and other planets at varying levels of complexity. Geosci. Model Dev., 11, 843-859.

Baroclinic and Barotropic Annular Modes of Variability


Modes of variability are climatological features that have global effects on regional climate and weather. They are identified through spatial structures and the timeseries associated with them (so-called EOF/PC analysis, which finds the largest variability of a given atmospheric field). Examples of modes of variability include El Niño Southern Oscillation, Madden-Julian Oscillation, North Atlantic Oscillation, Annular modes, etc. The latter are named after the “annulus” (a region bounded by two concentric circles) as they occur in the Earth’s midlatitudes (a band of atmosphere bounded by the polar and tropical regions, Fig. 1), and are the most important modes of midlatitude variability, generally representing 20-30% of the variability in a field.

Figure 1: Southern Hemisphere midlatitudes (red concentric circles) as annulus, region where annular modes have the largest impacts. Source.

We know two types of annular modes: baroclinic (based on eddy kinetic energy, a proxy for eddy activity and an indicator of storm-track intensity) and barotropic (based on zonal mean zonal wind, representing the north-south shifts of the jet stream) (Fig. 2). The latter are usually referred to as Southern (SAM or Antarctic Oscillation) or Northern (NAM or Arctic Oscillation) Annular Mode (depending on the hemisphere), have generally quasi-barotropic (uniform) vertical structure, and impact the temperature variations, sea-ice distribution, and storm paths in both hemispheres with timescales of about 10 days. The former are referred to as BAM (baroclinic annular mode) and exhibit strong vertical structure associated with strong vertical wind shear (baroclinicity), and their impacts are yet to be determined (e.g. Thompson and Barnes 2014, Marshall et al. 2017). These two modes of variability are linked to the key processes of the midlatitude tropospheric dynamics that are involved in the growth (baroclinic processes) and decay (barotropic processes) of midlatitude storms. The growth stage of the midlatitude storms is conventionally associated with increase in eddy kinetic energy (EKE) and the decay stage with decrease in EKE.

Figure 2: Barotropic annular mode (right), based on zonal wind (contours), associated with eddy momentum flux (shading); Baroclinic annular mode (left), based on eddy kinetic energy (contours), associated with eddy heat flux (shading). Source: Thompson and Woodworth (2014).

However, recent observational studies (e.g. Thompson and Woodworth 2014) have suggested decoupling of baroclinic and barotropic components of atmospheric variability in the Southern Hemisphere (i.e. no correlation between the BAM and SAM) and a simpler formulation of the EKE budget that only depends on eddy heat fluxes and BAM (Thompson et al. 2017). Using cross-spectrum analysis, we empirically test the validity of the suggested relationship between EKE and heat flux at different timescales (Boljka et al. 2018). Two different relationships are identified in Fig. 3: 1) a regime where EKE and eddy heat flux relationship holds well (periods longer than 10 days; intermediate timescale); and 2) a regime where this relationship breaks down (periods shorter than 10 days; synoptic timescale). For the relationship to hold (by construction), the imaginary part of the cross-spectrum must follow the angular frequency line and the real part must be constant. This is only true at the intermediate timescales. Hence, the suggested decoupling of baroclinic and barotropic components found in Thompson and Woodworth (2014) only works at intermediate timescales. This is consistent with our theoretical model (Boljka and Shepherd 2018), which predicts decoupling under synoptic temporal and spatial averaging. At synoptic timescales, processes such as barotropic momentum fluxes (closely related to the latitudinal shifts in the jet stream) contribute to the variability in EKE. This is consistent with the dynamics of storms that occur on timescales shorter than 10 days (e.g. Simmons and Hoskins 1978). This is further discussed in Boljka et al. (2018).

Figure 3: Imaginary (black solid line) and Real (grey solid line) parts of cross-spectrum between EKE and eddy heat flux. Black dashed line shows the angular frequency (if the tested relationship holds, the imaginary part of cross-spectrum follows this line), the red line distinguishes between the two frequency regimes discussed in text. Source: Boljka et al. (2018).


Boljka, L., and T. G. Shepherd, 2018: A multiscale asymptotic theory of extratropical wave, mean-flow interaction. J. Atmos. Sci., in press.

Boljka, L., T. G. Shepherd, and M. Blackburn, 2018: On the coupling between barotropic and baroclinic modes of extratropical atmospheric variability. J. Atmos. Sci., in review.

Marshall, G. J., D. W. J. Thompson, and M. R. van den Broeke, 2017: The signature of Southern Hemisphere atmospheric circulation patterns in Antarctic precipitation. Geophys. Res. Lett., 44, 11,580–11,589.

Simmons, A. J., and B. J. Hoskins, 1978: The life cycles of some nonlinear baroclinic waves. J. Atmos. Sci., 35, 414–432.

Thompson, D. W. J., and E. A. Barnes, 2014: Periodic variability in the large-scale Southern Hemisphere atmospheric circulation. Science, 343, 641–645.

Thompson, D. W. J., B. R. Crow, and E. A. Barnes, 2017: Intraseasonal periodicity in the Southern Hemisphere circulation on regional spatial scales. J. Atmos. Sci., 74, 865–877.

Thompson, D. W. J., and J. D. Woodworth, 2014: Barotropic and baroclinic annular variability in the Southern Hemisphere. J. Atmos. Sci., 71, 1480–1493.