Atmospheric blocking: why is it so hard to predict?

Atmospheric blocks are nearly stationary large-scale flow features that effectively block the prevailing westerly winds and redirect mobile cyclones. They are typically characterised by a synoptic-scale, quasi-stationary high pressure system in the midlatitudes that can remain over a region for several weeks. Blocking events can cause extreme weather: heat waves in summer and cold spells in winter, and the impacts associated with these events can escalate due to a block’s persistence. Because of this, it is important that we can forecast blocking accurately. However, atmospheric blocking has been shown to be the cause of some of the poorest forecasts in recent years. Looking at all occasions when the ECMWF model experienced a period of very low forecast skill, Rodwell et al. (2013) found that the average flow pattern for which these forecasts verified was an easily-distinguishable atmospheric blocking pattern (Figure 1). But why are blocks so hard to forecast?

Fig_1_blogjacob
Figure 1:  Average verifying 500 hPa geopotential height (Z500) field for occasions when the ECMWF model experienced very low skill. From Rodwell et al. (2013).

There are several reasons why forecasting blocking is a challenge. Firstly, there is no universally accepted definition of what constitutes a block. Several different flow configurations that could be referred to as blocks are shown in Figure 2. The variety in flow patterns used to define blocking brings with it a variety of mechanisms that are dynamically important for blocks developing in a forecast (Woollings et al. 2018). Firstly, many phenomena must be well represented in a model for it to forecast all blocking events accurately. Secondly, there is no complete dynamical theory for block onset and maintenance- we do not know if a process key for blocking dynamics is missing from the equation set solved by numerical weather prediction models and is contributing to the forecast error. Finally, many of the known mechanisms associated with block onset and maintenance are also know sources of model uncertainty. For example, diabatic processes within extratropical cyclones have been shown to contribute substantially to blocking events (Pfahl et al. 2015), the parameterisation of which has been shown to affect medium-range forecasts of ridge building events (Martínez-Alvarado et al. 2015).

Fig_2_blogjacob
Figure 2: Different flow patterns, shown using Z500 (contours), that have been defined as blocks. From Woollings et al. (2018).

We do, however, know some ways to improve the representation of blocking: increase the horizontal resolution of the model (Schiemann et al. 2017); improve the parameterisation of subgrid physical processes (Jung et al. 2010); remove underlying model biases (Scaife et al. 2010); and in my PhD we found that improvements to a model’s dynamical core (the part of the model used to solved the governing equations) can also improve the medium-range forecast of blocking. In Figure 3, the frequency of blocking that occurred during two northern hemisphere winters is shown for the ERA-Interim reanalysis and three operational weather forecast centres (the ECMWF, Met Office (UKMO) and the Korean Meteorological Administration (KMA)). Both KMA and UKMO use the Met Office Unified Model – however, before the winter of 2014/15 the UKMO updated the model to use a new dynamical core whilst KMA continued to use the original. This means that for the 2013/14 the UKMO and KMA forecasts are from the same model with the same dynamical core whilst for the 2014/15 winter the UKMO and KMA forecasts are from the same model but with different dynamical cores. The clear improvement in forecast from the UKMO in 2014/15 can hence be attributed to the new dynamical core. For a full analysis of this improvement see Martínez-Alvarado et al. (2018).

Fig_3_blogjacob
Figure 3: The frequency of blocking during winter in the northern hemisphere in ERA-Interim (grey shading) and in seven-day forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF), the Met Office (UKMO) and the Korean Meteorological Administration (KMA). Box plots show the spread in the ensemble forecast from each centre.

In the remainder of my PhD I aim to investigate the link between errors in forecasts of blocking with the representation of upstream cyclones. I am particularly interested to see if the parameterisation of diabatic processes (a known source of model uncertainty) could be causing the downstream error in Rossby wave amplification and blocking.

Email: j.maddison@pgr.reading.ac.uk.

References:

Rodwell, M. J., and Coauthors, 2013: Characteristics of occasional poor medium-range weather  forecasts for Europe. Bulletin of the American Meteorological Society, 94 (9), 1393–1405.

Woollings, T., and Coauthors, 2018: Blocking and its response to climate change. Current Climate Change Reports, 4 (3), 287–300.

Pfahl, S., C. Schwierz, M. Croci-Maspoli, C. Grams, and H. Wernli, 2015: Importance of latent  heat release in ascending air streams for atmospheric blocking. Nature Geoscience, 8 (8), 610– 614.

Mart´ınez-Alvarado, O., E. Madonna, S. Gray, and H. Joos, 2015: A route to systematic error in forecasts of Rossby waves. Quart. J. Roy. Meteor. Soc., 142, 196–210.

Mart´ınez-Alvarado, O., and R. Plant, 2014: Parametrized diabatic processes in numerical simulations of an extratropical cyclone. Quart. J. Roy. Meteor. Soc., 140 (682), 1742–1755.

Scaife, A. A., T. Woollings, J. Knight, G. Martin, and T. Hinton, 2010: Atmospheric blocking and mean biases in climate models. Journal of Climate, 23 (23), 6143–6152.

Schiemann, R., and Coauthors, 2017: The resolution sensitivity of northern hemisphere blocking in four 25-km atmospheric global circulation models. Journal of Climate, 30 (1), 337–358.

Jung, T., and Coauthors, 2010: The ECMWF model climate: Recent progress through improved physical parametrizations. Quart. J. Roy. Meteor. Soc., 136 (650), 1145–1160.

Communicating uncertainties associated with anthropogenic climate change

Email: j.f.talib@pgr.reading.ac.uk

This week Prof. Ed Hawkins from the Department of Meteorology and NCAS-Climate gave a University of Reading public lecture discussing the science of climate change. A plethora of research was presented, all highlighting that humans are changing our climate. As scientists we can study the greenhouse effect in scientific labs, observe increasing temperatures across the majority of the planet, or simulate the impact of human actions on the Earth’s climate through using climate models.

simulating_temperature_rise
Figure 1. Global-mean surface temperature in observations (solid black line), and climate model simulations with (red shading) and without (blue shading) human actions. Shown during Prof. Ed Hawkins’ University of Reading Public Lecture.

Fig. 1, presented in Ed Hawkins’ lecture, shows the global mean temperature rise associated with human activities. Two sets of climate simulations have been performed to produce this plot. The first set, shown in blue, are simulations controlled solely by natural forcings, i.e. variations in radiation from the sun and volcanic eruptions. The second, shown in red, are simulations which include both natural forcing and forcing associated with greenhouse gas emissions from human activities. The shading indicates the spread amongst climate models, whilst the observed global-mean temperature is shown by the solid black line. From this plot it is evident that all climate models attribute the rising temperatures over the 20th and 21st century to human activity. Climate simulations without greenhouse gas emissions from human activity indicate a much smaller rise, if any, in global-mean temperature.

However, whilst there is much agreement amongst climate scientists and climate models that our planet is warming due to human activity, understanding the local impact of anthropogenic climate change contains its uncertainties.

For example, my PhD research aims to understand what controls the location and intensity of the Intertropical Convergence Zone. The Intertropical Convergence Zone is a discontinuous, zonal precipitation band in the tropics that migrates meridionally over the seasonal cycle (see Fig. 2). The Intertropical Convergence Zone is associated with wet and dry seasons over Africa, the development of the South Asian Monsoon and the life-cycle of tropical cyclones. However, currently our climate models struggle to simulate characteristics of the Intertropical Convergence Zone. This, alongside other issues, results in climate models differing in the response of tropical precipitation to anthropogenic climate change.

animation
Figure 2. Animation showing the seasonal cycle of the observed monthly-mean precipitation rates between 1979-2014.

Figure 3 is a plot taken from a report written by the Intergovernmental Panel on Climate Change (Climate Change 2013: The Physical Science Basis). Both maps show the projected change from climate model simulations in Northern Hemisphere winter precipitation between the years 2016 to 2035 (left) and 2081 to 2100 (right) relative to 1986 to 2005 under a scenario where minimal action is taken to limit greenhouse gas emissions (RCP8.5) . Whilst the projected changes in precipitation are an interesting topic in their own right, I’d like to draw your attention to the lines and dots annotated on each map. The lines indicate where the majority of climate models agree on a small change. The map on the left indicates that most climate models agree on small changes in precipitation over the majority of the globe over the next two decades. Dots, meanwhile, indicate where climate models agree on a substantial change in Northern Hemisphere winter precipitation. The plot on the right indicates that across the tropics there are substantial areas where models disagree on changes in tropical precipitation due to anthropogenic climate change. Over the majority of Africa, South America and the Maritime Continent, models disagree on the future of precipitation due to climate change.

IPCC_plot
Figure 3. Changes in Northern Hemisphere Winter Precipitation between 2016 to 2035 (left) and 2081 to 2100 (right) relative to 1986 to 2005 under a scenario with minimal reduction in anthropogenic greenhouse gas emission. Taken from IPCC – Climate Change 2013: The Physical Science Basis.

How should scientists present these uncertainties?

I must confess that I am nowhere near an expert in communicating uncertainties, however I hope some of my thoughts will encourage a discussion amongst scientists and users of climate data. Here are some of the ideas I’ve picked up on during my PhD and thoughts associated with them:

  • Climate model average – Take the average amongst climate model simulations. With this method though you take the risk of smoothing out large positive and negative trends. The climate model average is also not a “true” projection of changes due to anthropogenic climate change.
  • Every climate model outcome – Show the range of climate model projections to the user. Here you face the risk of presenting the user with too much climate data. The user may also trust certain model outputs which suit their own agenda.
  • Storylines – This idea was first shown to me in a paper by Zappa, G. and Shepherd, T. G., (2017). You present a series of storylines in which you highlight the key processes that are associated with variability in the regional weather pattern of interest. Each change in the set of processes leads to a different climate model projection. However, once again, the user of the climate model data has to reach their own conclusion on which projection to take action on.
  • Probabilities with climate projections – Typically with short- and medium-range weather forecasts probabilities are used to support the user. These probabilities are generated by re-performing the simulations, each with either different initial conditions or a slight change in model physics, to see the percentage of simulations that agree on model output. However, with climate model simulations, it is slightly more difficult to associate probabilities with projections. How do you generate the probabilities? Climate models have similarities in the methods which they use to represent the physics of our atmosphere and therefore you don’t want the probabilities associated with each climate projection due to similarity amongst climate model set-up. You could base the probabilities on how well the climate model simulates the past, however just because a model simulates the past correctly, doesn’t mean it will correctly simulate the forcing in the future.

There is much more that can be said about communicating uncertainty among climate model projections – a challenge which will continue for several decades. As climate scientists we can sometimes fall into the trap on concentrating on uncertainties. We need to keep on presenting the work that we are confident about, to ensure that the right action is taken to mitigate against anthropogenic climate change.

International Conferences on Subseasonal to Decadal Prediction

I was recently fortunate enough to attend the International Conferences on Subseasonal to Decadal Prediction in Boulder, Colorado. This was a week-long event organised by the World Climate Research Programme (WCRP) and was a joint meeting with two conferences taking place simultaneously: the Second International Conference on Subseasonal to Seasonal Prediction (S2S) and the Second International Conference on Seasonal to Decadal Prediction (S2D). There were also joint sessions addressing common issues surrounding prediction on these timescales.

Weather and climate variations on subseasonal to seasonal (from around 2 weeks to a season) to decadal timescales can have enormous social, economic, and environmental impacts, making skillful predictions on these timescales a valuable tool for policymakers. As a result, there is an increasingly large interest within the scientific and operational forecasting communities in developing forecasts to improve our ability to predict severe weather events. On S2S timescales, these include high-impact meteorological events such as tropical cyclones, floods, droughts, and heat and cold waves. On S2D timescales, while the focus broadly remains on similar events (such as precipitation and surface temperatures), deciphering the roles of internal and externally-forced variability in forecasts also becomes important.

IMG_6994.HEIC
Attendees of the International Conferences on Subseasonal to Decadal Prediction

The conferences were attended by nearly 350 people, of which 92 were Early Career Scientists (either current PhD students or those who completed their PhD within the last 5-7 years), from 38 different countries. There were both oral and poster presentations on a wide variety of topics, including mechanisms of S2S and S2D predictability (e.g. the stratosphere and tropical-extratropical teleconnections) and current modelling issues in S2S and S2D prediction. I was fortunate to be able to give an oral presentation about some of my recently published work, in which we examine the performance of the ECMWF seasonal forecast model at representing a teleconnection mechanism which links Indian monsoon precipitation to weather and climate variations across the Northern Hemisphere. After my talk I spoke to several other people who are working on similar topics, which was very beneficial and helped give me ideas for analysis that I could carry out as part of my own research.

One of the best things about attending an international conference is the networking opportunities that it presents, both with people you already know and with potential future collaborators from other institutions. This conference was no exception, and as well as lunch and coffee breaks there was an Early Career Scientists evening meal. This gave me a chance to meet scientists from all over the world who are at a similar stage of their career to myself.

IMG_5205
The view from the NCAR Mesa Lab

Boulder is located at the foot of the Rocky Mountains, so after the conference I took the opportunity to do some hiking on a few of the many trails that lead out from the city. I also took a trip up to NCAR’s Mesa Lab, which is located up the hillside away from the city and has spectacular views across Boulder and the high plains of Colorado, as well as a visitor centre with meteorological exhibits. It was a great experience to attend this conference and I am very grateful to NERC and the SummerTIME project for funding my travel and accommodation.

Email: j.beverley@pgr.reading.ac.uk

Modelling windstorm losses in a climate model

Extratropical cyclones cause vast amounts of damage across Europe throughout the winter seasons. The damage from these cyclones mainly comes from the associated severe winds. The most intense cyclones have gusts of over 200 kilometres per hour, resulting in substantial damage to property and forestry, for example, the Great Storm of 1987 uprooted approximately 15 million trees in one night. The average loss from these storms is over $2 billion per year (Schwierz et al. 2010) and is second only to Atlantic Hurricanes globally in terms of insured losses from natural hazards. However, the most severe cyclones such as Lothar (26/12/1999) and Kyrill (18/1/2007) can cause losses in excess of $10 billion (Munich Re, 2016). One property of extratropical cyclones is that they have a tendency to cluster (to arrive in groups – see example in Figure 1), and in such cases these impacts can be greatly increased. For example Windstorm Lothar was followed just one day later by Windstorm Martin and the two storms combined caused losses of over $15 billion. The large-scale atmospheric dynamics associated with clustering events have been discussed in a previous blog post and also in the scientific literature (Pinto et al., 2014; Priestley et al. 2017).

Picture1
Figure 1. Composite visible satellite image from 11 February 2014 of 4 extratropical cyclones over the North Atlantic (circled) (NASA).

A large part of my PhD has involved investigating exactly how important the clustering of cyclones is on losses across Europe during the winter. In order to do this, I have used 918 years of high resolution coupled climate model data from HiGEM (Shaffrey et al., 2017) which provides a huge amount of winter seasons and cyclone events for analysis.

In order to understand how clustering affects losses, I first of all need to know how much loss/damage is associated with each individual cyclone. This is done using a measure called the Storm Severity Index (SSI – Leckebusch et al., 2008), which is a proxy for losses that is based on the 10-metre wind field of the cyclone events. The SSI is a good proxy for windstorm loss. Firstly, it scales the wind speed in any particular location by the 98th percentile of the wind speed climatology in that location. This scaling ensures that only the most severe winds at any one point are considered, as different locations have different perspectives on what would be classed as ‘damaging’. This exceedance above the 98th percentile is then raised to the power of 3 due to damage from wind being a highly non-linear function. Finally, we apply a population density weighting to our calculations. This weighting is required because a hypothetical gust of 40 m/s across London will cause considerably more damage than the same gust across far northern Scandinavia, and the population density is a good approximation for the density of insured property. An example of the SSI that has been calculated for Windstorm Lothar is shown in Figure 2.

 

figure_2_blog_2018_new
Figure 2. (a) Wind footprint of Windstorm Lothar (25-27/12/1999) – 10 metre wind speed in coloured contours (m/s). Black line is the track of Lothar with points every 6 hours (black dots). (b) The SSI field of Windstorm Lothar. All data from ERA-Interim.

 

From Figure 2b you can see how most of the damage from Windstorm Lothar was concentrated across central/northern France and also across southern Germany. This is because the winds here were most extreme relative to what is the climatology. Even though the winds are highest across the North Atlantic Ocean, the lack of insured property, and a much high climatological winter mean wind speed, means that we do not observe losses/damage from Windstorm Lothar in these locations.

figure_3_blog_2018_new
Figure 3. The average SSI for 918 years of HiGEM data.

 

I can apply the SSI to all of the individual cyclone events in HiGEM and therefore can construct a climatology of where windstorm losses occur. Figure 3 shows the average loss across all 918 years of HiGEM. You can see that the losses are concentrated in a band from southern UK towards Poland in an easterly direction. This mainly covers the countries of Great Britain, Belgium, The Netherlands, France, Germany, and Denmark.

This blog post introduces my methodology of calculating and investigating the losses associated with the winter season extratropical cyclones. Work in Priestley et al. (2018) uses this methodology to investigate the role of clustering on winter windstorm losses.

This work has been funded by the SCENARIO NERC DTP and also co-sponsored by Aon Benfield.

 

Email: m.d.k.priestley@pgr.reading.ac.uk

 

References

Leckebusch, G. C., Renggli, D., and Ulbrich, U. 2008. Development and application of an objective storm severity measure for the Northeast Atlantic region. Meteorologische Zeitschrift. https://doi.org/10.1127/0941-2948/2008/0323.

Munich Re. 2016. Loss events in Europe 1980 – 2015. 10 costliest winter storms ordered by overall losses. https://www.munichre.com/touch/naturalhazards/en/natcatservice/significant-natural-catastrophes/index.html

Pinto, J. G., Gómara, I., Masato, G., Dacre, H. F., Woollings, T., and Caballero, R. 2014. Large-scale dynamics associated with clustering of extratropical cyclones affecting Western Europe. Journal of Geophysical Research: Atmospheres. https://doi.org/10.1002/2014JD022305.

Priestley, M. D. K., Dacre, H. F., Shaffrey, L. C., Hodges, K. I., and Pinto, J. G. 2018. The role of European windstorm clustering for extreme seasonal losses as determined from a high resolution climate model, Nat. Hazards Earth Syst. Sci. Discuss., https://doi.org/10.5194/nhess-2018-165, in review.

Priestley, M. D. K., Pinto, J. G., Dacre, H. F., and Shaffrey, L. C. 2017. Rossby wave breaking, the upper level jet, and serial clustering of extratropical cyclones in western Europe. Geophysical Research Letters. https://doi.org/10.1002/2016GL071277.

Schwierz, C., Köllner-Heck, P., Zenklusen Mutter, E. et al. 2010. Modelling European winter wind storm losses in current and future climate. Climatic Change. https://doi.org/10.1007/s10584-009-9712-1.

Shaffrey, L. C., Hodson, D., Robson, J., Stevens, D., Hawkins, E., Polo, I., Stevens, I., Sutton, R. T., Lister, G., Iwi, A., et al. 2017. Decadal predictions with the HiGEM high resolution global coupled climate model: description and basic evaluation, Climate Dynamics, https://doi.org/10.1007/s00382-016-3075-x.

Advice for students starting their PhD

The Meteorology Department at Reading has just welcomed its new cohort of PhD students, so we gathered some pearls of wisdom for the years ahead:

photo-1529007196863-d07650a3f0ea

“Start good habits from the beginning; decide how you will make notes on papers, and how you will save papers, and where you will write notes, and how you will save files. Create a spreadsheet of where your code is, and what it does, and what figure it creates. It will save you so much time.”

“Write down everything you do in note form; this helps you a) track back your progress if you take time out from research and b) makes writing your thesis a LOT easier…”

“Pick your supervisor carefully. Don’t kid yourself that they will be different as a PhD supervisor; look for someone understanding and supportive.”

“Expect the work to progress slowly at first, things will not all work out simply.”

“Don’t give up! And don’t be afraid to ask for help from other PhDs or other staff members (in addition to your supervisors).”

“Don’t compare yourself to other PhDs, and make sure to take some time off, you’re allowed a holiday!”

“Ask for help all the time.”

“Keep a diary of the work you do each day so you remember exactly what you’ve done 6 months later.”

“Don’t worry if your supervisors/people in years above seem to know everything, or can do things really easily. There hasn’t been an administrative cock-up, you’re not an impostor: everyone’s been there. Also, get into a good routine. It really helps.”

“Talk to your supervisor about what both of you expect and decide how often you want to meet at the beginning. This will make things easier.”

“Don’t compare with other students. A PhD is an individual project with its own aims and targets. Everyone will get to completion on their own journey.”

“You’ll be lost but achieving something. You can’t see it yet.”

The Many Speak Of Computer

Knowing multiple languages can be hard. As any polyglot will tell you, there are many difficulties that can come from mixing and matching languages; losing vocabulary in both, only being able to think and speak in one at a time, having to remember to apply the correct spelling and pronunciation conventions in the correct contexts.

Humans aren’t the only ones who experience these types of multiple-language issues, however. Computers can also suffer from linguistic problems pertaining to the “programming languages” humans use to communicate with them, as well as the more hidden, arcane languages they use to speak to one another. This can cause untold frustration to their users. Dealing with seemingly arbitrary computing issues while doing science, we humans, especially if we aren’t computing experts, can get stuck in a mire with no conceivable way out.

photo-1526374965328-7f61d4dc18c5
Caught in the Matrix…

Problems with programming languages are the easiest problems of this nature to solve. Often the mistake is the human in question lacking the necessary vocabulary, or syntax, and the problem can be solved with a quick peruse of google or stack exchange to find someone with a solution. Humans are much better at communicating and expressing ideas in native human languages than computational ones. They often encounter the same problems as one another and describe them in similar ways. It is not uncommon to overhear a programmer lamenting: “But I know what I mean!” So looking for another human to act as a ‘translator’ can be very effective.

Otherwise, it’s a problem with the programming language itself; the language’s syntax is poorly defined, or doesn’t include provision for certain concepts or ideas to be expressed. Imagine trying to describe the taste of a lemon in a language which doesn’t possess words for ‘bitter’ or ‘sour’. At best these problems can be solved by installing some kind of library, or package, where someone else has written a work-around and you can piggy-back off of that effort. Like learning vocabulary from a new dialect. At worst you have to write these things yourself, and if you’re kind, and write good code, you will share them with the community; you’re telling the people down the pub that you’ve decided that the taste of lemons, fizzy colas, and Flanders red is “sour”.

photo-1531618548726-b21ac4df8121
Describe a lemon without using “bitter” or “sour”

There is, however, a more insidious and undermining class of problems, pertaining to the aforementioned arcane computer-only languages. These languages, more aptly called “machine code”, are the incomprehensible languages computers and different parts of a computer use to communicate with one another.

For many programming languages known as “compiled languages”, the computer must ‘compile’ the code written by a human into machine code which it then executes, running the program. This is generally a good thing; it helps debug errors before potentially disastrous code is run on a machine, it significantly improves performance as computers don’t need to translate code on the fly line-by-line. But there is a catch.

There is no one single machine code. And unless a computer both knows the language an executable is written in, and is able to speak it, then tough tomatoes, it can’t run that code.

This is fine for code you have written and compiled yourself, but when importing code from elsewhere it can cause tough to diagnose problems. Especially on the large computational infrastructures used in scientific computing, with many computers that might not all speak the same languages. In a discipline like meteorology, with a large legacy codebase, and where use of certain libraries is assumed, not knowing how to execute pre-compiled code will leave the hopeful researcher in a rut. Especially in cases where access to the source code of a library is restricted due to it being a commercial product. You know there’s a dialect that has the words the computer needs to express itself, and you have a set of dictionaries, but you don’t know any of the languages and they’re all completely unintelligible; which dictionary do you use?

photo-1532522750741-628fde798c73
All unintelligible?

So what can you do? Attempt to find alternative codebases. Write them yourself. Often, however, we stand on the shoulders of giants, and having to do so would be prohibitive. Ask your institution’s computing team for help – but they don’t always know the answers.

There are solutions we can employ in our day to day coding practices that can help. Clear documentation when writing code, as well as maintaining clear style guides can make a world of difference in attempting to diagnose problems that are machine-related as opposed to code-related. Keeping a repository of functions and procedures for oneself, even if it is not shared with the community, can also be a boon. You can’t see that a language doesn’t have a word for a concept unless you own a dictionary. Sometimes, pulling apart the ‘black box’-like libraries and packages we acquire from the internet, or our supervisors, or other scientists, is important in verifying that code does what we expect it to.

At the end of the day, you are not expected to be an expert in machine architecture. This is one of the many reasons why it is important to be nice to your academic computing team. If you experience issues of compilers not working on your institution’s computers, or executables of libraries not running it isn’t your job to fix it and you shouldn’t feel bad if it holds your project up. Read some papers, concentrate on some other work, work on your lit-review if you’re committed to continuing to do work. Personally, I took a holiday.

I have struggled with these problems and the solution has been to go to my PhD’s partner institution where we know the code works! Perhaps this is a sign that these problems can be extremely non-trivial, and are not to be underestimated.

Ahh well. It’s better than being a monoglot, at least.

Faster analysis of large datasets in Python

Have you ever run into a memory error or thought your function is taking too long to run? Here are a few tips on how to tackle these issues.

In meteorology we often have to analyse large datasets, which can be time consuming and/or lead to memory errors. While the netCDF4, numpy and pandas packages in Python provide great tools for our data analysis, there are other packages we can use, that parallelize our code: joblib, xarray and dask (view links for documentation and references for further reading). This means that the input data is split between the different cores of the computer and our analysis of different bits of data runs in parallel, rather than one after the other, speeding up the process. At the end the data is collected and returned to us in the same form as before, but now it was done faster. One of the basic ideas behind the parallelization is the ‘divide and conquer’ algorithm [Fig. 1] (see, e.g., Cormen et al. 2009, or Wikipedia for brief introduction), which finds the best possible (fastest) route for calculating the data and return it.

divide_and_conquer
Figure 1: A simple example of the ‘divide and conquer’ algorithm for sorting a list of numbers. First the list is split into simpler subproblems, that are then solved (sorted) and merged to a final sorted array. Source

The simplest module we can use is joblib. This module can be easily implemented for for-loops (see an example here): e.g. the operation that needs to be executed 1000 times, can be split between 40 cores that your computer has, making the calculation that much faster. Note that often Python modules include optimized routines, and we can avoid for-loops entirely, which is usually a faster option.

The xarray module provides tools for opening and saving multiple netCDF-type (though not limited to this) datasets, which can then be analysed either as numpy arrays or dask arrays. If we choose to use the dask arrays (also available via dask module), any command we use on the array will be calculated in parallel automatically via a type of ‘divide and conquer’ algorithm. Note that this on its own does not help us avoid a memory error as the data eventually has to be loaded in the memory (potentially using a for-loop on these xarray/dask arrays can speed-up the calculation). There are often also options to run your data on high-memory nodes, and the larger the dataset the more time we save through parallelization.

In the end it really depends on how much time you are willing to spend on learning about these arrays and whether it is worth the extra effort – I had to use them as they resolved my memory issues and sped up the code. It is certainly worth keeping this option in mind!

Getting started with xarray/dask

In the terminal window:

  • Use a system with conda installed (e.g. anaconda)
  • To start a bash shell type: bash
  • Create a new python environment (e.g. ‘my_env’) locally, so you can install custom packages. Give it a list of packages:
    • conda create -n my_env xarray
  • Then activate the new python environment (Make sure that you are in ‘my_env’ when using xarray):
    • source activate my_env
  • If you need to install any other packages that you need, you can add them later (via conda install), or you could list them with xarray when you create the environment:
    • conda install scipy pandas numpy dask matplotlib joblib #etc.
  • If the following paths are not ‘unset’ then you need to unset them (check this with command: conda info -a):
    • unset PYTHONPATH PYTHONHOME LD_LIBRARY_PATH
  • In python you can then simply import xarray, numpy or dask modules:
    • import xarray as xr; import dask.array as da; import numpy as np; from joblib import Parallel, delayed; # etc.
  • Now you can easily import datasets [e.g.: dataset = xr.open_dataset() from one file or dataset = xr.open_mfdataset() from multiple files; similarly dataset.to_netcdf() to save to one netcdf file or xr.save_mfdataset() to save to multiple netcdf files] and manipulate them using dask and xarray modules – documentation for these can be found in the links above and references below.
  • Once you open a dataset, you can access data either by loading it into memory (xarray data array: dataset.varname.values) and further analyzing it as before using numpy package (which will not run in parallel); or you can access data through the dask array (xarray dask array: dataset.varname.data), which will not load the data in the memory (it will create the best possible path to executing the operation) until you wish to save the data to a file or plot it. The latter can be analysed in a similar way as the well-known numpy arrays, but instead using the dask module [e.g. numpy.mean (array,axis=0) in dask becomes dask.array.mean (dask_array,axis=0)]. Many functions exist in xarray module as well, meaning you can run them on the dataset itself rather than the array [e.g. dataset.mean(dim=’time’) is equivalent to the mean in dask or numpy].
  • Caution: If you try to do too many operations on the array the ‘divide and conquer’ algorithm will become so complex that the programme will not be able to manage it. Therefore, it is best to calculate everything step-by-step, by using dask_array.compute() or dask_array.persist(). Another issue I find with these new array-modulesis that they are slow when it comes to saving the data on disk (i.e. not any faster than other modules).

I would like to thank Shannon Mason and Peter Gabrovšek for their helpful advice and suggestions.

References

Cormen, T.H., C.E. Leiserson, R.L. Rivest, C. Stein, 2009: An introduction to algorithms. MIT press, third edition, 1312 pp.

Dask Development Team, 2016: Dask: Library for dynamic task scheduling. URL: http://dask.pydata.org

Hoyer, S. & Hamman, J., 2017. xarray: N-D labeled Arrays and Datasets in Python. Journal of Open Research Software. 5, p.10. DOI: http://doi.org/10.5334/jors.148

Hoyer, S., C. Fitzgerald, J. Hamman and others, 2016: xarray: v0.8.0. DOI: http://dx.doi.org/10.5281/zenodo.59499

Rocklin, M., 2016: Dask: Parallel Computation with Blocked algorithms and Task Scheduling. Proceedings of the 14th Python in Science Conference (K. Huff and J. Bergstra, eds.), 130 – 136.