## CMIP6 Data Hackathon

What is it?

A hackathon, from the words hack (meaning exploratory programming, not the alternate meaning of breaching computer security) and marathon, is usually a sprint-like event where programmers collaborate intensively with the goal of creating functioning software by the end of the event. From 2 to 4 June 2021, more than a hundred early career climate scientists and enthusiasts (mostly PhDs and Postdocs) from UK universities took part in a climate hackathon organised jointly by Universities of Bristol, Exeter and Leeds, and the Met Office. The common goal was to quickly analyse certain aspects of Climate Model Intercomparison Project 6 (CMIP6) data to output cutting-edge research that could be worked into a published material and shown in this year’s COP26.

Before the event, attendees signed up to their preferred project from a choice of ten. Topics ranged from how climate change will affect migration of arctic terns to the effects of geoengineering by stratospheric sulfate injections and more… Senior academics from a range of disciplines and institutions led each project.

How is this virtual hackathon different to a usual hackathon?

Like many other events this year, the hackathon took place virtually, using a combination of video conferencing (Zoom) for seminars and teamwork, and discussion forums (Slack).

Brian:

Compared to two 24-hour non-climate related hackathons I previously attended, this one was spread out for three days, so I managed not to disrupt my usual sleep schedules! The experience of pair programming with one or two other team members was as easy, since I shared one of my screens on Zoom breakout rooms throughout the event. What I really missed were the free meals, plenty of snacks and drinks usually on offer at normal hackathons to keep me energised while I programmed.

Chloe:

I’ve been to a climate campaign hackathon before, and I did a hackathon style event to end a group project during the computer science part of my undergraduate; we made the boardgame buccaneer in java. But this was set out completely differently. And, it was not as time intensive as those which was nice. I missed not being in a room with those you are on a project with and still missing out on free food – hopefully not for too much longer. But we made use of Zoom and Slack for communication. And Jasmin and the version control that git offers with individuals working on branches and then these were merged at the end of the hackathon.

What did we do?

Brian:

Project 2: How well do the CMIP6 models represent the tropical rainfall belt over Africa?

Using Gaussian parameters in Nikulin & Hewitson 2019 to describe the intensity, mean meridional position and width of the tropical rainfall belt (TRB), the team I was in investigated three aspects of CMIP6 models for capturing the Africa TRB, namely the model biases, projections and whether there was any useful forecast information in CMIP6 decadal hindcasts. These retrospective forecasts were generated under the Decadal Climate Prediction Project (DCPP), with an aim of investigating the skill of CMIP models in predicting climate variations from a year to a decade ahead. Our larger group of around ten split ourselves amongst these three key aspects. I focused on aspect of CMIP6 decadal hindcasts, where I compared different decadal models at different model lead times with three observation sources.

Chloe:

Project 10: Human heat stress in a warming world

Our team leader Chris had calculated the universal thermal climate index (UTCI) – a heat stress index for a bunch of the CMIP6 climate models. He was looking into bias correction against the ERA5 HEAT reanalysis dataset whilst we split into smaller groups. We looked at a range of different things from how the intensity of heat stress changed to how the UTCI compared to mortality. I ended up coding with one of my (5) PhD supervisors Claudia Di Napoli and we made amongst other things the gif below.

Would we recommend meteorology/climate-related hackathon?

Brian:

Yes! The three days was a nice break from my own radar research work. The event was nevertheless good training for thinking quickly and creatively to approach research questions other than those in my own PhD project. The experience also sharpened my coding and data exploration skills, while also getting the chance to quickly learn advanced methods for certain software packages (such as xarray and iris). I was amazed at the amount of scientific output achieved in only three short days!

Chloe:

Yes, but also make sure if it’s online you block out the time and dedicate all your focus to the hackathon. Don’t be like me. The hackathon taught me more about python handling of netcdfs, but I am not yet a python plotting convert, there are some things R is just nicer for. And I still love researching heat stress and heatwaves, so that’s good!

We hope that the CMIP hackathon runs again next year to give more people the opportunity to get involved.

## How to write a PhD thesis during a global pandemic

Completing a PhD is a momentous task at the best of times, let alone in combination with a year-long global pandemic. Every PhD researcher is different, and as such, everyone has had different circumstantial struggles throughout Covid-19. The lack of human interaction that comes with working in a vibrant academic environment such as the Meteorology Department can make working from home a real struggle. Sometimes it is difficult to find the motivation to get anything useful done; whereas at other times you could squeeze five hours’ worth of work into one. Trying to stay organised is key to getting it done, therefore the following are some of the things that helped me get to the end of my PhD thesis – and it has not been easy. If you are still out there writing and finishing up experiments: read on! Maybe the result is that you might feel a little less alone. The PhD experience can be truly isolating at the best of times, so literally being instructed to isolate from the world is not ideal. The points are numbered for convenience of structuring this post, rather than any order of importance.

It is tempting to “disappear off the radar” when things are not going well. You could wake up in the morning of the day of your regular weekly meeting, filled with dread that you have not prepared anything for it. Your brain recoils into the depths of your skull as your body recoils back under the safety of the duvet. What are your options? Some of them might be: take a deep gulp and force yourself out of bed with the prospect of coffee before the meeting (where you steer the conversation onto the things you did manage to do); or to postpone the meeting because you need to finish XYZ and thus a later meeting may be more productive; or ignore the meeting altogether. The first one is probably the best option, but it requires mental strength where there might be none. The second one is OK, but you still need to do the work. The last one is a big no. Don’t do it.

Anxiety will make you believe that ignoring the world and all responsibilities is the most comfortable option in the moment, but the consequences of acting on it could be worse. Supervisors value honesty, and they know well that it is not always possible to complete all the scheduled tasks. Of course, if this happens every week then you might need to introspectively address the reasons for this, and – again, talking with your supervisor is usually a useful thing to do. You might not want them to know your entire life story, but it is helpful for everybody involved if they are aware that you struggle with anxiety / depression / ADHD / *insert any condition here*, which could affect your capacity to complete even the simplest, daily tasks. Being on the same page and having matching expectations is key to any student – supervisor partnership.

1.  Reward yourself for the things you have already accomplished.

Whether that’s mid-week, mid-to-do-list, weekend — whenever. List all the things you have done regularly (either work- or life-related) and recognise that you are trying to survive a pandemic. And trying to complete the monstrous task of writing a PhD thesis. Those are big asks, and the only way to get through them is to break them down into smaller chunks. Putting down “Write thesis” on your to-do list is more likely to intimidate than motivate you. How about breaking it down further: “Re-create plot 4.21”, or “Consolidate supervisor comments on pages 21 – 25” — these are achievable things in a specified length of time. It also means you could tick them off more easily, hopefully resulting in feeling accomplished. Each time this happens, reward yourself in whatever way makes you feel nice. Even just giving yourself a literal pat on the shoulder could feel great – try it!

1. Break down how long specific things will take

This is most useful when you are a few weeks away from submission date. With only 5 weeks left, my colour-coded charts were full of outstanding comments; neither my ‘Conclusions’ chapter nor my Abstract had been written; plots needed re-plotting and I still did not know the title of my thesis. Naturally, I was panicking. I knew that the only way I could get through this was to set a schedule — and stick to it. At the time, there were 5 major things to do: complete a final version of each of my 5 thesis chapters. A natural split was to allow each chapter only one week for completion. If I was near to running over my self-prescribed deadline, I would prioritise only the major corrections. If still not done by the end of the allowed week: that’s it! Move on. This can be difficult for any perfectionists out there, but by this point the PhD has definitely taught me that “done” is better than perfect. I also found that some chapters took less time to finish than others, so I had time to return to the things I left not quite finished. Trust yourself, and give it your best. By all means, push through the hardest bit to the end, but remember that there (probably) does not exist a single PhD thesis without any mistakes.

There exist two groups of people: those who turn off or deactivate all social media when they need to focus on a deadline, and those who get even more absorbed by its ability to divert your attention away from the discomfort of the dreaded task at hand. Some might call it “productive procrastination”. I actually found that social media helped me a little – but only when my state of mind was such that I could resist the urge to fall down a scrolling rabbit hole. If you are on Twitter, you might find hashtags like #phdchat and accounts such as @AcademicChatter , @phdforum @phdvoice useful.

6. Join a virtual “writing room”

On the back of the last tip, I have found a virtual writing room helpful for focus. The idea is that you join an organised Zoom meeting full of other PhDs, all of whom are writing at the same time. All microphones are muted, but the chat is active so it is nice to say ‘hello!’ to someone else writing at the same time, anywhere else in the world. The meetings have scheduled breaks, with the organiser announcing when they occur. I found that because I actively chose to be up and start writing at the very early hour of 6am by attending the virtual writing room, I was not going to allow myself to procrastinate. The commitment to being up so early and being in a room full of people also doing the same thing (but virtually, obviously) meant that those were the times that I was probably the most focused. These kinds of rooms are often hosted by @PhDForum on Twitter; there could also be others. An alternative idea could be to set up a “writing meeting” with your group of peers and agree to keep chatter to a minimum (although this is not something I tried myself).

7. Don’t look at the news

Or at least, minimise your exposure to them. It is generally a good thing to stay on top of current events, but the final stages of writing a PhD thesis are probably unlike any other time in your life. You need the space and energy to think deeply about your own work right now. Unfortunately, I learnt this the hard way and found that there were days where I could do very little work because my brain was preoccupied with awful events happening around the world. It made me feel pathetic, routinely resulting in staying up late to try and finish whatever I failed to finish during the day. This only deteriorated my wellbeing further with shortened sleep and a constant sense of playing “catch-up”. If this sounds like you, then try switching off your news notifications on your phone or computer, or limit yourself to only checking the news homepage once a day at a designated time.

Hopefully it goes without saying that the above are simply some things that helped me through to the end of the thesis, but everybody is different. I am no counsellor or wellbeing guru; just a recently-finished PhD! Hopefully the above points might offer a little bit of light for anyone else struggling through the storm of that final write-up. Keep your chin up and, as Dory says: just keep swimming. Good luck!

As researchers, we familiarise ourselves with many different datasets. Depending on who put together the dataset, the variable names and definitions that we are already familiar from one dataset may be different in another. These differences can range from subtle annoyances to large structural differences, and it’s not always immediately obvious how best to handle them.

One dataset might be on an hourly time-index, and the other daily. The grid points which tell us the geographic location of data points may be spaced at different intervals, or use entirely different co-ordinate systems!

However most modern datasets come with hidden help in the form of metadata – this information should tell us how the data is to be used, and with the right choice of python modules we can use the metadata to automatically work with different datasets whilst avoiding conversion headaches.

## First attempt…

Starting my PhD, my favourite (naïve, inefficient, bug prone,… ) method of reading data with python was with use of the built-in function open() or numpy functions like genfromtxt(). These are quick to set up, and can be good enough. But as soon as we are using data with more than one field, complex coordinates and calendar indexes, or more than one dataset, this line of programming becomes unwieldy and disorderly!

>>> header = np.genfromtxt(fname, delimiter=',', dtype='str', max_rows=1)
['Year' 'Month' 'Day' 'Electricity_Demand']
>>> data = np.genfromtxt(fnam, delimiter=',', skip_header=1)
>>> print(data)
array([[2.010e+03, 1.000e+00, 1.000e+00, 0.000e+00],
[2.010e+03, 1.000e+00, 2.000e+00, 0.000e+00],
[2.010e+03, 1.000e+00, 3.000e+00, 0.000e+00],
...,
[2.015e+03, 1.200e+01, 2.900e+01, 5.850e+00],
[2.015e+03, 1.200e+01, 3.000e+01, 6.090e+00],
[2.015e+03, 1.200e+01, 3.100e+01, 6.040e+00]])


The above code reads in year, month, day data in the first 3 columns, and Electricity_Demand in the last column.

You might be familiar with such a workflow – perhaps you have refined it down to a fine art!

In many cases this is sufficient for what we need, but making use of already available metadata can make the data more readable, and easier to operate on when it comes to complicated collocation and statistics.

Enter pandas!

## Pandas

In the previous example, we read in our data to numpy arrays. Numpy arrays are very useful, because they store data more efficiently than a regular python list, they are easier to index, and have many built in operations from simple addition to niche linear algebra techniques.

We stored column labels in an array called header, but this means our metadata has to be handled separately from our data. The dates are stored in three different columns alongside the data – but what if we want to perform an operation on just the data (for example add 5 to every value). It is technically possible but awkward and dangerous – if the column index changes in future our code might break! We are probably better splitting the dates into another separate array, but that means more work to record the column headers, and an increasing number of python variables to keep track of.

Using pandas, we can store all of this information in a single object, and using relevant datatypes:

>>> data = pd.read_csv(fname, parse_dates=[['Year', 'Month', 'Day']], index_col=0)
>>> data
Electricity_Demand
Year_Month_Day
2010-01-01      0.00
2010-01-02      0.00
2010-01-03      0.00
2010-01-04      0.00
2010-01-05      0.00
...              ...
2015-12-27      5.70
2015-12-28      5.65
2015-12-29      5.85
2015-12-30      6.09
2015-12-31      6.04

[2191 rows x 1 columns]

This may not immediately appear a whole lot different to what we had earlier, but notice the dates are now saved in datetime format, whilst being tied to the data Electricity_Demand. If we want to index the data, we can simultaneously index the time-index without any further code (and possible mistakes leading to errors).

Pandas also makes it really simple to perform some complicated operations. In this example, I am only dealing with one field (Electricity_Demand), but this works with 10, 100, 1000 or more columns!

• Flip columns with data.T
• Calculate quantiles with data.quantile
• Cut to between dates, eg. data.loc['2010-02-03':'2011-01-05']
• Calculate 7-day rolling mean: data.rolling(7).mean()

We can insert new columns, remove old ones, change the index, perform complex slices, and all the metadata stays stuck to our data!

Whilst pandas does have many maths functions built in, if need-be we can also export directly to numpy using numpy.array(data['Electricity_Demand']) or data.to_numpy().

Pandas can also simplify plotting – particularly convenient when you just want to quickly visualise data without writing import matplotlib.pyplot as plt and other boilerplate code. In this example, I plot my data alongside its 7-day rolling mean:

ax = data.loc['2010'].plot(label='Demand', ylabel='Demand (GW)')
data.loc['2010'].rolling(7).mean().plot(ax=ax, label='Demand rolling mean')
ax.legend()

Now I can visualise the anomalous values at the start of the dataset, a consistent annual trend, a diurnal cycle, and fairly consistent behaviour week to week.

## Big datasets

Pandas can read from and write to many different data formats – CSV, HTML, EXCEL, … but some filetypes like netCDF4 that meteorologists like working with aren’t built in.

xarray is an extremely versatile tool that can read in many formats including netCDF, GRIB. As well as having built in functions to export to pandas, xarray is completely capable of handling metadata on its own, and many researchers work directly with objects such as xarray DataArray objects.

There are more xarray features than stars in the universe[citation needed], but some that I find invaluable include:

open_mfdataset – automatically merge multiple files (eg. for different dates or locations)
assign_coords – replace one co-ordinate system with another
where – replace xarray values depending on a condition

Yes you can do all of this with pandas or numpy. But you can pass metadata attributes as arguments, for example we can get the latitude average with my_data.mean('latitude'). No need to work in indexes and hardcoded values – xarray can do all the heavy lifting for you!

## The EGU Experience 2021: a PhD student perspective

The European Geoscience Union General Assembly is one of the big annual conferences for atmospheric science (and Earth sciences more generally). The two of us were fortunate to have the opportunity to attend and present our research at this year’s vEGU21 conference. As has been done in previous years like in 2019 we’re here to give you an account of our EGU experience 😀 (so you can compare our virtual experience with the previous posts if you like 😉)

Entrance hall to virtual EGU (Source: Linda Speight)

What was vEGU21?

EGUv21 was the general assembly for 2021 online. It took place from the 19th to the 30th April EGU. Through an impressive virtual conference center and mostly Zoom.

Chloe –  I presented borderless heat stress in the extreme heat events session, which is based on a paper currently under review at Earth’s Future, where we show that heat stress is growing in the area during the month of August. The invited speaker to the session was Laura Suarez-Gutierrez and it was a great presentation on the dynamics of increasing heat extremes with climate change across Europe. I really enjoyed learning about the latest research in the extreme heat area.

Max – I presented on my work using model nudging to study aerosol radiative adjustments. I presented in the session ‘Chemistry, Aerosols and Radiative Forcing in CMIP6-era models’, which was convened and hosted by Reading’s very own Bill Collins. There were many interesting presentations in this session, including presentations on the balance between climate and air quality benefits by Robert Allen and Steve Turnock; a summary of the Aerosol Chemistry Model Intercomparison Project (AerChemMIP) findings by UoR’s Gill Thornhill; and a personal favourite concerned the impacts of different emissions pathways in Africa on local and global climate, and local air pollution effects on mortality, presented by Chris Wells.

Chloe presenting: would win an award for most interesting screenshot. (Source: Maureen Wanzala)

What were your favourite aspects of the conference?

Chloe – Apart from my session one of my favorite’s was on climate services. This focused on the application of meteorological and hydrology data to services for example health heat impacts and growing grapes and olives. I also enjoyed the panel on the climate and ecological emergency in light of COVID-19 including Katherine Hayhoe and the session on equality, diversity and inclusion; it was interesting how ‘listening’ to those impacted was an overlapping theme in these. The weirdest, loveliest experience was my main supervisor sending me a colouring page of her face

Max – As with any conference it was a great opportunity to learn about the latest research in my specific field, as well as learning about exciting developments in other fields, from machine learning applications in earth science to observational studies of methane emissions. Particularly, it’s a nice change from just reading about them in papers.Having conversations with presenters gives you the opportunity to really dive in and find out what motivated their research initially and discuss future applications. For example, one conversation I had went from discussing their application of unsupervised machine learning in classifying profiles of earth system model output, to learning about it’s potential for use in model intercomparisons.

Katherine Hayhoe in the session Climate and Ecological Emergency: can a pandemic help save us? (Source: Chloe Brimicombe)

What was your least favourite aspect?

Chloe – I did manage to do a little networking. But I’d love to experience an in person conference where I present. I have never presented my research in real life at a conference or research group/department seminar 😱. We also miss out on a lot of free food and pens not going to any in life conferences, which is what research is about 😉. Also, I find it difficult to stay focused on the conference when it’s online.

Max – For me the structure of two minute summaries followed by breakout Zoom rooms for each speaker had some definite drawbacks. For topics outside one’s own field, I found it difficult to really learn much from many of the summaries – it’s not easy to fit something interesting for experts and non-experts into two minutes! In theory you can go speak to presenters in their breakout rooms, but there’s something awkward about entering a zoom breakout room with just you and the presenter, particularly when you aren’t sure exactly how well you understood their two minute summary.

Max – Overall I think virtual conferencing has a way to go before it can match up to the in person experience. There were the classic technical issues of anything hosted remotely: the ‘I think you’re on mute’ experience, other microphone issues, and even the conference website crashing on the first day of scientific sessions (though the organisers did a swift job getting the conference back up and running). But there’s also the less obvious, such as it feeling actually quite a lonely experience. I’ve only been to a couple of in-person conferences, but there were always some people I knew and could meet up with. But it’s challenging to recreate this online, especially for early career researchers who don’t have as many established connections, and particularly at a big conference like the EGU general assembly. Perhaps a big social media presence can somewhat replace this, but not everyone (including myself!) is a big social media user. .

On the other hand, it’s great that we can still have conferences during a global pandemic, and no doubt is better than an absence of them entirely. Above all else, it’s also much greener and more accessible to those with less available funding for conference travel (though new challenges of accessibility, such as internet quality and access, undoubtedly arise). Plus, the facility to upload various display materials and people to look back at them whenever they like, regardless of time zones, is handy.

Chloe – I’d just add, as great as Twitter is and can be for promoting your research, it’s not the same as going for a good old cup of tea (or cocktail) with someone. Also, you can have the biggest brightest social media, but actually be terrible at conveying your research in person.

Summary

Overall it was interesting to take part in vEGU21, and we were both glad we went. It didn’t quite live up to the in person experience – and there is definitely room for improvements for virtual conferencing – but it’s great we can still have these experiences, albeit online.

## Coding lessons for the newly initiated

Better coding skills and tooling enable faster, more useful results.

This post presents a collection of resources and tips that have been most useful to me in the first 18 months I’ve been coding – when I arrived at Reading, my coding ability amounted to using excel formulas. These days, I spend a lot of time coding experiments that test how well machine learning algorithms can provide information on error growth in low-dimensional dynamical systems. This requires fairly heavy use of Scikit-learn, Tensorflow and Pandas. This post would have been optimally useful at the start of the year, but perhaps even the coding veterans will find something of use – or better, they can tell me about something I am yet to discover!

First Steps: a few useful references

• A byte of python. A useful and concise reference for the fundamentals.
• Python Crash Course, Eric Matthes (2019). Detailed, lots of examples, and covers a wider range of topics (including, for example, using git). There are many intro to Python books around; this one has certainly been useful to me.1 There are many good online resources for python, but it can be helpful initially to have a coherent guide in one place.

How did I do that last time?

Tip: save snippets.

There are often small bits of code that contain key tricks that we use only occasionally. Sometimes it takes a bit of time reading forums or documentation to figure out these tricks. It’s a pain to have to do the legwork again to find the trick a second or third time. There were numerous occasions when I knew I’d worked out how to do something previously, and then spent precious minutes trawling through various bits of code and coursework to find the line where I’d done it. Then I found a better solution: I started saving snippets with an online note taking tool called Supernotes. Here’s an example:

I often find myself searching through my code snippets to remind myself of things.

Text editors, IDEs and plugins.

If you haven’t already, it might be worth trying some different options when it comes to your text editor or IDE. I’ve met many people who swear by PyCharm. Personally, I’ve been getting on well with Visual Studio Code (VS Code) for a year now.

Either way, I also recommend spending some time installing useful plugins as these can make your life easier. My recommendations for VS Code plugins are: Hungry Delete, Rainbow CSV, LaTeX Workshop, Bracket Pair Colorizer 2, Rewrap and Todo Tree

Linters & formatters

Linters and formatters check your code for syntax errors or style errors. I use the Black formatter, and have it set to run every time I save my file. This seems to save a lot of time, and not only with formatting: it becomes more obvious when I have used incorrect syntax or made a typo. It also makes my code easier to read and look nicer. Here’s an example of Black in anger:

Some other options for linters and formatters include autopep, yapf and pylint.

Data needs metadata in order to be understood. Does your workflow enable you to understand your data? I tend to work with toy models, so my current approach is to make a new directory for each version of my experiment code. This way I can make notes for each version of the experiment (usually in a markdown file). In other words, what not to do, is to run the code to generate results and then edit the code (excepting, of course, if your code has a bug). At a later stage you may want to understand how your results were calculated, and this cannot be done if you’ve changed the code file since the data was generated (unless you are a git wizard).

A bigger toolbox makes you a more powerful coder

Knowing about the right tool for the job can make life much easier.2 There are many excellent Python packages, and the more you explore, the more likely you’ll know of something that can help you. A good resource for the modules of the Python 3 standard library is Python Module of The Week. Some favourite packages of mine are Pandas (for processing data) and Seaborn (a wrapper on Matplotlib that enables quick and fancy plotting of data). Both are well worth the time spent learning to use them.

Some thoughts on Matplotlib

Frankly some of the most frustrating experiences in my early days with python was trying to plot things with Matplotlib. At times it seemed inanely tedious, and bizarrely difficult to achieve what I wanted given how capable a tool others made it seem. My tips for the uninitiated would be:

• Be a minimalist, never a perfectionist. I often managed to spend 80% of my time plotting trying to achieve one obscure change. Ask: Do I really need this bit of the plot to get my point across?
• Can you hack it, i.e. can you fix up the plot using something other than Matplotlib? For example, you might spend ages trying to tell Matplotlib to get some spacing right, when for your current purpose you could get the same result by editing the plot in word/pages in a few clicks.
• Be patient. I promise, it gets easier with time.

Object oriented programming

I’m curious to know how many of us in the meteorology department code with classes. In simple projects, it is possible to do without classes. That said, there’s a reason classes are a fundamental of modern programming: they enable more elegant and effective problem solving, code structure and testing. As Hans Petter Langtangen states in A Primer on Scientific Programming with Python, “classes often provide better solutions to programming problems.”

What’s more, if you understand classes and object- oriented programming concepts then understanding others’ code is much easier. For example, it can make Matplotlib’s documentation easier to understand and, in the worse caseworst case scenario, if you had to read the Matplotlib source code to understand what was going on under the hood, it will make much more sense if you know how classes work. As with Pandas, classes are worth the time buy in!

Have any suggestions or other useful resources for wannabe pythonistas? Please comment below or email me at d.ayers@pgr.reading.ac.uk.

## Support vector machine for classification of space weather

In a recent blog post, I discussed the use of the analogue ensemble (AnEn), or “similar-day” approach to forecasting geomagnetic activity. In this post I will look at the use of support vector machines (SVMs), a machine learning approach, to the same problem and compare the performance of the SVM to the AnEn. An implementation of the SVM has been developed for this project in python and is available at https://doi.org/10.5281/zenodo.4604485

Space weather encompasses a range of impacts on Earth caused by changes in the near-earth plasma due to variable activity at the Sun. These impacts include damage to satellites, interruptions to radio communications and damage to power grids. For this reason, it is useful to forecast the occurrence, intensity and duration of heightened geomagnetic activity, which we call geomagnetic storms.

As in the previous post, the measure of geomagnetic activity used is the aaH index which has been developed by Mike Lockwood in Reading. The  aaH  index  gives a global measure of geomagnetic activity at a 3-hour resolution. In this index, the minimum value is zero and larger values represent more intense geomagnetic activity.

The SVM is a commonly used classification algorithm which we implement here to classify whether a storm will occur. Given a sample of the input and the associated classification labels, the SVM will find a function that separates these input features by their class label. This is simple if the classes are linearly separable, as the function is a hyperplane. The samples lying closest to the hyperplane are called support vectors and the distance between these samples and the hyperplane is maximised.

Typically, the samples are not linearly separable, so we employ Cover’s theorem which states that linearly inseparable classification problems are more likely to be linearly separable when cast non-linearly into a higher dimensional space. Therefore, we use a kenel trick to throw the inputs into a higher dimensional feature space, as depicted in Figure 1, to make it more separable. Further explanation is available in (Towards data science, n.d.)

Based on aaH values in the 24-hour training window, the SVM predicts whether the next 3 hours will be either a storm or not. By comparing this dichotomous hindcast with the observed aaH, the outcome will be one of True Positive (TP, where a storm is correctly predicted), True Negative (TN, where no storm is correctly predicted), False Positive (FP, where a storm is predicted but not observed), or False Negative (FN, where a storm is not predicted but is observed). This is shown in the form of a contingency table in Figure 2.

For development of the SVM, the aaH data has been separated into independent training and test intervals. These intervals are chosen to be alternate years. This is longer than the auto-correlation in the data (choosing, e.g. alternate 3-hourly data points, would not generate independent training and test data sets) but short enough that we assume there will not be significant aliasing with solar cycle variations.

Training is an iterative process, whereby a cost function is minimised. The cost function is a combination of the relative proportion of TP, TN, FP and FN. Thus, while training itself, an SVM attempts to classify labelled data i.e., data belonging to a known category, in this case “storm” and “no storm” on the basis of the previous 24 hours of aaH. If the SVM makes an incorrect prediction it is penalised through a cost function which the SVM minimises.  The cost parameter determines the degree to which the SVM is penalised for a misclassification in training which allows for noise in the data.

It is common that data with a class imbalance, that is containing many more samples from one class than the other, causes the classifier to be biased towards the majority class. In this case, there are far more non-storm intervals than storm intervals. Following (McGranaghan, Mannucci, Wilson, Mattman, & Chadwick, 2018), we define the cost of mis-classifying each class separately. This is done through the weight ratio Wstorm : Wno storm. Increasing the Wstorm increases the frequency at which the SVM predicts a storm and it follows that it predicts no storm at a reduced frequency. In this work we have varied Wstorm and kept Wno storm constant at 1. A user of the SVM method for forecasting may wish to tune the class weight ratio to give an appropriate ratio of false alarms and hit rate dependent on their needs.

Different forecast applications will have different tolerances for false alarms and missed events.  To accommodate this, and as a further comparison of the hindcasts, we use a Cost/Loss analysis. In short, C is the economic cost associated with taking mitigating action when an event is predicted (whether or not it actually occurs) and L is the economic loss suffered due to damage if no mitigating action is taken when needed. For a deterministic method, such as the SVM, each time a storm is predicted will incur a cost C. Each time a storm is not predicted but a storm occurs a loss L is incurred. If no storm is predicted and no storm occurs, then no expense in incurred. By considering some time interval, the total expense can be computed by summing C and L. Further information, including the formular for computing the potential economic value (PEV) can be found in (Owens & Riley, 2017).

A particular forecast application will have a C/L ratio in the domain (0,1). This is because a C/L of 0 would mean it is most cost effective to take constant mitigating action and a C/L of 1 or more means that mitigating action is never cost effective. In either case, no forecast would be helpful. The power of a Cost/Loss analysis is that it allows us to evaluate our methods for the entire range of potential forecast end users without specific knowledge of the forecast application requirements. End users can then easily interpret whether our methods fit their situation.

Figure 3 shows the potential economic value (PEV) of the SVM with a range of class weights (CW), probabilistic AnEn and 27-day recurrence. The shaded regions indicate which hindcast has the highest PEV for that Cost/Loss ratio. The probabilistic AnEn has the highest PEV for the majority of the Cost/Loss domain although SVM has higher PEV for lower Cost/Loss ratios. It highlights that the best’ hindcast is dependent on the context in which it is to be employed.

In summary, we have implemented an SVM for the classification of geomagnetic storms and compared the performance to that of the AnEn which was discussed in a previous blog post. The SVM and AnEn generally perform similarly in a Cost/Loss analysis and the best method will depend on the requirements of the end user. The code for the SVM is available at  https://doi.org/10.5281/zenodo.4604485 and the AnEn at https://doi.org/10.5281/zenodo.4604487

References

Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data mining an knowledge discovery

McGranaghan, R., Mannucci, A., Wilson, B., Mattman, C., & Chadwick, R. (2018). New Capabilities for Prediction of High-Latitude Ionospheric Scin-669tillation: A Novel Approach With Machine Learning. Space Weather

Owens, M., & Riley, P. (2017). Probabilistic Solar Wind Forecasting Using Large Ensembles of Near-Sun Conditions With a Simple One-Dimensional “Upwind” Scheme. Space weather

## Quantifying Arctic Storm Risk in a Changing Climate

Alec Vessey (Final Year PhD Student) – alexandervessey@pgr.reading.ac.uk
Supervisors: Kevin Hodges (UoR), Len Shaffrey (UoR), Jonny Day (ECMWF), John Wardman (AXA XL)

Arctic sea ice extent has reduced dramatically since it was first monitored by satellites in 1979 – at a rate of 60,000 km2 per year (see Figure 1a). This is equivalent to losing an ice sheet the size of London every 10 days. This dramatic reduction in sea ice extent has been caused by global temperatures increasing, which is a result of anthropogenic climate change. The Arctic is the region of Earth that has undergone the greatest warming in recent decades, due to the positive feedback mechanism of Arctic Amplification. Global temperatures are expected to continue to increase into the 21st century, further reducing Arctic sea ice extent.

Consequently, the Arctic Ocean has become increasingly open and navigable for ships (see Figure 1b and 1c). The Arctic Ocean provides shorter distances between ports in Europe and North America to ports in Asia than more traditional routes in the mid-latitudes that include the Suez Canal Route and the routes through the Panama Canal. There are two main shipping routes in the Arctic, the Northern Sea Route (along the coastline of Eurasia) and the Northwest Passage (through the Canadian Archipelago) (see Figure 2). For example, the distance between the Ports of Rotterdam and Tokyo can be reduced by 4,300 nautical-miles if ships travel through the Arctic (total distance: 7,000 nautical-miles) rather than using the mid-latitude route through the Suez Canal (total distance: 11,300 nautical-miles). Travelling through the Arctic could increase profits for shipping companies. Shorter journeys will require less fuel to be spent on between destinations and allow more time for additional shipping contracts to be pursued. It is expected that the number of ships in the Arctic will increase exponentially in the near future, when infrastructure is developed, and sea ice extent reduces further.

However, as human activity in the Arctic increases, the vulnerability of valuable assets and the risk to life due to exposure to hazardous weather conditions also increases.  Hazardous weather conditions often occur during the passage of storms.  Storms cause high surface wind speeds and high ocean waves. Arctic storms have also been shown to lead to enhanced break up of sea ice, resulting in additional hazards when ice drifts towards shipping lanes. Furthermore, the Arctic environment is extremely cold, with search and rescue and other support infrastructure poorly established. Thus, the Arctic is a very challenging environment for human activity.

Over the last century, the risks of mid-latitude storms and hurricanes have been a focal-point of research in the scientific community, due to their damaging impact in densely populated areas. Population in the Arctic has only just started to increase. It was only in 2008 that sea ice had retreated far enough for both of the Arctic shipping lanes to be open simultaneously (European Space Agency, 2008). Arctic storms are less well understood than these hazards, mainly because they have not been a primary focus of research. Reductions in sea ice extent and increasing human activity mean that it is imperative to further the understanding of Arctic storms.

This is what my PhD project is all about – quantifying the risk of Arctic storms in a changing climate. My project has four main questions, which try to fill the research gaps surrounding Arctic storm risk. These questions include:

1. What are the present characteristics (frequency, spatial distribution, intensity) of Arctic storms, and, what is the associated uncertainty of this when using different datasets and storm tracking algorithms?
1. What is the structure and development of Arctic storms, and how does this differ to that of mid-latitude storms?
1. How might Arctic storms change in a future climate in response to climate change?
1. Can the risk of Arctic storms impacting shipping activities be quantified by combining storm track data and ship track data?

Results of my first research question are summarised in a recent paper (https://link.springer.com/article/10.1007/s00382-020-05142-4 – Vessey et al. 2020).  I previously wrote a blog post on the The Social Metwork summarising this paper, which can be found at https://socialmetwork.blog/2020/02/21/arctic-storms-in-multiple-global-reanalysis-datasets/. This showed that there is a seasonality to Arctic storms, with most winter (DJF) Arctic storms occurring in the Greenland, Norwegian and Barents Sea region, whereas, summer (JJA) Arctic storms generally occur over the coastline of Eurasia and the high Arctic Ocean. Despite the dramatic reductions in Arctic sea ice over the past few decades (see Figure 1), there is no trend in Arctic storm frequency. In the paper, the uncertainty in the present climate characteristics of Arctic storms is assessed, by using multiple reanalysis datasets and tracking methods. A reanalysis datasets is our best approximation of past atmospheric conditions, that combines past observations with state-of-the-art Numerical Weather Prediction Models.

The deadline for my PhD project is the 30th of June 2021, so I am currently experiencing the very busy period of writing up my Thesis. Hopefully, there aren’t too many hiccups over the next few months, and perhaps I will be able to write some of my research chapters up as papers.

References:

BBC, 2016, Arctic Ocean shipping routes ‘to open for months’. https://www.bbc.com/news/science-environment-37286750. Accessed 18 March 2021.

European Space Agency, 2008: Arctic sea ice annual freeze-up underway. https://www.esa.int/Applications/Observing_the_Earth/Space_for_our_climate/Arctic_sea_ice_annual_freeze_nobr_-up_nobr_underway. Accessed 18 March 2021.

National Snow & Ice Data Centre, (2021), Sea Ice Index. https://nsidc.org/data/seaice_index. Accessed 18 March 2021.

Vessey, A.F., K.I., Hodges, L.C., Shaffrey and J.J. Day, 2020: An Inter-comparison of Arctic synoptic scale storms between four global reanalysis datasets. Climate Dynamics, 54 (5), 2777-2795.

## Dialogue Concerning the Obvious and Obscurity of Scientific Programming in Python: A Lost Script

Disclaimer: The characters and events depicted in this blog post are entirely fictitious. Any similarities to names, incidents or source code are entirely coincidental.

Antonio (Professor): Welcome to the new Masters module and PhD training course, MTMX101: Introduction to your worst fears of scientific programming in Python. In this course, you will be put to the test–to spot and resolve the most irritating ‘bugs’ that will itch, sorry, gl-itch your software and intended data analysis.

Berenice (A PhD student): Before we start, would you have any tips on how to do well for this module?

Antonio: Always read the code documentation. Otherwise, there’s no reason for code developers to host their documentation on sites like rtfd.org (Read the Docs).

Cecilio (An MSc student): But… didn’t we already do an introductory course on scientific computing last term? Why this compulsory module?

Antonio: The usual expectation is for you to have to completed last term’s introductory computing module, but you may also find that this course completely changes what you used to consider to be your “best practices”… In other words, it’s a bad thing that you have taken that module last term but also a good thing you have taken that module last term. There may be moments where you may think “Why wasn’t I taught that?” I guess you’ll all understand soon enough! As per the logistics of the course, you will be assessed in the form of quiz questions such as the following:

Example #0: The Deviations in Standard Deviation

Will the following print statements produce the same numeric values?

import numpy as np
import pandas as pd

p = pd.Series(np.array([1,1,2,2,3,3,4,4]))
n = np.array([1,1,2,2,3,3,4,4])

print(p.std())
print(n.std())

Example #1: The Sum of All Fears

Antonio: As we all know, numpy is an important tool in many calculations and analyses of meteorological data. Summing and averaging are common operations. Let’s import numpy as np and consider the following line of code. Can anyone tell me what it does?

>>> hard_sum = np.sum(np.arange(1,100001))

Cecilio: Easy! This was taught in the introductory course… Doesn’t this line of code sum all integers from 1 to 100 000?

Antonio: Good. Without using your calculators, what is the expected value of hard_sum?

Berenice: Wouldn’t it just be 5 000 050 000?

Antonio: Right! Just as quick as Gauss. Let’s now try it on Python. Tell me what you get.

Cecilio: Why am I getting this?

>>> hard_sum = np.sum(np.arange(1,100001))
>>> print(hard_sum)
705082704


Berenice: But I’m getting the right answer instead with the same code! Would it be because I’m using a Mac computer and my MSc course mate is using a Windows system?

>>> hard_sum = np.sum(np.arange(1,100001))
>>> print(hard_sum)
5000050000

Antonio: Well, did any of you get a RuntimeError, ValueError or warning from Python, despite the bug? No? Welcome to MTMX101!

Berenice: I recall learning something about the computer’s representation of real numbers in one of my other modules. Would this be problem?

Antonio: Yes, I like your thinking! But that still doesn’t explain why you both got different values in Python. Any deductions…? At this point, I will usually set this as a homework assignment, but since it’s your first MTMX101 lecture, here is the explanation on Section 2.2.5 of your notes.

If we consider the case of a 4-bit binary, and start counting from 0, the maximum number we can possibly represent is 15. Adding 1 to 15, would lead to 0 being represented, as shown in Figure 1. This is called integer overflow, just like how an old car’s analogue 4-digit odometer “resets” to zero after recording 9999 km. As for the problem of running the same code and getting different results on a Windows and Mac machine, a numpy integer array on Windows defaults to 32-bit integer, whereas it is 64-bit on Mac/Linux, and as expected, the 64-bit integer has more bits and can thus represent our expected value of 5000050000. So, how do we mitigate this problem when writing future code? Simply specify the type argument and force Python to use 64-bit integers when needed.

>>> hard_sum = np.sum(np.arange(1,100001), dtype=np.int64)
>>> print(type(hard_sum))
<class 'numpy.int64'>
>>> print(hard_sum)
5000050000

As to why we got the spurious value of 705082704 from using 32-bit integers, I will leave it to you to understand it from the second edition of my book, Python Puzzlers!

Figure 1: Illustration of overflow in a 4-bit unsigned integer

Example #2: An Important Pointer for Copying Things

Antonio: On to another simple example, numpy arrays! Consider the following two-dimensional array of temperatures in degree Celsius.

>>> t_degrees_original = np.array([[2,1,0,1], [-1,0,-1,-1], [-3,-5,-2,-3], [-5,-7,-6,-7]])

Antonio: Let’s say we only want the first three rows of data, and in this selection would like to set all values on the zeroth row to zero, while retaining the values in the original array. Any ideas how we could do that?

Cecilio: Hey! I’ve learnt this last term, we do array slicing.

>>> t_degrees_slice = t_degrees_original[0:3,:]
>>> t_degrees_slice[0,:] = 0

Antonio: I did say to retain the values in the original array…

>>> print(t_degrees_original)
[[ 0  0  0  0]
[-1  0 -1 -1]
[-3 -5 -2 -3]
[-5 -7 -6 -7]]

Cecilio: Oh oops.

Berenice: Let me suggest a better solution.

>>> t_degrees_original = np.array([[2,1,0,1], [-1,0,-1,-1], [-3,-5,-2,-3], [-5,-7,-6,-7]])
>>> t_degrees_slice = t_degrees_original[[0,1,2],:]
>>> t_degrees_slice[0,:] = 0
>>> print(t_degrees_original)
[[ 2  1  0  1]
[-1  0 -1 -1]
[-3 -5 -2 -3]
[-5 -7 -6 -7]]

Antonio: Well done!

Cecilio: What? I thought the array indices 0:3 and 0,1,2 would give you the same slice of the numpy array.

Antonio: Let’s clarify this. The former method of using 0:3 is standard indexing and only copies the information on where the original array was stored (i.e. “view” of the original array, or “shallow copy”), while the latter 0,1,2 is fancy indexing and actually makes a new separate array with the corresponding values from the original array (i.e. “deep copy”). This is illustrated in Figure 2 showing variables and their respective pointers for both shallow and deep copying. As you now understand, numpy, is really not as easy as pie…

Figure 2: Simplified diagram showing differences in variable pointers and computer memory for shallow and deep copying

Cecilio: That was… deep.

Berenice: Is there a better way to deep copy numpy arrays rather than having to type in each index like I did e.g. [0,1,2]?

Antonio: There is definitely a better way! If we replace the first line of your code with the line below, you should be able to do a deep copy of the original array. Editing the copied array will not affect the original array.

>>> t_degrees_slice = np.copy(t_degrees_original[0:2,:])

I would suggest this method of np.copy to be your one– and preferably only one –obvious way to do a deep copy of a numpy array, since it’s the most intuitive and human-readable! But remember, deep copy only if you have to, since deep copying a whole array of values takes computation time and space! It’s now time for a 5-minute break.

Cecilio: More like time for me to eat some humble (num)py.

Consider the following Python code:

short_a = "galileo galilei"
short_b = "galileo galilei"
long_a = "galileo galilei " + "linceo"
long_b = "galileo galilei " + "linceo"

print(short_a == short_b)
print(short_a is short_b)
print(long_a == long_b)
print(long_a is long_b)


Which the correct sequence of booleans that will be printed out?
1. True, True, True, True
2. True, False, True, False
3. True, True, True, False

Antonio: In fact, they are all correct answers. It depends on whether you are running Python 3.6.0, Python 3.8.5 and whether you ran the code in a script or in the console! Although there is much more to learn about “string interning”, the quick lesson here is to always compare the value of strings using double equal signs (==) instead of using is.

Example #3: Array manipulation – A Sleight of Hand?

Antonio: Let’s say you are asked to calculate the centered difference of some quantity (e.g. temperature) in one dimension $\frac{\partial T}{\partial x}$ with gird points uniformly separated by $\Delta x$ of 1 metre. What is some code that we could use to do this?

Berenice: I remember this from one of the modelling courses. We could use a for loop to calculate most elements of $\frac{\partial T}{\partial x} \approx \frac{T_{i+1} - T_{i-1}}{2\Delta x}$ then deal with the boundary conditions. The code may look something like this:

delta_x = 1.0
temp_x = np.random.rand(1000)
dtemp_dx = np.empty_like(temp_x)
for i in range(1, len(temp_x)-1):
dtemp_dx[i] = (temp_x[i+1] - temp_x[i-1]) / (2*delta_x)

# Boundary conditions
dtemp_dx[0] = dtemp_dx[1]
dtemp_dx[-1] = dtemp_dx[-2]

Antonio: Right! How about we replace your for loop with this line?

dtemp_dx[1:-1] = (temp_x[2:] - temp_x[0:-2]) / (2*delta_x)

Cecilio: Don’t they just do the same thing?

Antonio: Yes, but would you like to have a guess which one might be the “better” way?

Berenice: In last term’s modules, we were only taught the method I proposed just now. I would have thought both methods were equally good.

Antonio: On my computer, running your version of code 10000 times takes 6.5 seconds, whereas running my version 10000 times takes 0.1 seconds.

Cecilio: That was… fast!

Berenice: Only if my research code could run with that kind of speed…

Antonio: And that is what we call vectorisation, the act of taking advantage of numpy’s optimised loop implementation instead of having to write your own!

Cecilio: I wish we knew all this earlier on! Can you tell us more?

Antonio: Glad your interest is piqued! Anyway, that’s all the time we have today. For this week’s homework, please familiarise yourself so you know how to

import this`

module or import that package. In the next few lectures, we will look at more bewildering behaviours such as the “Mesmerising Mutation”, the “Out of Scope” problem that is in the scope of this module. As we move to more advanced aspects of this course, we may even come across the “Dynamic Duck” and “Mischievous Monkey”. Bye for now!