Designing a program to improve data access for my PhD project

Caleb Miller – c.s.miller@pgr.reading.ac.uk

In my project work, I regularly need to load hundreds of various CSV (comma separated values) files with daily data from meteorological observations. For example, many of the measurements I use are made at the Reading University Atmospheric Observatory using the main datalogger, in addition to some of my own instruments. This data comes distributed across a number of different files for each day.

Most of my analysis is done in Python using the Pandas library for data processing. Pandas can easily read in CSV files with a built-in function, and it is well-suited for the two-dimensional data tables which I regularly use.

However, after a year or so of working directly with CSV files, I began to run up against some of the performance limitations of doing so.

Daily CSV files may be good for organizational purposes, but they are not the most efficient way to store and access large amounts of data. In particular, once I wanted to start studying many years’ worth of data at the same time, reading in each file every time I wanted to re-run my code began to slow the development process significantly. Also, the code that I had written to locate each file and read it in for a given range of dates was somewhat clunky and inflexible.

It was time to develop a new solution for accessing the met observations more quickly and easily.

Pandas has built-in functions for reading a variety of different formats, not just plain-text CSV files. I decided to constrain my choices for data formats to those that Pandas could read natively.

In addition, I wanted to build a system that would satisfy three primary goals:

Compatibility for long-term data storage
High speed
Simple programming interface

Compatibility is important, since I wanted to ensure that my data would continue to be readable to others (and myself in the future) without any specialized software that I had written. CSV is excellent for that purpose.

However, CSV was not a fast way to access the data. Ideally, the system I chose could store both numerical data and timestamps as floating point values rather than encoded text, for better performance.

Finally, I wanted to create a system that would be flexible and easy to use–ideally, something that would only require one or two lines of code to load in the data from a given instrument and date range, rather than the many complicated steps that had been required to search for and load the many files I had been using.

System Design

In the end, I settled on a rather complicated system that resulted in a very simple, reliable, and fast data stack that could be used to access my data.

At the base layer, all the data would be stored in the original CSV files. This is the format that most of the data comes in, and the few instruments that do not can easily be converted. CSV is a very common file format, which can be read easily by many software packages and will likely be useful far into the future, even when current software is too outdated to be run.

However, rather than directly accessing the CSV files, I import them occasionally into a SQLite database file. SQLite is a widely-used, open source software library which enables users to run a database from a single file, rather than a server (as opposed to many other popular database programs). The advantage over CSV files is that it is relatively fast. Data from an individual table can be accessed by a query specifying the start and end dates. This means that it is very easy to load in arbitrary timeseries of data.

However, for loading many years’ worth of high-resolution data, even SQLITE was not as fast as I wanted. Pandas is also capable of using a format called pickle. “Pickling” a dataframe outputs the dataframe from program memory to the disk as a file. This can be then be read back very quickly into a program at a later time, even for large files.

In my data access library, once a request is made for a given timeseries of data, that dataframe is cached to a pickle file. If the same request is made again shortly afterwards, rather than going back to the SQLite database, the data is loaded from the pickle file. For large datasets, this can reduce the loading time from nearly a minute to just a few seconds, which is very helpful when repeatedly debugging a program using the same data! The cache files can be relatively large, however, so they are automatically cleared out when the code runs if they have not been used for several days.

Finally, all of this functionality is available behind a simple library, which allows for accessing a dataset from any other Python code on my machine with just two lines, as shown below.

import foglib.db as fdb fdb.load_db("dataset_name","start_datetime","end_datetime")

Conclusions

I have found this system to work very well for my purposes. It required a fair amount of development work, but the returns have been very beneficial.

By allowing me to access almost any of my data with just a few lines of code, I can now start new analyses with less time and code overhead. This means that I have more time and energy to spend answering science questions. And because it allows reading in large datasets so quickly, this means that I can rapidly debug my code without requiring me to wait as long while my code runs.

My particular solution may not be the ideal data-loading system for everyone’s needs. However, based on my experiences working on this program, I believe that time invested in enabling access to your data at the beginning of a PhD is time very well spent.

The Social Metwork

PhD Student Blog from the Department of Meteorology, University of Reading

Designing a program to improve data access for my PhD project

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply