Data Management for Data Scientists

Lessons from Software Engineering

Although the field of data science is rapidly growing in popularity, there’s nothing new about data. Even the humblest application in the earliest days of the software engineering field existed purely to represent, search and transform data. Ever since, the ways of processing and managing data have only multiplied. In this time, valuable strategies for data management have been developed and refined. So as data science takes off and large datasets become commonplace, it bears considering how data is managed in the field that, even more so than data science, lives and breathes data. In this blog post, we’ll detail five ways that you can improve how you manage data and introduce more rigour into your experiment codebases using software engineering principles and techniques.

1. Store Data in a Database

It is not uncommon to see datasets in the form of files, but this comes with the risk of data loss – e.g. accidental file deletion or accidental overwrites – or incorrect file usage – e.g. specifying the path to an incorrect file or experiencing unexpected parsing behaviour. Databases solve such problems by managing data on disk for you and providing you an interface to search and retrieve that data. The query functionality of databases is powerful and provides control over exactly what data is loaded. And, by creating new tables, exact variations on a source dataset can be stored and selected at will. A brief introduction to using the common SQL type of database in Python can be found here. Although databases are bit a more involved than simple file management, persist and you’ll soon be able to offload some of the dataset preparation tasks to the database software.

2. Represent Data using Types

A common idiom for managing data in experiment codebases is to store the data in its raw form. For example, textual data may be stored directly in strings and images in byte arrays. The problem with this is that data is never alone – it always comes with metadata. Defining types to represent the elements of your dataset allows you to attach that metadata, such as the unique ID of the element and details of the origin of the data. Another advantage of this approach is that custom data types are resistant to accidental editing; consider, for example, the ease of editing a string or number in your programming language of choice, and how destructive an accidental edit of elements of your dataset could be to your experiments. Guard against this by defining types to represent your data and formalising the actions that apply to your types. A quick guide to this idiom can be found here.

3. Define Your Data Transformations

The programming languages commonly used in experiment codebases tend to be on the more expressive side of the expressive spectrum. However, expressiveness can come at a cost: the more expressive a language, the less likely you may be to introduce more formal structure. In severe cases, an experiment codebase becomes a series of dense list manipulations that can actually make small details, such as a specific parameter or the sequence of transforms, more difficult to identify and verify. Instead, get into the practice of defining each dataset manipulation as a small function, and replacing in-place list manipulations with a series of function calls. Functions are testable, isolate functionality from the rest of the program, and provide concise descriptors for more involved tasks; e.g. normalise_by_token_frequency and filter_by_tfidf_score. This page contains a strong argument for why it’s worth transforming most expressions into functions.

4. Generalise Data Preparation

Although datasets and domains may change, most experiment codebases still require functionality for dividing datasets into training and testing subsets. The steps are generally the same: 1) select the desired elements, 2) divide them in the desired proportion, 3) shuffle the subsets, etc. This kind of functionality, which reorganises elements without manipulating them, is easy to generalise and use across your experiment codebases. More importantly, generalising the functionality forces you to separate it from the representation of the dataset elements, separating and isolating the two aspects of the codebase. Isolated code is valuable because it is tightly focused and less complex, making it easier to test, reuse and maintain. An overview of generalisation and related concepts, and how they can benefit you, can be found here.

5. Control for Non-determinism

The careful storage, representation, manipulation and preparation of datasets can be for nothing if a source of non-determinism enters the process, such as the order of elements in the dataset becoming unpredictable or a data transformation function behaving inconsistently. Sources of non-determinism can be obvious – e.g. using a random number generator without setting the seed, or using collection functionality, such as Python’s dict structure, that iterates in a non-deterministic order – but can also be subtle – e.g. a sorting function may not be stable, or a time library may behave strangely with leap-seconds. The tips on this page will help you identify and eradicate sources of unpredictability. Also, be sure to read software documentation to understand the implications of third-party software on the determinism of your code.

Data is Precious

Whether you’re a data scientist creating sophisticated models or a software engineer writing complex applications, the careful management of data is essential. The software engineering field has had a long, long time to appreciate this and continually improve data management techniques. The data science field, being much younger and historically more academic, has not… but it’s in a fantastic position to rapidly mature as it becomes more popular and commercialised. Careful data management is the first step.

Header image courtesy of Fabien Barral.

Thanks to Simon Vadja and Rudree Trivedi for proofreading and providing suggestions.