Addressing the Sad State of Data Science Libraries

The Rise of Data Science Libraries

As the utility of data science techniques becomes more broadly recognised, many data science software libraries have emerged. Certain libraries, such as Keras and Scikit-learn, reduce the effort required to configure and train models. At DSTIL, one of our projects involves large datasets on which we train a large number of models, which are then deployed to production to classify streams of incoming data. Thanks to Keras and Scikit, we haven’t had to produce our own implementations of the models—such as neural networks and support vector machines – that we depend on. However, we have also run into many technical problems caused by these libraries that simply should not exist.


Quirks in Early Keras Versions

For neural networks, we use Keras with the TensorFlow backend. Keras has improved in the past few months. When we first started using it, when it was at version 0.3.3, trained models were stored using two separate formats—one for structure and one for weights—and the file input/output operations occurred inside the core model class. As we did not intend to use those formats and were not intending to save models to file, we had to rewrite that functionality ourselves. Keras also represented graph and sequential architectures using different classes, each with their own implementations and interfaces. As the sequence is simply a special case of a graph, digging into the source code to find two disparate classes was very concerning.


(Different) Quirks in Later Keras Versions

Version 1.0.0 of Keras was soon released and partially addressed both of these problems. The clunky, file-based save-restore functionality was replaced by methods for converting models to and from dictionaries. And the object model was modified so that the class that implements sequential architectures extends from the class that implements graph architectures. However, version 1.0.0 also arrived without warning and was a complete rewrite of Keras. We lost all of our knowledge of the library’s internals and also had to discard much of the code we had created to interface with the library. In many ways, the rewrite is just as confounding as version 0.3.3: the class implementing sequential architectures, for example, both extends and contains an instance of the class that implements the more general graph architecture.


Scikit and the Inability to Safely Store Models

For machine learning algorithms, we use Scikit. Scikit implements a large number of models such as support vector machines, decision trees and random forests, and methods such as nearest neighbours and gaussian naive Bayes. Using Scikit models in production is difficult due to the inability to save them. Currently, the recommended approach is to use Python’s pickle to save and restore models. The risks with using pickle for data transport have been well documented (e.g. [1, 2]) and although the main contributors have indicated that a safer approach is forthcoming [3], that was now three years ago. To store Scikit models alongside Keras models, which are stored inside a database in a JSON column, we need to pickle the object and encode the result in base64, complicating our data pipeline.


What Features should we Expect in Data Science Libraries?

Both Keras and Scikit have enthusiastic contributors and users. I’m sure that both libraries will improve in time. However, there are some steps the respective projects can take to make the use of data science techniques in production easier:

  • Provide a flexible approach to saving and restoring models. A trained model that is confined to a file, or is only readable by certain software versions, or saved in an undesirable format, may well be useless depending on the needs of the user.
  • Consider the underlying design of the software. Simpler architectures are easier to maintain and extend, and effective abstractions can reduce code repetition and ambiguity.
  • Provide a development and release roadmap. Tell users what features are under development and what progress is being made towards the long-term objectives of the project.

Although Keras and Scikit are useful to data scientists that use such libraries in ad hoc setups to explore interesting problems, there is an entire segment of users—developers that are attempting to use neural networks and machine learning algorithms in production-ready systems—for which they do not cater well. The teams behind Scikit, Keras and similar libraries have a strong opportunity to meet the increasing demand for deployable models.


Resources
  1. “Pickles are for Delis, not Software”, Alex Gaynor (PyCon 2014)
  2. “Don’t Pickle Your Data”, Ben Frederickson
  3. “API design for machine learning software: experiences from the scikit-learn project”, Buitinck et al., accessible here