Flexible Classifier Creation using the Pipeline

This post is part of the series One Pipeline to Train Them All that describes a pipeline for training textual classifiers based on a plug-in architecture. Refer to Part I and Part II for previous entries in this series.

The Pipeline in a Distributed System

When run in a non-distributed manner, the advantage of the pipeline described in this series is that it provides an abstraction over explicitly defined adapter-model pairs. This is only a small advantage over writing ad hoc scripts that train a specific model using a dataset prepared by a specific adapter. In a distributed environment, however, the advantages of the pipeline are substantial: it allows model definitions specifying all manner of adapter-model pairs to be trained in parallel by a cluster of homogeneous workers, enabling us to rapidly create and test a large number of classifiers. This blog post describes the system that makes this possible, and what to consider should you attempt to create the system yourself.

Architecture of the System

The architecture of the system at a conceptual level is presented in Figure 1. The role of each component is as follows:

  • Definition generator: Manually generating model definitions as described in the previous post in this series is laborious and inefficient. This module of the system automatically generates model definitions, each with a unique label, from a ‘seed’ we refer to as a training plan (more details in the next section).
  • Training queue: This represents a message queue for model definitions, which are published to the queue by the definition generator.
  • Training workers: The pipeline in this blog series ‘resides’ in each training worker. These workers consume model definitions from the training queue and then execute the pipeline according to the definition’s specification.
  • Examples database: This database contains labeled examples of the data to be used to train an adapter-model pair. The "dataset" portion of a model definition (see the previous post in this series) defines which examples a worker will fetch from the database.
  • Models database: This database contains adapter-model pairs that have been trained and persisted by workers. Rows in this database represent the output of the system.
  • Notification queue: Workers publish a notification to a second message queue that contains the unique label of each model definition when training is completed. This queue exists so that consumers can respond to this event by, for example, transporting the completed model to a classification service.

Notice that Figure 1 represents many workers. This queue-based architecture enables the number of workers to massively scale. And due to the genericity of the pipeline, all workers execute the same code. Thus, the processing capacity of the system can be increased simply by adding more workers to the worker pool.

High level architecture of the systemFigure 1: High level architecture of the system

Making the System Even More Useful

By adding a small number of features to the system, it can be made even more flexible. This section describes some of these features.

Generating Model Definitions

Most model definitions are variations on a theme, containing various hyperparameters for a fixed adapter-model pair. We use training plans to reduce this repetition. A training plan looks like:

{    "dataset": {        "name": "SpanDataset",        "options": {            "start_datetime_str": ["2016-1-1"],            "end_datetime_str": ["2016-1-2"],            "training_ratio": [0.8]        }    },    "adapter": {        "name": "Word2VecAdapter",        "options": { "sequence_length": [50, 100] }    },    "model": {        "name": "scikit_model_svm",        "options": {            "kernel": ["rbf"],            "degree": [3,4],            "tolerance": [1e-3]        }    }}

It is identical to a model definition, except that each field corresponds to a list instead of a value. The model definitions defined by a training plan are all permutations of values from the plan’s lists. The above plan, for example, defines four definitions: a word2vec-based adapter preparing sequences of length [50,100] paired with a support vector machine with degree [3,4]. The only drawback to this approach is that care must be taken to ensure that no invalid or unwanted permutations can arise from the training plan.

Serialising Adapters and Models

Serialisation is necessary to store trained adapter-model pairs inside the models database. This can be achieved by adding serialise and deserialise methods to each adapter and model implementation. Note that it is important that these methods are on the adapter and model classes so that those classes can account for the specifics of the functionality they wrap (which may be provided by third-party libraries). Here is an example for the WordFrequencyAdapter adapter, presented in the previous post in this series, which serialises to a Python dictionary:

class WordFrequencyAdapter:      ...    @staticmethod    def from_dict(params):        """Returns the WordFrequencyAdapter defined by the given params."""        adapter = WordFrequencyAdapter(            params["sequence_length"],            params["vocabulary_size"]        )        adapter.index = params["index"]        return adapter    def to_dict(self):        """Returns a dictionary representing this WordFrequencyAdapter."""        return dict(            sequence_length = self.sequence_length,            vocabulary_size = self.vocabulary_size,            index = self.index,        )

An adapter-model pair is serialised by adding the same methods to the AdaptedModel class (also introduced in the previous post in this series):

class AdaptedModel:      ...    @staticmethod    def from_dict(structure):        """Returns the AdaptedModel represented by 'structure'."""        adapter_class = getattr(adapters, structure["adapter_type"])        model_class = getattr(models, structure["model_type"])        return AdaptedModel(            adapter_class(structure["adapter"]),            model_class(structure["model"]),        )    def to_dict(self):        """Returns a dict representing this AdaptedModel."""        return dict(            adapter_type = type(self.adapter).__name__,            model_type = type(self.model).__name__,             
adapter = self.adapter.to_dict(),            model = self.model.to_dict(),        )

Transporting Adapter-Model Pairs

Models that reach the models database are ultimately sent to classification services so that they can put the knowledge they’ve accrued from exposure to past examples to the task of classifying new examples. For this reason they must be serialised into a transportable format. For this task we use JSON, which provides the following advantages:

  • JSON is the de facto data format for RESTful services, and RESTful services provide a useful means for providing models on request.
  • PostgreSQL, our database of choice for this system, provides many utilities for storing and inspecting JSON data.
  • There is a one-to-one correspondence between JSON and the dict structure in the Python language, in which we’ve implemented the system. (Note that the JSON tricks Python module can be used for serialising Numpy arrays to JSON.)

By no means is it necessary to only serialise adapters and models to JSON, but it is a very useful data format for storing and distributing models once training is complete. However, adapters and models can become large so a space-efficient binary format may be more useful for your use case.

Things to Consider

As distributed systems go, the system presented in this blog post is relatively simple. Even so, there are many details to consider during implementation. Don’t overlook the following:

  • Data security: If your examples contain sensitive data, then transporting them to training workers constitutes a security risk. Keep examples encrypted until they enter the pipeline (i.e. keep examples encrypted at all times except during training).
  • Queue configuration: Configure your message queue broker to reduce the risk that model definitions will be lost in the system. Message persistence, message acknowledgements and publisher confirms can eliminate message loss in simple use cases such as the system described in this post.
  • Model storage: In the models database, store each adapter-model pair with its corresponding model definition so that you have a record of the process by which each adapter-model pair was created.

Optimal Flow

The aim of the system presented in this blog post, and the pipeline presented in this series, was to make the task of creating classifiers as painless as possible. Rather than create scripts, we define experiments in textual model definitions. Rather than personally manage training, we dispatch model definitions to remote worker machines. Rather than struggle within the computing resources of a single machine, we distribute training across a worker cluster. We’ve strived to make the distance between encountering a classification problem and determining the best classifier recipe to solve that problem as short as possible, which has enabled us to optimise our data science workflow. With the details provided in this blog post series, you might be able to do the same.

Header image courtesy of Maurizio Abbate.

Thanks to Rohan Liston and Andrew Simmons for proofreading and providing suggestions.