One of the first steps in data analysis is dataset preparation: the process of using techniques to reduce the amount of noise in the data. This step is particularly relevant in the natural language processing (NLP) field which is, just like language itself, very messy. Data cleaning may involve converting all words to lowercase, stripping punctuation, removing digits and functors (articles, pronouns, particles, etc.) and many other steps. Depending on the objectives of the analysis, the combinations of these transformations or filters may be very different from each other. For instance, for the evaluation of emotional polarity of a text, the removal of punctuation or the adverb not can change the sentiment of a sentence completely, so it is a good idea to keep the text minimally changed. Another task, such as topic extraction, may greatly benefit from removing noise, which may, for example, require not only numbers but their word representations to be removed. And if you think for a moment of the stereotypical data scientist, the exploration of all possible combinations of transformations may be by itself an element of the model selection process. So it seems like a good idea to have a set of ready-to-use functions that can be easily composed [1, 2] to perform the dataset preparation task. In this post, we will look at the problem that motivates composable transformations in more detail, then show how their composition can be achieved using example code.
Consider the following two implementations of raw dataset conversion into a cleaned corpus, which can be used as an input for an NLP model. In this case, a dataset (a text) is a list of elements (or sentences), where each element is a list of words. The snippet below shows a quick and dirty Python 3 implementation of dataset cleaning. The reader is advised to read through and make sure everything makes sense.
import stringstopwords = set(["the", "i", "you", ...]) PUNCTUATION_TABLE = str.maketrans("", "", string.punctuation)corpus = [ [token.strip().lower().translate(PUNCTUATION_TABLE) for token in element if not any(x.isdigit() for x in token) and len(token) >= 5 and token not in stopwords] for element in dataset]
One may think that it is not a big deal to have these eight lines of code, which may even seem handy for a fast experiment. However, it is only suitable for quick trial and error code, not for code that will enter production and require long term maintenance. Firstly, the readability: it is not immediately clear where exactly the inner list ends. Secondly, the stockpile of multiple operations in a single expression: a subtle change to the ‘>’ sign or even the order of operations may be difficult to identify, and significantly impact the result.
Let’s look at another way the same transformations can be expressed.
transforms = [ unpunctuate, normalise, not_stopword, all_chars, longer_than(5)]corpus =  for element in dataset: corpus.append( process(transforms, element) )
The dataset preparation step is now very clear and concise. Using this approach, one first determines what transformation operations are suitable and worthwhile to have in a project. Then each transformation is localised in its separate transform function, and the functions that apply the transformation pipeline are written out. Once it is done, there are little reasons to worry about accidentally introduced alterations, since these functions are set aside.
This formulation works with a little help from the ‘process’ function, which we will look at in the next section. It may seem this formulation introduces unnecessary complexity, while still arriving at the same result as the first example. However, the additional time spent on designing the composable transforming approach will most certainly pay off later, freeing one from searching for errors.
The technique of creating composable transformations is defined by several steps:
To be composable, the functions need to all take an input of type X and produce an output of type X, where X is the type of elements in the dataset. Then, it is simply a matter of composing the functions together in any order into what is essentially one ‘big’ function that takes in a raw value of X and returns that value in its prepared form. Let’s look at the
unpunctuate functions from our NLP example above:
import stringPUNCTUATION_TABLE = str.maketrans("", "", string.punctuation)def normalise(token): return token.strip().lower()def unpunctuate(token): return token.translate(PUNCTUATION_TABLE)
See that they both take and return strings. The composition of these functions takes place by simply applying the second transform to the result of the first:
def apply_transforms(transforms, token): for transform in transforms: token = transform(token) return token
unpunctuate are straightforward, but other transformations can actually reject the input. Consider, for example, stopword filtering, in which some words of text are removed because they carry no or little meaning. In the example above, we captured this step using the
stopwords = set(["the", "i", "you", ...])def not_stopword(token): if token in stopwords: return None return token
See how the function returns
None to signal that the word should be rejected. We need to update our
apply_transforms function to handle this case, which expands its functionality from the mere composition of transformations to filtration (and early termination):
def apply_transforms(transforms, token): for transform in transforms: token = transform(token) if token is None: return None return token
As in the previous example,
apply_transforms composes together the functions by making the output of the last the input of the next. But it also checks if that output is
None and, if so, simply returns
None. As such, any function in the composition is able to transform the input it receives, replacing useless words with the value representing nothing. The final step is to call
apply_transforms with the desired transform functions on each word in the text (for our NLP example), which is performed using the process function:
def process(transforms, element): f = lambda x: apply_transforms(transforms, x) return list(filter(None, map(f, element)))
Some transformations lend themselves to configuration. In our NLP case, for example, we have included a
longer_than function that filters out words shorter than a certain length. This function is most useful when that length can be easily changed, but that is a bit tricky when the approach we have described depends on each function receiving a single value of type X (whatever X is appropriate for your context) and returning a value of the same type. This is where closures are useful. A closure is a function that is ‘closed over’ a value. It can be understood as a version of a function in which one more of the parameters is fixed. Take the implementation of
longer_than, for example:
def longer_than(n): def _longer_than(token): if len(token) < n: return None return token return _longer_than
See that the outer function returns a closure, the inner function, that is closed over the value
longer_than(5), then, returns a function that filters out words shorter than five, and
longer_than(10) filters out words shorter than ten letters. Whenever you are implementing a transformation, consider whether it could be made more useful by making it a closure.
The value of composable transformations does not end with the ability to mix and match them. By expressing transformations as succinct functions with a single purpose your code becomes easier to read, and small functions with specific purposes are easier to test and reuse across projects. In Python, functions representing transformations can also be replaced with callable classes which can be expanded to include serialisation functionality, allowing you to save, load and even share pre-processing ‘recipes’. While it is a little more effort than using an ad hoc implementation, this little bit of effort goes a long way. Maybe consider it as an option.
Header image courtesy of Radek Grzybowski.
Thanks to Shannon Pace, Elodie Thilliez, and Rhys Adams for reviewing this post and providing suggestions.