Our ability to process textual data has improved rapidly with the introduction of language models such as word2vec and GloVe, which enable symbolic text to be transformed into a more flexible numeric representation. Yet, the nature of some problems is that there will never be enough domain or context-specific textual data to train such models. One of our projects at DSTIL involves analytics on the task descriptions created during software projects. Our datasets are small and contain very specialised language – exactly the opposite of what is required to apply contemporary textual analysis approaches. However, there is an algorithm for which these constraints are ideal: k-medoids. By clustering elements based on only their quantified distance to each other, this algorithm can show us the main concepts in a project and how much project effort has been directed to each one. This post describes all aspects of the algorithm – how it works, how to analyse the results, and its characteristics – to prepare you for encounters with similar problems and datasets.
k-medoids is a partitioning algorithm that divides the dataset into non-overlapping clusters . Each cluster is represented by a medoid, an exemplar of the cluster’s elements. Elements are assigned to the cluster with the medoid to which they are most similar, where similarity is quantified by a metric function. The metric enables the algorithm to optimise clusters after the assignment step, ensuring that the most representative element of each cluster becomes its medoid. One interpretation of the algorithm  is as follows:
For the code-inclined, a Haskell implementation and a guide to implementing the algorithm yourself can be found here. To demonstrate the power of this approach, let’s apply it to the first lines of our commit messages of one of our larger DSTIL projects. This dataset has all of the problems that we have discussed: it is small (851 messages containing a total of 4035 non-stopword tokens); contains technical, domain and team-specific language, and is inherently noisy as it is natural language. The k-medoids algorithm gives the breakdown of clusters by size shown in Figure 1, and tells us that the following are the five main concepts in that project:
The Wicketkeeper project involved worker machines executing deep learning models in a Docker-based setup, and development involved much prototyping and experimentation with various implementations. This is reflected in the clusters identified by the k-medoids algorithm.
Fig. 1: Size of clusters generated by the k-medoids clustering algorithm applied to the commit messages of DSTIL’s Wicketkeeper project.
Effective use of the k-medoids algorithm means selecting the appropriate k number of clusters. k must be high enough to ensure that all of the valuable ‘groups’ of elements hidden in the dataset can be identified, but not so high that there are few elements per cluster thus losing the power of aggregation and abstraction that clustering provides. Ideally, clusters should be cohesive, containing elements that are similar to each other, and distinct, meaning different from every other cluster. One way of measuring this is the silhouette method, which indicates how appropriately an element was clustered. The silhouette value s of element i is calculated as follows:
s(i) = (b(i) – a(i)) / max(a(i), b(i))
where a(i) is the mean distance of i to its cluster neighbours and b(i) is the mean distance of i to the elements of the next most similar cluster, also calculated by mean distance. If s(i) is positive, i is in the most appropriate cluster. If s(i) is negative, i should be in the next most similar cluster. If s(i) is zero then i is equally appropriate to its current and next most similar clusters. By summing the silhouette of all elements we can gain insight into the overall quality of a cluster, and use this to search for the optimal value of k. Figure 2 presents the mean quality of configurations for various k on our dataset: the peak at k=15 indicates that that is the optimal number of clusters, not k=25 as assumed in the initial application.
Fig. 2: Configuration silhouette values for various k when clustering the commit messages of DSTIL’s Wicketkeeper project using the k-medoids algorithm.
For our example dataset the k-medoids algorithm is ideal, but whether it is appropriate for your use case may depend on a number of factors. The algorithm has the following advantages, disadvantages and quirks:
If the disadvantages of k-medoids are too severe for your use case, consider the similar k-medians  and medoidshift  algorithms.
Like our project management example, in which a dataset of specialised language grows slowly over time, some datasets will forever remain small and continue to be difficult to transform into more flexible representations. When faced with such problems, consider the k-medoids algorithm. As this post has demonstrated by the commit message clustering example, the algorithm is able to provide a surprising degree of insight into challenging datasets. k-medoids has been our first step to taking the wild out of the wild west of project management data.
Header image courtesy of Enriko Polger.
Thanks to Nicola Pastorello and Allan Jones for proofreading and providing suggestions.