The Melbourne Datathon 2016 is a datathon (i.e. a hackathon for data scientists) organised by the Data Science Melbourne. The second edition of this event started with an introductory night the 21st of April, where a sneak peak of the dataset was distributed, and concluded at the NAB Arena on the 6th of May with the award ceremony. In between, on the 23rd of April, a full hack day was held at the amazing Telstra Gurrowa Innovation lab in Melbourne’s CBD.
This is the very first datathon I experienced as a contestant and what follows is a brief overview of the whole challenge.
The dataset we had to extract some juice from was provided by Seek, the job recruitment website/company, and it was built from 5 different tables linking job ads with searches and various ad visualisation controls/statistics (e.g. geographical location, number of clicks, …).
The main datathon challenge consisted of extracting interesting/cool insights from the data provided and producing a slide deck with the findings. The top 5 teams had the chance to present their results to a panel of professionals in the field.
The second part of the datathon was structured as a Kaggle competition: classify job ads as belonging to the “Hospitality&Tourism” class or not, using both the provided dataset and external sources.
Our team (“Melbourne Patos”) was composed of 6 people with a wide range of skills, from data analysis to machine learning, from UX to database management. We split into three different subgroups. While Alberto and Marco mostly worked on massaging and crunching the dataset, in order to provide the team with clean and easily usable data, Elisa and Niroshinie extracted correlations among the different variables and ran the data analysis, which eventually produced the results included in the final presentation. Most of this work has been done with R and Python.
One of the presented outcomes was the geographical distribution of offer and demand for different categories of jobs, which could be useful to Seek in optimising the job ad targeting according to the demand of a particular geographical locations.
Meanwhile, Felipe and I worked on the Kaggle competition, trying to predict the job class from the text in the job ads.
In one of several approaches, we trained a word2vec model on the body text of the ads, cleansed by stop words and common unigrams. This word2vec representation was then used to train a two-way LSTM-based recurrent neural network that, unfortunately, without a GPU took too long to train on the whole dataset. In the few days of the competition we managed to train up to 10 epochs, reaching a classification accuracy on the test dataset around 98%.
Separately, we used a number of supervised ML classifiers in R, with the tf-idf of words in the job ads and title classes as predictive features. From 50k input jobs, we extracted the 277 most representative job classes that we used as inputs for the classifiers. The best of these classifiers has been the xgboost, which ranked us in the top 20% of the competition ladder with a Gini score (the adopted evaluation metric) of 0.966. With a bit more time, we could have added further dimensions to the training dataset (e.g. ad geographical location, salary and a larger number of words from the job ad text) to achieve a top ranking, but overall we were quite happy with the end result.
We learned quite a lot from the experience. In particular, the impressive presentation by the open-end competition winners, the “Dirty Dataing” team, showed us that non-standard data science approaches are sometimes more effective in exploring the data structures. The three Melbourne University students of this team modeled the relations between job ads and searches as a network, with links between the nodes being the user clicks.
Another non-standard approach was adopted by the two Kaggle competition winners, who considered a massively large parameter space for their initial input dataset. Specifically, they described each job ad according to its unigrams and bigrams in the text, location, a number of salary statistical metrics (e.g. average, standard deviation, different percentiles), number of searches and impressions and, most importantly, the business field of the ad. The latter was extracted by cross-matching the competition data with the external register of Australian Business Numbers (ABNs). After reducing the dimensionality of their input to a few hundreds parameters with PCA and FA, they trained their xgboost model on the most performant AWS CPU-based virtual machine for a few days. Their best Gini score was above 0.99.
Our first-time participation in the Melbourne Datathon was with the intention to get our feet wet and learn about these kinds of competitions (especially regarding the importance of team workload distribution, effective analysis strategies and sharing practices). Incidentally, we managed to rank pretty well in the competition and we can’t wait to compete in the next edition, taking advantage of the experience we accumulated during those 10 days. We are very confident that we’ll be much tougher competition to other Melbourne data scientists at the next Datathon. See you next year, folks!