Go speak to a six year old child and ask if they can recognise a triangle versus a circle, or distinguish the colour blue from the colour red. Chances are, they can probably tell you. Now ask them how many examples of triangles, circles, blue and red objects they needed to see before they learnt that red is red and blue is blue. If they could tell you, what would that number be? Skip a few years and you’re speaking to a teenager. You show them, for example, a marathon racer, like the one shown below in Figure 1. You tell them that the numbered piece of paper the runner wears on their torso is known as a racing bib. How many bibs would they need to see before they can recognise a bib in another image? My bet is one or two. So, why must we currently provide thousands upon thousands of training images to teach a neural network (using the current state-of-the-art machine learning algorithms)—why can’t it learn from just two or three examples?

Figure 1. A sample marathon runner wearing her racing bib.

Training is expensive, both computationally and in human effort. Beyond the time and resources needed to train a neural network, we typically need thousands of training images in the domain of image detection. Sourcing data is the first problem, which machines cannot do automatically (yet…). Next, we need to use humans to label or annotate those images—e.g., what are the x and y coordinates of the feature of the bib in the image above? We use human minds to manually enter input, using services like Amazon’s MTurk or ScaleAPI, and funding these can become expensive. So, how can we minimise this expense? What is the minimum amount of labelled data needed to learn a new feature? Transfer learning[1–3] adapts pre-existing neural networks by extending them to understand a new feature/label; using this to our advantage, maybe the minimum amount is not as much as you might think.

Let’s presume you start off with no training images. Could existing models already pick up the bib at all? If so, then it renders our task pointless, as the job is already done for us. For example, if we look at existing models like Tiny YOLO[4], such networks are already able to pick up basic features like humans. So if it can detect humans, maybe it can detect bibs already? To validate such a claim, we would need to source a given sample set of some random marathon photos, and then we’d assume that the desired output would look something like the image in Figure 2. In reality, if we validate the output, our performance is likely to be poor: bibs do look rather different to humans!

Figure 2. Could a bib be detected on a pre-existing neural network?

This begs the question: what are the steps to take if the existing network can only find a few bibs, or none at all? Typically, we have a workflow (represented in a state diagram in Figure 3) that is comprised of these five stages:

1. find some images with the feature you want in mind, and label its coordinates,
2. use augmentation to multiply training examples of the raw feature,
3. use transfer learning to update a pre-existing network,
4. validate the performance of your new network’s layer, and optionally

So, what happens when we try this with just one image? Obviously, we would expect poor performance, but in the area of transfer learning, benchmarking the minimum amount of training data is still an area of active research with few strong guidelines. Thus, starting with just one image gives us an indication of the minimum amount of data needed to train the network. From this, we have a benchmark to keep improving on: once an increase in training data is introduced into the training pipeline, and further training no longer produces any further performance increase, then the network has converged. Thus, when we increment this training sample by n images, what improvement is there after training the original existing network N again to produce a new network N′ ? And, more importantly, is N′fully converged?

Figure 3. State diagram of our human-in-the-loop training. (Human-required states highlighted.) When performance is consistently too low and the network is not yet converged, increment the training size pool and annotate as needed until the network has converged.

This is a problem we recently tackled and found further insight. Using the bib example and 803 marathon running images, we found that it takes about 55 seconds to annotate a given marathon photo using an annotation tool we developed. In total, our annotators labelled a total of 722 annotations.

We then augmented these images 50 times (with a mix of affine transformations, colour channel distortment, blurring, translations, rotations and shearing), producing 40,150 images and roughly 34,000 annotations. From this pool of annotations, we tested transfer learning on an existing neural network: Faster R-CNN (F-RCNN). We split the pool three times into smaller subsets of randomly sampled training images and then ran training on these subsets (and the original dataset) to produce four models.

The next stage was to test the performance of each model. We ran inference on a different annotated dataset of marathon races four times using the four different models. Each image had a set of ground truth bibs, T, and a set of estimates inferred, E. To compare how the various models performed, we developed a performance metric, p, as the following:

The results of our inference experiments are shown below.

Model Training Images p Δp
A 1 0.25 N/A
B 100 0.61 +0.36
C 500 0.68 +0.07
D 722 0.74 +0.06

We can see that our gain in performance of models B and C are not significant—only small increments of 0.07 and 0.06 in performance occur for an additional 400 and 222 images annotated, respectively.

Thus, at 55 seconds per image to annotate, using model D requires an extra 9.5 hours worth of annotation for only a 13% improvement over model B. The trade-off here is accuracy performance versus prototyping efficiency: does a 13% improvement justify the added cost of 9.5 hours worth of annotation labour, given that half of bibs are detected in images at only 100 images (model B)? The answer would differ on a case-by-case basis. In ours, the improvement wasn’t justified as our work (in a prototyping stage) was just to confirm if F-RCNN would be suitable to piggy-back from.

What did we learn? It’s not always necessary to train with a massive amount of data. A quarter of all of bibs are detected when we augment one single image 50 times (Model A). Thus, augmentation is not only powerful, but can be used to assist you in seeing if a particular network is viable for transfer learning. Question the amount of data you need for training—you might be surprised at how little you could get away with.

Can we generalise this approach to other techniques? In this blog, we reported our training and inference on a convolutional neural network (F-RCNN). Expansions to this experiment could investigate applying similar experiments to bayesian optimisation classifiers or the applicability of transfer learning to capsule networks, ultimately as an adaptive experiment design for rapid machine learning prototyping.

## References

[1] J. Baxter, “A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling,” Machine Learning, vol. 28, no. 1, pp. 7–39, 1997.

[2] R. Caruana, “Multitask Learning.,” Machine Learning, 1997.

[3] S. Thrun, “Is learning the n-th thing any easier than learning the first?,” Advances in neural information processing systems, 1996.

[4] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You Only Look Once – Unified, Real-Time Object Detection.,” CVPR, 2016.

Cover image courtesy of Patrick Tomasso (link).

Thanks to Rodney Pilgrim and Rhys Hill for proofreading and providing suggestions.