Like many machine learning techniques, the Hidden Markov Model (or HMM) is often used as a simple black-box. However, within this black box lies a powerful modelling and predicting tool, which can perform both complex classification and clustering tasks on sequential data. In the first part of this post, we will push our understanding of the HMM a bit further. We will first review what kind of problems can be solved by a HMM, before looking at what is required to build such a model. Finally we will dive a bit deeper into this black-box by setting up our own HMM.
HMMs are a full probabilistic model which have been applied to a wide range of sequence based problems such as financial forecasting via time series [1], text prediction from a sequence of characters [2] and gene annotation in Biology [3]. The main idea behind the HMM formulation is that the evolution of a system can be described as a sequence of states which, although not directly observable, can be linked to a specific set of observable variables.
Let’s use a concrete example to illustrate this definition. Imagine that I have two possible states: I am either Hungry or Thirsty. At a restaurant, a distant observer can hardly tell if I am Hungry or Thirsty, as these states are hidden and only known to myself. However if this observer sees me drinking a Soda, or eating a Burger then, by using these two possible observations, this person can now estimate if I am more likely to be Hungry or Thirsty – see Figure 1.
Fig. 1: A sequence of observations and the possible corresponding hidden states
To apply this modelling technique to a specific system, there are a couple of requirements: (1) this system can be described with a finite number of hidden states and (2), the sequence of observed data must form a stationary Markov process.
Let’s break down the previous statement: here Markov process implies that the current state of the system only depends on its previous state and that the probability to switch between states is time invariant [4]. Now let’s say my current state is Thirsty, this state directly depends on my previous state: if I just had a salty Burger because I was Hungry, now I am likely Thirsty. Similarly, if I just drink plenty of Soda because I was Thirsty, then I am now more likely to be Hungry rather than Thirsty. It is therefore a Markov process.
Stationary means that the distribution from which the sequential data is generated does not evolve with time [5]. Using our example, it means that if during lunch at a time t=0, there were only three possible observations (i.e. three possible items on the menu: Burger, Soda and Ice-cream) then, at t=1, the menu is still the same. The distribution of observations is therefore stationary.
To define a basic HMM, three quantities need to be specified [6]:
Let assume that, initially (t=0), the probability to be Hungry is 0.4 while the probability to be Thirsty is 0.6. The initial probability vector, IN, is therefore given by:
Hungry | Thirsty |
0.4 | 0.6 |
Similarly, we can set up the transition probability matrix, TR:
at t=0 \ at t=1 | Hungry | Thirsty |
Hungry | 0.3 | 0.7 |
Thirsty | 0.8 | 0.2 |
With this transition probability matrix, we now know that the probability of becoming Hungry at a time t=1 is either 0.3 if our initial state at t=0 was Hungry or, 0.8 if we were initially Thirsty. Similarly, we now know the probability of becoming Thirsty given our two possible initial states. Finally the emission probability matrix, EM, can be expressed as:
Burger | Ice-cream | Soda | |
Hungry | 0.6 | 0.3 | 0.1 |
Thirsty | 0.05 | 0.4 | 0.55 |
This matrix provides the probability to get each specific observation given our current state; for example, if I am Thirsty, I am more likely to get a Soda with a corresponding probability of 0.55, rather than a Burger which only has a probability of 0.05. Figure 2 summarises the HMM we just defined:
Fig. 2: The HMM, a graphic probabilistic model
Given the graphical probabilistic model we just established for our lunch, there are three typical questions that can be now investigated. Let’s imagine we have a big fattening lunch leading to this sequence of observations: Burger — Ice-cream — Soda or (B-I-S). One can wonder:
a) How likely is it to start lunch with a Burger, then have an Ice-cream and then finish with a Soda? In other words, what is the likelihood to get B-I-S as a sequence of observations ? This question is solved by applying the Forward algorithm to the HMM.
b) After eating a Burger, an Ice-cream and finishing my lunch with a Soda, what is my current state? Was I more likely to have a Soda because I was Thirsty or Hungry? Here we are interested in decoding the hidden state from the sequence of observations (B-I-S), which is typically performed using the Viterbi algorithm.
c) Given the two possible states of our system, the three possible observations, and an example of observations (our big fattening lunch), which transition and emission matrices best describes the lunch we just had? Here we want to adjust all matrices to obtain the best HMM possible, which would maximise the probability of predicting the sequence of observations. This unsupervised learning task can be achieved by applying the Baum-Welch algorithm.
In the part 2 of this post, we will push our understanding of the HMM a bit further by looking at the mathematics used in the Forward and Viterbi algorithm and then discussing how to implement our example in R.
Time for a Burger.
The next post in this series can be found here.
Header image courtesy of Javier Molina.
Thanks to Stan Kowalczyk and Rhys Hill for reviewing this post.