a simple Hidden Markov Model toy
HMM implemented from scratch for a simplified 5′ splice-site recognition problem, a decoding problem.
This is my implementation to the problem described in "What is a hidden Markov model?" by Sean Eddy (2004) published in Nature biotechnology [1] in an attempt to understand the concept of Hidden Markov Models and the Viterbi algorithm.
Some biology:
Splicing is a process in which introns are removed from the pre-mRNA and exons are joined together to form the mature mRNA (when forming a functional product of a gene, exons=good & introns=bad - the molecular machinery get rid of introns and concatenates exons as blueprint for the exact protein sequence to be translated). The 5' splice site is the boundary between the exon and the intron at the 5' end of the intron. It is a sequence of nucleotides that signals the spliceosome (in eukarya) to cut the RNA at that location.
from khanacademy.org
This 5' splice-site recognition problem is a simplified version of the problem of identifying the boundaries of exons and introns in a gene. The problem is to predict the location of the 5' splice site in a sequence of nucleotides.
The computational model:
A Hidden Markov Model (HMM) is a probabilistic model that describes a sequence of observable events generated by a sequence of hidden states. It is widely used in bioinformatics for sequence analysis. For the sake of this problem, the observable events are the actually accessible data (nucleotides in the sequence: A, T, C, G) and the hidden states are the states that generate the observable events (exon, intron, 5' splice site).
Figure 1: "A toy HMM for 5′ splice site recognition" [1]. These parameters will be used in the implementation.
Starting by defining initial HMM terms for this problem:
-
Hidden States:
$S=\set {S_E, S_5, S_I}$
set of hidden states where$E$ is the exon state,$5$ is the 5' splice site state, and$I$ is the intron state. -
Observables:
$O=\set {O_A, O_C, O_G, O_T}$
set of observable states, in this case the nucleotides -
Initial probabilities:
$\pi = \set {\pi_E, \pi_5, \pi_I}=\set{1,0,0}$
Initial probabilities of being in each state starting from the state at$time=0$ . In this problem, the nucleotide sequence always starts with an exon; note, the sum of all initial probabilities must be 1 -
Transition probabilities:
$A=\set{a_{Si \rightarrow Sj}}$
Matrix that contains the probabilities of transitioning from state$S_i$ to state$S_j$ . In this problem, the transition probabilities between the 3 states$S_E$ ,$S_5$ ,$S_I$ should be a$3*3$ matrix as follows:
0.9 | 0.1 | 0 | |
0 | 0 | 1 | |
0 | 0 | 1 |
note, instead of a 0.1 probability to end after intron, this was edited to have a probability of 1 to have an intron after it, and terminating right after the sequence of observables (nucleotides) finishes at the end of the sequence
-
Emission probabilities:
$E=\set{e_{(o/s)}}$
Matrix that contains the probabilities of emitting an observable$o$ from a hidden state$s$ . In other words, probability of observing a nucleotide given an exon/intron/5' hidden state. In this problem, the emission probabilities should be a$3*4$ matrix as follows:
0.25 | 0.25 | 0.25 | 0.25 | |
0.05 | 0 | 0.95 | 0 | |
0.4 | 0.1 | 0.1 | 0.4 |
A sequence of observables (nucleotides here) of lenfth T can be perceirved in a lattice where each column represents a time step and each row represents a hidden state. Each edge in this lattice represents a transition between hidden states and each node represents an observable state. Thus the probability at each node can be calculated by the product of the emission probability and the transition probability from the previous state.
The purpose of the HMM - as in the described problem - is to predict the most likely sequence of hidden states given a sequence of observable states. In other words, given a sequence of nucleotides, the HMM will predict the most likely sequence of exon, intron, and 5' splice site states, which will help to identify the 5' splice site in the sequence.
This is a classic problem addressed by HMMs called the Decoding Problem. To solve it the Virtebri algorithm is often used.
What's the Viterbi algorithm?
It's a dynamic programming algorithm, aiming to get the best possible sequence of hidden states that generated the observed sequence. It does so by calculating the probability of the most likely path that ends in each state at each time step.
The DP algorithm is based on the principle of optimality: the best path to a state at time
This problem is a very particular case of the decoding problem where there exists forbidden transitions between states, e.g., the 5' splice site cannot be followed by an exon. This simplifies the problem and makes it easier to solve, forming this sparse lattice:
This will also affect the initiation and termination steps of the algorithms.
In any case, the algotiyhm's implementation in HMM.py
will be able to solve the general decoding problem as well as the special case of the 5' splice site recognition problem.
Other problems solved by HMMs include the Likelihood (Evaluation) and Learning problems. The likelihood problem is about calculating the probability of observing a sequence of observable states given the model parameters - this one uses the forward-backward algorithms. The learning problem is about estimating the model parameters given a sequence of observable states.
The algorithm is implemented in HMM.py
and tested in main.py
. The implementation is based on the Viterbi algorithm and the HMM parameters described above. The implementation is tested on a simple example (same one as in the figure) of a sequence of nucleotides and the expected output is the most likely sequence of hidden states that generated the observed sequence.
A model is defined by:
- number of hidden states
- number of observable states
- initial probabilities
- transition probabilities
- emission probabilities lambda = (A, E, pi) (paramters of the model)
The Viterbi algorithm is ran given lambda and a sequnece of observables as per the following pseudocode:
$A, E, \pi \leftarrow \lambda$ -
$T \leftarrow$ length of$O$ -
$N \leftarrow$ length of$A$ - Initialize
$\delta$ and$\psi$ matrices of size$N*T$ - Initialize
$\delta_{1,i} = \pi_i * e_{(O_1/S_i)}$ for$i=1,2,...,N$ - for
$t=2$ to$T$ : -
$\quad$ for$j=1$ to$N$ : $\quad \quad \delta_{t,j} = max_{i=1}^{N}(\delta_{t-1,i} * a_{(S_i/S_j)} * e_{(O_t/S_j)})$ $\quad \quad \psi_{t,j} = argmax_{i=1}^{N}(\delta_{t-1,i} * a_{(S_i/S_j)} * e_{(O_t/S_j)})$ - Backtrack to find the most likely sequence of hidden states
The memoization of the
C | T | T | C | |
---|---|---|---|---|
E | ||||
5 | ||||
I |
Running the example:
python src/main.py
Output:
EEEEEEEEEEEEEEEEE5IIIIIII
CTTCATGTGAAGCAGACGTAAGTCA
p.s. it is recommended to do a log transformation of the probabilities to avoid underflow when dealing with very small probabilities, mind that the product of probabilties will be transdormed into a sum of log probabilities
[1] Eddy, S. R. (2004). What is a hidden Markov model?. Nature biotechnology, 22(10), 1315-1316.