State Space Models and the Kalman Filter

To date in our time series analysis posts we have considered linear time series models including ARMA, ARIMA as well as the GARCH model for conditional heteroskedasticity. In this article we are going to consider the theoretical basis of state space models, the primary benefit of which is that their parameters can adapt over time.

State space models are very general and it is possible to put the models we have considered to date into a state space formulation. However, in order to keep the analysis straightforward, it is often better to use the simpler representation.

The general premise of a state space model is that we have a set of states that evolve in time (such as the hedge ratio between two cointegrated pairs of equities), but our observations of these states contain statistical noise (such as market microstructure noise), and hence we are unable to ever directly observe the "true" states.

The goal of the state space model is to infer information about the states, given the observations, as new information arrives. A famous algorithm for carrying out this procedure is the Kalman Filter, which we will also discuss in this article.

The Kalman Filter is ubiquitous in engineering control problems, including guidance & navigation, spacecraft trajectory analysis and manufacturing, but it is also widely used in quantitative finance.

In engineering, for instance, a Kalman Filter will be used to estimate values of the state, which are then used to control the system under study. This introduces a feedback loop, often in real-time.

For an extremely interesting application of Kalman Filtering, one can consider the recent successful attempt of the private space firm, Space Exploration Technologies, to return and land the first-stage of their Falcon 9 rocket back at its original launch site. The first stage booster was subject to an extremely precise dynamic control problem, involving asymmetric time-varying mass (fuel sloshing) at hypersonic through to subsonic velocities:

Perhaps the most common usage of a Kalman Filter in quantitative trading is to update hedging ratios between assets in a statistical arbitrage pairs trade, but the algorithm is much more general than this and we will look at other use cases.

Generally, there are three types of inference that we are interested in when considering state space models:

Prediction - Forecasting subsequent values of the state
Filtering - Estimating the current values of the state from past and current observations
Smoothing - Estimating the past values of the state given the observations

Filtering and smoothing are similar, but not the same. Perhaps the best way to think of the difference is that with smoothing we are really wanting to understand what has happened to states in the past given our current knowledge, whereas with filtering we really want to know what is happening with the state right now.

In this article we are going to discuss the theory of the state space model and how we can use the Kalman Filter to carry out the various types of inference described above. In subsequent articles we will apply the Kalman Filter to trading situations, such as cointegrated pairs, as well as asset price prediction.

We will be making use of a Bayesian approach to the problem, as this is a natural statistical framework for allowing us to readily update our beliefs in light of new information, which is precisely the desired behaviour of the Kalman Filter.

I want to warn you that state-space models and Kalman Filters suffer from an abundance of mathematical notation, even if the conceptual ideas behind them are relatively straightforward. I will try and explain all of this notation in depth, as it can be confusing for those new to engineering control problems or state-space models in general.

Linear State-Space Model

This section follows closely the notation utilised in both Cowpertwait et al^[1] and Pole et al^[2]. I decided it wasn't particularly helpful to invent my own notation for the Kalman Filter, as I want you to be able to relate it to other research papers or texts.

Let's begin by discussing all of the elements of the linear state-space model.

Since the states of the system are time-dependent, we need to subscript them with $t$. We will use $\theta_t$ to represent a column vector of the states.

In a linear state-space model we say that these states are a linear combination of the prior state at time $t-1$ as well as system noise (random variation). In order to simplify the analysis we are going to suggest that this noise is drawn from a multivariate normal distribution, but of course, other distributions can be used.

The linear dependence of $\theta_t$ on the previous state $\theta_{t-1}$ is given by the matrix $G_t$, which can also be time-varying (hence the subscript $t$). The multivariate time-dependent noise is given by $w_t$. The relationship is summarised below in what is often called the state equation:

\begin{eqnarray} \theta_t = G_t \theta_{t-1} + w_t \end{eqnarray}

However, this is only half of the story. We also need to discuss the observations, that is, what we actually see, since the states are hidden to us by system noise.

We can denote the (time-dependent) observations by $y_t$. The observations are a linear combination of the current state and some additional random variation known as measurement noise, also drawn from a multivariate normal distribution.

If we denote the linear dependence matrix of $\theta_t$ on $y_t$ by $F_t$ (also time-dependent), and the measurement noise by $v_t$ we have the observation equation:

\begin{eqnarray} y_t = F^{T}_t \theta_t + v_t \end{eqnarray}

Where $F^{T}$ is the transpose of $F$.

In order to fully specify the model we need to provide the first state $\theta_0$, as well as the variance-covariance matrices for the system noise and measurement noise. These terms are distributed as:

\begin{eqnarray} \theta_0 &\sim& \mathcal{N} (m_0, C_0) \\ v_t &\sim& \mathcal{N} (0, V_t) \\ w_t &\sim& \mathcal{N} (0, W_t) \end{eqnarray}

Clearly that is a lot of notation to specify the model. For completeness, I'll summarise all of the terms here to help you get to grips with the model:

$\theta_t$ - The state of the model at time $t$
$y_t$ - The observation of the model at time $t$
$G_t$ - The state-transition matrix between current and prior states at time $t$ and $t-1$ respectively
$F_t$ - The observation matrix between the current observation and current state at time $t$
$w_t$ - The system noise drawn from a multivariate normal distribution
$v_t$ - The measurement noise drawn from a multivariate normal distribution
$m_0$ - The mean value of the multivariate normal distribution of the initial state, $\theta_0$
$C_0$ - The variance-covariance matrix of the multivariate normal distribution of the initial state, $\theta_0$
$W_t$ - The variance-covariance matrix for the multivariate normal distribution from which the system noise is drawn
$V_t$ - The variance-covariance matrix for the multivariate normal distribution from from which the measurement noise is drawn

Now that we've specified the linear state-space model, we need an algorithm to actually solve it. This is where the Kalman Filter comes in. We can use Bayes' Rule and conjugate priors to help us derive the algorithm.

The Kalman Filter

A Bayesian Approach

If we recall from the article on Bayesian statistics, Bayes' Rule is given by:

\begin{eqnarray} P(H | D) = P(D | H) P(H) / P(D) \end{eqnarray}

Where $H$ refers to our hypothesis or parameters and $D$ refers to our data, evidence or observations.

We want to apply the rule to the idea of updating the probability of seeing a state given all of the previous data we have and our current observation. Once again, we need to introduce more notation!

If we are at time $t$, then we can represent all of the data known about the system by the quantity $D_t$. However, our current observations are given by $y_t$. Thus we can say that $D_t = (D_{t-1}, y_t)$. That is, our current knowledge is a mixture of our previous knowledge plus our most recent observation.

Applying Bayes' Rule to this situation gives the following:

\begin{eqnarray} P(\theta_t | D_{t-1}, y_t) = \frac{P(y_t | \theta_t) P(\theta_t | D_{t-1})}{P(y_t)} \end{eqnarray}

What does this mean? It says that the posterior or updated probability of obtaining a state $\theta_t$, given our current observation $y_t$ and previous data $D_{t-t}$, is equal to the likelihood of seeing an observation $y_t$, given the current state $\theta_t$ multiplied by the prior or previous belief of the current state, given only the previous data $D_{t-1}$, normalised by the probability of seeing the observation $y_t$ regardless.

While the notation may be somewhat verbose, it is a very natural statement. It says that we can update our view on the state, $\theta_t$ in a rational manner given the fact that we have new information in the form of the current observation, $y_t$.

One of the extremely useful aspects of Bayesian inference is that if our prior and likelihood are both normally distributed, we can use the concept of conjugate priors to state that our posterior (i.e. updated view of $\theta_t$) will also be normally distributed.

We utilised the same concept, albeit with different distributional forms, in the discussion on the inference of binomial proprotions.

So how does this help us produce a Kalman Filter?

Well, let's specify the terms that we'll be using, from Bayes' Rule above. Firstly, we specify the distributional form of the prior:

\begin{eqnarray} \theta_t | D_{t-1} \sim \mathcal{N}(a_t, R_t) \end{eqnarray}

That is, the prior view of $\theta$ at time $t$, given our knowledge at time $t-1$ is distributed as a multivariate normal distribution, with mean $a_t$ and variance-covariance $R_t$. The latter two parameters will be defined below.

Now let's consider the likelihood:

\begin{eqnarray} y_t | \theta_t \sim \mathcal{N}(F^{T}_t \theta_t, V_t) \end{eqnarray}

That is, the likelihood function of the current observation $y$ at time $t$ is distributed as a multivariate normal distribution, with mean $F^{T}_t \theta_t$ and variance-covariance $V_t$. We've already outlined these terms in our list above.

Finally we have the posterior of $\theta_t$:

\begin{eqnarray} \theta_t | D_t \sim \mathcal{N}(m_t, C_t) \end{eqnarray}

That is, our posterior view of the current state $\theta$ at time $t$, given our current knowledge at time $t$ is distributed as a multivariate normal distribution with mean $m_t$ and variance-covariance $C_t$.

The Kalman Filter is what links all of these terms together for $t=1,\ldots$. We won't derive where these values actually come from, but we will simply state them. Thankfully we can use library implementations in R to carry out the "heavy lifting" for us:

\begin{eqnarray} a_t &=& G_t m_{t-1} \\ R_t &=& G_t C_{t-1} G^{T}_t + W_t \\ e_t &=& y_t - f_t \\ m_t &=& a_t + A_t e_t \\ f_t &=& F^{T}_t a_t \\ Q_t &=& F^{T}_t R_t F_t + V_t \\ A_t &=& R_t F_t Q^{-1}_t \\ C_t &=& R_t - A_t Q_t A^{T}_t \end{eqnarray}

Clearly that is a lot of notation! As I said above, we need not worry about the excessive verboseness of the Kalman Filter, as we can simply use libraries in R to calculate the algorithm for us.

So how does it all fit together? Well, $f_t$ is the predicated value of the observation at time $t$, where we make this prediction at time $t-1$. Since $e_t = y_t - f_t$, we can see easily that $e_t$ is the error associated with the forecast (the difference between $f$ and $y$).

Importantly, the posterior mean is a weighting of the prior mean and the forecast error, since $m_t = a_t + A_t e_t = G_t m_{t-1} + A_t e_t$, where $G_t$ and $A_t$ are our weighting matrices.

Now that we have an algorithmic procedure for updating our views on the observations and states, we can use it to make predictions, as well as smooth the data.

Prediction

The Bayesian approach to the Kalman Filter leads naturally to a mechanism for prediction. Since we have our posterior estimate for the state $\theta_t$, we can predict the next day's values by considering the mean value of the observation.

Let's take the expected value of the observation tomorrow, given our knowledge of the data today:

\begin{eqnarray} E[y_{t+1} | D_t] &=& E[F^{T}_{t+1} \theta_t + v_{t+1} | D_t] \\ &=& F^{T}_{t+1} E[\theta_{t+1} | D_t] \\ &=& F^{T}_{t+1} a_{t+1} \\ &=& f_{t+1} \end{eqnarray}

Where does this come from? Let's try and follow through the analysis:

Since the likelihood function for today's observation $y_t$, given today's state $\theta_t$, is normally distributed with mean $F^{T}_t \theta_t$ and variance-covariance $V_t$ (see above), we have that the expectation of tomorrow's observation $y_{t+1}$, given our data today, $D_t$, is precisely the expectation of the multivariate normal for the likelihood, namely $E[F^{T}_{t+1} \theta_t + v_{t+1} | D_t]$. Once we make this connection it simply reduces to applying rules about the expectation operator to the remaining matrices and vectors, ultimately leading us to $f_{t+1}$.

However it is not sufficient to simply calculate the mean, we must also know the variance of tomorrow's observation given today's data, otherwise we cannot truly characterise the distribution on which to draw tomorrow's prediction.

\begin{eqnarray} \operatorname{Var}[y_{t+1}| D_t] &=& \operatorname{Var} [F^{T}_{t+1} \theta_t + v_{t+1} | D_t] \\ &=& F^{T}_{t+1} \operatorname{Var}[\theta_{t+1} | D_t] F_{t+1} + V_{t+1} \\ &=& F^{T}_{t+1} R_{t+1} F_{t+1} + V_{t+1} \\ &=& Q_{t+1} \end{eqnarray}

Now that we have the expectation and variance of tomorrow's observation, given today's data, we are able to provide the general forecast for $k$ steps ahead, by fully characterising the distribution on which these predictions are drawn:

\begin{eqnarray} y_{t+k} | D_t \sim \mathcal{N}(f_{t+k | t}, Q_{t+k | t}) \end{eqnarray}

Note that I have used some odd notation here - what does it mean to have a subscript of ${t+k | t}$? Actually, it allows us to write a convenient shorthand for the following:

\begin{eqnarray} f_{t+k | t} &=& F^{T}_{t+k} G^{k-1} a_{t+1} \\ Q_{t+k | t} &=& F^{T}_{t+k} R_{t+k} F_{t+k} + V_{t+k} \\ R_{t+k | t} &=& G^{k-1} R_{t+1} (G^{k-1})^{T} + \sum^{k}_{j=2} G^{k-j} W_{t+j} (G^{k-j})^{T} \end{eqnarray}

As I've mentioned repeatedly in this article, we should not concern ourselves too much with the verboseness of the Kalman Filter and its notation, rather we should think about the overall procedure and its Bayesian underpinnings.

Thus we now have the means of predicting new values of the series. This is an alternative to the predictions produced by combining ARIMA and GARCH. In subsequent articles we will actually carry this out for some real financial data and apply it to a predictive trading model.

We will also be able to use the "filter" aspect to provide us with continually updated views on a linear hedge ratio between two cointegrated pairs of assets, such as might be found in a stastical arbitrage strategy.

References

Cowpertwait, P.S.P. and Metcalfe, A.V. (2009). Introductory Time Series with R.
Pole, A., West, M., and Harrison, J. (1994). Applied Bayesian Forecasting.

Linear State-Space Model

The Kalman Filter

A Bayesian Approach

Prediction

References

The Quantcademy

Successful Algorithmic Trading

Advanced Algorithmic Trading