Batch Linear Regression via Bayesian Estimation

In this article we are going to estimate the parameters of a univariate linear regression in a Bayesian framework, using a "batch solution" based approach, in order to introduce more theory around Kalman Filters and Particle Filters.

In previous articles we have discussed the theory of state space models and Kalman Filters as well as their application to estimating a dynamic hedging ratio between a pair of cointegrating ETFs. The articles were relatively light on theory and did not explore the much broader field of Bayesian Filtering and Smoothing, which the Kalman Filter is a part of.

In this new series of articles we are going to go into more theoretical depth in order to discuss linear and non-linear Kalman Filters as well as Sequential Monte Carlo (SMC) methods, including Particle Filters. Such techniques have many applications in quantitative finance and econometrics, as well as in the broader physical, biological and engineering sciences.

In this article we are going to begin our discussion of Bayesian Filtering by estimating the parameters of a univarate linear regression in a Bayesian framework, using a technique known as the batch solution. We have previously utilised the PyMC3 library with Markov Chain Monte Carlo (MCMC) to estimate the posterior distribution of the parameters for some simulated linear data. However, this article will provide the solution directly through an application of Bayes' Rule.

This article series is heavily influenced by, and largely follows, the concise and accessible Bayesian Filtering and Smoothing[1], by Simo Särkkä. If you would like to gain more insight into the theory this text is worth a read. While Särkkä has provided many of the code snippets in MATLAB we have written our own listings using Python and its associated scientific library ecosystem. We have also chosen to utilise the notation as provided in the book so as not to confuse readers who wish to refer to the equations therein.

Bayesian Batch Solution to Linear Regression

Our approach in this article is to utilise a simple univariate linear regression model with given parameters to simulate some data with noise and then attempt to recover the (known) parameters via the Bayesian batch solution. The linear regression model is given by:

\begin{eqnarray} y_k = {\bf H_k} \theta + \epsilon_k \end{eqnarray}

Where $\theta = (\theta_0 \;\; \theta_1)^T \in \mathbb{R}^2$ is the parameter vector containing the intercept and slope parameters, ${\bf H_k} = (1 \;\; t_k)$ is a row vector, introduced to allow more succint matrix calculation (as well as notation) for multivariate problems, and $\epsilon_k \sim N(0, \sigma^2)$ is normally distributed 'noise' with zero mean and variance $\sigma^2$. $t_k$ is the $k$-th element of the independent variable here. $t$ is used rather than the traditional $x$ because this value usually represents time, as we will see when we come to study recursive estimation.

Since we are operating in a Bayesian setting it is necessary to specify a prior distribution for the parameters $\theta$. In this instance we will assume they are normally distributed with a known mean vector ${\bf m_0}$ and covariance matrix ${\bf P_0}$ leading to $\theta \sim \mathcal{N}({\bf m_0}, {\bf P_0})$.

In the previous article on Bayesian Linear Regression we rewrote the linear regression using probabilistic notation. In the notation of Särkkä[1] this is given by the following:

\begin{eqnarray} p(y_k \mid \theta) &=& \mathcal{N}(y_k \mid {\bf H_k} \theta, \sigma^2) \\ p(\theta) &=& \mathcal{N}(\theta \mid {\bf m_0}, {\bf P_0}) \end{eqnarray}

The first line is the likelihood distribution of the data $y_k$ conditioned on the parameters $\theta$. The second line is the prior distribution of the parameters $\theta$. Both are normally distributed.

The goal of the Bayesian analysis is to calculate the posterior distribution of the parameters conditioned on the data, $y_k$. That is, we want to estimate the values of the intercept and slope of the univariate linear regression once we have obtained some data $(t_k, y_k)$ for $k \in {\bf K}$, with ${\bf K} = \{1,...T\}$ representing an indexing set.

As we saw in the our original article introducing Bayesian Statistics this can be achieved through the use of Bayes' Rule:

\begin{eqnarray} p(\theta \mid y_{1:T}) &=& p(y_{1:T} \mid \theta) \: p(\theta) \: / \: p(y_{1:T}) \\ &\propto& p(\theta) \prod_{k=1}^T p(y_k \mid \theta) \\ &=& \mathcal{N}(\theta \mid {\bf m_0}, {\bf P_0}) \prod_{k=1}^T \mathcal{N}(y_k \mid {\bf H_k} \theta, \sigma^2) \end{eqnarray}

Please note that the notation $y_{1:T}$ is shorthand for a set or sequence $\{y_1, \ldots, y_T\}.$

Let's break down how we achieved this result. The first line uses Bayes' Rule to obtain the posterior distribution of the parameters from the product of the likelihood and the prior, normalised by the 'normalising factor'. On the second line we have then utilised the fact that the data are conditionally independent, which allows us to equate the joint likelihood of the data $p(y_{1:T} | \theta)$ to the product of the distributions of the individual data $y_k$.

We can make use of conjugate priors in the above. Since the likelihood and the prior are both normally distributed it implies that the posterior distribution will also be normally distributed. This allows us to draw samples from the posterior distribution directly, rather than having to resort to a numerical method such as MCMC or quadrature.

In the notation of Särkkä[1] the posterior is thus given by:

\begin{eqnarray} p(\theta \mid y_{1:T}) = \mathcal{N}(\theta \mid {\bf m_T}, {\bf P_T}) \end{eqnarray}

Where the mean vector and covariance matrix for the posterior are themselves given by the following relations (see Särkkä[1]):

\begin{eqnarray} {\bf m_T} &=& \left[{\bf P_0^{-1}} + \frac{1}{\sigma^2} {\bf H^T} {\bf H} \right]^{-1} \left[\frac{1}{\sigma^2} {\bf H^T}{\bf y} + {\bf P_0^{-1}} {\bf m_0}\right] \\ {\bf P_T} &=& \left[{\bf P_0^{-1}} + \frac{1}{\sigma^2} {\bf H^T} {\bf H} \right]^{-1} \end{eqnarray}

Where:

\begin{eqnarray} {\bf H} = \begin{pmatrix} {\bf H_1} \\ \vdots \\ {\bf H_3} \end{pmatrix} = \begin{pmatrix} 1 \;\; t_1 \\ \vdots \\ 1 \;\; t_T \end{pmatrix} \end{eqnarray}

And:

\begin{eqnarray} {\bf y} = \begin{pmatrix} y_1 \\ \vdots \\ y_T \end{pmatrix} \end{eqnarray}

The above equations provide everything we need to implement a full analytic solution using Python via NumPy arrays and a little linear algebra.

Python Implementation

Let's now discuss how to turn the above theoretical outline into a working Python code that can simulate samples from the linear model and estimate the parameters based on these samples.

As always the first task is to import the necessary libraries. For this we will primarily be using NumPy for its linear algebra capabilities. We will also make use of Matplotlib and Seaborn for visualisation:

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_theme()

The next step is to write a simple simulation function that takes in the number of samples to generate, the known parameters $\theta$ and the standard deviation of the error term, $\sigma$. We first generate a linear space to cover the independent variable $t \in [0, 1]$.

Then we use the vectorised addition and multiplication capabilities of NumPy to generate a corresponding array of linear data with random error. We achieve this by multiplying $t$ by the 'slope' $\theta_1$, adding the 'intercept' $\theta_0$, along with a sample of normally distributed error with mean zero and variance $\sigma^2$.

In order to follow the matrix notation given in Särkkä[1] we simultaneously calculate ${\bf H}$ by adding a column vector of unity values to the column vector of $t_k$ values using numpy.insert. Finally we return ${\bf H}$, ${\bf t}$ and ${\bf y}$:

def simulate_linear_data(n, theta, sigma):
    """
    This function creates a set of univariate linear data with
    normally distributed error. It then converts this data into
    the H matrix format for use in subsequent calculations.

    Parameters
    ----------
    n : `integer`
        The number of samples to generate
    theta : `np.ndarray`
        The vector containing the known slope and intercept parameters
    sigma : `float`
        Known standard deviation of the normally distributed error

    Returns
    -------
    `tuple(np.ndarray, np.ndarray, np.ndarray)`
        The H, t and y matrices containing the simulated data
    """
    # Create the linear data with random error
    t = np.linspace(0, 1, n)
    y = theta[0] + theta[1] * t + np.random.normal(0.0, sigma**2, size=n)

    # Modify the data shape to correspond to the H matrix
    ones = np.ones((1, n))
    H = np.insert(ones, 1, t, axis=0).T

    return H, t, y

The next function to be implemented, batch_linear_regression, calculates the mean vector ${\bf m_T}$ and the covariance matrix ${\bf P_T}$ for the normally distributed posterior. The equations for both were previously given above.

Despite the matrix algebra present in these relations NumPy actually makes these calculations relatively straightfoward. Notice that the mean vector utilises the covariance matrix calculation as its first term. Hence we can precalculate this (since it utilises a matrix inverse) and store the result.

In order to calculate a matrix inverse it is necessary to utilise the numpy.linalg.inv method. We can combine this with the numpy.dot method, which carries out matrix multiplication. The matrices ${\bf H}$ and ${\bf y}$ have been defined in such a way as to allow these calculations to be written as matrix multiplications. This is why the NumPy code is relatively terse:

def batch_linear_regression(m0, P0, sigma, H, y):
    """
    This function calculates the mean vector and covariance
    matrix of the normally distributed posterior given
    the prior mean vector and covariance matrix.

    Parameters
    ----------
    m0 : `np.ndarray`
        The mean vector of the parameter prior distribution
    P0 : `np.ndarray`
        The covariance matrix of the parameter prior distribution
    sigma : `float`
        The standard deviation of the error term.
    H : `np.ndarray`
        The matrix of 1s and the t_k values.
    y : `np.ndarray`
        The vector of 'response' values y_k.
    Returns
    -------
    `tuple(np.ndarray, np.ndarray)`
        The mean vector and covariance matrix of the posterior.
    """
    # Pre-calculate inverse matrix of the prior mean
    # of the parameters, along with the inverse of the variance
    P0inv = np.linalg.inv(P0)
    invsigsq = 1.0 / (sigma**2)

    # Use NumPy linear algebra routines to calculate the
    # mean vector and covariance matrix of the normally
    # distributed posterior
    PT = np.linalg.inv(P0inv + invsigsq * np.dot(H.T, H))
    mT = np.dot(PT, (invsigsq * np.dot(H.T, y) + np.dot(P0inv, m0)))

    return mT, PT

The final task is to tie the above code together with a __main__ entrypoint, if running via a command line script. The first task is to fix a random seed for NumPy to ensure that this script will produce identical results when executed elsewhere. The next step is to set the parameters of the simulation. We have chosen to simulate $n=50$ samples, with a given linear regression parameters vector $\theta = (1.0 \;\; 0.5)$ and a standard deviation of $\sigma=0.4$ for the error term.

Note that while it is convenient in this short script to set a NumPy random seed for reproducibility, in a larger production system it is not recommended to carry this out since it acts in a global fashion and could affect other modules in a larger codebase. Please see this article for guidance on what to do in a production system.

We have (arbitrarily) chosen our prior mean vector for the parameters to be ${\bf m_0} = (0.9 \;\; 0.6)$, which is close to the true $\theta = (1.0 \;\; 0.5)$. We have (arbitrarily) set the covariance matrix to be ${\bf P_0} = \text{diag}(0.5, 0.5)$. That is, the matrix with zeros everywhere except the values 0.5 along the diagonal.

We then simulate the linear data and use it to calculate the posterior mean and covariance. We are then able to calculate the 'true' linear model points, which we have denoted by real_y. We can also use the mean vector of the parameters posterior, ${\bf m_T}$, to provide us with a point estimate of the linear regression line.

Note that in the previous article on Bayesian Linear Regression we utilised the full posterior distribution to draw samples for $\theta$ and thus were able to plot multiple regression line estimates. For this article we are simply taking a single point estimate via the mean of this distribution.

Once we have this estimated regression line (denoted by est_y) we can use Matplotlib to create a figure similar to Figure 3.2 in Särkkä[1]. We plot the original data points as a scatter plot in blue, overlaid with the true regression line in thick light grey, with our estimate given by the thin dark grey line:

if __name__ == "__main__":
    # Fix the parameters of the linear model
    np.random.seed(123)  # Ensure this script is reproducible
    n = 50
    theta = (1.0, 0.5)
    sigma = 0.4
    
    # Specify our prior 'view' on the parameters
    m0 =  np.array([0.9, 0.6]) # Prior mean of theta
    P0 = np.array([[0.5, 0.0], [0.0, 0.5]]) # Prior covariance of theta

    # Simulate the noisy linear data and fit a linear
    # regression line using Bayesian estimation
    H, t, y = simulate_linear_data(n, theta, sigma)
    mT, PT = batch_linear_regression(m0, P0, sigma, H, y)
    
    # Calculate the true and estimated lines of best fit
    # as sample points along both lines
    real_y = theta[0] + theta[1] * t
    est_y = mT[0] + mT[1] * t
    
    # Plot all of the appropriate data using Matplotlib
    # themed with Seaborn
    fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 6))
    ax.plot(t, real_y, '-', linewidth='5', color='#dddddd', label='True Signal')
    ax.plot(t, est_y, '-', linewidth='2', color='#333333', label='Estimate')
    ax.scatter(t, y, label='Measurement')
    ax.legend(loc="upper left", facecolor='#ffffff')
    ax.set_xlim(0.0, 1.0)
    ax.set_ylim(0.5, 2.0)
    ax.set_xlabel('$t$')
    ax.set_ylabel('$y$')
    plt.show()

Saving this script as batch_lin_reg.py and running via python batch_lin_reg.py in a suitable virtual environment or notebook (such as Jupyter), with the libraries installed, will produce the following figure:

It can be seen that the estimated line calculated from the mean vector of the posterior distribution for the parameters $\theta$ is relatively close to the true value of $\theta$ in this instance.

Next Steps

Now that we have described the batch solution we are going to continue following along with Särkkä[1] by taking a look at recursive linear regression, which is useful in the quantitative finance context as we continually receive new data once per day, hour, minute etc. This will be the subject of the next article.

References

Full Code

# batch_lin_reg.py

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_theme()


def simulate_linear_data(n, theta, sigma):
    """
    This function creates a set of univariate linear data with
    normally distributed error. It then converts this data into
    the H matrix format for use in subsequent calculations.

    Parameters
    ----------
    n : `integer`
        The number of samples to generate
    theta : `np.ndarray`
        The vector containing the known slope and intercept parameters
    sigma : `float`
        Known standard deviation of the normally distributed error

    Returns
    -------
    `tuple(np.ndarray, np.ndarray, np.ndarray)`
        The H, t and y matrices containing the simulated data
    """
    # Create the linear data with random error
    t = np.linspace(0, 1, n)
    y = theta[0] + theta[1] * t + np.random.normal(0.0, sigma**2, size=n)

    # Modify the data shape to correspond to the H matrix
    ones = np.ones((1, n))
    H = np.insert(ones, 1, t, axis=0).T

    return H, t, y


def batch_linear_regression(m0, P0, sigma, H, y):
    """
    This function calculates the mean vector and covariance
    matrix of the normally distributed posterior given
    the prior mean vector and covariance matrix.

    Parameters
    ----------
    m0 : `np.ndarray`
        The mean vector of the parameter prior distribution
    P0 : `np.ndarray`
        The covariance matrix of the parameter prior distribution
    sigma : `float`
        The standard deviation of the error term.
    H : `np.ndarray`
        The matrix of 1s and the t_k values.
    y : `np.ndarray`
        The vector of 'response' values y_k.
    Returns
    -------
    `tuple(np.ndarray, np.ndarray)`
        The mean vector and covariance matrix of the posterior.
    """
    # Pre-calculate inverse matrix of the prior mean
    # of the parameters, along with the inverse of the variance
    P0inv = np.linalg.inv(P0)
    invsigsq = 1.0 / (sigma**2)

    # Use NumPy linear algebra routines to calculate the
    # mean vector and covariance matrix of the normally
    # distributed posterior
    PT = np.linalg.inv(P0inv + invsigsq * np.dot(H.T, H))
    mT = np.dot(PT, (invsigsq * np.dot(H.T, y) + np.dot(P0inv, m0)))

    return mT, PT


if __name__ == "__main__":
    # Fix the parameters of the linear model
    np.random.seed(123)  # Ensure this script is reproducible
    n = 50
    theta = (1.0, 0.5)
    sigma = 0.4
    
    # Specify our prior 'view' on the parameters
    m0 =  np.array([0.9, 0.6]) # Prior mean of theta
    P0 = np.array([[0.5, 0.0], [0.0, 0.5]]) # Prior covariance of theta

    # Simulate the noisy linear data and fit a linear
    # regression line using Bayesian estimation
    H, t, y = simulate_linear_data(n, theta, sigma)
    mT, PT = batch_linear_regression(m0, P0, sigma, H, y)
    
    # Calculate the true and estimated lines of best fit
    # as sample points along both lines
    real_y = theta[0] + theta[1] * t
    est_y = mT[0] + mT[1] * t
    
    # Plot all of the appropriate data using Matplotlib
    # themed with Seaborn
    fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 6))
    ax.plot(t, real_y, '-', linewidth='5', color='#dddddd', label='True Signal')
    ax.plot(t, est_y, '-', linewidth='2', color='#333333', label='Estimate')
    ax.scatter(t, y, label='Measurement')
    ax.legend(loc="upper left", facecolor='#ffffff')
    ax.set_xlim(0.0, 1.0)
    ax.set_ylim(0.5, 2.0)
    ax.set_xlabel('$t$')
    ax.set_ylabel('$y$')
    plt.show()