Beginner's Guide to Statistical Machine Learning - Part I

In previous articles I have made it clear that statistical machine learning has become an extremely important component in the overall toolkit of a quantitative trading researcher. I have also outlined a brief study guide on what to learn. However, I've not really explored the subject on a conceptual level starting from first principles.

This article is designed to give you an idea of the mathematics and formalism behind statistical learning, with follow-on articles describing exactly how it can be applied to quantitative finance problems, such as algorithmic trading strategy design.

What is Statistical Learning?

Before discussing the theoretical aspects of statistical learning it is appropriate to consider an example of a situation from quantitative finance where such techniques are applicable. Consider a quantitative fund that wishes to make long term predictions of the S&P500 stock market index. The fund has managed to collect a substantial amount of fundamental data associated with the companies that constitute the index. Fundamental data includes price-earnings ratio or book value, for instance. How should the fund go about using this data to make predictions of the index in order to create a trading tool? Statistical learning provides one such approach to this problem.

In a more quantitative sense we are attempting to model the behaviour of an outcome or response based on a set of predictors or features assuming a relationship between the two. In the above example the stock market index value is the response and the fundamental data associated with the constituent firms are the predictors.

This can be formalised by considering a response $Y$ with $p$ different features $x_1,x_2,...,x_p$. If we utilise vector notation then we can define $X = (x_1,x_2,...,x_p)$, which is a vector of length $p$. Then the model of our relationship is given by:

\begin{eqnarray} Y = f(X) + \epsilon \end{eqnarray}

Where $f$ is an unknown function of the predictors and $\epsilon$ represents an error or noise term. Importantly, $\epsilon$ is not dependent on the predictors and has a mean of zero. This term is included to represent information that is not considered within $f$. Thus we can return to the stock market index example to say that $Y$ represents the value of the S&P500 whereas the $x_i$ components represent the values of individual fundamental factors.

The goal of statistical learning is to estimate the form of $f$ based on the observed data and to evaluate how accurate those estimates are.

Prediction and Inference

There are two general tasks that are of interest in statistical learning - prediction and inference. Prediction refers to the situation where it is straightforward to obtain information on features/predictors but difficult (or impossible) to obtain the responses.

Prediction

Prediction is concerned with predicting a response $Y$ based on a newly observed predictor, $X$. Assuming a model relationship has been determined then it is simple to predict the response using an estimate for $f$ to produce an estimate for the response:

\begin{eqnarray} \hat{Y} = \hat{f}(X) \end{eqnarray}

The exact functional form of $f$ is often unimportant in a prediction scenario assuming that the estimated responses are close to the true responses and is thus accurate in its predictions. Different estimates of $f$ will produce various accuracies of the estimates of $Y$. The error associated with having a poor estimate $\hat{f}$ of $f$ is called the reducible error. Note that there is always a degree of irreducible error because our original specification of the problem included the $\epsilon$ error term. This error term encapsulates the unmeasured factors that may affect the response $Y$. The approach taken is to try and minimise the reducible error with the understanding that there will always be an upper limit of accuracy based on the irreducible error.

Inference

Inference is concerned with the situation where there is a need to understand the relationship between $X$ and $Y$ and hence its exact form must be determined. One may wish to identify important predictors or determine the relationship between individual predictors and the response. One could also ask if the relationship is linear or non-linear. The former means the model is likely to be more interpretable but at the expense of potentially worse predictability. The latter provides models which are generally more predictive but are sometimes less interpretable. Hence a trade-off between predictability and interpretability often exists.

On QuantStart we are generally less concerned with inference models since the actual form of $f$ is not as important as its ability to make accurate predictions. Many of the trading articles on the site have and will be based on predictive modelling. The next section deals with how we go about constructing an estimate $\hat{f}$ for $f$.

Parametric and Non-Parametric Models

In a statistical learning situation it is often possible to construct a set of tuples of predictors and responses of the form $\{ (X_1, Y_1), (X_2, Y_2), ... , (X_N, Y_N) \}$, where $X_j$ refers to the jth predictor vector and not the jth component of a particular predictor vector (that is denoted by $x_j$). A data set of this form is known as training data since it will be used to train a particular statistical learning method on how to generate $\hat{f}$. In order to actually estimate $f$ we need to find a $\hat{f}$ that provides a reasonable approximation to a particular $Y$ under a particular predictor $X$. There are two broad categories of statistical models that allow us to achieve this. They are known as parametric and non-parametric models.

Parametric Models

The defining feature of parametric methods is that they require the specification or assumption of the form of $f$. This is a modelling decision. The first choice is whether to consider a linear or non-linear model. Let's consider the simpler case of a linear model. Such a model reduces the problem from estimation of some unknown function of dimension $p$ to that of estimating a coefficient vector $\beta=(\beta_0, \beta_1, ... , \beta_p) $ of length $p+1$.

Why $p+1$ and not $p$? Since linear models can be affine, that is they may not pass through the origin when creating a "line of best fit", a coefficient is required to specify the "intercept". In a one-dimensional linear model (regression) setting this coefficient is often represented as $\alpha$. For our multi-dimensional linear model, where there are $p$ predictors, we instead use the notation $\beta_0$ to represent our intercept and hence there are $p+1$ components in our $\hat{\beta}$ estimate of $\beta$.

Now that we have specified a (linear) functional form of $f$ we need to train it. "Training" in this instance means finding an estimate for $\beta$ such that:

\begin{eqnarray} Y \approx \hat{\beta}^T X = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p \end{eqnarray}

Where the vector $X=(1,x_1,x_2,...,x_p)$ contains an additional component with unity in order to have a $p+1$-dimensional inner product.

In the linear setting we can use an algorithm such as ordinary least squares (OLS) to determine the coefficients but other methods are available as well. It is far simpler to estimate $\beta$ than fit a (potentially non-linear) $f$. However, by choosing a parametric linear approach our estimate $\hat{f}$ is unlikely to be replicating the true form of $f$. This can lead to poor estimates because the model is not flexible enough.

A potential remedy is to consider adding more parameters, by choosing alternate forms for $\hat{f}$. Unfortunately if the model becomes too flexible it can lead to a very dangerous situation known as overfitting, which we have discussed at length in previous articles. In essence the model follows the noise too closely and not the signal!

Non-Parametric Models

The alternative approach is to consider a non-parametric form of $\hat{f}$. Non-parametric models can potentially fit a wider range of possible forms for $f$ and are thus more flexible. Unfortunately non-parametric models suffer from the need to have an extensive amount of observational data points, often far more than in a parametric settings. In addition non-parametric methods are also prone to overfitting if not treated carefully, as described above.

Non-parametric models may seem like a natural choice for quantitative trading models as there is seemingly an abundance of (historical) data on which to apply the models. However, the methods are not always optimal. While the increased flexibility is attractive for modelling the non-linearities in stock market data it is very easy to overfit the data due to the poor signal/noise ratio found in financial time series.

Thus a "middle-ground" of considering models with some degree of flexibility is preferred. We will discuss such problems in later articles.

Supervised and Unsupervised Learning

A distinction is often made in statistical machine learning between supervised and unsupervised methods. On QuantStart the strategies we look at will be based almost exclusively on supervised techniques, but unsupervised techniques are certainly applicable to financial markets.

A supervised model requires that for each predictor vector $X_j$ there is an associated response $Y_j$. The "supervision" of the procedure occurs when the model for $f$ is trained or fit to this particular data. For example, when fitting a linear regression model, the OLS algorithm is used to train it, ultimately producing an estimate $\hat{\beta}$ to the vector of regression coefficients, $\beta$.

In an unsupervised model there is no corresponding response $Y_j$ for any particular predictor $X_j$. Hence there is nothing to "supervise" the training of the model. This is clearly a much more challenging environment for an algorithm to produce results as there is no form of "fitness function" with which to assess accuracy. Despite this setback, unsupervised techniques are extremely powerful. They are particularly useful in the realm of clustering.

A parametrised clustering model, when provided with a parameter specifying the number of clusters to identify, can often discern unanticipated relationships within the data that might not otherwise have been easily determined. Such models generally fall within the domain of business analytics and consumer marketing optimisation but they do have uses within finance, particularly in regards to assessing clustering within volatility, for instance.

In the next article we will consider different categories of machine learning techniques as well as how to assess the quality of a model.