Beginner's Guide to Statistical Machine Learning - Part I

In previous articles I have made it clear that statistical machine learning has become an extremely important component in the overall toolkit of a quantitative trading researcher. I have also outlined a brief study guide on what to learn. However, I've not really explored the subject on a conceptual level starting from first principles.

This article is designed to give you an idea of the mathematics and formalism behind statistical learning, with follow-on articles describing exactly how it can be applied to quantitative finance problems, such as algorithmic trading strategy design.

Before discussing the theoretical aspects of statistical learning it is appropriate to consider an example of a situation from quantitative finance where such techniques are applicable. Consider a quantitative fund that wishes to make long term predictions of the S&P500 stock market index. The fund has managed to collect a substantial amount of *fundamental data* associated with the companies that constitute the index. Fundamental data includes *price-earnings ratio* or *book value*, for instance. How should the fund go about using this data to make predictions of the index in order to create a trading tool? Statistical learning provides one such approach to this problem.

In a more quantitative sense we are attempting to model the behaviour of an *outcome* or *response* based on a set of *predictors* or *features* assuming a relationship between the two. In the above example the stock market index value is the response and the fundamental data associated with the constituent firms are the predictors.

This can be formalised by considering a response $Y$ with $p$ different features $x_1,x_2,...,x_p$. If we utilise *vector notation* then we can define $X = (x_1,x_2,...,x_p)$, which is a vector of length $p$. Then the model of our relationship is given by:

Where $f$ is an unknown function of the predictors and $\epsilon$ represents an *error* or *noise term*. Importantly, $\epsilon$ is not dependent on the predictors and has a mean of zero. This term is included to represent information that is not considered within $f$. Thus we can return to the stock market index example to say that $Y$ represents the value of the S&P500 whereas the $x_i$ components represent the values of individual fundamental factors.

The goal of statistical learning is to *estimate* the form of $f$ based on the observed data and to evaluate how accurate those estimates are.

There are two general tasks that are of interest in statistical learning - *prediction* and *inference*. Prediction refers to the situation where it is straightforward to obtain information on features/predictors but difficult (or impossible) to obtain the responses.

Prediction is concerned with predicting a response $Y$ based on a *newly observed* predictor, $X$. Assuming a model relationship has been determined then it is simple to predict the response using an *estimate* for $f$ to produce an *estimate* for the response:

The exact functional form of $f$ is often unimportant in a prediction scenario assuming that the estimated responses are close to the true responses and is thus accurate in its predictions. Different estimates of $f$ will produce various accuracies of the estimates of $Y$. The error associated with having a poor estimate $\hat{f}$ of $f$ is called the *reducible error*. Note that there is always a degree of *irreducible error* because our original specification of the problem included the $\epsilon$ error term. This error term encapsulates the unmeasured factors that may affect the response $Y$. The approach taken is to try and minimise the reducible error with the understanding that there will always be an upper limit of accuracy based on the irreducible error.

Inference is concerned with the situation where there is a need to understand the relationship between $X$ and $Y$ and hence its exact form must be determined. One may wish to identify important predictors or determine the relationship between individual predictors and the response. One could also ask if the relationship is *linear* or *non-linear*. The former means the model is likely to be more interpretable but at the expense of potentially worse predictability. The latter provides models which are generally more predictive but are sometimes less interpretable. Hence a trade-off between *predictability* and *interpretability* often exists.

On QuantStart we are generally less concerned with inference models since the actual form of $f$ is not as important as its ability to make accurate predictions. Many of the trading articles on the site have and will be based on predictive modelling. The next section deals with how we go about constructing an estimate $\hat{f}$ for $f$.

In a statistical learning situation it is often possible to construct a set of tuples of predictors and responses of the form $\{ (X_1, Y_1), (X_2, Y_2), ... , (X_N, Y_N) \}$, where $X_j$ refers to the jth predictor vector and not the jth component of a particular predictor vector (that is denoted by $x_j$). A data set of this form is known as *training data* since it will be used to *train* a particular statistical learning method on how to generate $\hat{f}$. In order to actually estimate $f$ we need to find a $\hat{f}$ that provides a reasonable approximation to a particular $Y$ under a particular predictor $X$. There are two broad categories of statistical models that allow us to achieve this. They are known as *parametric* and *non-parametric* models.

The defining feature of parametric methods is that they require the *specification* or *assumption* of the form of $f$. This is a modelling decision. The first choice is whether to consider a linear or non-linear model. Let's consider the simpler case of a linear model. Such a model reduces the problem from estimation of some unknown function of dimension $p$ to that of estimating a coefficient vector $\beta=(\beta_0, \beta_1, ... , \beta_p) $ of length $p+1$.

Why $p+1$ and not $p$? Since linear models can be *affine*, that is they may not pass through the origin when creating a "line of best fit", a coefficient is required to specify the "intercept". In a one-dimensional linear model (regression) setting this coefficient is often represented as $\alpha$. For our multi-dimensional linear model, where there are $p$ predictors, we instead use the notation $\beta_0$ to represent our intercept and hence there are $p+1$ components in our $\hat{\beta}$ estimate of $\beta$.

Now that we have specified a (linear) functional form of $f$ we need to *train* it. "Training" in this instance means finding an estimate for $\beta$ such that:

Where the vector $X=(1,x_1,x_2,...,x_p)$ contains an additional component with unity in order to have a $p+1$-dimensional inner product.

In the linear setting we can use an algorithm such as *ordinary least squares* (OLS) to determine the coefficients but other methods are available as well. It is far simpler to estimate $\beta$ than fit a (potentially non-linear) $f$. However, by choosing a parametric linear approach our estimate $\hat{f}$ is unlikely to be replicating the true form of $f$. This can lead to poor estimates because the model is not *flexible* enough.

A potential remedy is to consider adding more parameters, by choosing alternate forms for $\hat{f}$. Unfortunately if the model becomes too flexible it can lead to a very dangerous situation known as *overfitting*, which we have discussed at length in previous articles. In essence the model follows the noise too closely and not the signal!

The alternative approach is to consider a non-parametric form of $\hat{f}$. Non-parametric models can potentially fit a wider range of possible forms for $f$ and are thus more flexible. Unfortunately non-parametric models suffer from the need to have an extensive amount of observational data points, often far more than in a parametric settings. In addition non-parametric methods are also prone to overfitting if not treated carefully, as described above.

Non-parametric models may seem like a natural choice for quantitative trading models as there is seemingly an abundance of (historical) data on which to apply the models. However, the methods are not always optimal. While the increased flexibility is attractive for modelling the non-linearities in stock market data it is very easy to overfit the data due to the poor signal/noise ratio found in financial time series.

Thus a "middle-ground" of considering models with some degree of flexibility is preferred. We will discuss such problems in later articles.

A distinction is often made in statistical machine learning between *supervised* and *unsupervised* methods. On QuantStart the strategies we look at will be based almost exclusively on supervised techniques, but unsupervised techniques are certainly applicable to financial markets.

A supervised model requires that for each predictor vector $X_j$ there is an associated response $Y_j$. The "supervision" of the procedure occurs when the model for $f$ is *trained* or *fit* to this particular data. For example, when fitting a linear regression model, the OLS algorithm is used to train it, ultimately producing an estimate $\hat{\beta}$ to the vector of regression coefficients, $\beta$.

In an unsupervised model there is no corresponding response $Y_j$ for any particular predictor $X_j$. Hence there is nothing to "supervise" the training of the model. This is clearly a much more challenging environment for an algorithm to produce results as there is no form of "fitness function" with which to assess accuracy. Despite this setback, unsupervised techniques are extremely powerful. They are particularly useful in the realm of *clustering*.

A parametrised clustering model, when provided with a parameter specifying the number of clusters to identify, can often discern unanticipated relationships within the data that might not otherwise have been easily determined. Such models generally fall within the domain of *business analytics* and *consumer marketing optimisation* but they do have uses within finance, particularly in regards to assessing clustering within volatility, for instance.

In the next article we will consider different categories of machine learning techniques as well as how to assess the quality of a model.

comments powered by DisqusYou'll get instant access to a free 10-part email course packed with hints and tips to help you get started in quantitative trading!

Every week I'll send you a wrap of all activity on QuantStart so you'll never miss a post again.

Real, actionable quant trading tips with no nonsense.