Updated for Python 3.10, January 2023

In this series of articles we are going to create a statistically robust process for forecasting financial time series. These forecasts will form the basis for a group of automated trading strategies. The first article in the series will discuss the modelling approach and a group of *classification algorithms* that will enable us to predict market direction.

Within these articles we will be making use of scikit-learn, a machine learning library for Python. Scikit-learn contains implementations of many machine learning techniques. Not only does this save us a great deal of time in implementing our own, but it minimises the risk of bugs introduced by our own code and allows additional verification against libraries written in other packages, such as R. This gives us a great deal of confidence if we need to create our own custom implementation (for reasons of execution speed, say).

## Process for Forecasting

A detailed explanation of the field of statistical machine learning is beyond this article. In order to utilise techniques such as Logistic Regression, Linear Discriminant Analysis and Quadratic Discriminant Analysis we need to outline some basic concepts.

### Supervised Learning Techniques

Supervised learning techniques involve a set of *known* tuples $(x_i, y_i)$, $i \in \{1,...,N \}$, with $x_i$ representing the predictor variables (such as lagged stock market returns or volume traded) and $y_i$ representing the associated response/observation variables (such as the stock market return today). In this situation we are interested in *prediction*. Given future predictor variables we wish to estimate the responses from these predictors. This is in opposition to *inference* where we are more interested in the *relationship* between the variables.

All of the algorithms we make use of in this article, along with many others that we will employ in the future, are from the supervised learning domain.

### Measuring Prediction Accuracy

The particular class of methods that we are interested in involves *binary classification*. That is, we will attempt to allocate the percentage return for a particular day into two buckets: "Up" or "Down". In a production forecaster we would be very concerned with the magnitude of this prediction and the deviations of the prediction from the actual value.

In such cases we can make use of the Mean-Squared Error, Mean Absolute Deviation and Root-Mean-Squared Error to provide an estimate of forecasting accuracy. The literature provides numerous other examples of forecasting accuracy measures.

In this instance we are only going to be concerned with the **hit rate**, which is simply the percentage of times that the forecaster achieved an accurate prediction (i.e. up when the day was up and vice versa). In later examples we will make use of a confusion matrix to determine the prediction performance on a class-by-class basis. In addition we will calculate the aforementioned values and incorporate them into our trading research process.

### Forecasting Factors

A forecasting methodology is only as good as the factors chosen as predictors. There are a staggering number of potential factors to choose from when forecasting stock market index returns. In this article we are going to restrict the factors to time lags of the current percentage returns. This is not because they are the best predictors, rather it is because it is straightforward to demonstrate the process of forecasting on an easily obtained dataset.

Forecasting factor choice is extremely important, if not the most important, component of the forecaster. Even simple machine learning techniques will produce good results on well-chosen factors. Note that the converse is not often the case. "Throwing an algorithm at a problem" will usually lead to poor forecasting accuracy.

For this forecaster specifically, I have chosen the first and second time lags of the percentage returns as the predictors for the current stock market direction. This is a relatively arbitrary choice and there is plenty of scope for modification, for instance by adding in additional lags or the volume of shares traded. It is generally better to have fewer predictors in a model, although there are statistical tests available which can demonstrate the predictive capability of each factor.

## Forecasting S&P500 with Logistic Regression, LDA and QDA

The S&P500 is a weighted index of the 500 largest publicly traded companies (by market capitalisation) in the US stock market. It is often considered an equities "benchmark". Many derivative products exist in order to allow speculation or hedging on the index. In particular, the S&P500 E-Mini Index Fultures Contract is an extremely liquid means of trading the index.

In this section we are going to use three *classifiers* to predict the direction of the closing price at day $N$ based solely on price information known at day $N-1$. An upward directional move means that the closing price at $N$ is higher than the price at $N-1$, while a downward move implies a closing price at $N$ lower than at $N-1$.

If we can determine the direction of movement in a manner that significantly exceeds a 50% hit rate, with low error and a good statistical significance, then we are on the road to forming a basic systematic trading strategy based on our forecasts. At this stage we're not concerned with the most up to date machine learning classification algorithms. Right now we're just introducing concepts and so we'll begin the discussion on forecasting with some elementary methods.

### Logistic Regression

The first technique we will consider is logistic regression (LR). In our case we are going to use LR to measures the relationship between a binary categorical dependent variable ("Up" or "Down") and multiple independent continuous variables (the lagged percentage returns). The model provides the *probability* that a particular (following) day will be categorised as "Up" or "Down". In this implementation we have chosen to assign each day as "Up" if the probability exceeds 0.5. We could make use of a different threshold, but for simplicity I have chosen 0.5.

LR uses the logistic formula to model the probability of obtaining an "Up" day ($Y=U$) based on the lag factors ($L_1$, $L_2$):

\begin{eqnarray} p(Y=U|L_1, L_2) = \frac{e^{\beta_0 + \beta_1 L_1 + \beta_2 L_2}}{1+e^{\beta_0 + \beta_1 L_1 + \beta_2 L_2}} \end{eqnarray}The logistic function is used because it provides a probability between $[0,1]$ for all values of $L_1$ and $L_2$, unlike linear regression where negative probabilities can be generated in the same setting.

To fit the model (i.e. estimate the $\beta_i$ coefficients) the maximum likelihood method is used. Fortunately for us, the implementation of the fitting and prediction of the LR model is handled by the scikit-learn library.

### Linear Discriminant Analysis

The next technique used is Linear Discriminant Analysis (LDA). LDA differs from LR in because in LR we model $P(Y=U|L_1,L_2)$ as a conditional distribution of the response $Y$ given the predictors $L_i$, using a logistic function. In LDA the distribution of the $L_i$ variables are modelled separately, given $Y$, and $P(Y=U|L_1,L_2)$ is obtained via Bayes' Theorem.

Essentially, LDA results from assuming that predictors are drawn from a multivariate Gaussian distribution. After calculting estimates for the parameters of this distribution, the parameters can be input into Bayes' Theorem in order to make predictions about which class an observation belongs to.

LDA assumes that all classes share the same covariance matrix.

We won't dwell on the formulae for estimating the distribution or *posterior probabilities* that are needed to make predictions, as once again scikit-learn handles this for us.

### Quadratic Discriminant Analysis

Quadratic Discriminant Analysis (QDA) is closely related to LDA. The significant difference is that each class can now possess its own covariance matrix.

QDA generally performs better when the decision boundaries are non-linear. LDA generally performs better when there are fewer training observations (i.e. when needing to reduce variance). QDA, on the other hand, performs well when the training set is large (i.e. variance is of less concern). The use of one or the other ultimately comes down to the bias-variance trade-off.

As with LR and LDA, scikit-learn takes care of the QDA implementation so we only need to provide it with training/test data for parameter estimation and prediction.

### Python Implementation

For the implementation of these forecasters we will make use of NumPy, pandas and scikit-learn. If you would like to follow along with the code and are new to Python, we highly recommend our early career researcher series which will help you install Python and create an algorithmic trading prototyping environment.

Once you have you Python environment up and running the first step is to import the relevant modules and libraries. Depending on your environment it maybe necessary for you to install the necessary library with a suitable package manager such as pip or conda. In order to follow along with this example you will need:

- Numpy v1.23
- Pandas v1.5
- Scikit-Learn v1.2

You will also need to obtain as CSV file for SPY from January 3rd 2000 and 31st December 2005. The choice of data vendor is up to you, but your CSV needs to have the format:

```
Date,Open,High,Low,Close,Adj Close,Volume
2000-01-03,148.250000,148.250000,143.875000,145.437500,95.308777,8164300
2000-01-04,143.531250,144.062500,139.640625,139.750000,91.581642,8089800
2000-01-05,139.937500,141.531250,137.250000,140.000000,91.745522,12177900
2000-01-06,139.625000,141.500000,137.750000,137.750000,90.271004,6227200
```

Below are the imports that need to be made:

```
import datetime
import datetime
import numpy as np
import pandas as pd
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
```

Now that the libraries are imported, we need to create a pandas DataFrame that contains the lagged percentage returns for a prior number of days (defaulting to five). `create_lagged_series`

will take a CSV file and create a lagged DataFrame across the period specified:

```
def create_dataframe(spy, start_date, lags=5):
# Obtain stock information
ts = pd.read_csv(spy)
ts = ts.set_index(pd.DatetimeIndex(ts['Date']))
# Create lagged dataframe
tslag = pd.DataFrame(index=ts.index)
tslag["Today"] = ts["Adj Close"]
tslag["Volume"] = ts["Volume"]
# Create the shifted lag series of prior trading period close values
for i in range(0, lags):
tslag[f"Lag{i+1}"] = ts["Adj Close"].shift(i+1)
# Create the returns dataframe
tsret = pd.DataFrame(index=tslag.index)
tsret["Volume"] = tslag['Volume']
tsret["Today"] = tslag["Today"].pct_change()*100.0
# If any of the values of percentage returns are zero, set them to
# a small number. This stops issues with QDA model.
tsret.loc[:, 'Today'] = tsret.apply(
lambda row: 0.0001 if abs(row['Today']) < 0.0001 else row['Today'], axis=1
)
# Create the lagged percentage returns columns
for i in range(0, lags):
tsret[f"Lag{i+1}"] = tslag[f"Lag{i+1}"].pct_change()*100.0
# Create a "Direction" column (+1 or -1) indicating an up/down day
tsret["Direction"] = np.sign(tsret["Today"])
tsret = tsret[tsret.index >= start_date]
return tsret
```

The next helper function is designed to create a percentage `hit_rate`

for each model, by eliminating duplicated code. It relies on the fact that the Logistic Regression, LDA and QDA objects have the same methods (`fit`

and `predict`

). The hit rate is output to the terminal:

```
def fit_model(name, model, X_train, y_train, X_test, pred):
"""Fits a classification model (for our purposes this is LR, LDA and QDA)
using the training data, then makes a prediction and subsequent "hit rate"
for the test data."""
# Fit and predict the model on the training, and then test, data
model.fit(X_train, y_train)
pred[name] = model.predict(X_test)
# Create a series with 1 being correct direction, 0 being wrong
# and then calculate the hit rate based on the actual direction
pred["%s_Correct" % name] = (1.0+pred[name]*pred["Actual"])/2.0
hit_rate = np.mean(pred["%s_Correct" % name])
print( "%s: %.3f" % (name, hit_rate))
```

Finally, we tie it together with a `__main__`

function. In this instance we're going to attempt to forecast the US stock market direction in 2005, using returns data from 2001 to 2004:

```
if __name__ == "__main__":
# Create a lagged series of the S&P500 US stock market index
spy = "PATH/TO/YOUR/CSV"
snpret = create_dataframe(spy, dt(2001, 1, 10), lags=5)
# Use the prior two days of returns as predictor values, with direction as the response
X = snpret[["Lag1","Lag2"]]
y = snpret["Direction"]
# The test data is split into two parts: Before and after 1st Jan 2005.
start_test = dt(2005,1,1)
# Create training and test sets
X_train = X[X.index < start_test]
X_test = X[X.index >= start_test]
y_train = y[y.index < start_test]
y_test = y[y.index >= start_test]
# Create prediction DataFrame
pred = pd.DataFrame(index=y_test.index)
pred["Actual"] = y_test
# Create and fit the three models
print("Hit Rates:")
models = [("LR", LogisticRegression()), ("LDA", LDA()), ("QDA", QDA())]
for m in models:
fit_model(m[0], m[1], X_train, y_train, X_test, pred)
```

The output of the code is as follows:

```
Hit Rates:
LR: 0.556
LDA: 0.556
QDA: 0.552
```

It can be seen that the Logistic Regression and Linear Discriminant Analyser were both able to gain a 56% hit rate. However, the Quadratic Discriminant Analyser performed slightly worse and managed to gain a 55% hit rate.

Thus, there is hope that we *may* be able to partially predict the US stock market. There are a few caveats to this forecasting methodology:

- We haven't used any form of
**cross-validation**to reduce fitting errors. A production forecaster would require such analysis to be considered robust. - The forecaster has only been trained on data between
**2001-2004**inclusive. More recent stock market data may have substantially different prediction accuracy. - We haven't actually attempted to trade off this information. In particular, how would we actually
**execute trades**? Would we use the US e-mini future? Would we make use of Market-On-Open (MOO) or Market-On-Close (MOC) orders? We would also need to consider transaction costs.

In subsequent articles we will consider these issues in greater depth.

## A Warning on Random Forecasting

In this section I want to overtly highlight the problem of statistical significance when dealing with forecasters. In addition to the forecaster outlined above I also generated a "forecasting" series based solely on the sign of random draws from a standard normal distribution. Note that in the same period it has produced a forecasting hit-rate of 53.4% and yet the method used to generate the series is essentially no different to tossing a coin! Keep this in mind whenever you carry out forecasting procedures as it can often lead to dire trading performance if not taken into account.

In the following articles we will consider more advanced supervised non-linear forecasting classifiers such as artificial neural networks (ANN) and support vector machines (SVM). With a "stable" of machine learning techniques at our disposal we will subsequently be able to make use of *ensemble methods* to produce a forecasting accuracy and robustness that can sometimes exceed those of any individual forecaster.