In this series of articles we are going to create a statistically robust process for forecasting financial time series. These forecasts will form the basis for a group of automated trading strategies. The first article in the series will discuss the modelling approach and a group of *classification algorithms* that will enable us to predict market direction.

Within these articles we will be making use of scikit-learn, a machine learning library for Python. Scikit-learn contains implementations of many machine learning techniques. Not only does this save us a great deal of time in implementing our own, but it minimises the risk of bugs introduced by our own code and allows additional verification against libraries written in other packages, such as R. This gives us a great deal of confidence if we need to create our own custom implementation (for reasons of execution speed, say).

## Process for Forecasting

A detailed explanation of the field of statistical machine learning is beyond this article. In order to utilise techniques such as Logistic Regression, Linear Discriminant Analysis and Quadratic Discriminant Analysis we need to outline some basic concepts.

### Supervised Learning Techniques

Supervised learning techniques involve a set of *known* tuples $(x_i, y_i)$, $i \in \{1,...,N \}$, with $x_i$ representing the predictor variables (such as lagged stock market returns or volume traded) and $y_i$ representing the associated response/observation variables (such as the stock market return today). In this situation we are interested in *prediction*. Given future predictor variables we wish to estimate the responses from these predictors. This is in opposition to *inference* where we are more interested in the *relationship* between the variables.

All of the algorithms we make use of in this article, along with many others that we will employ in the future, are from the supervised learning domain.

### Measuring Prediction Accuracy

The particular class of methods that we are interested in involves *binary classification*. That is, we will attempt to allocate the percentage return for a particular day into two buckets: "Up" or "Down". In a production forecaster we would be very concerned with the magnitude of this prediction and the deviations of the prediction from the actual value.

In such cases we can make use of the Mean-Squared Error, Mean Absolute Deviation and Root-Mean-Squared Error to provide an estimate of forecasting accuracy. The literature provides numerous other examples of forecasting accuracy measures.

In this instance we are only going to be concerned with the **hit rate**, which is simply the percentage of times that the forecaster achieved an accurate prediction (i.e. up when the day was up and vice versa). In later examples we will make use of a confusion matrix to determine the prediction performance on a class-by-class basis. In addition we will calculate the aforementioned values and incorporate them into our trading research process.

### Forecasting Factors

A forecasting methodology is only as good as the factors chosen as predictors. There are a staggering number of potential factors to choose from when forecasting stock market index returns. In this article we are going to restrict the factors to time lags of the current percentage returns. This is not because they are the best predictors, rather it is because it is straightforward to demonstrate the process of forecasting on an easily obtained dataset.

Forecasting factor choice is extremely important, if not the most important, component of the forecaster. Even simple machine learning techniques will produce good results on well-chosen factors. Note that the converse is not often the case. "Throwing an algorithm at a problem" will usually lead to poor forecasting accuracy.

For this forecaster specifically, I have chosen the first and second time lags of the percentage returns as the predictors for the current stock market direction. This is a relatively arbitrary choice and there is plenty of scope for modification, for instance by adding in additional lags or the volume of shares traded. It is generally better to have fewer predictors in a model, although there are statistical tests available which can demonstrate the predictive capability of each factor.

## Forecasting S&P500 with Logistic Regression, LDA and QDA

The S&P500 is a weighted index of the 500 largest publicly traded companies (by market capitalisation) in the US stock market. It is often considered an equities "benchmark". Many derivative products exist in order to allow speculation or hedging on the index. In particular, the S&P500 E-Mini Index Futures Contract is an extremely liquid means of trading the index.

In this section we are going to use three *classifiers* to predict the direction of the closing price at day $N$ based solely on price information known at day $N-1$. An upward directional move means that the closing price at $N$ is higher than the price at $N-1$, while a downward move implies a closing price at $N$ lower than at $N-1$.

If we can determine the direction of movement in a manner that significantly exceeds a 50% hit rate, with low error and a good statistical significance, then we are on the road to forming a basic systematic trading strategy based on our forecasts. At this stage we're not concerned with the most up to date machine learning classification algorithms. Right now we're just introducing concepts and so we'll begin the discussion on forecasting with some elementary methods.

### Logistic Regression

The first technique we will consider is logistic regression (LR). In our case we are going to use LR to measures the relationship between a binary categorical dependent variable ("Up" or "Down") and multiple independent continuous variables (the lagged percentage returns). The model provides the *probability* that a particular (following) day will be categorised as "Up" or "Down". In this implementation we have chosen to assign each day as "Up" if the probability exceeds 0.5. We could make use of a different threshold, but for simplicity I have chosen 0.5.

LR uses the logistic formula to model the probability of obtaining an "Up" day ($Y=U$) based on the lag factors ($L_1$, $L_2$):

\begin{eqnarray} p(Y=U|L_1, L_2) = \frac{e^{\beta_0 + \beta_1 L_1 + \beta_2 L_2}}{1+e^{\beta_0 + \beta_1 L_1 + \beta_2 L_2}} \end{eqnarray}The logistic function is used because it provides a probability between $[0,1]$ for all values of $L_1$ and $L_2$, unlike linear regression where negative probabilities can be generated in the same setting.

To fit the model (i.e. estimate the $\beta_i$ coefficients) the maximum likelihood method is used. Fortunately for us, the implementation of the fitting and prediction of the LR model is handled by the scikit-learn library.

### Linear Discriminant Analysis

The next technique used is Linear Discriminant Analysis (LDA). LDA differs from LR in because in LR we model $P(Y=U|L_1,L_2)$ as a conditional distribution of the response $Y$ given the predictors $L_i$, using a logistic function. In LDA the distribution of the $L_i$ variables are modelled separately, given $Y$, and $P(Y=U|L_1,L_2)$ is obtained via Bayes' Theorem.

Essentially, LDA results from assuming that predictors are drawn from a multivariate Gaussian distribution. After calculting estimates for the parameters of this distribution, the parameters can be input into Bayes' Theorem in order to make predictions about which class an observation belongs to.

LDA assumes that all classes share the same covariance matrix.

I won't dwell on the formulae for estimating the distribution or *posterior probabilities* that are needed to make predictions, as once again scikit-learn handles this for us.

### Quadratic Discriminant Analysis

Quadratic Discriminant Analysis (QDA) is closely related to LDA. The significant difference is that each class can now possess its own covariance matrix.

QDA generally performs better when the decision boundaries are non-linear. LDA generally performs better when there are fewer training observations (i.e. when needing to reduce variance). QDA, on the other hand, performs well when the training set is large (i.e. variance is of less concern). The use of one or the other ultimately comes down to the bias-variance trade-off.

As with LR and LDA, scikit-learn takes care of the QDA implementation so we only need to provide it with training/test data for parameter estimation and prediction.

### Python Implementation

For the implementation of these forecasters we will make use of NumPy, pandas and scikit-learn. I've previously written a tutorial on how to install these libraries. I've heavily commented the code itself so it should be easy to ascertain what is happening.

The first step is to import the relevant modules and libraries. We're going to import the `LogisticRegression`

, `LDA`

and `QDA`

classifiers for this forecaster:

import datetime import numpy as np import pandas as pd import sklearn from pandas.io.data import DataReader from sklearn.linear_model import LogisticRegression from sklearn.lda import LDA from sklearn.qda import QDA

Now that the libraries are imported, we need to create a pandas DataFrame that contains the lagged percentage returns for a prior number of days (defaulting to five). `create_lagged_series`

will take a stock symbol (as recognised by Yahoo Finance) and create a lagged DataFrame across the period specified:

def create_lagged_series(symbol, start_date, end_date, lags=5): """This creates a pandas DataFrame that stores the percentage returns of the adjusted closing value of a stock obtained from Yahoo Finance, along with a number of lagged returns from the prior trading days (lags defaults to 5 days). Trading volume, as well as the Direction from the previous day, are also included.""" # Obtain stock information from Yahoo Finance ts = DataReader(symbol, "yahoo", start_date-datetime.timedelta(days=365), end_date) # Create the new lagged DataFrame tslag = pd.DataFrame(index=ts.index) tslag["Today"] = ts["Adj Close"] tslag["Volume"] = ts["Volume"] # Create the shifted lag series of prior trading period close values for i in xrange(0,lags): tslag["Lag%s" % str(i+1)] = ts["Adj Close"].shift(i+1) # Create the returns DataFrame tsret = pd.DataFrame(index=tslag.index) tsret["Volume"] = tslag["Volume"] tsret["Today"] = tslag["Today"].pct_change()*100.0 # If any of the values of percentage returns equal zero, set them to # a small number (stops issues with QDA model in scikit-learn) for i,x in enumerate(tsret["Today"]): if (abs(x) < 0.0001): tsret["Today"][i] = 0.0001 # Create the lagged percentage returns columns for i in xrange(0,lags): tsret["Lag%s" % str(i+1)] = tslag["Lag%s" % str(i+1)].pct_change()*100.0 # Create the "Direction" column (+1 or -1) indicating an up/down day tsret["Direction"] = np.sign(tsret["Today"]) tsret = tsret[tsret.index >= start_date] return tsret

The next helper function is designed to create a percentage `hit_rate`

for each model, by eliminating duplicated code. It relies on the fact that the Logistic Regression, LDA and QDA objects have the same methods (`fit`

and `predict`

). The hit rate is output to the terminal:

def fit_model(name, model, X_train, y_train, X_test, pred): """Fits a classification model (for our purposes this is LR, LDA and QDA) using the training data, then makes a prediction and subsequent "hit rate" for the test data.""" # Fit and predict the model on the training, and then test, data model.fit(X_train, y_train) pred[name] = model.predict(X_test) # Create a series with 1 being correct direction, 0 being wrong # and then calculate the hit rate based on the actual direction pred["%s_Correct" % name] = (1.0+pred[name]*pred["Actual"])/2.0 hit_rate = np.mean(pred["%s_Correct" % name]) print "%s: %.3f" % (name, hit_rate)

Finally, we tie it together with a `__main__`

function. In this instance we're going to attempt to forecast the US stock market direction in 2005, using returns data from 2001 to 2004:

if __name__ == "__main__": # Create a lagged series of the S&P500 US stock market index snpret = create_lagged_series("^GSPC", datetime.datetime(2001,1,10), datetime.datetime(2005,12,31), lags=5) # Use the prior two days of returns as predictor values, with direction as the response X = snpret[["Lag1","Lag2"]] y = snpret["Direction"] # The test data is split into two parts: Before and after 1st Jan 2005. start_test = datetime.datetime(2005,1,1) # Create training and test sets X_train = X[X.index < start_test] X_test = X[X.index >= start_test] y_train = y[y.index < start_test] y_test = y[y.index >= start_test] # Create prediction DataFrame pred = pd.DataFrame(index=y_test.index) pred["Actual"] = y_test # Create and fit the three models print "Hit Rates:" models = [("LR", LogisticRegression()), ("LDA", LDA()), ("QDA", QDA())] for m in models: fit_model(m[0], m[1], X_train, y_train, X_test, pred)

The output of the code is as follows:

Hit Rates: LR: 0.560 LDA: 0.560 QDA: 0.599

It can be seen that the Logistic Regression and Linear Discriminant Analyser were both able to gain a 56% hit rate. However, the Quadratic Discriminant Analyser was able to improve on both of them to produce a 60% hit rate. For the particular period analysed, this is likely due to the fact that there is some non-linearity in the relationship between the lagged factors and the direction that isn't well-captured in the linear analysis.

Thus, there is hope that we *may* be able to partially predict the US stock market. There are a few caveats to this forecasting methodology:

- We haven't used any form of
**cross-validation**to reduce fitting errors. A production forecaster would require such analysis to be considered robust. - The forecaster has only been trained on data between
**2001-2004**inclusive. More recent stock market data may have substantially different prediction accuracy. - We haven't actually attempted to trade off this information. In particular, how would we actually
**execute trades**? Would we use the US e-mini future? Would we make use of Market-On-Open (MOO) or Market-On-Close (MOC) orders? We would also need to consider transaction costs.

In subsequent articles we will consider these issues in greater depth.

## A Warning on Random Forecasting

In this section I want to overtly highlight the problem of statistical significance when dealing with forecasters. In addition to the forecaster outlined above I also generated a "forecasting" series based solely on the sign of random draws from a standard normal distribution. Note that in the same period it has produced a forecasting hit-rate of 53.4% and yet the method used to generate the series is essentially no different to tossing a coin! Keep this in mind whenever you carry out forecasting procedures as it can often lead to dire trading performance if not taken into account.

In the following articles we will consider more advanced supervised non-linear forecasting classifiers such as artificial neural networks (ANN) and support vector machines (SVM). With a "stable" of machine learning techniques at our disposal we will subsequently be able to make use of *ensemble methods* to produce a forecasting accuracy and robustness that can sometimes exceed those of any individual forecaster.