In the previous article on studying to become a quant trader we touched on the importance of statistical and machine learning. Many of you contacted me in regard to the "state of the art" of such machine learning methods, and how they're applied in the quant finance world. In this article I want to outline the resources necessary to learn machine learning techniques so that you'll be better prepared for a role as a quant trader.

Statistical learning is extremely important in quant trading research. We can bring to bear the entire weight of the *scientific method* and *hypothesis testing* in order to rigourously assess the quant trading research process. For quantitative trading we are interested in testable, repeatable results that are subject to constant scrutiny. This allows easy replacement of trading strategies as and when performance degrades. Note that this is in stark contrast to the approach taken in "discretionary" trading where performance and risk are not often assessed in this manner.

## Why Should We Use The Scientific Method In Quantitative Trading?

The statistical approach to quant trading is designed to eliminate issues that surround discretionary methods. A great deal of discretionary technical trading is rife with cognitive biases, including loss aversion, confirmation bias and the bandwagon effect. Quant trading research uses alternative mathematical methods to mitigate such behaviours and thus enhance trading performance.

In order to carry out such a methodical process quant trading researchers possess a continuously skeptical mindset and any strategy ideas or hypotheses about market behaviour are subject to continual scrutiny. A strategy idea will only be put into a "production" environment after extensive statistical analysis, testing and refinement. This is necessary because the market has a rather low signal-to-noise ratio. This creates difficulties in forecasting and thus leads to a challenging trading environment.

## What Modelling Problems Do We Encounter In Quantitative Finance?

The goal of quantitative trading research is to produce algorithms and technology that can satisfy a certain investment mandate. In practice this translates into creating trading strategies (and related infrastructure) that produce consistent returns above a certain pre-determined benchmark, net of costs associated with the trading transactions, while minimising "risk". Hence there are a few levers that can be pulled to enhance the financial objectives.

A great deal of attention is often given to the signal/alpha generator, i.e. "the strategy". The best funds and retail quants will spend a significant amount of time modelling/reducing transaction costs, effectively managing risk and determining the optimal portfolio. This article is primarily aimed at the alpha generator component of the stack, but please be aware that the other components are of equal importance if successful long-term strategies are to be carried out.

We will now investigate problems encountered in signal generation and how to solve them. The following is a basic list of such methods (which clearly overlap) that are often encountered in signal generation problems:

**Forecasting/Prediction**- The most common technique is direct forecasting of a financial asset price/direction based on prior prices (or fundamental factors). This usually involves detection of an underlying signal in the "noise" of the market that can be predicted and thus traded upon. It might also involve regressing against other factors (including lags in the original time series) in order to assess the future*response*against future*predictors*.**Clustering/Classification**- Clustering or classification techniques are methods designed to group data into certain classes. These can be binary in nature, e.g. "up" or "down", or multiply-grouped, e.g. "weak volatility", "strong volatility", "medium volatility".**Sentiment Analysis**- More recent innovations in natural language processing and computational speed have lead to sophisticated "sentiment analysis" techniques, which are essentially a classification method, designed to group data based on some underlying sentiment factors. These could be directional in nature, e.g. "bullish", "bearish", "neutral" or emotional such as "happy", "sad", "positive" or "negative". Ultimately this will lead to a trading signal of some form.**Big Data**- Alternative sources of data, such as consumer social media activities, often lead to terabytes (or greater) of data that requires more novel software/hardware in order to interpret. New algorithm implementations have been created in order to handle such "big data".

## Modelling Methodology

There are countless textbooks on statistical modelling, probability and machine learning. It is actually quite challenging to know where to begin. I myself have had to go through this process when transitioning from a physical modelling mindset (during my own PhD) towards a statistical approach while in industry. I described the two books I consider the "best" to get started in this field in the previous article, but to recap they are:

- An Introduction to Statistical Learning by Gareth James et al
- The Elements of Statistical Learning by Trevor Hastie et al

The first book doesn't require a great deal of mathematical sophistication. The necessary background includes typical college linear algebra, calculus and probability theory. The second book is more advanced and goes deeper into the theory. For that you should have some good grounding in probability theory, prior statistical methods and modelling.

These books will teach you about the following topics. By studying the books (and carrying out the associated "labs" in R) you will gain a solid insight into when certain algorithms are applicable.

**Statistical Modelling and Limitations**- The books will outline what statistical learning is and isn't capable of along with the tradeoffs that are necessary when carrying out such research. The difference between*prediction*and*inference*is outlined as well as the difference between*supervised*and*unsupervised*learning. The bias-variance tradeoff is also explained in detail.**Linear Regression**- Linear regression (LR) is one of the simplest supervised learning techniques. It assumes a model where the predicted values are a linear function of the predictor variable(s). While this may seem simplistic compared to the remaining methods in this list, linear regression is still widely utilised in the financial industry. Being aware of LR is important in order to grasp the later methods, some of which are generalisations of LR.**Supervised Classification: Logistic Regression, LDA, QDA, KNN**- Supervised classification techniques such as Logistic Regression, Linear/Quadratic Discriminant Analysis and K-Nearest Neighbours are techniques for modelling qualitative classification situations, such as prediction of whether a stock index will move up or down (i.e. a binary value) in the next time period.**Resampling Techniques: Bootstrapping, Cross-Validation**- Resampling techniques are necessary in quantitative finance (and statistics in general) because of the dangers of model-fitting. Such techniques are used to ascertain how a model behaves over different*training sets*and how to minimise the problem of "overfitting" models.**Decision Tree Methods: Bagging, Random Forests**- Decision trees are a type of*graph*that are often employed in classification settings. Bagging and Random Forest techniques are*ensemble methods*making use of such trees to reduce overfitting and reduce variance in individually fitted supervised learning methods.**Neural Networks**- Artificial Neural Networks (ANN) are a machine learning technique often employed in a supervised manner to find non-linear relationships between predictors and responses. In the financial domain they are often used for time series prediction and forecasting.**Support Vector Machines**- SVMs are also classification or regression tools, which work by constructing a*hyperplane*in high or infinite dimensonal spaces. The kernel trick allows non-linear classification to occur by a mapping of the original space into an inner-product space.**Unsupervised Methods:**PCA, K-Means, Hierarchical Clustering, NNMF - Unsupervised learning techniques are designed to find hidden structure in data, without the use of an objective or reward function to "train" on. Additionally, unsupervised techniques are often used to pre-process data.**Ensemble Methods**- Ensemble methods make use of multiple separate statistical learning models in order to achieve greater predictive capability than could be achieved from any of the individual models.

To become an adept quantitative trading researcher it is essential to be familiar with the process of statistical modelling. An exhaustive knowledge of machine learning techniques is of lesser importance than a deeper understanding of the modelling process itself. Make sure to always keep in mind the core ideas of modelling assumptions, the bias-variance tradeoff, algorithm applicability and cognitive biases when carrying out quantitative trading research.