Beginner's Guide to Unsupervised Learning

Beginner's Guide to Unsupervised Learning

The majority of machine learning posts to date on QuantStart have all been about supervised learning. In this post we are going to take a look at unsupervised learning, which is a far more challenging area of machine learning.

Supervised learning involves taking a number of data observations, each of which contains a feature, or predictor, vector as well as an associated output, or response. The goal of supervised learning is to try and predict the output/response from the associated features/predictors. It is "supervised" because in the training, or supervision, phase of the learning process the algorithm has access to the "ground truth" via known responses to certain inputs. It uses these to adjust its model parameters such that when it is exposed to new features it can attempt to make an estimate of the response.

In unsupervised learning we still have access to the features but we do not have an associated response. Instead we are interested solely in attributes of the features themselves. This might include whether the features form specific clusters or sub-groups in feature space. It might also include whether we are able to describe very high-dimensional data in a much lower-dimensional setting.

Unsupervised learning techniques are often motivated by the fact that it can be prohibitive in terms of time and/or money to "label" feature data, which would allow it to be analysed using supervised techniques. An additional motivation is due to the fact that images, video, natural language documents and scientific research data (such as gene expressions), once quantified, possess very high dimensionality. Such high-dimensionality requires supervised learning techniques with many degrees of freedom, potentially leading to overfitting and thus poor test performance. Unsupervised learning techniques are a partial solution to these problems.

Unfortunately, the lack of "ground truth" or "supervision" for unsupervised techniques often leads to subjective assessment of their performance. There are no widely agreed approaches for quantifying how well unsupervised algorithms have done. Performance is largely determined on a case-by-case basis using heuristic approaches. Such "judgement based" assessments might seem unscientific to quantitatively trained individuals, but unsupervised techniques have proven to be extremely useful in many research areas.

Unsupervised learning techniques are often deployed in the realms of anomaly detection, purchasing habit analysis, recommendation systems and natural language processing. In quantitative finance they find usage in de-noising datasets, portfolio/asset clustering, market regime detection and trade signal generation with natural language processing.

High Dimensional Data

Quantitative finance and algorithmic trading extend well beyond analysis of asset price time series. The increased competition from the proliferation of quantitative funds has forced new and old firms alike to consider alternative data sources. Many of these sources are inhomogeneous, non-numeric and form extremely large databases. When quantified, often by a process known as vectorisation, much of this data is extremely high-dimensional. Examples include satellite imagery, high-resolution video, corpora of text documents and sensor data.

To provide some scope as the extreme dimensionality of some datasets, consider a standard 1080p monitor, which has a resolution of $1920 \times 1080 = 2073600$ pixels. If we restrict each of these pixels to only being black or white (that is, "off" or "on"), then there are $2^{2073600}$ potential images that can be displayed. This is a vast number (try entering it into your Python terminal!). It becomes significantly worse when considering the fact that each pixel often has $2^{24}$ potential colours (three 8-bit channels for Red, Green and Blue respectively).

Hence there is a significant motivation when searching through such datasets to reduce the dimensionality to a more manageable level by trying to find lower dimensional subspaces that still capture the essence of the data signal. A key problem is that even with a huge number of samples, the "training data cannot be expected to populate the space"[3]. If $n$ is the number of samples available and $p$ is the dimensionality of the space, then we are in a situation where $p \gg n$. In essence, there are large subsets of the space where very little is known. This problem is often referred to as the Curse of Dimensionality.

A lot of unsupervised learning is thus concerned with means of reducing this dimensionality to a manageable level but still retaining the "signal" within the data. Mathematically, we are attempting to describe the key variations in the data using a lower dimensional manifold of dimension $q < p$, which is embedded within the larger $p$-dimensional space. Dimensionality reduction algorithms such as linear Principal Components Analysis (PCA) and non-linear kernel PCA have been developed for this task.

Mathematical Overview of Unsupervised Learning

In a supervised learning task we have access to a training set consisting of $n$ pairs of feature vectors, or predictors, ${\bf x}_i \in \mathbb{R}^p$ as well as associated outputs or responses, $y_i \in \mathbb{R}$. Thus our dataset consists of $n$ tuples $({\bf x}_1, y_1), \ldots, ({\bf x}_n, y_n)$. The responses, $y_i$ can be considered as "labels" for the set of features and are used to guide the supervised learning algorithm in its training phase. In order train the model we need to use a loss function between the true value of the response $y$ and its estimate from the model $\hat{y}$, given by $L(y, \hat{y})$.

The unsupervised learning setting differs in that we only have access to the predictors ${\bf x}_i$. That is, we do not have associated labelled responses $y_i$ for each data point. Thus there is no concept of training or supervision for such techniques as there is nothing for the algorithm to use for ground truth. Instead, we are interested solely in the structure of the ${\bf x}_i$s themselves.

One approach involves formulating the task probabilistically via a concept known as density estimation[2], [4].

In the supervised learning case we wish to build models of the form $p(y_i \mid {\bf x}_i, \theta)$. That is we are specifically interested in the distribution of the responses $y_i$, conditional on both the feature vectors ${\bf x}_i$ and the parameters of the model, $\theta$. This is known as conditional density estimation.

In contrast, in the unsupervised case, since we do not have access to the responses $y_i$, we are interested in probabilistic models of the form $p({\bf x}_i \mid \theta)$. That is, the distribution of the feature vectors ${\bf x}_i$ conditional on the parameters of the model, $\theta$. This is known as unconditional density estimation.

As mentioned above, one of the main challenges in unsupervised learning problems is that often $p \gg n$. That is, the dimensionality of the feature space, $p$, vastly exceeds the number of observations, $n$. This means that the space is not very well populated by the feature data and there is a strong focus on techniques for dimensionality reduction.

Unsupervised Learning Algorithms

There are two main areas of unsupervised learning that are of interest to us in quantitative finance: Dimensionality Reduction and Clustering.

Dimensionality Reduction

We have motivated the need for dimensionality reduction above. The most common mechanism in unsupervised learning for achieving this is (linear) Principal Components Analysis (PCA).

In machine learning and quantitative finance problems we often have a large set of correlated variables in a high dimensional space. PCA allows us to summarise these datasets using a reduced number of dimensions. It achieves this by carrying out an orthogonal coordinate transformation of the original space, forming a new set of linearly uncorrelated variables called principal components.

The principal components are found as the eigenvectors of the covariance matrix of the data. Each principal component is orthogonal to each other (by construction) and explains successively less of the variability of the dataset. Usually the first few principal components are able to account for a large fraction of the variability of the original set, leading to a much lower dimensional representation in this new space.

Another way to think of PCA is that it is a change of basis. The transformation produces a set of basis vectors, a subset of which are capable of spanning a linear subspace within the original space that closely follows the data grouping.

However, not all data is easily summarised by a linear subspace. In classification problems, for instance, there are many data sources which are not linearly separable. In this case it is possible to invoke the "kernel trick", as was discussed in the article on Support Vector Machines, to linearly separate a space in a much higher dimensional space and thus carry out PCA in the transformed space. This allows PCA to be applied to non-linear datasets.

In quantitative finance PCA is often used for factor analysis. An example would be looking at a large number of correlated stocks and attempting to reduce their dimensionality by looking at a smaller set of unobserved and uncorrelated latent variables. We will be considering principal components analysis and factor analysis in later articles.


Another important unsupervised learning technique is known as cluster analysis. Its goal is to assign a cluster label to elements of a feature space in order to partition them into groupings or clusters. In certain cases this can be accomplished unambiguously if subgroupings within the feature space are clearly distinct and easily separable. In other cases clusters may "overlap", making it challenging to form a distinction boundary.

The canonical algorithm for cluster analysis is K-Means Clustering. The basic idea with the procedure is to assign all $n$ elements of a feature space into $K$ separate and non-overlapping clusters.

To achieve this a simple iterative algorithm is used. All elements of the feature space are initially randomly assigned a cluster $k \in \{1, \ldots, K \}$. At this point the algorithm iterates and for each step of the iteration calculates the mean vector - the centroid - for each cluster, $k$, and then assigns each element to the cluster possessing the nearest centroid, using a Euclidean distance metric. The algorithm is iterated until the centroid locations remain fixed to within a certain pre-specified tolerance distance.

In quantitative finance clustering is commonly used to identify assets that have similar characteristics, which is useful in constructing diversified portfolios. It can also be utilised for detecting market regimes and thus potentially acting as a risk management tool. We will be studying clustering techniques for assets in future articles.

Bibliographic Note

An introduction to unsupervised learning, and its difficulties, can be found in [1]. It is accessible to those without a strong mathematical background or those coming from other areas of science. A significantly more advanced mathematical discussion, at the graduate level, can be found in [2]. The book discusses many unsupervised techniques although it is primarily about supervised methods. [3] discusses high-dimensionality and the problems it causes at a reasonable mathematical level, concentrating primarily on PCA and clustering, while [4] considers unsupervised learning through the probabilistic density estimation approach at a gentler mathematical level of rigour.