Matrix Algebra - Linear Algebra for Deep Learning (Part 2)

Matrix Algebra - Linear Algebra for Deep Learning (Part 2)

Last week I posted an article, which formed the first part in a series on Linear Algebra For Deep Learning. The response to the article was extremely positive, both in terms of feedback, article views and also more broadly on social media.

Many of you commented that there was "an appetite" for introductory mathematical content and this only confirms the results of the QuantStart 2017 Content Survey. Hence I've decided to write more introductory articles, not only continuing with Linear Algebra, but also on the topics of Calculus and Probability, which are fundamental topics for machine learning—and quantitative finance more broadly.

In the previous article we introduced the three basic entities that will be used in linear algebra, namely the scalar, vector and the matrix. We saw that they were all really specific versions of a more general entity known as a tensor.

In this article we are going to look at how to form operations between these entities. Such operations include addition and multiplication. While you will be very familiar with scalar addition and multiplication, the rules differ somewhat when dealing with more general tensor entities. This article will precisely define those operations and provide numeric examples to give you some intuition.

At this stage it is not likely to be clear why these operations will be useful in the context of deep learning. However, in the previous article I stated that linear algebra was the 'language in which machine learning was written'. If we can understand the basics of the language, we'll be in a much better position to grasp the more complex ideas that form the backbone of neural network models in later articles.

We will begin by looking at matrix addition and then consider matrix multiplication. These operations will eventually allow us to discuss a topic known as matrix inversion, which will form the basis of the next article.

What does it mean to add two matrices together? In this section we will explore such an operation and hopefully see that it is actually quite intuitive.

Matrices can be added to scalars, vectors and other matrices. Each of these operations has a precise definition. These techniques are used frequently in machine learning and deep learning so it is worth familiarising yourself with them.

Given two matrices of size $m \times n$, $\boldsymbol{A}=[a_{ij}]$ and $\boldsymbol{B}=[b_{ij}]$, it is possible to define the matrix $\boldsymbol{C}=[c_{ij}]$ as the matrix sum $\boldsymbol{C} = \boldsymbol{A} + \boldsymbol{B}$ where $c_{ij} = a_{ij} + b_{ij}$.

That is, $\boldsymbol{C}$ is constructed by element-wise summing the respective elements of $\boldsymbol{A}$ and $\boldsymbol{B}$. This operation is only defined where the two matrices have equal size, except in the case of broadcasting below. The definition implies that $\boldsymbol{C}$ also has size $m \times n$.

Matrix addition is commutative. This means that it doesn't matter which way around the matrices are added:

\begin{equation} \boldsymbol{A} + \boldsymbol{B} = \boldsymbol{B} + \boldsymbol{A} \end{equation}

It is also associative. This means that you get the same result if you add two matrices together first, and then another, as if you add another two together first and then the other:

\begin{equation} \boldsymbol{A} + (\boldsymbol{B} + \boldsymbol{C}) = (\boldsymbol{A} + \boldsymbol{B}) + \boldsymbol{C} \end{equation}

Both of these results follow from the fact that normal scalar addition is itself commutative and associative, because we're just adding the elements together.

I'm stressing commutivity and associativity for matrix addition because matrix multiplication (defined below) is certainly not commutative. We'll see why later.

#### Example

Consider two matrices $\boldsymbol{A}$ and $\boldsymbol{B}$. We can create a new matrix $\boldsymbol{C}$ through addition:

$\boldsymbol{A} + \boldsymbol{B} = \left[ \begin{array}{ccc} 1 & 4 & 17 \\ 18 & 3 & 2 \\ 5 & 19 & 1 \\ \end{array} \right] + \left[ \begin{array}{ccc} 12 & 18 & 6 \\ 4 & 3 & 33 \\ 23 & 5 & 8 \\ \end{array} \right] = \left[ \begin{array}{ccc} 13 & 22 & 23 \\ 22 & 6 & 35 \\ 28 & 24 & 9 \\ \end{array} \right] = \boldsymbol{C}$

It is clear to see that the elements of $\boldsymbol{C}$ are simply the elements added in the corresponding positions from $\boldsymbol{A}$ and $\boldsymbol{B}$.

It is possible to add a scalar value $x$ to a matrix $\boldsymbol{A}=[a_{ij}]$ to produce a new matrix $\boldsymbol{B}=[b_{ij}]$ where $b_{ij} = x + a_{ij}$. This simply means that we're adding the same scalar value to every element of the matrix. It is written as $\boldsymbol{B} = x + \boldsymbol{A}$.

Scalar-matrix addition is once again commutative and associative, because normal scalar addition is both commutative and associative.

For certain applications in machine learning it is possible to define a shorthand notation known as broadcasting. Consider $\boldsymbol{A} \in \mathbb{R}^{m \times n}$ an $m \times n$-dimensional real-valued matrix and $\boldsymbol{x} \in \mathbb{R}^m$ an $m$-dimensional vector.

It is possible to define $\boldsymbol{B} = \boldsymbol{A} + \boldsymbol{x}$, despite the fact that the matrices $\boldsymbol{A}$ and $\boldsymbol{x}$ are unequal in size. The definition of the sum takes $b_{ij} = a_{ij} + x_j$.

That is, the elements of $\boldsymbol{x}$ are copied into each row of the matrix $\boldsymbol{A}$. Since the value of $\boldsymbol{x}$ is `broadcast' into each row the process is known as broadcasting.

## Matrix Multiplication

The rules for matrix addition are relatively simple and intuitive. However when it comes to multiplication of matrices the rules become more complex.

### Matrix Transpose

In order to define certain matrix-matrix multiplication operations such as the dot-product (discussed below) it is necessary to define the transpose of a matrix. The transpose of a matrix $\boldsymbol{A}=[a_{ij}]_{m \times n}$ of size $m \times n$ is denoted by $\boldsymbol{A}^{T}$ of size $n \times m$ and is given element-wise by the following expression:

\begin{equation} \boldsymbol{A}^{T} = [a_{ji}]_{n \times m} \end{equation}

That is, the indices $i$ and $j$ are swapped. This has the effect of reflecting the matrix along the line of diagonal elements $a_{ii}$. The operation is defined for non-square matrices, as well as vectors and scalars (which are simply $1 \times 1$ matrices). Note that a scalar is its own transpose: $x = x^T$. In addition the transpose of the transpose of a matrix is simply itself: $\boldsymbol{A}^{TT} = \boldsymbol{A}$.

#### Examples

$\boldsymbol{A} = \left[ \begin{array}{ccc} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{array} \right], \quad \boldsymbol{A}^T = \left[ \begin{array}{cc} a_{11} & a_{21} \\ a_{12} & a_{22} \\ a_{13} & a_{23} \end{array} \right]$ $\boldsymbol{x} = \left[ \begin{array}{c} x_{1} \\ x_{2} \\ x_{3} \end{array} \right], \quad \boldsymbol{x}^T = \left[ \begin{array}{ccc} x_{1} & x_{2} & x_3 \end{array} \right]$

Note here that $\boldsymbol{A}^{T}$ does not represent $\boldsymbol{A}$ multiplied by itself $T$ times. This is an entirely different operation. However, it will usually be clear from the context whether we mean the transpose of a matrix or repeated multiplication by itself.

### Matrix-Matrix Multiplication

We are now going to consider matrix-matrix multiplication. This is a more complex operation than matrix addition because it does not simply involve multiplying the matrices element-wise. Instead a more complex procedure is utilised, for each element, involving an entire row of one matrix and an entire column of the other.

The operation is only defined for matrices of specific sizes. The first matrix must have as many columns as the second matrix has rows, otherwise the operation is not defined.

The definition below can be a bit tricky to understand initially, so have a look at it first and then try working through the examples to see how specific numeric instances match up to the general formula.

Given a matrix $\boldsymbol{A}=[a_{ij}]_{m \times n}$ and a matrix $\boldsymbol{B}=[b_{ij}]_{n \times p}$ the matrix product $\boldsymbol{C}=\boldsymbol{A}\boldsymbol{B}=[c_{ij}]_{m \times p}$ is defined element-wise by:

\begin{equation} c_{ij} = \sum^n_{k=1} a_{ik} b_{kj} \end{equation}

That is the elements $c_{ij}$ of the matrix $\boldsymbol{C}=\boldsymbol{A}\boldsymbol{B}$ are given by summing the products of the elements of the $i$-th row of $\boldsymbol{A}$ with the elements of the $j$-th column of $\boldsymbol{B}$.

Note that matrix-matrix multiplication is not commutative. That is:

\begin{eqnarray} \boldsymbol{A}\boldsymbol{B} \neq \boldsymbol{B}\boldsymbol{A} \end{eqnarray}

#### Examples

Given the following two matrices:

$\boldsymbol{A} = \left[ \begin{array}{ccc} 2 & 5 & 1 \\ 7 & 3 & 6 \end{array} \right], \quad \boldsymbol{B} = \left[ \begin{array}{cc} 1 & 8 \\ 9 & 4 \\ 3 & 5 \end{array} \right]$

It is possible to construct the product $\boldsymbol{A}\boldsymbol{B}$ of size $2 \times 2$:

$\boldsymbol{AB} = \left[ \begin{array}{cc} 2 \cdot 1 + 5 \cdot 9 + 1 \cdot 3 & 2 \cdot 8 + 5 \cdot 4 + 1 \cdot 5 \\ 7 \cdot 1 + 3 \cdot 9 + 6 \cdot 3 & 7 \cdot 8 + 3 \cdot 4 + 6 \cdot 5 \end{array} \right] = \left[ \begin{array}{cc} 50 & 41 \\ 52 & 98 \end{array} \right]$

It is also possible to construct the product $\boldsymbol{B}\boldsymbol{A}$ of size $3 \times 3$:

$\boldsymbol{BA} = \left[ \begin{array}{ccc} 1 \cdot 2 + 8 \cdot 7 & 1 \cdot 5 + 8 \cdot 3 & 1 \cdot 1 + 8 \cdot 6 \\ 9 \cdot 2 + 4 \cdot 7 & 9 \cdot 5 + 4 \cdot 3 & 9 \cdot 1 + 4 \cdot 6 \\ 3 \cdot 2 + 5 \cdot 7 & 3 \cdot 5 + 5 \cdot 3 & 3 \cdot 1 + 5 \cdot 6 \\ \end{array} \right] = \left[ \begin{array}{ccc} 58 & 29 & 49 \\ 46 & 57 & 33 \\ 41 & 30 & 33 \end{array} \right]$

The above definition does not initially seem to be motivated in a simple way. Why is matrix-matrix multiplication defined like this? It has to do with a deeper result involving matrices representing linear map functions between two vector spaces. We need not worry about these certain deeper aspects of linear maps as they are beyond the scope of this article series.

However, we can briefly provide some intuition through the idea of composing functions together. It is common in mathematics to compose two functions together to produce a new function. That is the function $h$ can be defined as $h = f \circ g$, with the $\circ$ symbol representing function composition. This notation means that $h$ is equivalent to $g$ being carried out first, and then $f$.

If, for example $f=sin(x)$ and $g=x^2$ then $h = sin(x^2)$. Function composition is not commutative and as such $f \circ g \neq g \circ f$. Instead $g \circ f = sin^2 (x)$, which is a completely different function. This is why matrix-matrix multiplication is not commutative and also why it is defined as above.

Note also that this definition does not imply that the elements of the matrix $\boldsymbol{C}$ are defined as the element-wise multiplication of those from $\boldsymbol{A}$ and $\boldsymbol{B}$. That is, the elements in each location cannot simply be multiplied together to get the new product matrix. That is an entirely different operation known as the Hadamard Product and will be discussed below.

Since a column vector is in fact a $n \times 1$ matrix it is possible to carry out matrix-vector multiplication. If the left-multiplying matrix has size $m \times n$ then this is a valid operation that will produce another $m \times 1$ matrix (column vector) as output.

Matrix-matrix and matrix-vector multiplication are extremely common operation in the physical sciences, computational graphics and machine learning fields. As such highly optimised software libraries such as BLAS and LAPACK have been developed to allow efficient scientific computation--particularly on GPUs.

### Scalar-Matrix Multiplication

Scalar-matrix multiplication is simpler than matrix-matrix multiplication. Given a matrix $\boldsymbol{A}=[a_{ij}]_{m \times n}$ and a scalar $\lambda \in \mathbb{R}$ the scalar-matrix product $\lambda \boldsymbol{A}$ is calculated by multiplying every element of $\boldsymbol{A}$ by $\lambda$ such that $\lambda \boldsymbol{A} = [\lambda a_{ij}]_{m \times n}$.

If we take two real-valued scalars $\lambda, \mu \in \mathbb{R}$ the subsequent useful relationships follow from the definition above:

\begin{eqnarray} \lambda (\boldsymbol{A} + \boldsymbol{B}) &=& \lambda \boldsymbol{A} + \lambda \boldsymbol{B} \\ (\lambda + \mu) \boldsymbol{A} &=& \lambda \boldsymbol{A} + \mu \boldsymbol{A} \\ \lambda (\mu \boldsymbol{A}) &=& (\lambda \mu) \boldsymbol{A} \end{eqnarray}

The first result states that you will get the same answer if you add two matrices together, and then multiply them by a scalar, as if you individually multiplied each matrix by the scalar and then added them together.

The second result states that if you add two scalars together and then multiply the result by a matrix it gives the same answer as if you individually multiplied the matrix separately by each scalar and added the result.

The third result states that the order of scalar multiplication does not matter. If you multiply the matrix by one scalar, and then multiply the result by another scalar it gives the same answer as if you first multiplied both scalars together and then by the matrix.

All of these results follow from the simple rules of scalar multiplication and addition.

It is possible to define element-wise multiplication of matrices, which differs from the definition of matrix-matrix multiplication above. The Hadamard product of two matrices $\boldsymbol{A}=[a_{ij}]_{m \times n}$ and $\boldsymbol{B}=[b_{ij}]_{m \times n}$ is denoted by $\boldsymbol{A} \odot \boldsymbol{B}$ and calculated by the following expression:

\begin{equation} \boldsymbol{A} \odot \boldsymbol{B} = [a_{ij} b_{ij}]_{m \times n} \end{equation}

That is, the elements of the new matrix are simply given as the scalar multiples of each of the elements from the other matrices. Note that since scalar multiplication is commutative so is the Hadamard Product, unlike normal matrix-matrix multiplication.

When might it be necessary to use the Hadamard product? In fact such a process has wide applications, including correcting codes in satellite transmissions and cryptography, signal processing as well as lossy compression algorithms for images in JPEG format.

### Dot Product

An important special case of matrix-matrix multiplication occurs between two vectors and is known as the dot product. It has a deep relationship with geometry and is useful in all areas of the physical and computational sciences.

We have to be extremely careful here in regards to notation. I am being particularly precise about this definition, which may be unusual to some of you who have it seen it written before. In particular the dot product is really only defined as a true matrix-matrix multiplication, where one of the vectors is a row "matrix" and the other a column "matrix". However, you will often see a slight "notational abuse" where it is defined for any two vectors (row or column).

The usual notation for a dot product between two $n$-dimensional vectors $\boldsymbol{x}, \boldsymbol{y} \in \mathbb{R}^n$ is $\boldsymbol{x} \cdot \boldsymbol{y}$, which is where the name of the operation comes from. However in more advanced textbooks (particularly the popular graduate level statistics, machine learning and deep learning texts) you will see it written as $\boldsymbol{x}^T \boldsymbol{y}$, where $T$ represents the transpose of $\boldsymbol{x}$.

This is because $\boldsymbol{x}$ and $\boldsymbol{y}$ are usually considered to both be column vectors. A matrix-matrix multiplication between two column vectors is not defined. Hence, one of the vectors needs to be transposed into a row vector such that the matrix-matrix multiplication is properly defined. Hence you will see the $\boldsymbol{x}^T \boldsymbol{y}$ notation frequently in more advanced textbooks and papers. Now we can proceed with the definition!

Given two column vectors $\boldsymbol{x}, \boldsymbol{y} \in \mathbb{R}^n$ it is possible to define the dot product (or scalar product) between them by taking the transpose of one to form a product that is defined within the rules of matrix-matrix multiplication. Such a product produces a scalar value and is commutative:

\begin{equation} \boldsymbol{x}^T \boldsymbol{y} = \sum^n_{i=1} x_i y_i = \boldsymbol{y}^T \boldsymbol{x} \end{equation}

The dot product has an important geometric meaning. It is the product of the Euclidean lengths of the two vectors and the cosine of the angle between them. In subsequent articles the concept of norms will be introduced, at which point this definition will be formalised.

Another way to think about a dot product is that if we take the dot product of a vector with itself it is the square of the length of the vector:

\begin{equation} \boldsymbol{x}^T \boldsymbol{x} = \sum^n_{i=1} x_i x_i = \sum^n_{i=1} x_i^2 \end{equation}

Hence to find the (Euclidean) length of a vector you can simply take the square root of the dot product of the vector, $\sqrt{\boldsymbol{x}^T \boldsymbol{x}}$.

The dot product is a special case of a more general mathematical entity known as an inner product. In more abstract vector spaces the inner product allows intuitive concepts such as length and angle of a vector to be made rigourous. However such spaces are beyond the scope of this article series and will not be discussed further.

## Next Steps

This article has all been about operations applied to one or more matrices. We can now add and multiply matrices together. But what does this give us? How can we use it?

In the next article we are going to look at one of the most fundamental topics in linear algebra—inverting a matrix. Matrix inversion allows us to solve matrix equations, in exactly the same way that scalar algebra allows us to solve scalar equations.

This is a widespread operation in the physical and computational sciences and will be indispensible in our studies of deep learning.