Training the Perceptron with Scikit-Learn and TensorFlow

In the previous article on the topic of artificial neural networks we introduced the concept of the perceptron. We demonstrated that the perceptron was capable of classifying input data via a linear decision boundary.

However we postponed a discussion on how to calculate the parameters that govern this linear decision boundary. Determining these parameters by means of 'training' the perceptron will be the topic of this article.

We will begin by describing the training procedure. We will note its similarity to a popular optimisation approach in deep learning known as stochastic gradient descent. Then we will provide some Python code that demonstrates the training mechanism. This is implemented within the Scikit-Learn library. Finally we will examine the corresponding code in the TensorFlow library and see how it differs.

Training the Perceptron

Recall from the previous article that once suitable weights and bias values were available it was straightforward to classify new input data via the inner product of weights and input components, as well as the step activation function.

Thus far we have neglected to describe how the weights and bias values are found prior to carrying out any classification with the perceptron. This is where a training procedure known as the perceptron learning rule comes in.

The perceptron learning rule works by accounting for the prediction error generated when the perceptron attempts to classify a particular instance of labelled input data. In particular the rule amplifies the weights (connections) that lead to a minimisation of the error.

As single training instances are provided to the perceptron a prediction is made. If an incorrect classification is generated—compared to the correct 'ground truth' label—the weights that would have led to a correct prediction are reinforced^[3].

In this manner the weights are iteratively shifted as more training samples are fed into the perceptron until an optimal solution is found.

Mathematically this procedure is given by the following update algorithm:

\begin{eqnarray} w_i^{n+1} = w_i^n + \nu (y - \hat{y}) x_i \end{eqnarray}

Where:

$w_i^{n}$ is the $i$th weight at step $n$
$x_i$ is the $i$th component of the current training input data instance
$y$ is the correct 'ground truth' classification label for this input data
$\hat{y}$ is the predicted classification label for this input data
$\nu$ is the learning rate

Let's break this formula down into separate terms in order to derive some intuition as to how it works. It states that the new weights at step $n+1$, $w_i^{n+1}$ are given by the sum of the old weights, $w_i^{n}$ at step $n$ plus an additional term $\nu (y - \hat{y}) x_i$.

Since this additional term includes the difference between the predicted value of the outcome $\hat{y}$ and the ground truth $y$, this term will become larger if this difference is more extreme. That is, the weights will be moved further from the old value the larger this difference becomes. This makes sense since if the prediction is far away from the correct labelled value it will be necessary to move the weight further to improve subsequent prediction accuracy.

The other factor in this term is the learning rate $\nu$. This coefficient scales the movement of the weights, so that it can either be significantly reduced or substantially amplified. A small $\nu$ means that even for a large prediction difference, the weights will not shift very much. Correspondingly, a large $\nu$ will mean a significant move of the weights even for a small predictive difference.

The learning rate is an example of a hyperparameter for the model. Determining its optimal value is also necessary. However we will delay the discussion on hyperparameter optimisation until we discuss more complex neural network architectures.

Finally the term is also multiplied by $x_i$. That is, if the $i$th component of the input itself is large, then so is the weight shift, all other factors being equal.

We will now demonstrate this perceptron training procedure in two separate Python libraries, namely Scikit-Learn and TensorFlow.

We recently published an article on how to install TensorFlow on Ubuntu against a GPU, which will help in running the TensorFlow code below.

Code Implementation

In this section we will utilise the National Institute of Diabetes and Digestive and Kidney Diseases diabetes dataset^[4] to test the classification capability of the perceptron.

The dataset contains 768 records with eight diagnostic measurements and an outcome as to whether a patient has diabetes. In the dataset all patients are female, at least 21 years of age, and of Pima heritage.

The dataset CSV file can be obtained from the Kaggle site here. Note that this file will need to placed in the same directory as the following snippet in order to load the data correctly.

We are not going to dwell on the specifics of the dataset here. Rather, we are going to utilise it purely as a means of explaining the training algorithm. If you wish to learn more about the diagnostic measurements and how the data was obtained please see [4] for more details.

Scikit-Learn

In the subsequent perc_diabetes_sklearn.py snippet we will utilise Pandas and Scikit-Learn to load the diabetes data and fit a perceptron binary classification model.

The first task is to call the Pandas read_csv method to load the dataset CSV file into a DataFrame, chaining the values method to convert the DataFrame entity into a NumPy matrix, suitable for value extraction in Scikit-Learn.

The features matrix X is defined as the first eight columns of this matrix (it has shape (768, 8)). The outcome vector y is the final column, consisting of 0s for no diabetes and 1s for diabetes.

The perceptron model is then initialised with a particular random seed to ensure reproducible results. The model is then trained with the perceptron learning rule via the fit method.

Finally the mean accuracy score on the same in-sample data is output.

# perc_diabetes_sklearn.py

import pandas as pd
from sklearn.linear_model import Perceptron


if __name__ == "__main__":
    # Load the Pima diabetes dataset from CSV
    # and convert into a NumPy matrix suitable for
    # extraction into X, y format needed for Scikit-Learn
    diabetes = pd.read_csv('diabetes.csv').values

    # Extract the feature columns and outcome response
    # into appropriate variables
    X = diabetes[:, 0:8]
    y = diabetes[:, 8]

    # Create and fit a perceptron model (with reproducible
    # random seed)
    model = Perceptron(random_state=1)
    model.fit(X, y)

    # Output the (in sample) mean accuracy score
    # of the classification
    print("%0.3f" % model.score(X, y))

The output is as follows:

0.531

It can be seen that the classification score is approximately 53%.

This low performance is to be expected. We are essentially trying to ask a single linear threshold unit to fit a linear decision hyperplane through complex eight-dimensional data. Such data is unlikely to present a straightforward linear decision boundary between 'no diabetes' and 'diabetes'.

TensorFlow & Keras

We will now attempt to implement the perceptron with the Keras API using the TensorFlow library. The code is slightly more complex than the Scikit-Learn version. However the added complexity in the API will prove beneficial in subsequent articles when we come to model deep neural network architectures.

Hard Sigmoid Activation Function

Prior to demonstrating and explaining the corresponding TensorFlow/Keras code for training a single perceptron it is worth highlighting that it is difficult to fully reproduce the perceptron as described in the previous article. See [6] for a detailed discussion as to why this is so.

In essence this is due to the nature of the Keras API, which is designed primarily for deep neural network architectures with differentiable activation functions that produce non-zero gradients. The activation function utilised in the original perceptron is a step function, which is not continuous (and thus not differentiable) at zero. It also leads to zero gradients everywhere else.

Since Keras utilises stochastic gradient descent as the primary optimisation procedure, it is necessary to involve non-zero gradients if the weights are to be changed when training.

To avoid this problem it is possible to replace the step function activation function with a closely-related function called a hard sigmoid. The hard sigmoid is a piecewise linear approximation to the original sigmoid function (an "s-curve"), which is differentiable everywhere except at two points. It still possesses zero gradients for certain parts of the domain but admits non-zero gradients in the middle piecewise linear section.

It turns out that this is sufficient to produce a 'perceptron like' implementation in Keras and TensorFlow.

TensorFlow Implementation

The intent with demonstrating the corresponding TensorFlow/Keras code in this post is to begin familiarising you with the API used for deep neural networks. Many of the parameters provided to the model creation require significantly more explanation than is possible within this post. Hence we will briefly describe each parameter, but will postpone more comprehensive explanations until we discuss deep neural network architectures in subsequent posts.

In the following snippet (perc_diabetes_tensorflow.py) we utilise the same Pima diabetes dataset as was used for Scikit-Learn. It is loaded from CSV in exactly the same manner, being placed into the feature matrix X and the outcome vector y.

The difference in the two implementations begins when we define the perceptron model using the Keras API.

We first create the model using a call to Sequential. This is used to group a linear stack of neural network layers into a single model.

# Create the 'Perceptron' using the Keras API
model = Sequential()

Since we only have a single 'layer' in the perceptron this call may appear to be superfluous. However by implementing it in this manner we are demonstrating a common feature of the Keras API and providing familiarity, which can be leveraged for future deep learning models in subsequent articles.

We then utilise the add method to add a layer of nodes to the sequential model. In particular we are adding a Dense layer, which means that all nodes in the layer are connected to all of the inputs and outputs. Dense layers are also termed fully connected layers. We will discuss dense neural network layers at length in the subsequent article on multi-layer perceptrons.

model.add(Dense(1, input_shape=(8,), activation=hard_sigmoid, kernel_initializer='glorot_uniform'))

The first argument 1 in the call to Dense is the dimensionality of the output. Since we are attempting to determine whether a patient has diabetes or not, this only needs a single dimension. However the second parameter determines the number of inputs. For the diabetes dataset this is eight—one for each of the feature columns in the CSV file.

We then specify the activation function for the layer as the hard sigmoid.

The kernel_initializer keyword argument is given the 'glorot_uniform' value. Since we are training the perceptron with stochastic gradient descent (rather than the perceptron learning rule) it is necessary to intialise the weights with non-zero random values rather than initially set them to zero. This aspect will be discussed in depth in subsequent articles.

We then set the loss function to utilise binary cross-entropy (see our discussion on cross-entropy here for more details), which is the standard loss function for binary classification problems.

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The optimizer keyword argument is set to 'adam'. Adam is a particular variant of stochastic gradient descent. We will not explain how Adam works in this article but for the purposes of this code snippet it can be thought of as a more computationally efficient variant of stochastic gradient descent.

We then train the model using the Adam stochastic gradient descent algorithm. We utilise the concept of mini-batches, passing in 25 training samples at once. You can read more about mini-batches here.

# Train the perceptron using stochastic gradient descent
# with a validation split of 20%
model.fit(X, y, epochs=225, batch_size=25, verbose=1, validation_split=0.2)

The epochs keyword argument determines how many times we iterate over the full training set. For this example we have 225 epochs. It is necessary to iterate over the dataset multiple times in order to mitigate the problem of attaining a local minimum set of values for the weights. Multiple epochs provide a better chance of attaining the global maximum, or a potentially improved local minimum.

In this instance we utilise 20% of the training data as a 'validation' set, which is 'held out' (that is, not trained on) and used solely for evaluating the accuracy of the predictions. We did not do this for the Scikit-Learn implementation and instead checked the accuracy in sample.

Lastly as with the Scikit-Learn implementation we output the final prediction accuracy. Here is the full snippet (slightly modified from versions presented at [5] and [6]):

# perc_diabetes_tensorflow.py

import pandas as pd
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.activations import hard_sigmoid


if __name__ == "__main__":
    # Load the Pima diabetes dataset from CSV
    # and convert into a NumPy matrix suitable for
    # extraction into X, y format needed for TensorFlow
    diabetes = pd.read_csv('diabetes.csv').values

    # Extract the feature columns and outcome response
    # into appropriate variables
    X = diabetes[:, 0:8]
    y = diabetes[:, 8]

    # Create the 'Perceptron' using the Keras API
    model = Sequential()
    model.add(Dense(1, input_shape=(8,), activation=hard_sigmoid, kernel_initializer='glorot_uniform'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    # Train the perceptron using stochastic gradient descent
    # with a validation split of 20%
    model.fit(X, y, epochs=225, batch_size=25, verbose=1, validation_split=0.2)

    # Evaluate the model accuracy
    _, accuracy = model.evaluate(X, y)
    print("%0.3f" % accuracy)

The (truncated) output will be similar to the following:

..
..
Epoch 214/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 215/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 216/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 217/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 218/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 219/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 220/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 221/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 222/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 223/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 224/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
Epoch 225/225
25/25 [==============================] - 0s 2ms/step - loss: 5.3510 - accuracy: 0.6531 - val_loss: 5.5089 - val_accuracy: 0.6429
24/24 [==============================] - 0s 793us/step - loss: 5.3827 - accuracy: 0.6510
0.651

It can be seen that the final classification score is approximately 65%.

We should view this figure with caution however. We have not fully implemented the perceptron in the same manner as was done with Scikit-Learn. Nor have we evaluated the accuracy in the same way due to the usage of a validation set.

In summary we have carried out the perceptron learning rule, using a step function activation function with Scikit-Learn. In the TensorFlow/Keras implementation we carried out stochastic gradient descent, using a (mostly) differentiable hard sigmoid activation function. Hence the classification accuracy results will differ.

Despite these differences the intent of the above code has been to provide some insight into the separate APIs of each library. We will be utilising TensorFlow and the Keras API extensively in subsequent articles.

Next Steps

We have now implemented and trained our first neural network model in TensorFlow with the Keras API. However such a simplistic model is unlikely to produce effective predication accuracy on more complex data, particularly that utilised within quantitative finance.

In the next article we are going to introduce the multi-layer perceptron as a first step in adding more complexity and hence potential predictive accuracy.