Building a Raspberry Pi Cluster for QSTrader using SLURM - Part 3

In this multi-part article series we are going to discuss how to build a distributed cluster of Raspberry Pi computers, utilising the SLURM work scheduling tool to run QSTrader systematic trading parameter variation backtests. This article will describe how to install and configure SLURM.

In the previous article of this series on creating a Raspberry Pi mini-cluster for HPC we installed Ubuntu 22.04 LTS on the cluster. We configured the Raspberry Pi cluster to utilise Network File System (NFS), which allows persistent storage to be shared and accessed by all Pis on the cluster. This allows files to easily be shared across all of the Pis, making both subsequent configuration more straightforward and allowing the Pis to persist results of parallel jobs in a single cluster-wide location.

In this article we are going to install a tool called SLURM on the cluster. In its own words SLURM is "an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.". In essence it allows parallelised workloads—such as QSTrader backtest parameter sweeps—to be scheduled against the available resources that actually exist on the cluster, queueing the tasks if more work is required than can be processed.

This means that, for instance, if we have 12 nodes (cores of CPU, say) available on our cluster, but our parallel job requires 60 separate tasks (say), then the first 12 will be allocated, while the remaining 48 will be queued. As each of the 12 tasks completes more tasks will be pulled from the queue and scheduled on the cluster. This is beneficial because it allows multiple separate users to share resources on the cluster. If you work in a small team of quant researchers, for instance, then all can share the resources of the cluster. It also means that work can be batched as new tasks arise.

SLURM is utilised within many modern HPC environments, including quantitative hedge funds that carry out backtesting research or derivatives pricing. Hence if you are building the Raspberry Pi cluster in order to develop career experience with high performance distributed computing then becoming familiar with SLURM will be a useful career related skill.

We are going to utilise SLURM for many of the computational quantitative finance tasks that follow, so we will now describe how to install and configure it on the Raspberry Pi cluster.

In order to configure SLURM we need to install a control daemon on the primary node, which will distribute workloads to the remaining secondary nodes. The primary node will only be utilised to manage workloads and will not carry out any parallel tasks itself. Instead it delegates this responsibility to the secondary, computational nodes.

The primary node will be where the user logs in to submit workloads as well as obtain diagnostics about the current cluster capacity. There is no need to login to any of the secondary nodes in order to distribute work. This is handled by SLURM itself.

Since each of the Raspberry Pi 4B computers has four cores on its CPU with 4GB RAM, this means that our cluster will actually have a total of 12 nodes (3x4 on each Pi) with 12Gb (3x4Gb on each Pi) distributed RAM on the cluster.

The article will proceed by installing SLURM on the primary node. Then we will modify the default configuration of SLURM on the primary node. We will copy the configuration on the primary node to the remaining computational nodes (via the NFS share). We will then configure a tool called Munge, used to provide SSH access from the primary node to the computational nodes. We will then configure the computational nodes to communicate with the primary node. Finally we will check that parallel workloads can be executed from the primary/login node.

As with the previous articles in this series, this article closely follows the approach discussed in the excellent three part series written by Garrett Mills. We will also be making changes where necessary, particularly around optimising for usage of QSTrader. We highly recommend taking a read of his articles for additional details.

Primary Node Configuration

The first task is to SSH into the primary node (as in the previous article) and edit the hosts file to ensure that all of the secondary nodes are listed. In the previous article we utilised the vim editor, but any text editor will suffice:

sudo vim /etc/hosts

Add the following if not already present:

IP_ADDR_OF_NODE02      node02
IP_ADDR_OF_NODE03      node03
IP_ADDR_OF_NODE04      node04

Remembering to replace IP_ADDR_OF_NODE02 etc with the IP addresses of the various nodes determined in the previous article.

This will allow SLURM to know the location of the other Raspberry Pi nodes on the network.

The next step is to install SLURM on the primary node. Run the following command:

sudo apt install slurm-wlm -y

Now that SLURM has been installed it needs to be configured. The default configuration can be copied over from the /usr/share SLURM install directory. Run the following commands:

cd /etc/slurm-llnl
cp /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz .
gzip -d slurm.conf.simple.gz
mv slurm.conf.simple slurm.conf

This copies the default configuration to the SLURM configuration directory, unzips it and renames it to point to the configuration used by SLURM.

The next step is to make some modifications to this configuration specific to the use case of this cluster. Run the following (replacing vim with your favourite text editor) to begin editing the configuration file:

sudo vim /etc/slurm-llnl/slurm.conf

The first configuration change we need to make is to let SLURM know about the primary control node location on the network. Modify the following line to look like:

SlurmctldHost=rpicluster01(IP_ADDR_OF_NODE01)

Replace rpicluster01 with the name of your node chosen in the previous article and IP_ADDR_OF_NODE01 with the assigned IP address of the node.

We then need to configure how SLURM allocates its resources through customisation of the scheduler algorithm. As with the approach carried out in Garrett Mills' article, we will utilise the "consumable resources" method. This means that each of the Raspberry Pi computers will provide a consumable resource (such as CPU cores) and SLURM will allocate task workloads to these provided resources.

To set this it is necessary to modify the SelectType and SelectTypeParameters lines to the following:

SelectType=select/cons_res
SelectTypeParameters=CR_Core

It is also possible to modify the cluster name via the following line. Ensure to replace rpicluster with your preferred cluster name:

ClusterName=rpicluster

The primary node also needs to know about each of the secondary, computational nodes. Find the example entry for the node specification, delete it and replace it with the following:

NodeName=rpicluster01 NodeAddr=IP_ADDR_OF_NODE01 CPUs=4 State=UNKNOWN
NodeName=rpicluster02 NodeAddr=IP_ADDR_OF_NODE02 CPUs=4 State=UNKNOWN
NodeName=rpicluster03 NodeAddr=IP_ADDR_OF_NODE03 CPUs=4 State=UNKNOWN
NodeName=rpicluster04 NodeAddr=IP_ADDR_OF_NODE04 CPUs=4 State=UNKNOWN

Once again ensuring to replace rpicluster01 (and 02-04) with the appropriate node names, as well as replacing IP_ADDR_OF_NODE01 (and 02-04) with the IP addresses of each node on the cluster.

The next task is to create a 'partition', which is a logical grouping of nodes that SLURM uses to run task workloads on. This is where the three computational/secondary nodes (02-04) are added. It is necessary to delete the example entry and replace it with the following:

PartitionName=mycluster Nodes=rpicluster[02-04] Default=YES MaxTime=INFINITE State=UP

It is also necessary to configure SLURM to supprt cgroups kernel isolation. This will let SLURM know what system resources are available for task workloads to utilise. It is necessary to edit the cgroup.conf file to set this up.:

sudo vim /etc/slurm-llnl/cgroup.conf

Add the following lines to this file, then save and close:

CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
AllowedDevicesFile="/etc/slurm-llnl/cgroup_allowed_devices_file.conf"
ConstrainCores=no
TaskAffinity=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

It is also necessary to specify which devices SLURM has access to. This is achieved by editing the cgroup_allowed_devices_file.conf:

sudo vim /etc/slurm-llnl/cgroup_allowed_devices_file.conf

Add the following lines (note the inclusion of the NFS drive we previously set up):

/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/sharedfs*

Note that, as Garrett mentions in his original article series this is a permissive setup. You may wish to modify it for your own purposes depending upon where your cluster is deployed and who will be accessing it.

Now that the primary node is configured this configuration needs to be shared amongst the computational nodes. Since we have already declared a network file share drive we can utilise this to distribute the configuration. Run the following commands to copy all of the configuration and the Munge key across to each node:

sudo cp slurm.conf cgroup.conf cgroup_allowed_devices_file.conf /sharedfs
sudo cp /etc/munge/munge.key /sharedfs

Then we will enable and start Munge. Run the following on the primary node:

sudo systemctl enable munge
sudo systemctl start munge

Then we will enable and start the SLURM daemon. Run the following on the primary node:

sudo systemctl enable slurmd
sudo systemctl start slurmd

Then we will enable and start the SLURM control daemon. Run the following on the primary node:

sudo systemctl enable slurmctld
sudo systemctl start slurmctld

It may be necessary here to reboot the primary node in order for the above to correctly take effect. If you have difficulty, then it is always worth trying a reboot.

Secondary Node Configuration

For each of the secondary nodes it is necessary to install the SLURM client. Run the following on all secondary nodes (02-04):

sudo apt install slurmd slurm-client -y

As with the primary node it is necessary for each of the secondary nodes to know of the locations of all other nodes on the network. This can be achieved by editing the /etc/hosts files on each node:

sudo vim /etc/hosts

Add the following lines ensuring to exclude the current node. This example is for node 03:

IP_ADDR_OF_NODE01      node01
IP_ADDR_OF_NODE02      node02
IP_ADDR_OF_NODE04      node04

The next step is to copy the SLURM configuration (and Munge key) files that were previously copied to shared storage into each computational nodes respective SLURM install location. Run the following on each of the secondary nodes:

sudo cp /sharedfs/munge.key /etc/munge/munge.key
sudo cp /sharedfs/slurm.conf /etc/slurm-llnl/slurm.conf
sudo cp /sharedfs/cgroup* /etc/slurm-llnl

Now we are going to enable and start Munge on each of the secondary nodes to ensure that the SLURM controller can connect to the secondary nodes:

sudo systemctl enable munge
sudo systemctl start munge

At this point it is necessary to reboot all of the nodes, including the primary node and all secondary nodes (ensuring to replace rpicluster01 with the name of your primary node):

To check Munge run the following command on each of the secondary nodes (02-04):

ssh ubuntu@rpicluster01 munge -n | unmunge

You will see output similar to the following:

ubuntu@rpicluster01's password: 
STATUS:           Success (0)
ENCODE_HOST:      rpicluster01
ENCODE_TIME:      2022-06-18 20:44:23 -0000 (...)
DECODE_TIME:      2022-06-18 20:44:23 -0000 (...)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              ..
GID:              ..
LENGTH:           0

Finally, enable and start the SLURM daemon on each of the computational nodes by running the following commands:

sudo systemctl enable slurmd
sudo systemctl start slurmd

Testing SLURM on the Primary Node

To confirm that SLURM is working it is necessary to SSH into the primary/login node and run the following SLURM command:

sinfo

You will hopefully see output similar to the following:

PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
mycluster*      up   infinite      3   idle node[02-04]

To run our first parallel task we can ask each of the nodes to print its hostname to the terminal via the srun command:

srun --nodes=3 hostname

Since this can occur in any order, if successful this will produce output similar to the following. You will notice if you run this multiple times that the order of the node hostname printouts will likely change:

node03
node02
node04

This completes the installation and configuration of SLURM on the Rapsberry Pi cluster! We are now in a position to begin installing further libraries to allow parallel workloads for QSTrader backtesting or HPC derivatives pricing. These tasks will be the subject of following articles.

Related Articles