dev_array: A Useful Array Class for CUDA

In previous articles I explained how to get started with CUDA showing a simple vector addition example. This time we will code an array class, the purpose of which is to speed up the coding process.

This is a very helpful class for programmers of all levels, from beginners to advanced. Also, I will make extensive use of this class in future articles, so it is worth familiarising yourself with it if you wish to follow the CUDA series.

Now I will demonstrate the different blocks that constitute the class.

Code Explanation

As usual let's start with library declarations:

#ifndef _DEV_ARRAY_H_
#define _DEV_ARRAY_H_

#include <stdexcept>
#include <algorithm>
#include <cuda_runtime.h>

template 
class dev_array

Public Functions

This first part includes all the public functions, which are the ones that you can call from your scripts (e.g. main.cpp).

We start with the constructor and the destructor. Notice that they only call one function each. This is because it’s safer to just make a call to private functions in the public part. That is, private functions are only “visible” to the class itself, so no one can call them. Also, it’s a good habit for a programmer, especially for long scripts, to separate the public and the private part, as it makes the code easier to read and therefore to correct.

  {
public:
    explicit dev_array()
        : start_(0),
          end_(0)
    {}

    // constructor
    explicit dev_array(size_t size)
    {
        allocate(size);
    }
    // destructor
    ~dev_array()
    {
        free();
    }

Now you can declare a vector from your main block by typing:

  dev_array array_name(size);

resize()

  // resize the vector
    void resize(size_t size)
    {
        free();
        allocate(size);
    }

This function will allow you to easily resize a vector by typing the new size as following:

  array_name.resize(new_size);

By doing so, it will call free() first to deallocate memory from the device, and then allocate() for reallocate the new amount of bytes in the device memory.

getSize()

 // get the size of the array
    size_t getSize() const
    {
        return end_ - start_;
    }

This function is quite simple, as it just counts the “space” between the first and the last element in the memory, giving you back the size of the array. Once again, you can call it as a normal instance

  array_name.getSize();

getData()

  // get data
    const T* getData() const
    {
        return start_;
    }

    T* getData()
    {
        return start_;
    }

The reasoning here is slightly more complex than the previous ones, but still nothing to worry about. By invoking it typing the code

  array_name.getData();

it makes use of pointers to point to the memory cells allocated and then gives back the content at that location.

Set and get

  // copy memory host to device
    void set(const T* src, size_t size)
    {
        size_t min = std::min(size, getSize());
        cudaError_t result = cudaMemcpy(start_, src, min * sizeof(T), cudaMemcpyHostToDevice);
        if (result != cudaSuccess)
        {
            throw std::runtime_error("failed to copy to device memory");
        }
    }
    // copy memory from device to host
    void get(T* dest, size_t size)
    {
        size_t min = std::min(size, getSize());
        cudaError_t result = cudaMemcpy(dest, start_, min * sizeof(T), cudaMemcpyDeviceToHost);
        if (result != cudaSuccess)
        {
            throw std::runtime_error("failed to copy to host memory");
        }
    }

These functions are of particular interest. First, they let you do several operations simply writing only one line of code, saving you time and preventing possible code errors. Second, they provide helpful exception handling (runtime_error).

Private functions

The following functions cannot be invoked other than by the class itself and, as aforementioned, this is helpful in different ways. As you can see they are the "core" functions, in the sense that they perform basic operations and they are called by higher, public functions.

// private functions
private:

Allocate and Free Memory

This function is quite simple. It uses the CUDA commands we saw in the previous article for allocating and freeing memory in the device for our arrays. Also, as in the get and set functions, error checking with exception throwing is performed, providing additional control. This is especially useful in longer codes in which it is very likely to encounter issues like a segmentation fault or overflow without even noticing, due to failing or misbehaving memory operations.

  // allocate memory on the device
    void allocate(size_t size)
    {
        cudaError_t result = cudaMalloc((void**)&start_, size * sizeof(T));
        if (result != cudaSuccess)
        {
            start_ = end_ = 0;
            throw std::runtime_error("failed to allocate device memory");
        }
        end_ = start_ + size;
    }

      // free memory on the device
    void free()
    {
        if (start_ != 0)
        {
            cudaFree(start_);
            start_ = end_ = 0;
        }
    }

    T* start_;
    T* end_;
};

#endif

In this article we have seen how to implement a limited yet powerful class for handling arrays in a CUDA program. Now you can use it simply by copying the following code in a source file and giving it the extension ".h" (e.g. dev_array.h) and including it in your main block by adding the class as a regular one, typing #include <cuda_runtime.h> in the declaration part.

In the next article, besides using this class, I'll show you how to code a Monte Carlo simulation in CUDA, with a practical example about option pricing.

I also want to thank Dr. Massimo Guarnieri, who wrote this class and let me use it for writing this article.

Here is the full code for completeness:

#ifndef _DEV_ARRAY_H_
#define _DEV_ARRAY_H_

#include <stdexcept>
#include <algorithm>
#include <cuda_runtime.h>

template <class T>
class dev_array
{
// public functions
public:
    explicit dev_array()
        : start_(0),
          end_(0)
    {}

    // constructor
    explicit dev_array(size_t size)
    {
        allocate(size);
    }
    // destructor
    ~dev_array()
    {
        free();
    }

    // resize the vector
    void resize(size_t size)
    {
        free();
        allocate(size);
    }

    // get the size of the array
    size_t getSize() const
    {
        return end_ - start_;
    }

    // get data
    const T* getData() const
    {
        return start_;
    }

    T* getData()
    {
        return start_;
    }

    // set
    void set(const T* src, size_t size)
    {
        size_t min = std::min(size, getSize());
        cudaError_t result = cudaMemcpy(start_, src, min * sizeof(T), cudaMemcpyHostToDevice);
        if (result != cudaSuccess)
        {
            throw std::runtime_error("failed to copy to device memory");
        }
    }
    // get
    void get(T* dest, size_t size)
    {
        size_t min = std::min(size, getSize());
        cudaError_t result = cudaMemcpy(dest, start_, min * sizeof(T), cudaMemcpyDeviceToHost);
        if (result != cudaSuccess)
        {
            throw std::runtime_error("failed to copy to host memory");
        }
    }


// private functions
private:
    // allocate memory on the device
    void allocate(size_t size)
    {
        cudaError_t result = cudaMalloc((void**)&start_, size * sizeof(T));
        if (result != cudaSuccess)
        {
            start_ = end_ = 0;
            throw std::runtime_error("failed to allocate device memory");
        }
        end_ = start_ + size;
    }

    // free memory on the device
    void free()
    {
        if (start_ != 0)
        {
            cudaFree(start_);
            start_ = end_ = 0;
        }
    }

    T* start_;
    T* end_;
};

#endif