Intro to CUDA
Modern game engines have a lot going on. With so many different subsystems competing for resources, multi-threading is a way of life. As multi-core CPUs have gotten cheaper and cheaper, game developers have been able to more easily take advantage of parallelism. While Intel and AMD fight to bring more cores and more cores to the CPU, GPUs have been easily surpassing them for raw parallel abilities. Modern GPUs contain thousands of cores, allowing tens of thousands of threads to execute code simultaneously. This presents game developers with yet another opportunity to add parallelism to their programs. In separate threads, an engine may want to perform a search or sort against a large amount of data, pre-process trees, generate a large amount of random data, process an image or perform calculations to be used for a transformation or collision detection. Any parallel computational task can be a good candidate for offloading to the GPU. This article aims to show you one possible way of harnessing that ability in a game using NVidia's CUDA.
CUDA is both a parallel platform and model that allows code to run directly on the processing cores that make up modern GPUs. It was created by NVidia and currently only supported on NVidia's hardware. It is similar to OpenCL in the idea but different in execution. Using CUDA is as simple as having a recent NVidia graphics card and downloading the free SDK. Links for Windows, Linux and Mac OSX can be found here. While it is proprietary to NVidia, the programming model is easy to use and supported by many languages such as C/C++, Java and Python and is even seeing support on ARM7 architectures. The CUDA programming syntax itself is based on C and so pairs well with games written in C or C++. The CUDA code you write is compiled to object code with NVidia's nvcc compiler and then is linked with standard C code using gcc or Visual Studio to produce the final program. For simple programs, the same file can be used to contain both your entry point and your CUDA function(s). After downloading and installing the toolkit, compiling CUDA code can be done from the command line with the nvcc compiler or through Visual Studio using the CUDA Runtime template which makes it easy to combine standard C/C++ and CUDA code files together in one project.
To demonstrate CUDA with C, we can start with a simple addition function. All samples shown in this article were compiled with the CUDA 5.5 toolkit:
__global__ void cudaAdd(int a, int b, int *c)
{
*c = a + b;
}
This program adds two numbers and stores the result in c. The __global__ identifier marks this function as an entry point for the CUDA program. Now we will see an example of how to call the above program. This can be placed in the same file to create one complete program:
#include
__global__ void cudaAdd(int a, int b, int *c)
{
*c = a + b;
}
int main()
{
int a = 4;
int b = 7;
int *c;
int answer;
cudaMalloc((void**)&c, sizeof(int));
cudaAdd<<<1,1>>>(a, b, c);
cudaMemcpy(&answer, c, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d + %d = %d\n", a, b, answer);
return 0;
}
Programs on CUDA are executed as kernels, with one kernel executing at a time. The kernel can be run by just one or even thousands of threads at the same time. Since we are retrieving a result from the GPU, we first use CUDA to allocate memory for it. Next we execute our program, using the < >>> syntax to specify how many blocks and threads we want the kernel to use. The number of threads that can run in a block is dependent on the specific architecture of the GPU you have. For Fermi GPUs you can execute up to 1024 threads on a block. For this simple example we are just executing one thread on one block. Once we have the data in our c variable, we need to copy it back to system memory using cudaMemcpy. Finally we can display the result.
Performing a Reduce
With a simple example out of the way, we can look at a more common example. A reduce is a parallel operation where data that exists across many threads is combined over a series of steps until a single value is held by one thread. A common example could be computing a sum where each steps adds the values of two different threads. After each step, fewer and fewer threads are used until only the final thread adds the last two values remaining and holds the sum. For this sample, we will demonstrate a program that has separate threads count the number of 5's in parts of an array and then perform a reduce to get the final total. This sample can be run over any number of blocks and threads:
__global__ void countFives(int *array, int size, int *total)
{
int index = threadIdx.x;
int totalThreads = blockDim.x * gridDim.x;
int totalThreadIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
__shared__ int sharedCounts[512];
//first determine how many elements each thread must count
int chunk = (size / totalThreads);
if (size % totalThreads > 0)
chunk++;
int start = totalThreadIndex * chunk;
int end = start + chunk;
if (end >= size)
end = size;
sharedCounts[index] = 0;
*total = 0;
//have each thread count its own elements and store in shared memory
for (int i = start; i < end; i ++)
{
if (array == 5)
{
sharedCounts[index]++;
}
}
__syncthreads();
//now perform a reduce to get the sum of all counts
//the stride tells us how many elements to include at each level
//each loop reduces the number of threads needed until only the first thread is used to capture the count
for (int stride = 1; stride < blockDim.x; stride*=2)
{
int offset = index*(stride*2);
if (offset + stride < blockDim.x)
{
sharedCounts[offset]+=sharedCounts[offset+stride];
}
}
//now have the first thread of each block sum the results to global memory
if (index == 0)
{
atomicAdd(total, sharedCounts[0]);
}
}
This program has three basic steps. First we broke up the array into chunks and had each thread look for 5's in its own chunk. Then we performed a simple add reduction across the threads on each block, storing the result in the shared memory of the first thread of each block. For the last step we used an atomic add to update the global total across the different blocks. The atomic add prevents any contention issues between threads. The syncthreads function show here is used to provide a stopping point for the threads. All threads must reach this point before the program can continue. The example on the whole is inefficient as it only uses about half the total threads for the reduction and has potential contention issues when accessing the global memory but hopefully demonstrates the basic concept of a reduction. The following allocates memory for the array and calls the function:
const int size = 11;
int sourceArray[size] = { 1, 4, 5, 2, 5, 6, 8, 9, 5, 12, 5 };
int total; //stores final value we can examine
int *cudaTotal; //value to allocate for cuda to use
int *cudaArray;
cudaMalloc(&cudaTotal, sizeof(int));
cudaMalloc(&cudaArray, sizeof(int)*size);
//copy our source numbers to cuda before calling
cudaMemcpy(cudaArray, sourceArray, sizeof(int)*size, cudaMemcpyHostToDevice);
countFives<<<2,2>>>(cudaArr, size, cudaTotal);
//copy our result from the device to the program's memory
cudaMemcpy(&total, cudaTotal, sizeof(int), cudaMemcpyDeviceToHost);
CUDA Thrust
A really great library that can be used for common CUDA tasks is Thrust. Thrust is a template library for CUDA that allows STL-like syntax to increase developer productivity. The CUDA SDK comes with a version of Thrust that can be easily used in C code. The following demonstrates a sum reduction and a count of fives using the same array as above:
#include
#include
...
thrust::device_vector thrustArray(11);
thrustArray[0] = 1; thrustArray[1] = 4; thrustArray[2] = 5; thrustArray[3] = 2;
thrustArray[4] = 5; thrustArray[5] = 6; thrustArray[6] = 8; thrustArray[7] = 9;
thrustArray[8] = 5; thrustArray[9] = 12; thrustArray[10] = 5;
//compute the sum of all elements in our array
int sum = thrust::reduce(thrustArray.begin(), thrustArray.end(), (int) 0, thrust::plus());
//get a count of just the 5's in our array
int count = thrust::count(thrustArray.begin(), thrustArray.end(), 5);
printf("Array sum: %d Count of fives: %d\n", sum, count);
As you can see, the syntax is very similar to the Standard Template Library and makes it very easy to call common functions, saving you lots of coding time. It also integrates well with STL vectors. For useful examples of what Thrust can do, you can go here.
Integrating with OpenGL
A great feature of CUDA is its built-in ability to work with OpenGL directly. This allows a CUDA program easy access to data such as texture, pixel buffers or vertex buffers to perform operations against it quickly. Here we will see how we can use CUDA to alter data in parallel against a vertex buffer. The buffer shown here will be small and simple for demonstration purposes. I won't show all of the basic OpenGL set up or program layout here but this sample will work with code from any basic OpenGL tutorial. I placed all my OpenGL code and the main game loop in one c file and the CUDA kernel function and a wrapper to call it in a separate file with a .cu extension.
To get started, first we need to define our simple data structures to use to create the vertex buffer:
struct vertex {
float x;
float y;
float z;
};
struct VertexType
{
vertex position;
//texture coordinate and other information below
...
};
Next we want to allocate an array to use for our vertex buffer using the above structures and then generate a buffer. For this sample we will just allocate an array of four vertices to store a quad. We also need a global variable to store the ID of our vertex buffer:
GLuint vbufferId;
...
VertexType verts[4];
verts[0].position.x = -1.0f; verts[0].position.y = 1.0f; verts[0].position.z = 0.0f;
verts[1].position.x = 1.0f; verts[1].position.y = 1.0f; verts[1].position.z = 0.0f;
verts[2].position.x = 1.0f; verts[2].position.y = -1.0f; verts[2].position.z = 0.0f;
verts[3].position.x = -1.0f; verts[3].position.y = -1.0f; verts[3].position.z = 0.0f;
//fill in texture coordinates, etc
...
glGenBuffers( 1, &vbufferId );
glBindBuffer( GL_ARRAY_BUFFER, vbufferId );
glBufferData( GL_ARRAY_BUFFER, 4 * sizeof(VertexType), verts, GL_DYNAMIC_DRAW );
With a simple buffer created, we can now create a CUDA resource to store a pointer to our vertex buffer. We need another global variable to store our resource:
struct cudaGraphicsResource *cuda_vb_resource;
Then we can map it to our vertex buffer immediately after the glBufferData call above:
cudaGraphicsGLRegisterBuffer(&cuda_vb_resource, vbufferId, cudaGraphicsMapFlagsWriteDiscard);
The resource now has a pointer to the vertex buffer we created above. This allows us to retrieve and modify them using CUDA. The actual program to modify our vertices is very simple. Since we want to stretch our cube in all directions, we must first get a positive or negative value by dividing the current vertices position by the absolute value of itself. Then we will multiply it by the elapsed time in seconds and by our desired rate of movement of .05 units a second.
__global__ void update_vb(VertexType *verts, double timeElapsed)
{
int i = threadIdx.x;
float valx = verts.position.x / abs(verts.position.x);
float valy = verts.position.y / abs(verts.position.y);
verts.position.x += valx * timeElapsed * .05f;
verts.position.y += valy * timeElapsed * .05f;
}
I placed this code in a file separate from the main c file with the OpenGL code and gave it an extension of .cu. Note that the program assumes that each thread will only act on one vertice. It also assumes one block for simplicity but you could easily execute this over multiple blocks if you had enough vertices. We use the index of our current thread to determine which vertice to operate on. We also use an elapsed time variable to control how much change we want in each loop. This helps keep the movement constant if frame rates vary and our time elapsed delta is constantly changing.
The last step now is to create a function to call our CUDA kernel. We can place this function in the same .cu file. The extern keyword is used so that our main c program is able to find it when compiling and linking.
extern "C" void cuda_kernel(VertexType *verts, double timeElapsed)
{
update_vb<<<1,4>>>(verts, timeElapsed);
}
All the wrapper needs to do is pass in the arguments and instruct CUDA how many blocks and threads we want to run on. In this example we tell it to run over 4 threads in one block so each thread has its own vertice. With the function in place, we can call it from the main logic loop. You will want to put the above function's signature with the extern keyword in your main c file if using multiple files so it can be found when linking. This code is set to execute once per loop:
VertexType *verts;
cudaGraphicsMapResources(1, &cuda_vb_resource, 0);
cudaGraphicsResourceGetMappedPointer((void **)&verts, #_bytes, cuda_vb_resource);
cuda_kernel(verts, timeElapsed);
cudaGraphicsUnmapResources(1, &cuda_vb_resource, 0);
The code works by getting a pointer to the vertices in the vertex buffer that is mapped to our CUDA resource. Then they are passed to the kernel wrapper to be modified and unmapped so they are released. This sample assumes there is some code for getting the time elapsed delta between this and the previous loop. QueryPerformanceCounter works well for this. After clearing buffers and setting our texture, our render code looks like this:
glEnableClientState( GL_VERTEX_ARRAY );
glEnableClientState( GL_TEXTURE_COORD_ARRAY );
glTexCoordPointer( 2, GL_FLOAT, sizeof(vertexType), (GLvoid*)offsetof( vertexType, texcoord ) );
glVertexPointer( 3, GL_FLOAT, sizeof(vertexType), (GLvoid*)offsetof( vertexType, vert ) );
//now draw the array
glBindBuffer(GL_ARRAY_BUFFER, vbufferId);
glDrawArrays(GL_QUADS, 0, 4);
glDisableClientState( GL_TEXTURE_COORD_ARRAY );
glDisableClientState( GL_VERTEX_ARRAY );
The last step is to free our resources:
cudaGraphicsUnregisterResource(cuda_vb_resource);
glBindBuffer(1, vbufferId);
glDeleteBuffers(1, &vbufferId);
And thats it. OpenGL integration is fairly straightforward when dealing with buffers. This example can be easily extended to cover TextureBuffers, PixelBuffers or RenderBuffers as well.
Integrating with Direct3D
Similar to its integration with OpenGL, CUDA provides the ability to tie in with Direct3D 9, 10 or 11. Here I will demonstrate the Direct3D 11 version of modifying a simple vertex buffer. Just like with the OpenGL example, we will create a simple 2D cube that we can resize in a game loop. We can use the same vertex structure from the OpenGL example which allows us to use the same CUDA kernel function as we did earlier:
struct cudaGraphicsResource *cuda_vb_resource;
...
D3D11_BUFFER_DESC vertexBufferDesc, indexBufferDesc;
D3D11_SUBRESOURCE_DATA vertexData, indexData;
HRESULT result;
m_vertexCount = 4;
m_indexCount = 6;
vertices = new VertexType[m_vertexCount];
if(!vertices)
{
return false;
}
indices = new unsigned long[m_indexCount];
if(!indices)
{
return false;
}
vertices[0].position.x = -1.0f; vertices[0].position.y = -1.0f; vertices[0].position.z = 0.0f;
vertices[1].position.x = -1.0f; vertices[1].position.y = 1.0f; vertices[1].position.z = 0.0f;
vertices[2].position.x = 1.0f; vertices[2].position.y = 1.0f; vertices[2].position.z = 0.0f;
vertices[3].position.x = 1.0f; vertices[3].position.y = -1.0f; vertices[3].position.z = 0.0f;
//fill in other properties
...
//fill in indices for 2 triangles
indices[0] = 0; indices[1] = 1; indices[2] = 2;
indices[3] = 0; indices[4] = 2; indices[5] = 3;
//create a dynamic vertex buffer
vertexBufferDesc.Usage = D3D11_USAGE_DYNAMIC;
vertexBufferDesc.ByteWidth = sizeof(VertexType) * m_vertexCount;
vertexBufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER;
vertexBufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
vertexBufferDesc.MiscFlags = 0;
vertexBufferDesc.StructureByteStride = 0;
vertexData.pSysMem = vertices;
vertexData.SysMemPitch = 0;
vertexData.SysMemSlicePitch = 0;
result = device->CreateBuffer(&vertexBufferDesc, &vertexData, &m_vertexBuffer);
if(FAILED(result))
{
return false;
}
//now create the index buffer
indexBufferDesc.Usage = D3D11_USAGE_DEFAULT;
indexBufferDesc.ByteWidth = sizeof(unsigned long) * m_indexCount;
indexBufferDesc.BindFlags = D3D11_BIND_INDEX_BUFFER;
indexBufferDesc.CPUAccessFlags = 0;
indexBufferDesc.MiscFlags = 0;
indexBufferDesc.StructureByteStride = 0;
indexData.pSysMem = indices;
indexData.SysMemPitch = 0;
indexData.SysMemSlicePitch = 0;
result = device->CreateBuffer(&indexBufferDesc, &indexData, &m_indexBuffer);
if(FAILED(result))
{
return false;
}
With the buffers created we can associate the resource and our vertex buffer like we did with OpenGL:
cudaGraphicsD3D11RegisterResource(&cuda_VB_resource, m_vertexBuffer, cudaGraphicsRegisterFlagsNone);
Finally our rendering code looks like this:
unsigned int stride = sizeof(VertexType);
unsigned int offset = 0;
deviceContext->IASetVertexBuffers(0, 1, &m_vertexBuffer, &stride, &offset);
deviceContext->IASetIndexBuffer(m_indexBuffer, DXGI_FORMAT_R32_UINT, 0);
deviceContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
With that up and running, we can call the update from inside a game loop just like with OpenGL. The example I wrote used the exact same kernel and external wrapper function from the OpenGL example:
VertexType *verts;
size_t num_bytes;
cudaGraphicsMapResources(1, &cuda_vb_resource, 0);
cudaGraphicsResourceGetMappedPointer((void **)&verts, #_bytes, cuda_vb_resource);
cuda_kernel(verts, elapsedTime);
cudaGraphicsUnmapResources(1, &cuda_vb_resource, 0);
Lastly we need to clean up:
cudaGraphicsUnregisterResource(cuda_VB_resource);
if(m_indexBuffer)
{
m_indexBuffer->Release();
m_indexBuffer = 0;
}
if(m_vertexBuffer)
{
m_vertexBuffer->Release();
m_vertexBuffer = 0;
}
delete [] vertices;
delete [] indices;
Now we have seen some basic examples of how to create CUDA programs and how they can directly interact with data from OpenGL or Direct3D. These examples are pretty basic but hopefully provide a springboard to more advanced concepts. The SDK is loaded with useful samples that demonstrate the power and flexibility of the toolkit.
Something I was wondering, are there any advantages to using CUDA vs using OpenCL?