Cyrille Favreau: When GPUs meet Service-oriented architectures – or how to build Cloud ready GPU systems

Wednesday, June 20, 2012

When GPUs meet Service-oriented architectures – or how to build Cloud ready GPU systems

GPUs are cool. I like this basic statement as an introduction to this article. And since they are cool, why should not we share them? I do not come from the HPC world, I've always worked on service oriented software. I like the abstraction of calling a service, without knowing where the code is being executed. If this is a common statement for CPUs, why should not GPUs be part of the game? programmers are comfortable with the idea of sharing a CPU between, or within applications. This is done using threads and everyone is happy with letting the operating system taking care of the necessary context switches. But what about GPUs?

From what I read in many of the GPGPU programming articles, it seems that the large majority of applications use the same basic architecture, that is:

Send data to the GPU
Sequentially run the kernels
Read the results back.

In the context of SOA, such applications look very much like “batch” programs. They are usually started from the command line and they very often take exclusive control of external resources such as databases. Batch programs are used to process intensive work on large amounts of data. They usually run during the night, in order to prepare the data for the morning after the night before J But GPUs are fast, very fast, and they have the power to make these long running processes step into the real-time world. Changing batch programs to interactive services, here is the idea!

A GPU is a single resource, exclusively reserved by an application. In other words, several threads cannot access the GPU in parallel. Threads can post commands to the single queue (This will be improved with HyperQ on nVidia Kepler 2 devices), commands are then executed sequentially, and computed results can be sent back to the host, still using the same single command queue.

The architecture we propose consists in making use of a singleton class to access the GPU. The server application instantiates as many threads as needed - ideally one per client -, each of them invoking methods on the singleton, in order to serialize calls to the GPU.

In the example of the ray-tracer, what we want to offer is a different view of the same 3D scene to each client. The 3D scene is loaded onto the GPU when the server application starts. This process includes loading of 3D primitives as well as textures and lights. Each client can then request for a view of that 3D scene by specifying a number of camera parameters. For every client, each frame is computed on the server application and a compressed bitmap is sent back as the result of the rendering process. From a server application perspective, every request is only a set of parameters applied to a constant set of data, meaning there is no need for context switching.

The protocol for accessing services from the cloud can be HTTP, IIOP, ICEP, THRUST, or whatever protocol that efficiently supports data transfers. This choice usually depends on data structures that need to transit on the wire. In our example, we use ICE, from ZeroC, because it’s fast, reliable, and allows compression on the fly.

The following diagram illustrates the technical architecture of the project:

Advantages:

GPU computing power can be delivered to any device.
The server can be implemented using one unique technology (Cuda, OpenCL, DirectCompute).
Server deployments and GPU upgrades immediately benefit to the clients.
Scalability is made easy using server clusters and load balancing features, usually provided by the transport layer.

Disadvantages:

Network bandwidth can be an issue depending on use cases,
The client still has to be written in a technology supported by its hardware. Note that clients do not need support for GPU programming.

Download client from https://github.com/cudaopencl/CloudCudaRaytracer

And give me a shout for a demo.

skype: cyrille_favreau