OpenCL Introduction

CONRAD offers Grid classes to handle 1D, 2D, and 3D image data. In order to speed computation up, there are also variants of Grid1D, Grid2D, and Grid3D that are compatible with OpenCL. These OpenCL grids can be used in the same way as the normal grids. If they are used in operations with other OpenCL grids, all computations are done entirely on the OpenCL device. If they are mixed with CPU grids, the data is automatically transferred from the device to the host memory. Thus, the user does not have to think about memory transfers. One drawback of this method is, that the memory has always to exist on the host and on the OpenCL device. One advantage is that the code can be executed on the OpenCL device easily. In fact, the CPU code is exactly the same as the OpenCL code. Only the underlying container is replaced. Thus, OpenCL grids are comfortable, but come at a slight overhead.

Here we show some examples about how OpenCL grids are operated, indicate advantages and disadavantages, and give intructions on how to use these containers efficiently. For further details also consult the Opens internal link in current window OpenCL Design Considerations.

OpenCL Grids

This part is a simple example about OpenCL grids and we demonstrate that using OpenCl grids can improve the computation speed.

First, we need to define the OpenCL Context and choose an OpenCL Device. The CLContext is used to manage objects, memory transfers, and kernel executions.

CLContext context = OpenCLUtil.getStaticContext();

Note that we use a CONRAD method from the OpenCLUtil here. It will create a static reference to the current OpenCL context. If this is called for the first time, a dialogue box will appear that will query the user for the OpenCL device to use. This device is then stored in CONRAD's registry for later use. If you want to reset this, you can use the ReconstructionPipelineFrame that is introduced in the Opens internal link in current window Installation Tutorial. Go to Configuration / Registry to remove the OpenCL device entry.

Then, we select the OpenCL device with the best peak performance:

CLDevice device = context.getMaxFlopsDevice();

Next, we create a 2000*2000 Shepp Logan phantom on CPU:

Phantom shepp = new SheppLogan(2000);

Transfer to the OpenCL device is handled by the corresponding OpenCL grid container. It automatically copies the phantom data from CPU to OpenCL memory.

OpenCLGrid2D sheppCL = new OpenCLGrid2D(shepp, context, device);

Now we double the phantom data for number times on CPU:

for (int i = 0; i < number; i++){
PointwiseOperators.addBy(shepp, shepp);
}

The corresponding OpenCL code is identical, but using the OpenCL grid:

for (int i = 0; i < number; i++){
PointwiseOperators.addBy(sheppCL, sheppCL);
}

After that we compare the time costs on CPU and OpenCL device. We can use the following function to calculate the time cost:

long starttime= System.nanoTime();

//Codes...

long endtime= System.nanoTime();

long timecost= endtime - statrttime;

In the case of 10 iterations, the computation time on CPU is 192.899 ms and the time cost on OpenCL is only 25.916 ms, which indicates that parallel computation with OpenCL on GPU is much faster. Here, we achieve a speed up factor of 7.4. For the experiment, we used an Nvidia GTX 480.

Code

The code of this example is founded in src.FlatPanelProject.SimpleCLGridExample.java

OpenCL Texture Memory

Fig 1: Time Cost For Different Methods

Fig 2: Time Cost Per Iteration For "Overwrite GPU Texture"

In this example, we compare different methods for memory copy. Again, we create a 2000*2000 Shepp Logan phantom on CPU. Then we use different methods to copy the phantom data into GPU memory.

Method 1: We make a new OpenCL grid from a CPU grid and iterate for number iterations.

for (int i = 0; i < number; i++){
OpenCLGrid2D grid = new OpenCLGrid2D(shepp, context, device);
grid.getDelegate().release();}

Method 2: We make a new OpenCL grid from a previously existing OpenCL grid and iterate for the same number.

for (int i = 0; i < number; i++){
OpenCLGrid2D grid = sheppCL.clone();
grid.getDelegate().release();}

Method 3: For every iteration, we copy the phantom data from CPU memory to OpenCL memory using a linear buffer.

for (int i = 0; i < number; i++){
queue.putWriteBuffer(sheppCL.getDelegate().getCLBuffer(), true);}

Method 4: First, allocate an OpenCL texture (called image in OpenCL language) and then overwrite the texture memory for number iterations. Note that the code below uses only buffers allocated in the OpenCL memory. We just copy data from OpenCL linear memory to the same OpenCL device into the texture memory.

CLImage2d<FloatBuffer> image = context.createImage2d(sheppCL.getDelegate().getCLBuffer().getBuffer(), sheppCL.getSize()[0], sheppCL.getSize()[1], format);

        for (int i = 0; i < number; i++){
            queue.putCopyBufferToImage(sheppCL.getDelegate().getCLBuffer(), image).finish();
        }
        image.release();

Method 5: For every iteration, we allocate a new texture on the OpenCL device and copy the image data from CPU memory to OpenCL texture memory:

for (int i = 0; i < number; i++){
            CLImage2d<FloatBuffer> image2 = context.createImage2d(sheppCL.getDelegate().getCLBuffer().getBuffer(), sheppCL.getSize()[0], sheppCL.getSize()[1], format);
            queue.putWriteImage(image2, true);
            queue.finish();
            image2.release();
}

Method 6:First, we allocate the OpenCL texture memory for the image data and then for every iteration we only write the image data into the OpenCL texture memory. This means we copy for every iteration data from CPU to the OpenCL device but don't need to reallocate OpenCL texture memory.

CLImage2d<FloatBuffer> image2 = context.createImage2d(sheppCL.getDelegate().getCLBuffer().getBuffer(), sheppCL.getSize()[0], sheppCL.getSize()[1], format);

        for (int i = 0; i < number; i++){
            queue.putWriteImage(image2, true);
            queue.finish();
        }
        image2.release();

Comparing the results for different methods displayed in Figure 1, we can observe that:

Overwriting OpenCL texture (Method 4) performs best as it is just copying inside the GPU texture memory. No memory from the host has to be accessed.
Method 3, Method 5, and Method 6 have almost the same time cost. Comparing Method 3 and Method 6, we can see copying data into the OpenCL linear memory is a little faster than texture memory. And Method 5 is faster than Method 6 because it doesn't need to reallocate OpenCL texture memory. There is, however, only a small difference.
Making a new OpenCL grid from a CPU grid (Method 1) or from an OpenCL grid (Method 2) is quite time consuming. Note that these two Methods operate both on CPU and OpenCL, and need to allocate and initialize memory on CPU and OpenCL plus data transfers.
For each method, with an increase in the iteration number, the time cost per iteration is decreasing slightly and converges to a constant. In order to avoid random time measurement results, a large number of iterations is necessary. Here, we chose 10,000 repeatitions (cf. Figure 2).

Conclusion

OpenCL grids are useful and convenient. However, one has to keep in mind that the creation of new OpenCL grids also involves operations on the host computer. Thus, one should omit calling "new" too often in this context. Reusing memory is much faster in this context.

Code

The code of this example is founded in src.FlatPanelProject.TestofTextureCopy.java

Authors

Anja Jäger, Tilmann Hübner, Karoline Kallis, Hamidreza Moghadas, Yixing Huang, Andreas Maier

CONRAD

Contact

Address

OpenCL Introduction

OpenCL Grids

Code

OpenCL Texture Memory

Conclusion

Code

Authors