|
||
OpenCL IntroductionCONRAD offers Grid classes to handle 1D, 2D, and 3D image data. In order to speed computation up, there are also variants of Grid1D, Grid2D, and Grid3D that are compatible with OpenCL. These OpenCL grids can be used in the same way as the normal grids. If they are used in operations with other OpenCL grids, all computations are done entirely on the OpenCL device. If they are mixed with CPU grids, the data is automatically transferred from the device to the host memory. Thus, the user does not have to think about memory transfers. One drawback of this method is, that the memory has always to exist on the host and on the OpenCL device. One advantage is that the code can be executed on the OpenCL device easily. In fact, the CPU code is exactly the same as the OpenCL code. Only the underlying container is replaced. Thus, OpenCL grids are comfortable, but come at a slight overhead. Here we show some examples about how OpenCL grids are operated, indicate advantages and disadavantages, and give intructions on how to use these containers efficiently. For further details also consult the OpenCL Design Considerations. OpenCL GridsThis part is a simple example about OpenCL grids and we demonstrate that using OpenCl grids can improve the computation speed. First, we need to define the OpenCL Context and choose an OpenCL Device. The CLContext is used to manage objects, memory transfers, and kernel executions. CLContext context = OpenCLUtil.getStaticContext(); Note that we use a CONRAD method from the OpenCLUtil here. It will create a static reference to the current OpenCL context. If this is called for the first time, a dialogue box will appear that will query the user for the OpenCL device to use. This device is then stored in CONRAD's registry for later use. If you want to reset this, you can use the ReconstructionPipelineFrame that is introduced in the Installation Tutorial. Go to Configuration / Registry to remove the OpenCL device entry. Then, we select the OpenCL device with the best peak performance: CLDevice device = context.getMaxFlopsDevice(); Next, we create a 2000*2000 Shepp Logan phantom on CPU: Phantom shepp = new SheppLogan(2000); Transfer to the OpenCL device is handled by the corresponding OpenCL grid container. It automatically copies the phantom data from CPU to OpenCL memory. OpenCLGrid2D sheppCL = new OpenCLGrid2D(shepp, context, device); Now we double the phantom data for number times on CPU: for (int i = 0; i < number; i++){ The corresponding OpenCL code is identical, but using the OpenCL grid: for (int i = 0; i < number; i++){ After that we compare the time costs on CPU and OpenCL device. We can use the following function to calculate the time cost: long starttime= System.nanoTime(); //Codes... long endtime= System.nanoTime(); long timecost= endtime - statrttime; In the case of 10 iterations, the computation time on CPU is 192.899 ms and the time cost on OpenCL is only 25.916 ms, which indicates that parallel computation with OpenCL on GPU is much faster. Here, we achieve a speed up factor of 7.4. For the experiment, we used an Nvidia GTX 480. OpenCL Texture MemoryIn this example, we compare different methods for memory copy. Again, we create a 2000*2000 Shepp Logan phantom on CPU. Then we use different methods to copy the phantom data into GPU memory. Method 1: We make a new OpenCL grid from a CPU grid and iterate for number iterations. for (int i = 0; i < number; i++){
Method 2: We make a new OpenCL grid from a previously existing OpenCL grid and iterate for the same number. for (int i = 0; i < number; i++){
Method 3: For every iteration, we copy the phantom data from CPU memory to OpenCL memory using a linear buffer. for (int i = 0; i < number; i++){
Method 4: First, allocate an OpenCL texture (called image in OpenCL language) and then overwrite the texture memory for number iterations. Note that the code below uses only buffers allocated in the OpenCL memory. We just copy data from OpenCL linear memory to the same OpenCL device into the texture memory. CLImage2d<FloatBuffer> image = context.createImage2d(sheppCL.getDelegate().getCLBuffer().getBuffer(), sheppCL.getSize()[0], sheppCL.getSize()[1], format);
Method 5: For every iteration, we allocate a new texture on the OpenCL device and copy the image data from CPU memory to OpenCL texture memory: for (int i = 0; i < number; i++){
Method 6:First, we allocate the OpenCL texture memory for the image data and then for every iteration we only write the image data into the OpenCL texture memory. This means we copy for every iteration data from CPU to the OpenCL device but don't need to reallocate OpenCL texture memory. CLImage2d<FloatBuffer> image2 = context.createImage2d(sheppCL.getDelegate().getCLBuffer().getBuffer(), sheppCL.getSize()[0], sheppCL.getSize()[1], format);
Comparing the results for different methods displayed in Figure 1, we can observe that:
ConclusionOpenCL grids are useful and convenient. However, one has to keep in mind that the creation of new OpenCL grids also involves operations on the host computer. Thus, one should omit calling "new" too often in this context. Reusing memory is much faster in this context. CodeThe code of this example is founded in src.FlatPanelProject.TestofTextureCopy.java AuthorsAnja Jäger, Tilmann Hübner, Karoline Kallis, Hamidreza Moghadas, Yixing Huang, Andreas Maier |