CPU vs GPU: Blurring image using DirectCompute

GPU with its SIMD (Single Instruction Multiple Data) type architecture provides a massive parallelizing opportunity over thousands of cores compared to a typically just 8 cores (which again may be just logical and not physical cores) in CPU. Again, GPU threads are implemented in hardware making their context switching almost instantaneous and less expensive than in CPU. Give GPU large floating point calculations and it will put CPU performance to shame.

Of course, CPU excels in performing general purpose tasks. Not to mention, writing CPU programs require next to nothing thought in the part of the programmer about the underlying hardware. Programmers are even discouraged to perform optimizations on binaries produced by compilers of their programming language. It is accepted that the compiler “knows best” when it comes to optimizations. Cache memory is implemented in several levels and underlying cache control hardware (which cannot be controlled by the programmer) tries to predict memory usage of your code. If your code requests memory according to the predicted usage model, then your program will keep the CPU busy and run at optimal level.

The grass isn’t so green on the GPU side. GPU programming requires that you have extensive knowledge of the underlying hardware. There is high reward with amazing performance boost if you make use of correct configuration and harsh punishment with no gain (even loss) if your configurations cannot make use of hardware. In GPU programming, programmer must actually plan their code around the architecture. You as a programmer are given full control of the cache. With three memories – local, shared and global – with varying speeds and sizes to choose from, you must decide where your program data must be kept. By profiling your code, you must make sure that the configuration you chose and the kernel (GPU programs are called kernels) you wrote makes full use of hardware capability. This may sometimes mean that a serial algorithm must be converted to a parallel version and/or you must write different kernels targeting different models of hardware. Finally, there’s the obvious hassle of having to copy data to and fro CPU and GPU memories.

If there are so many requirements to get good performance on GPU, is it really worth it? I wanted to see how fast my NVIDIA GeForce GT 425M GPU could perform a box blur using a 9×9 size blur kernel (matrix used for blurring is also called kernel, not to be confused with GPU programs) over an image of 2560×1600 size. I got the following results:

Box Blur


Housekeeping code time

Blurring code time

CPU n/a 3-4 secs
GPU About 0.8-2 secs ~122 milliseconds
NOTE: CPU program had optimizations on. GPU blurring time is as reported by GPU while other timings are rough. Configuration used for GPU configuration and GPU kernels may not be optimal.

The main obstacle in getting started with DirectCompute is learning how to put HLSL (High level shader language is the programming language used for GPU programming) code and data into GPU and copy result back from GPU into CPU memory. Using ‘BasicCompute’ example that comes with DirectX SDK, I wrote a neat little class called ‘ComputeShader’ which wraps this cleanly so that the programmer may focus on writing GPU code rather than trivial task of dealing with DirectX COM interfaces and copy data back and fro.

Download ‘ComputeShader’ class files: Click Here
Download sample program source using the class: Click Here

For new users of this class, please download the sample to see how to use the class. In summary, here’s what you need to know to use it:

1. Call class functions in this order:

  • CompileShader(<filename of HLSL code>, <Entry point function>, <No. of X threads in a block>, <No. of Y threads in a block>, <No. of Z threads in a block>)
  • RunShader(<No. of X blocks>, <No. of Y blocks>, <No. of Z blocks>, <Vector of Input data>, <Vector specifying sizes of Output data>, <Vector of Constant data>)
  • Result<Type you want to be returned>(<Index of output data item>)

2. For Input data, specify vector of tuple as ‘make_tuple(Pointer to Input data, Size of each element, Total elements)’. The order in which you push elements in the vector maps to the order of register in GPU.

3. For Output data, specify vector of tuple as ‘make_tuple(Size of each element, Total elements)’.

4. For Constant data, specify vector of tuple just as in the case of Input data. Make sure that constant data is 32bit aligned otherwise you will have problems.

5. Using already compiled HLSL object file is not supported in the version of the code published with this article. This may change in future.

6. Any error during compilation of shader code is outputted to the Immediate Window of your IDE.

7. Profiling GPU execution time can be done using ‘GetExecutionTime()’ function. Sometimes profiling returned by this function cannot be trusted due to change in GPU execution frequency while your code was executing (One of the causes of this change may be due to GPU being powered down because the computer is running on batteries). This function takes an optional bool variable pointer with which you can determine the value’s trustworthiness.

8. Class assumes a ComputeShader 5.0 supported hardware.

9. To control the no. of threads from your source .cpp file, make sure you use the ‘NUM_OF_THREADS_X’, ‘NUM_OF_THREADS_Y’ and ‘NUM_OF_THREADS_Z’ macros in your HLSL file. See sample.

10. You are free to use the class or sample code in any way you like. Having said so, I disclaim any liability from its use.


Leave a reply here, thanks!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s