GPU with its SIMD (Single Instruction Multiple Data) type architecture provides a massive parallelizing opportunity over thousands of cores compared to a typically just 8 cores (which again may be just logical and not physical cores) in CPU. Again, GPU threads are implemented in hardware making their context switching almost instantaneous and less expensive than in CPU. Give GPU large floating point calculations and it will put CPU performance to shame.
Of course, CPU excels in performing general purpose tasks. Not to mention, writing CPU programs require next to nothing thought in the part of the programmer about the underlying hardware. Programmers are even discouraged to perform optimizations on binaries produced by compilers of their programming language. It is accepted that the compiler “knows best” when it comes to optimizations. Cache memory is implemented in several levels and underlying cache control hardware (which cannot be controlled by the programmer) tries to predict memory usage of your code. If your code requests memory according to the predicted usage model, then your program will keep the CPU busy and run at optimal level.
The grass isn’t so green on the GPU side. GPU programming requires that you have extensive knowledge of the underlying hardware. There is high reward with amazing performance boost if you make use of correct configuration and harsh punishment with no gain (even loss) if your configurations cannot make use of hardware. In GPU programming, programmer must actually plan their code around the architecture. You as a programmer are given full control of the cache. With three memories – local, shared and global – with varying speeds and sizes to choose from, you must decide where your program data must be kept. By profiling your code, you must make sure that the configuration you chose and the kernel (GPU programs are called kernels) you wrote makes full use of hardware capability. This may sometimes mean that a serial algorithm must be converted to a parallel version and/or you must write different kernels targeting different models of hardware. Finally, there’s the obvious hassle of having to copy data to and fro CPU and GPU memories.
If there are so many requirements to get good performance on GPU, is it really worth it? I wanted to see how fast my NVIDIA GeForce GT 425M GPU could perform a box blur using a 9×9 size blur kernel (matrix used for blurring is also called kernel, not to be confused with GPU programs) over an image of 2560×1600 size. I got the following results:
Housekeeping code time
Blurring code time
|GPU||About 0.8-2 secs||~122 milliseconds|
The main obstacle in getting started with DirectCompute is learning how to put HLSL (High level shader language is the programming language used for GPU programming) code and data into GPU and copy result back from GPU into CPU memory. Using ‘BasicCompute’ example that comes with DirectX SDK, I wrote a neat little class called ‘ComputeShader’ which wraps this cleanly so that the programmer may focus on writing GPU code rather than trivial task of dealing with DirectX COM interfaces and copy data back and fro.
For new users of this class, please download the sample to see how to use the class. In summary, here’s what you need to know to use it:
1. Call class functions in this order:
- CompileShader(<filename of HLSL code>, <Entry point function>, <No. of X threads in a block>, <No. of Y threads in a block>, <No. of Z threads in a block>)
- RunShader(<No. of X blocks>, <No. of Y blocks>, <No. of Z blocks>, <Vector of Input data>, <Vector specifying sizes of Output data>, <Vector of Constant data>)
- Result<Type you want to be returned>(<Index of output data item>)
2. For Input data, specify vector of tuple as ‘make_tuple(Pointer to Input data, Size of each element, Total elements)’. The order in which you push elements in the vector maps to the order of register in GPU.
3. For Output data, specify vector of tuple as ‘make_tuple(Size of each element, Total elements)’.
4. For Constant data, specify vector of tuple just as in the case of Input data. Make sure that constant data is 32bit aligned otherwise you will have problems.
5. Using already compiled HLSL object file is not supported in the version of the code published with this article. This may change in future.
6. Any error during compilation of shader code is outputted to the Immediate Window of your IDE.
7. Profiling GPU execution time can be done using ‘GetExecutionTime()’ function. Sometimes profiling returned by this function cannot be trusted due to change in GPU execution frequency while your code was executing (One of the causes of this change may be due to GPU being powered down because the computer is running on batteries). This function takes an optional bool variable pointer with which you can determine the value’s trustworthiness.
8. Class assumes a ComputeShader 5.0 supported hardware.
9. To control the no. of threads from your source .cpp file, make sure you use the ‘NUM_OF_THREADS_X’, ‘NUM_OF_THREADS_Y’ and ‘NUM_OF_THREADS_Z’ macros in your HLSL file. See sample.
10. You are free to use the class or sample code in any way you like. Having said so, I disclaim any liability from its use.