When you want to accelerate an image processing or computer vision application using a graphics processing unit (GPU) like Adreno™, OpenCL™ offers a convenient way to write your program since it’s written in the ubiquitous C language. As an open standard, OpenCL is supported by many GPUs available today in both discrete and mobile environments.
You can optimize your app for one GPU, but most of the time you lose performance when you port to the next one. That is especially true for mobile GPUs, so you need to tailor your application very carefully to get the performance you expect. Unfortunately, it’s not easy to understand or second-guess architectural details when you’re optimizing your program for each different GPU.
That’s why we’ve analyzed a variety of applications and come up with a set of guidelines for better OpenCL performance on Adreno GPUs. In this series, I’ll spell out the guidelines and steps you can follow to improve performance in your applications.
OPENCL PERFORMANCE IMPROVEMENT ON ADRENO GPU
In this post I’ll focus on memory optimization. I’m starting out with memory because most mobile applications are memory-bound rather than compute-bound. So, the performance optimization often boils down to manipulating memory accesses differently.
For Adreno GPUs, I recommend the following:
- Vectorize and vectorize! We recommend size 4 (e.g., vload4).
- For maximal bandwidth, try to group each memory access to 128 bits.
- For most applications, it works better to read from an image object than from a buffer object.
- If you are working with YUV images and only processing the Y plane, see if you can map the Y plane as “RGBA” so you can read 4 pixels at once. (Of course, if you’re doing 5×5 convolution, that isn’t very convenient, so try reading 8×5 pixels and process 4 output pixels per kernel.)
- If you want more-flexible vectorized loading and storing, or byte-addressable access, then I recommend reading from a buffer object.
- When in doubt, try reading from an image object, and writing to a buffer object.
USING LOCAL MEMORY
Developers often ask us when and how to use local memory. Adreno GPUs have dedicated local memory within each compute unit. For example, the Adreno 530 in the Snapdragon™ 820 processor has four compute units, each with its own 32KB of local memory.
Here are a few things to note:
- Local memory has lower latency than global memory, but it may come with hidden overhead. For example, using local memory often imposes a local barrier, which can result in synchronization latency that offsets the benefit of low-latency access.
- Local memory is beneficial for storing intermediate data when you’re combining multiple stages of your algorithm into a single kernel. You’ll save DDR bandwidth, leading to lower power consumption.
- If you want to cache your data in local memory because you will be accessing it multiple times, a good rule of thumb is to make sure it’s accessed 3 times or more.
USING CONSTANT MEMORY
There is built-in RAM for storing constant memory. Use it to speed up your constant arrays and variables.
- You can store up to 3KB of constant memory in the built-in RAM.
- The compiler will attempt to promote constant variables and arrays to this RAM, but due to space limitations, some constants may not get promoted.
- To do this, the compiler needs to know the size of your constant array. If the array is defined in the program scope, then you’re all set.
- If the array needs to be passed in as a kernel argument, use the following attribute, which tells the compiler its size in bytes:
__kernel void foo(__const float f* __attribute__( (max_constant_size(1024)))