Parallel implementation of canny edge detection algorithm using OpenMP and CUDA
Canny edge detection is a technique to extract useful structural information from different vision objects and dramatically reduce the amount of data to be processed. It has been widely applied in various computer vision systems. [1]
The process of Canny edge detection algorithm can be broken down to five different steps:
- Apply Gaussian filter to smooth the image in order to remove the noise
- Find the magnitude and direction of gradients of the image
- Apply non-max suppression: suppress edges that or not local maximum in the direction of gradient
- Apply double threshold to determine potential edges
- Track edge by hysteresis: Finalize the detection of edges by suppressing all the other edges that are weak and not connected to strong edges.
you can read more about the algorithm in this wikipedia article.
The serial implementation is available in the serial branch.
There are four variations for the OpenMP implementation that utilizes different parallelization methods:
parallel for
avaiable in themain
branch.parallel for collapse
avaiable in theparallel for collapse
branch.parallel for dynamic schedule
avaiable in theparallel for collapse
branch.taskloop
avaiable in thetaskloop
branch.
The CUDA implementation is available in the CUDA branch.
The project uses CMAKE as its build system generator. Create a folder called build
for example, in the project directory and cd
into it. Then run
build ..
and finally:
make
The resulting binary should be available under build/src
/path/to/executable /path/to/input/image /path/to/output/image low_threshold high_threshold
Example:
./src/Main ../input/1280x720.jpg ../output/1280x720.jpg 30 90
Tables and charts are available in this PDF file.
Using OpenMP and its directives, we achieved speedups up to 4. This speed was gained without much tinkering with the code. All we did was add some directives to the serial implementation.
On the specified hardware, 8 threads on average, seem to be the optimal number of threads. The number 8 corresponds to the number of CPU cores on the specified hardware.
parallel for
seems to be the way to go for the simple task of iterating over all pixels of the image. For huge input sizes however, dynamic schedule
yields better results.
Using CUDA we were able to achieved speedups up to 10. In contrast to OpenMP, we had to change the code quite a bit to be able to run it efficiently on a GPU. The small VRAM on the GPU caused some problems when trying to allocate memory for huge inputs sizes.