36
different image processing filters and algorithms is provided. For implementing CNN on a
digital platform we need an accurate approximation of the
CNN equation in a discrete
mode [95-97]. In this thesis the architecture of a CNN implementation based on GPU and
FPGA are proposed. Figure 4-3 does show the abstract model of the GPU based system
which has been proposed.
Figure
4-3: Architecture of system for processing images based on CNN
To have more flexibility in design and accuracy in result, software based implementation of
CNN is a good option. The only drawback is that
by increasing the CNN size, the CNN
performance will be very poor. Therefore we proposed a parallel implementation of CNN
on GPU. Instead of programming in pixel level by vertex engine
and fragment engine we
proposed an implementation on OpenCL platform. OpenCL which is a heterogeneous
platform for high performance computing on GPU and CPU devices provided a sort of APIs
for execution of kernels on computing devices and communication between them. Kernels
are
distributed in the form of one, two and three dimensional and they following
hierarchical abstraction mode.
In GPU device there is local, global and constant memory for
computing and each computing unit has a local memory. OpenCL can manage easily local
communication between these memories between different kernels. Figure 4-4 has shown
the
overview of the CNN GPU design, this part has been describe in details in chapter 8.
CNN
Templates
Bank/Memory
CNN on
GPU
CPU
Global
Memory