What are convolutional layers and how they enable deep learning for computer vision
Unlike you and me, computers only work with binary numbers. So they cannot see and understand an image. However, we can represent the images using pixels. For a grayscale image, the smaller the pixel, the darker it is. A pixel takes values between 0 (black) and 255 (white), the numbers in the middle are a spectrum of gray. This number range is equal to one byte in binary, which is ²⁸, it is the smallest working unit of most computers.
Below is an example image I created in Python and its corresponding pixel values:
Using this concept, we can develop algorithms that can see patterns in these pixels to classify images. This is exactly what a Convolutional Neural Network (CNN) do.
Most images are not grayscale and contain colors. They are usually represented in RGB, where we have three channels red, green and blue. Each color is a two-dimensional grid of pixels, which are then stacked on top of each other. Thus, the image input is then three-dimensional.
The code used to generate the plot is available on my GitHub:
The key element of CNNs is the convolution operation. I have a full article detailing how convolution works, but I’ll give a quick recap here for completeness. If you want an in-depth understanding, I highly recommend checking out the previous post: