Let’s See … How Computers Can See!

Yes, you didn’t get it wrong. Computers can actually see.
And guess what!
Not only can they see, but they can interpret what they see!
And you know what?!
You have seen this happen yourself! You have seen examples of how computers can see… but you didn’t pay attention to them!

Still not convinced? I’m sure that one example is sufficient to make you see this! Suppose you went on a trip with your friends, took photos, and uploaded them to Facebook, then suddenly, Facebook starts suggesting who you should tag with your photos!

Yeah, I know that …

There you go! How can you explain this? Well, it looks like Facebook saw your photo and recognized who was with you on your trip, and accordingly, suggested who you should tag to save you the trouble and effort of tagging people individually. So, logically speaking, Facebook was able to see your images and interpret what’s inside of them.

Fair enough! But just out of curiosity: how does Facebook do this?!

Ok, just so you know, it’s not only Facebook that does this. In fact, a lot of computer programs are able to interpret images — even if their purposes are a bit different than Facebook’s. This is where we start talking about what we call Computer Vision!

In simple terms, Computer Vision is a field that helps computers interpret what’s inside an image and make sense out of it. In order to do this, we mostly rely on Artificial Intelligence techniques, and more specifically, on what we call Convolutional Neural Networks (CNN), which are neural networks specifically designed to extract information from images. Through the rest of the blog, we will be answering the following 3 questions:

  • What is an image for a computer?
  • How can a computer interpret images?
  • How can CNNs extract information from images in order to do this interpretation?

But … First things first!

What is an image for a computer in the first place?

It is true that computers can interpret images as we do, but computers don’t see images exactly as we do!

Ok, so what I mean is that the conclusions that we come out with when looking at images are similar to the conclusions that computers will derive while looking at them, although we are seeing kind of different things! So, when we look at an image we see colors, shapes, objects, etc. And based on these, we interpret it. So, if we look at the image of a person, and we want to tell whether this person is a male or a female, we look for specific things like hair, beard, etc. Although a computer might come up with the same conclusion, do you think that the computer saw the same thing in order to interpret? So, do you think that the computer saw the hair or the beard explicitly as we did?
The short answer to this is NO, it didn’t. At least, not explicitly. In fact, a computer sees the image as a matrix of pixels! A pixel is the smallest entity of an image and is characterized by some values that represent the intensity of the colors in that pixel. The computer sees only those! But for us, it is these values that make us see what we see! So, in simple terms, an image for a computer is just a bunch of numbers placed in a matrix, and the computer tries to interpret an image based on these numbers only.
It’s fascinating, isn’t it? Let us see what are these values exactly and how can they represent different colors in each kind of image.

1. Binary Images

In binary images, as the name suggests, we have 2 colors only: Black and White. That’s what we see. But for the computer, the black color is represented by a 0, and the white color is represented by a 1.

2. Grayscale Images

Grayscale images are images that contain different gray colors. As you know, life is not always in black and white: some things can be in between, and grayscale images are a way to represent these colors that exist in between. In a grayscale image, the value of each pixel can be between 0 and 255 where 0 represents the black color and 255 represents the white color. Values in between represent different levels of the gray color.

3. Color Images

Finally, we have the images that we mostly see during our everyday life, the color images. But how can we represent their colors? Well, to understand the representation, you should know that any color that we see is just a combination of the 3 colors: Red, Green, and Blue with different intensities. So, to represent any color, we need a specific amount of Red, a specific amount of Green, and a specific amount of Blue. The amount of Red can be represented by a number between 0 and 255 where 0 represents no Red at all, and 255 represents the maximum amount of Red. The same thing applies to Blue and Green. This is why, for a computer, each pixel is represented by 3 values, and each of them is between 0 and 255 as you can see in the figure. The first value represents the amount of Red, the second value represents the amount of Green, and the third value represents the amount of Blue.

Now that you know that an image for a computer is just a bunch of numbers, the question that you are asking yourselves is …

What can a computer do with these numbers?

As discussed previously, Computer Vision is a field that focuses on interpreting images and extracting information out of them. Now we will discuss what are the tasks that we mostly achieve when using Computer Vision.

1. Image Classification

The goal of image classification is to assign an image to a particular class. Image Classification is useful for many applications. One of them is Medical diagnosis where we get to classify whether a patient has a particular disease out of one of his medical images. Chest X-Rays for instance can help us diagnose diseases like pneumonia, COVID-19, and many others.

2. Object Detection

The goal of object detection is not only to categorize an image but to locate objects inside of that image. So what we need here is to detect objects within the image and know their exact location. This location is defined by what we call a bounding box, which is just a rectangle that surrounds the object we are looking for. Many applications use Object Detection techniques, and the most popular of them are self-driving cars that have to identify all the objects and their locations in order to ensure a smooth drive and avoid any accidents

3. Semantic Segmentation

The goal of semantic segmentation is also to search for individual objects within an image. But instead of locating them with a bounding box, in image segmentation, each pixel is gonna be classified according to the content that it holds, and according to the class that we obtain, we know each pixel belongs to which object. The image below clarifies this furthermore.

Great outcomes, but how can they be achieved?

Excellent question! And you have the full right of asking it! To simplify things, let’s say we want to build a model that can classify images.

In order for an image to be classified, there is like a tunnel that it should pass through. This tunnel is mainly composed of 2 parts: the first part is supposed to extract features from the image, and the second part is supposed to do the actual classification based on the extracted features. These facts can be highlighted in the figure below.

Now how can we do the automatic feature extraction from images? This is where we use what we call “Convolutional Neural Networks” or CNNs for short. CNNs are very much useful when dealing with images. Just like normal neural networks, a CNN is made of multiple layers. The difference is that there are different kinds of layers that can be involved in a CNN which mainly are the convolutional layers and the pooling layers..
In simple terms, a convolutional layer will extract information, and a pooling layer will compress this information. Now we will understand how each of these layers is supposed to do its job.

1. Before understanding convolutional layers, let’s understand convolutions!

In simple terms, convolution is a mathematical operation that helps extract specific patterns from an input image going from simple lines or edges to complex things like eyes and faces. In order to do the convolution, we need to have what we call a “kernel” that specifies, in one way or another, the kind of information that we are willing to extract from the image. This kernel is going to slide over the image (input feature map) as you can see in the image below, and while sliding, it is going to be extracting information and saving it to the output feature map.

Take a minute to look at the above image carefully before continuing!
Just to give you intuition, the process that happens here is purely mathematical where you would be dotting matrices, and depending on the kernel that you choose, you would obtain a different output. If you look at the image below, you will get a better understanding of how things are done mathematically. The kernel used in this example is a 3 by 3 matrix represented as follows: kernel = [0 1 2; 2 2 0; 0 1 2].
When the kernel is in a particular position, you need to do the dot product between the kernel values and the image values under this kernel, and you put the result in the output feature map. You keep sliding the window and doing the above operation until your kernel passes over all pixels in the image.

Now, imagine that you change the values of the kernel, and you pass it over the same image. What would happen to the output?! It would definitely change because you would be extracting a different kind of information.

2. Going from a simple convolution to a convolutional layer

A convolutional layer is the essence of a CNN. This convolutional layer is supposed to apply multiple convolutions to the input that it receives. Each of these convolutions is supposed to extract particular information according to the kernel values for each of them. But how do we know what these values are? Well, these values are LEARNED by the network. In fact, during the training process, the network is gonna be learning what are the optimal kernel values that are supposed to give us accurate predictions. In other words, the network will be learning what are the features that we are looking to extract from the input image in order to base our predictions on.

3. Now we got information … Let’s compress it with a pooling layer

A pooling layer is a layer that compresses the information we get from our convolutional layer. It is like a filter applied to specific regions of the output feature map. This filter should have a specific size. Let’s say the size is a 2 by 2. What you do in this case is that you divide your image into blocks of that size (2×2 in our case), and then, you apply the pooling operation on each of these blocks. The pooling operation is also mathematical and can be of 2 kinds: Max Pooling, or Average Pooling. When we talk about Max Pooling, we take the maximum value of a block, and when we talk about Average Pooling, we take the average value of a block. The image below can help you understand the process even more on a simple example.

4. Putting Everything together … CNN

Now that you know the building blocks of a CNN, let’s see what a typical CNN looks like. A typical CNN’s goal is to extract the maximum amount of features from an image, but we need these features to be relevant. For this reason, in a CNN, we alternate between convolutional layers and pooling layers. The convolutional layer will extract information from an image and the pooling layer will compress this information. Then, this compressed information is fed to another convolutional layer that will extract more information, and then the output passes to another pooling layer for compression, and so on. We keep doing this until we reach a limited number of features that we choose. These features will be characterizing an image, so we can feed them to a classifier in order to perform image classification.

Et Voilà!

As you have seen, Convolutional Neural Networks are very much essential in order to extract features from images. In this blog, you have learned about what is computer vision, what tasks are usually done in it, and how Convolutional Neural Networks help in achieving these tasks by extracting features from images. 

Want to know how to write a code that can make use of CNNs to perform image classification? 

Stay tuned for the next blog post!