Introduction
Computer vision is one of the hottest topics of artificial intelligence. Professor Fei-Fei Li defined it as “a subset of mainstream artificial intelligence that deals with the science of making computers or machines visually enabled, i.e., they can analyze and understand an image.”.
Just like how the human brain works, computers use inputs received to try to understand patterns in the images and take decisions based on these understandings. Nowadays, with the high amount of data available, algorithms are more and more able to solve complex problems that were considered impossible years ago.
From the engineering point of view, the purpose of computer vision is to achieve similar tasks than the human visual system.
After reading this article, you will know what are convolutional neural networks and why they are well suited to computer vision problems. We will also go through a concrete example of an application.
Convolutional Neural Networks
The convolutional neural network, or CNN, is a neural network architecture well suited for two-dimensional image data but can be applied to one-dimensional or three-dimensional data. It is composed of multiple convolutional layers as it is shown in the figure below:


Let us understand now the role of a convolutional layer. As its name suggests, it performs a convolution which is a specialized kind of linear operation. Its definition in deep learning may vary from the ones used in engineering or pure mathematics.
To illustrate the role of the convolution operation, let us consider a concrete example. Suppose that we’re receiving a continuous output from a device and we want to use a weighted average to add robustness to the estimations, with the latest outputs having greater weight. We would use this function “s”:


In this specific example, “w” should be a valid probability density function to get a weighted average. However, the convolution operation is defined whenever the integral is defined. In the case of convolutional neural networks, “x” is a discrete variable and we may have a multi-dimensional input.
In the case of a two-dimensional image, the convolution is defined as :


In the context of machine learning, the learning algorithm will
learn the appropriate values of the kernel in the appropriate place.


Dataset
We will work on a subset of Caltech-UCSD Birds-200-2011 dataset. Here is a brief presentation :
Dataset | Size | Number of birds’ types |
Training set | 1082 | 20 |
Validation set | 103 | 20 |
As you can see, the training data contains only a small number of images which is challenging as neural networks contain usually millions of parameters to tune. The idea here is to perform transfer learning from a pre-trained neural network on ImageNet database where more than 14M images have been hand-annotated to indicate what objects are pictured and in at least one million of the images, bounding boxes are provided.
Actually, some images in this dataset overlap with images in ImageNet. Therefore, we need to be cautious when using pre-trained models to avoid cheating. This will be dealt with in the preprocessing (data augmentation) and the learning steps (retraining of layers).
Pre-Processing
In order to deal with the previous problem and to avoid overfitting models, we will perform data augmentation techniques on the training images.
Resize : We resize all the images to 224*224 in order to use thepre – trained neural networks on PyTorch.- Random horizontal
flip : We perform a horizontal flip on the images with a probability of 0.75. - Random vertical
flip : We perform a vertical flip on the image with a probability of 0.5. - Random affine transformation: We perform a random affine transformation using the following parameters in PyTorch : Degrees = [-30,30] – Translate = [0.15,0.15] – Scale = [0.5,2].
Normalizing : We normalize the tensor of the image to zero mean and unit variance.
Model
The best model is VGG19 using convolution layers (filters 3×3), max-pooling layers (2×2). We added two dense layers (25088,2048) and (2048,64) and eventually a classifier layer.


In addition to performing data augmentation to avoid overfitting and cheating, we will freeze the first three blocks of layers and retrain the rest.
We’ve chosen to freeze the first four convolution layers for two reasons:
• Their purpose is to create simple features like edges. Hence, we can keep weights generated from the ImageNet dataset.
• Computation time: We only need to backpropagate the gradient for the next layers.
Results


The best model according the validation accuracy is the one built at the 64th epoch.