Convolutional Neural Networks for Object Classification

In this post we will look at some Convolutional Neural Networks (CNNs) which solve the general tasks of classifying objects in an image. The architectures will be presented by date, you can scroll down to see the latest one.

Traditional CNNs (1998)

LeNet5, developed by Yann LeCun (Facebook FAIR), was a pioneering CNN that accelerated research on deep learning. It was used to recognize zip codes and digits. The architecture has inspired later CNNs, but it differs fundamentally from modern CNNs because GPUs were not available for training at the time and the CPUs were much slower. Thus, only a brief summary of the architecture is presented.

LeNet5s architecture is shown in figure below. It alternately used convolution and pooling layers with fully-connected layers as final classifiers. As non-linear activation function, hyperbolic tangent or sigmoid was used and average pooling was used as pool- ing function. It had sparse connections between the layers to reduce the number of dimensions. Mean squared error was used as loss function.

Ref: Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, Nov. 1998.

Deep CNNs

AlexNet (2012) competed in the ILSVRC 2012 and won by a significant margin (84% accuracy top5 versus 74% for the runner-up). It is based on LeNet, but is much deeper and wider with a total of 60 million parameters. Because of the number of parameters, it was implemented to utilize multiple GPUs to satisfy the memory requirements. Its contributions to the field were:

  • the use of the rectified linear unit (ReLU) as activation function.
  • a method of stacking multiple convolutional layers before the pooling layer
  • the use of max pooling, to avoid problem with the average pooling
  • the use of GPUs to reduce training time

Ref: A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Sys- tems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc., 2012

Very deep CNNs

GoogLeNet, also called Inception, won the ILSVRC 2014. Inspired by VggNet, it used much smaller 3-by-3 filters in each convolutional layer instead of 5-by-5 or 7-by-7 filters. This idea opens the possibility to increase the number of layers and build deeper networks. Its own contribution was to dramatically reduce the number of parameters used. This was achieved by using cheap 1-by-1 filters before expensive parallel blocks, referred to as a bottleneck. To further reduce the number of parameters, average pooling was used as classifiers instead of fully connected layers. Development of Inception has continued and the most recent version is InceptionV4 most notably inspired by ResNet described in the next subsection.

A module in InceptionV1. The yellow 1×1 convolution blocks reduces the input size before being fed into more expensive 3×3 and 5×5 convolutions

VGGNet (2014) – The runner-up at the ILSVRC 2014 competition is dubbed VGGNet by the community and was developed by Simonyan and Zisserman . VGGNet consists of 16 convolutional layers and is very appealing because of its very uniform architecture. Similar to AlexNet, only 3×3 convolutions, but lots of filters. Trained on 4 GPUs for 2–3 weeks. It is currently the most preferred choice in the community for extracting features from images. The weight configuration of the VGGNet is publicly available and has been used in many other applications and challenges as a baseline feature extractor. However, VGGNet consists of 138 million parameters, which can be a bit challenging to handle.

Residual CNNs

ResNet won the ILSVRC 2015 and represents a new revolutionizing way of building CNNs, called residual CNNs. It consists of 152 layers, an extreme increase compared to previous architectures. This was achieved by using skip connections. A skip connection is a connection used by the input signal to bypass a number of layers. With this technique, CNNs with over 1000 layers are trainable. Insights of why this is working so well are still being researched. Empiric research has shown that ResNet operates with blocks of only moderate depth of about 20-30 layers acting in parallel, instead of a serial flow of the entire network. Thus, the behaviour can be compared to ensembles of relatively shallow networks. In addition, when the output is fed back recursively, as in RNNs, it can be seen as a biologically-plausible model of the ventral stream in visual cortex.

Residual CNNs remain the state-of-the-art (June 2017) within the field of general object classification as no revolutionary architectures were submitted to the ILSVRC 2016.


Single-crop top-1 validation accuracies for top scoring single-model architectures. Shows effectively different architectures and their correspondent authors


Top1 vs. operations, size ∝ parameters. Top-1 one-crop accuracy versus amount of operations required for a single forward pass. The size of the blobs is proportional to the number of network parameters; a legend is reported in the bottom right corner, spanning from 5×106 to 155×106 params. Both these figures share the same y-axis, and the grey dots highlight the centre of the blobs.



ENet has the highest accuracy per parameter used of any neural network