WEBVTT
Kind: captions
Language: en

00:00:14.440 --> 00:00:18.500
Convolutional Neural Networks is a fundamental neural network architecture,

00:00:18.900 --> 00:00:22.340
and becomes more and more important in modern deep learning.

00:00:22.840 --> 00:00:27.160
In this class I am gonna talk about several important CNN architectures

00:00:27.160 --> 00:00:29.760
and current development in this field.

00:00:30.300 --> 00:00:35.120
The first CNN is proposed by Yann LeCun back to 1990s.

00:00:35.540 --> 00:00:38.400
The neural network shown here is LeNet-5,

00:00:38.800 --> 00:00:41.660
which has 2 convolution, 2 subsampling,

00:00:41.880 --> 00:00:44.820
2 fully-connected layers and 1 output layer.

00:00:45.140 --> 00:00:49.220
The later CNN architectures are largely based on this architecture,

00:00:49.220 --> 00:00:51.620
with different subsampling strategies,

00:00:51.960 --> 00:00:55.180
activation functions, or neuron connections.

00:00:55.820 --> 00:00:58.400
LeCun created a hand-written digits

00:00:58.640 --> 00:01:01.180
dataset as a benchmark to evaluate

00:01:01.180 --> 00:01:03.940
the performance of his Convolutional Neural Network.

00:01:04.220 --> 00:01:06.360
The benchmark is called MNIST

00:01:06.900 --> 00:01:10.360
MNIST has become the “Hello World” of deep learning now,

00:01:10.820 --> 00:01:18.220
almost all deep learning tutorials use MNSIT as first example to show how to build and train CNN networks.

00:01:18.820 --> 00:01:23.780
Let’s see an example of recognizing digits using CNN.

00:01:24.500 --> 00:01:28.400
Here is a figure from Francois Chollet’s book

00:01:28.720 --> 00:01:30.220
“Deep Learning with Python.”

00:01:30.800 --> 00:01:33.760
Mr. Chollet is the author of Keras.

00:01:34.300 --> 00:01:38.200
This example shows the filter responses of the input image.

00:01:38.480 --> 00:01:40.100
In a CNN network,

00:01:40.480 --> 00:01:47.000
the filters of the first layer will capture the basic shapes of digits like lines or corners.

00:01:47.420 --> 00:01:53.320
The filters of the second layer will capture more complicated and abstract features.

00:01:53.800 --> 00:01:54.620
In short,

00:01:54.860 --> 00:01:59.440
the filters of first several layers extract the basic image structures,

00:01:59.960 --> 00:02:04.060
while the filters of deeper layers capture high-level features.

00:02:04.580 --> 00:02:07.380
Before talking about more CNN architectures,

00:02:07.880 --> 00:02:11.180
let me introduce ImageNet data set first,

00:02:11.600 --> 00:02:15.280
because ImageNet played an important role in CNN history.

00:02:16.120 --> 00:02:22.200
ImageNet is large-scale image database created by Prof.  Fei-Fei Li and her group.

00:02:22.780 --> 00:02:30.400
There are 20,000 categories and 14 million hand-annotated images in this database.

00:02:30.940 --> 00:02:33.980
Fei-Fei selected 1000 categories from ImageNet

00:02:33.980 --> 00:02:42.260
and held the ImageNet Large Scale Visual Object Recognition Challenge (ILSVRC) in 2010.

00:02:42.740 --> 00:02:47.540
The participating teams would evaluate their algorithms on the given dataset,

00:02:48.060 --> 00:02:52.760
and compete to achieve higher accuracy on several visual recognition tasks.

00:02:53.640 --> 00:02:57.340
ILSVRC stimulated the advancement of technology.

00:02:57.340 --> 00:02:59.160
In 2012,

00:02:59.720 --> 00:03:04.520
the students of Hinton joined the challenge and proposed a CNN,

00:03:05.420 --> 00:03:07.980
which is known as AlexNet now.

00:03:08.760 --> 00:03:11.940
AlexNet won the challenge with large margin,

00:03:12.160 --> 00:03:14.160
and spur the deep learning boom.

00:03:14.820 --> 00:03:19.560
There are the error rates of previous ILSVRC winners.

00:03:20.320 --> 00:03:22.740
The Y-axis unit is error rate,

00:03:22.900 --> 00:03:24.360
so the lower the better.

00:03:25.000 --> 00:03:31.040
The winners of 2010 and 2011 use algorithms based on Support Vector Machine (SVM),

00:03:31.720 --> 00:03:36.700
which is considered as shallow in contrast to deep learning.

00:03:37.340 --> 00:03:41.380
We can see that AlexNet was around 38%

00:03:41.940 --> 00:03:45.720
better than previous shallow algorithms.

00:03:46.300 --> 00:03:47.640
In 2015,

00:03:48.760 --> 00:03:53.200
the ResNet proposed by Microsoft Asia exceeded human-level accuracy.

00:03:53.980 --> 00:03:56.360
Here is the architecture of AlexNet,

00:03:56.720 --> 00:04:01.740
which consists of 5 convolutional layers and 3 fully-connected layers.

00:04:02.780 --> 00:04:09.520
AlexNet has has 60 million parameters and 650,000 neurons.

00:04:10.360 --> 00:04:15.580
The team proposed a simple but effective method called “dropout” to prevent overfitting.

00:04:16.380 --> 00:04:23.700
Dropout forces a network to be more generalized by randomly dropping connections between specific layers

00:04:23.900 --> 00:04:30.000
The model was trained on two GPUs because the RAM of GPUs at that time were not large enough

00:04:30.180 --> 00:04:33.700
to keep all the model parameters in memory.

