WEBVTT
Kind: captions
Language: en

00:00:14.540 --> 00:00:16.040
Deep learning basics

00:00:16.540 --> 00:00:18.880
Today we can hear deep learning everywhere.

00:00:19.200 --> 00:00:21.600
Now let’s see what is deep learning.

00:00:22.400 --> 00:00:25.160
Deep learning is based on artificial neural networks,

00:00:25.620 --> 00:00:28.880
which is inspired by biological neural networks.

00:00:29.400 --> 00:00:31.320
Here is a picture of a neuron.

00:00:31.640 --> 00:00:35.160
A neuron is a cell that carries electrical impulses.

00:00:35.600 --> 00:00:38.040
Each neuron consists of three parts:

00:00:38.560 --> 00:00:39.600
a cell body,

00:00:39.780 --> 00:00:40.740
dendrites

00:00:40.740 --> 00:00:42.420
and a single&nbsp;axon.

00:00:43.040 --> 00:00:50.340
Dendrites are the branches of neurons that receive signals from other neurons and pass the signals into the cell body.

00:00:50.720 --> 00:00:56.560
Axon can be over a meter long in humans and pass electrical signal to other neurons.

00:00:56.880 --> 00:01:01.760
If the signals received are strong enough and reach an action potential,

00:01:01.880 --> 00:01:03.840
the neuron will be activated.

00:01:04.080 --> 00:01:06.280
Inspired by the biological neuron,

00:01:06.480 --> 00:01:07.720
Frank Rosenblatt

00:01:08.220 --> 00:01:13.740
has developed the first prototype of neuron called Perceptron in 1957.

00:01:14.620 --> 00:01:20.600
Perceptron uses weighted sum to represent Dendrites and a threshold to control the action potential.

00:01:21.180 --> 00:01:23.580
There were no computers these days,

00:01:23.940 --> 00:01:28.660
Dr. Rosenblatt designed a hardware device to implement the function.

00:01:29.220 --> 00:01:33.980
Although the idea of perceptron is very similar to the neurons in today’s deep learning,

00:01:34.620 --> 00:01:39.220
Rosenblatt didn’t develop a mechanism to train multi-layer neural networks.

00:01:39.660 --> 00:01:42.700
In 1969, Marvin Minsky,

00:01:43.020 --> 00:01:45.380
founder of the MIT AI Lab,

00:01:45.640 --> 00:01:48.360
has published a book called perceptrons

00:01:48.760 --> 00:01:50.240
and concluded that

00:01:50.540 --> 00:01:52.300
neural networks are dead.

00:01:52.980 --> 00:01:58.520
He argued that perception cannot be used to learn a simple Boolean function XOR,

00:01:58.980 --> 00:02:02.040
because XOR is not linearly separable.

00:02:02.680 --> 00:02:06.200
This publication has caused the first AI winter,

00:02:06.400 --> 00:02:11.500
and neural networks has widely been rejected by major machine learning conferences.

00:02:12.420 --> 00:02:16.500
The winter for neural networks has been continued for more than a decade.

00:02:17.240 --> 00:02:20.840
The hero came to rescue is Geoffrey Hinton,

00:02:21.240 --> 00:02:27.340
who showed that the XOR can be learnt by using multi-layer perceptrons using backpropagation.

00:02:28.020 --> 00:02:32.020
Although the idea has been conceived by other researchers before,

00:02:32.440 --> 00:02:37.200
it is Hinton’s paper that clearly addressed the problems proposed by Minsky.

00:02:38.240 --> 00:02:39.940
How backpropagation works?

00:02:40.420 --> 00:02:43.700
Let me first introduce how neural network works.

00:02:44.220 --> 00:02:45.220
As we know,

00:02:45.360 --> 00:02:48.720
there are two stages in machine learning algorithms:

00:02:49.220 --> 00:02:51.000
training and inference.

00:02:51.520 --> 00:02:52.700
For inference,

00:02:52.860 --> 00:02:57.820
we make predictions by calculating parameters starting from the input layer.

00:02:58.400 --> 00:03:01.240
This process is called the forward pass.

00:03:02.280 --> 00:03:07.220
The predicted output is compared with the label to calculate the error.

00:03:07.880 --> 00:03:12.620
The we propagate the error back to the neurons and adjust the weights.

00:03:13.220 --> 00:03:16.000
This process is called backward pass.

00:03:16.740 --> 00:03:21.040
So what is the magical math formula used for backpropagation?

00:03:21.620 --> 00:03:24.500
It turns out the age-old calculus:

00:03:24.780 --> 00:03:25.920
the chain rule.

00:03:26.660 --> 00:03:27.640
Of course,

00:03:27.840 --> 00:03:32.840
to run backpropagation on modern neural networks requires advanced calculus skills

00:03:32.900 --> 00:03:35.460
like building computational graphs.

00:03:35.980 --> 00:03:36.940
Fortunately,

00:03:37.020 --> 00:03:40.220
the open-sourced deep learning framework like Caffe,

00:03:40.360 --> 00:03:41.240
TensorFlow,

00:03:41.380 --> 00:03:42.200
PyTorch,

00:03:42.240 --> 00:03:43.180
CNTK

00:03:43.460 --> 00:03:45.140
will do the works for us.

00:03:45.640 --> 00:03:48.040
We don’t need to worry about the details.

00:03:48.820 --> 00:03:52.140
Gradient descent is the most used learning algorithm.

00:03:52.740 --> 00:03:54.900
It is a first-order iterative

00:03:55.080 --> 00:03:58.400
optimization algorithm for minimizing a function.

00:03:59.120 --> 00:04:01.780
To find a local minimum of the loss function,

00:04:02.320 --> 00:04:07.480
we can take steps proportional to the negative of the gradient at the current point.

00:04:08.080 --> 00:04:12.960
The procedure is similar to finding the deepest point in a valley.

00:04:14.000 --> 00:04:14.960
On the other hand,

00:04:14.960 --> 00:04:21.520
the algorithm to find a local maximum of that function using positive gradient is called gradient ascent.

00:04:22.120 --> 00:04:23.740
In this example,

00:04:23.980 --> 00:04:28.840
the cost function is simple and the global minimum can be easily found.

00:04:29.540 --> 00:04:31.980
This is also called a convex function.

00:04:32.400 --> 00:04:33.100
However,

00:04:33.260 --> 00:04:35.880
in complex high-dimension vector space,

00:04:36.420 --> 00:04:39.680
it is not guaranteed to find the local minimum.

00:04:40.100 --> 00:04:41.680
The good news is that,

00:04:41.940 --> 00:04:45.160
researcher found that there are many local minimums

00:04:45.260 --> 00:04:47.960
that almost as good as global minimum,

00:04:48.500 --> 00:04:51.860
so we are not necessary to search for the global minimum.

