WEBVTT
Kind: captions
Language: en

00:00:14.560 --> 00:00:17.400
Now let’s talk about some limitations of deep learning.

00:00:19.860 --> 00:00:23.680
There are several limitations in deep learning models.

00:00:24.420 --> 00:00:25.480
First of all,

00:00:25.620 --> 00:00:28.900
the models are not scale and rotation invariants,

00:00:29.380 --> 00:00:34.840
and can easily misclassify images when the object poses are unusual.

00:00:35.460 --> 00:00:39.460
Here are some examples from the CVPR 2019 paper

00:00:39.720 --> 00:00:41.480
“Strike with a Pose:

00:00:41.820 --> 00:00:46.480
Neural networks are easily fooled by strange poses of familiar objects.”

00:00:47.060 --> 00:00:49.460
Let’s take a look of those images.

00:00:50.180 --> 00:00:51.500
At the first row,

00:00:52.000 --> 00:00:55.420
the first image is correctly classified as school bus.

00:00:55.880 --> 00:00:56.680
However,

00:00:56.920 --> 00:01:00.240
if we rotate and show only the bottom of the bus,

00:01:00.800 --> 00:01:04.400
it will be misclassified as garbage truck,

00:01:05.000 --> 00:01:06.040
a punching bag,

00:01:06.360 --> 00:01:07.900
or a snowplow.

00:01:08.740 --> 00:01:09.640
Similarly,

00:01:10.000 --> 00:01:17.220
a motor scooter may be misclassified as parachute or bobsled with strange poses;

00:01:18.300 --> 00:01:22.200
a fire truck may be classified as school bus or fireboat.

00:01:22.920 --> 00:01:30.140
Although many methods have been proposed to solve those issues,

00:01:30.580 --> 00:01:34.720
the errors show that the models lack knowledge of our real world.

00:01:35.660 --> 00:01:37.320
To add insult to injury,

00:01:37.700 --> 00:01:43.760
the models can be fooled and cheated intentionally using Generative Adversarial Networks.

00:01:44.400 --> 00:01:47.240
These techniques are called adversarial attacks.

00:01:47.820 --> 00:01:51.200
Here is an example from Ian Goodfellow’s paper.

00:01:51.880 --> 00:01:54.380
By adding some small intentional information,

00:01:54.940 --> 00:01:57.420
which is not detectable by humans,

00:01:58.280 --> 00:02:04.500
we can make CNN models misclassify panda into gibbon with high confidence!

00:02:05.240 --> 00:02:07.320
This phenomenon is very robust,

00:02:07.600 --> 00:02:12.780
Even if the photos of the adversarial examples can still fool the models.

00:02:13.700 --> 00:02:20.580
Adversarial attack raises a serious security issue of deep-learning based image recognition models.

00:02:21.540 --> 00:02:22.420
For example,

00:02:22.840 --> 00:02:30.920
a hacker can change the direction of a traffic sign to fool autonomous vehicles without being detected by the police.

