WEBVTT
00:00:04.580 --> 00:00:07.880
Do you care about the data;
do you care about the future?
00:00:08.179 --> 00:00:11.812
Do you want to understand well
what is happening behind the data
00:00:11.825 --> 00:00:14.934
and even make
good predictions for the future?
00:00:16.309 --> 00:00:18.831
This is the main task of statistical learning,
00:00:18.890 --> 00:00:24.143
which can be classified into two areas:
supervised and unsupervised learning.
00:00:25.343 --> 00:00:28.644
In this lecture
we will explain both of them.
00:00:29.843 --> 00:00:35.882
Statistical learning is
a strong area in statistics, aiming to reveal
00:00:35.983 --> 00:00:40.789
hidden relations between the data instances
or variables that we are measuring.
00:00:41.926 --> 00:00:47.744
A classical example of statistical learning
is regression, especially linear regression.
00:00:48.344 --> 00:00:54.644
Suppose we have a dataset of songs
– let’s say audio clips or music notations –
00:00:55.092 --> 00:01:00.191
and we represent each song
with two variables describing its complexity:
00:01:00.643 --> 00:01:04.844
unigram and bigram entropy.
We omit details about how to measure them.
00:01:05.743 --> 00:01:12.643
Scatter plot reveals there is a hidden relation,
actually linear relation between these two variables.
00:01:12.943 --> 00:01:17.744
Therefore a natural question arises: what is
the relation between these two variables.
00:01:17.956 --> 00:01:23.744
More precisely, we see there is a linear relation
between the dots on the scatter diagram
00:01:24.044 --> 00:01:28.243
so we want to compute
the best line representing this relation.
00:01:29.143 --> 00:01:32.461
Another example
of statistical learning is classification.
00:01:32.550 --> 00:01:37.561
Suppose we have for the introduced
dataset of songs another variable:
00:01:37.658 --> 00:01:42.053
song popularity,
whether the song is popular or not.
00:01:42.834 --> 00:01:45.889
We can again plot scatter diagram
with two colours:
00:01:45.902 --> 00:01:50.722
blue dots for popular
and red dots for the unpopular songs.
00:01:51.176 --> 00:01:53.887
Scatter diagram naturally
suggests the question:
00:01:53.900 --> 00:01:58.565
can we predict popularity
from the unigram and bigram complexity?
00:01:59.032 --> 00:02:05.439
More precisely, can we compute a line
that separates well the red dots from the blue dots?
00:02:07.586 --> 00:02:10.265
Summing up:
in supervised learning,
00:02:10.328 --> 00:02:15.367
we have a set of variables X
that we call predictors, also features,
00:02:15.751 --> 00:02:18.928
and one response variable
– we denote it by Y.
00:02:19.229 --> 00:02:24.029
We also have available measurements
of these variables on so-called training set.
00:02:24.603 --> 00:02:27.603
The main goal
is to find mathematical function
00:02:27.737 --> 00:02:34.528
that relates values of predictors X
and response Y and fits the measurements well.
00:02:35.128 --> 00:02:40.528
As mentioned, classical examples are
regression and classification problems.
00:02:42.929 --> 00:02:46.228
Another example
of statistical learning is clustering.
00:02:47.279 --> 00:02:51.584
Suppose we consider
a dataset of all Slovenian scientists
00:02:51.723 --> 00:02:57.465
that have published at least one paper
in the period 1970-2015.
00:02:57.620 --> 00:03:03.376
We consider their collaboration
in the years 1970, 1980, 1990 and 2010
00:03:03.520 --> 00:03:08.784
and we visualise this collaboration
with the following network map.
00:03:10.284 --> 00:03:15.085
The dots represent the groups of scientists
that collaborate - publish joint papers.
00:03:16.136 --> 00:03:20.162
We call such dots
clusters or also communities.
00:03:20.310 --> 00:03:27.636
Cluster or community detection
is the main task of cluster analysis.
00:03:28.762 --> 00:03:32.962
A special lecture afterwards
will be devoted to it.
00:03:33.863 --> 00:03:38.362
Once we have detected the communities,
we need an explanation for them.
00:03:38.567 --> 00:03:42.862
For example - what are the groups of scientists
that collaborate most?
00:03:43.313 --> 00:03:50.362
In our case it turns out these are scientists
from the same institute or the same scientific fields.
00:03:50.969 --> 00:03:54.793
But note that here we do not have
any valid grouping available,
00:03:54.970 --> 00:03:58.945
so we cannot check
how well we detected the clusters.
00:04:00.595 --> 00:04:05.450
Another example of unsupervised learning
is dimension reduction.
00:04:05.595 --> 00:04:08.298
Let us consider the cancer data.
00:04:08.546 --> 00:04:13.453
We have available measurements for several
variables (features) describing each patient
00:04:13.594 --> 00:04:17.794
and we want to best visualise
these data in two dimensions.
00:04:18.093 --> 00:04:22.893
We can see
that 3d diagram is not very descriptive.
00:04:23.123 --> 00:04:29.193
If we take arbitrary pair of features, we see
that the diagram is not very descriptive either.
00:04:29.881 --> 00:04:33.318
But if we take appropriate 2D space
00:04:33.726 --> 00:04:36.968
the resulting diagram
clearly suggests two clusters,
00:04:36.981 --> 00:04:40.722
probably belonging to patients
with and without cancer.
00:04:42.375 --> 00:04:45.070
Summing up,
unsupervised learning is needed
00:04:45.083 --> 00:04:49.722
when we have a bunch of measurements
of a given list of statistical variables
00:04:50.011 --> 00:04:53.901
and we want to reveal
hidden groups of similar data instances,
00:04:53.984 --> 00:04:56.232
which is known as clustering problem.
00:04:57.322 --> 00:05:03.684
We may also want to find few new variables
which enable better low dimensional visualization
00:05:03.778 --> 00:05:10.261
or more compact representation of the data
- in this case we talk about dimension reduction
00:05:10.330 --> 00:05:15.069
and may use for example
the Principal component analysis or Factor analysis.
00:05:17.125 --> 00:05:22.424
There is a clear distinction
between supervised and unsupervised learning.
00:05:22.899 --> 00:05:27.562
For supervised learning we know
the ground truth - at least on the training set.
00:05:27.824 --> 00:05:31.724
The ground truth is coded
in the response variable Y.
00:05:32.039 --> 00:05:35.834
Therefore different approaches
can be evaluated and compared.
00:05:36.061 --> 00:05:39.068
For unsupervised learning
this is not the case.
00:05:39.185 --> 00:05:43.472
We have no universal measure
to compare different solutions for clustering,
00:05:43.772 --> 00:05:47.972
and these methods are therefore
more prone to subjectivity.