WEBVTT
Kind: captions
Language: en
00:00:10.510 --> 00:00:16.380
Hi! The one bit of Data Mining with Weka that
we're going to see a little bit of mathematics,
00:00:16.380 --> 00:00:20.280
but don't worry, I'll take you through it
gently.
00:00:20.280 --> 00:00:25.500
The OneR strategy that we've just been studying
assumes that there is one of the attributes
00:00:25.500 --> 00:00:30.940
that does all the work, that takes the responsibility
for the decision. That's a simple strategy.
00:00:30.940 --> 00:00:36.220
Another simple strategy is the opposite, to
assume all of the attributes contribute equally
00:00:36.230 --> 00:00:42.520
and independently to the decision. This is
called the "Naive Bayes" method -- I'll explain
00:00:42.520 --> 00:00:47.880
the name later on. There are two assumptions
that underline Naive Bayes: that the attributes
00:00:47.880 --> 00:00:54.400
are equally important; and that they are statistically
independent, that is, knowing the value of
00:00:54.400 --> 00:01:00.180
one of the attributes doesn't tell you anything
about the value of any of the other attributes.
00:01:00.180 --> 00:01:05.580
This independence assumption is never actually
correct, but the method based on it often
00:01:05.590 --> 00:01:07.570
works well in practice.
00:01:07.570 --> 00:01:14.580
There's a theorem in probability called "Bayes
Theorem" after this guy Thomas Bayes from
00:01:14.580 --> 00:01:23.990
the 18th century. It's about the probability
of a hypothesis H given evidence E. In our
00:01:23.990 --> 00:01:31.110
case, the hypothesis is the class of an instance
and the evidence is the attribute values of
00:01:31.110 --> 00:01:37.240
the instance. The theorem is that Pr[H|E]
-- the probability of the class given the
00:01:37.240 --> 00:01:45.740
instance, the hypothesis given the evidence
-- is equal to Pr[E|H] times Pr[H] divided
00:01:45.740 --> 00:01:55.360
by Pr[E]. Pr[H] by itself is called the prior
probability of the hypothesis H. That's the
00:01:55.360 --> 00:02:03.660
probability of the event before any evidence
is seen. That's really the baseline probability
00:02:03.670 --> 00:02:10.369
of the event. For example, in the weather
data, I think there are 9 yes's and 5 no's,
00:02:10.369 --> 00:02:17.760
so the baseline probability of the hypothesis
"play equals yes" is 9/14 and "play equals
00:02:17.760 --> 00:02:25.499
no" is 5/14. What this equation says is how
to update the probability Pr[H] when you see
00:02:25.499 --> 00:02:32.060
some evidence, to get what's call the "a posteriori"
probability of H -- that means "after the
00:02:32.060 --> 00:02:38.260
evidence". The evidence in our case is the
attribute values of an unknown instance; that's E.
00:02:39.680 --> 00:02:46.239
That's Bayes Theorem. Now, what makes this
method "naive"? The naive assumption is -- I've
00:02:46.239 --> 00:02:51.839
said it before -- that the evidence splits
into parts that are statistically independent.
00:02:51.840 --> 00:02:58.280
The parts of the evidence in our case are
the four different attribute values in the
00:02:58.280 --> 00:03:08.269
weather data. When you have independent events,
the probabilities multiply, so Pr[H|E] according
00:03:08.269 --> 00:03:14.189
to the top equation is the product of Pr[E|H],
times the prior probability Pr[H], divided
00:03:14.189 --> 00:03:24.529
by Pr[E]. Pr[E|H] splits up into these parts:
Pr[E1|H], the first attribute value; Pr[E2|H],
00:03:24.529 --> 00:03:27.869
the second attribute value; and so on, for
all of the attributes.
00:03:27.869 --> 00:03:34.620
That's maybe a bit abstract. Let's look at
the actual weather data. On the right-hand
00:03:34.629 --> 00:03:41.310
side is the weather data. In the large table
at the top, we've taken each of the attributes.
00:03:41.310 --> 00:03:47.260
Let's start with "outlook". Under the "yes"
hypothesis and the "no" hypothesis, we've
00:03:47.260 --> 00:03:51.640
looked at how many times the outlook is "sunny".
It's sunny twice under yes and 3 times under
00:03:51.640 --> 00:03:58.980
no. That comes straight from the data in the
table. Overcast. When the outlook is overcast,
00:03:58.989 --> 00:04:05.129
it's always a "yes" instance, so there were
4 of those, and zero "no" instances. Then,
00:04:05.129 --> 00:04:10.809
rainy is 3 "yes" instances and 2 "no" instances.
Those numbers just come straight from the
00:04:10.809 --> 00:04:13.349
data table giving the instance values.
00:04:13.349 --> 00:04:18.260
Then we take those numbers and underneath
we make them into probabilities. Let's say
00:04:18.260 --> 00:04:24.780
we know the hypothesis: let's say we know
it's a "yes". Then the probability of it being
00:04:24.780 --> 00:04:30.960
"sunny" is 2/9ths, "overcast" is 4/9ths, and
"rainy" 3/9ths -- simply because when you
00:04:30.970 --> 00:04:37.880
add up 2 plus 4 plus 3 you get 9. Those are
the probabilities. If we know that the outcome
00:04:37.880 --> 00:04:45.711
is "no", the probabilities are "sunny" 3/5ths,
"overcast" 0/5ths, and "rainy" 2/5ths. That's
00:04:45.711 --> 00:04:47.150
for the "outlook" attribute.
00:04:47.150 --> 00:04:54.190
That's what we're looking for, you see, the
probability of each of these attribute values
00:04:54.190 --> 00:05:00.310
given the hypothesis H. The next attribute
is temperature, and we just do the same thing
00:05:00.310 --> 00:05:05.170
with that to get the probabilities of the
3 values -- hot, mild, and cool -- under the
00:05:05.170 --> 00:05:12.700
"yes" hypothesis or the "no" hypothesis. The
same with humidity and windy. Play, that's
00:05:12.700 --> 00:05:20.090
the prior probability -- Pr[H]. It's "yes"
9/14ths of the time, "no" 5/14ths of the time
00:05:20.090 --> 00:05:23.920
-- even if you don't know anything about the
attribute values.
00:05:23.920 --> 00:05:28.940
The equation we're looking at is this one
below, and we just need to work it out. Here's
00:05:28.940 --> 00:05:35.310
an example. Here's an unknown day, a new day.
We don't know what the value of "play" is,
00:05:35.310 --> 00:05:43.630
but we know it's sunny, cool, high, and windy.
We can just multiply up these probabilities.
00:05:43.630 --> 00:05:50.850
If we multiply for the "yes" hypothesis, we
get 2/9th times 3/9ths times 3/9ths times
00:05:50.850 --> 00:05:58.620
3/9ths -- those are just the numbers on the
previous slide, Pr[E1|H], Pr[E2|H], Pr[E3|H],
00:05:58.620 --> 00:06:11.861
Pr[E4|H], finally Pr[H], that is, 9/14ths.
That gives us a likelihood of 0.0053 when
00:06:11.861 --> 00:06:18.821
you multiply them. Then, for the "no" class
we do the same, to get a likelihood of 0.0206.
00:06:18.821 --> 00:06:24.560
These numbers are not probabilities. Probabilities
have to add up to 1. They are likelihoods.
00:06:24.560 --> 00:06:29.440
But we can get the probabilities from them
by using the straightforward technique of
00:06:29.440 --> 00:06:33.640
normalization. Take those likelihoods for
"yes" and "no" and we normalize them as shown
00:06:33.640 --> 00:06:40.080
below to make them add up to 1. That's how
we get the probability of "play" on a new
00:06:40.080 --> 00:06:42.040
day, with different attribute values.
00:06:42.040 --> 00:06:46.800
Just to go through that again. The evidence
is "outlook" is "sunny", "temperature" is
00:06:46.800 --> 00:06:52.520
"cool", "humidity" is "high", "windy" is "true"
-- and we don't know what "play" is. The probability
00:06:52.521 --> 00:07:00.250
of a "yes" given the evidence is the product
of those 4 probabilities -- one for outlook,
00:07:00.250 --> 00:07:07.220
temperature, humidity and windy -- times the
prior probability, which is just the baseline
00:07:07.220 --> 00:07:14.380
probability of a "yes". That product of fractions
is divided by Pr[E]. We don't know what Pr[E]
00:07:14.380 --> 00:07:20.110
is, but it doesn't matter, because we can
do the same calculation for Pr[E] of "no",
00:07:20.110 --> 00:07:25.080
which gives us another equation just like
this, and then we can calculate the actual
00:07:25.080 --> 00:07:29.660
probabilities by normalizing them so that
the two probabilities add up to 1. Pr[E] for
00:07:29.660 --> 00:07:33.580
"yes" plus Pr[E] for "no" equals 1.
00:07:33.590 --> 00:07:40.210
It's actually quite simple when you look at
it in numbers, and it's simple when you look
00:07:40.210 --> 00:07:44.960
at it in Weka, as well. I'm going to go to
Weka here, and I'm going to open the nominal
00:07:44.960 --> 00:07:55.389
weather data, which is here. We've seen that
before, of course, many times. I'm going to
00:07:55.389 --> 00:08:03.181
go to Classify. I'm going to use the NaiveBayes
method. It's under this bayes category here.
00:08:03.181 --> 00:08:07.370
There are a lot of implementations of different
variants of Bayes; I'm just going to use the
00:08:07.370 --> 00:08:17.770
straightforward NaiveBayes method here. I'll
just run it. This is what we get. The success
00:08:17.770 --> 00:08:22.780
probability calculated according to cross-validation.
More interestingly, we get the model. The
00:08:22.780 --> 00:08:29.140
model is just like the table I showed you
before divided under the "yes" class and the
00:08:29.140 --> 00:08:34.300
"no" class. We've got the four attributes
-- outlook, temperature, humidity, and windy
00:08:34.310 --> 00:08:41.019
-- and then, for each of the attribute values,
we've got the number of times that attribute
00:08:41.019 --> 00:08:42.110
value appears.
00:08:42.110 --> 00:08:47.560
Now, there's one little and important difference
between this table and the one I showed you
00:08:47.560 --> 00:08:52.420
before. Let me go back to my slide and look
at these numbers. You can see that for outlook
00:08:52.420 --> 00:09:02.380
under "yes" on my slide I've got 2, 4, and
3, and Weka has got 3, 5, and 4. That's 1
00:09:02.380 --> 00:09:09.740
more each time, for a total of 12 instead
of a total of 9. Weka adds 1 to all of the
00:09:09.740 --> 00:09:16.200
counts. The reason it does this is to get
rid of the zeros. In the original table under
00:09:16.200 --> 00:09:22.980
outlook, under "no", the probability of overcast
given "no" is zero, and we're going to be
00:09:22.980 --> 00:09:27.240
multiplying that into things. What that would
mean in effect, if we took that zero at face
00:09:27.240 --> 00:09:35.520
value, is that the probability of the class
being "no" given any day for which the outlook
00:09:35.520 --> 00:09:40.160
was overcast would be zero. Anything multiplied
by zero is zero.
00:09:40.160 --> 00:09:45.080
These zeros in probability terms have sort
of a veto over all of the other numbers, and
00:09:45.090 --> 00:09:50.620
we don't want that. We don't want to categorically
conclude that it must be a "no" day on a basis
00:09:50.620 --> 00:09:57.070
that it's overcast, and we've never seen an
overcast outlook on a "no" day before. That's
00:09:57.070 --> 00:10:02.020
called the "zero-frequency problem," and Weka's
solution -- the most common solution -- is
00:10:02.020 --> 00:10:07.710
very simple: just add 1 to all the counts.
That's why all those numbers in the Weka table
00:10:07.710 --> 00:10:14.630
are 1 bigger than the numbers in the table
on the slide. Aside from that, it's all exactly
00:10:14.630 --> 00:10:21.190
the same. We're avoiding zero frequencies
by effectively starting all counts at 1 instead
00:10:21.190 --> 00:10:25.920
of starting them at 0, so they can't end up
at 0.
00:10:25.920 --> 00:10:31.210
That's the Naive Bayes method. The assumption
is that all attributes contribute equally
00:10:31.210 --> 00:10:36.320
and independently to the outcome. That works
surprisingly well, even in situations where
00:10:36.320 --> 00:10:42.040
the independence assumption is clearly violated.
Why does it work so well when the assumption
00:10:42.040 --> 00:10:47.650
is wrong? That's a good question. Basically,
classification doesn't need accurate probability
00:10:47.650 --> 00:10:54.060
estimates. We're just going to choose as the
class the outcome with the largest probability.
00:10:54.060 --> 00:10:58.810
As long as the greatest probability is assigned
to the correct class, it doesn't matter if
00:10:58.810 --> 00:11:04.750
the probability estimates are all that accurate.
This actually means that if you add redundant
00:11:04.750 --> 00:11:11.000
attributes you get problems with Naive Bayes.
The extreme case of dependence is where two
00:11:11.000 --> 00:11:17.210
attributes have the same values, identical
attributes. That will cause havoc with the
00:11:17.210 --> 00:11:22.589
Naive Bayes method. However, Weka contains
methods for attribute selection to allow you
00:11:22.589 --> 00:11:29.210
to select a subset of fairly independent attributes,
after which you can safely use Naive Bayes.