WEBVTT
1
00:00:00.001 --> 00:00:06.334
This course is about collecting,
analyzing and
2
00:00:06.334 --> 00:00:12.361
reporting data that we
get from social media.
3
00:00:12.361 --> 00:00:19.102
Data collection is easy and
fast using social media.
4
00:00:19.102 --> 00:00:24.430
However, analysis require
a specialized set of skills.
5
00:00:24.430 --> 00:00:29.161
In this class you will need
to know at least a modicum
6
00:00:29.161 --> 00:00:33.243
of statistical analysis to be successful.
7
00:00:33.243 --> 00:00:38.007
I will give you an example
of a statistical skill that
8
00:00:38.007 --> 00:00:40.398
you can use in this class.
9
00:00:40.398 --> 00:00:42.774
In case you have this type of skill,
10
00:00:42.774 --> 00:00:46.906
you should be very successful
in the course, the way you are.
11
00:00:46.906 --> 00:00:53.078
In case you do not recognize this skill
as something you've learned in the past,
12
00:00:53.078 --> 00:00:58.547
you might have to look up some
information about this type of analysis.
13
00:00:58.547 --> 00:01:02.133
Or maybe take one of our intro courses.
14
00:01:02.133 --> 00:01:06.436
Whichever the case,
however I would like to emphasize that
15
00:01:06.436 --> 00:01:11.442
the statistical skills you need in
this class are not very advanced.
16
00:01:11.442 --> 00:01:15.700
They require that you understand one and
fundamental thing.
17
00:01:15.700 --> 00:01:19.406
Which is that statistical
analysis is about stating
18
00:01:19.406 --> 00:01:22.865
the obvious with a certain
degree of certainty.
19
00:01:22.865 --> 00:01:27.264
And this applies to just about any type
of statistical analysis you want to do.
20
00:01:27.264 --> 00:01:31.355
And I'll provide an example,
right here and
21
00:01:31.355 --> 00:01:37.611
now by reviewing in the shortest
possible way the logic of T-tests.
22
00:01:37.611 --> 00:01:42.416
Now T-tests are a statistical
procedure by which you can
23
00:01:42.416 --> 00:01:45.929
tell if two groups are indeed different.
24
00:01:45.929 --> 00:01:48.486
If there are statistically different or
not.
25
00:01:48.486 --> 00:01:54.536
Let's just take this example of a tweet
that I might have posted with a cat in it.
26
00:01:54.536 --> 00:01:57.196
Cats are very popular on the Internet.
27
00:01:57.196 --> 00:01:59.843
Initially let's just say,
28
00:01:59.843 --> 00:02:04.926
all females like the tweet and
no males like the tweet.
29
00:02:04.926 --> 00:02:09.027
No, we got the impressions,
the people looked at it and
30
00:02:09.027 --> 00:02:11.567
only the females liked the tweet.
31
00:02:11.567 --> 00:02:16.179
Now, if that is the case,
its common sense to say that there is
32
00:02:16.179 --> 00:02:20.282
a significant difference between males and
females.
33
00:02:20.282 --> 00:02:23.943
Males are different than females, why?
34
00:02:23.943 --> 00:02:29.895
We don't know that from the statistics but
the statistics tell us the difference.
35
00:02:29.895 --> 00:02:34.030
Now, what happens if the difference
is not as large as that?
36
00:02:34.030 --> 00:02:40.539
Let's just say that we have one male
among the ten who likes the tweet.
37
00:02:40.539 --> 00:02:45.417
Is the group of males now different
from the group of females?
38
00:02:45.417 --> 00:02:48.932
The T-test, the statistical T-test,
39
00:02:48.932 --> 00:02:53.453
will be able to tell us to
what degree our certainty has
40
00:02:53.453 --> 00:02:58.185
moved by reducing the differences
between the groups.
41
00:02:58.185 --> 00:03:02.682
I put that number,
the statistical test number in this cell.
42
00:03:02.682 --> 00:03:07.924
And the number is generated by a formula
which is called the T-test formula,
43
00:03:07.924 --> 00:03:11.485
which basically looks at
how spread the values are.
44
00:03:11.485 --> 00:03:15.487
And how large the difference
in proportions are,
45
00:03:15.487 --> 00:03:18.258
it's no more, no less than that.
46
00:03:18.258 --> 00:03:23.299
And now, we see that we,
by adding one male to the mix,
47
00:03:23.299 --> 00:03:27.474
we can say that,
our certainty has shifted.
48
00:03:27.474 --> 00:03:33.415
If in the beginning, we could have said
that there's zero chance that males and
49
00:03:33.415 --> 00:03:36.173
females are not different, right?
50
00:03:36.173 --> 00:03:39.032
Now, by adding one male, we say,
51
00:03:39.032 --> 00:03:44.357
there's a very tiny 0.00000
chance that may be males and
52
00:03:44.357 --> 00:03:48.417
females are different,
or maybe they're not.
53
00:03:48.417 --> 00:03:53.656
Now, as we add more males
who like the tweet.
54
00:03:53.656 --> 00:03:58.453
Let's just say that we reissue
the same tweet several times.
55
00:03:58.453 --> 00:04:01.975
And on each iteration we have
more males liking the tweet.
56
00:04:01.975 --> 00:04:06.612
Now as you observe here at the bottom,
our likelihood that the two
57
00:04:06.612 --> 00:04:11.438
groups are different moves from
zero to higher and higher numbers.
58
00:04:11.438 --> 00:04:15.962
Until we can get to a situation
where the test will tell us hey,
59
00:04:15.962 --> 00:04:21.530
you know actually this two groups are so
similar that I actually cannot run.
60
00:04:21.530 --> 00:04:25.186
I'm a test of difference
not the test to similarity.
61
00:04:25.186 --> 00:04:29.046
I'm breaking up,
these two groups are not that different.
62
00:04:29.046 --> 00:04:33.373
So, T-test in most statistics,
is meant to tell us what
63
00:04:33.373 --> 00:04:38.241
are the significant differences
between groups, by basically
64
00:04:38.241 --> 00:04:43.126
looking at the characteristics
of the groups by simple counts.
65
00:04:43.126 --> 00:04:47.810
In a future video, I will talk about
the different type of analysis
66
00:04:47.810 --> 00:04:52.760
which is complimentary to this,
which is the correlation analysis.
67
00:04:52.760 --> 00:04:58.158
Which looks at how similar characteristics
of individuals within a group could be.
68
00:04:58.158 --> 00:05:02.188
But between T-tests and
correlations, basically you have
69
00:05:02.188 --> 00:05:06.938
all the statistical skill you need
to be very successful in this class.