WEBVTT
00:00:00.883 --> 00:00:08.140
Until now, we talked about a number of empirical studies that explored potential determinants of crime.
00:00:08.140 --> 00:00:15.723
Their findings usually come from some sort of regression analysis, such as fixed effect linear regression,
00:00:15.723 --> 00:00:22.018
difference-in-differences, or regression discontinuity analysis. In this video,
00:00:22.018 --> 00:00:29.219
I want to make a few comments about the strengths and limitations of these regression analyses.
00:00:29.219 --> 00:00:36.185
The first point is that, given the availability of statistical software and high quality data these days,
00:00:36.185 --> 00:00:39.927
running a regression analysis is actually pretty easy.
00:00:39.927 --> 00:00:45.101
Suppose I want to learn about the relationship between education and crime.
00:00:45.101 --> 00:00:51.518
One thing I can do is to collect data on city level education levels and crime rates
00:00:51.518 --> 00:00:58.888
across different years and different cities and run a linear regression using some software like Stata.
00:00:58.888 --> 00:01:05.328
Unless your data contains some unusual problem, you will almost certainly find regression coefficient
00:01:05.328 --> 00:01:09.987
that tells you how education and crime are correlated.
00:01:09.987 --> 00:01:14.165
In other words, your regression analysis will always give you a number
00:01:14.165 --> 00:01:18.880
that tells you how the two variables of interest are related.
00:01:18.880 --> 00:01:25.076
The second point is that, although your regression analysis will always give you the correlation
00:01:25.076 --> 00:01:32.289
between the two variables, it usually does not give you their causal relationship between the two variables.
00:01:32.289 --> 00:01:36.681
In other words, your regression will almost always give you correlation,
00:01:36.681 --> 00:01:42.135
but correlation does not imply causation. For example,
00:01:42.135 --> 00:01:48.839
when you regress crime on education levels across different cities and obtain a negative coefficient,
00:01:48.839 --> 00:01:55.435
it simply means that cities with low education are more likely to have more crimes
00:01:55.435 --> 00:01:57.865
than cities with high education.
00:01:57.865 --> 00:02:03.970
But you should not take this as evidence that less education causes more crime.
00:02:03.970 --> 00:02:10.960
As we talked before, the main problem is that cities with high and low education levels may be different
00:02:10.960 --> 00:02:17.298
in many other aspects as well, and you cannot simply attribute the difference in their crime rates
00:02:17.298 --> 00:02:20.859
to the difference in their education levels.
00:02:20.859 --> 00:02:29.393
For example cities with high education level may have more high paying jobs than cities with low education.
00:02:29.393 --> 00:02:35.364
In this case, it may be that the difference in crime rates between the two cities
00:02:35.364 --> 00:02:38.981
is not caused by the difference in their education levels,
00:02:38.981 --> 00:02:46.799
but is actually caused by the difference in the availability of high paying jobs between the two cities.
00:02:46.799 --> 00:02:52.344
My last point is a bit more optimistic. If we carefully design our empirical study,
00:02:52.344 --> 00:02:58.234
we may be able to recover a causal relationship from the regression analysis.
00:02:58.234 --> 00:03:04.092
For example, suppose I run a simple experiment in which I recruit many subjects
00:03:04.092 --> 00:03:07.026
and randomly divide them into two groups.
00:03:07.026 --> 00:03:12.715
I will offer the first group an opportunity to attend a high quality job training program
00:03:12.715 --> 00:03:20.690
and I will not offer anything the second group. And I will compare the offending rates after a few years.
00:03:20.690 --> 00:03:26.223
In this case, when I regress the offending rates on the group assignment status,
00:03:26.223 --> 00:03:33.481
the regression coefficient should reflect a causal effect of the job training program on crime.
00:03:33.481 --> 00:03:37.210
And that’s because I am pretty sure there was very little difference
00:03:37.210 --> 00:03:42.530
between the two groups in the beginning except their group assignment status,
00:03:42.530 --> 00:03:47.559
and the difference in their crime rates mostly likely have come from the fact that
00:03:47.559 --> 00:03:52.095
only one group could attend the training program and the other could not.
00:03:52.095 --> 00:03:58.445
Do we have to have a random experiment for the regression analysis to have a causal interpretation?
00:03:58.445 --> 00:04:04.493
The answer is no. For example, when we run a regression to compare the offending rates
00:04:04.493 --> 00:04:09.379
between individuals who are just above 18 and individuals who are just below 18,
00:04:09.379 --> 00:04:16.770
we can plausibly take the regression result as a causal effect of having to face a lot more severe punishment
00:04:16.770 --> 00:04:19.750
on the criminal behavior of young adults.
00:04:19.750 --> 00:04:26.269
And that’s because individuals who are just above age 18 and individuals who are just below 18
00:04:26.269 --> 00:04:29.718
should be highly comparable in most aspects,
00:04:29.718 --> 00:04:37.423
except that one group is subject to a more lenient juvenile court system and the other is not.
00:04:37.423 --> 00:04:44.816
Here, the age cutoff for legal minority and majority creates a variation between otherwise
00:04:44.816 --> 00:04:52.544
comparable individuals, and this is what allows our regression analysis to identify a causal effect.
00:04:52.544 --> 00:04:56.307
And we call this an identifying variation.
00:04:56.307 --> 00:05:02.416
Whether we run a simple linear regression, difference-in-differences, regression discontinuity,
00:05:02.416 --> 00:05:07.928
or some other types of regression analyses, the strength of the research design often hinges
00:05:07.928 --> 00:05:11.767
upon the strength of the identifying variation used.