WEBVTT
00:00:01.017 --> 00:00:06.797
Now that we know the basic intuition of a linear regression, let’s work on a simple example.
00:00:06.797 --> 00:00:11.658
This example is from my own research project on inequality and crime.
00:00:11.658 --> 00:00:17.906
To give you a little background, many researchers thought that more inequality will cause more crimes.
00:00:17.906 --> 00:00:23.543
Sociologists and criminologists had their reasons, but economist's main rationale came
00:00:23.543 --> 00:00:28.874
from the rational choice model we saw earlier: an individual will choose to commit a crime
00:00:28.874 --> 00:00:34.968
if the net gains from committing a crime is greater than the net gains from not committing a crime.
00:00:34.968 --> 00:00:41.339
Suppose you are a low-income person, which implies that your u is likely to be low.
00:00:41.339 --> 00:00:46.146
As income inequality goes up and your wealthy neighbor becomes even wealthier,
00:00:46.146 --> 00:00:51.086
you may find that you can gain more by stealing from your wealthy neighbor.
00:00:51.086 --> 00:00:58.665
In other words, the difference between us and u is likely to increase as inequality goes up,
00:00:58.665 --> 00:01:04.031
and you are more likely to find that the expected gains from committing a crime is greater
00:01:04.031 --> 00:01:07.001
than the expected gains from not committing a crime.
00:01:07.001 --> 00:01:14.572
Therefore, economists argued, more individuals will likely to become criminal as inequality goes up.
00:01:14.572 --> 00:01:21.539
To test this prediction using actual data, we first have to collect data on inequality and crime.
00:01:21.539 --> 00:01:28.822
In my research project, I obtained data on the Gini coefficient, which is a widely-used measure of inequality,
00:01:28.822 --> 00:01:36.193
and the rate of larceny in the large U.S. counties in 2000. The data points look like this.
00:01:36.193 --> 00:01:41.916
What do you think the relationship between unemployment and larceny is, based on this figure?
00:01:41.916 --> 00:01:48.478
I would say the relationship is positive and when we draw the best fitted line using Microsoft Excel,
00:01:48.478 --> 00:01:52.433
the result show that the relationship is positive indeed.
00:01:52.433 --> 00:01:58.839
The best fitted line supports the prediction that more inequality leads to more larceny.
00:01:58.839 --> 00:02:05.512
Furthermore, the regression results suggest that a 0.01 increase in the Gini coefficient
00:02:05.512 --> 00:02:11.191
will lead to 63 more larcenies per 100,000 residents.
00:02:11.191 --> 00:02:16.584
This is an overly simplified version of an empirical analysis on inequality and crime,
00:02:16.584 --> 00:02:21.610
but it still gives us an idea of how to run a basic analysis.
00:02:21.610 --> 00:02:28.451
We first collect data on inequality and crime, and run linear regression to quantify the relationship
00:02:28.451 --> 00:02:31.430
between inequality and crime.
00:02:31.430 --> 00:02:37.063
The line that best fits these data points allows us to make predictions of crime rates,
00:02:37.063 --> 00:02:43.914
based on the level of inequality. Up to this point, everything looks pretty straightforward.
00:02:43.914 --> 00:02:49.257
However, there is a big limitation in our empirical analysis so far.
00:02:49.257 --> 00:02:53.692
Linear regression can be very helpful in allowing us to make predictions,
00:02:53.692 --> 00:02:59.394
but it's not very helpful in telling us whether more inequality causes more crime.
00:02:59.394 --> 00:03:05.357
In other words, linear regression can be very helpful and very good at picking up correlation,
00:03:05.357 --> 00:03:10.319
but not very helpful in finding out the causation.
00:03:10.319 --> 00:03:15.653
Suppose that in areas with high inequality, we see high crime rates.
00:03:15.653 --> 00:03:21.219
Does this mean that more inequality cause more crime? Not really, right?
00:03:21.219 --> 00:03:26.808
There may be another factor that's associated with both inequality and crime.
00:03:26.808 --> 00:03:35.331
For example, a high share of youth population may cause both inequality and crime to go up.
00:03:35.331 --> 00:03:41.609
In this case, when we regress crime rates on the level of inequality, even if the causal effect of inequality
00:03:41.609 --> 00:03:48.598
on crime is zero, we will still have positive relationship between inequality and crime.
00:03:48.598 --> 00:03:54.630
It's because in areas with many many young people will have high inequality and high crime,
00:03:54.630 --> 00:03:59.861
in areas with very few young people will have low crime and low inequality.
00:03:59.861 --> 00:04:04.358
To account for such possibilities, we usually include other variables
00:04:04.358 --> 00:04:10.562
that we believe may influence our outcome variable in our linear regression equation.
00:04:10.562 --> 00:04:17.742
For example, instead of trying to find a simple line that fits our data on inequality in larceny,
00:04:17.742 --> 00:04:23.739
we may want to include more variables in the equation to separate the effects of inequality on crime
00:04:23.739 --> 00:04:26.909
and the effects of other factors on crime.
00:04:26.909 --> 00:04:33.808
To intuitively describe what we are trying to achieve in this simple extension, in our last example,
00:04:33.808 --> 00:04:39.190
we are trying to compare crime rates in cities with different levels inequality,
00:04:39.190 --> 00:04:46.000
and trying to see whether crime rates were higher in cities with high inequality or low inequality.
00:04:46.000 --> 00:04:52.209
This time, we are trying to compare crimes rates in cities that have different levels inequality
00:04:52.209 --> 00:04:58.976
but are comparable in other attributes such as the share of youth, the share of low-income population,
00:04:58.976 --> 00:05:01.725
poverty rate, and so on.
00:05:01.725 --> 00:05:08.708
If we still find that crime rates are higher in cities with high inequality, then we can be more confident
00:05:08.708 --> 00:05:15.635
that high inequality was the main driving force for the high crime rate in such cities.
00:05:15.635 --> 00:05:22.732
So is the problem solved if we just keep adding more variables in the equation? The answer is no.
00:05:22.732 --> 00:05:27.918
Sometimes data on some important characteristics that should have large impacts on crime
00:05:27.918 --> 00:05:33.815
and need to be included in the estimating equation will not be available at all.
00:05:33.815 --> 00:05:39.190
For example, many believe that how much the public trust and cooperate with police
00:05:39.190 --> 00:05:41.541
should have large impacts on crime.
00:05:41.541 --> 00:05:46.924
And I want to include this information in my regression equation.
00:05:46.924 --> 00:05:53.705
But reliable and accurate data on this public trust measure is usually not available.
00:05:53.705 --> 00:05:55.700
What are we do in this case?
00:05:55.700 --> 00:06:02.693
In the next video, we will see how this problem can be mitigated by using something called panel data.