WEBVTT
00:00:00.510 --> 00:00:05.539
Many of the research papers we talked about, and many others we will talk about later,
00:00:05.539 --> 00:00:10.801
use a technique called linear regression, to find out the relationship between two variables.
00:00:10.801 --> 00:00:17.544
For example, to find out whether higher wage levels reduce crime, we make collect data on wage and crime,
00:00:17.544 --> 00:00:22.584
and run linear regression to examine how the two variables are related.
00:00:22.584 --> 00:00:29.405
In this video and the following exercise, we will talk about the basic intuition behind linear regression
00:00:29.405 --> 00:00:32.821
and how we can actually run it using simple data.
00:00:32.821 --> 00:00:38.925
I usually use a computer program called Stata to run linear regression and other empirical analyses,
00:00:38.925 --> 00:00:44.065
but there are many other software packages such as Matlab R and Microsoft Excel
00:00:44.065 --> 00:00:47.750
that allow you to run linear regression.
00:00:47.750 --> 00:00:54.661
For now, we will just run a very simple linear regression, and for this task, Excel works just fine.
00:00:54.661 --> 00:01:00.786
But please feel free to check out other online courses and textbooks on econometrics to learn more details
00:01:00.786 --> 00:01:03.273
about linear regression.
00:01:03.273 --> 00:01:10.632
I think the easiest way to motivate linear regression is to see it as a finding the best fitting line exercise.
00:01:10.632 --> 00:01:17.811
Here is the data on larceny and unemployment rates from the 200 largest U.S. counties in 2000.
00:01:17.811 --> 00:01:26.088
From last week, we know we can find the official crime data in the United States from the FBI UCR website.
00:01:26.088 --> 00:01:31.733
I also obtained the county-level unemployment data from the U.S. Census website.
00:01:31.733 --> 00:01:36.594
Both data sets are publicly available online.
00:01:36.594 --> 00:01:42.384
Now let’s use the Excel’s scatterplot function to visualize the data.
00:01:42.384 --> 00:01:49.504
The x-axis represents the rate of unemployment and the y-axis represents the larceny rate.
00:01:49.504 --> 00:01:53.822
To go back to my main question, I want to find out the relationship
00:01:53.822 --> 00:02:02.048
between unemployment and larceny rates. From this figure, I would say that relationship is positive.
00:02:02.048 --> 00:02:08.868
In areas with high unemployment rates, larceny rates tend to be high as well.
00:02:08.868 --> 00:02:15.287
But we want more details. We want to quantify the relationship, and be able to say things like,
00:02:15.287 --> 00:02:22.768
“When unemployment goes up by X%, larceny will go up by Y%.” How can we do this?
00:02:22.768 --> 00:02:29.029
For now, let’s assume that the true relationship between unemployment and larceny is linear.
00:02:29.029 --> 00:02:35.703
Then, we can quantify the relationship by drawing a line that fits these data points.
00:02:35.703 --> 00:02:42.189
We know that a line can be represented by a linear function: y=a+bx.
00:02:42.189 --> 00:02:48.539
In this case, we want to recover the linear function that gives us the best fit of our data
00:02:48.539 --> 00:02:51.406
on unemployment and larceny.
00:02:51.406 --> 00:02:56.502
So how do we find a line that fits our data points the best?
00:02:56.502 --> 00:03:03.547
Different people may have different ideas about how to define the best fitting line, but the most widely
00:03:03.547 --> 00:03:12.227
used way to define the best fitting line is to choose the line where the sum of squared errors is minimized.
00:03:12.227 --> 00:03:18.450
To give a concrete example, let’s look at this figure which has five dots.
00:03:18.450 --> 00:03:24.632
I drew a line to fit these five dots. But it is clear that no matter how hard I try,
00:03:24.632 --> 00:03:30.086
there is no way I can perfectly fit these five dots on a straight line.
00:03:30.086 --> 00:03:36.752
No matter how I draw the line, there will be always some error, meaning that there will be some difference
00:03:36.752 --> 00:03:43.707
between the actual value of my data point and the value predicted by the fitted line.
00:03:43.707 --> 00:03:49.283
In this figure, the five blue dots represent actual values of my data points,
00:03:49.283 --> 00:03:55.443
and the five green dots represent the values predicted by the fitted line.
00:03:55.443 --> 00:04:03.919
The difference between actual values and predicted values are the errors associated with my fitted line.
00:04:03.919 --> 00:04:08.668
Intuitively, if we want to have a line that fits these data points well,
00:04:08.668 --> 00:04:13.652
we want to draw the line so that the errors are as small as possible.
00:04:13.652 --> 00:04:21.548
To be more exact, for a linear function y=a+bx, we want to choose the values of a and b
00:04:21.548 --> 00:04:34.809
that will minimize the sum of squared errors: (y1-a-bx1)^2 + (y2-a-bx2)^2, and so on.
00:04:34.809 --> 00:04:42.906
Here x1 and y1 refer to the first dot, x2 and y2 the second dot, and so on.
00:04:42.906 --> 00:04:46.532
Why we want to choose the values of a and b
00:04:46.532 --> 00:04:51.165
that minimize the sum of squared errors not just sum of errors?
00:04:51.165 --> 00:04:57.370
That’s because, if we want to minimize the sum of errors, we will run into an obvious problem.
00:04:57.370 --> 00:05:03.968
In this graph we saw, some errors were positive and some errors were negative.
00:05:03.968 --> 00:05:10.307
And if we just add up these errors, the errors with opposite signs will cancel each other out,
00:05:10.307 --> 00:05:16.316
and the sum of errors will be pretty small, although this fitted line is not really doing a great job
00:05:16.316 --> 00:05:20.080
in predicting the values of my data points.
00:05:20.080 --> 00:05:24.482
On the other hand, when we add the squares of each error term,
00:05:24.482 --> 00:05:30.865
because all square terms will be positive, we won’t have to worry about such a problem.
00:05:30.865 --> 00:05:33.206
When we have just a few data points,
00:05:33.206 --> 00:05:38.360
it is actually not that hard to the linear regression by hand using simple calculus.
00:05:38.360 --> 00:05:41.373
But when we have hundreds and thousands of data points,
00:05:41.373 --> 00:05:45.851
it's probably better to let a computer to do the computation.
00:05:45.851 --> 00:05:51.205
After this video, we will see how we can use Microsoft Excel to do the linear regression
00:05:51.205 --> 00:05:54.388
and find the best fitting line for our data points.