Linear Regression

What is regression?

Regression refers to a collection of techniques for modeling one variable (the dependent variable or DV), as a function of some other variables (the independent variables or IVs). Different regression techniques should be applied for different types of DVs. If the DV is a dichotomy (like living vs. dead), then the most common method is logistic regression. If the DV has multiple categories (e.g. Republican, Democrat, Independent) then the usual method is either multinomial or ordinal logistic regression. If the DV is a count (such as number of times something happens) then there are Poisson regression and negative binomial regression. If the DV is a time to an event (such as time to death) then there are a range of techniques known as survival analysis. There are other varieties too. But the most common type of DV is one that is continuous, or nearly so, such as weight, IQ, income, and so on.

What is linear regression?

When the DV is a continuous variable, or nearly continuous, then by far the most common regression technique is linear regression, almost always with ordinary least squares.

In fact, if you see the word “regression” used in a statistical context, you can usually be sure that it is linear regression, if nothing is specified. However, linear regression, like many statistical techniques, is often applied inappropriately. In these notes, I will first explain what linear regression is for, then show how it works and explain when it is appropriate and when it is inappropriate. I will not cover how to actually do regression, but will give references to works which explain it.

As noted, linear regression is a technique for modeling a continuous, or nearly continuous, DV as a function of one or more IVs. A variable is continuous when it can take on any value in a specified range. It is nearly continuous when it can take on a great many values. Thus, height is a continuous variable, because (say) adult humans can take on any height between (say) 3 feet and 8 feet. IQ is a nearly continuous variable, because, while it can’t be fractional, it can be any integer between (say) 40 and 200. The DV is the variable that you wish to explain, or model, or explore. The IVs are the variables that you think will help explain the DV. Note that while the DV must be continuous or nearly so, the IVs need not be. Methods exist for using IVs that are dichotomous, categorical, or continuous. Also note that we do not expect, or even desire, our model to be exactly correct. A famous statistician (George Box) once noted that ‘all models are wrong, but some models are useful’, and Borges wrote a short story about a tribe of mapmakers who designed an exactly accurate map, but it was as big as the area it was mapping.

What is simple linear regression?

The simplest case is when there is only one IV, and it is continuous. In this case, we can make a scatterplot of the DV and the IV. Here is a scatterplot of the heights and weights of a group of young adults in Brooklyn, New York. (It’s from a project I worked on, long ago).

It is traditional to put the IV along the X axis, and the DV along the Y axis.. Just by looking, it is clear that there is some relationship between height.

There are various ways to model this relationship, and these can be represented by various lines (see here). OLS regression assumes that the relationship is linear, that is, it fits a straight line to represent the relationship. Algebra tells us that any straight line can be represented as an equation like y = a + bx. Here, y is height, x is weight, and a and b are parameters which we attempt to estimate (hence, simple linear regression, and regression generally, is a parametric method). Various lines might be fit to these points; we need a method to choose one of them, or, in other words, to select a and b. Ideally, the points would lie exactly on the line, but there is no such line for these points. Any line will miss most of the points; we need a method to say how badly a line misses the points. The most common weigh is through ordinary least squares (OLS) which uses the sum of the squared distances from the line to the points.

This seems, at first, needlessly complex: Why squared? If one simply summed the distances, some would be negative (i.e the point is below the line) and some positive (i.e. the point is above the line) and the total would be 0 for any line. By squaring the distances, all are positive. One could, instead, take the sum of the absolute values of the distances; this is, in fact, a good method. The reason it isn’t used is historical and technical: Without computers, it is much easier to estimate the line based on least squares. Another way of thinking about this is to imagine that the scatterplot was a piece of wood, and that each of the points was a nail sticking up from that piece of wood. Then, if we got a rod with a lot of nails, and tied rubber bands from each nail on the board to each nail on the rod, the rod would show the least squares line. In the figure, this is the black line. The red line is a nonparametric curve, which simply attempts to get close to the points, without being too bumpy and without assuming anything.

What can go wrong in simple linear regression?

F.J. Anscombe came up with 4 sets of data, each of which he fit with simple linear regression; each has the same slope (b), the same intercept (a), and a lot of other things in common. But three show things that can go wrong in simple linear regression. Here’s the plot . The graph in the upper left is fine. The one in the upper right is a nonlinear relationship, and a straight line fits very badly; simple linear regression is a bad choice here. The one on the bottom left and bottom right show the effects of outliers or extreme points, that have too much influence on the results.

There are other potential problems, too. No statistical method should be applied without knowing about its assumptions and what happens when they are violated.