What is the residual sum of squares?

1 Answer
Dec 5, 2016

It's the remaining variance unaccounted for by explainable sources of variation in data.

Explanation:

All data sets have what's known as a "total sum of squares" (or perhaps a "corrected total sum of squares"), which is usually denoted something like SS_"Total" or SS_T. This is the grand sum of all the squared data values (minus a "mean"-based correction factor, if you're using the corrected SS_T). SS_T quantifies the total amount of variance for any given data set.

Using some formulas, SS_T can be split into other sums of squares—the sources that attempt to explain where all that variance in SS_T comes from. These sources may be:

  • regression (line slopes, like how a server's tips increase with the price of a meal), denoted SS_R;
  • main effects (category averages, like how women tip more than men, female servers get more tips than male servers, etc.), denoted SS_A, SS_B, etc;
  • interaction effects between two explanatory variables (like how men tip more than women if their server is female), denoted SS_(AB);
  • lack of fit (repeated observations when all explanatory variables are the same, like if a customer dines at a restaurant twice with the same server), denoted SS_"LOF";
  • and many others.

Most of the time, these explainable sources do not account for all of the total variance in the data. We certainly hope they come close, but there is almost always a little bit of variance left over that has no explainable source.

This leftover bit is called the residual sum of squares or the sum of squares due to error and is usually denoted by SS_"Error" or SS_E. It's the remaining variance in the data that can't be attributed to any of the other sources in our model.

We usually write an equation like this:

SS_T=SS_"Source 1"+SS_"Source 2"+...+SS_E

It's that last term, the SS_E, that contains all the variance in the data that has no explainable source. It's the sum of all the squared distances between each observed data point and the point the model predicts at the corresponding explanatory values. These distances are also called the residuals, hence the term "residual sum of squares". In this way, SS_E is the best value to help us estimate sigma^2, the variance of the residuals.

Note: SS_E on its own does not estimate sigma^2; we must first divide SS_E by its degrees of freedom, df_E, to get our "mean squared error":

MS_E=(SS_E)/(df_E)

Unfortunately, explaining degrees of freedom would make this answer a lot longer, so I have left it out for the sake of keeping this response (relatively) short.