Explaining linear regression using the ordinary least squares method appears to be a bit of a rite of passage in data science judging by the amount of entries one can find on the web. True enough, it has the same feel to it as the first “Hello World!” script of the latest, newest programming language (I put a list of entries that I found on the topic below the sources for this article)
Whenever you want to describe the relationship between one variable and another (or several others) and subsequently predict other outcomes, linear regression is one of the most important tools to be used.
Linear Regression Formula and some basic terminology
When looking at it graphically, the variable that we want to predict (the dependent variable), called y, will be plotted against the independent variable x, in the most simple case. Consequently, we will try to predict other values by a standard formula for graphs that looks quite similar to one that most of us learnt in high school (see below),
This is, of course, the formula to calculate a straight line graph. The letters may change, but the form is always the same. Back in the day, I learnt it as y = m*x + c, for example. The key unknowns here are the two, so-called, coefficients, signified by the greek letter, β. As you can already guess, β0 is the value at the y-intercept and β1 gives the slope of the graph. The coefficients are currently unknown and the task is to determine values for them.
Visualization and further nomenclature
So, let’s take a bunch of variables and plot them on a standard axis.
x y 0 1.5 0.9 1 1.7 1.9 2 2.3 3.5 3 5.0 4.2 4 5.1 6.3 5 4.9 4.8 6 7.3 7.3 7 8.2 8.9 8 9.9 8.9 9 10.1 10.5
which will give us
Please note that the above equation, only allows for one independent variable, however, the approach can easily be extended to several variables by extending the formula for each addtional one. Unsurprisingly, this adjusts the formula to
Determination of the coefficient using the Ordinary Least Squares (OLS) Method
Since we have a set of x and y values, our task is to find the values for the coefficients which is where the ordinary least squares method comes in to play.
Before we get into the nitty-gritty part of that, we first have to acknowledge that even the most diligent measurements will not always be 100%. If we enter the realms of every day life, it is even less likely to happen. As a consequence, there will always be a certain error value, e, involved in drawing a best fit straight line. Even in the above graph with very few points, it is impossible to draw a straight line that goes through all of them.
As a result, there is an error associated with each term of coefficient times x. The error, e, is also referred as a ‘residual’. We can get the residual from the difference between the measured y and the predicted y, also known as ŷ (y-hat, the -hat signifies a predicted value).
Residual Sum of Squares, RSS
Why am I telling you this and not just stick with the far more understandable word: error? Because it makes it easier to explain the term ‘Residual’ Sum of Squares or RSS. You see, the square of the residual can be very easily used as a measure of how close the measured point and the line are actually away from each other. Why is it the square value? Because it avoids mixing negative and positive values, as it is possible to have both negative and positive residuals (if the predicted y value is smaller or larger than the measured value, respectively).
So, since we want all residuals squared to be as small as possible, we actually want the sum of all residuals squared to be as small as possible. In other words, we want the Residual Sum of Squares to be minimized. Therfore, given the smallest RSS, we arrive at several best fit y values, ŷ, from which we can then extrapolate an equation that we see as the best fit trendline, which in this case corresponds to:
…and I leave it up to you now to identify β0 and β1.
I leave it up to you now to identify β0 and β1.
In mathematical terms, it looks as follows:
The exact calculations for the coefficients are outside the scope of of this article. However, for those of you, who are undeterred by this, I would redirect you to the book An Introduction to Statistical Learning by G. James et al. It is a truly outstanding book with regards to explaining mathematical approaches.
Total Sum of Squares, TSS
Now, we need to look at something called the Total Sum of Squares, TSS. Since it is determined by the sum of all substractions of the mean value of y, ȳ, from the ith y value followed by squaring the result, we can easily arrive at the below formula.
Looking closely at this formula, what you are actually measuring is the sum of differences from the average y value of all measured values (or mean). This is referred to as the total variance before the regression was performed, i.e. the line was drawn.
Now, let’s think about how we can put RSS and TSS in context? So, using RSS, we essentially measure the error value (or variation) that cannot be explained by the regression. If we could explain it, the line would go straight through the coordinate.
So what do you think, you measure when subtracting RSS from TSS? Correct! The variance is the measurements that we can explain through regression.
Finally, combining RSS and TSS one more time, we can actually measure how much of the variability in the data can be explained by all the X terms that we have.
So, R2 actually tells you how well your data map onto the observed results. Therefore, the smaller the RSS, the smaller RSS/TSS and hence the closer to 1 is the R2 value.
Given the above calculation, the R2 value corresponds to 0.9419, which means we are actually very close to the real thing and we could use this graph to make some semi-decent predictions.
This was a very brief and superficial look into Linear Regression using the ordinary least squares method. I hope it was helpful and let me know in the comments section below what you think.
- ISLR 7th Edition – Excellent Book, simply excellent
- Introduction to Machine Learning with Python (Amazon Link) – My Bible
- Hands-On Machine Learning with Scikit-Learn & TensorFlow – Fun Book with great insights
Other Explanations summaries found on the web
- Short and Sweet Minimalistic Summary as a YouTube Video
- Data Genetics – Linear Regression
- Eli Bendersky on Linear Regression
- The Win-Vector Blog on Linear Regression
- Analytics Vidhya on Regression Techniques
- Clever Tap’s Brief Primer on Linear Regression
- Yhat Blog’s closer exploration on Linear Regression