Linear Regression, lastly!
For Day 11, I waited many days to current this mannequin. It marks the start of a new journey on this “Creation Calendar“.
Till now, we largely checked out fashions primarily based on distances, neighbors, or native density. As chances are you’ll know, for tabular knowledge, resolution timber, particularly ensembles of resolution timber, are very performant.
However beginning at this time, we swap to a different perspective: the weighted method.
Linear Regression is our first step into this world.
It appears to be like easy, however it introduces the core components of recent ML: loss capabilities, gradients, optimization, scaling, collinearity, and interpretation of coefficients.
Now, after I say, Linear Regression, I imply Bizarre Least Sq. Linear Regression. As we progress by means of this “Creation Calendar” and discover associated fashions, you will note why it is very important specify this, as a result of the identify “linear regression” will be complicated.
Some individuals say that Linear Regression is not machine studying.
Their argument is that machine studying is a “new” subject, whereas Linear Regression existed lengthy earlier than, so it can’t be thought of ML.
That is deceptive.
Linear Regression matches completely inside machine studying as a result of:
- it learns parameters from knowledge,
- it minimizes a loss operate,
- it makes predictions on new knowledge.
In different phrases, Linear Regression is among the oldest fashions, but in addition one of many most elementary in machine studying.
That is the method utilized in:
- Linear Regression,
- Logistic Regression,
- and, later, Neural Networks and LLMs.
For deep studying, this weighted, gradient-based method is the one that’s used in all places.
And in trendy LLMs, we’re now not speaking about just a few parameters. We’re speaking about billions of weights.
On this article, our Linear Regression mannequin has precisely 2 weights.
A slope and an intercept.
That’s all.
However we now have to start someplace, proper?
And listed here are just a few questions you may consider as we progress by means of this text, and within the ones to come back.
- We are going to attempt to interpret the mannequin. With one function, y=ax+b, everybody is aware of {that a} is the slope and b is the intercept. However how can we interpret the coefficients the place there are 10, 100 or extra options?
- Why is collinearity between options such an issue for linear regression? And the way can we do to resolve this situation?
- Is scaling vital for linear regression?
- Can Linear regression be overfitted?
- And the way are the opposite fashions of this weighted familly (Logistic Regression, SVM, Neural Networks, Ridge, Lasso, and so forth.), all related to the identical underlying concepts?
These questions kind the thread of this text and can naturally lead us towards future matters within the “Creation Calendar”.
Understanding the Pattern line in Excel
Beginning with a Easy Dataset
Allow us to start with a quite simple dataset that I generated with one function.
Within the graph under, you may see the function variable x on the horizontal axis and the goal variable y on the vertical axis.
The aim of Linear Regression is to search out two numbers, a and b, such that we will write the connection:
y=a x +b
As soon as we all know a and b, this equation turns into our mannequin.
Creating the Pattern Line in Excel
In Google Sheets or Excel, you may merely add a development line to visualise the very best linear match.
That already offers you the results of Linear Regression.

However the goal of this text is to compute these coefficients ourselves.
If we wish to use the mannequin to make predictions, we have to implement it immediately.

Introducing Weights and the Price Operate
A Be aware on Weight-Based mostly Fashions
That is the primary time within the Creation Calendar that we introduce weights.
Fashions that study weights are sometimes known as parametric discriminant fashions.
Why discriminant?
As a result of they study a rule that immediately separates or predicts, with out modeling how the information was generated.
Earlier than this chapter, we already noticed fashions that had parameters, however they weren’t discriminant, they had been generative.
Allow us to recap rapidly.
- Choice Bushes use splits, or guidelines, and so there are not any weights to study. So they’re non-parametric fashions.
- k-NN just isn’t a mannequin. It retains the entire dataset and makes use of distances at prediction time.
Nevertheless, once we transfer from Euclidean distance to Mahalanobis distance, one thing attention-grabbing occurs…
LDA and QDA do estimate parameters:
- means of every class
- covariance matrices
- priors
These are actual parameters, however they aren’t weights.
These fashions are generative as a result of they mannequin the density of every class, after which use it to make predictions.
So despite the fact that they’re parametric, they don’t belong to the weight-based household.
And as you may see, these are all classifiers, they usually estimate parameters for every class.

Linear Regression is our first instance of a mannequin that learns weights to construct a prediction.
That is the start of a brand new household within the Creation Calendar:
fashions that depend on weights + a loss operate to make predictions.
The Price Operate
How can we acquire the parameters a and b?
Nicely, the optimum values for a and b are these minimizing the associated fee operate, which is the Squared Error of the mannequin.
So for every knowledge level, we will calculate the Squared Error.
Squared Error = (prediction-real worth)²=(a*x+b-real worth)²
Then we will calculate the MSE, or Imply Squared Error.
As we will see in Excel, the trendline offers us the optimum coefficients. If you happen to manually change these values, even barely, the MSE will improve.
That is precisely what “optimum” means right here: some other mixture of a and b makes the error worse.

The traditional closed-form answer
Now that we all know what the mannequin is, and what it means to attenuate the squared error, we will lastly reply the important thing query:
How can we compute the 2 coefficients of Linear Regression, the slope a and the intercept b?
There are two methods to do it:
- the precise algebraic answer, generally known as the closed-form answer,
- or gradient descent, which we are going to discover simply after.
If we take the definition of the MSE and differentiate it with respect to a and b, one thing stunning occurs: the whole lot simplifies into two very compact formulation.

These formulation solely use:
- the typical of x and y,
- how x varies (its variance),
- and the way x and y fluctuate collectively (their covariance).
So even with out realizing any calculus, and with solely primary spreadsheet capabilities, we will reproduce the precise answer utilized in statistics textbooks.
How you can interpret the coefficients
For one function, interpretation is simple and intuitive:
The slope a
It tells us how a lot y modifications when x will increase by one unit.
If the slope is 1.2, it means:
“when x goes up by 1, the mannequin expects y to go up by about 1.2.”
The intercept b
It’s the predicted worth of y when x = 0.
Typically, x = 0 doesn’t exist in the true context of the information, so the intercept just isn’t all the time significant by itself.
Its position is usually to place the road accurately to match the middle of the information.
That is normally how Linear Regression is taught:
a slope, an intercept, and a straight line.
With one function, interpretation is straightforward.
With two, nonetheless manageable.
However as quickly as we begin including many options, it turns into tougher.
Tomorrow, we are going to talk about additional concerning the interpretation.
In the present day, we are going to do the gradient descent.
Gradient Descent, Step by Step
After seeing the traditional algebraic answer for Linear Regression, we will now discover the opposite important device behind trendy machine studying: optimization.
The workhorse of optimization is Gradient Descent.
Understanding it on a quite simple instance makes the logic a lot clearer as soon as we apply it to Linear Regression.
A Mild Heat-Up: Gradient Descent on a Single Variable
Earlier than implementing the gradient descent for the Linear Regression, we will first do it for a easy operate: (x-2)^2.
Everybody is aware of the minimal is at x=2.
However allow us to fake we have no idea that, and let the algorithm uncover it by itself.
The thought is to search out the minimal of this operate utilizing the next course of:
- First, we randomly select an preliminary worth.
- Then for every step, we calculate the worth of the spinoff operate df (for this x worth): df(x)
- And the following worth of x is obtained by subtracting the worth of spinoff multiplied by a step measurement: x = x – step_size*df(x)
You may modify the 2 parameters of the gradient descent: the preliminary worth of x and the step measurement.
Sure, even with 100, or 1000. That’s fairly stunning to see, how nicely it really works.

However, in some instances, the gradient descent won’t work. For instance, if the step measurement is simply too large, the x worth can explode.

Gradient descent for linear regression
The precept of the gradient descent algorithm is identical for linear regression: we now have to calculate the partial derivatives of the associated fee operate with respect to the parameters a and b. Let’s be aware them as da and db.
Squared Error = (prediction-real worth)²=(a*x+b-real worth)²
da=2(a*x+b-real worth)*x
db=2(a*x+b-real worth)

After which, we will do the updates of the coefficients.

With this tiny replace, step-by-step, the optimum worth might be discovered after just a few interations.
Within the following graph, you may see how a and b converge in the direction of the goal worth.

We are able to additionally see all the small print of y hat, residuals and the partial derivatives.
We are able to absolutely admire the fantastic thing about gradient descent, visualized in Excel.
For these two coefficients, we will observe how fast the convergence is.

Now, in follow, we now have many observations and this ought to be accomplished for every knowledge level. That’s the place issues turn out to be loopy in Google Sheet. So, we use solely 10 knowledge factors.
You will notice that I first created a sheet with lengthy formulation to calculate da and db, which comprise the sum of the derivatives of all of the observations. Then I created one other sheet to indicate all the small print.
Categorical Options in Linear Regression
Earlier than concluding, there may be one final vital concept to introduce:
how a weight-based mannequin like Linear Regression handles categorical options.
This subject is important as a result of it exhibits a elementary distinction between the fashions we studied earlier (like k-NN) and the weighted fashions we’re coming into now.
Why distance-based fashions wrestle with classes
Within the first a part of this Creation Calendar, we used distance-based fashions equivalent to Ok-NN, DBSCAN, and LOF.
However these fashions rely solely on measuring distances between factors.
For categorical options, this turns into inconceivable:
- a class encoded as 0 or 1 has no quantitative which means
- the numerical scale is bigoted,
- Euclidean distance can’t seize class variations.
This is the reason k-NN can’t deal with classes accurately with out heavy preprocessing.
Weight-based fashions remedy the issue in a different way
Linear Regression doesn’t evaluate distances.
It learns weights.
To incorporate a categorical variable in a weight-based mannequin, we use one-hot encoding, the most typical method.
Every class turns into its personal function, and the mannequin merely learns one weight per class.
Why this works so nicely
As soon as encoded:
- the size drawback disappears (the whole lot is 0 or 1),
- every class receives an interpretable weight,
- the mannequin can regulate its prediction relying on the group
A easy two-category instance
When there are solely two classes (0 and 1), the mannequin turns into very simple:
- one worth is used when x=0,
- one other when x=1.
One-hot encoding just isn’t even crucial:
the numeric encoding already works as a result of Linear Regression will study the suitable distinction between the 2 teams.

Gradient Descent nonetheless works
Even with categorical options, Gradient Descent works precisely as ordinary.
The algorithm solely manipulates numbers, so the replace guidelines for a and b are an identical.
Within the spreadsheet, you may see the parameters converge easily, identical to with numerical knowledge.
Nevertheless, on this particular two-category case, we additionally know {that a} closed-form formulation exists: Linear Regression primarily computes two group averages and the distinction between them.

Conclusion
Linear Regression could look easy, however it introduces virtually the whole lot that trendy machine studying depends on.
With simply two parameters, a slope and an intercept, it teaches us:
- tips on how to outline a price operate,
- tips on how to discover optimum parameters, numerically,
- and the way optimization behaves once we regulate studying charges or preliminary values.
The closed-form answer exhibits the class of the arithmetic.
Gradient Descent exhibits the mechanics behind the scenes.
Collectively, they kind the inspiration of the “weighted + loss operate” household that features Logistic Regression, SVM, Neural Networks, and even at this time’s LLMs.
New Paths Forward
You might assume Linear Regression is easy, however with its foundations now clear, you may lengthen it, refine it, and reinterpret it by means of many various views:
- Change the loss operate
Change squared error with logistic loss, hinge loss, or different capabilities, and new fashions seem. - Transfer to classification
Linear Regression itself can separate two lessons (0 and 1), however extra strong variations result in Logistic Regression and SVM. And what about multiclass classification? - Mannequin nonlinearity
By way of polynomial options or kernels, linear fashions abruptly turn out to be nonlinear within the authentic house. - Scale to many options
Interpretation turns into tougher, regularization turns into important, and new numerical challenges seem. - Primal vs twin
Linear fashions will be written in two methods. The primal view learns the weights immediately. The twin view rewrites the whole lot utilizing dot merchandise between knowledge factors. - Perceive trendy ML
Gradient Descent, and its variants, are the core of neural networks and huge language fashions.
What we discovered right here with two parameters generalizes to billions.
All the pieces on this article stays throughout the boundaries of Linear Regression, but it prepares the bottom for a whole household of future fashions.
Day after day, the Creation Calendar will present how all these concepts join.
