Calculus for Machine Learning – Linear Regression

In this blog on Calculus for Machine Learning – Linear Regression, we will apply our concepts of multivariable calculus that we learned in previous parts to fit functions to datasets. This will help us in evaluating data statistically.

The first step in working with data is to clean it. This means data should be in a form that we can start applying mathematics concepts to it. It includes cleaning garbage values, duplicate values, and also figuring out things like reducing data dimensions and grouping of data values. Once the data is cleaned then we can start with plotting it and taking averages, standard deviations, etc. to get more familiar with the data.

Simple Linear Regression:

Consider a simple plot of some data below,

If we know how data is generated or how variables are related we can try fitting some reasonable model to it. Else we will try fitting some model depending on how the plot looks like. In this case a straight line.

We will try to model the above straight line y as a function of observation xi and a vector of fitting parameters. In the case of a straight line, y=mx+c, the parameters in the vector a would be gradient and intercept of the straight line, c.

We define a residual r to find the optimal value of m and c. The residual r is the difference between the data value yi and the predicted value on the line y which is mx + c.

In this way, we get the overall quality of the fit called chi-squared. Chi-squared is the sum of the squares of the residuals r. We want to consider the data points that are above and below the data line. Also, the data points that are widely away from the line. In this way, we find the best possible chi-squared value, one that is the lowest. That is we will do minimization.

The above contour plot shows different chi-squared values for different m and c values. In the middle, we get an intercept zero and a minimum value of chi-squared. The slanting contour lines show some kind of tradeoff. If the value of c increases m decreases and vice versa.

This plotting of contour lines will take hundreds of computations on Matlab software. This means we should have an algorithm to do it quickly.

We now know that we will reach a minimum when the gradient of chi-squared is zero. If we set the gradient of chi-squared to 0 with respect to the fitting parameters we can reach our solution.

Differentiating it with respect to m and c,

From the second row, we can derive,

Where y bar and x bar are the averages of data values.

Doing some old math for m,

Now, we again plot our data.

It amazing fits well. The gradient is 215 but uncertainties reduced to 9.

We should always plot our data to visually analyze it.

See the graphs below. They are Anscombe’s quartet. The four graphs have different data points but the same chi-squared values, mean values, and best fit lines.

In only top-left graph straight line fit seems a good option.

Now going back to our plot we see that intercept c depends on the gradient m.

Another way to go by our problem of the straight line is to look at the deviations from the center of the mass and data of x bar and then intercept b.

Sigma m and sigma b are the uncertainties.

Now c is the center of mass y that is y bar. The constant term b now does not depends on the gradient anymore. Also, if we now plot contour they are no longer slanted.

This is the essence of fitting data to a line.

Fitting a non-linear function:

Next in Calculus for Machine Learning – Linear Regression, we will look at a complicated function that is not as simple as y = mx + c and see how we can fit it to a model. This will give us an idea that how computers solve it for us.

Consider the following function with variable x. The function has some parameters ak where k goes from 1 to m.

The function is not linear and we will do nonlinear least squared to fit it.

Say we want to fit parameters ak to some data.

For every data pint xi, i have yi and some uncertainty sigma i where i goes from 1 to n.

The more uncertain we are about yi, the greater the value of sigma i.

Next, we will define the goodness of fit that is chi-squared. It is the sum of differences between yi and the model xi with its parameters ak over all the data points i.

We divide by sigma square to penalize each of the differences. This makes uncertain data points to have a low weight in our sum and in turn, they don’t affect the fit much.

The minimum chi-squared would be when the grad of chi-squared is 0.

Instead of solving this algebraically, we will go by another approach. That is by going down the steepest descent and updating the vector of fitting parameters that is a.

We say that the next iteration is the current iteration minus some constant times the grad of chi-squared.

So, we go down the gradient by an amount given by the constant. This constant in a way determines how fast we want to go down the descent.

We keep updating the next vectors till we reach grad chi-squared equals 0  which gives us the minimum or the grad chi-squared stops changing that is also the same thing.

So, for grad, we have to differentiate chi-squared by ak for each of the k values.

Wrapping the -2 in the constant,

The final formula looks quite intimidating but it’s quite easy when we put it into practice. For example for our above example function if we differentiate it with respect to parameter a1 and a2,

So, finally, we have derived the formula of steepest descent for fitting nonlinear functions. Here we try to minimize the sum of the square of the residuals and therefore it is called nonlinear least square fitting.

There are a lot of other methods too but this is the simplest method of finding the minimum value of the sum of the squares of the residuals to fit a model that is nonlinear in terms of function and fitting parameters.