Hi! We have covered a lot in the series Calculus for Machine Learning. From functions, gradients, derivatives in a single dimension to jacobian and hessian in a multivariate environment. We also discussed linearization and approximating functions with methods like McLaurin and Taylor series. Today, in Calculus for Machine Learning – Function Optimization we will touch another important aspect of machine learning that is to optimize the parameters of the function. In other words, multivariate calculus can help us to find the maxima and minima of the function where our goal is to find a function to fit our data.
We will start with a single dimension, learn an important Newton-Raphson method, and then move onto multi-dimension using the Jacobian vector (vector of gradients). This will lead us to a gradient descent method. Lastly, we will discuss some situations with constraints and try to find their gradient descent solution using Lagrange multipliers. So let’s start Calculus for Machine Learning – Function Optimization.
Function Optimization – Newton-Raphson method:
We will start with a simple scenario. We need to find the derivative in order to solve an equation using the Newton-Raphson method. Consider the following distribution with a mean mu and a width sigma.
We want to fit our data to this distribution and we have two parameters mean and width for this purpose. So that we don’t have to move around all the data points. We will have a model with two parameters and we could do everything with it. Much simpler and faster. But how to find the best parameters for the model?
Our approach will be to find an expression that best fits the model to the data. Then we will observe how this goodness of fit changes with the fitting parameters mu and sigma. First, let’s do some more calculus.
Consider the following equation differentiating to two quadratic solutions. Meaning it will have two turning points as a maximum and a minimum as seen in the graph.
Let’s pretend that we don’t know about the equation. Or it has so many dimensions that we are unable to visualize it. Let’s say we have to find a solution where y = 0. On the graph, we can see it has only one solution. Depending on the form of the equation it could have more than one solution. We will start looking for solutions from somewhere near our initial guess. If we evaluate our solution at x = -2 we can say that function has value -2 and gradient is 10. If we extrapolate this gradient to y-intercept then the value of x at this intercept on the y-axis will be our next estimate for the solution to the equation.
We will again extrapolate it and look for the next estimate of the solution. We can write an equation for a new guess based on our previous guess.
If we make a table starting from our original guess i = 0,
You can see in just 3 iterations we have reached the solution to our equation. This is the Newton-Raphson method. Using this method we don’t have to evaluate the function at every point. Neither we have to plot graphically nor solve it algebraically. If we have a multivariable function all this would be expensive to solve analytically and even plotting it would be huge. This approach of computation where we try a solution, evaluate it, find the next guess, and then repeat the same again and again till we reach the required solution is called iteration method.
Next, quickly look at some of the cases where this approach can go wrong. For example we start with x = 0. Y evaluates to 2. When we find the gradient and extrapolate it we move away from our solution.
Another guess at x =1 evaluates y at 1. The extrapolation of gradient leads us back to x = 0 just where we started off. This traps us in a loop where we keep on switching between 0 and 1 never reach close to our required solution of x = -1.769.
Another problem arises at the turning point. At maximum and minimum. The gradient here is quite small. When we use this small value to divide in our Newton-Raphson equation the next estimate would be some non-converging crazy value.
These are the problems. Using the Newton-Raphson method we evaluate a function to find its gradient and do this iteratively. Most of the time it helps us in reaching our solution. In this way, the Newton-Raphson is a powerful method to reach a solution by just evaluating it and calculating its gradient a few times.
For imaging this whole scenario, consider yourself standing in a rocky place at night time. You can’t see the whole landscape. You can only feel the area around you through small steps. You can feel the steepness around you and then take a step in a direction that can lead you to a safe nonrocky place. Point is you don’t need to know the whole area or the whole function. You just need to know the value of the function and feel the steepness or gradient locally.
After finding the gradient of the function using the Newton-Raphson method in one-dimension, let’s move onto multi-dimensional functions next in Calculus for Machine Learning – Function Optimization.
Optimization of Multivariate Function – Gradient Descent Method
Consider the following function. The function is positive when x and y are positive and negative when y is negative.
Question for us is what is the fastest way to get to the bottom of this graph? We know how to differentiate a multivariate function (we have learned in previous parts). Let’s do it.
Next, we will write these two gradients in a vector. Grad of function f.
Grad is an awesome vector that combines linear algebra and calculus together!
If we dot product this grad vector with another vector r. It tells us how much the df/dx moves in the x-direction of the vector and how much df/dy moves in the y-direction of the vector.
Another thing to note is that we are evaluating the function at points a and b in a space. We want to know how much vector changes when moved along some unit vector in the opposite direction. We call this unit vector r hat with components c and d. Then we take the dot product of grad f and r hat called directional gradient.
Another question can be what is the maximum height that the directional gradient takes?
This means that r hat is parallel to grad f. so to get maximum of the directional gradient we want r as the normalized form of grad f.
This is equal to the size of grad f squared divided by the size of the grad f which is equal to the size of the grad f.
So the maximum value that a directional gradient takes is just the size of the grad f. This is therefore the steepest gradient we can have.
Another question is where does the gradient point? In which direction. If you are standing on a hill on a foggy day, to go down you will see down. Not around the hill or in our case the contour. So the gradient points in the direction of the steepest descent perpendicular to the contour lines.
Grad f would point up the direction of the deepest descent and – grad f would point down the direction of the steepest descent.
If we are fitting a model to our data. We will look at the minimum difference between our data points and the model fit. We can use the Newton-Raphson to go from some test point to the required solution using gradients. But in the Newton-Raphson, we look for a zero point. Here, we don’t know the minimum value of the function.
In such cases, we use the gradient descent method. We take a series of small steps to reach the minimum. Say we start at point sn. Then we take a small step to reach sn + 1. This sn + 1 is equal to sn minus some amount times grad f evaluated at previous point sn. sn+1 = sn – x (grad f)sn
So, on the graph we take a little step down the hill, re-evaluate, make another step and in this way take a series of smaller steps. There is still a possibility of falling into one of the local minima depending upon our starting step but still, it is quite a useful method generally used in numerical optimizers for real-life problems.
Lastly in Calculus for Machine Learning – Function Optimization, we will discuss constrained optimization. That is finding a maximum or minimum subject to some constraints like along a line etc.
The method used is called the Lagrange Multipliers.
Constrained Optimization – Lagrange Multipliers:
Below is a contour map of the function we were discussing above where gradients are plotted perpendicular to the contour lines.
Now, we have to find the maximum value for this function but with a constraint that it should lie on the circle. The radius of the circle is a squared and we want to find the highest value on this circle. Note not the value everywhere on the circle but just the maximum on this circle.
Lagrange, a French mathematician, noticed that when the contour lines touch the path we get maxima and minima. When contour lines are a bit small it won’t touch and if they are big they might cross multiple times like in contour map. When it just touches we get maxima and minima of the function along the path, in our case a circle.
Further, Lagrange noticed that when contour touches the path the vector perpendicular to the contour is in the same direction of the perpendicular vector of the path itself in a negative direction. This means if we find grad we can find maximum and minimum points of the function.
The function we want to find the maximum of and equation for the constraint of a circle is,
We have to solve,
Where lambda is Lagrange multiplier.
Solving the equations,
So for function solutions, we got 2 maximums and two maximum. Seeing them on the graph,
We have used our understanding of gradients to find maxima and minima of a function subject to some constraint that is on a circular path.
In this blog on Calculus for Machine Learning – Function Optimization we have dived into vector calculus by combining linear algebra and vectors. We mainly talked about function optimization and equation optimization. We started with the Newton-Raphson method where we started off with a guess solution and using gradients moved on to the next guess till we reach the required solution.
We also looked at a multivariate scenario and learned about gradient descent. The most powerful tool for optimization of a function used these days. Using this gradient descent method if we find the gradient of our initial estimate and then taking smaller steps down the hill we can reach our solution. We lastly talked about optimizing functions under some constraints. The method we discussed used the Lagrange multiplier. With this in Calculus for Machine Learning – Function Optimization we have covered quite an important aspect of machine learning that is the optimization of function parameters for better fitting of a model to the data values.