In the previous parts of Calculus for Machine Learning we discussed differentiating univariate and multivariate functions. Also, we developed a calculus toolbox to help us navigate high dimensional spaces. We talked about chain rule in particular. In Calculus for Machine Learning part 4, we will look at how to apply this chain rule to multivariate spaces through examples.
Multivariate Chain Rule:
Previously, we talked about the Total derivative. That is, when we have a multivariable function, suppose f(x, y, z) and the variables x, y, z are themselves a function of some other variable say t,
Then to find derivative of f w.r.t t, we use this expression,
Which is actually the sum of chains relating f to t through each of the three variables.
This helps us solving in a piecewise manner rather than substitution. Our computers perform piecewise operations quickly.
We’ll start Calculus for Machine Learning part 4 with the generalization of this concept with a little simplification.
The bold x indicates that it represents a series of variables, just like an n-dimensional vector.
Now, these x variables are in turn function of some variable t. We are interested in finding the derivative of f w.r.t t.
We’ll try linking f to t through the sum of chains of its every variable.
The partial derivative of f w.r.t x would be,
And the partial derivative of each of its variable w.r.t t would be,
To build a multivariate chain rule expression, we will sum up the product of the terms at a similar position in the two vectors. Exactly what a dot product does.
This is our nice and neat chain rule for multivariate functions in a generalized form. Another useful addition to our calculus toolbox!
By now you may have noticed that our vector of partial derivatives is the same as the jacobian vector we saw in the last part. With one little change, that is we wrote it as a row vector. This means that vector df/dx is the transverse of the jacobian of function f.
This shows us how conveniently jacobian allowed us to represent the multivariable chain rule.
Next, let’s see how chain rule works for more than two links.
Take an example.
We can also find df/dt through substituting expression like,
Another way would be to apply a two-step chain rule. Like,
This shows how this method applies to chains of univariate function. This can be extended to as many intermediary functions as we may like. But how this approach applies to multivariate functions?
Let’s see. Consider the multivariable function,
Here function f takes a vector x as an input. Also, vector x is a function that takes vector u as its input. Vector u is also a function that takes t as its input. t is scalar.
Notice here that two scalars f and t are linking but via two intermediate vector-valued functions.
Going back to our expression same as of univariate functions,
We have seen before that differentiating f w.r.t its input x gives a jacobian row vector. Also, differentiating u w.r.t scalar t gives a column vector. For the middle term, we need to find the derivatives of each of the two output variables w.r.t each of the two input variables. This boils down to,
So, a derivative of f w.r.t is actually the product of the jacobian of f with the jacobian of x and the derivative of u.
The dimensions of the matrices, (1×1) = (1×2) (2×2) (2×1), shows that the product is possible and will return a scalar. Just what we expected!
Now in Calculus for Machine Learning part 4, we have started seeing how linear algebra and multivariable calculus play together…
Artificial Neural Networks and Multivariable Chain Rule:
Let’s now talk about how our understanding of chain rule applies to artificial neural networks.
Whenever we hear the term artificial neural network following diagram comes to our mind. We say that circles are the neurons and the lines are the connections between them.
Speaking in terms of mathematics, ANN is a mathematical function with some variables as input and some variables as output. These variables can also be vectors.
Translating this diagram to a mathematical expression we can say that,
a terms here are activities, w is a weight, b is bias, and sigma is an activation function.
The function sigma is what makes our artificial neural network associated with a human brain. In our brain, neurons receive information from other neurons through chemical and electrical stimulation. When the sum of all these stimulations increase a certain threshold value neuron is activated and in turn stimulates its nearby neurons.
Hyperbolic tangent function, tanh, has this threshold property with a range from -1 to 1. Tanh has a characteristic S shape called sigmoid like some other functions. Hence, sigma is used to represent it.
Read more on activation functions here.
Our simple neural network at this point seems quite useless. To make it interesting or in other words to make it capable of doing some work we should add some more complexity that is, introduce some more neurons.
We can then modify our above expression as,
But as we keep on increasing input neurons our expression will start getting messy. To generalize it for n inputs we can do the following,
To simplify it even further consider that we have weights and inputs in the form of vectors. Taking their dot product gives,
Now, our input vector can contain as many inputs as we want.
Applying the same logic to the outputs would give,
The two output neurons will have the same input neurons but different weight vectors and bias.
We can further compact these equations into a vector form considering that the two outputs are rows of a column vector and we can hold our two weight vectors into a weight matrix and two bias vectors into a bias matrix. This concludes to,
The image above shows how a single layer neural network with n inputs and n outputs is represented by a single equation we formed above. Below in the image we can see the weights and biases at work!
Neural networks often have one or more layers between input and output that we refer to as hidden layers. They work in the same manner where the outputs of one layer become inputs of the next layer.
Given this we have now all the linear algebra required to calculate the output of the feed-forward neural network.
But to make this neural network do some work like image recognition we have to play with the right weights and bias values. Let’s now see how we can do this using our multivariate chain rule.
Multivariate chain rule helps us to update weights and biases until our neural network starts classifying inputs given some training examples.
Training a neural network in simple terms means using some labeled data that contains input values and their corresponding output values.
For example to train a neural network to recognize faces input could be some image pixels and output could be whether it is a face or not.
The most commonly used training method is backpropagation. It starts from the output neuron and works through the neurons in the neural network.
Considering a simple neural network with 4 input units, 3 hidden layer units, and 2 output units, we have to calculate the values of 18 weights and 5 biases. Using these values we can train our network to correctly match the inputs to their labels.
Initially weights and biases are set to some random number. Expectedly, any input data at this point will give meaningless output.
Here we define a cost function. The cost function is simply the sum of the square of the differences between our desired target output and the output our untrained neural network is currently giving.
The relation between the cost and a weight value can be given as follows,
The cost is higher for values too large or small, but at one specific point it is minimum.
If using our calculus knowledge we are able to find the gradient of cost w.r.t the weight we can simply start moving in the opposite direction.
The graph below shows that the gradient is positive and increasing the weight would increase the cost, so we make weight smaller to improve our neural network.
Also, this case is quite simple. Our cost function can be quite wiggly with multiple local minima that make our navigation difficult. Moreover, here we are considering just a single weight.
We are more interested in finding a minimum of multi-dimensional surfaces.
Recalling our previous learning, in such cases we can calculate Jacobian for our down the hill movement by combing the partial derivatives of the cost function w.r.t the relevant variables.
We can now easily write a chain rule for partial derivative of cost w.r.t to either the weight or bias.
Introducing another term z1 will make the expression more convenient. z1 holds our weighted activation plus bias values. This helps us in differentiating a particular sigmoid function chosen separately.
This two-chain rule will help us in navigating 2-dimensional weight bias space while minimizing costs for a simple neural network training on some examples.
Obviously the story gets a bit more complicated as the number of neurons starts to increase but basically we are just applying chain rule linking weights and biases to their effect on the cost eventually training our neural network.
Finally in Calculus for Machine Learning part 4, we have started to see calculus in action. By now we have also understood that linear algebra and calculus form the basis of machine learning techniques. We’ll end the Calculus for Machine Learning part 4 on this note. See you in the next part soon. Happy machine learning!