There are multiple libraries (PyTorch, TensorFlow) that can assist you in implementing almost any architecture of neural networks. This article is not about solving a neural net using one of those libraries. There are already plenty of articles, videos on that. In this article, we’ll see a step by step forward pass (forward propagation) and backward pass (backpropagation) example. We’ll be taking a single hidden layer neural network and solving one complete cycle of forward propagation and backpropagation.
Getting to the point, we will work step by step to understand how weights are updated in neural networks. The way a neural network learns is by updating its weight parameters during the training phase. There are multiple concepts needed to fully understand the working mechanism of neural networks: linear algebra, probability, calculus. I’ll try my best to re-visit calculus for the chain rule concept. I will keep aside the linear algebra (vectors, matrices, tensors) for this article. We’ll work on each and every computation and in the end up we’ll update all the weights of the example neural network for one complete cycle of forward propagation and backpropagation. Let’s get started.
Here’s a simple neural network on which we’ll be working.
I think the above example neural network is self-explanatory. There are two units in the Input Layer, two units in the Hidden Layer and two units in the Output Layer. The w1,w2,w2,…,w8 represent the respective weights. b1 and b2 are the biases for Hidden Layer and Output Layer, respectively.
In this article, we’ll be passing two inputs i1 and i2, and perform a forward pass to compute total error and then a backward pass to distribute the error inside the network and update weights accordingly.
Before getting started, let us deal with two basic concepts which should be sufficient to comprehend this article.
Peeking inside a single neuron
Inside a unit, two operations happen (i) computation of weighted sum and (ii) squashing of the weighted sum using an activation function. The result from the activation function becomes an input to the next layer (until the next layer is an Output Layer). In this example, we’ll be using the Sigmoid function (Logistic function) as the activation function. The Sigmoid function basically takes an input and squashes the value between 0 and +1. We’ll discuss the activation functions in later articles. But, what you should note is that inside a neural network unit, two operations (stated above) happen. We can suppose the input layer to have a linear function that produces the same value as the input.
Chain Rule in Calculus
If we have y = f(u) and u = g(x) then we can write the derivative of y as:
\frac{dy}{dx} = \frac{dy}{du} * \frac{du}{dx}The Forward Pass
Remember that each unit of a neural network performs two operations: compute weighted sum and process the sum through an activation function. The outcome of the activation function determines if that particular unit should activate or become insignificant.
Let’s get started with the forward pass.
For h1,
sum_{h1} = i_{1}*w_{1}+i_{2}*w_{3}+b_{1} sum_{h1} = 0.1*0.1+0.5*0.3+0.25 = 0.41Now we pass this weighted sum through the logistic function (sigmoid function) so as to squash the weighted sum into the range (0 and +1). The logistic function is an activation function for our example neural network.
output_{h1}=\frac{1}{1+e^{-sum_{h1}}} output_{h1}=\frac{1}{1+e^{-0.41}} = 0.60108Similarly for h2, we perform the weighted sum operation sum_{h2} and compute the activation value output_{h2}.
sum_{h2} = i_{1}*w_{2}+i_{2}*w_{4}+b_{1} = 0.47 output_{h2} = \frac{1}{1+e^{-sum_{h2}}} = 0.61538Now, output_{h1} and output_{h2} will be considered as inputs to the next layer.
For o1,
sum_{o1} = output_{h1}*w_{5}+output_{h2}*w_{6}+b_{2} = 1.01977 output_{o1}=\frac{1}{1+e^{-sum_{o1}}} = 0.73492Similarly for o2,
sum_{o2} = output_{h1}*w_{7}+output_{h2}*w_{8}+b_{2} = 1.26306 output_{o2}=\frac{1}{1+e^{-sum_{o2}}} = 0.77955Computing the total error
We started off supposing the expected outputs to be 0.05 and 0.95 respectively for output_{o1} and output_{o2}. Now we will compute the errors based on the outputs received until now and the expected outputs.
We’ll use the following error formula,
E_{total} = \sum \frac{1}{2}(target-output)^{2}To compute E_{total}, we need to first find out respective errors at o1 and o2.
E_{1} = \frac{1}{2}(target_{1}-output_{o1})^{2} E_{1} = \frac{1}{2}(0.05-0.73492)^{2} = 0.23456Similarly for E2,
E_{2} = \frac{1}{2}(target_{2}-output_{o2})^{2} E_{2} = \frac{1}{2}(0.95-0.77955)^{2} = 0.01452Therefore, E_{total} = E_{1} + E_{2} = 0.24908
The Backpropagation
The aim of backpropagation (backward pass) is to distribute the total error back to the network so as to update the weights in order to minimize the cost function (loss). The weights are updated in such as way that when the next forward pass utilizes the updated weights, the total error will be reduced by a certain margin (until the minima is reached).
For weights in the output layer (w5, w6, w7, w8)
For w5,
Let’s compute how much contribution w5 has on E_{1}. If we become clear on how w5 is updated, then it would be really easy for us to generalize the same to the rest of the weights. If we look closely at the example neural network, we can see that E_{1} is affected by output_{o1}, output_{o1} is affected by sum_{o1}, and sum_{o1} is affected by w5. It’s time to recall the Chain Rule.
\frac{\partial E_{total}}{\partial w5} = \frac{\partial E_{total}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial w5}Let’s deal with each component of the above chain separately.
Component 1: partial derivative of Error w.r.t. Output
E_{total} = \sum \frac{1}{2}(target-output)^{2} E_{total} = \frac{1}{2}(target_{1}-output_{o1})^{2} + \frac{1}{2}(target_{2}-output_{o2})^{2}Therefore,
\frac{\partial E_{total}}{\partial output_{o1}} = 2*\frac{1}{2}*(target_{1}-output_{o1})*-1 = output_{o1} - target_{1}Component 2: partial derivative of Output w.r.t. Sum
The output section of a unit of a neural network uses non-linear activation functions. The activation function used in this example is Logistic Function. When we compute the derivative of the Logistic Function, we get:
\sigma(x) = \frac{1}{1+e^{-x}} \frac{\mathrm{d}}{\mathrm{dx}}\sigma(x) = \sigma(x)(1-\sigma(x))Therefore, the derivative of the Logistic function is equal to output multiplied by (1 – output).
\frac{\partial output_{o1}}{\partial sum_{o1}} = output_{o1} (1 - output_{o1})Component 3: partial derivative of Sum w.r.t. Weight
sum_{o1} = output_{h1}*w_{5}+output_{h2}*w_{6}+b_{2}Therefore,
\frac{\partial sum_{o1}}{\partial w5} = output_{h1}Putting them together,
\frac{\partial E_{total}}{\partial w5} = \frac{\partial E_{total}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial w5} \frac{\partial E_{total}}{\partial w5} = [output_{o1} - target_{1} ]* [output_{o1} (1 - output_{o1})] * [output_{h1}] \frac{\partial E_{total}}{\partial w5} = 0.68492 * 0.19480 * 0.60108 \frac{\partial E_{total}}{\partial w5} = 0.08020The new\_w_{5} is,
new\_w_{5} = w5 - n * \frac{\partial E_{total}}{\partial w5}, where n is learning rate.
new\_w_{5} = 0.5 - 0.6 * 0.08020 new\_w_{5} = 0.45187We can proceed similarly for w6, w7 and w8.
For w6,
\frac{\partial E_{total}}{\partial w6} = \frac{\partial E_{total}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial w6}The first two components of this chain have already been calculated. The last component \frac{\partial sum_{o1}}{\partial w6} = output_{h2}.
\frac{\partial E_{total}}{\partial w6} = 0.68492 * 0.19480 * 0.61538 = 0.08211The new\_w_{6} is,
new\_w_{6}= w6 - n * \frac{\partial E_{total}}{\partial w6} new\_w_{6} = 0.6 - 0.6 * 0.08211 new\_w_{6} = 0.55073For w7,
\frac{\partial E_{total}}{\partial w7} = \frac{\partial E_{total}}{\partial output_{o2}} * \frac{\partial output_{o2}}{\partial sum_{o2}} * \frac{\partial sum_{o2}}{\partial w7}For the first component of the above chain, Let’s recall how the partial derivative of Error is computed w.r.t. Output.
\frac{\partial E_{total}}{\partial output_{o2}} = output_{o2} - target_{2}For the second component,
\frac{\partial output_{o2}}{\partial sum_{o2}} = output_{o2} (1 - output_{o2})For the third component,
\frac{\partial sum_{o2}}{\partial w7} = output_{h1}Putting them together,
\frac{\partial E_{total}}{\partial w7} = [output_{o2} - target_{2}] * [output_{o2} (1 - output_{o2})] * [output_{h1}] \frac{\partial E_{total}}{\partial w7} = -0.17044 * 0.17184 * 0.60108 \frac{\partial E_{total}}{\partial w7} = -0.01760The new\_w_{7} is,
new\_w_{7} = w7 - n * \frac{\partial E_{total}}{\partial w7} new\_w_{7} = 0.7 - 0.6 * -0.01760 new\_w_{7} = 0.71056Proceeding similarly, we get new\_w_{8} = 0.81081 (with \frac{\partial E_{total}}{\partial w8} = -0.01802).
For weights in the hidden layer (w1, w2, w3, w4)
Similar calculations are made to update the weights in the hidden layer. However, this time the chain becomes a bit longer. It does not matter how deep the neural network goes, all we need to find out is how much error is propagated (contributed) by a particular weight to the total error of the network. For that purpose, we need to find the partial derivative of Error w.r.t. to the particular weight. Let’s work on updating w1 and we’ll be able to generalize similar calculations to update the rest of the weights.
For w1 (with respect to E1),
For simplicity let us compute \frac{\partial E_{1}}{\partial w1} and \frac{\partial E_{2}}{\partial w1} separately, and later we can add them to compute \frac{\partial E_{total}}{\partial w1}.
\frac{\partial E_{1}}{\partial w1} = \frac{\partial E_{1}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1}Let’s quickly go through the above chain. We know that E_{1} is affected by output_{o1}, output_{o1} is affected by sum_{o1}, sum_{o1} is affected by output_{h1}, output_{h1} is affected by sum_{h1}, and finally sum_{h1} is affected by w1. It is quite easy to comprehend, isn’t it?
For the first component of the above chain,
\frac{\partial E_{1}}{\partial output_{o1}} = output_{o1} - target_{1}We’ve already computed the second component. This is one of the benefits of using the chain rule. As we go deep into the network, the previous computations are re-usable.
For the third component,
sum_{o1} = output_{h1}*w_{5}+output_{h2}*w_{6}+b_{2} \frac{\partial sum_{o1}}{\partial output_{h1}} = w5For the fourth component,
\frac{\partial output_{h1}}{\partial sum_{h1}} = output_{h1}*(1-output_{h1})For the fifth component,
sum_{h1} = i_{1}*w_{1}+i_{2}*w_{3}+b_{1} \frac{\partial sum_{h1}}{\partial w1} = i_{1}Putting them all together,
\frac{\partial E_{1}}{\partial w1} = \frac{\partial E_{1}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1} \frac{\partial E_{1}}{\partial w1} = 0.68492 * 0.19480 * 0.5 * 0.23978 * 0.1 = 0.00159Similarly, for w1 (with respect to E2),
\frac{\partial E_{2}}{\partial w1} = \frac{\partial E_{2}}{\partial output_{o2}} * \frac{\partial output_{o2}}{\partial sum_{o2}} * \frac{\partial sum_{o2}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1}For the first component of the above chain,
\frac{\partial E_{2}}{\partial output_{o2}} = output_{o2} - target_{2}The second component is already computed.
For the third component,
sum_{o2} = output_{h1}*w_{7}+output_{h2}*w_{8}+b_{2} \frac{\partial sum_{o2}}{\partial output_{h1}} = w7The fourth and fifth components have also been already computed while computing \frac{\partial E_{1}}{\partial w1}.
Putting them all together,
\frac{\partial E_{2}}{\partial w1} = \frac{\partial E_{2}}{\partial output_{o2}} * \frac{\partial output_{o2}}{\partial sum_{o2}} * \frac{\partial sum_{o2}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1} \frac{\partial E_{2}}{\partial w1} = -0.17044 * 0.17184 * 0.7 * 0.23978 * 0.1 = -0.00049Now we can compute \frac{\partial E_{total}}{\partial w1} = \frac{\partial E_{1}}{\partial w1} + \frac{\partial E_{2}}{\partial w1}.
\frac{\partial E_{total}}{\partial w1} = 0.00159 + (-0.00049) = 0.00110.
The new\_w_{1} is,
new\_w_{1} = w1 - n * \frac{\partial E_{total}}{\partial w1} new\_w_{1}= 0.1 - 0.6 * 0.00110 new\_w_{1} = 0.09933Proceeding similarly, we can easily update the other weights (w2, w3 and w4).
new\_w_{2} = 0.19919 new\_w_{3} = 0.29667 new\_w_{4} = 0.39597Once we’ve computed all the new weights, we need to update all the old weights with these new weights. Once the weights are updated, one backpropagation cycle is finished. Now the forward pass is done and the total new error is computed. And based on this newly computed total error the weights are again updated. This goes on until the loss value converges to minima. This way a neural network starts with random values for its weights and finally converges to optimum values.
I hope you found this article useful. I’ll see you in the next one.
Thanks for this straightforward explanation! Once I separately went away and learned about partial derivatives it made complete sense. Very helpful for this programmer. Kudos!
Hello Ben. Glad to know that the article was helpful.
Thank you for this excellent article, Rabindra.
I’ve stepped through multiple tutorials similar to this, but in each case there was a problem with the tutorial. Either it was incomplete, or it contained errors. Also, it seems that the values for one tutorial in particular were copied into numerous other tutorials, including one YouTube video, and in each, there were incomplete steps. This tutorial had none of those issues. It was well written, concise, and accurate. I think you for that. I do not have a math background and most tutorials displayed endless calculus equations that I couldn’t read. What I needed was a complete step-by-step walkthrough of the actual numbers for one complete forward and one complete backward pass, and that is exactly what you provided. To underscore how much your example helped me: In other articles, where the author left it to the reader to determine the new/updated w2, w3 and w4 values, I was hopelessly lost, but by going through your articles, I was able to compute those values and verify them accurate against your results. I can’t thank you enough. I searched for days trying to find an article or video to help me grasp the concepts of a NN, and this is the only article that truly helped me.
Hello Joe. Thank you so much for the words.
Hi, Thanks a lot for excellent article.
Thanks, Amin.
Thanks for the detailed explanation. Finally got to go through this popular blog. Helped me a lot in understanding. We are at the same Uni, we should catch up sometime.
Thanks, brother. Nobody knows about the blog :p, only this post seems to be getting lots of attention. Yeah, we should catch up sometime. HAHA!!
Thanks for the artical, it’s indeed most fullfilled one compare to banch others online
However, the network would not be working properly as the biases initialized and used for forward propagation but never updates… which means at any point of the function there would be offset, not equal to zero, but to other constants (.25 and .35) for the layers, not for individual neurons
Otherwise thanks!
Yes, the biases also need to be updated accordingly.
Thank you for the wonderful blog. While computing the weight of w5, can’t we take E1 instead of E_total.
Yes, we can. However, this applies to only the weights in the final layer. Once you come back inside the network (the other layers), the weights there have their effects on both E1 and E2. So its good to follow a general representation.
Thank you for this great article 🙂 ! It was really helpful, especially concerning the use of the Chain Rule !
Glad to know that you found the article useful! Thanks.
You’re the best, man. From your article, I learn a lot. I salute you
Thank you, Mumin. Glad to know that the article was helpful.
This was really helpful. Quite simplified and well explained. Thank you.
Glad that the article was of help!
Thank you so much for detailed and clear explanation! I finally understood topic.
Glad to know that, pooja.
Muy bonito trabajo….muy bien explicado, e interesante!!!
Thanks, jorge.
Good as far as it goes despite missing out the biases.
I’d like to see you do one with 2 or 3 hidden layers.
The chain rule mathematics is fine but then the summing of the
error derivatives for weights and biases as you move back through the hidden layers gets a bit more complicated.
The deep the network, the longer the chain of derivatives. The steps discussed above are generalizable.
its really interesting, but am bit confused how come n=0.6(learning rate)?
Hi Urgesa,
I just assumed the learning rate to be 0.6. To understand in deep how the learning rate affects the training of neural networks, refer to this article: https://theneuralblog.com/gradient-descent-algorithm/
I am glad I got this post just by chance. It is really concise and simple to understand. You are indeed a teacher. Thank you for taking time out to help others through your knowledge.
Great to know that the article was helpful, Ade!
Hi Rabindra, Thanks a lot for excellent worked example.
Thank you, JAY.
Great article! Helped me finish my neural networks homework problem.
Thank you, Vasundhara.
thank you for this artiicle
I think you did forword and backword probagation for one case of input , as I understand from keras foe example the forword is done for lot of input data then the update is done for weight is it right?
Yes.
Hi, it looks like in the visualizations, w6 and w7 are switched, making it look like w7 is connected to h1 and not h2, which does not align with your calculations.
Hi Jordan. As per the example neural network, we are computing (in the chain) sumO1 w.r.t. w6 and sumO2 w.r.t. w7.
i understood everything in this amazing tutorial . english isnt my first language and i easily understood every thing . thats how high is your level . good job man .
Glad it was helpful, Zyad.
Super article and very well explained
Hii…can we update value of bias b1 and b2 as well? if yes, b1 and b2 comes different for both the output nodes and hidden nodes. Is it correct or it should be same for both the nodes in one layer? Please reply
Chain rule applies. All you need to compute is derivative of a loss function with respect to the parameter you want to update.
Very well explained,thank you so very much.
Thank you!
please make me clear about learning rate
Please refer to this article for understanding the significance of a learning rate.
https://theneuralblog.com/gradient-descent-algorithm/
Just beautifully explained
Thank you!
This is the best article I could find in the web, and I’ve searched for just some time. Usually articles fly over the differentiation formulas for the back propagation. I think, one couldn’t get the essence of a neural network in a more concentrated and crystal clear way. Thank you so much! Please continue posting.
P.S.: … and I would also have a suggestion for another post or post completion: your example takes a single feature value as an input and outputs a single value to the hidden layer: how does it work with multiple feature inputs with a single value output for the input layer? How with a matrix input? Thanks again!
Thank you. Glad to know that you found the article useful.
I am preparing some other related articles; maybe they will start appearing next month.
Is that error formula of 0.5(|actual – calculated|)^2 an arbitrary decision? or is that *THE* error formula that Neural Network always uses??
It is a loss function. There are plenty other ones. Fundamentally, every loss function compute the difference between actual values and predicted values.
Well explained. But images are not showing. Can you just fix it?
Thanks for becoming helpful for writing some code.
Thanks for this tutorial. With the recent explosion in deep learning, almost every search result I find about neural networks immediately assumes you want to build a massive network using some complicated library. This is one of the only fully worked example (except for calculating the biases 😉 ) that shows you what happens as the network trains.
I am building a simple C# library to enable me to build some shallow networks and this has really helped me iron out the wrinkles, by being able to trace values through the network as it trains.
Thank you!
Thank you for this simple self explanatory article! It was really helpful.
Thanks alot mister.
Once I tracked down all of the values side by side, everything was clear at once.
previously I usually got confused when they used different Loss functions like cross-entropy loss for the final nodes and sigmoid for the activation functions. The matching terms would be cancelled out in the firstmost step and that was just mentioned assuming that we know the behind the scene logic. then for subsequent preceding terms, we were forced to follow a separate chain of equations, i.e. following the derivative for activation terms.
now your articles has cleared those doubts alongside providing a better methodological way of manual solution for these problems.
Thanks again.