A step by step forward pass and backpropagation example

Published Categorized as Neural Networks
neural-network
The neural network that we'll be solving in this article.

There are multiple libraries (PyTorch, TensorFlow) that can assist you in implementing almost any architecture of neural networks. This article is not about solving a neural net using one of those libraries. There are already plenty of articles, videos on that. In this article, we’ll see a step by step forward pass (forward propagation) and backward pass (backpropagation) example. We’ll be taking a single hidden layer neural network and solving one complete cycle of forward propagation and backpropagation.

Getting to the point, we will work step by step to understand how weights are updated in neural networks. The way a neural network learns is by updating its weight parameters during the training phase. There are multiple concepts needed to fully understand the working mechanism of neural networks: linear algebra, probability, calculus. I’ll try my best to re-visit calculus for the chain rule concept. I will keep aside the linear algebra (vectors, matrices, tensors) for this article. We’ll work on each and every computation and in the end up we’ll update all the weights of the example neural network for one complete cycle of forward propagation and backpropagation. Let’s get started.

Here’s a simple neural network on which we’ll be working.

Example Neural Network

I think the above example neural network is self-explanatory. There are two units in the Input Layer, two units in the Hidden Layer and two units in the Output Layer. The w1,w2,w2,…,w8 represent the respective weights. b1 and b2 are the biases for Hidden Layer and Output Layer, respectively.

In this article, we’ll be passing two inputs i1 and i2, and perform a forward pass to compute total error and then a backward pass to distribute the error inside the network and update weights accordingly.

Before getting started, let us deal with two basic concepts which should be sufficient to comprehend this article.

Peeking inside a single neuron

Inside h1 (first unit of the hidden layer)

Inside a unit, two operations happen (i) computation of weighted sum and (ii) squashing of the weighted sum using an activation function. The result from the activation function becomes an input to the next layer (until the next layer is an Output Layer). In this example, we’ll be using the Sigmoid function (Logistic function) as the activation function. The Sigmoid function basically takes an input and squashes the value between 0 and +1. We’ll discuss the activation functions in later articles. But, what you should note is that inside a neural network unit, two operations (stated above) happen. We can suppose the input layer to have a linear function that produces the same value as the input.

Chain Rule in Calculus

If we have y = f(u) and u = g(x) then we can write the derivative of y as:

\frac{dy}{dx} = \frac{dy}{du} * \frac{du}{dx}

The Forward Pass

Remember that each unit of a neural network performs two operations: compute weighted sum and process the sum through an activation function. The outcome of the activation function determines if that particular unit should activate or become insignificant.

Let’s get started with the forward pass.

For h1,

sum_{h1} = i_{1}*w_{1}+i_{2}*w_{3}+b_{1} sum_{h1} = 0.1*0.1+0.5*0.3+0.25 = 0.41

Now we pass this weighted sum through the logistic function (sigmoid function) so as to squash the weighted sum into the range (0 and +1). The logistic function is an activation function for our example neural network.

output_{h1}=\frac{1}{1+e^{-sum_{h1}}} output_{h1}=\frac{1}{1+e^{-0.41}} = 0.60108

Similarly for h2, we perform the weighted sum operation sum_{h2} and compute the activation value output_{h2}.

sum_{h2} = i_{1}*w_{2}+i_{2}*w_{4}+b_{1} = 0.47 output_{h2} = \frac{1}{1+e^{-sum_{h2}}} = 0.61538

Now, output_{h1} and output_{h2} will be considered as inputs to the next layer.

For o1,

sum_{o1} = output_{h1}*w_{5}+output_{h2}*w_{6}+b_{2} = 1.01977 output_{o1}=\frac{1}{1+e^{-sum_{o1}}} = 0.73492

Similarly for o2,

sum_{o2} = output_{h1}*w_{7}+output_{h2}*w_{8}+b_{2} = 1.26306 output_{o2}=\frac{1}{1+e^{-sum_{o2}}} = 0.77955

Computing the total error

We started off supposing the expected outputs to be 0.05 and 0.95 respectively for output_{o1} and output_{o2}. Now we will compute the errors based on the outputs received until now and the expected outputs.

We’ll use the following error formula,

E_{total} = \sum \frac{1}{2}(target-output)^{2}

To compute E_{total}, we need to first find out respective errors at o1 and o2.

E_{1} = \frac{1}{2}(target_{1}-output_{o1})^{2} E_{1} = \frac{1}{2}(0.05-0.73492)^{2} = 0.23456

Similarly for E2,

E_{2} = \frac{1}{2}(target_{2}-output_{o2})^{2} E_{2} = \frac{1}{2}(0.95-0.77955)^{2} = 0.01452

Therefore, E_{total} = E_{1} + E_{2} = 0.24908

The Backpropagation

The aim of backpropagation (backward pass) is to distribute the total error back to the network so as to update the weights in order to minimize the cost function (loss). The weights are updated in such as way that when the next forward pass utilizes the updated weights, the total error will be reduced by a certain margin (until the minima is reached).

For weights in the output layer (w5, w6, w7, w8)

For w5,

Let’s compute how much contribution w5 has on E_{1}. If we become clear on how w5 is updated, then it would be really easy for us to generalize the same to the rest of the weights. If we look closely at the example neural network, we can see that E_{1} is affected by output_{o1}, output_{o1} is affected by sum_{o1}, and sum_{o1} is affected by w5. It’s time to recall the Chain Rule.

\frac{\partial E_{total}}{\partial w5} = \frac{\partial E_{total}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial w5}

Let’s deal with each component of the above chain separately.

Component 1: partial derivative of Error w.r.t. Output

E_{total} = \sum \frac{1}{2}(target-output)^{2} E_{total} = \frac{1}{2}(target_{1}-output_{o1})^{2} + \frac{1}{2}(target_{2}-output_{o2})^{2}

Therefore,

\frac{\partial E_{total}}{\partial output_{o1}} = 2*\frac{1}{2}*(target_{1}-output_{o1})*-1 = output_{o1} - target_{1}

Component 2: partial derivative of Output w.r.t. Sum

The output section of a unit of a neural network uses non-linear activation functions. The activation function used in this example is Logistic Function. When we compute the derivative of the Logistic Function, we get:

\sigma(x) = \frac{1}{1+e^{-x}} \frac{\mathrm{d}}{\mathrm{dx}}\sigma(x) = \sigma(x)(1-\sigma(x))

Therefore, the derivative of the Logistic function is equal to output multiplied by (1 – output).

\frac{\partial output_{o1}}{\partial sum_{o1}} = output_{o1} (1 - output_{o1})

Component 3: partial derivative of Sum w.r.t. Weight

sum_{o1} = output_{h1}*w_{5}+output_{h2}*w_{6}+b_{2}

Therefore,

\frac{\partial sum_{o1}}{\partial w5} = output_{h1}

Putting them together,

\frac{\partial E_{total}}{\partial w5} = \frac{\partial E_{total}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial w5} \frac{\partial E_{total}}{\partial w5} = [output_{o1} - target_{1} ]* [output_{o1} (1 - output_{o1})] * [output_{h1}] \frac{\partial E_{total}}{\partial w5} = 0.68492 * 0.19480 * 0.60108 \frac{\partial E_{total}}{\partial w5} = 0.08020

The new\_w_{5} is,

new\_w_{5} = w5 - n * \frac{\partial E_{total}}{\partial w5}, where n is learning rate.

new\_w_{5} = 0.5 - 0.6 * 0.08020 new\_w_{5} = 0.45187

We can proceed similarly for w6, w7 and w8.

For w6,

\frac{\partial E_{total}}{\partial w6} = \frac{\partial E_{total}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial w6}

The first two components of this chain have already been calculated. The last component \frac{\partial sum_{o1}}{\partial w6} = output_{h2}.

\frac{\partial E_{total}}{\partial w6} = 0.68492 * 0.19480 * 0.61538 = 0.08211

The new\_w_{6} is,

new\_w_{6}= w6 - n * \frac{\partial E_{total}}{\partial w6} new\_w_{6} = 0.6 - 0.6 * 0.08211 new\_w_{6} = 0.55073

For w7,

\frac{\partial E_{total}}{\partial w7} = \frac{\partial E_{total}}{\partial output_{o2}} * \frac{\partial output_{o2}}{\partial sum_{o2}} * \frac{\partial sum_{o2}}{\partial w7}

For the first component of the above chain, Let’s recall how the partial derivative of Error is computed w.r.t. Output.

\frac{\partial E_{total}}{\partial output_{o2}} = output_{o2} - target_{2}

For the second component,

\frac{\partial output_{o2}}{\partial sum_{o2}} = output_{o2} (1 - output_{o2})

For the third component,

\frac{\partial sum_{o2}}{\partial w7} = output_{h1}

Putting them together,

\frac{\partial E_{total}}{\partial w7} = [output_{o2} - target_{2}] * [output_{o2} (1 - output_{o2})] * [output_{h1}] \frac{\partial E_{total}}{\partial w7} = -0.17044 * 0.17184 * 0.60108 \frac{\partial E_{total}}{\partial w7} = -0.01760

The new\_w_{7} is,

new\_w_{7} = w7 - n * \frac{\partial E_{total}}{\partial w7} new\_w_{7} = 0.7 - 0.6 * -0.01760 new\_w_{7} = 0.71056

Proceeding similarly, we get new\_w_{8} = 0.81081 (with \frac{\partial E_{total}}{\partial w8} = -0.01802).

For weights in the hidden layer (w1, w2, w3, w4)

Similar calculations are made to update the weights in the hidden layer. However, this time the chain becomes a bit longer. It does not matter how deep the neural network goes, all we need to find out is how much error is propagated (contributed) by a particular weight to the total error of the network. For that purpose, we need to find the partial derivative of Error w.r.t. to the particular weight. Let’s work on updating w1 and we’ll be able to generalize similar calculations to update the rest of the weights.

For w1 (with respect to E1),

For simplicity let us compute \frac{\partial E_{1}}{\partial w1} and \frac{\partial E_{2}}{\partial w1} separately, and later we can add them to compute \frac{\partial E_{total}}{\partial w1}.

\frac{\partial E_{1}}{\partial w1} = \frac{\partial E_{1}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1}

Let’s quickly go through the above chain. We know that E_{1} is affected by output_{o1}, output_{o1} is affected by sum_{o1}, sum_{o1} is affected by output_{h1}, output_{h1} is affected by sum_{h1}, and finally sum_{h1} is affected by w1. It is quite easy to comprehend, isn’t it?

For the first component of the above chain,

\frac{\partial E_{1}}{\partial output_{o1}} = output_{o1} - target_{1}

We’ve already computed the second component. This is one of the benefits of using the chain rule. As we go deep into the network, the previous computations are re-usable.

For the third component,

sum_{o1} = output_{h1}*w_{5}+output_{h2}*w_{6}+b_{2} \frac{\partial sum_{o1}}{\partial output_{h1}} = w5

For the fourth component,

\frac{\partial output_{h1}}{\partial sum_{h1}} = output_{h1}*(1-output_{h1})

For the fifth component,

sum_{h1} = i_{1}*w_{1}+i_{2}*w_{3}+b_{1} \frac{\partial sum_{h1}}{\partial w1} = i_{1}

Putting them all together,

\frac{\partial E_{1}}{\partial w1} = \frac{\partial E_{1}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1} \frac{\partial E_{1}}{\partial w1} = 0.68492 * 0.19480 * 0.5 * 0.23978 * 0.1 = 0.00159

Similarly, for w1 (with respect to E2),

\frac{\partial E_{2}}{\partial w1} = \frac{\partial E_{2}}{\partial output_{o2}} * \frac{\partial output_{o2}}{\partial sum_{o2}} * \frac{\partial sum_{o2}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1}

For the first component of the above chain,

\frac{\partial E_{2}}{\partial output_{o2}} = output_{o2} - target_{2}

The second component is already computed.

For the third component,

sum_{o2} = output_{h1}*w_{7}+output_{h2}*w_{8}+b_{2} \frac{\partial sum_{o2}}{\partial output_{h1}} = w7

The fourth and fifth components have also been already computed while computing \frac{\partial E_{1}}{\partial w1}.

Putting them all together,

\frac{\partial E_{2}}{\partial w1} = \frac{\partial E_{2}}{\partial output_{o2}} * \frac{\partial output_{o2}}{\partial sum_{o2}} * \frac{\partial sum_{o2}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1} \frac{\partial E_{2}}{\partial w1} = -0.17044 * 0.17184 * 0.7 * 0.23978 * 0.1 = -0.00049

Now we can compute \frac{\partial E_{total}}{\partial w1} = \frac{\partial E_{1}}{\partial w1} + \frac{\partial E_{2}}{\partial w1}.

\frac{\partial E_{total}}{\partial w1} = 0.00159 + (-0.00049) = 0.00110.

The new\_w_{1} is,

new\_w_{1} = w1 - n * \frac{\partial E_{total}}{\partial w1} new\_w_{1}= 0.1 - 0.6 * 0.00110 new\_w_{1} = 0.09933

Proceeding similarly, we can easily update the other weights (w2, w3 and w4).

new\_w_{2} = 0.19919 new\_w_{3} = 0.29667 new\_w_{4} = 0.39597

Once we’ve computed all the new weights, we need to update all the old weights with these new weights. Once the weights are updated, one backpropagation cycle is finished. Now the forward pass is done and the total new error is computed. And based on this newly computed total error the weights are again updated. This goes on until the loss value converges to minima. This way a neural network starts with random values for its weights and finally converges to optimum values.

I hope you found this article useful. I’ll see you in the next one.

Share this article:

By Rabindra Lamsal

Ph.D. Candidate (Computer Science) at the University of Melbourne.

41 comments

  1. Thanks for this straightforward explanation! Once I separately went away and learned about partial derivatives it made complete sense. Very helpful for this programmer. Kudos!

  2. Thank you for this excellent article, Rabindra.

    I’ve stepped through multiple tutorials similar to this, but in each case there was a problem with the tutorial. Either it was incomplete, or it contained errors. Also, it seems that the values for one tutorial in particular were copied into numerous other tutorials, including one YouTube video, and in each, there were incomplete steps. This tutorial had none of those issues. It was well written, concise, and accurate. I think you for that. I do not have a math background and most tutorials displayed endless calculus equations that I couldn’t read. What I needed was a complete step-by-step walkthrough of the actual numbers for one complete forward and one complete backward pass, and that is exactly what you provided. To underscore how much your example helped me: In other articles, where the author left it to the reader to determine the new/updated w2, w3 and w4 values, I was hopelessly lost, but by going through your articles, I was able to compute those values and verify them accurate against your results. I can’t thank you enough. I searched for days trying to find an article or video to help me grasp the concepts of a NN, and this is the only article that truly helped me.

  3. Thanks for the detailed explanation. Finally got to go through this popular blog. Helped me a lot in understanding. We are at the same Uni, we should catch up sometime.

  4. Thanks for the artical, it’s indeed most fullfilled one compare to banch others online
    However, the network would not be working properly as the biases initialized and used for forward propagation but never updates… which means at any point of the function there would be offset, not equal to zero, but to other constants (.25 and .35) for the layers, not for individual neurons
    Otherwise thanks!

    1. Yes, we can. However, this applies to only the weights in the final layer. Once you come back inside the network (the other layers), the weights there have their effects on both E1 and E2. So its good to follow a general representation.

  5. Thank you for this great article 🙂 ! It was really helpful, especially concerning the use of the Chain Rule !

  6. Good as far as it goes despite missing out the biases.
    I’d like to see you do one with 2 or 3 hidden layers.
    The chain rule mathematics is fine but then the summing of the
    error derivatives for weights and biases as you move back through the hidden layers gets a bit more complicated.

  7. I am glad I got this post just by chance. It is really concise and simple to understand. You are indeed a teacher. Thank you for taking time out to help others through your knowledge.

  8. thank you for this artiicle
    I think you did forword and backword probagation for one case of input , as I understand from keras foe example the forword is done for lot of input data then the update is done for weight is it right?

  9. Hi, it looks like in the visualizations, w6 and w7 are switched, making it look like w7 is connected to h1 and not h2, which does not align with your calculations.

  10. i understood everything in this amazing tutorial . english isnt my first language and i easily understood every thing . thats how high is your level . good job man .

Leave a Reply to Mumin Adam Cancel reply

Your email address will not be published. Required fields are marked *