According to Wikipedia, “Gradient descent is a first-order iterative optimization algorithm for finding local minima of a differentiable function“. Sounds a lot, right? In this article, let’s get acquainted with the Gradient descent algorithm in the most straightforward (and ‘simplest’) way.
Before we continue with understanding the ABCDs of Gradient descent (and dig into the maths), let’s see what exactly is the algorithm about: Gradient descent is an iterative method that is used to find the best possible set of parameters for a set of data points (dataset in numerical form) in a vector space. The optimal parameters’ set is searched while starting with a random guess, eventually reaching the minima. Once we understand the basics, all these concepts automatically start to make sense. Let’s get started.
An Example Model
In order to understand the concept of parameter optimization through Gradient descent, we consider the simplest model out there in the machine learning world: a linear regression model with a single feature.
y = b + wx + \epsilonwhere, y is a dependent variable (class, label) that is predicted based on x which is a feature (independent variable), parameter b is the bias that gives the average value of y when x is 0, and parameter w is the weight of x that hints how much y increases when we increase x by 1 unit, and \epsilon is noise (or error). For example, consider the following model:
calorieIntake = 1200 + 50*number~of~food~intakes~per~day + \epsilonIn the above equation (analogous to y = b + wx + \epsilon), 1200 is the bias (b; minimum calorie intake), 50 is the weight (w; 1 intake will increase y by 50), and \epsilon is noise. The model says that a human body’s calorie intake at minimum per day is 1200, and additional food intake in a day increases the total calorie intake accordingly. Suppose, your calorie intake is 1757, then b=1200, number of food intakes per day = 11, \epsilon = 7. Leaving \epsilon aside, if we are given x and we already have the parameters b and w, we can easily compute y. The value of y (prediction) depends upon the value of x; this is the reason y is also termed a dependent variable while x is an independent variable. Using Gradient descent, given a set of data points, we aim to find the best values of bias and weight.
In a real-world scenario, there will be much complex model with multiple features (such as y = b + w_1x_1 + w_2x_2+ w_3x_3+ … + w_nx_n + \epsilon). The idea discussed in this article for a single feature linear regression can be generalized accordingly to higher dimensions. Remember that, the aim of the Gradient descent algorithm is to search for the best set of parameters (b and w in the case of a single feature linear regression, and b and w1, w2, w3,…, wn in the case of a multivariate linear regression).
The 3 Steps for Gradient Descent
Intuition: The optimal values of the parameters are found at the minima of a loss function; there are a set of parameters for which the loss function gives the minimum value. Consider the loss function as a hill and the base of the hill as its minima where we find the optimal values of the parameters. We need to go downhill in multiple steps (iteratively) and identify the optimal values of b and w.
Step 1: Initializing the parameters
The first step is to randomly initialize the parameters; both the bias and weight. If we need to get to the base of the hill, we need to start somewhere, right? So the first step is random initialization. The random initialization makes us stand anywhere on the surface of the hill.
Step 2: Compute loss
Here, we use the first randomly initialized set of parameters and compute the output of the model. For all x_i, we compute \widehat{y}_i. It is obvious that the predictions \widehat{y}_i are going to be worst (at this time). I promise from here we are guaranteed to go downhill in terms of errors. How? Hang on, we are almost there.
The errors can be computed simply as the difference of predicted value and true value:
error_i= \widehat{y}_i - y_iNow we need to aggregate these errors into some mathematic form. That’s where loss function comes into the picture. The loss, in machine learning, is an aggregative picture of errors associated with all (there are cases such as: batch—all data points are used for computing loss, mini-batch—a portion of data points is used for computing loss, and stochastic—a single data point is used for computing loss) datapoints in the training set.
The Loss Function
In regression, the loss is computed as the average of all squared errors, the Mean Squared Error (MSE).
MSE = \frac{1}{n}\sum_{i=1}^{n}error_i^2 MSE = \frac{1}{n}\sum_{i=1}^{n}(\widehat{y}_i - y_i)^2Since, we know that \widehat{y}_i was computed using the parameters b and w as b+wx_i:
MSE = \frac{1}{n}\sum_{i=1}^{n}(b+wx_i - y_i)^2We get some value (its the loss, its the loss) when all the variables in the above loss function are plugged in. Now, moving forward we want to decrease this value to the minimum; in the upcoming steps, we are going to compute a new set of parameters (b and w) that are guaranteed to provide us a new value of loss which will be less than what we’ve computed just now.
The Partial Derivatives
Who do you think is responsible for the errors? Yes, it’s b and w. Remember, in the first step, we randomly initialized them. Now we are going to optimize their values so that when we plug in the new values of b, w, and x_i, we are going to get a new loss that will be less than the current loss.
Why do we take partial derivatives? Since b and w are the ones responsible for the loss, we want to see what portion of loss these two parameters contribute individually. Based on this contribution information, we optimize their values so as to decrease the loss. Still confused? Going forward, we’re going to play around with some basic calculus and try to update the parameters; everything will start to fall into place!
Partial derivative of the loss (MSE) with respect to parameter b
Here we use the following concepts: chain rule, partial derivative. It is not that difficult to follow; please give a look on the internet if you would like to have a deep outlook of these concepts. Note that, since \widehat{y_i}=b+wx_i, the partial derivatives of \widehat{y_i} with respect to b and w are 1 and x_i, respectively.
\frac{\delta MSE}{\delta b} = \frac{\delta MSE}{\delta \widehat{y_i}} * \frac{\delta \widehat{y_i}}{\delta b} = \frac{1}{n} \sum_{i=1}^{n} 2*(b+wx_i - y_i)*1b+wx_i can be written as \widehat{y}_i.
= \frac{2}{n} \sum_{i=1}^{n} (\widehat{y}_i - y_i)Partial derivative of the loss (MSE) with respect to parameter w
\frac{\delta MSE}{\delta w} = \frac{\delta MSE}{\delta \widehat{y_i}} * \frac{\delta \widehat{y_i}}{\delta w} = \frac{2}{n} \sum_{i=1}^{n} (b+wx_i - y_i)*x_i = \frac{2}{n} \sum_{i=1}^{n} (\widehat{y_i}- y_i)*x_iStep 3: Update parameters b and w
Whoa, we’re at the final step! Now we use the partial derivatives to update the parameters b and w.
new~b = b - learning~rate*\frac{\delta MSE}{\delta b} new~w = w - learning~rate*\frac{\delta MSE}{\delta w}I know lots of things are happening here: Why the minus sign? Why not plus sign? What’s the learning rate?
Okay. I guess you have clearly understood the requirement for computing partial derivatives (the gradients). The partial derivative of the loss function with respect to a parameter computes the contribution of the parameter to the loss. So we went ahead and computed the gradients of both the parameters. Now after that, we’re here, in the last stage, where we adjust the current values of the parameters based on the gradients. We know that b and w are the culprits for the loss, so we update their value based on the gradients to decrease the loss (we’re going downhill). I’ll come to the minus sign thing, but before that let’s make the learning rate thing clear. Learning rate (is very important hyper-parameter in machine learning) simply refers to the size of steps we would want to take while moving down the hill. The value of this hyper-parameter plays a crucial role; a lower learning rate makes the model learn very slowly (because it is taking steps in very small magnitude), while a larger learning rate introduces problems of divergence.
Sorry for these unprofessional illustrations. But I believe the concept of learning rate as step magnitude while moving downhill is visible in the illustrations.
Why the minus sign while updating the parameters?
Consider the above function f(x), and let’s discuss why we have a minus sign while updating a parameter. If we were at the red dot then we would want to decrease the value of x to get to the minima. The slope of f(x) at the red dot is positive (gradient is positive), and therefore the minus sign does its work to decrease the current value of x. However, if we were at the green dot then we would want to increase the value of x to get to the minima. The slope of f(x) at the green dot is negative (gradient is negative); therefore, the minus sign makes the negative gradient term positive and hence we end up increasing the value of x. Therefore, either way, the minus sign helps in optimizing the parameter.
The Iterations
After step 3 (updating the parameters), we again execute step 2 (computing the loss), again update the parameters, and again compute the loss. We iteratively make steps down the hill and reach the minima (where the loss is minimum). At the minima, lies our optimal set of parameters, which we can use to compute outputs of a model (make predictions) based on unseen data points (we use the available data points for training and validation).
I hope this article was helpful. I’ll see you at the next one.
Cover Image credits: pixabay