Skip to main content

Gradient Descent

What is Gradient Descent

Gradient decent is an optimization (achieve best performance) algorithm.when an differentiable function is given to it this will return the minimum value.

This is an iterative algorithm.

Gradient Decent is an general algorithm which is used in different machine learning algorithms like linear regression ,logistic regression , TSNE , and is also used very much in deep learning. 

You can't do anything in deep learning if you don't have the prior knowledge of gradient decent.



In gradient decent you start from a random initial value and try to get the minimum value by taking appropriate learning steps.

In gradient decent learning rate should be appropriate. It is important to tune it properly.

It is a step size that we will take to move in the direction of local minima and you can also see that while we are approaching minima our step size is constantly decrease. some of you think that we did it manually but this is the effect of slope when we go down then our slope is constantly decrease because of this our update in x which is directly influenced by the multiplication of learning rate and slope is also decrease

What should be the step size(learning rate = α)

We generally use learning rate in btw 0.0 to 1.0 .

Remember that learning rate should not be too high or too low . you can better understand from this figure



First of all we need a loss function. A very common approach is
y^=g(x)
 
f=argminx∑(y−y^)2

where y are the true observations of the target variable and yhat are the predictions. The parameters to be optimized are denoted as x. This function looks like a squared function - and yes, it is. But if we assume a non linear approach for the prediction of yhat, we end up in a higher degree. In deep neural networks for example, it is a function of weights (x) and (constant) input-features which is nonlinear.

So, lets assume that the function looks like the following polynom - as an example:

f=x6+4x5+2x4+5x3+20x2+x+250
 
Where f is the loss and x are the weights.


  • f is the loss function
  • f' is the first derivation to x and
  • f'' is the second derivation to x.

We want to find the minimum. But there are two minima and one maximum. So, firstly we find out where the extreme values are. For this intention, we use the Newton iteration approach. If we set a tangent to an arbitrary point x0 of the green function we can calculate where the linear tangent is zero instead. Through this procedure we iteratively approach the zero points of f' and thus obtain the extrema. This intercept with x is supposed to be x1.

(x0x1)fx0=fx0
x1=x0fx0fx0

When we have x1, then we start again. But now with x1.

x2=x1fx1fx1

and so forth...To find the three points of intersection with x, we set x0 close to the suspected zero points which we can see in the function-plot.

x0 = c(1, -1, -4)
j = 1

for(x in x0) {
  
  for (i in 1:20) {
    df = 6 * x ^ 5 + 20 * x ^ 4 + 8 * x ^ 3 + 15 * x ^ 2 + 40 * x + 1
    ddf = 30 * x ^ 4 + 80 * x ^ 3 + 24 * x ^ 2 + 30 * x + 40
    x = x - df / ddf
  }
  assign(paste("x", j, sep = ""), x)
  j = j + 1
}

Results are:

x1 =  -0.03
x2 =  -1.45
x3 =  -2.9



Now that we know where f has its extrema, we can apply the gradient decent - by the way, for the gradient descent you don't need to know where the extremes are, that's only important for this example. To search for the minima, especially the global minimum,we apply the gradient decent.

The gradient decent is an iterative approach. It starts from initial values and moves downhill. The initial values are usually set randomly, although there are other approaches.

xi+1=xi−learning_rate⋅f′xi
 
Where x is for example the weight of a neuron relation and
f′xi
 
is the partial derivation to a certein weight. The i is the iteration index.

Why the negative learning_rate times the derivation? Well, if the derivation is positive, it means that an increase of x goes along with an increase of f and vice versa. So if the derivative is positive, x must be decreased and vice versa. The following animation demonstrates how it might work:gd2.gif


Let's start with the initial value of:

x =2
We set the learning rate to:
learning_rate = .001
for(i in 1:2000){
  df = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
  x = x - df*learning_rate
}

Result:

-0.03

Here it becomes clear why the initial values (weights) are 
so important. The minimum found corresponds 
to x1 and not x3. This is due to the initial value of x = 2. 
Let us now change it to:
x = -2

for(i in 1:2000){
  df = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
  x = x - df*learning_rate
}


Result:

-2.9
This is the global minimum we have been searching for.
Now, we change the learning rate to:

learning_rate = .007
x = 2
for(i in 1:2000){
  df = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
  x = x - df*learning_rate
}

Result:

-2.67

Well, let´s plot the first 100 x iterations:

x = 2
checkX = NULL
for(i in 1:2000){
  df = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
  x = x - df*learning_rate
  checkX = c(checkX, x)
}



This phenomenon is called oscillation and prevents the optimization. It happens because the absolute increase f' is same for the two opposite sides at x = -3.03 and x = -2.67:


If we subtract the gradients from the opposite sides, we should get a difference of zero.

x = -2.67302
df1 = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1

x = -3.02809
df2 = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1

Difference is: 0


Let's see the code

from sklearn.datasets import make_regression

import matplotlib.pyplot as plt

import numpy as np

from sklearn.model_selection import cross_val_score

X,y = make_regression(n_samples=100, n_features=1, n_informative=1, n_targets=1,noise=20,random_state

plt.scatter(X,y)

from sklearn.model_selection import train_test_split

# Training the dataset

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

from sklearn.linear_model import LinearRegression

#creating object

lr = LinearRegression()

#fitting the data set

lr.fit(X_train,y_train)
print(lr.coef_)
print(lr.intercept_)

Output:

[28.12597332]
-2.271014426178382


# predicting and checking the R2 score

y_pred = lr.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

Output:

0.6345158782661013


Conclusion

We can see that the process of optimization 
of the gradiant decent is highly dependent
from the initial values (weights). 
So, it is possible that the best solution
you find when using a method based on a
gradient decent - such as a neural network 
-is not optimal. To prevent from oscillation
,one can add a momentum term or change the 
learning rate. But even that is no guarantee 
of finding the global minimum. A poor choice 
of the learning rate can also lead to
exploding gradients.This circumstance 
can be counteracted by gradient clipping.





Comments

Popular posts from this blog

Top 10 business ideas

Top 10 business ideas Hey do you want to become an self made entrepreneur? Do you want to start your own business? If yes then you are at the right place !!! The rise of entrepreneurship in India is unstoppable, and that is something we should be proud of.The wave of entrepreneurship is on it's hype. Here are some business ideas to bloom your career. 1. Online Reselling  If you’re interested in clothing and/or sales, you might consider  starting an online reseller business . Y ou can start your business as a side hustle and turn it into a full-time resale business. Here's your action plan: Choose the right type of reselling business. Identify the industry for your business.  Identify the market and target audience for your business.  Check out your competitors.  Check if the business is viable.  Start your reseller business online. 2. Professional Organizing If you’re a highly organized person who enjoys making spaces functional and comfortable, you might be good at coaching ot

Polynomial Regression in machine learning

What is polynomial Regression?? There are 3 types of Linear Regression Algorithms: 1. Simple Linear Regression 2. Multiple Linear Regression 3. Polynomial Regression We have already discussed about Simple Linear Regression & Multiple Linear Regression. If you want to know about that check this links out: Simple Linear Regression Multiple Linear Regression Polynomial Regression is a form of linear regression in which at the end we perform linear regression by applying same principles. It's just that we add polynomial terms in our data set. Polynomial Regression is used when we have  non-linear data set. For eg ; We have X,Y columns in our dataset. In which X is the input column & Y is the output column. In polynomial regression we extract the polynomial features in the preprocessing stage. That means if we want to create degree = 2 then we will convert X0 ,X1 ,X2 for every row. Features help us to understand the non linear relationship. In polynomial regression degree is a h

Ridge Regression Machine Learning

Bias variance trade off Bias means the inability of a machine learning model to truly capture the relationship in the training data set. That means it cannot understand the pattern in the training data set. Variance is the different of fits on different data sets. The difference between the training and the testing data set is variance. Overfitting When your data set works well on the trading data set but does not perform well on testing data set its called over fitting. Underfitting When your model does not perform well on your training data set then it is called under fitting. There are three methods for controlling over fitting: 1. Regularization 2. Bagging 3. Boosting There are 3 techniques of regularization: 1. Ridge Regression In this we add some more regularization terms to reduce the over fitting. Basically it's  lambda. For performing ridge Regression we have an in-built class Ridge in sklearn Library. Let's see the code : from sklearn.linear_model import LinearRegress