Gradient Descent

What is Gradient Descent

Gradient decent is an optimization (achieve best performance) algorithm.when an differentiable function is given to it this will return the minimum value.

This is an iterative algorithm.

Gradient Decent is an general algorithm which is used in different machine learning algorithms like linear regression ,logistic regression , TSNE , and is also used very much in deep learning.

You can't do anything in deep learning if you don't have the prior knowledge of gradient decent.

In gradient decent you start from a random initial value and try to get the minimum value by taking appropriate learning steps.

In gradient decent learning rate should be appropriate. It is important to tune it properly.

It is a step size that we will take to move in the direction of local minima and you can also see that while we are approaching minima our step size is constantly decrease. some of you think that we did it manually but this is the effect of slope when we go down then our slope is constantly decrease because of this our update in x which is directly influenced by the multiplication of learning rate and slope is also decrease

What should be the step size(learning rate = α)

We generally use learning rate in btw 0.0 to 1.0 .

Remember that learning rate should not be too high or too low . you can better understand from this figure

First of all we need a loss function. A very common approach is

y^=g(x)

f=argminx∑(y−y^)2

where y are the true observations of the target variable and yhat are the predictions. The parameters to be optimized are denoted as x. This function looks like a squared function - and yes, it is. But if we assume a non linear approach for the prediction of yhat, we end up in a higher degree. In deep neural networks for example, it is a function of weights (x) and (constant) input-features which is nonlinear.

So, lets assume that the function looks like the following polynom - as an example:

f=x6+4x5+2x4+5x3+20x2+x+250

Where f is the loss and x are the weights.

f is the loss function
f' is the first derivation to x and
f'' is the second derivation to x.

We want to find the minimum. But there are two minima and one maximum. So, firstly we find out where the extreme values are. For this intention, we use the Newton iteration approach. If we set a tangent to an arbitrary point x0 of the green function we can calculate where the linear tangent is zero instead. Through this procedure we iteratively approach the zero points of f' and thus obtain the extrema. This intercept with x is supposed to be x1.

(x_{0} - x_{1}) \cdot f_{x_{0}}^{'} = f_{x_{0}}

x_{1} = x_{0} - \frac{f_{x_{0}}}{f_{x_{0}}^{'}}

When we have x1, then we start again. But now with x1.

x_{2} = x_{1} - \frac{f_{x_{1}}}{f_{x_{1}}^{'}}

and so forth...To find the three points of intersection with x, we set x0 close to the suspected zero points which we can see in the function-plot.

x0 = c(1, -1, -4)

j = 1

for(x in x0) {

  for (i in 1:20) {

    df = 6 * x ^ 5 + 20 * x ^ 4 + 8 * x ^ 3 + 15 * x ^ 2 + 40 * x + 1

    ddf = 30 * x ^ 4 + 80 * x ^ 3 + 24 * x ^ 2 + 30 * x + 40

    x = x - df / ddf

  assign(paste("x", j, sep = ""), x)

  j = j + 1

Results are:

x1 =  -0.03
x2 =  -1.45
x3 =  -2.9

Now that we know where f has its extrema, we can apply the gradient decent - by the way, for the gradient descent you don't need to know where the extremes are, that's only important for this example. To search for the minima, especially the global minimum,we apply the gradient decent.

The gradient decent is an iterative approach. It starts from initial values and moves downhill. The initial values are usually set randomly, although there are other approaches.

xi+1=xi−learning_rate⋅f′xi

Where x is for example the weight of a neuron relation and

f′xi

is the partial derivation to a certein weight. The i is the iteration index.

Why the negative learning_rate times the derivation? Well, if the derivation is positive, it means that an increase of x goes along with an increase of f and vice versa. So if the derivative is positive, x must be decreased and vice versa. The following animation demonstrates how it might work:gd2.gif

Let's start with the initial value of:

x =2

We set the learning rate to:

learning_rate = .001

for(i in 1:2000){
  df = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
  x = x - df*learning_rate
}

Result:

-0.03

Here it becomes clear why the initial values (weights) are

so important. The minimum found corresponds

to x1 and not x3. This is due to the initial value of x = 2.

Let us now change it to:

x = -2

for(i in 1:2000){
  df = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
  x = x - df*learning_rate
}

Result:

-2.9

This is the global minimum we have been searching for.

Now, we change the learning rate to:

learning_rate = .007

x = 2
for(i in 1:2000){
  df = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
  x = x - df*learning_rate
}

Result:

-2.67

Well, let´s plot the first 100 x iterations:

x = 2
checkX = NULL
for(i in 1:2000){
  df = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
  x = x - df*learning_rate
  checkX = c(checkX, x)
}

This phenomenon is called oscillation and prevents the optimization. It happens because the absolute increase f' is same for the two opposite sides at x = -3.03 and x = -2.67:

If we subtract the gradients from the opposite sides, we should get a difference of zero.

x = -2.67302
df1 = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1

x = -3.02809
df2 = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1

Difference is: 0

Let's see the code

from sklearn.datasets import make_regression

import matplotlib.pyplot as plt

import numpy as np

from sklearn.model_selection import cross_val_score

X,y = make_regression(n_samples=100, n_features=1, n_informative=1, n_targets=1,noise=20,random_state

plt.scatter(X,y)

from sklearn.model_selection import train_test_split

# Training the dataset

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

from sklearn.linear_model import LinearRegression

#creating object

lr = LinearRegression()

#fitting the data set

lr.fit(X_train,y_train)

print(lr.coef_)

print(lr.intercept_)

Output:

[28.12597332]
-2.271014426178382

# predicting and checking the R2 score
y_pred = lr.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
Output:
0.6345158782661013

Conclusion
We can see that the process of optimization 
of the gradiant decent is highly dependent
from the initial values (weights). 
So, it is possible that the best solution
you find when using a method based on a
gradient decent - such as a neural network 
-is not optimal. To prevent from oscillation
,one can add a momentum term or change the 
learning rate. But even that is no guarantee 
of finding the global minimum. A poor choice 
of the learning rate can also lead to
exploding gradients.This circumstance 
can be counteracted by gradient clipping.

Polynomial Regression in machine learning

What is polynomial Regression?? There are 3 types of Linear Regression Algorithms: 1. Simple Linear Regression 2. Multiple Linear Regression 3. Polynomial Regression We have already discussed about Simple Linear Regression & Multiple Linear Regression. If you want to know about that check this links out: Simple Linear Regression Multiple Linear Regression Polynomial Regression is a form of linear regression in which at the end we perform linear regression by applying same principles. It's just that we add polynomial terms in our data set. Polynomial Regression is used when we have non-linear data set. For eg ; We have X,Y columns in our dataset. In which X is the input column & Y is the output column. In polynomial regression we extract the polynomial features in the preprocessing stage. That means if we want to create degree = 2 then we will convert X0 ,X1 ,X2 for every row. Features help us to understand the non linear relationship. In polynomial regression degree is a h

Machine Learning Algorithm

Search This Blog

Gradient Descent

What is Gradient Descent

What should be the step size(learning rate = α)

Result:

Well, let´s plot the first 100 x iterations:

Let's see the code

# Training the dataset

#creating object

#fitting the data set

Output:

# predicting and checking the R2 score

Output:

Conclusion

Labels

Comments

Post a Comment

Popular posts from this blog

Top 10 business ideas

Polynomial Regression in machine learning

Ridge Regression Machine Learning