y^=g(x)
f=argminx∑(y−y^)2
where y are the true observations of the target variable and yhat are the predictions. The parameters to be optimized are denoted as x. This function looks like a squared function - and yes, it is. But if we assume a non linear approach for the prediction of yhat, we end up in a higher degree. In deep neural networks for example, it is a function of weights (x) and (constant) input-features which is nonlinear.
So, lets assume that the function looks like the following polynom - as an example:
f=x6+4x5+2x4+5x3+20x2+x+250
Where f is the loss and x are the weights.
- f is the loss function
- f' is the first derivation to x and
- f'' is the second derivation to x.
We want to find the minimum. But there are two minima and one maximum. So, firstly we find out where the extreme values are. For this intention, we use the Newton iteration approach. If we set a tangent to an arbitrary point x0 of the green function we can calculate where the linear tangent is zero instead. Through this procedure we iteratively approach the zero points of f' and thus obtain the extrema. This intercept with x is supposed to be x1.
(x0−x1)⋅f′x0=fx0
x1=x0−fx0f′x0
When we have x1, then we start again. But now with x1.
x2=x1−fx1f′x1
and so forth...To find the three points of intersection with x, we set x0 close to the suspected zero points which we can see in the function-plot.
x0 = c(1, -1, -4)
j = 1
for(x in x0) {
for (i in 1:20) {
df = 6 * x ^ 5 + 20 * x ^ 4 + 8 * x ^ 3 + 15 * x ^ 2 + 40 * x + 1
ddf = 30 * x ^ 4 + 80 * x ^ 3 + 24 * x ^ 2 + 30 * x + 40
x = x - df / ddf
}
assign(paste("x", j, sep = ""), x)
j = j + 1
}
Now that we know where f has its extrema, we can apply the gradient decent - by the way, for the gradient descent you don't need to know where the extremes are, that's only important for this example. To search for the minima, especially the global minimum,we apply the gradient decent.
The gradient decent is an iterative approach. It starts from initial values and moves downhill. The initial values are usually set randomly, although there are other approaches.
xi+1=xi−learning_rate⋅f′xi
Where x is for example the weight of a neuron relation and
f′xi
is the partial derivation to a certein weight. The i is the iteration index.
Why the negative learning_rate times the derivation? Well, if the derivation is positive, it means that an increase of x goes along with an increase of f and vice versa. So if the derivative is positive, x must be decreased and vice versa. The following animation demonstrates how it might work:gd2.gif
Let's start with the initial value of:
x =2
We set the learning rate to:
Result:
-0.03
Here it becomes clear why the initial values (weights) are
so important. The minimum found corresponds
to x1 and not x3. This is due to the initial value of x = 2.
Let us now change it to:
x = -2
for(i in 1:2000){
df = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
x = x - df*learning_rate
}
Result:
-2.9
This is the global minimum we have been searching for.
Now, we change the learning rate to:
Result:
-2.67
Well, let´s plot the first 100 x iterations:
x = 2
checkX = NULL
for(i in 1:2000){
df = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
x = x - df*learning_rate
checkX = c(checkX, x)
}
This phenomenon is called oscillation and prevents the optimization. It happens because the absolute increase f' is same for the two opposite sides at x = -3.03 and x = -2.67:
If we subtract the gradients from the opposite sides, we should get a difference of zero.x = -2.67302
df1 = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
x = -3.02809
df2 = 6*x^5+20*x^4+8*x^3+15*x^2+40*x+1
Difference is: 0
Let's see the code
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import cross_val_score
X,y = make_regression(n_samples=100, n_features=1, n_informative=1, n_targets=1,noise=20,random_state
plt.scatter(X,y)
from sklearn.model_selection import train_test_split
# Training the dataset
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
from sklearn.linear_model import LinearRegression
#creating object
lr = LinearRegression()
#fitting the data set
lr.fit(X_train,y_train)
print(lr.coef_)
print(lr.intercept_)
Output:
[28.12597332]
-2.271014426178382
# predicting and checking the R2 score
y_pred = lr.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
Output:
0.6345158782661013
Conclusion
We can see that the process of optimization
of the gradiant decent is highly dependent
from the initial values (weights).
So, it is possible that the best solution
you find when using a method based on a
gradient decent - such as a neural network
-is not optimal. To prevent from oscillation
,one can add a momentum term or change the
learning rate. But even that is no guarantee
of finding the global minimum. A poor choice
of the learning rate can also lead to
exploding gradients.This circumstance
can be counteracted by gradient clipping.
Comments
Post a Comment