What is Stochastic Gradient Descent??
There are 3 types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-Batch Gradient Descent
We have already discussed about batch gradient descent. Batch Gradient is nothing but the gradient decent.
If you want to know about batch gradient
descent then check out this :
In batch gradient we take our whole data set in one epoch (iteration) and then predict a single update of the coefficient and the intercept. So it's slows the convergence.
In Stochastic Gradient Decent we take a single row and update the coefficient and the intercept then we take the second row and then update. If we are having n number of rows then in single epochs we do n updates.
In Stochastic Gradient Descent the rows are selected randomly from n number of rows. From here it's called stochastic.
Because of frequent updates you reach to your answer earlier without requiring extra number of epochs.
Learning Schedules:
This is a concept in which we vary our learning rate with epochs.
Advantages:
1. It is faster as it requires less number of epochs.
2. It's convergence is fast.
3. Your code will not crack (as you don't have to upload your whole data set in one iteration)
Disadvantages:
1. It's not a steady solution as if you apply stochastic gradient algorithm on same algorithm multiple times you will get different result as it select rows at random.
When to Use:
1. When we have big data that is many rows and many columns. If we have big data Stochastic Gradient Descent will converge faster.
2. When we have a non-convex function (It is a function in which if we connect any two points and the function cuts the line its a non-convex function).
In this we have both local and global minima.
sklearn.linear_model.SGDRegressor class is used for performing stochastic gradient descent.
We are having an inbuild class for sklearn performing stochastic gradient decent.
The class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties for classification. Below is the decision boundary of a SGDClassifier trained with the hinge loss, equivalent to a linear SVM.
As other classifiers, SGD has to be fitted with two arrays: an array X
of shape (n_samples, n_features) holding the training samples, and an array y of shape (n_samples,) holding the target values (class labels) for the training samples.
The concrete penalty can be set via the penalty parameter. SGD supports the following penalties:
penalty="l2": L2 norm penalty on coef_.
penalty="l1": L1 norm penalty on coef_.
penalty="elasticnet": Convex combination of L2 and L1; (1 - l1_ratio) * L2 + l1_ratio * L1.
The default setting is penalty="l2". The L1 penalty leads to sparse solutions, driving most coefficients to zero. The Elastic Net [11] solves some deficiencies of the L1 penalty in the presence of highly correlated attributes. The parameter l1_ratio controls the convex combination of L1 and L2 penalty.
SGDClassifier
supports multi-class classification by combining multiple binary classifiers in a “one versus all” (OVA) scheme. For each of the classes, a binary classifier is learned that discriminates between that and all other classes. At testing time, we compute the confidence score (i.e. the signed distances to the hyperplane) for each classifier and choose the class with the highest confidence. The Figure below illustrates the OVA approach on the iris dataset. The dashed lines represent the three OVA classifiers; the background colors show the decision surface induced by the three classifiers.
In the case of multi-class classification coef_
is a two-dimensional array of shape (n_classes, n_features) and intercept_
is a one-dimensional array of shape (n_classes,). The i-th row of coef_
holds the weight vector of the OVA classifier for the i-th class; classes are indexed in ascending order (see attribute classes_
). Note that, in principle, since they allow to create a probability model, loss="log_loss"
and loss="modified_huber"
are more suitable for one-vs-all classification.
SGDClassifier
supports both weighted classes and weighted instances via the fit parameters class_weight
and sample_weight
.
The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet.
The concrete loss function can be set via the loss parameter. SGDRegressor supports the following loss functions:
loss="squared_error": Ordinary least squares,
loss="huber": Huber loss for robust regression,
loss="epsilon_insensitive": linear Support Vector Regression.
Stopping criterion
The classes SGDClassifier
and SGDRegressor
provide two criteria to stop the algorithm when a given level of convergence is reached:
With early_stopping=True
, the input data is split into a training set and a validation set. The model is then fitted on the training set, and the stopping criterion is based on the prediction score (using the score
method) computed on the validation set. The size of the validation set can be changed with the parameter validation_fraction
.
With early_stopping=False
, the model is fitted on the entire input data and the stopping criterion is based on the objective function computed on the training data.
In both cases, the criterion is evaluated once by epoch, and the algorithm stops when the criterion does not improve n_iter_no_change
times in a row. The improvement is evaluated with absolute tolerance tol
, and the algorithm stops in any case after a maximum number of iteration max_iter.
Tips on Practical Use
Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can be easily done using StandardScaler.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
# Don't cheat - fit only on training
data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# apply same transformation to test
data
# Or better yet: use a pipeline!
from sklearn.pipeline import
make_pipeline
est = make_pipeline(StandardScaler(),
SGDClassifier())
est.fit(X_train)
est.predict(X_test)
Creating SGDRegressor class from scratch :
def__init__(self,learning_rate=0.01,epochs=100):
Comments
Post a Comment