Skip to main content

Stochastic Gradient Descent

What is Stochastic Gradient Descent?? 

There are 3 types of Gradient Descent:

1. Batch Gradient Descent

2. Stochastic Gradient Descent

3. Mini-Batch Gradient Descent 

We have already discussed about batch gradient descent. Batch Gradient is nothing but the gradient decent.

If you want to know about batch gradient
descent then check out this :


In batch gradient we take our whole data set in one epoch (iteration) and then predict a single update of the coefficient and the intercept. So it's slows the convergence.

In Stochastic Gradient Decent we take a single row and update the coefficient and the intercept then we take the second row and then update. If we are having n number of rows then in single epochs we do n updates. 

In Stochastic Gradient Descent the rows are selected randomly from n number of rows. From here it's called stochastic. 

Because of frequent updates you reach to your answer earlier without requiring extra number of epochs. 

Learning Schedules: 


This is a concept in which we vary our learning rate with epochs. 

Advantages:

1. It is faster as it requires less number of epochs.

2. It's convergence is fast.

3. Your code will not crack (as you don't have to upload your whole data set in one iteration)

Disadvantages:

1. It's not a steady solution as if you apply stochastic gradient algorithm on same algorithm multiple times you will get different result as it select rows at random. 

When to Use:

1. When we have big data that is many rows and many columns. If we have big data Stochastic Gradient Descent will converge faster.

2. When we have a non-convex function (It is a function in which if we connect any two points and the function cuts the line its a non-convex function).


In this we have both local and global minima.

sklearn.linear_model.SGDRegressor class is used for performing stochastic gradient descent.

We are having an inbuild class for sklearn performing stochastic gradient decent.

The class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties for classification. Below is the decision boundary of a SGDClassifier trained with the hinge loss, equivalent to a linear SVM.



As other classifiers, SGD has to be fitted with two arrays: an array X of shape (n_samples, n_features) holding the training samples, and an array y of shape (n_samples,) holding the target values (class labels) for the training samples.

The concrete penalty can be set via the penalty parameter. SGD supports the following penalties:

penalty="l2": L2 norm penalty on coef_.

penalty="l1": L1 norm penalty on coef_.

penalty="elasticnet": Convex combination of L2 and L1; (1 - l1_ratio) * L2 + l1_ratio * L1.

The default setting is penalty="l2". The L1 penalty leads to sparse solutions, driving most coefficients to zero. The Elastic Net [11] solves some deficiencies of the L1 penalty in the presence of highly correlated attributes. The parameter l1_ratio controls the convex combination of L1 and L2 penalty.


SGDClassifier supports multi-class classification by combining multiple binary classifiers in a “one versus all” (OVA) scheme. For each of the K classes, a binary classifier is learned that discriminates between that and all other K1 classes. At testing time, we compute the confidence score (i.e. the signed distances to the hyperplane) for each classifier and choose the class with the highest confidence. The Figure below illustrates the OVA approach on the iris dataset. The dashed lines represent the three OVA classifiers; the background colors show the decision surface induced by the three classifiers.



In the case of multi-class classification coef_ is a two-dimensional array of shape (n_classes, n_features) and intercept_ is a one-dimensional array of shape (n_classes,). The i-th row of coef_ holds the weight vector of the OVA classifier for the i-th class; classes are indexed in ascending order (see attribute classes_). Note that, in principle, since they allow to create a probability model, loss="log_loss" and loss="modified_huber" are more suitable for one-vs-all classification.

SGDClassifier supports both weighted classes and weighted instances via the fit parameters class_weight and sample_weight

The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet.

The concrete loss function can be set via the loss parameter. SGDRegressor supports the following loss functions:


loss="squared_error": Ordinary least squares,


loss="huber": Huber loss for robust regression,


loss="epsilon_insensitive": linear Support Vector Regression.

Stopping criterion

The classes SGDClassifier and SGDRegressor 

provide two criteria to stop the algorithm when a given level of convergence is reached:

  • With early_stopping=True, the input data is split into a training set and a validation set. The model is then fitted on the training set, and the stopping criterion is based on the prediction score (using the score method) computed on the validation set. The size of the validation set can be changed with the parameter validation_fraction.

  • With early_stopping=False, the model is fitted on the entire input data and the stopping criterion is based on the objective function computed on the training data.

In both cases, the criterion is evaluated once by epoch, and the algorithm stops when the criterion does not improve n_iter_no_change times in a row. The improvement is evaluated with absolute tolerance tol, and the algorithm stops in any case after a maximum number of iteration max_iter.

Tips on Practical Use

Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can be easily done using StandardScaler.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train) 
# Don't cheat - fit only on training
 data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)  
# apply same transformation to test
 data
# Or better yet: use a pipeline!
from sklearn.pipeline import
 make_pipeline
est = make_pipeline(StandardScaler(),
 SGDClassifier())
est.fit(X_train)
est.predict(X_test)

Creating SGDRegressor class from scratch :



def__init__(self,learning_rate=0.01,epochs=100):


Comments

Popular posts from this blog

Welcome to the Digital Era!!!

As we all know we are living in a Digital Era. Almost everything around us is digitally connected. For example, QR code. Almost everyone uses QR Code for financial transactions in there day to day life. If you see any business that is not on the internet it's like they are missing out on the digital world. No business can grow immensely without creating its digital presence. What is Digital Transformation? Digital transformation is the process of using digital technologies to transform existing traditional and non-digital business processes and services, or creating new ones, to meet with the evolving market and customer expectations, thus completely altering the way businesses are managed and operated, and how value is delivered to customers. To help you stay ahead of the game, we've compiled some of the most valuable insights from today's leading digital businesses.  Some of Indian Startups bloom after Digital Transformation : 1. Lenskart Have you ever thought that you do

Ridge Regression Machine Learning

Bias variance trade off Bias means the inability of a machine learning model to truly capture the relationship in the training data set. That means it cannot understand the pattern in the training data set. Variance is the different of fits on different data sets. The difference between the training and the testing data set is variance. Overfitting When your data set works well on the trading data set but does not perform well on testing data set its called over fitting. Underfitting When your model does not perform well on your training data set then it is called under fitting. There are three methods for controlling over fitting: 1. Regularization 2. Bagging 3. Boosting There are 3 techniques of regularization: 1. Ridge Regression In this we add some more regularization terms to reduce the over fitting. Basically it's  lambda. For performing ridge Regression we have an in-built class Ridge in sklearn Library. Let's see the code : from sklearn.linear_model import LinearRegress

Top 10 business ideas

Top 10 business ideas Hey do you want to become an self made entrepreneur? Do you want to start your own business? If yes then you are at the right place !!! The rise of entrepreneurship in India is unstoppable, and that is something we should be proud of.The wave of entrepreneurship is on it's hype. Here are some business ideas to bloom your career. 1. Online Reselling  If you’re interested in clothing and/or sales, you might consider  starting an online reseller business . Y ou can start your business as a side hustle and turn it into a full-time resale business. Here's your action plan: Choose the right type of reselling business. Identify the industry for your business.  Identify the market and target audience for your business.  Check out your competitors.  Check if the business is viable.  Start your reseller business online. 2. Professional Organizing If you’re a highly organized person who enjoys making spaces functional and comfortable, you might be good at coaching ot