Three types of gradient descent
(This is a restoration of a previous post hosted on Wordpress. Hyperlinks might be missing and formatting might be a bit messy.)
Reference - Notes from this blog: https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/
Three types are : batch gradient descent, stochastic gradient descent, mini-batch gradient descent
Batch gradient descent
During one epoch, evaluate error for one sample at a time. But update model only after evaluating all errors in the training set
Pros:
Calculation of prediction errors and the model update are seperated. Thus the algorithm can use parallel processing based implementations. Cons:
Need to have all data in memory The more stable error gradient may result in premature convergence of the model to a less optimal set of parameters. Algo:
Model.initialize() For i in n_epoches: training_data.shuffle X, Y = split(training_data) For x in X Y_pred = model(X) # get a vector error = get_error(Y_pred, Y) error_sum += error model.update(error_sum) Stochastic gradient descent
During one epoch, evaluate error for one sample at a time, then update model immediate after that evaluation.
Pros:
immediate feedback of model performance and improvement rate simple to implement frequent update – faster learning rate The noisy update process can allow the model to avoid local minima (e.g. premature convergence). Cons;
frequent update - computationally extensive add a noise parameter / gradient signal, causing the parameters to jump around Hard to settle to an error minimum Algo
Model.initialize() For i in n_epoches: training_data.shuffle X, Y = split(training_data) for each x in X Y_pred = model(X) error = get_error(Y_pred, Y) model.update()
Mini batch gradient descent
During one epoch, split the data into batches (which adds a batch size parameter). Then for each batch, evaluate error for one sample at a time. Update the model after evaluating for all data in one batches. Repeat for different batches. Repeat for different epoch.
Pros:
more robust converge to local minima, compared to stochastic gradient descent frequent update – faster learning rate, compared to batch gradient descent efficiency: no need to have all data in memory Cons;
configuration of an additional mini-batch parameter Algo
Model.initialize() For i in n_epoches: training_data.shuffle Data.split.batches For j in n_batches X, Y = split(batch_data) For x in X: Y_pred = model(X) # here a vector error = get_error(Y_pred, Y) error_sum += error model.update()