Sophia: Second-order Clipped Stochastic Optimization
Sophia was introduced by Liu et al in Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. Sophia is a second-order optimizer that leverages a light-weight Hessian estimate as a pre-conditioner, which is supposed to handle the Large Language Model (LLM) loss landscape better than AdamW. The Hessian pre-conditioner is more aggressive than AdamW
, with stronger update penalties sharp dimensions, which can lead to a more uniform loss decrease across parameters and faster convergence. Additionally, Sophia applies element-wise clipping to updates which allows infrequent and stochastic updates to the Hessian estimate, reducing optimizer wall-clock time.
Sophia
will not update the Hessian estimate unless the SophiaCallback
is added to fastai.learner.Learner
.
In addition to a fastai native implementation, Sophia
has a fused ForEach implementation. See the Fused Optimizer documentation for more details.
Sophia
Sophia (params:Union[torch.Tensor,Iterable[torch.Tensor],MutableSequence[ torch.Tensor],fastcore.foundation.L,fastcore.basics.fastuple], lr:float, mom:float=0.965, hess_mom:float=0.99, rho:float=0.4, eps:float=1e-15, wd:float=0.01, foreach:bool=False)
A fastai Sophia optimizer with a fused ForEach implementation
Type | Default | Details | |
---|---|---|---|
params | Listified[Tensor] | Model parameters or parameter groups | |
lr | float | Default learning rate | |
mom | float | 0.965 | Gradient moving average (β1) coefficient |
hess_mom | float | 0.99 | Hessian moving average (β2) coefficient |
rho | float | 0.4 | Maximum update size, set higher for more agressive updates |
eps | float | 1e-15 | Added for numerical stability |
wd | float | 0.01 | Optional weight decay |
foreach | bool | False | Use fused ForEach implementation |
Returns | SophiaOptimizer | SophiaForEachOptimizer |
sophia
sophia (mom:float=0.965, hess_mom:float=0.99, rho:float=0.4, eps:float=1e-15, wd:float=0.01, foreach:bool=False)
Partial function for the Sophia optimizer with a fused ForEach implementation
Type | Default | Details | |
---|---|---|---|
mom | float | 0.965 | Gradient moving average (β1) coefficient |
hess_mom | float | 0.99 | Hessian moving average (β2) coefficient |
rho | float | 0.4 | Maximum update size, set higher for more agressive updates |
eps | float | 1e-15 | Added for numerical stability |
wd | float | 0.01 | Optional weight decay |
foreach | bool | False | Use fused ForEach implementation |
Returns | SophiaOptimizer | SophiaForEachOptimizer |
SophiaCallback
SophiaCallback (hessian_update:int=10)
Modifies the training loop for the Sophia Optimizer. Required for Sophia to run.
Type | Default | Details | |
---|---|---|---|
hessian_update | int | 10 | Update Sophia’s Hessian estimate every hessian_update Optimizer steps |
SophiaCallback
expects the loss function to be a CrossEntropy loss, and only supports single target and single loss function training.
Hyperparameters
Hyperparameter notes from Liu et al:
- Sophia hyperparameters should be similar to AdamW
- \rho (
rho
) should be in [0.01, 0.1]. A larger \rho means more aggressive updates - Sophia may benefit from slightly higher weight decay and learning rate compared to AdamW