Sophia: Second-order Clipped Stochastic Optimization

With fastai native and fused ForEach implementations

Sophia was introduced by Liu et al in Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. Sophia is a second-order optimizer that leverages a light-weight Hessian estimate as a pre-conditioner, which is supposed to handle the Large Language Model (LLM) loss landscape better than AdamW. The Hessian pre-conditioner is more aggressive than AdamW, with stronger update penalties sharp dimensions, which can lead to a more uniform loss decrease across parameters and faster convergence. Additionally, Sophia applies element-wise clipping to updates which allows infrequent and stochastic updates to the Hessian estimate, reducing optimizer wall-clock time.

Important

Sophia will not update the Hessian estimate unless the SophiaCallback is added to fastai.learner.Learner.

In addition to a fastai native implementation, Sophia has a fused ForEach implementation. See the Fused Optimizer documentation for more details.


source

Sophia

 Sophia (params:Union[torch.Tensor,Iterable[torch.Tensor],MutableSequence[
         torch.Tensor],fastcore.foundation.L,fastcore.basics.fastuple],
         lr:float, mom:float=0.965, hess_mom:float=0.99, rho:float=0.4,
         eps:float=1e-15, wd:float=0.01, foreach:bool=False)

A fastai Sophia optimizer with a fused ForEach implementation

Type Default Details
params Listified[Tensor] Model parameters or parameter groups
lr float Default learning rate
mom float 0.965 Gradient moving average (β1) coefficient
hess_mom float 0.99 Hessian moving average (β2) coefficient
rho float 0.4 Maximum update size, set higher for more agressive updates
eps float 1e-15 Added for numerical stability
wd float 0.01 Optional weight decay
foreach bool False Use fused ForEach implementation
Returns SophiaOptimizer | SophiaForEachOptimizer

source

sophia

 sophia (mom:float=0.965, hess_mom:float=0.99, rho:float=0.4,
         eps:float=1e-15, wd:float=0.01, foreach:bool=False)

Partial function for the Sophia optimizer with a fused ForEach implementation

Type Default Details
mom float 0.965 Gradient moving average (β1) coefficient
hess_mom float 0.99 Hessian moving average (β2) coefficient
rho float 0.4 Maximum update size, set higher for more agressive updates
eps float 1e-15 Added for numerical stability
wd float 0.01 Optional weight decay
foreach bool False Use fused ForEach implementation
Returns SophiaOptimizer | SophiaForEachOptimizer

source

SophiaCallback

 SophiaCallback (hessian_update:int=10)

Modifies the training loop for the Sophia Optimizer. Required for Sophia to run.

Type Default Details
hessian_update int 10 Update Sophia’s Hessian estimate every hessian_update Optimizer steps

SophiaCallback expects the loss function to be a CrossEntropy loss, and only supports single target and single loss function training.

Hyperparameters

Hyperparameter notes from Liu et al:

  1. Sophia hyperparameters should be similar to AdamW
  2. \rho (rho) should be in [0.01, 0.1]. A larger \rho means more aggressive updates
  3. Sophia may benefit from slightly higher weight decay and learning rate compared to AdamW