Sophia: Second-order Clipped Stochastic Optimization

With fastai native and fused ForEach implementations

Sophia was introduced by Liu et al in Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. Sophia is a second-order optimizer that leverages a light-weight Hessian estimate as a pre-conditioner, which is supposed to handle the Large Language Model (LLM) loss landscape better than AdamW. The Hessian pre-conditioner is more aggressive than AdamW, with stronger update penalties sharp dimensions, which can lead to a more uniform loss decrease across parameters and faster convergence. Additionally, Sophia applies element-wise clipping to updates which allows infrequent and stochastic updates to the Hessian estimate, reducing optimizer wall-clock time.

Important

Sophia will not update the Hessian estimate unless the SophiaCallback is added to fastai.learner.Learner.

In addition to a fastai native implementation, Sophia has a fused ForEach implementation. See the Fused Optimizer documentation for more details.

source

Sophia

 Sophia (params:Union[torch.Tensor,Iterable[torch.Tensor],MutableSequence[
         torch.Tensor],fastcore.foundation.L,fastcore.basics.fastuple],
         lr:float, mom:float=0.965, hess_mom:float=0.99, rho:float=0.4,
         eps:float=1e-15, wd:float=0.01, foreach:bool=False)

A fastai Sophia optimizer with a fused ForEach implementation

	Type	Default	Details
params	Listified[Tensor]		Model parameters or parameter groups
lr	float		Default learning rate
mom	float	0.965	Gradient moving average (β1) coefficient
hess_mom	float	0.99	Hessian moving average (β2) coefficient
rho	float	0.4	Maximum update size, set higher for more agressive updates
eps	float	1e-15	Added for numerical stability
wd	float	0.01	Optional weight decay
foreach	bool	False	Use fused ForEach implementation
Returns	SophiaOptimizer \| SophiaForEachOptimizer

source

sophia

 sophia (mom:float=0.965, hess_mom:float=0.99, rho:float=0.4,
         eps:float=1e-15, wd:float=0.01, foreach:bool=False)

Partial function for the Sophia optimizer with a fused ForEach implementation

	Type	Default	Details
mom	float	0.965	Gradient moving average (β1) coefficient
hess_mom	float	0.99	Hessian moving average (β2) coefficient
rho	float	0.4	Maximum update size, set higher for more agressive updates
eps	float	1e-15	Added for numerical stability
wd	float	0.01	Optional weight decay
foreach	bool	False	Use fused ForEach implementation
Returns	SophiaOptimizer \| SophiaForEachOptimizer

source

SophiaCallback

 SophiaCallback (hessian_update:int=10)

Modifies the training loop for the Sophia Optimizer. Required for Sophia to run.

	Type	Default	Details
hessian_update	int	10	Update Sophia’s Hessian estimate every `hessian_update` Optimizer steps

SophiaCallback expects the loss function to be a CrossEntropy loss, and only supports single target and single loss function training.

Hyperparameters

Hyperparameter notes from Liu et al:

Sophia hyperparameters should be similar to AdamW
\rho (rho) should be in [0.01, 0.1]. A larger \rho means more aggressive updates
Sophia may benefit from slightly higher weight decay and learning rate compared to AdamW