Lion: EvoLved Sign Momentum Optimizer

With fastai native and fused ForEach implementations

Lion was introduced by Chen et al in Symbolic Discovery of Optimization Algorithms. Lion only keeps track of the gradient moving average (momentum) which reduces memory usage compared to AdamW. Lion uses two momentum EMA factors, one for tracking momentum and another for using momentum in the update step. Using default hyperparameters, this allows up to ten times longer history for momentum tracking while leveraging more of the current gradient for the model update. Unlike most optimizers, Lion uses the same magnitude for each parameter update calculated using the sign operation.

In addition to a fastai native implementation, Lion has a fused ForEach and bitsandbytes 8-bit implementations. See the Fused Optimizer and 8-bit Optimizer documentation for more details.


source

Lion

 Lion (params:Union[torch.Tensor,Iterable[torch.Tensor],MutableSequence[to
       rch.Tensor],fastcore.foundation.L,fastcore.basics.fastuple],
       lr:float, beta1:float=0.9, beta2:float=0.99, wd:float=0.1,
       foreach:bool=False, eightbit:bool=False, **eightbitargs)

A fastai Lion optimizer with fused ForEach and 8-bit implementations

Type Default Details
params Listified[Tensor] Model parameters or parameter groups
lr float Default learning rate
beta1 float 0.9 Update gradient moving average (β1) coefficient
beta2 float 0.99 Gradient moving average (β2) coefficient
wd float 0.1 True weight decay
foreach bool False Use fused ForEach implementation
eightbit bool False Use fused 8-bit implementation
eightbitargs
Returns Optimizer | LionForEachOptimizer | Lion8bitOptimizer

source

lion

 lion (beta1:float=0.9, beta2:float=0.99, wd:float=0.1,
       foreach:bool=False, eightbit:bool=False, **eightbitargs)

Partial function for the Lion optimizer with fused ForEach and 8-bit implementations

Type Default Details
beta1 float 0.9 Update gradient moving average (β1) coefficient
beta2 float 0.99 Gradient moving average (β2) coefficient
wd float 0.1 True weight decay
foreach bool False Use fused ForEach implementation
eightbit bool False Use fused 8-bit implementation
eightbitargs
Returns Optimizer | LionForEachOptimizer | Lion8bitOptimizer

Hyperparameters

Hyperparameter notes from Chen et al:

  1. Except for language modeling, beta1 and beta2 are held at 0.9 and 0.99, respectively. When traing T5 they set beta1=0.95 and beta2=0.98.
  2. Due to the larger update norm from the sign operation, the Lion learning rate is typically 10X smaller than AdamW, with 3X smaller sometimes performing better.
  3. Since the effective weight decay is multiplied by the learning rate, weight decay should be increased by the learning rate decrease (10X or 3X).
  4. The optimal batch size for Lion is 4096 (vs AdamW’s 256), but Lion still performs well at a batch size of 64 and matches or exceeds AdamW on all tested batch sizes.

Training Speed

Important

The ForEach optimizer has only been tested on PyTorch 1.12+ and are not guaranteed to work on older versions.

As shown in Table 1, fastxtend’s fused ForEach Lion is 13 to 195 percent faster1 then a standard PyTorch implementation. This training speed advantage could increase in a future PyTorch release, as PyTorch doesn’t have a ForEach implementation of sign2, so the implementation falls back to a for loop in the middle of the Lion update step.

Table 1: Increase in Lion ForEach opt_step Speed vs Native Optimizer
Model Layers Native Step ForEach Step ForEach Speedup
XResNet18 1 23ms 13ms 73%
XResNet50 1 50ms 34ms 47%
XSE-ResNeXt50 1 66ms 47ms 41%
XResNet101 1 76ms 48ms 59%
ConvNeXt Tiny 2 118ms 104ms 13%
ConvNeXt Small 2 189ms 164ms 16%
ViT Patch16 Small 2 57ms 45ms 26%
DeBERTa Base 1 22ms 7.5ms 195%

Due a simpler update and only tracking momentum, the native implementation of Lion is both faster than the native implementation of AdamW and uses less memory. However, since the ForEach implementation requires use of a for loop, Lion ForEach is equal or slower than AdamW ForEach. However, Lion ForEach should still use less memory than AdamW ForEach.

Table 2: AdamW vs Lion Training Speed

(a) Native Implementation
Model AdamW Step Lion Step Speedup
XResNet18 26ms 23ms 15%
XResNet50 54ms 50ms 8.2%
XSE-ResNeXt50 72ms 66ms 8.3%
XResNet101 91ms 76ms 19%
ConvNeXt Tiny 125ms 118ms 6.1%
ConvNeXt Small 202ms 189ms 6.8%
ViT Patch16 Small 63ms 57ms 9.4%
DeBERTa Base 26ms 22ms 25%
(b) Fused ForEach Implementation
Model AdamW Step Lion Step Slowdown
XResNet18 13ms 13ms 0.9%
XResNet50 33ms 34ms 4.6%
XSE-ResNeXt50 42ms 47ms 12%
XResNet101 46ms 48ms 4.2%
ConvNeXt Tiny 102ms 104ms 2.5%
ConvNeXt Small 161ms 164ms 1.6%
ViT Patch16 Small 42ms 45ms 4.9%
DeBERTa Base 7.4ms 7.5ms 1.5%

Footnotes

  1. Benchmarked on a GeForce 3080 Ti using PyTorch 1.13.1, Cuda 11.7, Mixed Precision, Channels Last (except DeBERTa and ViT), and fastxtend’s Simple Profiler Callback. Results may differ on other models, hardware, and across benchmarking runs. Speedup and slowdown are calculated from the total time spent on the optimization step.↩︎

  2. Numerically equivalent approximations of sign using ForEach operators ended up using more memory and were a wash on training speed.↩︎