Fused Optimizers

Fused fastai optimizers with ForEach, TorchScript, & bitsandbytes 8-bit implementations

fastxtend’s fused optimizers are 21 to 293 percent faster, drop-in replacements for fastai native optimizers.

Like fastai optimizers, fastxtend fused optimizers support both discriminative learning rates across multiple parameter groups and per-parameter weight decay without any extra setup.

While all fastai optimizers have vertically fused TorchScript implementations, only a subset have horizontally fused ForEach¹ implementations. These optimizers, SGD, Adam, RAdam, LAMB, and Ranger, usually outperform their TorchScript counterparts in all but the tiniest models. fastxtend also has ForEach implementatons of Adan, Lion, Sophia, and StableAdam.

fastxtend also adds full fastai support for bitsandbytes 8-bit optimizers². 8-bit optimizers can reduce optimizer memory usage up to 75% compared to 32-bit optimizers. A subset of optimizers are supported: SGD, Adam, LARS, LAMB, and Lion.

Important: Only Tested on PyTorch 1.12+

ForEach and TorchScript optimizers have only been tested on PyTorch 1.12+ and are not guaranteed to work on older versions.

Note: Documentation for Each Optimizer Type

Documentation for individual optimizers are lightly adapted from the fastai optimizer documentation. Docments and type hints have been upstreamed to fastai.

For implementation details, see the ForEach, TorchScript, or 8-bit documentation.

Fused Performance

As shown in Table 1, ForEach Optimizers are 21 to 293 percent faster³ in AdamW optimizer step performance relative to fastai implementations across benchmarked models. Complex optimizers without ForEach implementations, such as QHAdam, are up to 137 percent faster using TorchScript implementations.

Table 1: Increase in AdamW `opt_step` Speed vs fastai Native Optimizer
Model	fastai Step	ForEach Step	ForEach Speedup	JIT Step	JIT Speedup
XResNet18	26ms	12ms	109%	20ms	29%
XResNet50	56ms	32ms	74%	46ms	20%
XSE-ResNeXt50	72ms	43ms	68%	61ms	18%
XResNet101	88ms	47ms	84%	68ms	30%
DeBERTa Base	27ms	6.9ms	293%	19ms	46%

This speedup persists with single or multiple parameter groups. Although more groups can lead to a small decrease in optimizer step speed, as shown by DeBERTa in Table 2.

Table 2: Increase in AdamW `opt_step` Speed With Multiple Param Groups vs fastai Native Optimizer
Model	Layers	fastai Step	ForEach Step	ForEach Speedup	JIT Step	JIT Speedup
XResNet18	2	25ms	12ms	103%	19ms	30%
XResNet50	2	56ms	32ms	76%	46ms	24%
XSE-ResNeXt50	2	72ms	45ms	85%	61ms	29%
XResNet101	2	87ms	47ms	60%	67ms	17%
ConvNeXt Tiny	2	125ms	102ms	22%	115ms	9.4%
ConvNeXt Small	2	200ms	165ms	21%	181ms	10%
ViT Patch16 Small	2	62ms	38ms	62%	52ms	20%
DeBERTa Base	4	27ms	7.7ms	254%	19ms	47%

Examples

For backwards compatibility, all fastxtend optimizers return a fastai native optimizer by default. To use a fused version set foreach=True or jit=True.

from fastai.vision.all import *
from fastxtend.vision.all import *

# Use ForEach AdamW
opt_func = adam(foreach=True)

# Or use TorchScript AdamW
opt_func = adam(jit=True)

# Or use bitsandbytes' 8-bit AdamW
opt_func = adam(eightbit=True)

Learner(..., opt_func=opt_func)

Or import fused optimizers independent of other fastxtend features.

from fastai.vision.all import *
from fastxtend.optimizer.all import *

Learner(..., opt_func=partial(Adam, foreach=True))

Note

adam(...) is a fastxtend convenience method equivalent to partial(Adam, ...). fastextend adds lowercase convenience methods for all fastai optimizers.

SGD Optimizer

Stochastic gradient descent, optionally with momentum.

Optional weight decay of wd is applied, as true weight decay (decay the weights directly) if decouple_wd=True else as L2 regularization (add the decay to the gradients).

8-bit SGD only supports L2 weight decay: decouple_wd=False, and requires momentum: mom>0.

	Type	Default	Details
params	Listified[Tensor]		Model parameters or parameter groups
lr	float		Default learning rate
mom	float	0.0	Gradient moving average (β1) coefficient
wd	float	0.0	Optional weight decay (true or L2)
decouple_wd	bool	True	Apply true weight decay (SGDW) or L2 regularization (SGD)
foreach	bool	False	Use fused ForEach implementation
jit	bool	False	Use fused TorchScript implementation
eightbit	bool	False	Use fused 8-bit implementation
eightbitargs
Returns	Optimizer \| SGDForEachOptimizer \| JitOptimizer \| SGD8bitOptimizer

	Type	Default	Details
mom	float	0.0	Gradient moving average (β1) coefficient
wd	float	0.0	Optional weight decay (true or L2)
decouple_wd	bool	True	Apply true weight decay (SGDW) or L2 regularization (SGD)
foreach	bool	False	Use fused ForEach implementation
jit	bool	False	Use fused TorchScript implementation
eightbit	bool	False	Use fused 8-bit implementation
eightbitargs
Returns	Optimizer \| SGDForEachOptimizer \| JitOptimizer \| SGD8bitOptimizer

	Type	Default	Details
params	Listified[Tensor]		Model parameters or parameter groups
lr	float		Default learning rate
mom	float	0.0	Gradient moving average (β1) coefficient
sqr_mom	float	0.99	Gradient squared moving average (β2) coefficient
eps	float	1e-08	Added for numerical stability
wd	float	0.0	Optional weight decay (true or L2)
decouple_wd	bool	True	Apply true weight decay or L2 regularization. Ignored if `eightbit=True`
jit	bool	False	Use fused TorchScript implementation
eightbit	bool	False	Use fused 8-bit implementation
eightbitargs
Returns	Optimizer \| JitOptimizer \| RMSProp8bitOptimizer

	Type	Default	Details
mom	float	0.0	Gradient moving average (β1) coefficient
sqr_mom	float	0.99	Gradient squared moving average (β2) coefficient
eps	float	1e-08	Added for numerical stability
wd	float	0.0	Optional weight decay (true or L2)
decouple_wd	bool	True	Apply true weight decay (RMSPropW) or L2 regularization (RMSProp)
jit	bool	False	Use fused TorchScript implementation
eightbit	bool	False	Use fused 8-bit implementation
eightbitargs
Returns	Optimizer \| JitOptimizer \| RMSProp8bitOptimizer

Fused Performance

Examples

SGD Optimizer

SGD

sgd

RMSProp Optimizer

RMSProp

rmsprop

Adam Optimizer

Adam

adam

RAdam Optimizer

RAdam

radam

QHAdam Optimizer

QHAdam

qhadam

LARS/LARC Optimizer

Larc

larc

LAMB Optimizer

Lamb

lamb

Ranger Optimizer

Ranger

ranger

Footnotes