8-Bit Optimizers
bitsandbytes 8-bit optimizers can reduce optimizer memory usage up to 75% compared to 32-bit optimizers.
While it is possible to use bitsandbytes optimizers1 with fastai via fastai.optimizer.OptimWrapper
, this doesn’t provide compatibility with all fastai optimizer features. fastxtend adds full fastai compatibility to bitsandbytes 8-bit optimizers, including per-parameter weight decay, automatic weight decay exclusion for normalization and bias terms, and discriminative learning rate support.
While 8-bit optimizer support is defined and detailed here, they are integrated into and intended to be used via fastxtend’s fused fastai optimizers for SGD, Adam, LARS, and LAMB, and via fastxtend’s Lion optimizer as shown below.
To use 8-bit optimizers, install bitsandbytes on a machine with a Cuda device
pip install bitandbytes
then import fastxtend optimizers after importing fastai
from fastxtend.vision.all import *
# or just import fastxtend optimizers
from fastxtend.optimizer.all import *
= adam(eightbit=True)
opt_func =opt_func) Learner(..., opt_func
If training NLP models, you may need to replace the PyTorch embedding layer with a bitsandbytes layer : torch.nn.Embedding(..) -> bnb.nn.Embedding(..)
.
Check out the bitsandbytes readme for more details on using 8-bit optimizers.
bitsandbytes calls torch.cuda.synchronize
after each optimizer step. This prevents starting the next optimizer step until the current step finishes, which may increase optimizer wallclock time.
fastxtend adds sync_each_step=False
as an argument to both all 8-bit optimizers, disabling the per-step torch.cuda.synchronize
. Set to sync_each_step=True
to match bitsandbytes behavior.
fastai and bitsandbytes Compatibility
EightBitFastaiAdapter
EightBitFastaiAdapter ()
Base for adding fastai optimizer functionality to EightBit Optimizers
EightBitCommon
EightBitCommon ()
Common changes to EightBit Optimizers
EightBit1StateOptimizer
EightBit1StateOptimizer (optimizer_name, params, lr=0.001, mom=0.9, sqr_mom=0.0, eps=1e-08, wd=0.0, optim_bits=8, args=None, min_8bit_size=4096, percentile_clipping=100, block_wise=True, max_unorm=0.0, skip_zeros=False, is_paged=False, sync_each_step=False)
Adds fastai optimizer functionality & compatability to Optimizer1State
EightBit2StateOptimizer
EightBit2StateOptimizer (optimizer_name, params, lr=0.001, mom=0.9, sqr_mom=0.999, eps=1e-08, wd=0.0, optim_bits=8, args=None, min_8bit_size=4096, percentile_clipping=100, block_wise=True, max_unorm=0.0, skip_zeros=False, is_paged=False, sync_each_step=False)
Adds fastai optimizer functionality & compatability to Optimizer2State
8-bit Optimizers
SGD8bitOptimizer
SGD8bitOptimizer (params, lr, mom, wd=0, args=None, min_8bit_size=4096, percentile_clipping=100, block_wise=True, sync_each_step=False)
A fastai-compatible bitsandbytes 8-bit SGD optimizer
RMSProp8bitOptimizer
RMSProp8bitOptimizer (params, lr=0.01, sqr_mom=0.99, eps=1e-08, wd=0, args=None, min_8bit_size=4096, percentile_clipping=100, block_wise=True, sync_each_step=False)
A fastai-compatible bitsandbytes 8-bit RMSProb optimizer
AdamW8bitOptimizer
AdamW8bitOptimizer (params, lr=0.001, mom=0.9, sqr_mom=0.99, eps=1e-08, wd=0.01, args=None, min_8bit_size=4096, percentile_clipping=100, block_wise=True, is_paged=False, sync_each_step=False)
A fastai-compatible bitsandbytes 8-bit AdamW optimizer
LARS8bitOptimizer
LARS8bitOptimizer (params, lr, mom=0, wd=0, args=None, min_8bit_size=4096, percentile_clipping=100, trust_coeff=0.02, sync_each_step=False)
A fastai-compatible bitsandbytes 8-bit LARS optimizer
LAMB8bitOptimizer
LAMB8bitOptimizer (params, lr=0.001, mom=0.9, sqr_mom=0.999, eps=1e-08, wd=0, args=None, min_8bit_size=4096, percentile_clipping=100, block_wise=False, sync_each_step=False)
A fastai-compatible bitsandbytes 8-bit LAMB optimizer
Lion8bitOptimizer
Lion8bitOptimizer (params, lr=0.0001, beta1=0.9, beta2=0.99, wd=0, args=None, min_8bit_size=4096, percentile_clipping=100, block_wise=True, is_paged=False, sync_each_step=False)
A fastai-compatible bitsandbytes 8-bit Lion optimizer
Footnotes
Or any PyTorch-compatible optimizer.↩︎