Skip to content


Build workflow Documentation Status
Quality codecov black ruff
Package PyPI version PyPI pyversions
Status PyPi download PyPi month download
License apache

pytorch-optimizer is optimizer & lr scheduler collections in PyTorch. I just re-implemented (speed & memory tweaks, plug-ins) the algorithm while based on the original paper. Also, It includes useful and practical optimization ideas.
Currently, 60 optimizers (+ bitsandbytes), 10 lr schedulers, and 13 loss functions are supported!

Highly inspired by pytorch-optimizer.

Getting Started

For more, see the documentation.

Most optimizers are under MIT or Apache 2.0 license, but a few optimizers like Fromage, Nero have CC BY-NC-SA 4.0 license, which is non-commercial. So, please double-check the license before using it at your work.


$ pip3 install pytorch-optimizer

From pytorch-optimizer v2.12.0, you can install and import bitsandbytes optimizers. please check the requirements before installing it.

$ pip install "pytorch-optimizer[bitsandbytes]"

Simple Usage

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters())

# or you can use optimizer loader, simply passing a name of the optimizer.

from pytorch_optimizer import load_optimizer

optimizer = load_optimizer(optimizer='adamp')(model.parameters())

# if you install `bitsandbytes` optimizer, you can use `8-bit` optimizers from `pytorch-optimizer`.

from pytorch_optimizer import load_optimizer

opt = load_optimizer(optimizer='bnb_adamw8bit')
optimizer = opt(model.parameters())

Also, you can load the optimizer via torch.hub.

import torch

model = YourModel()
opt = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt(model.parameters())

If you want to build the optimizer with parameters & configs, there's create_optimizer() API.

from pytorch_optimizer import create_optimizer

optimizer = create_optimizer(

Supported Optimizers

You can check the supported optimizers with below code.

from pytorch_optimizer import get_supported_optimizers

supported_optimizers = get_supported_optimizers()
Optimizer Description Official Code Paper Citation
AdaBelief Adapting Step-sizes by the Belief in Observed Gradients github cite
AdaBound Adaptive Gradient Methods with Dynamic Bound of Learning Rate github cite
AdaHessian An Adaptive Second Order Optimizer for Machine Learning github cite
AdamD Improved bias-correction in Adam cite
AdamP Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights github cite
diffGrad An Optimization Method for Convolutional Neural Networks github cite
MADGRAD A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic github cite
RAdam On the Variance of the Adaptive Learning Rate and Beyond github cite
Ranger a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer github cite
Ranger21 a synergistic deep learning optimizer github cite
Lamb Large Batch Optimization for Deep Learning github cite
Shampoo Preconditioned Stochastic Tensor Optimization github cite
Nero Learning by Turning: Neural Architecture Aware Optimisation github cite
Adan Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models github cite
Adai Disentangling the Effects of Adaptive Learning Rate and Momentum github cite
SAM Sharpness-Aware Minimization github cite
ASAM Adaptive Sharpness-Aware Minimization github cite
GSAM Surrogate Gap Guided Sharpness-Aware Minimization github cite
D-Adaptation Learning-Rate-Free Learning by D-Adaptation github cite
AdaFactor Adaptive Learning Rates with Sublinear Memory Cost github cite
Apollo An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization github cite
NovoGrad Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks github cite
Lion Symbolic Discovery of Optimization Algorithms github cite
Ali-G Adaptive Learning Rates for Interpolation with Gradients github cite
SM3 Memory-Efficient Adaptive Optimization github cite
AdaNorm Adaptive Gradient Norm Correction based Optimizer for CNNs github cite
RotoGrad Gradient Homogenization in Multitask Learning github cite
A2Grad Optimal Adaptive and Accelerated Stochastic Gradient Descent github cite
AccSGD Accelerating Stochastic Gradient Descent For Least Squares Regression github cite
SGDW Decoupled Weight Decay Regularization github cite
ASGD Adaptive Gradient Descent without Descent github cite
Yogi Adaptive Methods for Nonconvex Optimization NIPS 2018 cite
SWATS Improving Generalization Performance by Switching from Adam to SGD cite
Fromage On the distance between two neural networks and the stability of learning github cite
MSVAG Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients github cite
AdaMod An Adaptive and Momental Bound Method for Stochastic Learning github cite
AggMo Aggregated Momentum: Stability Through Passive Damping github cite
QHAdam Quasi-hyperbolic momentum and Adam for deep learning github cite
PID A PID Controller Approach for Stochastic Optimization of Deep Networks github CVPR 18 cite
Gravity a Kinematic Approach on Optimization in Deep Learning github cite
AdaSmooth An Adaptive Learning Rate Method based on Effective Ratio cite
SRMM Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates github cite
AvaGrad Domain-independent Dominance of Adaptive Methods github cite
PCGrad Gradient Surgery for Multi-Task Learning github cite
AMSGrad On the Convergence of Adam and Beyond cite
Lookahead k steps forward, 1 step back github cite
PNM Manipulating Stochastic Gradient Noise to Improve Generalization github cite
GC Gradient Centralization github cite
AGC Adaptive Gradient Clipping github cite
Stable WD Understanding and Scheduling Weight Decay github cite
Softplus T Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM cite
Un-tuned w/u On the adequacy of untuned warmup for adaptive optimization cite
Norm Loss An efficient yet effective regularization method for deep neural networks cite
AdaShift Decorrelation and Convergence of Adaptive Learning Rate Methods github cite
AdaDelta An Adaptive Learning Rate Method cite
Amos An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale github cite
SignSGD Compressed Optimisation for Non-Convex Problems github cite
Sophia A Scalable Stochastic Second-order Optimizer for Language Model Pre-training github cite
Prodigy An Expeditiously Adaptive Parameter-Free Learner github cite
PAdam Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks github cite
LOMO Full Parameter Fine-tuning for Large Language Models with Limited Resources github cite
Tiger A Tight-fisted Optimizer, an optimizer that is extremely budget-conscious github cite
CAME Confidence-guided Adaptive Memory Efficient Optimization github cite

Supported LR Scheduler

You can check the supported learning rate schedulers with below code.

from pytorch_optimizer import get_supported_lr_schedulers

supported_lr_schedulers = get_supported_lr_schedulers()
LR Scheduler Description Official Code Paper Citation
Explore-Exploit Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule cite
Chebyshev Acceleration via Fractal Learning Rate Schedules cite

Supported Loss Function

You can check the supported loss functions with below code.

from pytorch_optimizer import get_supported_loss_functions

supported_loss_functions = get_supported_loss_functions()
Loss Functions Description Official Code Paper Citation
Label Smoothing Rethinking the Inception Architecture for Computer Vision cite
Focal Focal Loss for Dense Object Detection cite
Focal Cosine Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble cite
LDAM Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss github cite
Jaccard (IOU) IoU Loss for 2D/3D Object Detection cite
Bi-Tempered The Principle of Unchanged Optimality in Reinforcement Learning Generalization cite
Tversky Tversky loss function for image segmentation using 3D fully convolutional deep networks cite
Lovasz Hinge A tractable surrogate for the optimization of the intersection-over-union measure in neural networks github cite

Useful Resources

Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in Ranger21 optimizer.

Also, most of the captures are taken from Ranger21 paper.

Adaptive Gradient Clipping Gradient Centralization Softplus Transformation
Gradient Normalization Norm Loss Positive-Negative Momentum
Linear learning rate warmup Stable weight decay Explore-exploit learning rate schedule
Lookahead Chebyshev learning rate schedule (Adaptive) Sharpness-Aware Minimization
On the Convergence of Adam and Beyond Improved bias-correction in Adam Adaptive Gradient Norm Correction

Adaptive Gradient Clipping

This idea originally proposed in NFNet (Normalized-Free Network) paper. AGC (Adaptive Gradient Clipping) clips gradients based on the unit-wise ratio of gradient norms to parameter norms.

Gradient Centralization


Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.

Softplus Transformation

By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.

Gradient Normalization

Norm Loss


Positive-Negative Momentum


Linear learning rate warmup


Stable weight decay


Explore-exploit learning rate schedule



k steps forward, 1 step back. Lookahead consisting of keeping an exponential moving average of the weights that is updated and substituted to the current weights every k lookahead steps (5 by default).

Chebyshev learning rate schedule

Acceleration via Fractal Learning Rate Schedules.

(Adaptive) Sharpness-Aware Minimization

Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.
In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.

On the Convergence of Adam and Beyond

Convergence issues can be fixed by endowing such algorithms with 'long-term memory' of past gradients.

Improved bias-correction in Adam

With the default bias-correction, Adam may actually make larger than requested gradient updates early in training.

Adaptive Gradient Norm Correction

Correcting the norm of a gradient in each iteration based on the adaptive training history of gradient norm.

Frequently asked questions



Please cite the original authors of optimization algorithms. You can easily find it in the above table! If you use this software, please cite it below. Or you can get it from "cite this repository" button.

    author = {Kim, Hyeongchan},
    month = jan,
    title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
    url = {},
    version = {2.12.0},
    year = {2021}


Hyeongchan Kim / @kozistr