pytorch-optimizer

pytorch-optimizer is a production-focused optimization toolkit for PyTorch with 100+ optimizers, 10+ learning rate schedulers, and 10+ loss functions behind a consistent API.

Use it when you want fast experimentation with modern training methods without rewriting optimizer boilerplate.

Highly inspired by jettify/pytorch-optimizer.

Why pytorch-optimizer

Broad optimizer coverage, including many recent research variants.
Consistent loader APIs for optimizers, schedulers, and losses.
Practical features such as foreach, Lookahead, and Gradient Centralization integrations.
Tested and actively maintained codebase.
Works with optional ecosystem integrations like bitsandbytes, q-galore-torch, and torchao.

Installation

Requirements: - Python >=3.8 - PyTorch >=1.10

pip install pytorch-optimizer

Optional integrations are not installed by default: - bitsandbytes: https://github.com/TimDettmers/bitsandbytes?tab=readme-ov-file#tldr - q-galore-torch: https://github.com/VITA-Group/Q-GaLore?tab=readme-ov-file#install-q-galore-optimizer - torchao: https://github.com/pytorch/ao?tab=readme-ov-file#installation

Quick Start

1) Use an optimizer class directly

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters(), lr=1e-3)

2) Load by name

from pytorch_optimizer import load_optimizer

model = YourModel()
optimizer = load_optimizer('adamp')(model.parameters(), lr=1e-3)

3) Build with `create_optimizer()`

from pytorch_optimizer import create_optimizer

model = YourModel()
optimizer = create_optimizer(
    model,
    optimizer_name='adamp',
    lr=1e-3,
    weight_decay=1e-3,
    use_gc=True,
    use_lookahead=True,
)

4) Optional: load via `torch.hub`

import torch

model = YourModel()
opt_cls = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt_cls(model.parameters(), lr=1e-3)

Discover Available Components

Optimizers

from pytorch_optimizer import get_supported_optimizers

all_optimizers = get_supported_optimizers()
adam_family = get_supported_optimizers('adam*')
selected = get_supported_optimizers(['adam*', 'ranger*'])

Learning Rate Schedulers

from pytorch_optimizer import get_supported_lr_schedulers

all_schedulers = get_supported_lr_schedulers()
cosine_like = get_supported_lr_schedulers('cosine*')

Loss Functions

from pytorch_optimizer import get_supported_loss_functions

all_losses = get_supported_loss_functions()
focal_related = get_supported_loss_functions('*focal*')

Supported Optimizers

You can check the supported optimizers with below code.

from pytorch_optimizer import get_supported_optimizers

supported_optimizers = get_supported_optimizers()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_optimizers

get_supported_optimizers('adam*')
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw']

get_supported_optimizers(['adam*', 'ranger*'])
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw', 'ranger', 'ranger21']

Optimizer	Description	Official Code	Paper(Citation)
AdaBelief	Adapting Step-sizes by the Belief in Observed Gradients	github	paper(cite)
AdaBound	Adaptive Gradient Methods with Dynamic Bound of Learning Rate	github	paper(cite)
AdaHessian	An Adaptive Second Order Optimizer for Machine Learning	github	paper(cite)
AdamD	Improved bias-correction in Adam		paper(cite)
DualAdam	Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers	github	paper(cite)
AdamP	Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights	github	paper(cite)
diffGrad	An Optimization Method for Convolutional Neural Networks	github	paper(cite)
MADGRAD	A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic	github	paper(cite)
RAdam	On the Variance of the Adaptive Learning Rate and Beyond	github	paper(cite)
Ranger	a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer	github	paper(cite)
Ranger21	a synergistic deep learning optimizer	github	paper(cite)
Lamb	Large Batch Optimization for Deep Learning	github	paper(cite)
Shampoo	Preconditioned Stochastic Tensor Optimization	github	paper(cite)
Nero	Learning by Turning: Neural Architecture Aware Optimisation	github	paper(cite)
Adan	Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models	github	paper(cite)
Adai	Disentangling the Effects of Adaptive Learning Rate and Momentum	github	paper(cite)
SAM	Sharpness-Aware Minimization	github	paper(cite)
ASAM	Adaptive Sharpness-Aware Minimization	github	paper(cite)
GSAM	Surrogate Gap Guided Sharpness-Aware Minimization	github	paper(cite)
D-Adaptation	Learning-Rate-Free Learning by D-Adaptation	github	paper(cite)
AdaFactor	Adaptive Learning Rates with Sublinear Memory Cost	github	paper(cite)
Apollo	An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization	github	paper(cite)
NovoGrad	Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks	github	paper(cite)
Lion	Symbolic Discovery of Optimization Algorithms	github	paper(cite)
Ali-G	Adaptive Learning Rates for Interpolation with Gradients	github	paper(cite)
SM3	Memory-Efficient Adaptive Optimization	github	paper(cite)
AdaNorm	Adaptive Gradient Norm Correction based Optimizer for CNNs	github	paper(cite)
RotoGrad	Gradient Homogenization in Multitask Learning	github	paper(cite)
A2Grad	Optimal Adaptive and Accelerated Stochastic Gradient Descent	github	paper(cite)
AccSGD	Accelerating Stochastic Gradient Descent For Least Squares Regression	github	paper(cite)
SGDW	Decoupled Weight Decay Regularization	github	paper(cite)
ASGD	Adaptive Gradient Descent without Descent	github	paper(cite)
Yogi	Adaptive Methods for Nonconvex Optimization		paper(cite)
SWATS	Improving Generalization Performance by Switching from Adam to SGD		paper(cite)
Fromage	On the distance between two neural networks and the stability of learning	github	paper(cite)
MSVAG	Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients	github	paper(cite)
AdaMod	An Adaptive and Momental Bound Method for Stochastic Learning	github	paper(cite)
AggMo	Aggregated Momentum: Stability Through Passive Damping	github	paper(cite)
QHAdam	Quasi-hyperbolic momentum and Adam for deep learning	github	paper(cite)
PID	A PID Controller Approach for Stochastic Optimization of Deep Networks	github	paper(cite)
Gravity	a Kinematic Approach on Optimization in Deep Learning	github	paper(cite)
AdaSmooth	An Adaptive Learning Rate Method based on Effective Ratio		paper(cite)
SRMM	Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates	github	paper(cite)
AvaGrad	Domain-independent Dominance of Adaptive Methods	github	paper(cite)
PCGrad	Gradient Surgery for Multi-Task Learning	github	paper(cite)
AMSGrad	On the Convergence of Adam and Beyond		paper(cite)
Lookahead	k steps forward, 1 step back	github	paper(cite)
PNM	Manipulating Stochastic Gradient Noise to Improve Generalization	github	paper(cite)
GC	Gradient Centralization	github	paper(cite)
AGC	Adaptive Gradient Clipping	github	paper(cite)
Stable WD	Understanding and Scheduling Weight Decay	github	paper(cite)
Softplus T	Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM		paper(cite)
Un-tuned w/u	On the adequacy of untuned warmup for adaptive optimization		paper(cite)
Norm Loss	An efficient yet effective regularization method for deep neural networks		paper(cite)
AdaShift	Decorrelation and Convergence of Adaptive Learning Rate Methods	github	paper(cite)
AdaDelta	An Adaptive Learning Rate Method		paper(cite)
Amos	An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale	github	paper(cite)
SignSGD	Compressed Optimisation for Non-Convex Problems	github	paper(cite)
Sophia	A Scalable Stochastic Second-order Optimizer for Language Model Pre-training	github	paper(cite)
Prodigy	An Expeditiously Adaptive Parameter-Free Learner	github	paper(cite)
PAdam	Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks	github	paper(cite)
LOMO	Full Parameter Fine-tuning for Large Language Models with Limited Resources	github	paper(cite)
AdaLOMO	Low-memory Optimization with Adaptive Learning Rate	github	paper(cite)
LoRARite	LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization	github	paper(cite)
Tiger	A Tight-fisted Optimizer, an optimizer that is extremely budget-conscious	github	cite
CAME	Confidence-guided Adaptive Memory Efficient Optimization	github	paper(cite)
WSAM	Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term	github	paper(cite)
Aida	A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range	github	paper(cite)
GaLore	Memory-Efficient LLM Training by Gradient Low-Rank Projection	github	paper(cite)
Adalite	Adalite optimizer	github	paper(cite)
bSAM	SAM as an Optimal Relaxation of Bayes	github	paper(cite)
Schedule-Free	Schedule-Free Optimizers	github	paper(cite)
FAdam	Adam is a natural gradient optimizer using diagonal empirical Fisher information	github	paper(cite)
Grokfast	Accelerated Grokking by Amplifying Slow Gradients	github	paper(cite)
Kate	Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad	github	paper(cite)
FlashAdamW	FlashOptim: Optimizers for Memory-Efficient Training	github	paper(cite)
StableAdamW	Stable and low-precision training for large-scale vision-language models		paper(cite)
AdamMini	Use Fewer Learning Rates To Gain More	github	paper(cite)
TRAC	Adaptive Parameter-free Optimization	github	paper(cite)
AdamG	Towards Stability of Parameter-free Optimization		paper(cite)
AdEMAMix	Better, Faster, Older	github	paper(cite)
SOAP	Improving and Stabilizing Shampoo using Adam	github	paper(cite)
ADOPT	Modified Adam Can Converge with Any β2 with the Optimal Rate	github	paper(cite)
FTRL	Follow The Regularized Leader		paper
Cautious	Improving Training with One Line of Code	github	paper(cite)
DeMo	Decoupled Momentum Optimization	github	paper(cite)
MicroAdam	Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence	github	paper(cite)
Muon	MomentUm Orthogonalized by Newton-schulz	github	paper(cite)
LaProp	Separating Momentum and Adaptivity in Adam	github	paper(cite)
APOLLO	SGD-like Memory, AdamW-level Performance	github	paper(cite)
MARS	Unleashing the Power of Variance Reduction for Training Large Models	github	paper(cite)
SGDSaI	No More Adam: Learning Rate Scaling at Initialization is All You Need	github	paper(cite)
Grams	Gradient Descent with Adaptive Momentum Scaling		paper(cite)
OrthoGrad	Grokking at the Edge of Numerical Stability	github	paper(cite)
Adam-ATAN2	Scaling Exponents Across Parameterizations and Optimizers		paper(cite)
SPAM	Spike-Aware Adam with Momentum Reset for Stable LLM Training	github	paper(cite)
TAM	Torque-Aware Momentum		paper(cite)
FOCUS	First Order Concentrated Updating Scheme	github	paper(cite)
PSGD	Preconditioned Stochastic Gradient Descent	github	paper(cite)
EXAdam	The Power of Adaptive Cross-Moments	github	paper(cite)
GCSAM	Gradient Centralized Sharpness Aware Minimization	github	paper(cite)
LookSAM	Towards Efficient and Scalable Sharpness-Aware Minimization	github	paper(cite)
SCION	Training Deep Learning Models with Norm-Constrained LMOs	github	paper(cite)
COSMOS	SOAP with Muon	github
StableSPAM	How to Train in 4-Bit More Stably than 16-Bit Adam	github	paper
AdaGC	Improving Training Stability for Large Language Model Pretraining		paper(cite)
Simplified-Ademamix	Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants	github	paper(cite)
Fira	Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?	github	paper(cite)
RACS & Alice	Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension		paper(cite)
VSGD	Variational Stochastic Gradient Descent for Deep Neural Networks	github	paper(cite)
SNSM	Subset-Norm and Subspace-Momentum: Faster Memory-Efficient Adaptive Optimization with Convergence Guarantees	github	paper(cite)
AdamC	Why Gradients Rapidly Increase Near the End of Training		paper(cite)
AdaMuon	Adaptive Muon Optimizer		paper(cite)
SPlus	A Stable Whitening Optimizer for Efficient Neural Network Training	github	paper(cite)
EmoNavi	An emotion-driven optimizer that feels loss and navigates accordingly	github
Refined Schedule-Free	Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training		paper(cite)
FriendlySAM	Friendly Sharpness-Aware Minimization	github	paper(cite)
AdaGO	AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates		paper(cite)
Conda	Column-Normalized Adam for Training Large Language Models Faster	github	paper(cite)
BCOS	Stochastic Approximation with Block Coordinate Optimal Stepsizes	github	paper(cite)
Cautious WD	Cautious Weight Decay		paper(cite)
Ano	Faster is Better in Noisy Landscape	github	paper(cite)
Spectral Sphere	Controlled LLM Training on Spectral Sphere	github	paper(cite)
ROSE	Stateless optimization through range-normalized gradient updates	github	paper(cite)

Supported LR Scheduler

You can check the supported learning rate schedulers with below code.

from pytorch_optimizer import get_supported_lr_schedulers

supported_lr_schedulers = get_supported_lr_schedulers()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_lr_schedulers

get_supported_lr_schedulers('cosine*')
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup']

get_supported_lr_schedulers(['cosine*', '*warm*'])
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup', 'warmup_stable_decay']

LR Scheduler	Description	Official Code	Paper(Citation)
Explore-Exploit	Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule		paper(cite)
Chebyshev	Acceleration via Fractal Learning Rate Schedules		paper(cite)
REX	Revisiting Budgeted Training with an Improved Schedule	github	paper(cite)
WSD	Warmup-Stable-Decay learning rate scheduler	github	paper(cite)

Supported Loss Function

You can check the supported loss functions with below code.

from pytorch_optimizer import get_supported_loss_functions

supported_loss_functions = get_supported_loss_functions()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_loss_functions

get_supported_loss_functions('*focal*')
# ['bcefocalloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']

get_supported_loss_functions(['*focal*', 'bce*'])
# ['bcefocalloss', 'bceloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']

Loss Functions	Description	Official Code	Paper(Citation)
Label Smoothing	Rethinking the Inception Architecture for Computer Vision		paper(cite)
Focal	Focal Loss for Dense Object Detection		paper(cite)
Focal Cosine	Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble		paper(cite)
LDAM	Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss	github	paper(cite)
Jaccard (IOU)	IoU Loss for 2D/3D Object Detection		paper(cite)
Bi-Tempered	The Principle of Unchanged Optimality in Reinforcement Learning Generalization		paper(cite)
Tversky	Tversky loss function for image segmentation using 3D fully convolutional deep networks		paper(cite)
Lovasz Hinge	A tractable surrogate for the optimization of the intersection-over-union measure in neural networks	github	paper(cite)

Documentation

Stable docs: https://pytorch-optimizers.readthedocs.io/en/stable/
Latest docs: https://pytorch-optimizers.readthedocs.io/en/latest/
Optimizer API reference: https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/
LR scheduler API reference: https://pytorch-optimizers.readthedocs.io/en/latest/lr_scheduler/
Loss API reference: https://pytorch-optimizers.readthedocs.io/en/latest/loss/
FAQ: https://pytorch-optimizers.readthedocs.io/en/latest/qa/
Visualization examples: https://pytorch-optimizers.readthedocs.io/en/latest/visualization/

License Notes

Most implementations are under MIT or Apache 2.0 compatible terms from their original sources. Some algorithms (for example Fromage, Nero) are tied to CC BY-NC-SA 4.0, which is non-commercial. Please verify the license of each optimizer before production or commercial use.

Contributing and Community

Contributing guide: CONTRIBUTING
Changelog: CHANGELOG

Citation

Please cite original optimizer authors when you use specific algorithms. If you use this repository, you can use the citation metadata in CITATION or GitHub's "Cite this repository".

@software{Kim_pytorch_optimizer_optimizer_2021,
  author = {Kim, Hyeongchan},
  title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
  url = {https://github.com/kozistr/pytorch_optimizer},
  year = {2021}
}

Maintainer

Hyeongchan Kim / @kozistr

pytorch-optimizer

Why pytorch-optimizer

Installation

Quick Start

1) Use an optimizer class directly

2) Load by name

3) Build with create_optimizer()

4) Optional: load via torch.hub

Discover Available Components

Optimizers

Learning Rate Schedulers

Loss Functions

Supported Optimizers

Supported LR Scheduler

Supported Loss Function

Documentation

License Notes

Contributing and Community

Citation

Maintainer

3) Build with `create_optimizer()`

4) Optional: load via `torch.hub`