Skip to content

PyTorch Optimizer Foreach Benchmark

Environment - Device: cuda - GPU: NVIDIA GeForce GTX 1060 6GB - CUDA Version: 12.8 - PyTorch Version: 2.8.0+cu128

MLP Model (Batch size: 64)

Parameters: 29,375,488

Optimizer Foreach Avg Time Std Peak Mem Loss
AdaFactor No 35.421 ms 8.907 ms 329.58 MB 0.000321
AdaFactor Yes 27.109 ms 0.247 ms 577.85 MB 0.003894
GrokFastAdamW No 30.360 ms 0.110 ms 609.55 MB 0.000673
GrokFastAdamW Yes 27.240 ms 0.159 ms 801.66 MB 0.000565
Amos No 22.081 ms 0.239 ms 273.39 MB 0.419085
Amos Yes 20.077 ms 0.217 ms 465.53 MB 0.369267
Lion No 20.296 ms 0.108 ms 369.43 MB 0.006918
Lion Yes 17.292 ms 0.096 ms 465.48 MB 0.006869
Tiger No 14.876 ms 0.092 ms 369.43 MB 0.629746
Tiger Yes 13.438 ms 0.099 ms 465.48 MB 0.556300
Adan No 40.456 ms 0.363 ms 705.61 MB 0.447540
Adan Yes 40.706 ms 6.065 ms 1025.78 MB 0.438211
ADOPT No 21.104 ms 0.085 ms 497.49 MB 0.005070
ADOPT Yes 22.725 ms 0.160 ms 689.60 MB 0.001950
AdaBelief No 28.131 ms 0.109 ms 497.49 MB 0.000009
AdaBelief Yes 21.932 ms 0.284 ms 689.60 MB 0.000006
StableAdamW No 30.821 ms 0.787 ms 497.48 MB 0.121176
StableAdamW Yes 28.764 ms 0.388 ms 577.54 MB 0.000000
Lamb No 31.722 ms 1.081 ms 497.51 MB 0.692506
Lamb Yes 28.144 ms 0.428 ms 577.58 MB 0.700553
LARS No 20.467 ms 0.231 ms 353.93 MB 0.977978
LARS Yes 18.902 ms 0.094 ms 353.93 MB 0.978091
SignSGD No 14.069 ms 0.112 ms 369.43 MB 0.679428
SignSGD Yes 12.682 ms 0.123 ms 465.48 MB 0.661502
SGDW No 13.339 ms 0.148 ms 353.93 MB 0.994222
SGDW Yes 10.332 ms 0.119 ms 353.93 MB 0.996546

Foreach vs Regular Summary (CUDA)

Optimizer Speedup Time (foreach) Time (regular) Memory Diff Mem Diff %
AdaFactor 1.31x 27.109 ms 35.421 ms +248.27 MB +75.3%
GrokFastAdamW 1.11x 27.240 ms 30.360 ms +192.11 MB +31.5%
Amos 1.10x 20.077 ms 22.081 ms +192.15 MB +70.3%
Lion 1.17x 17.292 ms 20.296 ms +96.05 MB +26.0%
Tiger 1.11x 13.438 ms 14.876 ms +96.06 MB +26.0%
Adan 1.01x slower 40.706 ms 40.456 ms +320.17 MB +45.4%
ADOPT 1.08x slower 22.725 ms 21.104 ms +192.11 MB +38.6%
AdaBelief 1.28x 21.932 ms 28.131 ms +192.11 MB +38.6%
StableAdamW 1.07x 28.764 ms 30.821 ms +80.06 MB +16.1%
Lamb 1.13x 28.144 ms 31.722 ms +80.07 MB +16.1%
LARS 1.08x 18.902 ms 20.467 ms +0.00 MB +0.0%
SignSGD 1.11x 12.682 ms 14.069 ms +96.06 MB +26.0%
SGDW 1.29x 10.332 ms 13.339 ms +0.00 MB +0.0%

Average speedup (foreach vs regular): 1.13x