More than 10 optimizers (e.g. AdaFactor, StableAdamW, Lion, AdaBelief, Amos, ...) now support foreach.
In most cases, foreach improves training speed by 1.1x to 1.5x, with a moderate increase in memory usage.
Like official PyTorch optimizers, the default value of foreach is None. When foreach=None, CUDA paths prefer the foreach implementation over the for-loop implementation.
If you need the previous for-loop behavior, set foreach=False explicitly.
Update the Emo-series optimizers. (#472, #478)
Update EmoNavi, EmoFact, and EmoLynx.
Begin deprecating EmoNeco and EmoZeal (they are being phased out).