V3.3.1

Change Log

Support Cautious variant to AdaShift optimizer. (#310)
Save the state of the Lookahead optimizer too. (#310)
Implement APOLLO optimizer. (#311, #312) * SGD-like Memory, AdamW-level Performance
Rename the Apollo (An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization) optimizer name to ApolloDQN not to overlap with the new optimizer name APOLLO. (#312)
Implement MARS optimizer. (#313, #314) * Unleashing the Power of Variance Reduction for Training Large Models
Support Cautious variant to MARS optimizer. (#314)

Fix bias_correction in AdamG optimizer. (#305, #308)
Fix a potential bug when loading the state for Lookahead optimizer. (#306, #310)

thanks to @Vectorrent