Optimizers
- class optexp.optim.SGD(lr: float, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False, decay_strategy: WeightDecayStrategy = DecayEverything())[source]
Stochastic Gradient Descent.
- Parameters:
lr (float) – learning rate.
momentum (float, optional) – momentum. Defaults to 0
dampening (float, optional) – dampening for momentum. Defaults to 0
weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0
nesterov (bool, optional) – enables Nesterov momentum. Defaults to False
decay_strategy (WeightDecayStrategy, optional) – The strategy for applying weight decay. Defaults to DecayEverything().
- class optexp.optim.Adam(lr: float, beta1: float = 0.9, beta2: float = 0.999, eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, decay_strategy: WeightDecayStrategy = DecayEverything())[source]
Adam optimizer from [Kingma2014].
- Parameters:
lr (float) – learning rate.
beta1 (float, optional) – coefficient used for computing EMA of gradient. Defaults to 0.9.
beta2 (float, optional) – coefficient used for computing EMA of squared gradients. Defaults to 0.999.
eps (float, optional) – term added to the denominator to improve numerical stability. Defaults to 1e-8.
weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0.01.
amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. Defaults to False.
decay_strategy (WeightDecayStrategy, optional) – strategy for applying weight decay. Defaults to
DecayEverything().
[Kingma2014]Adam: A Method for Stochastic Optimization. Diederik P. Kingma, Jimmy Ba. International Conference on Learning Representations, 2015. doi.org/10.48550/arXiv.1412.6980
- class optexp.optim.AdamW(lr: float, beta1: float = 0.9, beta2: float = 0.999, eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, decay_strategy: WeightDecayStrategy = DecayEverything())[source]
AdamW optimizer from [Loshchilov2019].
- Parameters:
lr (float) – learning rate.
beta1 (float, optional) – coefficient used for computing EMA of gradient. Defaults to 0.9.
beta2 (float, optional) – coefficient used for computing EMA squared gradient. Defaults to 0.999.
eps (float, optional) – term added to the denominator to improve numerical stability. Defaults to 1e-8.
weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0.01.
amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. Defaults to False.
decay_strategy (WeightDecayStrategy, optional) – strategy for applying weight decay. Defaults to
DecayEverything().
[Loshchilov2019]Decoupled Weight Decay Regularization. Ilya Loshchilov, Frank Hutter. International Conference on Learning Representations, 2019. doi.org/10.48550/arXiv.1711.05101
- class optexp.optim.Adagrad(lr: float, weight_decay: float = 0.0, lr_decay: float = 0.0, decay_strategy: WeightDecayStrategy = DecayEverything())[source]