Optimizers

class optexp.optim.Optimizer[source]: Abstract base class for optimizers.

class optexp.optim.SGD(lr: float, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False, decay_strategy: WeightDecayStrategy = DecayEverything())[source]

Stochastic Gradient Descent.

Parameters:

lr (float) – learning rate.
momentum (float, optional) – momentum. Defaults to 0
dampening (float, optional) – dampening for momentum. Defaults to 0
weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0
nesterov (bool, optional) – enables Nesterov momentum. Defaults to False
decay_strategy (WeightDecayStrategy, optional) – The strategy for applying weight decay. Defaults to DecayEverything().

class optexp.optim.Adam(lr: float, beta1: float = 0.9, beta2: float = 0.999, eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, decay_strategy: WeightDecayStrategy = DecayEverything())[source]

Adam optimizer from [Kingma2014].

Parameters:

lr (float) – learning rate.
beta1 (float, optional) – coefficient used for computing EMA of gradient. Defaults to 0.9.
beta2 (float, optional) – coefficient used for computing EMA of squared gradients. Defaults to 0.999.
eps (float, optional) – term added to the denominator to improve numerical stability. Defaults to 1e-8.
weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0.01.
amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. Defaults to False.
decay_strategy (WeightDecayStrategy, optional) – strategy for applying weight decay. Defaults to DecayEverything().

[Kingma2014]

Adam: A Method for Stochastic Optimization. Diederik P. Kingma, Jimmy Ba. International Conference on Learning Representations, 2015. doi.org/10.48550/arXiv.1412.6980

class optexp.optim.AdamW(lr: float, beta1: float = 0.9, beta2: float = 0.999, eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, decay_strategy: WeightDecayStrategy = DecayEverything())[source]

AdamW optimizer from [Loshchilov2019].

Parameters:

lr (float) – learning rate.
beta1 (float, optional) – coefficient used for computing EMA of gradient. Defaults to 0.9.
beta2 (float, optional) – coefficient used for computing EMA squared gradient. Defaults to 0.999.
eps (float, optional) – term added to the denominator to improve numerical stability. Defaults to 1e-8.
weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0.01.
amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. Defaults to False.
decay_strategy (WeightDecayStrategy, optional) – strategy for applying weight decay. Defaults to DecayEverything().

[Loshchilov2019]

Decoupled Weight Decay Regularization. Ilya Loshchilov, Frank Hutter. International Conference on Learning Representations, 2019. doi.org/10.48550/arXiv.1711.05101

class optexp.optim.Adagrad(lr: float, weight_decay: float = 0.0, lr_decay: float = 0.0, decay_strategy: WeightDecayStrategy = DecayEverything())[source]

Weight Decay strategies

class optexp.optim.WeightDecayStrategy[source]: Abstract base class for weight decay strategies.

class optexp.optim.DecayEverything[source]: Applies weight decay to all parameters.

class optexp.optim.NoDecayOnBias[source]

Applies weight decay to all parameters except biases.

Only applies weight decay to parameters whose name does not contain “bias”.