Optimizers

class optexp.optim.Optimizer[source]

Abstract base class for optimizers.

class optexp.optim.SGD(lr: float, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False, decay_strategy: WeightDecayStrategy = DecayEverything())[source]

Stochastic Gradient Descent.

Parameters:
  • lr (float) – learning rate.

  • momentum (float, optional) – momentum. Defaults to 0

  • dampening (float, optional) – dampening for momentum. Defaults to 0

  • weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0

  • nesterov (bool, optional) – enables Nesterov momentum. Defaults to False

  • decay_strategy (WeightDecayStrategy, optional) – The strategy for applying weight decay. Defaults to DecayEverything().

class optexp.optim.Adam(lr: float, beta1: float = 0.9, beta2: float = 0.999, eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, decay_strategy: WeightDecayStrategy = DecayEverything())[source]

Adam optimizer from [Kingma2014].

Parameters:
  • lr (float) – learning rate.

  • beta1 (float, optional) – coefficient used for computing EMA of gradient. Defaults to 0.9.

  • beta2 (float, optional) – coefficient used for computing EMA of squared gradients. Defaults to 0.999.

  • eps (float, optional) – term added to the denominator to improve numerical stability. Defaults to 1e-8.

  • weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0.01.

  • amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. Defaults to False.

  • decay_strategy (WeightDecayStrategy, optional) – strategy for applying weight decay. Defaults to DecayEverything().

[Kingma2014]

Adam: A Method for Stochastic Optimization. Diederik P. Kingma, Jimmy Ba. International Conference on Learning Representations, 2015. doi.org/10.48550/arXiv.1412.6980

class optexp.optim.AdamW(lr: float, beta1: float = 0.9, beta2: float = 0.999, eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, decay_strategy: WeightDecayStrategy = DecayEverything())[source]

AdamW optimizer from [Loshchilov2019].

Parameters:
  • lr (float) – learning rate.

  • beta1 (float, optional) – coefficient used for computing EMA of gradient. Defaults to 0.9.

  • beta2 (float, optional) – coefficient used for computing EMA squared gradient. Defaults to 0.999.

  • eps (float, optional) – term added to the denominator to improve numerical stability. Defaults to 1e-8.

  • weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0.01.

  • amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. Defaults to False.

  • decay_strategy (WeightDecayStrategy, optional) – strategy for applying weight decay. Defaults to DecayEverything().

[Loshchilov2019]

Decoupled Weight Decay Regularization. Ilya Loshchilov, Frank Hutter. International Conference on Learning Representations, 2019. doi.org/10.48550/arXiv.1711.05101

class optexp.optim.Adagrad(lr: float, weight_decay: float = 0.0, lr_decay: float = 0.0, decay_strategy: WeightDecayStrategy = DecayEverything())[source]

Weight Decay strategies

class optexp.optim.WeightDecayStrategy[source]

Abstract base class for weight decay strategies.

class optexp.optim.DecayEverything[source]

Applies weight decay to all parameters.

class optexp.optim.NoDecayOnBias[source]

Applies weight decay to all parameters except biases.

Only applies weight decay to parameters whose name does not contain “bias”.