# Optimizer¶

## Momentum¶

Optimizers(update equation) for SGD method.

SGD Optimizer.

SGD is an optimization method, trying to find a neural network that minimize the “cost/error” of it by iteration. In paddle’s implementation SGD Optimizer is synchronized, which means all gradients will be wait to calculate and reduced into one gradient, then do optimize operation.

The neural network consider the learning problem of minimizing an objective function, that has the form of a sum

$Q(w) = \sum_{i}^{n} Q_i(w)$

The value of function Q sometimes is the cost of neural network (Mean Square Error between prediction and label for example). The function Q is parametrised by w, the weight/bias of neural network. And weights is what to be learned. The i is the i-th observation in (trainning) data.

So, the SGD method will optimize the weight by

$w = w - \eta \nabla Q(w) = w - \eta \sum_{i}^{n} \nabla Q_i(w)$

where $$\eta$$ is learning rate. And $$n$$ is batch size.

Optimizers(update equation) for SGD method.

$\begin{split}m(w, t) & = \beta_1 m(w, t-1) + (1 - \beta_1) \nabla Q_i(w) \\ v(w, t) & = \beta_2 v(w, t-1) + (1 - \beta_2)(\nabla Q_i(w)) ^2 \\ w & = w - \frac{\eta}{\sqrt{v(w,t) + \epsilon}}\end{split}$
Parameters: beta1 (float) – the $$\beta_1$$ in equation. beta2 (float) – the $$\beta_2$$ in equation. epsilon (float) – the $$\epsilon$$ in equation. It is used to prevent divided by zero.

Optimizers(update equation) for SGD method.

The details of please refer this Adam: A Method for Stochastic Optimization

$\begin{split}m_t & = \beta_1 * m_{t-1} + (1-\beta_1)* \nabla Q_i(w) \\ u_t & = max(\beta_2*u_{t-1}, abs(\nabla Q_i(w))) \\ w_t & = w_{t-1} - (\eta/(1-\beta_1^t))*m_t/u_t\end{split}$
Parameters: beta1 (float) – the $$\beta_1$$ in the equation. beta2 (float) – the $$\beta_2$$ in the equation.

Optimizers(update equation) for SGD method.

$\begin{split}G &= \sum_{\tau=1}^{t} g_{\tau} g_{\tau}^T \\ w & = w - \eta diag(G)^{-\frac{1}{2}} \circ g\end{split}$

Optimizers(update equation) for SGD method.

$\begin{split}E(g_t^2) &= \rho * E(g_{t-1}^2) + (1-\rho) * g^2 \\ learning\_rate &= 1/sqrt( ( E(g_t^2) + \epsilon )\end{split}$
Parameters: rho (float) – The $$\rho$$ parameter in that equation epsilon (float) – The $$\epsilon$$ parameter in that equation.

Optimizers(update equation) for SGD method.

$\begin{split}E(g_t^2) &= \rho * E(g_{t-1}^2) + (1-\rho) * g^2 \\ learning\_rate &= sqrt( ( E(dx_{t-1}^2) + \epsilon ) / ( \ E(g_t^2) + \epsilon ) ) \\ E(dx_t^2) &= \rho * E(dx_{t-1}^2) + (1-\rho) * (-g*learning\_rate)^2\end{split}$
Parameters: rho (float) – $$\rho$$ in equation epsilon (float) – $$\rho$$ in equation

## RMSProp¶

Optimizers(update equation) for SGD method.

$\begin{split}v(w, t) & = \rho v(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2 \\ w & = w - \frac{\eta} {\sqrt{v(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}$
Parameters: rho (float) – the $$\rho$$ in the equation. The forgetting factor. epsilon (float) – the $$\epsilon$$ in the equation.