% Small choose
$$
torch.autograd
.The specific choice of activation function often has a considerable effect on thes everity of the vanishing gradient problem. The following are the derivatives of some common activation functions:
There are several variants of ReLU that are designed to address the dying ReLU problem, for example,
The AdaGrad algorithm individually adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all the historical squared values of the gradient.
The RMSProp algorithm modifies AdaGrad to perform better in the nonconvex setting by changing the gradient accumulation into an exponentially weighted moving average.