% Small choose
$$
That is, dropout is equivalent to training an ensemble of \(2^d\) models, where \(d\) is the number of nodes in the network.
Typically, an input unit is included with probability 0.8, and a hidden unit is included with probability 0.5.
There are some differences between dropout and bagging:
Pros:
Cons: