RE: What’s the interpretation behind learning rate warmup?
Learning rate warmup is a technique used in training deep neural networks which gradually increases the learning rate from a small value. It was introduced to help large models, especially when using batch normalization, to avoid large gradient updates early in training, which can be destabilizing. The rationale behind is during the early stage of training, the model parameters are randomly initialized, thus the gradients can be large and noisy. Therefore, small learning rates are preferred initially to prevent the gradients from exploding and causing training instability. As training progresses, the learning rate is increased to accelerate the training speed. There are various types of warmup such as constant learning rate warmup, linear learning rate warmup, etc., each having their own specific interpretation. The right choice depends largely on your scenario and empirical trial. There's no one-size-fits-all value for the warmup period. Depending on your model and dataset, you might want to experiment with different values. In terms of “how long”, it typically ranges from 0-10,000 training steps, with popular choices around 1,000 steps in various papers. It's also common to decay the learning rate after warmup.