Deep Learning Training Tricks
Intro
Here is a blog to summarize all the deep learning training tricks I encountered these days.
AdamW Optimizer
“W” means weight decay, which is the same concept of weight decay in L2 regularization. AdamW is proposed to do weight decay that is decoupled with the regular Adam optimizer, that involves first and second order of gradient information.
Learning Rate Schedule
Warmup
Learning rate increases from 0 to the desired rate linearly during the warm up iterations.
Learning Rate Decay
Here is a paper on Cosine shape decay (and restart).
Restarts
The above paper also mentioned restart.
This post is licensed under CC BY 4.0 by the author.
Comments powered by Disqus.