Post

Deep Learning Training Tricks

Intro

Here is a blog to summarize all the deep learning training tricks I encountered these days.

AdamW Optimizer

“W” means weight decay, which is the same concept of weight decay in L2 regularization. AdamW is proposed to do weight decay that is decoupled with the regular Adam optimizer, that involves first and second order of gradient information.

Learning Rate Schedule

Warmup

Learning rate increases from 0 to the desired rate linearly during the warm up iterations.

Learning Rate Decay

Here is a paper on Cosine shape decay (and restart).

Restarts

The above paper also mentioned restart.

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.