Mixed precision training
Why mixed precision Benefits: Faster in modern GPUs that support half-precision (FP16, BFLOAT16) arithmetic. Check the A100 specs: FP32 has 19.5 TFLOPS, while FP16 has 312 TFLOPS, 15x...
Why mixed precision Benefits: Faster in modern GPUs that support half-precision (FP16, BFLOAT16) arithmetic. Check the A100 specs: FP32 has 19.5 TFLOPS, while FP16 has 312 TFLOPS, 15x...
Here is a tech blog on Ring-All-Reduce. It is a very smart way to perform all-reduce under data parallel setup.
Data Parallelism Model Parallelism Pipeline Parallelism Tensor Parallelism Data Parallelism Split non-overlapping batch of training data to multiple GPUs (GPU clusters if a model cannot...
Intro I have been implementing GPT model myself these days, and I found 2 interesting regularization techniques. One is Label Smoothing mentioned in the transformer paper. The other is Weight Tyin...
Intro Dropout is a regularization method by randomly setting some of the elements output by a layer of neurons to zero during the training stage, but acts like an Identity layer during the inferen...
<?xml version=”1.0” encoding=”utf-8”?> <!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Strict//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”> ‎ ...
<?xml version=”1.0” encoding=”utf-8”?> <!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Strict//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”> ‎ ...
Definition Heap is a tree-based data structure that satisfies the heap property. Binary Heap is a binary tree with two properties: Shape Property: it must be a complete binary tree, whic...