Parallelisms in Large Scale Deep Learning
- Data Parallelism
- Model Parallelism
- Pipeline Parallelism
- Tensor Parallelism
Data Parallelism
Split non-overlapping batch of training data to multiple GPUs (GPU clusters if a model cannot fit into a single GPU) for training.
This will speed up the training significantly.
The gradients need to be reduced and synced across all the machines. The communication between GPUs and computation within GPUs are happenning at the same time.
Model Parallelism
Split the big model by layers to mutlitple GPU.
The forward and backward pass needs to happen machine by machine. There is no parallel processing.
Megatron Paper about model parallelism.
Pipeline Parallelism
Under model parallesim setup, instead of processing the entire mini batch in one GPU and then send it to the next GPU, split the mini batch into micro batches and send the micro batch to the next GPU while processing the next micro batch. This reduces the idle time of next GPUs, but there is still a bubble of idle time.
Tensor Parallelism
Optimizer-level Parallelism
ZeRO paper, used by Deep Speed
Comments powered by Disqus.