Parallelisms in Large Scale Deep Learning

Posted Feb 17, 2024

By Jinyu Xie 1 min read

Data Parallelism
Model Parallelism
Pipeline Parallelism
Tensor Parallelism

Data Parallelism

Split non-overlapping batch of training data to multiple GPUs (GPU clusters if a model cannot fit into a single GPU) for training.

This will speed up the training significantly.

The gradients need to be reduced and synced across all the machines. The communication between GPUs and computation within GPUs are happenning at the same time.

image source

Model Parallelism

Split the big model by layers to mutlitple GPU.

The forward and backward pass needs to happen machine by machine. There is no parallel processing.

image source

Megatron Paper about model parallelism.

Pipeline Parallelism

Under model parallesim setup, instead of processing the entire mini batch in one GPU and then send it to the next GPU, split the mini batch into micro batches and send the micro batch to the next GPU while processing the next micro batch. This reduces the idle time of next GPUs, but there is still a bubble of idle time.

image source