Post

Parallelisms in Large Scale Deep Learning

  • Data Parallelism
  • Model Parallelism
  • Pipeline Parallelism
  • Tensor Parallelism

Data Parallelism

Split non-overlapping batch of training data to multiple GPUs (GPU clusters if a model cannot fit into a single GPU) for training.

This will speed up the training significantly.

The gradients need to be reduced and synced across all the machines. The communication between GPUs and computation within GPUs are happenning at the same time.

data-parallelism image source

Model Parallelism

Split the big model by layers to mutlitple GPU.

The forward and backward pass needs to happen machine by machine. There is no parallel processing.

model-parallelism image source

Megatron Paper about model parallelism.

Pipeline Parallelism

Under model parallesim setup, instead of processing the entire mini batch in one GPU and then send it to the next GPU, split the mini batch into micro batches and send the micro batch to the next GPU while processing the next micro batch. This reduces the idle time of next GPUs, but there is still a bubble of idle time.

pipeline-parallelism image source

Tensor Parallelism

tensor-parallelism image source

Optimizer-level Parallelism

ZeRO paper, used by Deep Speed

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.