Single Node PyTorch to Distributed DL

This section explains how to migrate a single node deep learning (DL) code with PyTorch to distributed training code with Horovod on Databricks with HorovodRunner.

Databricks Runtime 5.0 ML (Beta), the minimum required runtime for HorovodRunner includes Horovod. However, to add PyTorch support you need to reinstall Horovod. The PyTorch Init Script notebook creates an init script named pytorch-gpu-init.sh that installs required libraries. If you run on Databricks Runtime 5.1 ML (Beta) or above, you do not need to create the PyTorch init script and configure your cluster with the script.

Before running the HorovodRunner PyTorch MNIST Example notebook you must:

  1. Prepare data for distributed training.
  2. If you are running Databricks Runtime 5.0 ML (Beta), do the following:
    1. Run the PyTorch Init Script notebook.
    2. Configure a GPU cluster with the pytorch-gpu-init.sh init script.

HorovodRunner PyTorch MNIST Example