This section explains how to migrate a single node deep learning (DL) code with PyTorch to distributed training code with Horovod on Databricks with HorovodRunner.
Databricks Runtime 5.0 ML (Beta), the minimum required runtime for HorovodRunner includes Horovod. However, to add PyTorch support you need to reinstall Horovod. The PyTorch Init Script notebook creates an init script named
pytorch-gpu-init.sh that installs required libraries. If you run on Databricks Runtime 5.1 ML (Beta) or above, you do not need to create the PyTorch init script and configure your cluster with the script.
Before running the HorovodRunner PyTorch MNIST Example notebook you must:
- Prepare distributed data loading and model checkpointing.
- Prepare data for distributed training.
- Configure your
FUSE_MOUNT_LOCATIONin the notebook.
- If you are running Databricks Runtime 5.0 ML (Beta), do the following:
- Run the PyTorch Init Script notebook.
- Configure a GPU cluster with the