Distributed Training

When possible, Azure Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine.

Note

Accelerated networking is not available on the GPU VMs supported by Azure Databricks. Therefore we do not recommend running distributed DL training on a multiple node GPU cluster. You can use a single multi-GPU node or a multiple node CPU cluster for distributed DL training.

Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.

Azure Databricks supports two methods for migrating to distributed training: HorovodRunner and HorovodEstimator. HorovodRunner is appropriate when you are migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU contexts. HorovodEstimator is appropriate when you are migrating from Spark ML pipelines.

These topics contain in-depth discussions HorovodRunner and HorovodEstimator, and example notebooks demonstrating each approach: