Distributed deep learning (DL) involves training a deep neural network in parallel across multiple machines. When possible, Azure Databricks recommends that you train neural networks on a single machine; distributed training code is more complex than single-machine training and slower due to communication overhead. However, you should consider distributed training if your model or your data are too large to fit in memory on a single machine.
Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.
Azure Databricks supports two methods for migrating to distributed training: HorovodRunner and HorovodEstimator. HorovodRunner is appropriate when you are migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU contexts. HorovodEstimator is appropriate when you are migrating from Spark ML pipelines.
The following sections introduce HorovodRunner and our recommended workflow for distributed training.
In this topic:
HorovodRunner provides the ability to launch Horovod training jobs as Spark jobs. The
HorovodRunner API supports the following methods:
- Create an instance of HorovodRunner.
run(self, main, **kwargs)
- Run a Horovod training job invoking
main(**kwargs). Both the main function and the keyword arguments are serialized using cloudpickle and distributed to cluster workers.
For details, see the HorovodRunner API documentation.
The general approach to developing a distributed training program using
- Create a
HorovodRunnerinstance initialized with the number of nodes.
- Define a Horovod training method according to the methods described in Horovod usage.
- Pass the training method to the
hr = HorovodRunner(np=2) def train(): hvd.init() hr.run(train)
A typical workflow for distributed deep learning involves two main phases: data preparation and training. See the following topics for recommended approaches, options, and example notebooks.
Data preparation involves allocating shared storage for data loading and model checkpointing and preparing data for your selected training algorithms. The following topics discuss each step:
The following topics contain in-depth discussions of the two approaches that Azure Databricks supports for doing DDL training, and example notebooks demonstrating each approach: