Distributed Deep Learning

Distributed deep learning (DL) involves training and inferencing a deep neural network in parallel across multiple machines. When possible, Azure Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine.

Distributed training

Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.

Azure Databricks supports two methods for migrating to distributed training: HorovodRunner and HorovodEstimator. HorovodRunner is appropriate when you are migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU contexts. HorovodEstimator is appropriate when you are migrating from Spark ML pipelines.

The following sections introduce HorovodRunner and our recommended workflow for distributed training.

Distributed training with HorovodRunner

HorovodRunner provides the ability to launch Horovod training jobs as Spark jobs. The HorovodRunner API supports the following methods:

init(self, np)
Create an instance of HorovodRunner.
run(self, main, **kwargs)
Run a Horovod training job invoking main(**kwargs). Both the main function and the keyword arguments are serialized using cloudpickle and distributed to cluster workers.

For details, see the HorovodRunner API documentation.

The general approach to developing a distributed training program using HorovodRunner is:

  1. Create a HorovodRunner instance initialized with the number of nodes.
  2. Define a Horovod training method according to the methods described in Horovod usage.
  3. Pass the training method to the HorovodRunner instance.

For example:

hr = HorovodRunner(np=2)

def train():


Distributed training workflow

A typical workflow for distributed deep learning training involves two main phases: data preparation and training. The following topics contain recommended approaches, options, and example notebooks.

Data preparation

Data preparation involves allocating shared storage for data loading and model checkpointing and preparing data for your selected training algorithms. These topics discuss each step:

Distributed training

These topics contain in-depth discussions of the two approaches that Azure Databricks supports for doing DDL training, and example notebooks demonstrating each approach:

Distributed inference

After training is completed, trained networks are deployed for inference. Azure Databricks recommends loading data into a Spark DataFrame, applying the deep learning model in Pandas UDFs, and writing predictions out using Spark. The following topics provide an introduction to doing distributed DL model inference on Azure Databricks.

The first section gives a high-level overview of the workflow to do model inference. The next section provides detailed examples of this workflow using TensorFlow, Keras, and PyTorch. The final section provides some tips for debugging and performance tuning model inference.