Deep Learning Guide

Azure Databricks provides an environment that makes it easy to build, train, and deploy deep learning (DL) models at scale. Many deep learning libraries are available in Databricks Runtime ML, a machine learning runtime that provides a ready-to-go environment for machine learning and data science. For deep learning libraries not included in Databricks Runtime ML, you can either install libraries as a Databricks library or use init scripts to install libraries on clusters upon creation.

Graphics processing units (GPUs) can accelerate deep learning tasks. For information about creating GPU-enabled Azure Databricks clusters, see GPU-enabled Clusters. Databricks Runtime includes pre-installed GPU hardware drivers and NVIDIA libraries such as CUDA.

A typical DL workflow involves the phases data preparation, training, and inference. This section gives guidelines on deep learning in Azure Databricks.

Data Preparation

Data preparation involves allocating shared storage for data loading and model checkpointing and preparing data for your selected training algorithms. These topics discuss each step:

Single Node Training

Azure Databricks supports several popular deep learning libraries. Databricks Runtime ML contains many of these libraries, including TensorFlow, PyTorch, Keras, and XGBoost. We provide instructions as well as accompanying example notebooks to get started with training on single nodes.

Distributed Training

When possible, Azure Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine.

Note

Accelerated networking is not available on the GPU VMs supported by Azure Databricks. Therefore we do not recommend running distributed DL training on a multiple node GPU cluster. You can use a single multi-GPU node or a multiple node CPU cluster for distributed DL training.

Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.

Azure Databricks supports two methods for migrating to distributed training: HorovodRunner and HorovodEstimator. HorovodRunner is appropriate when you are migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU contexts. HorovodEstimator is appropriate when you are migrating from Spark ML pipelines.

These topics contain in-depth discussions HorovodRunner and HorovodEstimator, and example notebooks demonstrating each approach:

Model Inference

After training is completed, trained networks are deployed for inference. Azure Databricks recommends loading data into a Spark DataFrame, applying the deep learning model in Pandas UDFs, and writing predictions out using Spark. The following topics provide an introduction to doing model inference on Azure Databricks.

The first section gives a high-level overview of the workflow to do model inference. The next section provides detailed examples of this workflow using TensorFlow, Keras, and PyTorch. The final section provides some tips for debugging and tuning model inference.