Distributed Deep Learning

Distributed deep learning involves training a deep neural network in parallel across multiple machines. A typical workflow has three components that run concurrently: model training, model evaluation (on a held-out validation set), and monitoring.

When possible, we recommend training neural networks on a single machine; distributed training code is more complex than single-machine training and slower due to communication overhead. However, you should consider distributed training if your model or your data are too large to fit in memory on a single machine.

For more information about distributed training, see the guides below, which explain how to run TensorFlow and Keras-backed distributed deep learning workflows on Azure Databricks using the following frameworks:

  • HorovodEstimator: Supports TensorFlow workflows
  • TensorFlowOnSpark: Supports multi-machine TensorFlow workloads
  • dist-keras: Supports multi-machine Keras workloads

We recommend HorovodEstimator for TensorFlow workloads due to its ease of use in multi-GPU contexts.

Distributed deep learning guides