Machine Learning

This is the main machine learning (ML) guide. It provides an overview of ML capabilities in Azure Databricks and Apache Spark.

See Deep Learning for deep learning libraries and integrations and GraphFrames and GraphX for GraphFrames and other graph analytics libraries.

MLflow

MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles three primary functions:

  • Tracking experiments to record and compare parameters and results (MLflow Tracking).
  • Packaging ML code in a reusable, reproducible form in order to share with other data scientists or transfer to production (MLflow Projects).
  • Managing and deploying models from a variety of ML libraries to a variety of model serving and inference platforms (MLflow Models).

MLflow is in Beta. For information about MLflow, see the MLflow documentation.

Important

When running on Azure Databricks with Databricks Runtime 5.0 and above, you must specify a URI to an MLflow tracking server using mlflow.set_tracking_uri.

The following topics provide an introduction to using MLflow on Azure Databricks. The first topic provides an MLflow Quick Start on Azure Databricks, which shows how to train ElasticNet models on a diabetes dataset and log the training parameters, metrics, and trained model to an MLflow tracking server. The second topic shows how to fit a neural network on MNIST handwritten digit recognition data using PyTorch, log results to an MLflow tracking server, and view the results in the MLflow UI and TensorBoard.

Databricks Runtime ML

To provide a ready-to-go environment for machine learning and data science, Azure Databricks has developed Databricks Runtime ML, a machine learning runtime that contains multiple popular libraries, including TensorFlow, Keras, and XGBoost. It also supports distributed TensorFlow training using Horovod. Databricks Runtime ML frees you from having to install and configure these libraries on your Spark cluster yourself.

Apache Spark MLlib

Apache Spark MLlib is the Apache Spark scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. Spark MLLib seamlessly integrates with other Spark components such as Spark SQL, Spark Streaming, and DataFrames and is installed in the Azure Databricks runtime.

Azure Databricks recommends the following Apache Spark MLLib guides:

For using MLlib in R, refer to the R machine learning documentation.

The following topics and notebooks demonstrate how to use various Spark MLlib features in Azure Databricks.

For Azure Databricks support for visualizing machine learning algorithms, see Machine learning visualizations.

ML Model Export

After building and testing ML models, the next step is productionizing the trained models. A typical workflow of the productionization in Azure Databricks involves three steps:

  1. Fit an ML model using Apache Spark MLlib.
  2. Export the model.
  3. Import the model into an external system.

There are two ways to export and import models and full ML pipelines from Apache Spark: MLeap and Databricks ML Model Export

Azure Databricks recommends MLeap, which is a common serialization format and execution engine for machine learning pipelines. It supports serializing Apache Spark, scikit-learn, and TensorFlow pipelines into a bundle, so you can load and deploy your trained models to make predictions with new data.

We also support Databricks ML Model Export to export models and ML pipelines. These exported models and pipelines can be imported into other (Spark and non-Spark) platforms to do scoring and make predictions.