Machine Learning
This topic provides an overview of machine learning capabilities in Azure Databricks.
In this guide:
Databricks Runtime for Machine Learning
Azure Databricks provides Databricks Runtime for Machine Learning (Databricks Runtime ML), a machine learning runtime that contains multiple popular libraries, including TensorFlow, PyTorch, Keras, and XGBoost. It also supports distributed training using Horovod. Databricks Runtime ML provides a ready-to-go environment for machine learning and data science, freeing you from having to install and configure these libraries on your cluster.
Apache Spark MLlib
Apache Spark MLlib is the Apache Spark machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. Azure Databricks recommends the following Apache Spark MLLib guides:
For using MLlib from R, refer to the R machine learning documentation.
For Azure Databricks support for visualizing machine learning algorithms, see Machine learning visualizations.
The following topics and notebooks demonstrate how to use various Spark MLlib features in Azure Databricks.
Exporting and Importing ML Models
After developing ML models, the next step is productionizing the trained models. A typical workflow of the productionization in Azure Databricks involves the steps:
- Export a trained model.
- Import the model into an external system.
Azure Databricks supports two methods to export and import models and full ML pipelines from Apache Spark: MLeap and Databricks ML Model Export.
MLeap, which Azure Databricks recommends, is a common serialization format and execution engine for machine learning pipelines. It supports serializing Apache Spark, scikit-learn, and TensorFlow pipelines into a bundle, so you can load and deploy your trained models to make predictions with new data.
You can also use Databricks ML Model Export to export models and ML pipelines. These exported models and pipelines can be imported into other (Spark and non-Spark) platforms to do scoring and make predictions.
Third-Party Libraries
This section provides instructions and examples of how to install, configure, and run some of the most popular third-party ML tools in Azure Databricks.
Advanced Topics
For guides on advanced topics in machine learning, see:
- MLflow Guide for how to manage the machine learning lifecycle.
- Deep Learning Guide for deep learning libraries and and workflows.
- Graph Analysis Guide for GraphFrames and other graph analytics libraries.