MLflow Projects: Run on Azure Databricks

Note

This topic describes features that are in Private Preview.

An MLflow Project is a format for packaging data science code in a reusable and reproducible way. The MLflow Projects component includes an API and command-line tools for running projects, which also integrate with the Tracking component to automatically record the parameters and git commit of your source code for reproducibility. This topic describes how to run an MLflow project remotely on Azure Databricks clusters using the MLflow CLI, which makes it easy to vertically scale your data science code.

To get started with MLflow projects, check out the MLflow App Library, which contains a repository of ready-to-run projects aimed at making it easy to include ML functionality into your code.

Run an MLflow project

To run an MLflow project on an Azure Databricks cluster in the default workspace, use the command:

mlflow run <uri> -m databricks --cluster-spec <json-cluster-spec>

<uri> is a Git repository URI for an MLflow project and <json-cluster-spec> is a JSON document containing a cluster specification.

An example cluster specification is:

{
  "spark_version": "5.0.x-scala2.11",
  "num_workers": 1,
  "node_type_id": "Standard_DS3_v2"
}

Important

If you are using Databricks Runtime 4.3 or lower, you must specify the following spark_conf in your cluster specification:

{
  "spark_version": "5.0.x-scala2.11",
  "num_workers": 1,
  "node_type_id": "Standard_DS3_v2",
  "spark_conf": {"spark.databricks.chauffeur.shellCommandTask.enabled": "true"}
}

You can pass Git credentials using the git-username and git-password arguments or the MLFLOW_GIT_USERNAME and MLFLOW_GIT_PASSWORD environment variables.

To run against an Azure Databricks cluster in a non-default workspace, specify databricks://<profile>, where <profile> is a Databricks CLI profile, in the MLFLOW_TRACKING_URI environment variable.

Note

The API for running projects, mlflow.start_run(), accepts a source_name argument. This argument is used if you run a project from a file, but is ignored if you run from a Azure Databricks notebook or using the CLI command mlflow run.

Example

This example shows how to run the MLflow tutorial project on an Azure Databricks cluster, view the job run output, and view the run in the MLflow UI.

Run the MLflow tutorial project

The following command runs the MLflow tutorial project, training a wine model, and records the training parameters and metrics in MLflow experiment 49 on a workspace defined in the CLI profile mlflow:

export MLFLOW_TRACKING_URI=databricks://mlflow
mlflow run git@github.com:mlflow/mlflow.git#examples/sklearn_elasticnet_wine -P alpha=0.1 --experiment-id 49 -m databricks -c cluster-spec.json
=== Fetching project from git@github.com:mlflow/mlflow.git#examples/sklearn_elasticnet_wine into /var/folders/kc/l20y4txd5w3_xrdhw6cnz1080000gp/T/tmp6_rk_mme ===
=== Uploading project to DBFS path /dbfs/mlflow-experiments/49/projects-code/db7ec766f11c6d1fcdb7bf64e7429b4a355712e1a14b5039bc06717539334b1b.tar.gz ===
=== Finished uploading project to /dbfs/mlflow-experiments/49/projects-code/db7ec766f11c6d1fcdb7bf64e7429b4a355712e1a14b5039bc06717539334b1b.tar.gz ===
=== Running entry point main of project git@github.com:mlflow/mlflow.git#examples/sklearn_elasticnet_wine on Databricks ===
=== Launched MLflow run as Databricks job run with ID 2372743. Getting run status page URL... ===
=== Check the run's status at https://<databricks-instance>#job/11641/run/1 ===

View the Azure Databricks job run

The Azure Databricks job run output at https://<databricks-instance>#job/11641/run/1 is:

../../_images/mlflow-project-run-db.png

View the experiment in the MLflow UI

To view experiment in the MLflow UI, go to https://<databricks-instance>/mlflow/#/experiments/49. The output from running the job is:

../../_images/mlflow-project-run-mlflow.png

Display MLflow run information

To display the MLflow run information details, click the link in the Date column.

../../_images/mlflow-run.png

You can navigate back to the Azure Databricks job run page by clicking the Logs link in the Job Output field.