Reproducible Runs with MLflow Projects

An MLflow Project is a format for packaging data science code in a reusable and reproducible way. The MLflow Projects component includes an API and command-line tools for running projects, which also integrate with the Tracking component to automatically record the parameters and git commit of your source code for reproducibility. This topic describes the format of an MLflow Project and how to run an MLflow project remotely on Azure Databricks clusters using the MLflow CLI, which makes it easy to vertically scale your data science code.

MLflow project format

Any local directory or Git repository can be treated as an MLflow project. The following conventions define a project:

  • The project’s name is the name of the directory.
  • The Conda environment is specified in conda.yaml, if present. If no conda.yaml file is present, MLflow uses a Conda environment containing only Python (specifically, the latest Python available to Conda) when running the project.
  • Any .py or .sh file in the project can be an entry point, with no parameters explicitly declared. When you run such a command with a set of parameters, MLflow passes each parameter on the command line using --key <value> syntax.

You specify more options by adding an MLproject file, which is a text file in YAML syntax. An example MLproject file looks like this:

name: My Project

conda_env: my_env.yaml

entry_points:
  main:
    parameters:
      data_file: path
      regularization: {type: float, default: 0.1}
    command: "python train.py -r {regularization} {data_file}"
  validate:
    parameters:
      data_file: path
    command: "python validate.py {data_file}"

Run an MLflow project

To run an MLflow project on an Azure Databricks cluster in the default workspace, use the command:

mlflow run <uri> -m databricks --cluster-spec <json-cluster-spec>

where <uri> is a Git repository URI or folder containing an MLflow project and <json-cluster-spec> is a JSON document containing a cluster specification. The Git URI should be of the form: https://github.com/<repo>#<project-folder>.

An example cluster specification is:

{
  "spark_version": "5.2.x-scala2.11",
  "num_workers": 1,
  "node_type_id": "Standard_DS3_v2"
}

Important

If you are using Databricks Runtime 4.3 or lower, you must specify the following spark_conf in your cluster specification:

{
  "spark_version": "4.3.x-scala2.11",
  "num_workers": 1,
  "node_type_id": "Standard_DS3_v2",
  "spark_conf": {"spark.databricks.chauffeur.shellCommandTask.enabled": "true"}
}

Example

This example shows how to create an experiment, run the MLflow tutorial project on an Azure Databricks cluster, view the job run output, and view the run in the experiment.

Prerequisites

The Databricks CLI authentication mechanism is required to run jobs on an Azure Databricks cluster. Install and configure the Databricks CLI.

Step 1: Create an experiment

  1. In the Workspace, select Create > Experiment.

  2. In the Name field, enter Tutorial.

  3. Click Create. Note the Experiment ID. In this example, it is 14622565.

    ../../_images/mlflow-experiment-id.png

Step 2: Run the MLflow tutorial project

The following steps set up the MLFLOW_TRACKING_URI environment variable and run the project, recording the training parameters, metrics, and the trained model to the experiment noted in the preceding step:

  1. Set the MLFLOW_TRACKING_URI environment variable to the Azure Databricks workspace.

    export MLFLOW_TRACKING_URI=databricks
    
  2. Run the MLflow tutorial project, training a wine model. Replace <experiment-id> with the Experiment ID you noted in the preceding step.

    mlflow run https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine -m databricks -c cluster-spec.json --experiment-id <experiment-id>
    
    === Fetching project from https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine into /var/folders/kc/l20y4txd5w3_xrdhw6cnz1080000gp/T/tmpbct_5g8u ===
    === Uploading project to DBFS path /dbfs/mlflow-experiments/<experiment-id>/projects-code/16e66ccbff0a4e22278e4d73ec733e2c9a33efbd1e6f70e3c7b47b8b5f1e4fa3.tar.gz ===
    === Finished uploading project to /dbfs/mlflow-experiments/<experiment-id>/projects-code/16e66ccbff0a4e22278e4d73ec733e2c9a33efbd1e6f70e3c7b47b8b5f1e4fa3.tar.gz ===
    === Running entry point main of project https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine on Databricks ===
    === Launched MLflow run as Databricks job run with ID 8651121. Getting run status page URL... ===
    === Check the run's status at https://<databricks-instance>#job/<job-id>/run/1 ===
    
  3. Copy the URL https://<databricks-instance>#job/<job-id>/run/1 in the last line of the MLflow run output.

Step 3: View the Azure Databricks job run

  1. Open the URL you copied in the preceding step in a browser to view the Azure Databricks job run output:

    ../../_images/mlflow-job-run.png

Step 4: View the experiment and MLflow run details

  1. Navigate to the experiment in your Azure Databricks workspace.

    ../../_images/mlflow-workspace-experiment.png
  2. Click the experiment.

    ../../_images/mlflow-experiment.png
  3. To display run details, click a link in the Date column.

    ../../_images/mlflow-run-remote.png

You can navigate back to the Azure Databricks job run page by clicking the Logs link in the Job Output field.

Resources

For some example MLflow projects, see the MLflow App Library, which contains a repository of ready-to-run projects aimed at making it easy to include ML functionality into your code.