Overview of Databricks Runtime with Conda

Beta

This is a Beta release. It is intended for experimental use cases and not for production workloads.

Databricks Runtime with Conda is an Azure Databricks runtime based on Conda environments instead of Python virtual environments (virtualenvs). Databricks Runtime with Conda provides an updated and optimized list of default packages and a flexible Python environment for advanced users who require maximum control over packages and environments.

What is Conda?

Conda is an open source package and environment management system. As a package manager you use Conda to install Python packages from your desired channels (or repositories). Databricks Runtime with Conda uses the Anaconda repository. As an environment manager you use Conda to easily create, save, load, and switch between Python environments. Conda environments are compatible with PyPI packages.

What is in Databricks Runtime with Conda?

Databricks Runtime with Conda is available with two installed Conda environments: databricks-standard and databricks-minimal.

  • databricks-standard environment includes updated versions of many popular Python packages. This environment is intended as a drop-in replacement for existing notebooks that run on Databricks Runtime. This is the default Databricks Conda-based runtime environment.
  • databricks-minimal environment contains a minimum number of packages that are required for PySpark and Databricks Python notebook functionality. This environment is ideal if you want to customize the runtime with various Python packages.

The packages included in each environment are listed in the Databricks Runtime Release Notes.

Manage environments

One of the key advantages of Conda package management system is first-class support for environments.

Root environments

Databricks Runtime with Conda is available with two default installed Conda environments: databricks-standard and databricks-minimal. We refer to these as root environments.

Select a root environment

When launching a cluster using cluster UI running Databricks Runtime with Conda, you can pick one of the two environments to be activated by setting DATABRICKS_ROOT_CONDA_ENV environment variable on the cluster. Acceptable values are databricks-standard (default) and databricks-minimal.

You can also launch clusters using the REST API. Here is an example request that launches a cluster with the databricks-minimal environment.

{
  "cluster_name": "my-cluster",
  "spark_version": "5.4.x-conda-scala2.11",
  "node_type_id": "Standard_D3_v2",
  "spark_env_vars": {
    "DATABRICKS_ROOT_CONDA_ENV": "databricks-minimal"
  },
  "num_workers": 10
}

Important

  • You cannot create and activate new environments inside Databricks notebooks. Every notebook operates in a unique environment that is cloned from the root environment.
  • You cannot switch environments using notebook shell commands.

Environment activation

Each Azure Databricks notebook clones the root environment and activates the new environment before executing the first command. This offers several benefits:

  • All package management activity inside the notebook is isolated from other notebooks.
  • You can use conda and pip commands without having to worry about the location of the root environment.

Manage Python libraries

Similar to standard Databricks Runtime versions, Databricks Runtime with Conda supports three library modes: Workspace, cluster-installed, and notebook-scoped. This section reviews these options in the context of the Conda Python packages.

Workspace and cluster-scoped libraries

You can install all supported Python library formats (whl, wheelhouse.zip, egg, and PyPI) on clusters running Databricks Runtime with Conda. The Databricks library manager uses the pip command provided by Conda to install packages in the root Conda environment. The packages are accessible by all notebooks and jobs attached to the cluster.

Note

If a notebook is attached to a cluster before a Workspace library is attached, you must detach and reattach the notebook to the cluster to use the new library.

Notebook-scoped libraries

You can install Python libraries and create an environment scoped to a notebook session using Databricks Library utilities. The libraries will be installed in the notebook’s conda env and only accessible within the notebook.

Note

  • The cluster-scoped libraries installed before the notebook session starts are still accessible in the notebook session. However, cluster-scoped libraries that are installed after the notebook session starts won’t be accessible in the notebook. To use the newly installed cluster-scoped library, you must detach and reattach the notebook.
  • Notebook-scoped library is on by default, it can be turned off by setting spark.databricks.libraryIsolation.enabled to false.

An example of using library utilities is:

dbutils.library.installPyPI("tensorflow", "1.13")

Requirements files

A requirements files is a library type available for notebook-scoped libraries on Databricks Runtime for Conda. A requirements file contains a list of packages to be installed using pip. You can install a requirements file as a notebook-scoped library similar to how you install whl and egg libraries. The name of the file must end with requirements.txt. An example of using a requirements file is:

dbutils.library.install("dbfs:/path/to/file/a_requirements.txt")

See Requirements File Format for more information on requirements.txt files.

Use conda and pip commands

Every Python notebook (or Python cell) that is attached to a Databricks Runtime with Conda runs in an activated Conda environment. Therefore you can use conda and pip commands to list and install packages. Any modifications to the current environment using this method are restricted to the notebook and the driver. The changes are reset when you detach and reattach the notebook.

%sh
conda env list
%sh
conda install matplotlib -y

Tip

When you run shell commands inside notebooks using %sh, you cannot respond to interactive shells. To avoid blocking, pass the -y (--yes) flag to conda and pip commands.

Use Conda inside cluster initialization scripts

To install Conda packages using cluster initialization scripts, you can assume that your script is running inside the activated root Conda environment – either databricks-minimal or databricks-standard. Any packages installed using conda or pip commands with init scripts are exposed to all notebooks attached to the cluster. As an example, the following notebook code snippet generates a script that installs fast.ai packages on all the cluster workers.

dbutils.fs.put("dbfs:/home/myScripts/fast.ai", "conda install -c pytorch -c fastai fastai -y", True)

Limitations

The following features are not supported on Databricks Runtime with Conda:

  • GPU instances. If you are looking for a Conda-based runtime that supports GPUs, consider |DBR| for ML.
  • Python 2.7
  • Creating and activating a separate environment before a cluster running Databricks Runtime with Conda starts.
  • Data-centric security features, including Table ACLs and Credential Passthrough.

Note

Databricks Runtime with Conda is an experimental and evolving feature. We are actively working to improve it and resolve all the limitations. We recommend using the latest released version of Databricks Runtime with Conda.

FAQ

When should I use Databricks Container Service (DCS) versus when should I use Conda to customize my Python environment?
If your desired customizations are restricted to Python packages, you can start with databricks-minimal Conda environment and customize it based on your needs. However, if you need JVM, Native or R customizations, DCS would be a better choice. In addition, if you need to install many packages and cluster startup time becomes a bottleneck, then using DCS can help you launch clusters faster.
When should I use init scripts vs. cluster-installed libraries?
We recommend using cluster-scoped libraries to install Python libraries that are needed by all users/notebooks of a cluster as much as possible. You have more flexibility and visibility with cluster-scoped libraries. Cluster initialization scripts can be used when cluster or notebook scoped libraries are not sufficient. For example, when installing native packages that are not supported by Databricks Library manager.