Authenticate to Azure Data Lake Storage with your Azure Active Directory Credentials

You can authenticate automatically to Azure Data Lake Storage Gen1 and Azure Data Lake Storage Gen2 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you enable your cluster for Azure Data Lake Storage credential passthrough, commands that you run on that cluster can read and write data in Azure Data Lake Storage without requiring you to configure service principal credentials for access to storage.

Requirements

An Azure Data Lake Storage Gen1 or Gen2 storage account. Azure Data Lake Storage Gen2 storage accounts must use the hierarchical namespace to work with Azure Data Lake Storage credential passthrough. See Create an Azure Data Lake Storage Gen2 account and initialize a filesystem.

Security requirements

Important

You cannot authenticate to Azure Data Lake Storage with your Azure Active Directory credentials if you are behind a firewall that has not been configured to allow traffic to Azure Active Directory. Azure Firewall blocks Active Directory access by default. To allow access, configure the AzureActiveDirectory service tag. For more information, see Azure Firewall service tags.

Cluster requirements

  • Databricks Runtime 5.1 or above for Azure Data Lake Storage Gen1.
  • Databricks Runtime 5.3 or above Azure Data Lake Storage Gen2.
  • Databricks Runtime 5.5 or above for Standard clusters (Public Preview). Earlier releases only support High concurrency clusters. High concurrency clusters can be shared by multiple users. They support only Python, SQL, and R. Standard clusters are limited to a single user. Standard clusters support Python, SQL, and Scala.
  • Databricks Runtime 6.0 or above for R support on standard clusters.
  • Azure Data Lake Storage credentials cannot have been set for the cluster (by providing your service principal credentials, for example).
  • Clusters enabled for credential passthrough do not support jobs, and they do not support Table Access Control.

Enable Azure Data Lake Storage credential passthrough for a high-concurrency cluster

High concurrency clusters can be shared by multiple users. They support only Python, SQL, and R.

  1. When you create a cluster, set the Cluster Mode to High Concurrency.
  2. Choose a Databricks Runtime version according to the Azure Data Lake Storage type:
    • Azure Data Lake Storage Gen1: Databricks Runtime 5.1 or above.
    • Azure Data Lake Storage Gen2: Databricks Runtime 5.3 or above.
  3. Under Advanced Options, select Enable credential passthrough and only allow Python and SQL commands.
../../../_images/adls-credential-passthrough.png

Enable Azure Data Lake Storage credential passthrough for a standard cluster

Preview

Support for Azure Data Lake Storage credential passthrough on standard clusters is in Public Preview.

Standard clusters with credential passthrough are supported on Databricks Runtime 5.5 and above and are limited to a single user. Standard clusters support Python, SQL, and Scala. On Databricks Runtime 6.0, they also support SparkR.

You must assign a user at cluster creation, but the cluster can be edited by a user with Can Manage permissions at any time to replace the original user.

Important

The user assigned to the cluster must have at least Can Attach To permissions for the cluster in order to run commands on the cluster. Admins and the cluster creator have Can Manage permissions, but cannot run commands on the cluster unless they are the designated cluster user.

  1. When you create a cluster, set the Cluster Mode to Standard.
  2. Choose Databricks Runtime 5.5 or above.
  3. Under Advanced Options, select Enable credential passthrough for user-level access and select the user name from the Single User Access drop-down.
../../../_images/credential-passthrough-single.png

Read and write Azure Data Lake Storage using credential passthrough

Azure Data Lake Storage credential passthrough supports only Azure Data Lake Storage Gen1 and Gen2. Access data directly in Azure Data Lake Storage Gen1 using an adl:// path and in Azure Data Lake Storage Gen2 using an abfss:// path. For example:

Azure Data Lake Storage Gen1

spark.read.csv("adl://<myadlsfolder>.azuredatalakestore.net/MyData.csv").collect()

Azure Data Lake Storage Gen2

spark.read.csv("abfss://<my-file-system-name>@<my-storage-account-name>.dfs.core.windows.net/MyData.csv").collect()

Mount Azure Data Lake Storage to DBFS using credential passthrough

Note

  • Azure Data Lake Storage Gen1 mounts are supported in Databricks Runtime 5.1 and above.
  • Azure Data Lake Storage Gen2 mounts are supported in Databricks Runtime 5.4 and above.

You can mount an Azure Data Lake Storage account or a folder inside it to Databricks File System. The mount is a pointer to a data lake store, so the data is never synced locally.

When you mount data using a cluster enabled with Azure Data Lake Storage credential passthrough, any read or write to the mount point uses your Azure AD credentials. This mount point will be visible to other users, but the only users that will have read and write access are those who:

  • Have access to the underlying Azure Data Lake Storage storage account
  • Are using a cluster enabled for Azure Data Lake Storage credential passthrough

To mount an Azure Data Lake Storage account or a folder inside it, use the Python commands described in Mount Azure Data Lake Storage Gen1 resource or folder or Mount Azure Data Lake Storage Gen2 filesystem, replacing the configs with:

Azure Data Lake Storage Gen1

configs = {
  "fs.adl.oauth2.access.token.provider.type": "CustomAccessTokenProvider",
  "fs.adl.oauth2.access.token.custom.provider": spark.conf.get("spark.databricks.passthrough.adls.tokenProviderClassName")
}

Note

As of Databricks Runtime 6.0, the dfs.adls. prefix for Azure Data Lake Storage configuration keys has been deprecated in favor of the new fs.adl. prefix. Backward compatibility is maintained, which means you can still use old prefix. However, there are two caveats when using the old prefix. The first is that even though keys using the old prefix will be correctly propagated, calling spark.conf.get with a key using the new prefix will fail unless set explicitly. The second is that and any error message referencing an Azure Data Lake Storage configuration key will always use the new prefix. For Databricks Runtime versions below 6.0, you must always use the old prefix.

Azure Data Lake Storage Gen2

configs = {
  "fs.azure.account.auth.type": "CustomAccessToken",
  "fs.azure.account.custom.token.provider.class":   spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}

Warning

Do not provide your storage account access keys or service principal credentials to authenticate to the mount point. That would give other users access to the filesystem using those credentials. The purpose of Azure Data Lake Storage credential passthrough is to prevent you from having to use those credentials and to ensure that access to the filesystem is restricted to users who have access to the underlying Azure Data Lake Storage account.

Security

It is safe to share Azure Data Lake Storage credential passthrough clusters with other users. You will be isolated from each other and will not be able to read or use each other’s credentials.

Supported features

Feature Minimum Databricks Runtime Version Notes
Python and SQL 5.1  
Azure Data Lake Storage Gen1 5.1  
%run 5.1  
DBFS 5.3 Credentials are passed through only if the DBFS path resolves to a location in Azure Data Lake Storage Gen1 or Gen2. For DBFS paths that resolve to other storage systems, use a different method to specify your credentials.
Azure Data Lake Storage Gen2 5.3  
Delta caching 5.4  
PySpark ML API 5.4
The following ML classes are not supported:
  • org/apache/spark/ml/classification/RandomForestClassifier
  • org/apache/spark/ml/clustering/BisectingKMeans
  • org/apache/spark/ml/clustering/GaussianMixture
  • org/spark/ml/clustering/KMeans
  • org/spark/ml/clustering/LDA
  • org/spark/ml/evaluation/ClusteringEvaluator
  • org/spark/ml/feature/HashingTF
  • org/spark/ml/feature/OneHotEncoder
  • org/spark/ml/feature/StopWordsRemover
  • org/spark/ml/feature/VectorIndexer
  • org/spark/ml/feature/VectorSizeHint
  • org/spark/ml/regression/IsotonicRegression
  • org/spark/ml/regression/RandomForestRegressor
  • org/spark/ml/util/DatasetUtils
Broadcast variables 5.5 Within PySpark, there is a limit on the size of the Python UDFs you can construct, since large UDFs are sent as broadcast variables.
Notebook-scoped libraries 5.5  
Scala 5.5  
Spark R 6.0  
Notebook workflows 6.1  
PySpark ML API 6.1 All PySpark ML classes supported.
Ganglia UI 6.1  

Known limitations

The following features are not supported with Azure Data Lake Storage credential passthrough:

  • %fs (use the equivalent dbutils.fs command instead).
  • Jobs.
  • The REST API.
  • Connecting to your cluster using JDBC/ODBC.
  • Table access control. The powers granted by Azure Data Lake Storage credential passthrough could be used to bypass the fine-grained permissions of Table ACLs, while the extra restrictions of Table ACLs will constrain some of the power you get from Azure Data Lake Storage credential passthrough. In particular:
    • If you have Azure AD permission to access the data files that underlie a particular table you will have full permissions on that table via the RDD API, regardless of the restrictions placed on them via Table ACLs.
    • You will be constrained by Table ACLs permissions only when using the DataFrame API. You will see warnings about not having permission SELECT on any file if you try to read files directly with the DataFrame API, even though you could read those files directly via the RDD API.
    • You will be unable to read from tables backed by filesystems other than Azure Data Lake Storage, even if you have Table ACL permission to read the tables.
  • The following methods on SparkContext (sc) and SparkSession (spark) objects:
    • Deprecated methods.
    • Methods such as addFile() and addJar() that would allow non-admin users to call Scala code.
    • Any method that accesses a filesystem other than Azure Data Lake Storage Gen1 or Gen2 (to access other filesystems on a cluster with Azure Data Lake Storage credential passthrough enabled, use a different method to specify your credentials and see the section on trusted filesystems under Troubleshooting).
    • The old Hadoop APIs (hadoopFile() and hadoopRDD()).
    • Streaming APIs, since the passed-through credentials would expire while the stream was still running.
  • The FUSE mount (/dbfs).
  • Azure Data Factory.
  • Databricks Connect.
  • MLflow.

Example notebooks

The following notebooks demonstrate Azure Data Lake Storage credential passthrough for Azure Data Lake Storage Gen1 and Gen2.

Azure Data Lake Storage Gen1 passthrough notebook

Azure Data Lake Storage Gen2 passthrough notebook

Troubleshooting

py4j.security.Py4JSecurityException: … is not whitelisted
This exception is thrown when you have accessed a method that Azure Databricks has not explicitly marked as safe for Azure Data Lake Storage credential passthrough clusters. In most cases, this means that the method could allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials.
org.apache.spark.api.python.PythonSecurityException: Path … uses an untrusted filesystem

This exception is thrown when you have tried to access a filesystem that is not known by the Azure Data Lake Storage credential passthrough cluster to be safe. Using an untrusted filesystem might allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials, so we disallow all filesystems that we are not confident are being used safely.

To configure the set of trusted filesystems on a Azure Data Lake Storage credential passthrough cluster, set the Spark conf key spark.databricks.pyspark.trustedFilesystems on that cluster to be a comma-separated list of the class names that are trusted implementations of org.apache.hadoop.fs.FileSystem.