Accessing Azure Data Lake Storage Automatically with your Azure Active Directory Credentials

You can authenticate automatically to Azure Data Lake Storage Gen1 and Azure Data Lake Storage Gen2 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks.

When you enable your cluster for Azure AD credential passthrough, commands that you run on that cluster will be able to read and write your data in Azure Data Lake Storage without requiring you to configure service principal credentials for access to storage.

Requirements

Enable Azure Data Lake Storage credential passthrough for a cluster

To enable Azure Data Lake Storage credential passthrough for a cluster:

  1. When you create the cluster, set the Cluster Mode to High Concurrency.
  2. For Azure Data Lake Storage Gen1, use Databricks Runtime 5.1 or above. For Azure Data Lake Storage Gen2, use Databricks Runtime 5.3 or above.
  3. Under Advanced Options, select Enable credential passthrough and only allow Python and SQL commands.
../../../../_images/adls-credential-passthrough.png

Important

Azure Data Lake Storage credential passthrough will not work if Azure Data Lake Storage credentials have also been set for the cluster (by providing your service principal credentials, for example).

Read and write your data

Make sure your data is stored entirely in Azure Data Lake Storage. Azure Data Lake Storage credential passthrough does not support filesystems other than Azure Data Lake Storage Gen1 and Gen2.

When you use Azure Data Lake Storage credential passthrough, you can access your data directly using an adl:// path (for ADLS Gen1) or an abfss:// path (for ADLS Gen2). For example:

Azure Data Lake Storage Gen1

spark.read.csv("adl://<myadlsfolder>.azuredatalakestore.net/MyData.csv").collect()

Azure Data Lake Storage Gen2

spark.read.csv("abfss://<my-file-system-name>@<my-storage-account-name>.dfs.core.windows.net/MyData.csv").collect()

You can also use DBFS for Azure Data Lake Storage Gen1 (see below).

Mount your data with DBFS

Note

DBFS mounts are not supported on Azure Data Lake Storage Gen2.

You can mount an Azure Data Lake Storage account or a folder inside it through the Databricks File System - DBFS. The mount is a pointer to a data lake store, so the data is never synced locally.

When you mount your data with Azure Data Lake Storage credential passthrough, any read or write to the mount point automatically uses your Azure AD credentials. This mount point will be visible to other users, but only the users who have access to the underlying Azure Data Lake Storage storage account will have read and write access, and only when they use a cluster enabled for credential passthrough.

To mount an Azure Data Lake Storage Gen1 account or a folder inside it, use the Python commands described in Mount an Azure Data Lake Storage Gen1 account, replacing the configs with the following:

configs = {
"dfs.adls.oauth2.access.token.provider.type": "CustomAccessTokenProvider",
"dfs.adls.oauth2.access.token.custom.provider": spark.conf.get("spark.databricks.passthrough.adls.tokenProviderClassName")
}

Warning

Do not make the mistake of providing your storage account access keys or service principal credentials to authenticate to the mount point. That would give other users access to the filesystem using those credentials. The purpose of credential passthrough is to prevent you from having to use those credentials and to ensure that access to the filesystem is restricted to users who have access to the underlying Azure Data Lake Storage storage account.

Security

It is safe to share Azure Data Lake Storage credential passthrough clusters with other users. You will be isolated from each other and will not be able to read or use each other’s credentials.

Known limitations

For security, Azure Data Lake Storage credential passthrough:

  • Supports only Python and SQL.
  • Should not be used with clusters that are enabled for table access control (table ACLs).
  • Does not support filesystems other than Azure Data Lake Storage Gen1 and Gen2. This includes reading from DBFS paths, even if those paths eventually resolve to Azure Data Lake Storage. You should just use the adl:// (Gen1) or abfss:// (Gen2) path directly.
  • Does not support some Spark methods. In general, Azure Data Lake Storage credential passthrough supports all methods on the SparkContext (sc) and SparkSession (spark) objects, with the following exceptions, which are not supported:
    • Deprecated methods.
    • Methods such as addFile() and addJar() that would allow non-admin users to call arbitrary Scala code.
    • Any method that accesses a filesytem, when applied to a filesystem other than Azure Data Lake Storage Gen1 or Gen2.
    • The old Hadoop APIs (hadoopFile() and hadoopRDD()), since we cannot guarantee that these APIs access only Azure Data Lake Storage Gen1 or Gen2.

Troubleshooting

py4j.security.Py4JSecurityException: ... is not whitelisted
This exception is thrown when you have accessed a method that Azure Databricks has not explicitly marked as safe for Azure Data Lake Storage credential passthrough clusters. In most cases, this means that the method could allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials.
org.apache.spark.api.python.PythonSecurityException: Path ... uses an untrusted filesystem

This exception is thrown when you have tried to access a filesystem that is not known by the Azure Data Lake Storage credential passthrough cluster to be safe. Using an untrusted filesystem might allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials, so we disallow all filesystems that we are not confident are being used safely.

To configure the set of trusted filesystems on a Azure Data Lake Storage credential passthrough cluster, set the Spark conf key spark.databricks.pyspark.trustedFilesystems on that cluster to be a comma-separated list of the class names that are trusted implementations of org.apache.hadoop.fs.FileSystem.

To get the list of trusted file systems on a cluster enabled for credential passthrough, run this command:

spark.conf.get("spark.databricks.pyspark.trustedFilesystems")

Note

In Databricks Runtime 5.1, Azure Blob Storage is not trusted by default. To trust Azure Blob Storage in Databricks Runtime 5.1, you must manually configure it to be trusted.

In Databricks Runtime 5.2 and above, Azure Blob Storage is trusted by default, and you do not need to configure it to be trusted.