Azure Data Lake Storage Gen1

Azure Data Lake Storage Gen 1 (formerly Azure Data Lake Store) is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake Storage Gen1 enables you to capture data of any size, type, and ingestion speed in a single place for operational and exploratory analytics. Azure Data Lake Storage Gen1 is specifically designed to enable analytics on the stored data and is tuned for performance for data analytics scenarios.

Note

Microsoft has released its next-generation data lake store, Azure Data Lake Storage Gen2, in Preview.

Azure Databricks also supports the following Azure data sources: Azure Blob Storage, Azure Cosmos DB, and Azure SQL Data Warehouse.

Access Azure Data Lake Storage Gen1 automatically with your Azure Active Directory credentials (Preview)

You can authenticate automatically to Azure Data Lake Storage Gen1 (ADLS) from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks.

When you enable your cluster for Azure AD credential passthrough, commands that you run on that cluster will be able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.

Note

The ability to access Azure Data Lake Storage Gen1 using your Azure Active Directory credentials is in Public Preview.

Requirements

Enable Azure Data Lake Storage credential passthrough for a cluster

To enable Azure Data Lake Storage credential passthrough for a cluster:

  1. When you create the cluster, set the Cluster Mode to High Concurrency.
  2. Use Databricks Runtime 5.1 or above.
  3. Select Enable credential passthrough and only allow Python and SQL commands.
../../../../_images/adls-credential-passthrough.png

Important

Azure Data Lake Storage credential passthrough will not work if Azure Data Lake Storage Gen1 credentials have also been set for the cluster (by providing your service principal credentials, for example, as described in Mount Azure Data Lake Storage Gen1 with DBFS and Access Azure Data Lake Storage Gen1 directly using Spark APIs).

Read and write your data

Make sure your data is stored entirely in Azure Data Lake Storage Gen1. Azure Data Lake Storage credential passthrough does not support filesystems other than Azure Data Lake Storage Gen1.

When you use Azure Data Lake Storage credential passthrough, you must access your data directly using an adl:// path. For example:

spark.read.csv("adl://myadlsfolder.azuredatalakestore.net/MyData.csv").collect()

Security

It is safe to share Azure Data Lake Storage credential passthrough clusters with other users. You will be isolated from each other and will not be able to read or use each other’s credentials.

Known limitations

For security, Azure Data Lake Storage credential passthrough:

  • Supports only Python and SQL.
  • Does not support filesystems other than Azure Data Lake Storage Gen1. This includes reading from DBFS paths, even if those paths eventually resolve to Azure Data Lake Storage Gen1. You should just use the adl:// path directly.
  • Does not support some Spark methods. In general, Azure Data Lake Storage credential passthrough supports all methods on the SparkContext (sc) and SparkSession (spark) objects, with the following exceptions, which are not supported:
    • Deprecated methods.
    • Methods such as addFile() and addJar() that would allow non-admin users to call arbitrary Scala code.
    • Any method that accesses a filesytem, when applied to a filesystem other than Azure Data Lake Storage Gen1.
    • The old Hadoop APIs (hadoopFile() and hadoopRDD()), since it is hard to guarantee that these APIs only access Azure Data Lake Storage Gen1.

Azure Data Lake Storage credential passthrough does not support dbutils.fs commands.

Troubleshooting

py4j.security.Py4JSecurityException: ... is not whitelisted
This exception is thrown when you have accessed a method that Azure Databricks has not explicitly marked as safe for Azure Data Lake Storage credential passthrough clusters. In most cases, this means that the method could allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials.
org.apache.spark.api.python.PythonSecurityException: Path ... uses an untrusted filesystem

This exception is thrown when you have tried to access a filesystem that is not known by the Azure Data Lake Storage credential passthrough cluster to be safe. Using an untrusted filesystem might allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials, so we disallow all filesystems that we are not confident are being used safely.

To configure the set of trusted filesystems on a Azure Data Lake Storage credential passthrough cluster, set the Spark conf key spark.databricks.pyspark.trustedFilesystems on that cluster to be a comma-separated list of the class names that are trusted implementations of org.apache.hadoop.fs.FileSystem.

For example, to trust Azure Data Lake Storage Gen1 and Azure Blob Storage, use this Spark conf:

spark.databricks.pyspark.trustedFilesystems com.databricks.adl.AdlFileSystem,org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem

Mount Azure Data Lake Storage Gen1 with DBFS

You can mount an Azure Data Lake Storage Gen1 account or a folder inside it through Databricks File System - DBFS. The mount is a pointer to a data lake store, so the data is never synced locally.

Important

  • You should create a mount point only if you want all users in the Azure Databricks workspace to have access to the mounted Azure Data Lake Storage Gen1 account. The service client that you use to access the Azure Data Lake Storage Gen1 account should be granted access only to that Azure Data Lake Storage Gen1 account; it should not be granted access to other resources in Azure.
  • Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, users must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

DBFS uses the credential you provide when you create the mount point to access the mounted Azure Data Lake Storage Gen1 account.

Requirements

  • For OAuth 2.0 access, you must have a service principal. If you do not already have service credentials, you can follow the instructions in Create service principal with portal. If you do not know your-directory-id (also referred to as tenant ID in Azure AD), you can follow the instructions in Get tenant ID. For leveraging credentials safely in Azure Databricks, we recommend that you follow the Secrets user guide.
  • You can mount a Azure Data Lake Storage Gen1 account using Databricks Runtime 4.0 or higher. Once a Data Lake Storage account is mounted, you can use Runtime 3.4 or higher to access the mount point.

Mount an Azure Data Lake Storage Gen1 account

To mount an Azure Data Lake Storage Gen1 account or a folder inside it, use the following command:

Scala
val configs = Map(
  "dfs.adls.oauth2.access.token.provider.type" -> "ClientCredential",
  "dfs.adls.oauth2.client.id" -> "<your-service-client-id>",
  "dfs.adls.oauth2.credential" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>"),
  "dfs.adls.oauth2.refresh.url" -> "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")

// Optionally, you can add <your-directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)
Python
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
           "dfs.adls.oauth2.client.id": "<your-service-client-id>",
           "dfs.adls.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>"),
           "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}

# Optionally, you can add <your-directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

where

  • <mount-name> is a DBFS path that represents where the Azure Data Lake Storage Gen1 account or a folder inside it (specified in source) will be mounted in DBFS.
  • dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your service credential that has been stored as a secret in a secret scope.

Access files in your container as if they were local files, for example:

Scala
val df = spark.read.text("/mnt/<mount-name>/....")
val df = spark.read.text("dbfs:/<mount-name>/....")
Python
df = spark.read.text("/mnt/%s/...." % <mount-name>)
df = spark.read.text("dbfs:/<mount-name>/....")

Unmount a mount point

To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

Access Azure Data Lake Storage Gen1 directly using Spark APIs

This section explains how to access Azure Data Lake Storage Gen1 using the Spark DataFrame and RDD APIs.

Requirements

  • For OAuth 2.0 access, you must have a service principal. If you do not already have service credentials, you can follow the instructions in Create service principal with portal. If you do not know your-directory-id (also referred to as tenant ID in Azure AD), you can follow the instructions in Get tenant ID. For leveraging credentials safely in Azure Databricks, we recommend that you follow the Secrets user guide.

Access Azure Data Lake Storage Gen1 using the DataFrame API

To read from your Azure Data Lake Storage Gen1 account, you can configure Spark to use service credentials with the following snippet in your notebook:

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "<your-service-client-id>")
spark.conf.set("dfs.adls.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>"))
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")

where dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your service credential that has been stored as a secret in a secret scope.

After providing credentials, you can read from Azure Data Lake Storage Gen1 using Spark and Databricks APIs:

val df = spark.read.parquet("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")

dbutils.fs.ls("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")

Azure Data Lake Storage Gen1 provides directory level access control, so the service principal must have access to the directories that you want to read from as well as the Data Lake Store resource.

Access Azure Data Lake Storage Gen1 using the RDD API

Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. If you are using the RDD API to read from Azure Data Lake Storage Gen1, you must set the credentials using one of the following methods:

  • Specify the Hadoop credential configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations used for your RDD jobs:

    spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
    spark.hadoop.dfs.adls.oauth2.client.id <your-service-client-id>
    spark.hadoop.dfs.adls.oauth2.credential <your-service-credentials>
    spark.hadoop.dfs.adls.oauth2.refresh.url "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"
    
  • Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.client.id", "<your-service-client-id>")
    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.credential", "<your-service-credentials>")
    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")
    

Warning

These credentials are available to all users who access the cluster.

Access Azure Data Lake Storage Gen1 through metastore

To access adl:// locations specified in the metastore, you must specify Hadoop credential configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations used by the metastore:

spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
spark.hadoop.dfs.adls.oauth2.client.id <your-service-client-id>
spark.hadoop.dfs.adls.oauth2.credential <your-service-credentials>
spark.hadoop.dfs.adls.oauth2.refresh.url "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"

Warning

These credentials are available to all users who access the cluster.