Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 (also known as ADLS Gen2) is a next-generation data lake solution for big data analytics. Azure Data Lake Storage Gen2 builds Azure Data Lake Storage Gen1 capabilities–file system semantics, file-level security, and scale–into Azure Blob Storage, with its low-cost tiered storage, high availability, and disaster recovery features.

Note

The Azure Data Lake Storage Gen2 connector is supported in Databricks Runtime 5.2 and above with full support for Delta Lake in Databricks Runtime 5.5 and above.

There are four ways of accessing Azure Data Lake Storage Gen2:

  1. Pass your Azure Active Directory credentials, also known as credential passthrough.
  2. Mount an Azure Data Lake Storage Gen2 filesystem to DBFS using a service principal and OAuth 2.0.
  3. Use a service principal directly.
  4. Use the Azure Data Lake Storage Gen2 storage account access key directly.

This topic explains how to access Azure Data Lake Storage Gen2 using the Azure Blob File System (ABFS) driver built into Databricks Runtime. It covers all the ways you can access Azure Data Lake Storage Gen2, frequently asked questions, and known issues.

Create an Azure Data Lake Storage Gen2 account and initialize a filesystem

If you want to use Azure Data Lake Storage credential passthrough or mount the Azure Data Lake Storage Gen2 filesystem, perform this step.

If you have not created an Azure Data Lake Storage Gen2 account and initialized a filesystem, do the following:

  1. Create your Azure Data Lake Storage Gen2 storage account, enabling the hierarchical namespace, which provides improved filesystem performance, POSIX ACLs, and filesystem semantics that are familiar to analytics engines and frameworks.

    Important

    • When the hierarchical namespace is enabled for an Azure Data Lake Storage Gen2 account, you do not need to create any Blob containers through the Azure Portal.
    • When the hierarchical namespace is enabled, Azure Blob Storage APIs are not available. See this Known issue description. For example, you cannot use the wasb or wasbs scheme to access the blob.core.windows.net endpoint.
    • If you enable the hierarchical namespace there is no interoperability of data or operations between Azure Blob Storage and Azure Data Lake Storage Gen2 REST APIs.
  2. Initialize a filesystem before you can access it. If you haven’t already initialized it from within the Azure portal, enter the following in the first cell of the notebook (with your account values):

    spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
    dbutils.fs.ls("abfss://<file_system>@<storage-account-name>.dfs.core.windows.net/")
    spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
    

    You need to run this only once per filesystem, not each time you run the notebook or attach to a new cluster.

Access automatically with your Azure Active Directory credentials

You can configure your Azure Databricks cluster to let you authenticate automatically to Azure Data Lake Storage Gen2 using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you enable your cluster for Azure Data Lake Storage credential passthrough, commands that you run on that cluster will be able to read and write your data in Azure Data Lake Storage Gen2 without requiring you to configure service principal credentials for access to storage.

Note

Azure Data Lake Storage credential passthrough for Azure Data Lake Storage Gen2 requires Databricks Runtime 5.3 and above.

For complete setup and usage instructions, see Authenticate to Azure Data Lake Storage with your Azure Active Directory Credentials.

Create and grant permissions to service principal

If your selected access method requires a service principal with adequate permissions, and you do not have one, follow these steps:

  1. Create an Azure AD application and service principal that can access resources. Note the following properties:
    • client-id: An ID that uniquely identifies the application.
    • directory-id: An ID that uniquely identifies the Azure AD instance.
    • service-credential: A string that the application uses to prove its identity.
  2. Register the service principal, granting the correct role assignment, such as Storage Blob Data Contributor, on the Azure Data Lake Storage Gen2 account.

Mount an Azure Data Lake Storage Gen2 account using a service principal and OAuth 2.0

You can mount an Azure Data Lake Storage Gen2 account to DBFS, authenticating using a service principal and OAuth 2.0. The mount is a pointer to data lake storage, so the data is never synced locally.

Note

Accessing Azure Data Lake Storage Gen2 requires Databricks Runtime 5.2 or above. If you have been using Azure Data Lake Storage Gen2 with Databricks Runtime 4.2 to 5.1, you can continue to do so, but support is limited and we strongly recommend that you upgrade your clusters.

Important

  • All users in the Azure Databricks workspace have access to the mounted Azure Data Lake Storage Gen2 account. The service client that you use to access the Azure Data Lake Storage Gen2 account should be granted access only to that Azure Data Lake Storage Gen2 account; it should not be granted access to other resources in Azure.
  • Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

Configure storage account key

// You can also use OAuth 2 with Databricks Runtime 5.1 or above to authenticate the filesystem initialization.
spark.conf.set("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-for-service-credential>"))

where <storage-account-name> is the name of your storage account and dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>") retrieves your storage account access key that has been stored as a secret in a secret scope.

Mount Azure Data Lake Storage Gen2 filesystem

  1. To mount an Azure Data Lake Storage Gen2 filesystem or a folder inside it, use the following command:

    Scala
    val configs = Map(
      "fs.azure.account.auth.type" -> "OAuth",
      "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
      "fs.azure.account.oauth2.client.id" -> "<client-id>",
      "fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
      "fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")
    
    // Optionally, you can add <directory-name> to the source URI of your mount point.
    dbutils.fs.mount(
      source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
      mountPoint = "/mnt/<mount-name>",
      extraConfigs = configs)
    
    Python
    configs = {"fs.azure.account.auth.type": "OAuth",
               "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
               "fs.azure.account.oauth2.client.id": "<client-id>",
               "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
               "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
    
    # Optionally, you can add <directory-name> to the source URI of your mount point.
    dbutils.fs.mount(
      source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
      mount_point = "/mnt/<mount-name>",
      extra_configs = configs)
    

    where

    • <mount-name> is a DBFS path that represents where the Data Lake Store or a folder inside it (specified in source) will be mounted in DBFS.
    • dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>") retrieves your storage account access key that has been stored as a secret in a secret scope.
  2. Access files in your Azure Data Lake Storage Gen2 filesystem as if they were files in DBFS; for example:

    Scala
    val df = spark.read.text("/mnt/<mount-name>/....")
    val df = spark.read.text("dbfs:/<mount-name>/....")
    
    Python
    df = spark.read.text("/mnt/%s/...." % <mount-name>)
    df = spark.read.text("dbfs:/<mount-name>/....")
    

Unmount a mount point

To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

Access directly with service principal and OAuth 2.0

You can access an Azure Data Lake Storage Gen2 storage account directly (as opposed to mounting with DBFS) with OAuth 2.0 using the service principal. You can directly access any Azure Data Lake Storage Gen2 storage account that the service principal has permissions on. You can add multiple storage accounts and service principals in the same Spark session.

Set credentials

The way you set credentials depends on which API you plan to use when accessing Azure Data Lake Storage Gen2: DataFrame, Dataset, or RDD.

DataFrame or DataSet API

If you are using Spark DataFrame or Dataset APIs, we recommend that you set your account credentials in your notebook’s session configs:

spark.conf.set("fs.azure.account.auth.type.acctname.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.acctname.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.acctname.dfs.core.windows.net", "<client-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.acctname.dfs.core.windows.net", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-for-service-credential>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.acctname.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

where dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>") retrieves your storage account access key that has been stored as a secret in a secret scope.

RDD API

If you are using the RDD API to access Azure Data Lake Storage Gen2 you cannot access Hadoop configuration options set using spark.conf.set(...). Therefore you must set the credentials using one of the following methods:

  • Specify the Hadoop configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations that are used for your RDD jobs:

    fs.azure.account.auth.type.acctname.dfs.core.windows.net OAuth
    fs.azure.account.oauth.provider.type.acctname.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    fs.azure.account.oauth2.client.id.acctname.dfs.core.windows.net <client-id>
    fs.azure.account.oauth2.client.secret.acctname.dfs.core.windows.net <service-credential>
    fs.azure.account.oauth2.client.endpoint.acctname.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token
    
  • Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.auth.type.acctname.dfs.core.windows.net", "OAuth")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth.provider.type.acctname.dfs.core.windows.net",  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.id.acctname.dfs.core.windows.net", "<client-id>")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.secret.acctname.dfs.core.windows.net", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-for-service-credential>"))
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.endpoint.acctname.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
    
    where |GetSecret|
    

Warning

These credentials are available to all users who access the cluster.

Once your credentials are set up, you can use standard Spark and Databricks APIs to read from the storage account. For example:

val df = spark.read.parquet("abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")

dbutils.fs.ls("abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")

Access directly using the storage account access key

You can access an Azure Data Lake Storage Gen2 storage account using the storage account access key.

Set your credentials

The way you set credentials depends on which API you plan to use when accessing Azure Data Lake Storage Gen2: DataFrame, Dataset, or RDD.

DataFrame or DataSet API

If you are using Spark DataFrame or Dataset APIs, we recommend that you set your account credentials in your notebook’s session configs:

spark.conf.set(
  "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
  dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))

where dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>") retrieves your storage account access key that has been stored as a secret in a secret scope.

RDD API

If you are using the RDD API to access Azure Data Lake Storage Gen2 you cannot access Hadoop configuration options set using spark.conf.set(...). Therefore you must set the credentials using one of the following methods:

  • Specify the Hadoop configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations that are used for your RDD jobs:

    # Using an account access key
    spark.hadoop.fs.azure.account.key.<storage-account-name>.dfs.core.windows.net <key-name-for-service-credential>
    
  • Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    // Using an account access key
    spark.sparkContext.hadoopConfiguration.set(
      "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
      dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>")
      )
    

    where dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>") retrieves your storage account access key that has been stored as a secret in a secret scope.

Warning

These credentials are available to all users who access the cluster.

Once your credentials are set up, you can use standard Spark and Databricks APIs to read from the storage account. For example,

val df = spark.read.parquet("abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")

dbutils.fs.ls("abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")

Example notebook

The following notebook demonstrates accessing Azure Data Lake Storage Gen2 directly and with a mount.

Frequently asked questions (FAQ)

Does ABFS support Shared Access Signature (SAS) token authentication?
ABFS does not support SAS token authentication, but the Azure Data Lake Storage Gen2 service itself does support SAS keys.
Can I use the abfs scheme to access Azure Data Lake Storage Gen2?
Yes. However, we recommend that you use the abfss scheme, which uses SSL encrypted access, wherever possible.
When I accessed an Azure Data Lake Storage Gen2 account with the hierarchical namespace enabled, I experienced a java.io.FileNotFoundException error, and the error message includes FilesystemNotFound.

If the error message includes the following information, it is because your command is trying to access a Blob Storage container created through the Azure Portal:

StatusCode=404
StatusDescription=The specified filesystem does not exist.
ErrorCode=FilesystemNotFound
ErrorMessage=The specified filesystem does not exist.

When a hierarchical namespace is enabled, you do not need to create containers through Azure Portal. If you see this issue, delete the Blob container through Azure Portal. After a few minutes, you will be able to access the container. Alternatively, you can change your abfss URI to use a different container, as long as this container is not created through Azure Portal.

When I use Databricks Runtime 4.2, 4.3, or 5.0, Azure Data Lake Storage Gen2 fails to list a directory that has lots of files.

With these runtimes, Azure Data Lake Storage Gen2 has a known issue that causes it to fail to list a directory that has more than 5000 files or sub-directories. The error message should look like:

java.io.IOException: GET https://....dfs.core.windows.net/...?resource=filesystem&maxResults=5000&directory=...&timeout=90&recursive=false
StatusCode=403
StatusDescription=Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
ErrorCode=AuthenticationFailed
ErrorMessage=Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:...
Time:...
I observe the error This request is not authorized to perform this operation using this permission when I try to mount an Azure Data Lake Storage Gen2 filesystem.
This error occurs if the service principal you are using for Azure Data Lake Storage Gen2 is not granted the appropriate role assignment. See Access automatically with your Azure Active Directory credentials.