Azure Blob Storage

Note

Azure Databricks also supports the following Azure data sources: Azure Data Lake Store, Azure Cosmos DB, SQL Data Warehouse.

Accessing Azure Blob Storage Directly

Data can be read from Azure Blob Storage using the Hadoop FileSystem interface. Data can be read from public storage accounts without any additional settings.

To read data from a private storage account, you must set an account access key or a Shared Access Signature (SAS) in your notebook:

  • Setting up an account access key:

    spark.conf.set(
      "fs.azure.account.key.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net",
      "{YOUR STORAGE ACCOUNT ACCESS KEY}")
    
  • Setting up a SAS for a given container:

    spark.conf.set(
      "fs.azure.sas.{YOUR CONTAINER NAME}.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net",
      "{COMPLETE QUERY STRING OF YOUR SAS FOR THE CONTAINER}")
    

Once an account access key or a SAS is set up in your notebook, you can use standard Spark and Databricks APIs to read from the storage account:

val df = spark.read.parquet("wasbs://{YOUR CONTAINER NAME}@{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net/{YOUR DIRECTORY NAME}")
dbutils.fs.ls("wasbs://{YOUR CONTAINER NAME}@{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net/{YOUR DIRECTORY NAME}")

Note

Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. This means that, while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. If you are using the RDD API to read from Azure Blob Storage, you must set the credentials using one of the following methods:

  • Specify the Hadoop credential configuration options as Spark options when you create the cluster.

    You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to tell Spark to propagate them to the Hadoop configurations that are used for your RDD jobs:

    # Using an account access key
    spark.hadoop.fs.azure.account.key.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net {YOUR STORAGE ACCOUNT ACCESS KEY}
    
    # Using a SAS token
    spark.hadoop.fs.azure.sas.{YOUR CONTAINER NAME}.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net {COMPLETE QUERY STRING OF YOUR SAS FOR THE CONTAINER}
    
  • For Scala users, you can also set the credentials into spark.sparkContext.hadoopConfiguration:

    // Using an account access key
    spark.sparkContext.hadoopConfiguration.set(
      "fs.azure.account.key.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net",
      "{YOUR STORAGE ACCOUNT ACCESS KEY}"
    )
    
    // Using a SAS token
    spark.sparkContext.hadoopConfiguration.set(
      "fs.azure.sas.{YOUR CONTAINER NAME}.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net",
      "{COMPLETE QUERY STRING OF YOUR SAS FOR THE CONTAINER}"
    )
    

Warning! In either case, the credentials you set here are available to all users who access the cluster.

Mounting Azure Blob Storage Containers with DBFS

In addition to accessing Azure Blob Storage directly, you can also mount a Blob Storage container or a folder inside a container through Databricks File System - DBFS. This gives all users in the same workspace the ability to access the Blob Storage container or the folder inside the container through the mount point. DBFS uses the credential that you provide when you create the mount point to access the mounted Blob Storage container. If a Blob Storage container is mounted using a storage account access key, DBFS uses temporary SAS tokens derived from the storage account key when it accesses this mount point.

Warning

You should only create a mount point if you want all users in the Databricks workspace to have access to the mounted Blob Storage container.

Note

You can mount Blob Storage containers using Databricks Runtime 4.0 or higher. Once a Blob Storage container is mounted, you can use Runtime 3.4 or higher to access the mount point.

Note

Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, users must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

To mount a Blob Storage container or a folder inside a container, you can use the following command:

  • Scala version

    dbutils.fs.mount(
      source = "wasbs://{YOUR_CONTAINER_NAME}@{YOUR_STORAGE_ACCOUNT_NAME}.blob.core.windows.net/{YOUR_DIRECTORY_NAME}",
      mountPoint = "{mountPointPath}",
      extraConfigs = Map("{confKey}" -> "{confValue}"))
    
  • Python version

    dbutils.fs.mount(
      source = "wasbs://{YOUR CONTAINER NAME}@{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net/{YOUR DIRECTORY NAME}",
      mount_point = "{mountPointPath}",
      extra_configs = {"{confKey}": "{confValue}"})
    

where

  • {mountPointPath} is a DBFS path representing where the Blob Storage container or a folder inside the container (specified in source) will be mounted in DBFS. Note that this path must be under /mnt.
  • {confKey} and {confValue} specify the credentials used to access the mount point. {confKey} can be either fs.azure.account.key.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net or fs.azure.sas.{YOUR CONTAINER NAME}.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net.

Unmounting

To unmount a mount point, use the following command:

dbutils.fs.unmount("{mountPointPath}")