Problem: Error When Reading Data from ADLS Gen 1 with Sparklyr

Problem

When using a cluster with Azure AD Credential Passthrough enabled, commands that you run on that cluster are able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.

For example, you can directly access data using

spark.read.csv("adl://myadlsfolder.azuredatalakestore.net/MyData.csv").collect()

However, when you try to access data directly using Sparklyr:

spark_read_csv(sc, name = "air", path = "adl://myadlsfolder.azuredatalakestore.net/MyData.csv")

It fails with the error:

com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen1 Token

Cause

The spark_read_csv function in Sparklyr is not able to extract the ADLS token to enable authentication and read data.

Solution

A workaround is to use an Azure application id, application key, and directory id to mount the ADLS location in DBFS:

# Get credentials and ADLS URI from Azure
applicationId= <application-id>
applicationKey= <application-key>
directoryId= <directory-id>
adlURI=<adl-uri>
assert adlURI.startswith("adl:"), "Verify the adlURI variable is set and starts with adl:"

# Mount ADLS location to DBFS
dbfsMountPoint=<mount-point-location>
dbutils.fs.mount(
  mount_point = dbfsMountPoint,
  source = adlURI,
  extra_configs = {
    "dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
    "dfs.adls.oauth2.client.id": applicationId,
    "dfs.adls.oauth2.credential": applicationKey,
    "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/{}/oauth2/token".format(directoryId)
  })

Then, in your R code, read data using the mount point:

# Install Sparklyr
%r
install.packages("sparklyr")
library(sparklyr)
# Create a sparklyr connection
sc <- spark_connect(method = "databricks")

# Read Data
%r
myData = spark_read_csv(sc, name = "air", path = "dbfs:/<mount-point-location>/myData.csv")