sparklyr

Azure Databricks supports sparklyr in notebooks and jobs.

Requirements

Azure Databricks supports sparklyr 0.5.5 and above with Apache Spark 2.2 and above and Scala 2.11. This guide is based on Apache Spark 2.2.

Install sparklyr

Note

  • Databricks Runtime 5.3 and above installs the latest stable version of sparklyr and for these runtimes you can skip the installation step.
  • Some sparklyr dependencies are installed as source packages and require the latest version of the Rcpp package. Update this package before installing sparklyr.

You can install sparklyr from CRAN or GitHub.

  • Install the latest version of sparklyr from CRAN.

    # Install latest version of Rcpp
    install.packages("Rcpp")
    
    # Install sparklyr. It can take a few minutes, because it installs +10 dependencies.
    install.packages("sparklyr")
    
    # Load sparklyr package.
    library(sparklyr)
    
  • Install the latest development version of sparklyr from GitHub.

    # Install latest version of Rcpp
    install.packages("Rcpp")
    
    # Use devtools to install sparklyr from GitHub
    devtools::install_github("rstudio/sparklyr")
    
    # Load sparklyr package.
    library(sparklyr)
    

Connect sparklyr to Azure Databricks clusters

To establish a sparklyr connection, you can use "databricks" as the connection method in spark_connect(). No additional parameters to spark_connect() are needed, nor is calling spark_install() needed because Spark is already installed on an Azure Databricks cluster.

# create a sparklyr connection
sc <- spark_connect(method = "databricks")

Progress bars and Spark UI with sparklyr

If you assign the sparklyr connection object to a variable named sc as in the above example, you will see Spark progress bars in the notebook after each command that triggers Spark jobs. In addition, you can click the link next to the progress bar to view the Spark UI associated with the given Spark job.

Sparklyr Progress

Use sparklyr

After installing sparklyr and establishing the connection, all other sparklyr API would work as they normally do. See the example notebook below for some examples.

sparklyr is usually used along with other tidyverse packages such as dplyr. Most of these packages are pre-installed on Databricks for your convenience. You can simply import them and start using the API.

Use sparklyr and SparkR together

SparkR and sparklyr can be used together in a single notebook or job. You can import SparkR along with sparklyr and use its functionality. In Azure Databricks notebooks, the SparkR connection is pre-configured.

Some of the functions in SparkR mask a number of functions in dplyr:

> library(SparkR)
The following objects are masked frompackage:dplyr:

arrange, between, coalesce, collect, contains, count, cume_dist,
dense_rank, desc, distinct, explain, filter, first, group_by,
intersect, lag, last, lead, mutate, n, n_distinct, ntile,
percent_rank, rename, row_number, sample_frac, select, sql,
summarize, union

If you import SparkR after you imported dplyr, you can reference the functions in dplyr by using the fully qualified names, for example, dplyr::arrange(). Similarly if you import dplyr after SparkR, the functions in SparkR are masked by dplyr.

Alternatively, you can selectively detach one of the two packages while you do not need it.

detach("package:dplyr")

Use sparklyr in spark-submit jobs

You can run scripts that use sparklyr on Azure Databricks as spark-submit jobs, with minor code modifications. Some of the instructions above do not apply to using sparklyr in spark-submit jobs on Azure Databricks. In particular, you must provide the Spark master URL to spark_connect. For an example, refer to Create and run a spark-submit job for R scripts.

Unsupported features

Azure Databricks does not support sparklyr methods such as spark_web() and spark_log() that require a local browser. However, since the Spark UI is built-in on Azure Databricks, you can inspect Spark jobs and logs easily. See Cluster driver and worker logs.