Cluster Node Initialization Scripts

An init script is a shell script that runs during startup for each cluster node before the Spark driver or worker JVM starts. Some examples of tasks performed by init scripts include:

Important

To install Python packages, use the Azure Databricks pip binary located at /databricks/python/bin/pip to ensure that Python packages install into Databricks Python virtual environment rather than the system Python environment. For example, /databricks/python/bin/pip install <packagename>.

  • Modify the JVM system classpath in special cases
  • Set system properties and environment variables used by the JVM
  • Modify Spark configuration parameters

Init scripts apply to manually created clusters and clusters created by jobs. Create the script once and it will run at cluster startup.

Types of init scripts

Azure Databricks supports two kinds of init scripts: global and cluster-specific. They are both created and managed from Databricks File System - DBFS.

Global init scripts

Global init scripts run on every cluster at startup. Global init scripts are stored in dbfs:/databricks/init/.

Cluster-specific init scripts

Cluster-specific scripts scope to a single cluster, specified by the cluster’s name. They reside in a sub-directory of the init scripts directory named the same as the cluster name. For example, to specify init scripts for the cluster named PostgreSQL, create the directory dbfs:/databricks/init/PostgreSQL, and put all shell scripts that should run on cluster PostgreSQL in that directory.

Init script output

Databricks saves all init script output to a file in DBFS named as follows: dbfs:/databricks/init/output/<cluster-name>/<date-timestamp>/<script-name>_<node-ip>.log. For example, if a cluster PostgreSQL has two Spark nodes with IPs 10.0.0.1 and 10.0.0.2, and the init script directory has a script called installpostgres.sh, there will be two output files at the following paths:

  • dbfs:/databricks/init/output/PostgreSQL/2016-01-01_12-00-00/installpostgres.sh_10.0.0.1.log
  • dbfs:/databricks/init/output/PostgreSQL/2016-01-01_12-00-00/installpostgres.sh_10.0.0.2.log

Note

  • Any change to an init script requires a cluster restart
  • Avoid spaces in cluster names since they’re used in the script and output paths

Create a global init script

Warning

Global init scripts run on every cluster at cluster startup. Be careful about what you place in these init scripts.

  1. Create dbfs:/databricks/init/ if it doesn’t exist.

    dbutils.fs.mkdirs("dbfs:/databricks/init/")
    
  2. Display the list of existing global init scripts.

    display(dbutils.fs.ls("dbfs:/databricks/init/"))
    
  3. Create a script that simply appends to a file.

    dbutils.fs.put("dbfs:/databricks/init/my-echo.sh" ,"""
    #!/bin/bash
    
    echo "hello" >> /hello.txt
    """, True)
    
  4. Check that the script exists.

    display(dbutils.fs.ls("dbfs:/databricks/init/"))
    

Every time a cluster launches it will execute this append script.

Create a cluster-specific init script

This section creates an init script for a cluster named PostgreSQL that installs the PostgreSQL JDBC driver on that cluster. You can create a customizable command if you create a variable clusterName that holds the cluster name.

  1. Create dbfs:/databricks/init/ if it doesn’t exist.

    dbutils.fs.mkdirs("dbfs:/databricks/init/")
    
  2. Display the list of existing global init scripts.

    display(dbutils.fs.ls("dbfs:/databricks/init/"))
    
  3. Configure a cluster name variable.

    clusterName = "PostgreSQL"
    
  4. Create a directory named PostgreSQL using Databricks File System - DBFS.

    dbutils.fs.mkdirs("dbfs:/databricks/init/%s/"%clusterName)
    
  5. Create the script.

    dbutils.fs.put("/databricks/init/PostgreSQL/postgresql-install.sh","""
    #!/bin/bash
    wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar http://central.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar
    wget --quiet -O /mnt/jars/driver-daemon/postgresql-42.2.2.jar http://central.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
    
  6. Check that the cluster-specific init script exists.

    display(dbutils.fs.ls("dbfs:/databricks/init/%s/postgresql-install.sh"%clusterName))
    

Delete an init script

Delete the init script file. You can perform this either in a notebook or using the DBFS API or DBFS CLI. If you have created a global init script that is preventing new clusters from starting up, use the API or CLI to move or delete the script.

dbutils.fs.rm("/databricks/init/my-echo.sh")
dbutils.fs.rm("dbfs:/databricks/init/PostgreSQL/postgresql-install.sh")