Problem: Spark Job Fails with Driver is temporarily unavailable

Problem

A Databricks notebook returns the following error:

Driver is temporarily unavailable

This issue can be intermittent or not.

A related error message is:

Lost connection to cluster. The notebook may have been detached.

Cause

One common cause for this error is that the driver is undergoing a memory bottleneck. When this happens, the driver crashes with an out of memory (OOM) condition and gets restarted or becomes unresponsive due to frequent full garbage collection. The reason for the memory bottleneck can be any of the following:

  • The driver instance type is not optimal for the load executed on the driver.
  • There are memory-intensive operations executed on the driver.
  • There are many notebooks or jobs running in parallel on the same cluster.

Solution

The solution varies from case to case. The easiest way to resolve the issue in the absence of specific details is to increase the driver memory. You can increase driver memory simply by upgrading the driver node type on the cluster edit page in your Azure Databricks workspace.

Other points to consider:

  • Avoid memory intensive operations like:

    • collect() operator, which brings a large amount of data to the driver.
    • Conversion of a large DataFrame to Pandas

    If these operations are essential, ensure that enough driver memory is available.

  • Avoid running batch jobs on a shared interactive cluster.

  • Distribute the workloads into different clusters. No matter how big the cluster is, the functionalities of the Spark driver cannot be distributed within a cluster.