Problem: Cluster Failed to Launch

This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for each scenario based on error messages found in logs.

Cluster timeout

Error messages:

Driver failed to start in time
INTERNAL_ERROR: The Spark driver failed to start within 300 seconds
Cluster failed to be healthy within 200 seconds

Cause

The cluster can fail to launch if it has a connection to an external Hive metastore and it tries to download all the Hive metastore libraries from a maven repo. A cluster downloads almost 200 jar files, including dependencies. If the Azure Databricks cluster manager cannot confirm that the driver is ready within 5 minutes, then cluster launch fails. This can occur because jar downloading is taking too much time.

Solution

Store the Hive libraries in DBFS and access them locally from the DBFS location. See Spark Options with the External Hive Metastore.

Global or cluster-specific init scripts

Error message:

The cluster could not be started in 50 minutes. Cause: Timed out with exception after <xxx> attempts

Cause

Init scripts that run during the cluster spin-up stage send an RPC (remote procedure call) to each worker machine to run the scripts locally. All RPCs must return their status before the process continues. If any RPC hits an issue and doesn’t respond back (due to a transient networking issue, for example), then the 1-hour timeout can be hit, causing the cluster setup job to fail.

Solution

Use a cluster-scoped init script instead of global or cluster-named init scripts. With cluster-scoped init scripts, Azure Databricks does not use synchronous blocking of RPCs to fetch init script execution status.

Too many libraries installed in cluster UI

Error message:

Library installation timed out after 1800 seconds. Libraries that are not yet installed:

Cause

This is usually an intermittent problem due to network problems.

Solution

Usually you can fix this problem by re-running the job or restarting the cluster.

The library installer is configured to time out after 3 minutes. While fetching and installing jars, a timeout can occur due to network problems. To mitigate this issue, you can download the libraries from maven to a DBFS location and install it from there.

Cloud provider limit

Error message:

Cluster terminated. Reason: Cloud Provider Limit

Cause

This error is usually returned by the cloud provider.

Solution

See the cloud provider error information in Unexpected Cluster Termination.

Cloud provider shutdown

Error messages:

Cluster terminated. Reason: Cloud Provider Shutdown

Cause

This error is usually returned by the cloud provider.

Solution

See the cloud provider error information in Unexpected Cluster Termination.

Instances unreachable

Error message:

Cluster terminated. Reason: Instances Unreachable

An unexpected error was encountered while setting up the cluster. Please retry and contact Azure Databricks if the problem persists. Internal error message: Timeout while placing node

Cause

This error is usually returned by the cloud provider. Typically, it occurs when you have an Azure Databricks workspace deployed to your own virtual network (VNet) (as opposed to the default VNet created when you launch a new Azure Databricks workspace). If the virtual network where the workspace is deployed is already peered or has an ExpressRoute connection to on-premises resources, the virtual network cannot make an ssh connection to the cluster node when Azure Databricks is attempting to create a cluster.

Solution

Add a user-defined route (UDR) to give the Azure Databricks control plane ssh access to the cluster instances, Blob Storage instances, and artifact resources. This custom UDR allows outbound connections and does not interfere with cluster creation. For detailed UDR instructions, see Step 3: Create user-defined routes and associate them with your Azure Databricks virtual network subnets. For more VNet-related troubleshooting information, see Troubleshooting.