Sometimes a cluster is terminated unexpectedly, not as a result of a manually or a configured automatic termination. A cluster can be terminated for many reasons. Some terminations are initiated by Azure Databricks and others are initiated by the cloud provider. This topic describes termination reasons and steps for remediation.
To defend against API abuses, ensure quality of service, and prevent you from
accidentally creating too many large clusters, Azure Databricks throttles all cluster up-sizing requests, including cluster creation, starting, and
resizing. The throttling uses the
token bucket algorithm
to limit the total number of nodes that anyone can launch over a defined
interval across your Databricks deployment, while allowing burst requests of
certain sizes. Requests coming from both the web UI and the APIs are subject to
rate limiting. When cluster requests exceed rate limits, the limit-exceeding request
fails with a
If you hit the limit for your legitimate workflow, Databricks recommends that you do the following:
- Retry your request a few minutes later.
- Spread out your recurring workflow evenly in the planned time frame. For example, instead of scheduling all of your Jobs to run at an hourly boundary, try distributing them at different intervals within the hour.
- Consider using clusters with a larger node type and smaller number of nodes.
- Use autoscaling clusters.
For other Azure Databricks initiated termination reasons, see TerminationCode.
This topic lists common cloud provider related termination reasons and remediation steps.
This termination reason occurs when Azure Databricks fails to acquire virtual machines. The error code and message from the API are propagated to help you troubleshoot the issue.
- You have reached a quota limit, usually number of cores, that your subscription can launch. Request a limit increase in Azure portal. See Azure subscription and service limits, quotas, and constraints.
- You have reached the limit of the public IPs that you can have running. Request a limit increase in Azure Portal.
- The resource SKU you have selected (such as VM size) is not available for the location you have selected. To resolve, see Resolve errors for SKU not available.
- Your subscription was disabled. Follow the steps in Why is my Azure subscription disabled and how do I reactivate it? to reactivate your subscription.
- Can occur if someone cancels your Azure Databricks workspace in the Azure portal and you try to create a cluster at the same time. The cluster fails because the resource group is being deleted.
- Your subscription is hitting the Azure Resource Manager request limit (see Throttling Resource Manager requests). Typical cause is that another system outside Azure Databricks) making a lot of API calls to Azure. Contact Azure support to identify this system and then reduce the number of API calls.
Azure Databricks was able to launch the cluster, but lost the connection to the instance hosting the Spark driver.
Caused by the driver virtual machine going down or a networking issue.