Metrics

Metrics help you monitor the performance of Azure Databricks clusters. Ganglia metrics are available by default.

You can configure an Azure Databricks cluster to send metrics to a Log Analytics workspace in Azure Monitor, the monitoring platform for Azure.

You can also install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account.

Ganglia metrics

To access the Ganglia UI, navigate to the Metrics tab on the cluster details page. GPU metrics are available in the Ganglia UI for GPU-enabled clusters running Databricks Runtime 4.1 and above.

../../_images/metrics-tab.png

To view live metrics, click the Ganglia UI link.

To view historical metrics, click a snapshot file. The snapshot contains aggregated metrics for the hour preceding the selected time.

Note

Ganglia metrics are not supported on clusters with table access control or credential passthrough enabled.

Configure metrics collection

By default, Azure Databricks collects Ganglia metrics every 15 minutes. To configure the collection period, set the DATABRICKS_GANGLIA_SNAPSHOT_PERIOD_MINUTES environment variable using an init script or in the spark_env_vars field in the Cluster Create API.

Azure Monitor

You can configure an Azure Databricks cluster to send metrics to a Log Analytics workspace in Azure Monitor, the monitoring platform for Azure. For complete instructions, see Monitoring Azure Databricks.

Note

If you have deployed the Azure Databricks workspace in your own virtual network and you have configured network security groups (NSG) to deny all outbound traffic that is not required by Azure Databricks, then you must configure an additional outbound rule for the “AzureMonitor” service tag.

Datadog metrics

../../_images/datadog-metrics.png

You can install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account. The following notebook demonstrates how to install a Datadog agent on a cluster using a cluster-scoped init script.

To install the Datadog agent on all clusters, use a global init script after testing the cluster-scoped init script.