Optimizing Performance with Caching

The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed.

The Delta cache supports reading Parquet files in DBFS, HDFS, Azure Blob Storage, Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2 (on Databricks Runtime 5.1 and above). It does not support other storage formats such as CSV, JSON, and ORC.

Note

The Delta cache works for all Parquet files and is not limited to Delta Lake format files.

Delta and Apache Spark caching comparison

There are two types of caching available in Azure Databricks:

  • Delta caching
  • Apache Spark caching

You can use Delta caching and Spark caching at the same time. This section outlines the key differences between them so that you can choose the best tool for your workflow.

  • Type of stored data: The Delta cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).
  • Performance: The data stored in the Delta cache can be read and operated on faster than the data in the Spark cache. This is because the Delta cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.
  • Automatic vs manual control: When the Delta cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the CACHE command (see Cache a subset of the data). When you use the Spark cache, you must manually specify the tables and queries to cache.
  • Disk vs memory-based: The Delta cache is stored entirely on the local disk, so that memory is not taken away from other operations within Spark. Due to the high read speeds of modern SSDs, the Delta cache can be fully disk-resident without a negative impact on its performance. In contrast, the Spark cache uses memory.

Delta cache consistency

The Delta cache automatically detects when data files are created or deleted and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data.

As of Databricks Runtime 5.5, the Delta cache automatically detects files that have been modified or overwritten after being cached. Any stale entries are automatically invalidated and evicted from the cache.

Warning

When working with the Delta cache on Databricks Runtime 5.4 or below, you must not overwrite the data files. That is, you must not replace files with files with the same name but different content. If the cached data files are overwritten, queries accessing these files can produce invalid results or read errors. Spark uses unique file names and never alters the data files. However, you can overwrite the data files using shell commands or external programs. If you need to overwrite the data files, disable the Delta cache.

Use Delta caching

To use Delta caching, choose the Ls series worker type when you configure your cluster.

../_images/dbio-cache-cluster-creation-azure.png

With Databricks Runtime 4.3 and above, the cache is enabled by default. With Databricks Runtime 4.0 to 4.2, enable the cache by setting spark.databricks.io.cache.enabled to true. The Delta cache is configured to use at most half of the space available on the local SSDs provided with the worker nodes.

For configuration options, see Configure the Delta cache.

Cache a subset of the data

To explicitly select a subset of data to be cached, use the following syntax:

CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]

You don’t need to use this command for the Delta cache to work correctly (the data will be cached automatically when first accessed). But it can be helpful when you require consistent query performance.

See Cache for examples and more details.

Monitor the Delta cache

You can check the current state of the Delta cache on each of the executors in the Storage tab in the Spark UI.

../_images/dbio-cache-spark-ui-storage-tab.png

When a node reaches 100% disk usage, the cache manager discards the least recently used cache entries to make space for new data.

Configure the Delta cache

Tip

Azure Databricks recommends that you choose cache-accelerated worker instance types for your clusters. Such instances are automatically configured optimally for the Delta cache.

Configure disk usage

To configure how the Delta cache uses the worker nodes’ local storage, specify the following Spark configuration settings during cluster creation:

  • spark.databricks.io.cache.maxDiskUsage - disk space per node reserved for cached data in bytes
  • spark.databricks.io.cache.maxMetaDataCache - disk space per node reserved for cached metadata in bytes
  • spark.databricks.io.cache.compression.enabled - should the cached data be stored in compressed format

Example configuration:

spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false

Enable the Delta cache

To enable and disable the Delta cache, run:

spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")

Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.