Optimizing Performance with File Management

To improve query speed, Delta Lake on Azure Databricks supports the ability to optimize the layout of data stored in cloud storage. Delta Lake supports two layout algorithms: bin-packing and Z-Ordering.

This topic describes how to run the optimization commands and how the two layout algorithms work and has an example notebook that demonstrates the benefits of optimization. This topic also describes how to clean up stale table snapshots. The FAQ explains why optimization is not automatic and includes recommendations for how often to run optimize commands.

For reference information on Delta Lake on Azure Databricks SQL optimization commands, see SQL Guide.

Compaction (bin-packing)

Delta Lake on Azure Databricks can improve the speed of read queries from a table by coalescing small files into larger ones. You trigger compaction by running the OPTIMIZE command:

OPTIMIZE delta.`/data/events`



If you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition predicate using WHERE:

OPTIMIZE events WHERE date >= '2017-01-01'


  • Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second instance has no effect.
  • Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. However, the two measures are most often correlated.

Readers of Delta Lake tables use snapshot isolation, which means that they are not interrupted when OPTIMIZE removes unnecessary files from the transaction log. OPTIMIZE makes no data related changes to the table, so a read before and after an OPTIMIZE has the same results. Performing OPTIMIZE on a table that is a streaming source does not affect any current or future streams that treat this table as a source.

Data skipping

Data skipping information is collected automatically when you write data into a Delta Lake table. Delta Lake on Azure Databricks takes advantage of this information (minimum and maximum values) at query time to provide faster queries. You do not need to configure data skipping - the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data. For best results, apply Z-Ordering.

For an example of the benefits of Delta Lake on Azure Databricks data skipping and Z-Ordering, see the notebooks in the next section. By default Delta Lake collects statistics on the first 32 columns defined in your table schema. You can change this value using the table property dataSkippingNumIndexedCols. Adding more columns to collect statistics would add additional overhead as you write files.

Collecting statistics on long strings is an expensive operation, therefore you can consider tweaking the table property dataSkippingNumIndexedCols to avoid collecting statistics on long strings, or change the ordering of columns in the schema using ALTER TABLE CHANGE COLUMN. Stats collection considers the columns in a flattened schema i.e nested columns would still be counted a single column.

You can read more on this topic in the blog post: Processing Petabytes of Data in Seconds with Databricks Delta.

Z-Ordering (multi-dimensional clustering)

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Azure Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. To Z-Order data, you specify the columns to order on in the ZORDER BY clause:

WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional column. Z-Ordering on columns that do not have statistics collected on them would be ineffective and a waste of resources as data skipping requires column-local stats such as min, max, and count. You can configure statistics collection on certain columns by re-ordering columns in the schema and/or increasing the number of columns to collect statistics on. See the section Data skipping for more details.


  • Prior to Databricks Runtime 5.0, Z-Ordering was not an idempotent operation, simply rewriting all data matching the given filter every time the command was run. Starting with Databricks Runtime 5.0, Z-Ordering is idempotent too, just like bin-packing.

  • Z-Ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times.

    For example, if you ZORDER BY date and your most recent records are all much wider (for example longer arrays or string values) than the ones in the past, it is expected that the OPTIMIZE job’s task durations will be skewed, as well as the resulting file sizes. This is, however, only a problem for the OPTIMIZE command itself; it should not have any negative impact on subsequent queries.

For an example of the benefits of optimization, see the following notebooks:

Garbage collection

To ensure that concurrent readers can continue reading a stale snapshot of a table, Delta Lake on Azure Databricks leaves deleted files on DBFS for a period of time. To save on storage costs you should occasionally clean up these invalid files using the VACUUM command:

VACUUM delta.`/data/events`


VACUUM events


We recommend that you use a 0-4 worker autoscaling cluster for VACUUM.

Also see Vacuum.

Test the garbage collection

You can specify DRY RUN to test the garbage collection and return a list of files to be deleted:


Configure the retention threshold

The VACUUM command removes any files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days, but you can specify an alternate retention interval. For example, to delete all stale files older than 8 days, you can execute the following SQL command:



Azure Databricks does not recommend that you set a retention interval shorter than 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If VACUUM cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when VACUUM deletes files that have not yet been committed.

Delta Lake on Azure Databricks has a safety check to prevent you from running a dangerous VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false. You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.

Improving performance for interactive queries

At the beginning of each query Delta Lake tables auto-update to the latest version of the table. This process can be observed in notebooks when the command status reports: Updating the Delta table's state. However, when running historical analysis on a table, you may not necessarily need up-to-the-last-minute data, especially for tables where streaming data is being ingested frequently. In these cases, queries can be run on stale snapshots of your Delta Lake table. This can lower your latency in getting results from queries that you execute.

You can configure how stale your table can go by setting the Spark Session configuration spark.databricks.delta.stalenessLimit using a time string, for example 1h, 15m, 1d for 1 hour, 15 minutes, and 1 day respectively. This configuration is a session specific configuration, therefore won’t affect how other users are accessing this table from different notebooks, jobs, or BI tools. In addition, this setting doesn’t prevent your table from updating. It just prevents a query from having to wait for the table to update. The update still occurs in the background, and will share resources fairly across the cluster. If the staleness limit is exceeded, then the query will block on the table state update.

Frequently asked questions (FAQ)

Why isn’t OPTIMIZE automatic?

The OPTIMIZE operation starts up many Spark jobs in order to optimize the file sizing via compaction (and optionally perform Z-Ordering). Since much of what OPTIMIZE does is compact small files, you must first accumulate many small files before this operation has an effect. Therefore, the OPTIMIZE operation is not run automatically.

Moreover, running OPTIMIZE, especially with ZORDER, is an expensive operation in time and resources. If Databricks ran OPTIMIZE automatically or waited to write out data in batches, it would remove the ability to run low-latency Delta Lake streams (where a Delta Lake table is the source). Many customers have Delta Lake tables that are never optimized because they only stream data from these tables, obviating the query benefits that OPTIMIZE would provide.

Lastly, Delta Lake automatically collects statistics about the files that are written to the table (whether through an OPTIMIZE operation or not). This means that reads from Delta Lake tables leverage this information whether or not the table or a partition has had the OPTIMIZE operation run on it.

How often should I run OPTIMIZE?

When you choose how often to run OPTIMIZE, there is a trade-off between performance and cost. You should run OPTIMIZE more often if you want better end-user query performance (necessarily at a higher cost because of resource usage). You should run it less often if you want to optimize cost.

We recommend you start by running OPTIMIZE on a daily basis. Then modify your job from there.