Introduction to Databricks Delta

Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks DBFS. The core abstraction of Delta is an optimized Spark table that

  • Stores data as Parquet files in DBFS.
  • Maintains a transaction log that efficiently tracks changes to the table.

You read and write data stored in the delta format using the same familiar Apache Spark SQL batch and streaming APIs that you use to work with Hive tables and DBFS directories. With the addition of the transaction log and other enhancements, Delta offers significant benefits:

ACID transactions
  • Multiple writers can simultaneously modify a dataset and see consistent views. For qualifications, see Multi-cluster writes.
  • Writers can modify a dataset without interfering with jobs reading the dataset.
Fast read access
  • Automatic file management organizes data into large files that can be read efficiently.
  • Statistics enable speeding up reads by 10-100x and and data skipping avoids reading irrelevant information.

Requirements

Delta requires Databricks Runtime 4.1 or above. If you created a Delta table using a Databricks Runtime lower than 4.1, you must upgrade the table version. For details, see Table Versioning.

Frequently asked questions (FAQ)

How do Delta tables compare to Hive SerDe tables?

Delta tables are managed to a greater degree. In particular, there are several Hive SerDe parameters that Delta manages on your behalf that you should never specify manually:

  • ROWFORMAT
  • SERDE
  • OUTPUTFORMAT AND INPUTFORMAT
  • COMPRESSION
  • STORED AS
Does Delta support multi-table transactions?
Delta does not support multi-table transactions and foreign keys. Delta supports transactions at the table level.
Does Delta support writes or reads using the Spark Streaming DStream API?
Delta does not support the DStream API. We recommend Structured Streaming.
What DDL and DML features does Delta not support?
  • Unsupported DDL features:
    • ANALYZE TABLE PARTITION
    • ALTER TABLE [ADD|DROP] PARTITION
    • ALTER TABLE SET LOCATION
    • ALTER TABLE RECOVER PARTITIONS
    • ALTER TABLE SET SERDEPROPERTIES
    • CREATE TABLE LIKE
    • INSERT OVERWRITE DIRECTORY
    • LOAD DATA
  • Unsupported DML features:
    • INSERT INTO [OVERWRITE] with static partitions.
    • Bucketing.
    • Specifying a schema when reading from a table. A command such as spark.read.format("delta").schema(df.schema).load(path) will fail.
    • Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE.
What does it mean that Delta supports multi-cluster writes?
It means that Delta does locking to make sure that queries writing to a table from multiple clusters at the same time won’t corrupt the table. However, it does not mean that if there is a write conflict (for example, update and delete the same thing) that they will both succeed. Instead, one of writes will fail atomically and the error will tell you to retry the operation.
Can I access Delta tables outside of Databricks Runtime?

There are two cases to consider: external writes and external reads.

  • External writes: Delta maintains additional metadata in the form of a transaction log to enable ACID transactions and snapshot isolation for readers. In order to ensure the transaction log is updated correctly and the proper validations are performed, writes must go through Databricks Runtime.

  • External reads: Delta tables store data encoded in an open format (Parquet), allowing other tools that understand this format to read the data. However, since other tools do not support Delta’s transaction log, it is likely that they will incorrectly read stale deleted data, uncommitted data, or the partial results of failed transactions.

    In cases where the data is static (that is, there are no active jobs writing to the table), you can use VACUUM with a retention of ZERO HOURS to clean up any stale Parquet files that are not currently part of the table. This operation puts the Parquet files present in DBFS into a consistent state such that they can now be read by external tools.

    However, Delta relies on stale snapshots for the following functionality, which will break when using VACUUM with zero retention allowance:

    • Snapshot isolation for readers - Long running jobs will continue to read a consistent snapshot from the moment the jobs started, even if the table is modified concurrently. Running VACUUM with a retention less than length of these jobs can cause them to fail with a FileNotFoundException.
    • Streaming from Delta tables - Streams read from the original files written into a table in order to ensure exactly once processing. When combined with OPTIMIZE, VACUUM with zero retention can remove these files before the stream has time to processes them, causing it to fail.
For these reasons we only recommend the above technique on static data sets that must be read by external tools.