Frequently Asked Questions (FAQ)

What is Databricks Delta?

Databricks Delta is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark APIs. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning through:

  • ACID transactions
  • Schema enforcement
  • Upserts
  • Data versioning
  • Compaction
  • Caching
  • Z-ordering
  • Data skipping
How is Databricks Delta related to Apache Spark?
Databricks Delta is a compute layer and an associated table format that sits on top of Apache Spark. The format and the compute layer helps to simplify building big data pipelines and increase the overall efficiency of your pipelines.
What format does Databricks Delta use to store data?
Delta uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions. For details, see Concurrency Control and Isolation Levels.
How can I read and write data with Delta?
You can use your favorite Apache Spark APIs to read and write data with Delta. See Read a table and Write to a table.
Where does Delta store the data?
When writing data, you can specify the location in your cloud storage. Databricks stores the data in that location in Parquet format.
Can I stream data directly into Delta tables?
Yes, you can use Structured Streaming to directly write data into Delta tables. See Stream data into Delta tables.
Can I stream data from Delta tables?
Yes. See Stream data from Delta tables.
Does Delta support writes or reads using the Spark Streaming DStream API?
Delta does not support the DStream API. We recommend Structured Streaming.
When I use Delta, will I be able to port my code to other Spark platforms easily?
Yes. When you use Delta, you are using open Apache Spark APIs. So you can easily port your code to other Spark platforms. Wherever you are using delta format in your code, you just need to change it to parquet format to port your code.
How do Delta tables compare to Hive SerDe tables?

Delta tables are managed to a greater degree. In particular, there are several Hive SerDe parameters that Delta manages on your behalf that you should never specify manually:

  • ROWFORMAT
  • SERDE
  • OUTPUTFORMAT AND INPUTFORMAT
  • COMPRESSION
  • STORED AS
Does Delta support multi-table transactions?
Delta does not support multi-table transactions and foreign keys. Delta supports transactions at the table level.
What DDL and DML features does Delta not support?
  • Unsupported DDL features:
    • ANALYZE TABLE PARTITION
    • ALTER TABLE [ADD|DROP] PARTITION
    • ALTER TABLE SET LOCATION
    • ALTER TABLE RECOVER PARTITIONS
    • ALTER TABLE SET SERDEPROPERTIES
    • CREATE TABLE LIKE
    • INSERT OVERWRITE DIRECTORY
    • LOAD DATA
  • Unsupported DML features:
    • INSERT INTO [OVERWRITE] with static partitions.
    • Bucketing.
    • Specifying a schema when reading from a table. A command such as spark.read.format("delta").schema(df.schema).load(path) will fail.
    • Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE.
How can I change the type of a column?
Changing a column’s type or dropping a column requires rewriting the table. For an example, see Change column type.
What does it mean that Delta supports multi-cluster writes?
It means that Delta does locking to make sure that queries writing to a table from multiple clusters at the same time won’t corrupt the table. However, it does not mean that if there is a write conflict (for example, update and delete the same thing) that they will both succeed. Instead, one of writes will fail atomically and the error will tell you to retry the operation.
Can I access Delta tables outside of Databricks Runtime?

There are two cases to consider: external writes and external reads.

  • External writes: Delta maintains additional metadata in the form of a transaction log to enable ACID transactions and snapshot isolation for readers. In order to ensure the transaction log is updated correctly and the proper validations are performed, writes must go through Databricks Runtime.

  • External reads: Delta tables store data encoded in an open format (Parquet), allowing other tools that understand this format to read the data. However, since other tools do not support Delta’s transaction log, it is likely that they will incorrectly read stale deleted data, uncommitted data, or the partial results of failed transactions.

    In cases where the data is static (that is, there are no active jobs writing to the table), you can use VACUUM with a retention of ZERO HOURS to clean up any stale Parquet files that are not currently part of the table. This operation puts the Parquet files present in DBFS into a consistent state such that they can now be read by external tools.

    However, Delta relies on stale snapshots for the following functionality, which will break when using VACUUM with zero retention allowance:

    • Snapshot isolation for readers - Long running jobs will continue to read a consistent snapshot from the moment the jobs started, even if the table is modified concurrently. Running VACUUM with a retention less than length of these jobs can cause them to fail with a FileNotFoundException.
    • Streaming from Delta tables - Streams read from the original files written into a table in order to ensure exactly once processing. When combined with OPTIMIZE, VACUUM with zero retention can remove these files before the stream has time to processes them, causing it to fail.

    For these reasons we recommend the above technique only on static data sets that must be read by external tools.