When you port existing workloads to Delta Lake, you should be aware of the following simplifications and differences compared with the data sources provided by Apache Spark and Apache Hive.
Delta Lake handles the following operations automatically, which you should never perform manually:
- Delta Lake tables always return the most up-to-date information, so there is no need to manually call
REFRESH TABLEafter changes.
- Add and remove partitions
- Delta Lake automatically tracks the set of partitions present in a table and updates the list as data is added or removed. As a result, there is no need to run
ALTER TABLE [ADD|DROP] PARTITIONor
- Load a single partition
- As an optimization, you may sometimes directly load the partition of data you are interested in. For example,
spark.read.parquet("/data/date=2017-01-01"). This is unnecessary with Delta Lake, since it can quickly scan the list of files to find the list of relevant ones. If you are interested in a single partition, specify it using a
WHEREclause. For example,
spark.read.parquet("/data").where("date = '2017-01-01'").
When you port an existing application to Delta Lake, you should avoid the following operations, which bypass the transaction log:
- Manually modify data
- Delta Lake uses the transaction log to atomically commit changes to the table. Because the log is the source of truth, files that are written out but not added to the transaction log are not read by Spark. Similarly, even if you manually delete a file, a pointer to the file is still present in the transaction log.
Instead of manually modifying files stored in a Delta Lake table, always use the DML commands that are described in this guide.
- External readers
- The data stored in Delta Lake is encoded as Parquet files. However, accessing these files using an external reader is not safe.
You’ll see duplicates and uncommitted data and the read may fail when someone runs
Because the files are encoded in an open format, you always have the option to move the files outside Delta. At that point, you can run
VACUUM RETAIN 0 and delete the transaction log. This leaves the table’s files in a consistent state that can be read by the external reader of your choice.
Suppose you have Parquet data stored in the directory
/data-pipeline. You can always read into DataFrame and save as Delta Lake table. This approach copies data and lets Spark manage the table.
Alternatively you can convert to Delta Lake which is faster but results in an unmanaged table.
CONVERT TO DELTA parquet.`/data-pipeline`
For details, see Convert To Delta (Delta Lake).
Read the data into a DataFrame and save it to a new directory in
data = spark.read.parquet("/data-pipeline") data.write.format("delta").save("/delta/data-pipeline/")
Create a Delta Lake table that refers to the files in the Delta Lake directory:
spark.sql("CREATE TABLE events USING DELTA LOCATION '/delta/data-pipeline/'")