Delta Lake on Azure Databricks supports several statements to facilitate updating Delta Lake tables.
UPDATE statement allows you to apply expressions to change the value of columns when a row matches a predicate. For example, you can use
UPDATE to fix a spelling mistake in the
UPDATE events SET eventType = 'click' WHERE eventType = 'clck'
Similar to delete, update operations automatically make use of the partitioning of the table when possible.
MERGE INTO statement allows you to merge a set of updates and insertions into an existing dataset. For example, the following statement takes a stream of updates and merges it into the
events table. When there is already an event present with the same
eventId, Delta Lake updates the data column using the given expression. When there is no matching event, Delta Lake adds a new row.
Here’s a worked example:
MERGE INTO events USING updates ON events.eventId = updates.eventId WHEN MATCHED THEN UPDATE SET events.data = updates.data WHEN NOT MATCHED THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
You must specify a value for every column in your table when you perform an
INSERT (for example, when there is no matching row in the existing dataset). However, you do not need to update all values.
You should add as much information to the
ON condition in
MERGE INTO to both reduce the amount of work and reduce the chances of transaction conflicts. For example, suppose you have a table that is partitioned by
date and you use
MERGE to update information for the last day country by country. If you’re updating
country='USA', then you can write a
MERGE statement such as:
MERGE INTO target_table USING source ON target_table.user_id = source.user_id AND target_table.date = current_date() AND country = 'USA' WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
This lets you break a very large
MERGE operation into smaller chunks and run them all in parallel to get better performance or meet SLAs.
Delta Lake tables allow you to remove data that matches a predicate. For instance, to delete all events from before 2017, you can run the following DML:
DELETE FROM events WHERE date < '2017-01-01'
Delete operations automatically make use of the partitioning of the table when possible. This optimization means that it will be significantly faster to delete data based on partition predicates.