Transactional Writes to Cloud Storage with DBIO

The Databricks DBIO package provides transactional writes to cloud storage for Spark jobs. This solves a number of performance and correctness issues that come when Spark is used in a cloud-native setting (for example, writing directly to storage services like Azure Blob Storage).

Note

When DBIO transactional commit is enabled, metadata files starting with _started_<id> and _committed_<id> will accompany data files created by Spark jobs. Generally you shouldn’t alter these files directly. Rather, use the VACUUM command.

Clean up uncommitted files

To clean up uncommitted files left over from Spark jobs, use the VACUUM command to remove them. Normally VACUUM happens automatically after Spark jobs complete, but you can also run it manually if a job is aborted.

For example, VACUUM ... RETAIN 1 HOUR removes uncommitted files older than one hour.

Important

  • Avoid vacuuming with a horizon of less than one hour. It can cause data inconsistency.
  • You cannot use VACUUM directly on cloud storage. To vacuum storage, you must mount it to DBFS and run VACUUM on the mounted directory.

Also see Vacuum.

SQL
-- recursively vacuum an output path
%sql VACUUM '/path/to/output/directory' [RETAIN <N> HOURS]

-- vacuum all partitions of a catalog table
%sql VACUUM tableName [RETAIN <N> HOURS]
Scala
// recursively vacuum an output path
spark.sql("VACUUM '/path/to/output/directory' [RETAIN <N> HOURS]")

// vacuum all partitions of a catalog table
spark.sql("VACUUM tableName [RETAIN <N> HOURS]")