Databricks File System - DBFS

The Databricks File System or DBFS is a distributed file system that comes installed on Databricks Runtime Clusters in Databricks.

DBFS is a layer over Azure Blob Storage. It is available in both Python and Scala. In addition,

  • Files in DBFS persist to Azure Blob Storage, so you won’t lose data even after you terminate the clusters.
  • dbutils makes it easy for you to use DBFS and is automatically available (no import necessary) in every Databricks notebook.

You can access DBFS through Databricks Utilities with a Spark Cluster or using the DBFS Command Line Interface on your local computer.

DBFS Command Line Interface

The DBFS command line interface leverages the DBFS API to expose a easy to use command line interface to DBFS. Using this client, interacting with DBFS is as easy as running.

# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana

For more information about the DBFS command line interface reference Databricks CLI.

Saving Files to DBFS with dbutils

Read and write files to DBFS as if it were a local filesystem.

dbutils.fs.mkdirs("/foobar/")
dbutils.fs.put("/foobar/baz.txt", "Hello, World!")
dbutils.fs.head("/foobar/baz.txt")
dbutils.fs.rm("/foobar/baz.txt")

Use Spark to write to DBFS

sc.parallelize(range(0, 100)).saveAsTextFile("/tmp/foo.txt")
sc.parallelize(0 until 100).saveAsTextFile("/tmp/bar.txt")

Use nothing or dbfs:/ to access a DBFS path.

display(dbutils.fs.ls("dbfs:/foobar"))

Use file:/ to access the local disk

dbutils.fs.ls("file:/foobar")

Filesystem cells provide a shorthand for accessing the dbutils filesystem module. Most dbutils.fs are available via the %fs magic command as well.

%fs rm -r foobar

Using Local File I/O APIs

Users can use local APIs to read and write to DBFS paths. Databricks configures each node with a fuse mount that allows processes to read / write to the underlying distributed storage layer.

#python
# write a file to DBFS using python i/o apis
with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
  f.write("Apache Spark is awesome!\n")
  f.write("End of example!")

# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
  for line in f_read:
    print line
// scala
import scala.io.Source

val filename = "/dbfs/tmp/test_dbfs.txt"
for (line <- Source.fromFile(filename).getLines()) {
  println(line)
}

Warning

Local File I/O APIs only support files less than 2GB in size. You might see corrupted files if you use Local File I/0 APIs to read or write files larger than 2GB. Please use DBFS Command Line Interface, dbutils.fs or Hadoop FileSystem APIs to access large files instead.

Getting Help

Use the dbutils.fs.help() command anytime to access the help menu for DBFS.