Access Data

This topic describes how to import data into Azure Databricks, load data using the Apache Spark API, and edit and delete data using Databricks File System commands.

Import data

If you have small files on your local machine that you want to analyze with Azure Databricks, you can easily upload them to Databricks File System. For simple exploration scenarios you can:

  • Drop files into or browse to files in the Import & Explore Data box on the landing page:

    ../_images/import-landing.png
  • Upload the files in the Create table UI.

For production environments, however, we recommend that you access Databricks File System using the Databricks CLI or the Databricks REST API. You can also use a wide variety of data sources to import data directly in your notebooks.

Load data

You can read your data into Spark directly. For example, if you upload a CSV file, you can read your data using one of these examples.

Tip

For easier access, we recommend that you create a table. See Databases and Tables for more information.

Scala
val sparkDF = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/state_income-9f7c5.csv")
Python
sparkDF = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/state_income-9f7c5.csv')
R
sparkDF <- read.df(source = "csv", path = "/FileStore/tables/state_income-9f7c5.csv", header="true", inferSchema = "true")
Scala RDD
val rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")
Python RDD
rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")

If the data volume is small enough, you can also load this data directly onto the driver node. For example:

Python
pandas_df = pd.read_csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header='infer')
R
df = read.csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header = TRUE)

Download data to driver

You can use %sh wget <url>/<filename> to download data to the Spark driver node.

Note

The cell output prints Saving to: '<filename>', but the file is actually saved to file:/databricks/driver/<filename>.

Edit data

You cannot edit data directly within Azure Databricks, but you can overwrite a data file using Databricks File System commands.

Delete data

To delete data, use the following Databricks Utilities command:

dbutils.fs.rm("dbfs:/FileStore/tables/state_income-9f7c5.csv", true)

Warning

Deleted data cannot be recovered.