This topic describes how to import data into Azure Databricks, load data using the Apache Spark API, and edit and delete data using Databricks File System commands.
In this topic:
If you have small files on your local machine that you want to analyze with Azure Databricks, you can easily upload them to Databricks File System. For simple exploration scenarios you can:
Drop files into or browse to files in the Import & Explore Data box on the landing page:
Upload the files in the Create table UI.
For production environments, however, we recommend that you access Databricks File System using the Databricks CLI or the Databricks REST API. You can also use a wide variety of data sources to import data directly in your notebooks.
You can read your data into Spark directly. For example, if you upload a CSV file, you can read your data using one of these examples.
For easier access, we recommend that you create a table. See Databases and Tables for more information.
val sparkDF = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/FileStore/tables/state_income-9f7c5.csv")
sparkDF = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/state_income-9f7c5.csv')
sparkDF <- read.df(source = "csv", path = "/FileStore/tables/state_income-9f7c5.csv", header="true", inferSchema = "true")
- Scala RDD
val rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")
- Python RDD
rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")
If the data volume is small enough, you can also load this data directly onto the driver node. For example:
pandas_df = pd.read_csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header='infer')
df = read.csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header = TRUE)
You can use
%sh wget <url>/<filename> to download data to the Spark driver node.
The cell output prints
Saving to: '<filename>', but the file is actually saved to
You cannot edit data directly within Azure Databricks, but you can overwrite a data file using Databricks File System commands.
To delete data, use the following Databricks Utilities command:
Deleted data cannot be recovered.