CSV Files

Reading CSV files in Apache Spark is simple. In this example we’ll be using the diamonds dataset available as an Azure Databricks Dataset. All we have to do is specify the path as well as any options that we would like.

Tip

Sometimes you may find yourself with a variety of CSV files in one folder. You can read an entire directory of CSV files by specifying the directory as the file location as opposed to individual files.

Read file in any language

This notebook shows how to a read file, display sample data, and print the data schema using Scala, R, Python, and SQL.

Specify schema

When the schema of the CSV file is known upfront, you can specify the desired schema to the CSV reader with the schema option.

Verify correctness of the data

When reading CSV files with a user-specified schema, it is possible that the actual data in the files does not match the specified schema. For example, a field containing name of the city will not parse as an integer. The consequences depend on the mode that the parser runs in:

  • PERMISSIVE (default): nulls are inserted for fields that could not be parsed correctly
  • DROPMALFORMED: drops lines that contain fields that could not be parsed
  • FAILFAST: aborts the reading if any malformed data is found

To set the mode, use the mode option.

val diamonds_with_wrong_schema_drop_malformed = sqlContext.read.format("csv").option("mode", "PERMISSIVE")

In the PERMISSIVE mode it is possible to inspect the rows that could not be parsed correctly. To do that, you can add _corrupt_record column to the schema.

Pitfalls of reading a subset of columns

The behavior of the CSV parser depends on the set of columns that are read. If the user-specified schema is incorrect, the results might differ considerably depending on the subset of columns that is accessed. The notebook below presents the most common pitfalls.