Incompatible Schema in Some Files

Problem

The Spark job fails with an exception like the following while reading Parquet files:

Error in SQL statement: SparkException: Job aborted due to stage failure:
Task 20 in stage 11227.0 failed 4 times, most recent failure: Lost task 20.3 in stage 11227.0
(TID 868031, 10.111.245.219, executor 31):
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
    at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:52)

Cause

The java.lang.UnsupportedOperationException in this instance is caused by one or more Parquet files written to a Parquet folder with an incompatible schema.

Solution

Find the Parquet files and rewrite them with the correct schema. Try to read the Parquet dataset with schema merging enabled:

spark.read.option("mergeSchema", "true").parquet(path)

or

spark.conf.set("spark.sql.parquet.mergeSchema", "true")
spark.read.parquet(path)

If you do have Parquet files with incompatible schemas, the snippets above will output an error with the name of the file that has the wrong schema.