Prevent Duplicated Columns when Joining Two DataFrames

If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.

Join on columns

If you join on columns, you get duplicated columns.

Scala
%scala

val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload")

val df = left.join(right, left.col("name") === right.col("name"))
Python
%python

llist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)]
left = spark.createDataFrame(llist, ['name','date','duration'])
right = spark.createDataFrame([('alice', 100),('bob', 23)],['name','upload'])

df = left.join(right, left.name == right.name)

Solution

Specify the join column as an array type or string.

Scala
%scala

val df = left.join(right, Seq("name"))
%scala

val df = left.join(right, "name")
Python
%python
df = left.join(right, ["name"])
%python
df = left.join(right, "name")
R

First register the DataFrames as tables.

%python

left.createOrReplaceTempView("left_test_table")
right.createOrReplaceTempView("right_test_table")
%r
library(SparkR)
sparkR.session()
left <- sql("SELECT * FROM left_test_table")
right <- sql("SELECT * FROM right_test_table")

The above code results in duplicate columns. The following code does not.

%r
head(drop(join(left, right, left$name == right$name), left$name))