Spark - How to avoid duplicate columns after join?

Multi tool use
Spark - How to avoid duplicate columns after join?
Extending upon use case given here:
How to avoid duplicate columns after join?
I have two dataframes with the 100s of columns. Following are some samples with join columns:
df1.columns
// Array(ts, id, X1, X2, ...)
and
df2.columns
// Array(ts, id, X1, Y2, ...)
After I do:
val df_combined = df1.join(df2, df1.X1===df2.X1 and df1.X2==df2.Y2)
I end up with the following columns: Array(ts, id, X1, X2, ts, id, X1, Y2)
. X1
is duplicated.
Array(ts, id, X1, X2, ts, id, X1, Y2)
X1
I can't use join(right: Dataset[_], usingColumns: Seq[String])
api as to use this api all columns must be there in both dataframe which is not the case here (X2
and Y2
). Only option I see is to rename a column and drop column later or to alias dataframe and drop column later from 2nd dataframe.
Isn't there a simple api to achieve this? E.g. automatically drop one of the join column in case of equality join.
join(right: Dataset[_], usingColumns: Seq[String])
X2
Y2
It is. but I'd want to avoid manually selecting all fields from join. There are too many.
– nir
Jul 2 at 23:51
Okay, how many duplicate columns, do you have?
– naveen marri
Jul 2 at 23:53
1 Answer
1
As you noted, the best way to avoid duplicate columns is using a Seq[String]
as input to the join
. However, since the columns have different names in the dataframes there are only two options:
Seq[String]
join
Rename the Y2
column to X2
and perform the join
as df1.join(df2, Seq("X1", "X2"))
. If you want to keep both the Y2
and X2
column afterwards, simply copy X2
to a new column Y2
.
Y2
X2
join
df1.join(df2, Seq("X1", "X2"))
Y2
X2
X2
Y2
Perform the join
as before and drop
the unwanted duplicated column(s) afterwards:
join
drop
df1.join(df2, df1.col("X1") === df2.col("X1") and df1.col("X2") == df2.col("Y2"))
.drop(df1.col("X1"))
Unforunatly, currently there is no automatic way to achieve this.
When joining dataframes, it's better to make sure they do not have the same column names (with the exception of the columns used in the join
). For example, the ts
and id
columns above. If there is a lot of columns it can be hard to rename them all manually. To do it automatically, the below code can be used:
join
ts
id
val prefix "df1_"
val cols = df1.columns.map(c => col(c).as(s"$prefix$c"))
df1.select(cols:_*)
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Is using spark sql an option for you?
– naveen marri
Jul 2 at 23:45