Spark - How to avoid duplicate columns after join?

Multi tool use

Spark - How to avoid duplicate columns after join?

Extending upon use case given here:
How to avoid duplicate columns after join?

I have two dataframes with the 100s of columns. Following are some samples with join columns:

df1.columns // Array(ts, id, X1, X2, ...)

and

df2.columns // Array(ts, id, X1, Y2, ...)

After I do:

val df_combined = df1.join(df2, df1.X1===df2.X1 and df1.X2==df2.Y2)

I end up with the following columns: Array(ts, id, X1, X2, ts, id, X1, Y2). X1 is duplicated.

Array(ts, id, X1, X2, ts, id, X1, Y2)

X1

I can't use join(right: Dataset[_], usingColumns: Seq[String]) api as to use this api all columns must be there in both dataframe which is not the case here (X2 and Y2). Only option I see is to rename a column and drop column later or to alias dataframe and drop column later from 2nd dataframe.
Isn't there a simple api to achieve this? E.g. automatically drop one of the join column in case of equality join.

join(right: Dataset[_], usingColumns: Seq[String])

X2

Y2

Is using spark sql an option for you?
– naveen marri
Jul 2 at 23:45

It is. but I'd want to avoid manually selecting all fields from join. There are too many.
– nir
Jul 2 at 23:51

Okay, how many duplicate columns, do you have?
– naveen marri
Jul 2 at 23:53

1 Answer
1

As you noted, the best way to avoid duplicate columns is using a Seq[String] as input to the join. However, since the columns have different names in the dataframes there are only two options:

Seq[String]

join

Rename the Y2 column to X2 and perform the join as df1.join(df2, Seq("X1", "X2")). If you want to keep both the Y2 and X2 column afterwards, simply copy X2 to a new column Y2.

Y2

X2

join

df1.join(df2, Seq("X1", "X2"))

Y2

X2

Y2

Perform the join as before and drop the unwanted duplicated column(s) afterwards:

join

drop

df1.join(df2, df1.col("X1") === df2.col("X1") and df1.col("X2") == df2.col("Y2")) .drop(df1.col("X1"))

Unforunatly, currently there is no automatic way to achieve this.

When joining dataframes, it's better to make sure they do not have the same column names (with the exception of the columns used in the join). For example, the ts and id columns above. If there is a lot of columns it can be hard to rename them all manually. To do it automatically, the below code can be used:

join

ts

id

val prefix "df1_" val cols = df1.columns.map(c => col(c).as(s"$prefix$c")) df1.select(cols:_*)

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

yrdsFbSP,DfvfJc jIZh9r,NuG F Ra8DGmbiwUXiOK6w LANsmop8xeFvf2xFFA1,E5uf,Lm

搜尋此網誌

Fjhtyj