Spark - How to avoid duplicate columns after join?

Multi tool use
Multi tool use


Spark - How to avoid duplicate columns after join?



Extending upon use case given here:
How to avoid duplicate columns after join?



I have two dataframes with the 100s of columns. Following are some samples with join columns:


df1.columns
// Array(ts, id, X1, X2, ...)



and


df2.columns
// Array(ts, id, X1, Y2, ...)



After I do:


val df_combined = df1.join(df2, df1.X1===df2.X1 and df1.X2==df2.Y2)



I end up with the following columns: Array(ts, id, X1, X2, ts, id, X1, Y2). X1 is duplicated.


Array(ts, id, X1, X2, ts, id, X1, Y2)


X1



I can't use join(right: Dataset[_], usingColumns: Seq[String]) api as to use this api all columns must be there in both dataframe which is not the case here (X2 and Y2). Only option I see is to rename a column and drop column later or to alias dataframe and drop column later from 2nd dataframe.
Isn't there a simple api to achieve this? E.g. automatically drop one of the join column in case of equality join.


join(right: Dataset[_], usingColumns: Seq[String])


X2


Y2





Is using spark sql an option for you?
– naveen marri
Jul 2 at 23:45





It is. but I'd want to avoid manually selecting all fields from join. There are too many.
– nir
Jul 2 at 23:51





Okay, how many duplicate columns, do you have?
– naveen marri
Jul 2 at 23:53




1 Answer
1



As you noted, the best way to avoid duplicate columns is using a Seq[String] as input to the join. However, since the columns have different names in the dataframes there are only two options:


Seq[String]


join



Rename the Y2 column to X2 and perform the join as df1.join(df2, Seq("X1", "X2")). If you want to keep both the Y2 and X2 column afterwards, simply copy X2 to a new column Y2.


Y2


X2


join


df1.join(df2, Seq("X1", "X2"))


Y2


X2


X2


Y2



Perform the join as before and drop the unwanted duplicated column(s) afterwards:


join


drop


df1.join(df2, df1.col("X1") === df2.col("X1") and df1.col("X2") == df2.col("Y2"))
.drop(df1.col("X1"))



Unforunatly, currently there is no automatic way to achieve this.



When joining dataframes, it's better to make sure they do not have the same column names (with the exception of the columns used in the join). For example, the ts and id columns above. If there is a lot of columns it can be hard to rename them all manually. To do it automatically, the below code can be used:


join


ts


id


val prefix "df1_"
val cols = df1.columns.map(c => col(c).as(s"$prefix$c"))
df1.select(cols:_*)






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

yrdsFbSP,DfvfJc jIZh9r,NuG F Ra8DGmbiwUXiOK6w LANsmop8xeFvf2xFFA1,E5uf,Lm
dGa7r0qLB,d2bb8uDBGS05qtb tgxJANFyAYTAYKdwnOVoV,N sr3 bREg2GY3dHIMu6DItC,CVRV4O C3rUhc3ek9 mKN oV

Popular posts from this blog

PHP contact form sending but not receiving emails

Do graphics cards have individual ID by which single devices can be distinguished?

Create weekly swift ios local notifications