How to flat map a dataframe with unequal lists?


How to flat map a dataframe with unequal lists?



I have a data frame as follows-


a | b | c
[1] | [3,4,5,6] | [7,8,9,10]



i need the output as


a | b | c
1 3 7
1 4 8
1 5 9
1 6 10



Currently i am getting output as below upon using the following statement-


cols=['a','b','c']
df.rdd.flatMap(lambda x: itertools.izip_longest(*[x[c] for c in cols])).toDF(cols)

a | b | c
1 3 7
null 4 8
null 5 9
null 6 10



The number of columns are not fixed. If the solution is a generalized it would be helpful.




1 Answer
1




One option is to use itertools.repeat to repeat each array whose length is less than length of the longest array.


itertools.repeat


from itertools import izip_longest, repeat, chain

cols = df.columns
def zip_with_repeat(row, cols):
M = max(len(row[c]) for c in cols)
return izip_longest(
*[list(chain(*repeat(row[c], (M - len(row[c])) + 1)))[:M] for c in cols]
)

df.rdd.flatMap(lambda row: zip_with_repeat(row, cols)).toDF(cols).show()
#+---+---+---+
#| a| b| c|
#+---+---+---+
#| 1| 3| 7|
#| 1| 4| 8|
#| 1| 5| 9|
#| 1| 6| 10|
#+---+---+---+



For illustrative purposes, suppose you instead had the following DataFrame:


#+--------+------------+-------------+
#| a| b| c|
#+--------+------------+-------------+
#| [1]|[3, 4, 5, 6]|[7, 8, 9, 10]|
#|[10, 20]|[30, 40, 50]| [70, 80, 90]|
#+--------+------------+-------------+



The code would produce:


#+---+---+---+
#| a| b| c|
#+---+---+---+
#| 1| 3| 7|
#| 1| 4| 8|
#| 1| 5| 9|
#| 1| 6| 10|
#| 10| 30| 70|
#| 20| 40| 80|
#| 10| 50| 90|
#+---+---+---+



Notice that the 10 repeated once to fill in the array in column a to the appropriate length.


10


a





Thanks so much @pault you are a life saver
– Visualisation App
Jul 2 at 16:38






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

api-platform.com Unable to generate an IRI for the item of type

How to set up datasource with Spring for HikariCP?

Display dokan vendor name on Woocommerce single product pages