How to flat map a dataframe with unequal lists?
How to flat map a dataframe with unequal lists?
I have a data frame as follows-
a | b | c
[1] | [3,4,5,6] | [7,8,9,10]
i need the output as
a | b | c
1 3 7
1 4 8
1 5 9
1 6 10
Currently i am getting output as below upon using the following statement-
cols=['a','b','c']
df.rdd.flatMap(lambda x: itertools.izip_longest(*[x[c] for c in cols])).toDF(cols)
a | b | c
1 3 7
null 4 8
null 5 9
null 6 10
The number of columns are not fixed. If the solution is a generalized it would be helpful.
1 Answer
1
One option is to use itertools.repeat
to repeat each array whose length is less than length of the longest array.
itertools.repeat
from itertools import izip_longest, repeat, chain
cols = df.columns
def zip_with_repeat(row, cols):
M = max(len(row[c]) for c in cols)
return izip_longest(
*[list(chain(*repeat(row[c], (M - len(row[c])) + 1)))[:M] for c in cols]
)
df.rdd.flatMap(lambda row: zip_with_repeat(row, cols)).toDF(cols).show()
#+---+---+---+
#| a| b| c|
#+---+---+---+
#| 1| 3| 7|
#| 1| 4| 8|
#| 1| 5| 9|
#| 1| 6| 10|
#+---+---+---+
For illustrative purposes, suppose you instead had the following DataFrame:
#+--------+------------+-------------+
#| a| b| c|
#+--------+------------+-------------+
#| [1]|[3, 4, 5, 6]|[7, 8, 9, 10]|
#|[10, 20]|[30, 40, 50]| [70, 80, 90]|
#+--------+------------+-------------+
The code would produce:
#+---+---+---+
#| a| b| c|
#+---+---+---+
#| 1| 3| 7|
#| 1| 4| 8|
#| 1| 5| 9|
#| 1| 6| 10|
#| 10| 30| 70|
#| 20| 40| 80|
#| 10| 50| 90|
#+---+---+---+
Notice that the 10
repeated once to fill in the array in column a
to the appropriate length.
10
a
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Thanks so much @pault you are a life saver
– Visualisation App
Jul 2 at 16:38