How to get dict of first two indexes for multi index data frame


How to get dict of first two indexes for multi index data frame



I have a data frame that looks like the following



enter image description here



I was wondering if there exist a fastest way to create a python dict in pandas that would hold data like following


table = {2: [4, 5, 6, 7, 8 ...], 4: [1, 2, 3, 4, ...]}



Here the keys are users ids and the values are uniques list of dates.



This can be done early in core python but was wondering if there is a pandas or numpy based method to compute this fast. I needed a fast solution that scales well when this data frame grows bigger.



Edit 1: Performances



Time taken: 14.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})

table = levels.drop_duplicates()
.groupby(0)[1].apply(list)
.to_dict()

print(table)



Time Taken: 17.4 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


res.reset_index().drop_duplicates(['user_id','date']).groupby('user_id')['date'].apply(list).to_dict()



Time Taken: 294 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


a = {k: list(pd.unique(list(zip(*g))[1]))
for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)



Time Taken: 15 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


pd.Series(res.index.get_level_values(1), index=res.index.get_level_values(0)).groupby(level=0).apply(set).to_dict()



Edit 2: Benchmarking again



Wrong Result


idx = df.index.droplevel(-1).drop_duplicates()
l1, l2 = idx.levels
mapping = defaultdict(list)
for i, j in zip(l1, l2):
mapping[i].append(j)



Improved Timing: 14.6 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


a = {k: list(set(list(zip(*g))[1]))
for k, g in groupby(res.index.values.tolist(), itemgetter(0))}





Is possible add timings for pure python solutions?
– jezrael
Jul 3 at 6:42





I think Brad Solomon solutions and first of mine, I am really curious if faster or not. thank you.
– jezrael
Jul 3 at 6:44


Brad Solomon





@jezrael Updated in the question.
– Mayukh Sarkar
Jul 3 at 6:51





Thank you very much, +1
– jezrael
Jul 3 at 6:52




3 Answers
3



Here's one solution using drop_duplicates + groupby.


drop_duplicates


groupby


levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})

table = levels.drop_duplicates()
.groupby(0)[1].apply(list)
.to_dict()

print(table)

{1: [2, 3], 2: [8, 9]}



Setup


df = pd.DataFrame([[1, 2, 0, 3], [1, 2, 1, 4], [1, 3, 1, 5],
[2, 8, 1, 3], [2, 8, 1, 4], [2, 9, 2, 5]],
columns=['col1', 'col2', 'col3', 'col4'])

df = df.set_index(['col1', 'col2', 'col3'])

print(df)

col4
col1 col2 col3
1 2 0 3
1 4
3 1 5
2 8 1 3
1 4
9 2 5



Data from Jz


pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).to_dict()
Out[92]: {4: {'a', 'b'}, 5: {'a', 'b'}}



If you just need list , you can add apply(list) PS : Personally do not think this step is needed


apply(list)


pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).apply(list).to_dict()
Out[93]: {4: ['b', 'a'], 5: ['b', 'a']}





Good catch :) +1
– RafaelC
Jul 2 at 14:05





@RafaelC thank you man
– Wen
Jul 2 at 14:07






@jezrael fixed ;-)
– Wen
Jul 2 at 14:11





Only there is problem with ordering with sets, so not sure if good solution.
– jezrael
Jul 2 at 14:12





@jezrael Yes ordering was not important.
– Mayukh Sarkar
Jul 3 at 7:06



I think if need better performance, use itertools.groupby with unique for return lists in same ordering as original data. If order is not important use set:


itertools.groupby


unique


set


df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}).set_index(['F','B', 'A'])

print (df)
C D E
F B A
a 4 a 7 1 5
5 b 8 3 3
4 c 9 5 6
b 5 d 4 7 9
e 2 1 2
4 f 3 0 4

from itertools import groupby
from operator import itemgetter

a = {k: list(set(list(zip(*g))[1]))
for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)
{'a': [4, 5], 'b': [5, 4]}



Another pandas solution:


d = df.reset_index().drop_duplicates(['F','B']).groupby('F')['B'].apply(list).to_dict()
print (d)
{'a': [4, 5], 'b': [5, 4]}





set was giving far better result than pd.unique. You should change it back
– Mayukh Sarkar
Jul 3 at 6:59


set


pd.unique





@MayukhSarkar - Thank you.
– jezrael
Jul 3 at 7:00






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

api-platform.com Unable to generate an IRI for the item of type

How to set up datasource with Spring for HikariCP?

PHP contact form sending but not receiving emails