How to get dict of first two indexes for multi index data frame

I have a data frame that looks like the following

enter image description here

I was wondering if there exist a fastest way to create a python dict in pandas that would hold data like following

table = {2: [4, 5, 6, 7, 8 ...], 4: [1, 2, 3, 4, ...]}

Here the keys are users ids and the values are uniques list of dates.

This can be done early in core python but was wondering if there is a pandas or numpy based method to compute this fast. I needed a fast solution that scales well when this data frame grows bigger.

Edit 1: Performances

Time taken: 14.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)}) table = levels.drop_duplicates() .groupby(0)[1].apply(list) .to_dict() print(table)

Time Taken: 17.4 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

res.reset_index().drop_duplicates(['user_id','date']).groupby('user_id')['date'].apply(list).to_dict()

Time Taken: 294 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

a = {k: list(pd.unique(list(zip(*g))[1])) for k, g in groupby(df.index.values.tolist(), itemgetter(0))} print (a)

Time Taken: 15 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pd.Series(res.index.get_level_values(1), index=res.index.get_level_values(0)).groupby(level=0).apply(set).to_dict()

Edit 2: Benchmarking again

Wrong Result

idx = df.index.droplevel(-1).drop_duplicates() l1, l2 = idx.levels mapping = defaultdict(list) for i, j in zip(l1, l2): mapping[i].append(j)

Improved Timing: 14.6 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

a = {k: list(set(list(zip(*g))[1])) for k, g in groupby(res.index.values.tolist(), itemgetter(0))}

Is possible add timings for pure python solutions?
– jezrael
Jul 3 at 6:42

I think Brad Solomon solutions and first of mine, I am really curious if faster or not. thank you.
– jezrael
Jul 3 at 6:44

Brad Solomon

@jezrael Updated in the question.
– Mayukh Sarkar
Jul 3 at 6:51

Thank you very much, +1
– jezrael
Jul 3 at 6:52

3 Answers
3

Here's one solution using drop_duplicates + groupby.

drop_duplicates

groupby

levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)}) table = levels.drop_duplicates() .groupby(0)[1].apply(list) .to_dict() print(table) {1: [2, 3], 2: [8, 9]}

Setup

df = pd.DataFrame([[1, 2, 0, 3], [1, 2, 1, 4], [1, 3, 1, 5], [2, 8, 1, 3], [2, 8, 1, 4], [2, 9, 2, 5]], columns=['col1', 'col2', 'col3', 'col4']) df = df.set_index(['col1', 'col2', 'col3']) print(df) col4 col1 col2 col3 1 2 0 3 1 4 3 1 5 2 8 1 3 1 4 9 2 5

Data from Jz

pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).to_dict() Out[92]: {4: {'a', 'b'}, 5: {'a', 'b'}}

If you just need list , you can add apply(list) PS : Personally do not think this step is needed

apply(list)

pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).apply(list).to_dict() Out[93]: {4: ['b', 'a'], 5: ['b', 'a']}

Good catch :) +1
– RafaelC
Jul 2 at 14:05

@RafaelC thank you man
– Wen
Jul 2 at 14:07

@jezrael fixed ;-)
– Wen
Jul 2 at 14:11

Only there is problem with ordering with sets, so not sure if good solution.
– jezrael
Jul 2 at 14:12

@jezrael Yes ordering was not important.
– Mayukh Sarkar
Jul 3 at 7:06

I think if need better performance, use itertools.groupby with unique for return lists in same ordering as original data. If order is not important use set:

itertools.groupby

unique

set

df = pd.DataFrame({'A':list('abcdef'), 'B':[4,5,4,5,5,4], 'C':[7,8,9,4,2,3], 'D':[1,3,5,7,1,0], 'E':[5,3,6,9,2,4], 'F':list('aaabbb')}).set_index(['F','B', 'A']) print (df) C D E F B A a 4 a 7 1 5 5 b 8 3 3 4 c 9 5 6 b 5 d 4 7 9 e 2 1 2 4 f 3 0 4 from itertools import groupby from operator import itemgetter a = {k: list(set(list(zip(*g))[1])) for k, g in groupby(df.index.values.tolist(), itemgetter(0))} print (a) {'a': [4, 5], 'b': [5, 4]}

Another pandas solution:

d = df.reset_index().drop_duplicates(['F','B']).groupby('F')['B'].apply(list).to_dict() print (d) {'a': [4, 5], 'b': [5, 4]}

set was giving far better result than pd.unique. You should change it back
– Mayukh Sarkar
Jul 3 at 6:59

set

pd.unique

@MayukhSarkar - Thank you.
– jezrael
Jul 3 at 7:00

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Fjhtyj