How to get dict of first two indexes for multi index data frame
How to get dict of first two indexes for multi index data frame
I have a data frame that looks like the following
I was wondering if there exist a fastest way to create a python dict in pandas that would hold data like following
table = {2: [4, 5, 6, 7, 8 ...], 4: [1, 2, 3, 4, ...]}
Here the keys are users ids and the values are uniques list of dates.
This can be done early in core python but was wondering if there is a pandas or numpy based method to compute this fast. I needed a fast solution that scales well when this data frame grows bigger.
Edit 1: Performances
Time taken: 14.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})
table = levels.drop_duplicates()
.groupby(0)[1].apply(list)
.to_dict()
print(table)
Time Taken: 17.4 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
res.reset_index().drop_duplicates(['user_id','date']).groupby('user_id')['date'].apply(list).to_dict()
Time Taken: 294 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a = {k: list(pd.unique(list(zip(*g))[1]))
for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)
Time Taken: 15 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pd.Series(res.index.get_level_values(1), index=res.index.get_level_values(0)).groupby(level=0).apply(set).to_dict()
Edit 2: Benchmarking again
Wrong Result
idx = df.index.droplevel(-1).drop_duplicates()
l1, l2 = idx.levels
mapping = defaultdict(list)
for i, j in zip(l1, l2):
mapping[i].append(j)
Improved Timing: 14.6 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = {k: list(set(list(zip(*g))[1]))
for k, g in groupby(res.index.values.tolist(), itemgetter(0))}
I think
Brad Solomon
solutions and first of mine, I am really curious if faster or not. thank you.– jezrael
Jul 3 at 6:44
Brad Solomon
@jezrael Updated in the question.
– Mayukh Sarkar
Jul 3 at 6:51
Thank you very much, +1
– jezrael
Jul 3 at 6:52
3 Answers
3
Here's one solution using drop_duplicates
+ groupby
.
drop_duplicates
groupby
levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})
table = levels.drop_duplicates()
.groupby(0)[1].apply(list)
.to_dict()
print(table)
{1: [2, 3], 2: [8, 9]}
Setup
df = pd.DataFrame([[1, 2, 0, 3], [1, 2, 1, 4], [1, 3, 1, 5],
[2, 8, 1, 3], [2, 8, 1, 4], [2, 9, 2, 5]],
columns=['col1', 'col2', 'col3', 'col4'])
df = df.set_index(['col1', 'col2', 'col3'])
print(df)
col4
col1 col2 col3
1 2 0 3
1 4
3 1 5
2 8 1 3
1 4
9 2 5
Data from Jz
pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).to_dict()
Out[92]: {4: {'a', 'b'}, 5: {'a', 'b'}}
If you just need list , you can add apply(list)
PS : Personally do not think this step is needed
apply(list)
pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).apply(list).to_dict()
Out[93]: {4: ['b', 'a'], 5: ['b', 'a']}
Good catch :) +1
– RafaelC
Jul 2 at 14:05
@RafaelC thank you man
– Wen
Jul 2 at 14:07
@jezrael fixed ;-)
– Wen
Jul 2 at 14:11
Only there is problem with ordering with sets, so not sure if good solution.
– jezrael
Jul 2 at 14:12
@jezrael Yes ordering was not important.
– Mayukh Sarkar
Jul 3 at 7:06
I think if need better performance, use itertools.groupby
with unique
for return lists in same ordering as original data. If order is not important use set
:
itertools.groupby
unique
set
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}).set_index(['F','B', 'A'])
print (df)
C D E
F B A
a 4 a 7 1 5
5 b 8 3 3
4 c 9 5 6
b 5 d 4 7 9
e 2 1 2
4 f 3 0 4
from itertools import groupby
from operator import itemgetter
a = {k: list(set(list(zip(*g))[1]))
for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)
{'a': [4, 5], 'b': [5, 4]}
Another pandas solution:
d = df.reset_index().drop_duplicates(['F','B']).groupby('F')['B'].apply(list).to_dict()
print (d)
{'a': [4, 5], 'b': [5, 4]}
set
was giving far better result than pd.unique
. You should change it back– Mayukh Sarkar
Jul 3 at 6:59
set
pd.unique
@MayukhSarkar - Thank you.
– jezrael
Jul 3 at 7:00
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Is possible add timings for pure python solutions?
– jezrael
Jul 3 at 6:42