Sort list of lists with groupby and count
Sort list of lists with groupby and count
I am looking to do a sort of a list of lists. My function needs to return the day that that fewest activity of a certain type and if there is a tie, return the day with the fewest overall activities. Below is a working solution but I feel like it's fairly unpythonic as it needs to convert to a dictionary and back to a list and am looking for a faster way to write this.
print get_day(mylist, 'Activity C')
should yield Day 1
print get_day(mylist, 'Activity C')
Day 1
print get_day(mylist, 'Activity A')
should yield Day 2
print get_day(mylist, 'Activity A')
Day 2
def get_day(l, activity):
d = {}
for x in l:
if x[0] not in d.keys():
d[x[0]] =
d[x[0]].append(x[1])
d = {k: [v.count(activity), len(v)] for k, v in d.items()}
l = [[k, v[0], v[1]] for k, v in d.items()]
l = sorted(l, key=lambda x: (x[1], x[2]))
return l[0][0]
mylist = [['Day 1', 'Activity A'], ['Day 2', 'Activity A'], ['Day 1', 'Activity A'], ['Day 2', 'Activity C'],
['Day 2', 'Activity D']]
get_day(mylist, 'Activity C')
Day 1
@RafaelC Day 1 has no activity of type C. Since the function first needs to sort by a group by of that activity type, then by count of total activities.
– user2242044
2 days ago
what should we do in case when there are equal target activity & overall activities counts?
– Azat Ibrakov
yesterday
@AzatIbrakov in that case it doesn't matter, so one of the two can be arbitrarily picked
– user2242044
yesterday
2 Answers
2
First we can write utility for collecting pairs by first coordinate:
from collections import defaultdict
def collect(items):
result = defaultdict(list)
for key, value in items:
result[key].append(value)
return result
After that our get_day
function can be written like
get_day
from collections import Counter
from itertools import imap
def get_day(days_activities, target_activity):
activities_by_days = collect(days_activities)
days_by_activities = collect(imap(reversed, days_activities))
days_target_activity_counter = Counter(days_by_activities[target_activity])
def to_target_and_overall_activities_counts(day):
return (days_target_activity_counter[day],
# if there is a tie
len(activities_by_days[day]))
return min(activities_by_days,
key=to_target_and_overall_activities_counts)
Test
# 'Day 1' has fewest overall activities (3 < 4)
>>> mylist = [['Day 1', 'Activity A'],
['Day 1', 'Activity A'],
['Day 2', 'Activity A'],
['Day 2', 'Activity C'],
['Day 1', 'Activity D'],
['Day 2', 'Activity D'],
['Day 2', 'Activity E']]
>>> get_day(mylist, 'Activity C')
'Day 1'
>>> get_day(mylist, 'Activity A')
'Day 2'
>>> get_day(mylist, 'Activity D')
'Day 1'
Can't guarantee speed here without knowing more about the expected input dimensions and use case, but I think this code is more pythonic.
from collections import defaultdict, Counter
def get_day_pythonic(lst, activity):
if not lst:
return
# Count of activities by day
day_act_counts = Counter([d for (d, a) in lst])
# Activity counts per day
act_counter = defaultdict(Counter)
for (d, a) in lst:
act_counter[a][d] += 1
# NOTE: if planning to call this multiple times, should precompute day_act_counts and act_counter.
# Here we sort first by lowest count of activity, then total activity counts, and then day name.
return sorted([(act_counter[activity][d], day_act_counts[d], d) for d in day_act_counts])[0][-1]
EDIT: Faster implementation
def get_day(lst, activity):
if not lst:
return
# Count of all activities by day
day_act_counts = {}
# Count of interested activity by day
act_counter = {}
for (d, a) in lst:
day_act_counts[d] = day_act_counts.get(d, 0) + 1
if a != activity: # don't need exact count for other activities
continue
act_counter[d] = act_counter.get(d, 0) + 1
# Here we take the min first by lowest count of activity, then total activity counts, and then day name.
return min((act_counter.get(d, 0), day_act_counts[d], d) for d in day_act_counts)[-1]
this is a very nice clean approach, but unfortunately a bit slow. Using timeit with the data provided, this method is about twice as slow (even precompiled).
– user2242044
yesterday
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
Why
get_day(mylist, 'Activity C')
should yieldDay 1
?– RafaelC
2 days ago