Sort list of lists with groupby and count


Sort list of lists with groupby and count



I am looking to do a sort of a list of lists. My function needs to return the day that that fewest activity of a certain type and if there is a tie, return the day with the fewest overall activities. Below is a working solution but I feel like it's fairly unpythonic as it needs to convert to a dictionary and back to a list and am looking for a faster way to write this.



print get_day(mylist, 'Activity C') should yield Day 1


print get_day(mylist, 'Activity C')


Day 1



print get_day(mylist, 'Activity A') should yield Day 2


print get_day(mylist, 'Activity A')


Day 2


def get_day(l, activity):
d = {}

for x in l:
if x[0] not in d.keys():
d[x[0]] =
d[x[0]].append(x[1])

d = {k: [v.count(activity), len(v)] for k, v in d.items()}

l = [[k, v[0], v[1]] for k, v in d.items()]

l = sorted(l, key=lambda x: (x[1], x[2]))
return l[0][0]


mylist = [['Day 1', 'Activity A'], ['Day 2', 'Activity A'], ['Day 1', 'Activity A'], ['Day 2', 'Activity C'],
['Day 2', 'Activity D']]





Why get_day(mylist, 'Activity C') should yield Day 1 ?
– RafaelC
2 days ago



get_day(mylist, 'Activity C')


Day 1





@RafaelC Day 1 has no activity of type C. Since the function first needs to sort by a group by of that activity type, then by count of total activities.
– user2242044
2 days ago





what should we do in case when there are equal target activity & overall activities counts?
– Azat Ibrakov
yesterday






@AzatIbrakov in that case it doesn't matter, so one of the two can be arbitrarily picked
– user2242044
yesterday




2 Answers
2



First we can write utility for collecting pairs by first coordinate:


from collections import defaultdict


def collect(items):
result = defaultdict(list)
for key, value in items:
result[key].append(value)
return result



After that our get_day function can be written like


get_day


from collections import Counter
from itertools import imap


def get_day(days_activities, target_activity):
activities_by_days = collect(days_activities)
days_by_activities = collect(imap(reversed, days_activities))
days_target_activity_counter = Counter(days_by_activities[target_activity])

def to_target_and_overall_activities_counts(day):
return (days_target_activity_counter[day],
# if there is a tie
len(activities_by_days[day]))

return min(activities_by_days,
key=to_target_and_overall_activities_counts)



Test


# 'Day 1' has fewest overall activities (3 < 4)
>>> mylist = [['Day 1', 'Activity A'],
['Day 1', 'Activity A'],
['Day 2', 'Activity A'],
['Day 2', 'Activity C'],
['Day 1', 'Activity D'],
['Day 2', 'Activity D'],
['Day 2', 'Activity E']]
>>> get_day(mylist, 'Activity C')
'Day 1'
>>> get_day(mylist, 'Activity A')
'Day 2'
>>> get_day(mylist, 'Activity D')
'Day 1'



Can't guarantee speed here without knowing more about the expected input dimensions and use case, but I think this code is more pythonic.


from collections import defaultdict, Counter

def get_day_pythonic(lst, activity):
if not lst:
return
# Count of activities by day
day_act_counts = Counter([d for (d, a) in lst])
# Activity counts per day
act_counter = defaultdict(Counter)
for (d, a) in lst:
act_counter[a][d] += 1
# NOTE: if planning to call this multiple times, should precompute day_act_counts and act_counter.
# Here we sort first by lowest count of activity, then total activity counts, and then day name.
return sorted([(act_counter[activity][d], day_act_counts[d], d) for d in day_act_counts])[0][-1]



EDIT: Faster implementation


def get_day(lst, activity):
if not lst:
return
# Count of all activities by day
day_act_counts = {}
# Count of interested activity by day
act_counter = {}
for (d, a) in lst:
day_act_counts[d] = day_act_counts.get(d, 0) + 1
if a != activity: # don't need exact count for other activities
continue
act_counter[d] = act_counter.get(d, 0) + 1
# Here we take the min first by lowest count of activity, then total activity counts, and then day name.
return min((act_counter.get(d, 0), day_act_counts[d], d) for d in day_act_counts)[-1]





this is a very nice clean approach, but unfortunately a bit slow. Using timeit with the data provided, this method is about twice as slow (even precompiled).
– user2242044
yesterday




Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).


Would you like to answer one of these unanswered questions instead?

Popular posts from this blog

PHP contact form sending but not receiving emails

Do graphics cards have individual ID by which single devices can be distinguished?

iOS Top Alignment constraint based on screen (superview) height