Faster alternative to iterrows

Multi tool use
Faster alternative to iterrows
I know that this topic has been addressed a thousand times. But I can't figure out a solution.
I'm trying to count how often a list (each row of df1.list1) occurs in a column of list (df2.list2). All lists consist of unique values only. List1 includes about 300.000 rows and list2 30.000 rows.
I've got a working code but its terribly slow (because I'm using iterrows). I also tried itertuples() but it gave me an error ("too many values to unpack (expected 2)"). I found a similar question online: Pandas counting occurrence of list contained in column of lists. In the mentioned case the person considers only the occurrence of one list within a column of lists. However, I can't work things out so each row in df1.list1 is compared to df2.list2.
Thats how my lists look like (simplified):
df1.list1
0 ["a", "b"]
1 ["a", "c"]
2 ["a", "d"]
3 ["b", "c"]
4 ["b", "d"]
5 ["c", "d"]
df2.list2
0 ["a", "b" ,"c", "d"]
1 ["a", "b"]
2 ["b", "c"]
3 ["c", "d"]
4 ["b", "c"]
What I would like to come up with:
df1
list1 occurence
0 ["a", "b"] 2
1 ["a", "c"] 1
2 ["a", "d"] 1
3 ["b", "c"] 3
4 ["b", "d"] 1
5 ["c", "d"] 2
Thats what I've got so far:
for index, row in df_combinations.iterrows():
df1.at[index, "occurrence"] = df2["list2"].apply(lambda x: all(i in x for i in row['list1'])).sum()
Any suggestions how I can speed things up? Thanks in advance!
I constructed list1, list2 was already given. I could sort them if its easier to handle.
– Hannah
Jul 3 at 8:31
With sorted lists you should be able to do this much, much faster. The question is whether sorting the lists is going to cost you more time than it saves in the faster comparison.
– Jack Aidley
Jul 3 at 8:58
1 Answer
1
This should be much faster:
df = pd.DataFrame({'list1': [["a","b"],
["a","c"],
["a","d"],
["b","c"],
["b","d"],
["c","d"]]*100})
df2 = pd.DataFrame({'list2': [["a","b","c","d"],
["a","b"],
["b","c"],
["c","d"],
["b","c"]]*100})
list2 = df2['list2'].map(set).tolist()
df['occurance'] = df['list1'].apply(set).apply(lambda x: len([i for i in list2 if x.issubset(i)]))
Using your approach:
%timeit for index, row in df.iterrows(): df.at[index, "occurrence"] = df2["list2"].apply(lambda x: all(i in x for i in row['list1'])).sum()
1 loop, best of 3: 3.98 s per loop
Using mine:
%timeit list2 = df2['list2'].map(set).tolist();df['occurance'] = df['list1'].apply(set).apply(lambda x: len([i for i in list2 if x.issubset(i)]))
10 loops, best of 3: 29.7 ms per loop
Notice that I've increased the size of list by a factor of 100.
EDIT
This one seems even faster:
list2 = df2['list2'].sort_values().tolist()
df['occurance'] = df['list1'].apply(lambda x: len(list(next(iter(())) if not all(i in list2 for i in x) else i for i in x)))
And timing:
%timeit list2 = df2['list2'].sort_values().tolist();df['occurance'] = df['list1'].apply(lambda x: len(list(next(iter(())) if not all(i in list2 for i in x) else i for i in x)))
100 loops, best of 3: 14.8 ms per loop
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Where are your lists coming from? Can you sort them during construction?
– Jack Aidley
Jul 3 at 8:24