Reshaping into binary variables using pandas python

Multi tool use
Multi tool use


Reshaping into binary variables using pandas python



I am still new to Python pandas' pivot_table and im trying to reshape the data to have a binary indicator if a value is in a certain observation. I have follow some previous codes and got some encouraging results, however instead of 1 and zeros as Is my ideal result I get a sum. Please see a small sample data set below


ID SKILL NUM
1 A 1
1 A 1
1 B 1
2 C 1
3 C 1
3 C 1
3 E 1



The results I am aiming for is:


ID A B C E
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1



My code atm get the following result:


ID A B C E
1 2 1 0 0
2 0 0 2 0
3 0 0 0 1



Should I remove the duplicates first??



The code I'm using atm is below;


df_pivot = df2.pivot_table(index='Job_posting_ID', columns='SKILL', aggfunc=len, fill_value=0)




3 Answers
3



Try like this:


df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0)



Or this:


df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0)



Whichever suits you best.



You can use aggfunc='any' and convert to int as a separate step. This avoids having to use a lambda / custom function, and may be more efficient.


aggfunc='any'


int


lambda


df_pivot = df.pivot_table(index='ID', columns='SKILL',
aggfunc='any', fill_value=0).astype(int)

print(df_pivot)

NUM
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1



The same would work with aggfunc=len + conversion to int, except this is likely to be more expensive.


aggfunc=len


int





It worked, tks. R would not flip my table because is too big.
– Ian_De_Oliveira
Jul 4 at 0:24



You can use get_dummies with set_index for indicator columns and then get max values per index:


get_dummies


set_index


max


df = pd.get_dummies(df.set_index('ID')['SKILL']).max(level=0)



For better performance remove duplicates by drop_duplicates and reshape by set_index with unstack:


drop_duplicates


set_index


unstack


df = df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0)



Solution with pivot, but then is necessary replace NaNs to 0:


pivot


NaN


0


df = df.drop_duplicates(['ID','SKILL']).pivot('ID','SKILL','NUM').fillna(0).astype(int)



If want use your solution, just remove duplicates, but better is unstack, beacuse data are never aggregated, because not duplicated pairs ID with SKILL:


unstack


ID


SKILL


df2 = df.drop_duplicates(['ID','SKILL'])
df_pivot = (df2.pivot_table(index='ID',
columns='SKILL',
values='NUM',
aggfunc=len,
fill_value=0))
print (df_pivot)
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1





The unstack is really fast tks.. How do I keep the ID as a variable not as a index on this case ?
– Ian_De_Oliveira
Jul 4 at 1:50





@Ian_De_Oliveira - Last need df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0).reset_index().rename_axis(None, axis=1)
– jezrael
Jul 4 at 5:11


df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0).reset_index().rename_axis(None, axis=1)





Tks a lot , the unstack function worked really well.. Know I need to learn how to work with big data because my computer did not like the idea to flip 55000 columns , 2500 did ok but 55000 atm only SAS is been able to.
– Ian_De_Oliveira
Jul 5 at 23:25





@Ian_De_Oliveira - It depends of RAM in your PC, but for working best with large data in pandas check this
– jezrael
Jul 6 at 3:52






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

rjp GOXWrjrju,1YwyzkQNft6jd4b3psjA 3tdVMc
c1HmRufV sWAsHYrVva4,PxEhwABGjF pJyHa,kHYt

Popular posts from this blog

PHP contact form sending but not receiving emails

Do graphics cards have individual ID by which single devices can be distinguished?

Create weekly swift ios local notifications