Reshaping into binary variables using pandas python

Multi tool use
Reshaping into binary variables using pandas python
I am still new to Python pandas' pivot_table and im trying to reshape the data to have a binary indicator if a value is in a certain observation. I have follow some previous codes and got some encouraging results, however instead of 1 and zeros as Is my ideal result I get a sum. Please see a small sample data set below
ID SKILL NUM
1 A 1
1 A 1
1 B 1
2 C 1
3 C 1
3 C 1
3 E 1
The results I am aiming for is:
ID A B C E
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
My code atm get the following result:
ID A B C E
1 2 1 0 0
2 0 0 2 0
3 0 0 0 1
Should I remove the duplicates first??
The code I'm using atm is below;
df_pivot = df2.pivot_table(index='Job_posting_ID', columns='SKILL', aggfunc=len, fill_value=0)
3 Answers
3
Try like this:
df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0)
Or this:
df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0)
Whichever suits you best.
You can use aggfunc='any'
and convert to int
as a separate step. This avoids having to use a lambda
/ custom function, and may be more efficient.
aggfunc='any'
int
lambda
df_pivot = df.pivot_table(index='ID', columns='SKILL',
aggfunc='any', fill_value=0).astype(int)
print(df_pivot)
NUM
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
The same would work with aggfunc=len
+ conversion to int
, except this is likely to be more expensive.
aggfunc=len
int
You can use get_dummies
with set_index
for indicator columns and then get max
values per index:
get_dummies
set_index
max
df = pd.get_dummies(df.set_index('ID')['SKILL']).max(level=0)
For better performance remove duplicates by drop_duplicates
and reshape by set_index
with unstack
:
drop_duplicates
set_index
unstack
df = df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0)
Solution with pivot
, but then is necessary replace NaN
s to 0
:
pivot
NaN
0
df = df.drop_duplicates(['ID','SKILL']).pivot('ID','SKILL','NUM').fillna(0).astype(int)
If want use your solution, just remove duplicates, but better is unstack
, beacuse data are never aggregated, because not duplicated pairs ID
with SKILL
:
unstack
ID
SKILL
df2 = df.drop_duplicates(['ID','SKILL'])
df_pivot = (df2.pivot_table(index='ID',
columns='SKILL',
values='NUM',
aggfunc=len,
fill_value=0))
print (df_pivot)
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
The unstack is really fast tks.. How do I keep the ID as a variable not as a index on this case ?
– Ian_De_Oliveira
Jul 4 at 1:50
@Ian_De_Oliveira - Last need
df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0).reset_index().rename_axis(None, axis=1)
– jezrael
Jul 4 at 5:11
df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0).reset_index().rename_axis(None, axis=1)
Tks a lot , the unstack function worked really well.. Know I need to learn how to work with big data because my computer did not like the idea to flip 55000 columns , 2500 did ok but 55000 atm only SAS is been able to.
– Ian_De_Oliveira
Jul 5 at 23:25
@Ian_De_Oliveira - It depends of RAM in your PC, but for working best with large data in pandas check this
– jezrael
Jul 6 at 3:52
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
It worked, tks. R would not flip my table because is too big.
– Ian_De_Oliveira
Jul 4 at 0:24