How to pivot a dataframe

Multi tool use
Multi tool use


How to pivot a dataframe



I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....



... But I'm going to give it a go.



The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)



Look a few examples from my google search


pd.DataFrame.pivot



So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.


pivot



Setup



You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below. Pay attention so that you get familiar with where which column names go where to get the results you're looking for.


import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70



Why do I get ValueError: Index contains duplicate entries, cannot reshape


ValueError: Index contains duplicate entries, cannot reshape



How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?


df


col


row


val0


col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24



How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?


df


col


row


val0


0


col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24



Can I get something other than mean, like maybe sum?


mean


sum


col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24



Can I do more that one aggregation at a time?


sum mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24



Can I aggregate over multiple value columns?


val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



Can Subdivide by multiple columns?


item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



Or


item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?


col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1




1 Answer
1



We start by answering the first question:



Why do I get ValueError: Index contains duplicate entries, cannot reshape


ValueError: Index contains duplicate entries, cannot reshape



This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:


columns


index


pd.DataFrame.pivot


row


col


df.duplicated(['row', 'col']).any()

True



So when I pivot using


pivot


df.pivot(index='row', columns='col', values='val0')



I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:


df.set_index(['row', 'col'])['val0'].unstack()



Here is a list of idioms we can use to pivot


pd.DataFrame.groupby


pd.DataFrame.unstack


unstack


pd.DataFrame.pivot_table


groupby


pd.DataFrame.set_index


pd.DataFrame.unstack


groupby


unstack


pd.DataFrame.pivot


set_index


index


columns


values


pivot_table


pd.crosstab


pivot_table


pd.factorize


np.bincount


pd.get_dummies


pd.DataFrame.dot



Examples



What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.


pd.DataFrame.pivot_table



How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?


df


col


row


val0


0



pd.DataFrame.pivot_table


pd.DataFrame.pivot_table


fill_value


0


fill_value



aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.


aggfunc='mean'


df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='mean')

col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24



pd.DataFrame.groupby


pd.DataFrame.groupby


df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)



pd.crosstab


pd.crosstab


pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)



Can I get something other than mean, like maybe sum?


mean


sum



pd.DataFrame.pivot_table


pd.DataFrame.pivot_table


df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='sum')

col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24



pd.DataFrame.groupby


pd.DataFrame.groupby


df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)



pd.crosstab


pd.crosstab


pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='sum').fillna(0)



Can I do more that one aggregation at a time?



Notice that for pivot_table and cross_tab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.


pivot_table


cross_tab


groupby.agg


groupby.agg



pd.DataFrame.pivot_table


pd.DataFrame.pivot_table


df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])

size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24



pd.DataFrame.groupby


pd.DataFrame.groupby


df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)



pd.crosstab


pd.crosstab


pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')



Can I aggregate over multiple value columns?



pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely


pd.DataFrame.pivot_table


values=['val0', 'val1']


df.pivot_table(
values=['val0', 'val1'], index='row', columns='col',
fill_value=0, aggfunc='mean')

val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



pd.DataFrame.groupby


pd.DataFrame.groupby


df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)



Can Subdivide by multiple columns?



pd.DataFrame.pivot_table


pd.DataFrame.pivot_table


df.pivot_table(
values='val0', index='row', columns=['item', 'col'],
fill_value=0, aggfunc='mean')

item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



pd.DataFrame.groupby


pd.DataFrame.groupby


df.groupby(
['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)



Can Subdivide by multiple columns?



pd.DataFrame.pivot_table


pd.DataFrame.pivot_table


df.pivot_table(
values='val0', index=['key', 'row'], columns=['item', 'col'],
fill_value=0, aggfunc='mean')

item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



pd.DataFrame.groupby


pd.DataFrame.groupby


df.groupby(
['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)



pd.DataFrame.set_index because the set of keys are unique for both rows and columns


pd.DataFrame.set_index


df.set_index(
['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)



Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?



pd.DataFrame.pivot_table


pd.DataFrame.pivot_table


df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1



pd.DataFrame.groupby


pd.DataFrame.groupby


df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)



pd.cross_tab


pd.cross_tab


pd.crosstab(df['row'], df['col'])



pd.factorize + np.bincount


pd.factorize


np.bincount


# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`. Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)

col3 col2 col0 col1 col4
row3 2 0 0 1 0
row2 1 2 1 0 2
row0 1 0 1 2 1
row4 2 2 0 1 1



pd.get_dummies


pd.get_dummies


pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))

col0 col1 col2 col3 col4
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1





Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
– Vaishali
Nov 13 '17 at 19:04





Could you please consider extending official docs?
– MaxU
Dec 15 '17 at 10:31






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

gFpZnVeA2
Z,ijVglXDNcjh7QK qkCod9 s LSko,Q5sXH3GCcXREzTd JaQevwY68TwwufyCwxCjlTlxRa7h WYu

Popular posts from this blog

PHP contact form sending but not receiving emails

Do graphics cards have individual ID by which single devices can be distinguished?

Create weekly swift ios local notifications