python - Aggregating over multiple categories with sub results -
this seems simple enough problem, couldn’t find on it.
 have dataframe:
df = pd.dataframe({'x':[1,3,1,3,1,5,1,5],                    'cat1':['a','a','b','b','a','a','b','b'],                   'cat2':['a','a','a','a','b','b','b','b']})      cat1 cat2 x 0         1 1         3 2   b       1 3   b       3 4      b    1 5      b    5 6   b    b    1 7   b    b    5 where cat1 , cat 2 categories, , want aggregate x on them using aggregate function (sum, weighted average or defined). obvious solution, like
res = df.groupby(['cat1','cat2']).sum() results in
           x cat1 cat2           4      b     6 b        4      b     6 but want this:
    cat1    cat2    x 0               4 1         b       6 2             10 3   b             4 4   b       b       6 5   b           10 6             8 7       b       12 8           20 one way go this:
res2 = df.groupby(['cat1','cat2']).sum().reset_index() res1a = df.groupby(['cat1']).sum().reset_index() res1b = df.groupby(['cat2']).sum().reset_index() res1a['cat2'] = 'any' res1b['cat1'] = 'any' res = res1a.append(res1b).append(res2).set_index(['cat1','cat2']) but since actual task involves 10 categories, creating separate groupbys various category subsets , appending not convenient. hope there better way go
you can use unstack reshape, add new column , new row any sum , last reshape stack:
#add `['x']` series output res = df.groupby(['cat1','cat2'])['x'].sum() print (res) cat1  cat2           4       b       6 b           4       b       6 name: x, dtype: int64  = res.unstack() a['any'] = a.sum(axis=1) a.loc['any'] = a.sum() print (a) cat2    b  cat1                 4   6   10 b     4   6   10   8  12   20  df1 = a.stack() print (df1) cat1  cat2            4       b        6           10 b            4       b        6           10          8       b       12           20 dtype: int64 edit:
res2 = df.groupby(['cat1','cat2']).sum() res1a = df.groupby(['cat1']).sum() res1b = df.groupby(['cat2']).sum() res1a = res1a.set_index(pd.series(['any'] * len(res1a), name='cat2'), append=true) res1b=res1b.set_index(pd.series(['any'] * len(res1b), name='cat1'), append=true).swaplevel() res1ab=pd.dataframe(res1a.sum().values,                     index=pd.multiindex.from_arrays([['any'],['any']]),                    columns=['x'])  res2 = pd.concat([res2, res1a, res1b, res1ab]) print (res2)             x cat1 cat2             4      b      6 b         4      b      6      10 b      10       8      b     12        20 
Comments
Post a Comment