Need help is specific Crosstab! Python Pandas -

June 15, 2011

this link close query have in mind. python pandas groupby aggregate on multiple columns

main topics covering question are: 1. python | 2. pandas ds | 3. group | 4. aggregate function | 5.efficiency

current pandas ds have column names -> unique_identifier | classification | products | values

so data has identifier repeating each product. need information 1 row each identifier , columns count(records) sum(values) each identifier - classification combination.

i did try groupby(['unique_identifier','classficiation']) couldn't figure out new column generation part size , sum.

also since data crazy big, looking high level of efficiency. novice in pandas , love gurus. thanks, m

sample data looks this:

index   identifier  classification  product value   1          123          x          abc    10   2          123          x          bcd    20   3          123          y          cde    30   4          123          y          def    40   5          123          z          efg    50   6          234          z          ghi    60   7          234          z          abc    70   8          234          x          bcd    80   9          345          x          cde    90   10         345          y          def    10

expected output -

                 sum          size   identifier  x   y   z    x   y   z      123     30  70   50   2   2   1      234     80  nan 130   1  nan  2      345     90  10  nan   1   1  nan

i figured out few steps me view needed. gurus can cross-check , verify same. step1: setup groups using groupby(['identifier','classification'])

step2: aggregate function sum , count using 'values' temp['values'].agg(['sum','size']).reset_index()

/* reset index used see better ordered records rather referring older index */

step3 (optional): if have column (lets values2) u need repeat step 2 column , merge 2 datasets on identifier , classification

step4: use pivot(index='identifier',columns='classification')

step5: save dataset in required format. used csv.

output looks this--

                 sum          size   identifier  x   y   z    x   y   z      123     30  70   50   2   2   1      234     80  nan 130   1  nan  2      345     90  10  nan   1   1  nan

Search This Blog

LP

Need help is specific Crosstab! Python Pandas -

Comments

Post a Comment

Popular posts from this blog

PHP and MySQL WP -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

go - golang pprof for c library code -