Need help is specific Crosstab! Python Pandas -
this link close query have in mind. python pandas groupby aggregate on multiple columns
main topics covering question are: 1. python | 2. pandas ds | 3. group | 4. aggregate function | 5.efficiency
current pandas ds have column names -> unique_identifier | classification | products | values
so data has identifier repeating each product. need information 1 row each identifier , columns count(records) sum(values) each identifier - classification combination.
i did try groupby(['unique_identifier','classficiation']) couldn't figure out new column generation part size , sum.
also since data crazy big, looking high level of efficiency. novice in pandas , love gurus. thanks, m
sample data looks this:
index identifier classification product value 1 123 x abc 10 2 123 x bcd 20 3 123 y cde 30 4 123 y def 40 5 123 z efg 50 6 234 z ghi 60 7 234 z abc 70 8 234 x bcd 80 9 345 x cde 90 10 345 y def 10
expected output -
sum size identifier x y z x y z 123 30 70 50 2 2 1 234 80 nan 130 1 nan 2 345 90 10 nan 1 1 nan
i figured out few steps me view needed. gurus can cross-check , verify same. step1: setup groups using groupby(['identifier','classification'])
step2: aggregate function sum , count using 'values' temp['values'].agg(['sum','size']).reset_index()
/* reset index used see better ordered records rather referring older index */
step3 (optional): if have column (lets values2) u need repeat step 2 column , merge 2 datasets on identifier , classification
step4: use pivot(index='identifier',columns='classification')
step5: save dataset in required format. used csv.
output looks this--
sum size identifier x y z x y z 123 30 70 50 2 2 1 234 80 nan 130 1 nan 2 345 90 10 nan 1 1 nan
Comments
Post a Comment