Melt and Merge on Substring - Python & Pandas -
i have data has data
id name model_# ms bp1 cd1 sf1 sa1 rq1 bp2 cd2 sf2 sa2 rq2 ... 1 john 23984 1 23 234 124 25 252 252 62 194 234 234 ... 2 john 23984 2 234 234 242 62 262 622 262 622 26 262 ...
for hundreds of models 10 ms , variables counting 21.
i have used pd.melt doing analysis @ bp1:bp21 or whatever. have need create melt @ bp1 values along rq 1 values.
i looking create this:
id model_# ms variable_x value_x variable_y value_y 0 113 77515 1 bp1 23 rq1 252 1 113 77515 1 bp2 252 rq2 262 2 113 77515 1 bp3 26 rq3 311
right best have been able is:
id model_# ms variable_x value_x variable_y value_y 0 113 77515 1 bp1 23 rq1 252 1 113 77515 1 bp1 23 rq2 262 2 113 77515 1 bp1 23 rq3 311 3 113 77515 1 bp1 23 rq4 246
via:
df = pd.melt(dat, id_vars=['id', 'mod_req', 'ms'], value_vars=bp) df1 = pd.melt(dat, id_vars=['id', 'mod_req', 'ms'], value_vars=rq) df2 = pd.merge(df,df1, on=['id', 'mod_req', 'ms'])
is there easy way merge on substring such bp1 connect rq1 , forth? mean taking melted dataframe looks @ bp1:bp21 , other melted dataframe rq1:rq21 , merging based on substring values( bp1 rq1, not bp1 rq2)
solution
set index...
use clever column groupby
...
clever function apply
...
d1 = df.set_index(['id', 'name', 'model_#', 'ms']) def melt_(df): id_vars = df.index.names return df.reset_index().melt(id_vars=id_vars).set_index(id_vars) d2 = d1.groupby(d1.columns.str.extract('(\d+)', expand=false), axis=1).apply(melt_) d2.columns = d2.columns.swaplevel(0, 1).map('_'.join) d2.reset_index() id name model_# ms variable_bp value_bp variable_cd value_cd variable_rq value_rq variable_sa value_sa variable_sf value_sf 0 1 john 23984 1 bp1 23 cd1 234 rq1 252 sa1 25 sf1 124 1 2 john 23984 2 bp1 234 cd1 234 rq1 262 sa1 62 sf1 242 2 1 john 23984 1 bp2 252 cd2 62 rq2 234 sa2 234 sf2 194 3 2 john 23984 2 bp2 622 cd2 262 rq2 262 sa2 26 sf2 622
overly functionalized
e = lambda d, n: dict(zip(n, d.dtypes)) = lambda d, n: pd.dataframe(d.values, d.index, n).astype(e(d, n)) h = lambda d: i(d, d.columns.map(fmt)).reset_index() m = lambda d: d.reset_index().melt(cols).set_index(cols) fmt = '{0[1]}_{0[0]}'.format cols = ['id', 'name', 'model_#', 'ms'] d1 = df.set_index(cols) g = d1.columns.str.extract('(\d+)', expand=false) d1.groupby(g, axis=1).apply(m).pipe(h) id name model_# ms variable_bp value_bp variable_cd value_cd variable_rq value_rq variable_sa value_sa variable_sf value_sf 0 1 john 23984 1 bp1 23 cd1 234 rq1 252 sa1 25 sf1 124 1 2 john 23984 2 bp1 234 cd1 234 rq1 262 sa1 62 sf1 242 2 1 john 23984 1 bp2 252 cd2 62 rq2 234 sa2 234 sf2 194 3 2 john 23984 2 bp2 622 cd2 262 rq2 262 sa2 26 sf2 622
old answer
this far pretty , i'm not sure want.
d1 = df.set_index(['id', 'name', 'model_#', 'ms']) cidx = pd.multiindex.from_tuples( d1.columns.to_series().str.extract('(\d+)(\d+)', expand=false).values.tolist(), names=[none, 'variable'] ) d1.columns = cidx d2 = d1.sort_index(axis=1).stack() variables = pd.dataframe( (d2.columns + d2.index.get_level_values('variable')[:, none]).tolist(), d2.index, d2.columns ) d3 = pd.concat( [variables, d2], axis=1, keys=['variable', 'value'] ).reset_index('variable', drop=true).sort_index(axis=1, level=1, sort_remaining=false) d3.columns = d3.columns.map('_'.join) d3.reset_index() id name model_# ms variable_bp value_bp variable_cd value_cd variable_rq value_rq variable_sa value_sa variable_sf value_sf 0 1 john 23984 1 bp1 23 cd1 234 rq1 252 sa1 25 sf1 124 1 1 john 23984 1 bp2 252 cd2 62 rq2 234 sa2 234 sf2 194 2 2 john 23984 2 bp1 234 cd1 234 rq1 262 sa1 62 sf1 242 3 2 john 23984 2 bp2 622 cd2 262 rq2 262 sa2 26 sf2 622
Comments
Post a Comment