Melt and Merge on Substring - Python & Pandas -


i have data has data

id      name    model_#   ms   bp1   cd1    sf1    sa1   rq1   bp2   cd2   sf2   sa2   rq2 ...  1       john    23984     1    23    234    124     25   252   252    62   194    234   234 ...  2       john    23984     2    234   234    242     62   262   622   262   622    26    262 ...  

for hundreds of models 10 ms , variables counting 21.

i have used pd.melt doing analysis @ bp1:bp21 or whatever. have need create melt @ bp1 values along rq 1 values.

i looking create this:

              id  model_#  ms  variable_x  value_x variable_y  value_y 0            113    77515   1        bp1     23        rq1      252 1            113    77515   1        bp2     252       rq2      262 2            113    77515   1        bp3     26        rq3      311 

right best have been able is:

              id  model_#  ms variable_x  value_x variable_y  value_y 0            113    77515   1        bp1     23        rq1      252 1            113    77515   1        bp1     23        rq2      262 2            113    77515   1        bp1     23        rq3      311 3            113    77515   1        bp1     23        rq4      246 

via:

df = pd.melt(dat, id_vars=['id', 'mod_req', 'ms'], value_vars=bp) df1 = pd.melt(dat, id_vars=['id', 'mod_req', 'ms'], value_vars=rq) df2 = pd.merge(df,df1, on=['id', 'mod_req', 'ms']) 

is there easy way merge on substring such bp1 connect rq1 , forth? mean taking melted dataframe looks @ bp1:bp21 , other melted dataframe rq1:rq21 , merging based on substring values( bp1 rq1, not bp1 rq2)

solution

set index...
use clever column groupby...
clever function apply...

d1 = df.set_index(['id', 'name', 'model_#', 'ms'])  def melt_(df):     id_vars = df.index.names     return df.reset_index().melt(id_vars=id_vars).set_index(id_vars)   d2 = d1.groupby(d1.columns.str.extract('(\d+)', expand=false), axis=1).apply(melt_)  d2.columns = d2.columns.swaplevel(0, 1).map('_'.join) d2.reset_index()     id  name  model_#  ms variable_bp  value_bp variable_cd  value_cd variable_rq  value_rq variable_sa  value_sa variable_sf  value_sf 0   1  john    23984   1         bp1        23         cd1       234         rq1       252         sa1        25         sf1       124 1   2  john    23984   2         bp1       234         cd1       234         rq1       262         sa1        62         sf1       242 2   1  john    23984   1         bp2       252         cd2        62         rq2       234         sa2       234         sf2       194 3   2  john    23984   2         bp2       622         cd2       262         rq2       262         sa2        26         sf2       622 

overly functionalized

e = lambda d, n: dict(zip(n, d.dtypes)) = lambda d, n: pd.dataframe(d.values, d.index, n).astype(e(d, n)) h = lambda d: i(d, d.columns.map(fmt)).reset_index() m = lambda d: d.reset_index().melt(cols).set_index(cols) fmt = '{0[1]}_{0[0]}'.format  cols = ['id', 'name', 'model_#', 'ms']  d1 = df.set_index(cols) g = d1.columns.str.extract('(\d+)', expand=false) d1.groupby(g, axis=1).apply(m).pipe(h)     id  name  model_#  ms variable_bp  value_bp variable_cd  value_cd variable_rq  value_rq variable_sa  value_sa variable_sf  value_sf 0   1  john    23984   1         bp1        23         cd1       234         rq1       252         sa1        25         sf1       124 1   2  john    23984   2         bp1       234         cd1       234         rq1       262         sa1        62         sf1       242 2   1  john    23984   1         bp2       252         cd2        62         rq2       234         sa2       234         sf2       194 3   2  john    23984   2         bp2       622         cd2       262         rq2       262         sa2        26         sf2       622 

old answer

this far pretty , i'm not sure want.

d1 = df.set_index(['id', 'name', 'model_#', 'ms'])  cidx = pd.multiindex.from_tuples(     d1.columns.to_series().str.extract('(\d+)(\d+)', expand=false).values.tolist(),     names=[none, 'variable'] )  d1.columns = cidx  d2 = d1.sort_index(axis=1).stack()  variables = pd.dataframe(     (d2.columns + d2.index.get_level_values('variable')[:, none]).tolist(),     d2.index, d2.columns )  d3 = pd.concat(     [variables, d2], axis=1, keys=['variable', 'value'] ).reset_index('variable', drop=true).sort_index(axis=1, level=1, sort_remaining=false)  d3.columns = d3.columns.map('_'.join)  d3.reset_index()     id  name  model_#  ms variable_bp  value_bp variable_cd  value_cd variable_rq  value_rq variable_sa  value_sa variable_sf  value_sf 0   1  john    23984   1         bp1        23         cd1       234         rq1       252         sa1        25         sf1       124 1   1  john    23984   1         bp2       252         cd2        62         rq2       234         sa2       234         sf2       194 2   2  john    23984   2         bp1       234         cd1       234         rq1       262         sa1        62         sf1       242 3   2  john    23984   2         bp2       622         cd2       262         rq2       262         sa2        26         sf2       622 

Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -