python - JSON object inside Pandas DataFrame -

i have json object inside pandas dataframe column, want pull apart , put other columns. in dataframe, json object looks string containing array of dictionaries. array can of variable length, including zero, or column can null. i've written code, shown below, want. column names built 2 components, first being keys in dictionaries, , second being substring key value in dictionary.

this code works okay, slow when running on big dataframe. can offer faster (and simpler) way this? also, feel free pick holes in have done if see not sensible/efficient/pythonic. i'm still relative beginner. heaps.

# import libraries  import pandas pd ipython.display import display # used display df's nicely in jupyter notebook. import json  # set display options pd.set_option('max_colwidth',150)  # create example dataframe print("original df:") df = pd.dataframe.from_dict({'cola': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},\  'colb': {0: '[{"key":"keyvalue=1","vala":"8","valb":"18"},{"key":"keyvalue=2","vala":"9","valb":"19"}]',\   1: '[{"key":"keyvalue=2","vala":"28","valb":"38"},{"key":"keyvalue=3","vala":"29","valb":"39"}]',\   2: '[{"key":"keyvalue=4","vala":"48","valc":"58"}]',\   3: '[]',\   4: none}}) display(df)  # create temporary dataframe append results to, record record dftemp = pd.dataframe()  # step through rows in dataframe in range(df.shape[0]):     # check whether record null, or doesn't contain real data     if pd.notnull(df.iloc[i,df.columns.get_loc("colb")]) , len(df.iloc[i,df.columns.get_loc("colb")]) > 2:         # convert json structure dataframe, 1 cell @ time in relevant column         x = pd.read_json(df.iloc[i,df.columns.get_loc("colb")])         # last bit of string (after last =) used key column labels         x['key'] = x['key'].apply(lambda x: x.split("=")[-1])         # set new key index         y = x.set_index('key')         # stack rows via multi-level column index         y = y.stack().to_frame().t         # flatten out multi-level column index         y.columns = ['{1}_{0}'.format(*c) c in y.columns]         # give single record same index number parent dataframe (for merge work)         y.index = [df.index[i]]         # append dataframe on sequentially each row go through loop         dftemp = dftemp.append(y)  # merge new dataframe onto original 1 columns, index mataching original dataframe df = pd.merge(df,dftemp, how = 'left', left_index = true, right_index = true)  print("processed df:") display(df)

first, general piece of advice pandas. if find iterating on rows of dataframe, doing wrong.

with in mind, can re-write current procedure using pandas 'apply' method (this speed begin with, means far fewer index lookups on df):

# check whether record null, or doesn't contain real data def do_the_thing(row):     if pd.notnull(row) , len(row) > 2:         # convert json structure dataframe, 1 cell @ time in relevant column         x = pd.read_json(row)         # last bit of string (after last =) used key column labels         x['key'] = x['key'].apply(lambda x: x.split("=")[-1])         # set new key index         y = x.set_index('key')         # stack rows via multi-level column index         y = y.stack().to_frame().t         # flatten out multi-level column index         y.columns = ['{1}_{0}'.format(*c) c in y.columns]          #we don't need re-index             # give single record same index number parent dataframe (for merge work)             #y.index = [df.index[i]]         #we don't need add temp df         # append dataframe on sequentially each row go through loop         return y.iloc[0]     else:         return pd.series() df2 = df.merge(df.colb.apply(do_the_thing), how = 'left', left_index = true, right_index = true)

notice returns same result before, haven't changed logic. apply method sorts out indexes, can merge, fine.

i believe answers question in terms of speeding , being bit more idiomatic.

i think should consider however, want data structure, , how might better structure you're doing.

given colb of arbitrary length, end dataframe arbitrary number of columns. when come access these values whatever purpose, cause pain, whatever purpose is.

are entries in colb important? away keeping first one? need know index of vala val?

these questions should ask yourself, decide on structure allow whatever analysis need, without having check bunch of arbitrary things.

Search This Blog

LP

python - JSON object inside Pandas DataFrame -

Comments

Post a Comment

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

nginx - phpPgAdmin - log in works but I have to login again after clicking on any links -

How to deploy a middleman blog inside a rails app? -