python - How to replace values in a column in a fast and memory efficient manner -


i have dataframe containing 114 million records 2 columns named session_id , artifact_id. both columns categorical. replace values in column artifact_id values in dictionary each value in artifact_id mapped value. artifact_id int column , values replaced int values. there 322k unique values replaced.

here's sample dataset:

session_id    artifact_id                  234                  123     b              123     b              678 

the contents of dictionary follows:

{'234':'1','123':'2','678':'3'} 

i final dataset this:

session_id    artifact_id                  1                  2     b              2     b              3 

i have thought following statement replace these values:

sessions['artifact_id'].replace(artifactid2num, inplace=true) 

artifactid2num name of dictionary. statement gives me out of memory error. thought breaking process various pieces avoid memoryerror using following code:

count = 0 idx in xrange(0,len(sessions),50000):     count = count + 1     print(count)     if (idx+50000) > len(sessions):         sessions[idx:(len(sessions)-1)]['artifact_id'].replace(artifactid2num, inplace=true)     else:         sessions[idx:(idx+50000)]['artifact_id'].replace(artifactid2num, inplace=true) 

the above code runs far without errors. has been running 10+ hours , hasn't finished yet.

more info: original dataframe 114 million records fits in memory , takes 4.2 gb. moment run above code iterations, memory occupancy increases 20gb. i'm working on 50000 records @ time , replacing values dictionary. why memory usage increase drastically?

is there way make code faster? or there way can achieve same result?

any appreciated.

can try:

d = {'123': '2', '234': '1', '678': '3'} df['artifact_id'] = df.artifact_id.astype(str).map(d) 

Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -