python - How to replace values in a column in a fast and memory efficient manner -
i have dataframe
containing 114 million records 2 columns named session_id
, artifact_id
. both columns categorical. replace values in column artifact_id
values in dictionary each value in artifact_id
mapped value. artifact_id
int
column , values replaced int
values. there 322k unique values replaced.
here's sample dataset:
session_id artifact_id 234 123 b 123 b 678
the contents of dictionary follows:
{'234':'1','123':'2','678':'3'}
i final dataset this:
session_id artifact_id 1 2 b 2 b 3
i have thought following statement replace these values:
sessions['artifact_id'].replace(artifactid2num, inplace=true)
artifactid2num
name of dictionary. statement gives me out of memory
error. thought breaking process various pieces avoid memoryerror
using following code:
count = 0 idx in xrange(0,len(sessions),50000): count = count + 1 print(count) if (idx+50000) > len(sessions): sessions[idx:(len(sessions)-1)]['artifact_id'].replace(artifactid2num, inplace=true) else: sessions[idx:(idx+50000)]['artifact_id'].replace(artifactid2num, inplace=true)
the above code runs far without errors. has been running 10+ hours , hasn't finished yet.
more info: original dataframe 114 million records fits in memory , takes 4.2 gb. moment run above code iterations, memory occupancy increases 20gb. i'm working on 50000 records @ time , replacing values dictionary. why memory usage increase drastically?
is there way make code faster? or there way can achieve same result?
any appreciated.
can try:
d = {'123': '2', '234': '1', '678': '3'} df['artifact_id'] = df.artifact_id.astype(str).map(d)
Comments
Post a Comment