python - Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df -


i'd take dataset bunch of different unique individuals, each multiple entries, , assign each individual unique id of entries. here's example of df:

      firstname lastname  id 0     tom       jones     1 1     tom       jones     1 2     david     smith     1 3     alex      thompson  1 4     alex      thompson  1 

so, want entries tom jones have id=1, entries david smith have id=2, entries alex thompson have id=3, , on.

so have 1 solution, dead simple python loop iterating 2 values (one id, 1 index) , assigning individual id based on whether match previous individual:

x = 1 = 1  while < len(df_test):     if (df_test.lastname[i] == df_test.lastname[i-1]) &      (df_test.firstname[i] == df_test.firstname[i-1]):         df_test.loc[i, 'id'] = x         = i+1     else:         x = x+1         df_test.loc[i, 'id'] = x         = i+1 

the problem i'm running dataframe has 9 million entries, loop have taken huge amount of time run. can think of more efficient way this? i've been looking @ groupby , multiindexing potential solutions, haven't quite found right solution yet. thanks!

you join last name , first name, convert category, , codes.

of course, multiple people same name have same id.

df = df.assign(id=(df['lastname'] + '_' + df['firstname']).astype('category').cat.codes) >>> df   firstname  lastname  id 0       tom     jones   0 1       tom     jones   0 2     david     smith   1 3      alex  thompson   2 4      alex  thompson   2 

Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -