python - Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df -
i'd take dataset bunch of different unique individuals, each multiple entries, , assign each individual unique id of entries. here's example of df:
firstname lastname id 0 tom jones 1 1 tom jones 1 2 david smith 1 3 alex thompson 1 4 alex thompson 1
so, want entries tom jones have id=1, entries david smith have id=2, entries alex thompson have id=3, , on.
so have 1 solution, dead simple python loop iterating 2 values (one id, 1 index) , assigning individual id based on whether match previous individual:
x = 1 = 1 while < len(df_test): if (df_test.lastname[i] == df_test.lastname[i-1]) & (df_test.firstname[i] == df_test.firstname[i-1]): df_test.loc[i, 'id'] = x = i+1 else: x = x+1 df_test.loc[i, 'id'] = x = i+1
the problem i'm running dataframe has 9 million entries, loop have taken huge amount of time run. can think of more efficient way this? i've been looking @ groupby , multiindexing potential solutions, haven't quite found right solution yet. thanks!
you join last name , first name, convert category, , codes.
of course, multiple people same name have same id
.
df = df.assign(id=(df['lastname'] + '_' + df['firstname']).astype('category').cat.codes) >>> df firstname lastname id 0 tom jones 0 1 tom jones 0 2 david smith 1 3 alex thompson 2 4 alex thompson 2
Comments
Post a Comment