pandas - How can I change my index vector into sparse feature vector that can be used in sklearn? -


i doing news recommendation system , need build table users , news read. raw data :

001436800277225 [12,456,157] 009092130698762 [248] 010003000431538 [361,521,83] 010156461231357 [173,67,244] 010216216021063 [203,97] 010720006581483 [86] 011199797794333 [142,12,86,411,201] 011337201765123 [123,41] 011414545455156 [62,45,621,435] 011425002581540 [341,214,286] 

the first column userid, second column newsid.newsid index column, example, after transformation, [12,456,157] in first row means user has read 12th, 456th , 157th news (in sparse vector, 12th column, 456th column , 157th column 1, while other columns have value 0). , want change these data sparse vector format can used input vector in kmeans or dbscan algorithm of sklearn. how can that?

one option construct sparse matrix explicitly. find easier build matrix in coo matrix format , cast csr format.

from scipy.sparse import coo_matrix  input_data = [     ("001436800277225", [12,456,157]),     ("009092130698762", [248]),     ("010003000431538", [361,521,83]),     ("010156461231357", [173,67,244])     ]  number_movies = 1000 # maximum index of movies in data number_users = len(input_data) # number of users in model  # you'll want have way lookup index given user id. user_row_map = {} user_row_index = 0  # structures coo format i,j,data = [],[],[] user, movies in input_data:      if user not in user_row_map:         user_row_map[user] = user_row_index         user_row_index+=1      movie in movies:         i.append(user_row_map[user])         j.append(movie)         data.append(1)  # number of times users watched movie  # create matrix in coo format; cast csr easier use feature_matrix = coo_matrix((data, (i,j)), shape=(number_users, number_movies)).tocsr() 

Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -