pandas - How can I change my index vector into sparse feature vector that can be used in sklearn? -
i doing news recommendation system , need build table users , news read. raw data :
001436800277225 [12,456,157] 009092130698762 [248] 010003000431538 [361,521,83] 010156461231357 [173,67,244] 010216216021063 [203,97] 010720006581483 [86] 011199797794333 [142,12,86,411,201] 011337201765123 [123,41] 011414545455156 [62,45,621,435] 011425002581540 [341,214,286]
the first column userid
, second column newsid
.newsid
index column, example, after transformation, [12,456,157]
in first row means user has read 12th, 456th , 157th news (in sparse vector, 12th column, 456th column , 157th column 1
, while other columns have value 0
). , want change these data sparse vector format can used input vector in kmeans or dbscan algorithm of sklearn
. how can that?
one option construct sparse matrix explicitly. find easier build matrix in coo matrix format , cast csr format.
from scipy.sparse import coo_matrix input_data = [ ("001436800277225", [12,456,157]), ("009092130698762", [248]), ("010003000431538", [361,521,83]), ("010156461231357", [173,67,244]) ] number_movies = 1000 # maximum index of movies in data number_users = len(input_data) # number of users in model # you'll want have way lookup index given user id. user_row_map = {} user_row_index = 0 # structures coo format i,j,data = [],[],[] user, movies in input_data: if user not in user_row_map: user_row_map[user] = user_row_index user_row_index+=1 movie in movies: i.append(user_row_map[user]) j.append(movie) data.append(1) # number of times users watched movie # create matrix in coo format; cast csr easier use feature_matrix = coo_matrix((data, (i,j)), shape=(number_users, number_movies)).tocsr()
Comments
Post a Comment