python - Computing Jaccard Similarity between DataFrame Columns with Different Lengths -


i have dataframe user_ids columns , ids of movies they've liked row values. here's snippet:

   15       30       50        93       100     113      1008    1028     0  3346.0  42779.0   1816.0  191319.0    138.0   183.0    171.0   283.0    1  1543.0      nan    169.0    5319.0  34899.0   188.0  42782.0  1183.0    2  5942.0      nan  30438.0  195514.0    169.0   172.0    187.0  5329.0    3  3249.0      nan  32361.0     225.0     87.0   547.0   6710.0   283.0    4   794.0      nan    187.0  195734.0   6297.0  8423.0   1289.0   222.0    

i'm trying calculate jaccard similarity between each column (i.e. between each user using movies they've liked). python gives following error when try use jaccard_similarity_score found in sklearn:

valueerror: continuous not supported 

ideally, result, matrix rows , columns of user_id's , values similarity scores each.

how can go computing jaccard similarities between these columns? i've tried use list of dictionaries keys user ids , values lists of movies, takes forever compute.

since sklearn.metrics.jaccard_similarity_score expects 2 input vectors of equal length try following, partially addapted this similar question.

import itertools import pandas pd  # method compute jaccard similarity index between 2 sets def compute_jaccard(user1_vals, user2_vals):     intersection = user1_vals.intersection(user2_vals)     union = user1_vals.union(user2_vals)     jaccard = len(intersection)/float(len(union))     return jaccard  # small test dataframe users = ['user1', 'user2', 'user3'] df = pd.dataframe(      np.transpose(np.array([[1,2,3],[3,np.nan,7], [np.nan, np.nan,3]])),      columns=users) sim_df = pd.dataframe(columns=users, index=users)  # iterate through columns , compute metric col_pair in itertools.combinations(df.columns, 2):     u1= col_pair[0]     u2 = col_pair[1]     sim_df.loc[col_pair] = compute_jaccard(set(df[u1].dropna()), set(df[u2].dropna()))   print sim_df 

this returns following (upper triangular) half of similarity matrix diagonal of course ones.

        user1  user2     user3 user1   nan    0.25      0.333333 user2   nan    nan       0.5 user3   nan    nan       nan 

Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -