python - Computing Jaccard Similarity between DataFrame Columns with Different Lengths -

February 15, 2013

i have dataframe user_ids columns , ids of movies they've liked row values. here's snippet:

   15       30       50        93       100     113      1008    1028     0  3346.0  42779.0   1816.0  191319.0    138.0   183.0    171.0   283.0    1  1543.0      nan    169.0    5319.0  34899.0   188.0  42782.0  1183.0    2  5942.0      nan  30438.0  195514.0    169.0   172.0    187.0  5329.0    3  3249.0      nan  32361.0     225.0     87.0   547.0   6710.0   283.0    4   794.0      nan    187.0  195734.0   6297.0  8423.0   1289.0   222.0

i'm trying calculate jaccard similarity between each column (i.e. between each user using movies they've liked). python gives following error when try use jaccard_similarity_score found in sklearn:

valueerror: continuous not supported

ideally, result, matrix rows , columns of user_id's , values similarity scores each.

how can go computing jaccard similarities between these columns? i've tried use list of dictionaries keys user ids , values lists of movies, takes forever compute.

since sklearn.metrics.jaccard_similarity_score expects 2 input vectors of equal length try following, partially addapted this similar question.

import itertools import pandas pd  # method compute jaccard similarity index between 2 sets def compute_jaccard(user1_vals, user2_vals):     intersection = user1_vals.intersection(user2_vals)     union = user1_vals.union(user2_vals)     jaccard = len(intersection)/float(len(union))     return jaccard  # small test dataframe users = ['user1', 'user2', 'user3'] df = pd.dataframe(      np.transpose(np.array([[1,2,3],[3,np.nan,7], [np.nan, np.nan,3]])),      columns=users) sim_df = pd.dataframe(columns=users, index=users)  # iterate through columns , compute metric col_pair in itertools.combinations(df.columns, 2):     u1= col_pair[0]     u2 = col_pair[1]     sim_df.loc[col_pair] = compute_jaccard(set(df[u1].dropna()), set(df[u2].dropna()))   print sim_df

this returns following (upper triangular) half of similarity matrix diagonal of course ones.

        user1  user2     user3 user1   nan    0.25      0.333333 user2   nan    nan       0.5 user3   nan    nan       nan

Search This Blog

LP

python - Computing Jaccard Similarity between DataFrame Columns with Different Lengths -

Comments

Post a Comment

Popular posts from this blog

PHP and MySQL WP -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

javascript - Generate barcode from text and convert it to base64 -