python - Computing Jaccard Similarity between DataFrame Columns with Different Lengths -
i have dataframe user_ids columns , ids of movies they've liked row values. here's snippet:
15 30 50 93 100 113 1008 1028 0 3346.0 42779.0 1816.0 191319.0 138.0 183.0 171.0 283.0 1 1543.0 nan 169.0 5319.0 34899.0 188.0 42782.0 1183.0 2 5942.0 nan 30438.0 195514.0 169.0 172.0 187.0 5329.0 3 3249.0 nan 32361.0 225.0 87.0 547.0 6710.0 283.0 4 794.0 nan 187.0 195734.0 6297.0 8423.0 1289.0 222.0
i'm trying calculate jaccard similarity between each column (i.e. between each user using movies they've liked). python gives following error when try use jaccard_similarity_score found in sklearn:
valueerror: continuous not supported
ideally, result, matrix rows , columns of user_id's , values similarity scores each.
how can go computing jaccard similarities between these columns? i've tried use list of dictionaries keys user ids , values lists of movies, takes forever compute.
since sklearn.metrics.jaccard_similarity_score
expects 2 input vectors of equal length try following, partially addapted this similar question.
import itertools import pandas pd # method compute jaccard similarity index between 2 sets def compute_jaccard(user1_vals, user2_vals): intersection = user1_vals.intersection(user2_vals) union = user1_vals.union(user2_vals) jaccard = len(intersection)/float(len(union)) return jaccard # small test dataframe users = ['user1', 'user2', 'user3'] df = pd.dataframe( np.transpose(np.array([[1,2,3],[3,np.nan,7], [np.nan, np.nan,3]])), columns=users) sim_df = pd.dataframe(columns=users, index=users) # iterate through columns , compute metric col_pair in itertools.combinations(df.columns, 2): u1= col_pair[0] u2 = col_pair[1] sim_df.loc[col_pair] = compute_jaccard(set(df[u1].dropna()), set(df[u2].dropna())) print sim_df
this returns following (upper triangular) half of similarity matrix diagonal of course ones.
user1 user2 user3 user1 nan 0.25 0.333333 user2 nan nan 0.5 user3 nan nan nan
Comments
Post a Comment