algorithm - Efficient approximation of most frequent items across multiple large item-sets -
i learn there efficient algorithm can approximate top-k frequent items across many large item-sets.
for example, have 99 sets:
s1 = (i1, i2, i3, i5, i7, i9, ... , i230231)
s2 = (i2, i4, i7, i11, i17, i21, ... , i230231)
s3 = (i1, i5, i9, i12, i13, i55, ... , i334234)
...
s99 = (i2, i3, i4, i6, i22, i32, ... , i96574)
want find frequent items. e.g. i2, i1, i3, i7
an underlying assumption items care (want included in top-k) have @ least 10% frequency across sets, if there 100 sets, item care should @ least appear in 10 sets (hope can make problem easier).
a straightforward solution build dictionary items keys , frequencies values, , iterate through each item-set , increase frequency count each item. problem costs linear running time based on sizes of sets: ∑|si|. , item-sets large.
i tried find approximation random sampling pairs of sets, , each pair of sets make intersection of them, based on idea if item frequent enough, should have frequency across multiple intersections. not faster (if not slower).
greatly appreciate ideas, keen read researches well
Comments
Post a Comment