python 3.x - How to compare columns in a pandas dataframe -
i have pandas dataframe looks "word" column header columns:
word word word word 0 nap nap nap cat 1 cat cat cat flower 2 peace kick kick go 3 phone fin fin nap
how can return words appear in 4 columns?
expected output:
word 0 nap 1 cat
- use
apply(set)
turn each column set of words - use
set.intersection
find words in each column's set - turn list , series
pd.series(list(set.intersection(*df.apply(set)))) 0 cat 1 nap dtype: object
we can accomplish same task python functional magic performance benefit.
pd.series(list( set.intersection(*map(set, map(lambda c: df[c].values.tolist(), df))) )) 0 cat 1 nap dtype: object
timing
code below
pir1 = lambda d: pd.series(list(set.intersection(*d.apply(set)))) pir2 = lambda d: pd.series(list(set.intersection(*map(set, map(lambda c: d[c].values.tolist(), d))))) # took liberties @anton vbr's solution. vbr = lambda d: pd.series((lambda x: x.index[x.values == len(d.columns)])(pd.value_counts(d.values.ravel()))) results = pd.dataframe( index=pd.index([10, 30, 100, 300, 1000, 3000, 10000, 30000]), columns='pir1 pir2 vbr'.split() ) in results.index: d = pd.concat(dict(enumerate( [pd.series(np.random.choice(words[:i*2], i, false)) _ in range(4)] )), axis=1) j in results.columns: stmt = '{}(d)'.format(j) setp = 'from __main__ import d, {}'.format(j) results.set_value(i, j, timeit(stmt, setp, number=100)) results.plot(loglog=true)
Comments
Post a Comment