python - Complex filtering in dask DataFrame -
i'm used doing "complex" filtering on pandas dataframe objects:
import numpy np import pandas pd data = pd.dataframe(np.random.random((10000, 2)) * 512, columns=["x", "y"]) data2 = data[np.sqrt((data.x - 200)**2 + (data.y - 200)**2) < 1]
this produces no problems.
but dask dataframes have:
ddata = dask.dataframe.from_pandas(data, 8) ddata2 = ddata[np.sqrt((ddata.x - 200)**2 + (ddata.y - 200)**2) < 1] --------------------------------------------------------------------------- notimplementederror traceback (most recent call last) <ipython-input-13-c2acf73dddf6> in <module>() ----> 1 ddata2 = ddata[np.sqrt((ddata.x - 200)**2 + (ddata.y - 200)**2) < 1] ~/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py in __getitem__(self, key) 2115 return new_dd_object(merge(self.dask, key.dask, dsk), name, 2116 self, self.divisions) -> 2117 raise notimplementederror(key) 2118 2119 def __setitem__(self, key, value): notimplementederror: 0 false
meanwhile simpler operation:
ddata2 = ddata[ddata.x < 200]
works fine.
i think issue "complex" math (i.e. np.sqrt
) result no longer lazy dask dataframe.
is there way around this? have create new column can filter on or there better way?
if replace np.sqrt
da.sqrt
works fine.
import dask.array da
you may notice np.sqrt
of dask series produces numpy array, step in computation not lazy, forces concrete result. use dask equivalent function maintain laziness , keep dask-compliant.
Comments
Post a Comment