python - Installing pandas 0.20.3 on Google Cloud Dataflow takes a very long time -
when using apache beam python sdk 2.0.0 on google cloud dataflow, takes forever (about 8 minutes) install pandas 0.20.3. install hangs on message running setup.py bdist_wheel pandas: still running...
. on machine, however, installing same version of pandas doesn't take 30 seconds (even after clearing pip cache). installing pandas takes third of cost of running pipeline right now. ideas on why takes time?
dataflow sdk stages dependencies in source form because client architecture not match vms used dataflow workers. cause pandas installed sources , compiled on vms taking long time.
it possible solve using --extra_package
flag , pointing whl
file. pandas can use corresponding whl
file (py27, x86_64) pypi page of pandas.
Comments
Post a Comment