Gpu util is 0 when run tensorflow training job, and context switch is very high -
tensorflow: 1.2.0 gpu: titan x (pascal) driver: 370.28
i run distrubuted tensorflow train image classify model, see no gpu usage(actually,gpu util of mnist or other training job 0).
there's many poll system call when straced training process(poll fd /dev/nvidia0):
poll([{fd=8, events=pollin}, {fd=12, events=pollin}, {fd=13, events=pollin}, {fd=14, events=pollin}, {fd=15, events=pollin}, {fd=17, events=pollin}, {fd=18, events=pollin}, {fd=19, events=pollin}, {fd=20, events=pollin}, {fd=21, events=pollin}], 10, 100 <unfinished ...> futex(0x2d1eca4, futex_wait_bitset_private|futex_clock_realtime, 3340677, {1502763800, 428734182}, ffffffff) = -1 etimedout (connection timed out)
vmstat shows high context switch, millions cs per second.
had seen before?
i've had same problem before, it's because gpu not set run -- ran tensorflow on cpu, thought run on gpu. if right, won't this.
1) check use nvidia-smi check: despite gpu util 0%, gpu memory util 0% ? if there isn't process in processes @ all?
-- if so, gpu not used @ all, tensorflow must running on cpu (you may use top check cpu usage, if above 100% it's additional proof program paraleled on cpu)
in case, should check if you've installed gpu version of tensorflow. may find 2 different versions of installation introduction in www.tensorflow.org cpu/gpu respectly. cpu version tensorflow may never run on gpu.
plus above, machine enviroment requires specify gpu device want use explicitly. use command following check:
cuda_visible_devices=0 python rnn_mnist.py
(note =? must in right format, e.g. cuda_visible_devices=[0] invalid no warning promoted, program run on cpu instead)
2) if not case, python running on gpu , it's util 0%. there's possibility data fetching cost time, on cpu, , gpu waiting data, util avg 0%.
-- possible reason set batch_size small, try 128 or 1024.
Comments
Post a Comment