Gpu util is 0 when run tensorflow training job, and context switch is very high -


tensorflow: 1.2.0 gpu: titan x (pascal) driver: 370.28 

i run distrubuted tensorflow train image classify model, see no gpu usage(actually,gpu util of mnist or other training job 0).

there's many poll system call when straced training process(poll fd /dev/nvidia0):

 poll([{fd=8, events=pollin}, {fd=12, events=pollin}, {fd=13, events=pollin}, {fd=14, events=pollin}, {fd=15, events=pollin}, {fd=17, events=pollin}, {fd=18, events=pollin}, {fd=19, events=pollin}, {fd=20, events=pollin}, {fd=21, events=pollin}], 10, 100 <unfinished ...>  futex(0x2d1eca4, futex_wait_bitset_private|futex_clock_realtime, 3340677, {1502763800, 428734182}, ffffffff) = -1 etimedout (connection timed out) 

vmstat shows high context switch, millions cs per second.

had seen before?

i've had same problem before, it's because gpu not set run -- ran tensorflow on cpu, thought run on gpu. if right, won't this.

1) check use nvidia-smi check: despite gpu util 0%, gpu memory util 0% ? if there isn't process in processes @ all?

-- if so, gpu not used @ all, tensorflow must running on cpu (you may use top check cpu usage, if above 100% it's additional proof program paraleled on cpu)

in case, should check if you've installed gpu version of tensorflow. may find 2 different versions of installation introduction in www.tensorflow.org cpu/gpu respectly. cpu version tensorflow may never run on gpu.

plus above, machine enviroment requires specify gpu device want use explicitly. use command following check:

cuda_visible_devices=0 python rnn_mnist.py

(note =? must in right format, e.g. cuda_visible_devices=[0] invalid no warning promoted, program run on cpu instead)

2) if not case, python running on gpu , it's util 0%. there's possibility data fetching cost time, on cpu, , gpu waiting data, util avg 0%.

-- possible reason set batch_size small, try 128 or 1024.


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -