c++ - Incomplete output from printf() called on device -
for purpose of testing printf() call on device, wrote simple program copies array of moderate size device , print value of device array screen. although array correctly copied device, printf() function not work correctly, lost first several hundred numbers. array size in code 4096. bug or i'm not using function properly? in adavnce.
edit: gpu geforce gtx 550i, compute capability 2.1
my code:
#include<stdio.h> #include<stdlib.h> #define n 4096 __global__ void printcell(float *d_array , int n){ int k = 0; printf("\n=========== data of d_array on device==============\n"); for( k = 0; k < n; k++ ){ printf("%f ", d_array[k]); if((k+1)%6 == 0) printf("\n"); } printf("\n\ntotally %d elements has been printed", k); } int main(){ int =0; float array[n] = {0}, rarray[n] = {0}; float *d_array; for(i=0;i<n;i++) array[i] = i; cudamalloc((void**)&d_array, n*sizeof(float)); cudamemcpy(d_array, array, n*sizeof(float), cudamemcpyhosttodevice); cudadevicesynchronize(); printcell<<<1,1>>>(d_array, n); //print device array kernel cudadevicesynchronize(); /* copy device array host see if correctly copied */ cudamemcpy(rarray, d_array, n*sizeof(float), cudamemcpydevicetohost); printf("\n\n"); for(i=0;i<n;i++){ printf("%f ", rarray[i]); if((i+1)%6 == 0) printf("\n"); } }
printf device has limited queue. it's intended small scale debug-style output, not large scale output.
referring programmer's guide:
the output buffer printf() set fixed size before kernel launch (see associated host-side api). circular , if more output produced during kernel execution can fit in buffer, older output overwritten.
your in-kernel printf output overran buffer, , first printed elements lost (overwritten) before buffer dumped standard i/o queue.
Comments
Post a Comment