cuda - How does warp divergence manifest in SASS? -
when different threads in warp execute divergent code, divergent branches serialized, , inactive warps "disabled."
if divergent paths contain small number of instructions, such branch predication used, it's pretty clear "disabled" means (threads turned on/off predicate), , it's visible in sass dump.
if divergent execution paths contain larger numbers of instructions (exact number dependent on some compiler heuristics) branch instructions inserted potentially skip 1 execution path or other. makes sense: if 1 long branch seldom taken, or not taken threads in warp, it's advantageous allow warp skip instructions (rather being forced execute both paths in cases predication).
my question is: how inactive threads "disabled" in case of divergence branches? slide on page 2, lower left of this presentation seems indicate branches taken based on condition , threads not participate switched off via predicates attached instructions @ branch targets. however, not behavior observe in sass.
here's minimal compilable sample:
#include <stdio.h> __global__ void nonpredicated( int* a, int iter ) { if( a[threadidx.x] == 0 ) // make number of divergent instructions unknown @ // compile time compiler forced create branches for( int = 0; < iter; i++ ) { a[threadidx.x] += 5; a[threadidx.x] *= 5; } else for( int = 0; < iter; i++ ) { a[threadidx.x] += 2; a[threadidx.x] *= 2; } } int main(){}
here's sass dump showing branch instructions predicated, code @ branch targets not predicated. threads did not take branch switched off implicitly during execution of branch targets, in way not directly visible in sass? see terminology "active mask" alluded in various cuda documents, i'm wondering how manifests in sass, if separate mechanism predication.
additionally, pre-volta architectures, program counter shared per-warp, idea of predicated branch instruction confusing me. why attach per-thread predicate instruction might change (the program counter) shared threads in warp?
code sm_20 function : _z13nonpredicatedpii .headerflags @"ef_cuda_sm20 ef_cuda_ptx_sm(ef_cuda_sm20)" /*0000*/ mov r1, c[0x1][0x100]; /* 0x2800440400005de4 */ /*0008*/ s2r r0, sr_tid.x; /* 0x2c00000084001c04 */ /*0010*/ mov32i r3, 0x4; /* 0x180000001000dde2 */ /*0018*/ imad.u32.u32 r2.cc, r0, r3, c[0x0][0x20]; /* 0x2007800080009c03 */ /*0020*/ imad.u32.u32.hi.x r3, r0, r3, c[0x0][0x24]; /* 0x208680009000dc43 */ /*0028*/ ld.e r0, [r2]; /* 0x8400000000201c85 */ /*0030*/ isetp.eq.and p0, pt, r0, rz, pt; /* 0x190e0000fc01dc23 */ /*0038*/ @p0 bra 0xd0; /* 0x40000002400001e7 */ /*0040*/ mov r4, c[0x0][0x28]; /* 0x28004000a0011de4 */ /*0048*/ isetp.lt.and p0, pt, r4, 0x1, pt; /* 0x188ec0000441dc23 */ /*0050*/ mov r4, rz; /* 0x28000000fc011de4 */ /*0058*/ @p0 exit; /* 0x80000000000001e7 */ /*0060*/ nop; /* 0x4000000000001de4 */ /*0068*/ nop; /* 0x4000000000001de4 */ /*0070*/ nop; /* 0x4000000000001de4 */ /*0078*/ nop; /* 0x4000000000001de4 */ /*0080*/ iadd r4, r4, 0x1; /* 0x4800c00004411c03 */ /*0088*/ iadd r0, r0, 0x2; /* 0x4800c00008001c03 */ /*0090*/ isetp.lt.and p0, pt, r4, c[0x0][0x28], pt; /* 0x188e4000a041dc23 */ /*0098*/ shl r0, r0, 0x1; /* 0x6000c00004001c03 */ /*00a0*/ @p0 bra 0x80; /* 0x4003ffff600001e7 */ /*00a8*/ st.e [r2], r0; /* 0x9400000000201c85 */ /*00b0*/ bra 0x128; /* 0x40000001c0001de7 */ /*00b8*/ nop; /* 0x4000000000001de4 */ /*00c0*/ nop; /* 0x4000000000001de4 */ /*00c8*/ nop; /* 0x4000000000001de4 */ /*00d0*/ mov r0, c[0x0][0x28]; /* 0x28004000a0001de4 */ /*00d8*/ mov r4, rz; /* 0x28000000fc011de4 */ /*00e0*/ isetp.lt.and p0, pt, r0, 0x1, pt; /* 0x188ec0000401dc23 */ /*00e8*/ mov r0, rz; /* 0x28000000fc001de4 */ /*00f0*/ @p0 exit; /* 0x80000000000001e7 */ /*00f8*/ mov32i r5, 0x19; /* 0x1800000064015de2 */ /*0100*/ iadd r0, r0, 0x1; /* 0x4800c00004001c03 */ /*0108*/ imad r4, r4, 0x5, r5; /* 0x200ac00014411ca3 */ /*0110*/ isetp.lt.and p0, pt, r0, c[0x0][0x28], pt; /* 0x188e4000a001dc23 */ /*0118*/ @p0 bra 0x100; /* 0x4003ffff800001e7 */ /*0120*/ st.e [r2], r4; /* 0x9400000000211c85 */ /*0128*/ exit; /* 0x8000000000001de7 */ .....................................
are threads did not take branch switched off implicitly during execution of branch targets, in way not directly visible in sass?
yes.
there warp execution or "active" mask separate formal concept of predication defined in ptx isa manual.
predicated execution may allow instructions executed (or not) particular thread on instruction-by-instruction basis. compiler may emit predicated instructions enact conditional jump or branch.
however gpu maintains warp active mask. when machine observes thread execution within warp has diverged (for example @ point of predicated branch, or perhaps predicated instruction), set active mask accordingly. process isn't "visible" @ sass level. afaik low level execution process diverged warp (not via predication) isn't specified, questions around how long warp stays diverged , exact mechanism re-synchronization aren't specified, , afaik can affected compiler choices, on architectures. this 1 recent discussion (note particularly remarks @njuffa).
why attach per-thread predicate instruction might change (the program counter) shared threads in warp?
this how perform conditional jump or branch. since execution lock-step, if going execute particular instruction (regardless of mask status or predication status) pc had better point instruction. however, gpu can perform instruction replay handle different cases, needed @ execution time.
a few other notes:
- a mention of "active mask" here:
the scheduler dispatches 32 lanes of warp execution units active mask. non-active threads execute through pipe.
- some nvidia tools allow inspection of active mask.
Comments
Post a Comment