PTX does not have a notion of unreachable, which results in emitted basic
blocks having an edge to the next block:
block1: call @does_not_return(); // unreachable block2: // ptxas will create a CFG edge from block1 to block2
This may result in significant changes to the control flow graph, e.g., when
LLVM moves unreachable blocks to the end of the function. That's a problem
in the context of divergent control flow, as ptxas uses the CFG to determine
divergent regions, while some intructions may not be executed divergently.
For example, bar.sync is not allowed to be executed divergently on Pascal
or earlier. If we start with the following:
entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; unlikely: ... // unreachable cont: // end of divergent region bar.sync 0; bra.uni exit; exit: ret;
it is transformed by the branch-folder and block-placement passes to:
entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; cont: bar.sync 0; bra.uni exit; unlikely: ... // unreachable exit: // end of divergent region ret;
After moving the unlikely block to the end of the function, it has an edge
to the exit block, which widens the divergent region and makes the bar.sync
instruction happen divergently. That causes wrong computations, as we've been
running into for years with Julia code (which emits a lot of trap +
unreachable code all over the place).
To work around this, add an exit instruction before every unreachable,
as ptxas understands that exit terminates the CFG. Note that trap is not
equivalent, and only future versions of ptxas will model it like exit.
Another alternative would be to emit a branch to the block itself, but emitting
exit seems like a cleaner solution to represent unreachable to me.
Also note that this may not be sufficient, as it's possible that the block
with unreachable control flow is branched to from different divergent regions,
e.g. after block merging, in which case it may still be the case that ptxas
could reconstruct a CFG where divergent regions are merged (I haven't confirmed
this, but also haven't encountered this pattern in the wild yet):
entry: // start of divergent region 1 @%p0 bra cont1; @%p1 bra unlikely; bra.uni cont1; cont1: // intended end of divergent region 1 bar.sync 0; // start of divergent region 2 @%p2 bra cont2; @%p3 bra unlikely; bra.uni cont2; cont2: // intended end of divergent region 2 bra.uni exit; unlikely: ... exit; exit: // possible end of merged divergent region?
I originally tried to avoid the above by cloning paths towards unreachable and
splitting the outgoing edges, but that quickly became too complicated. I propose
we go with the simple solution first, also because modern GPUs with more flexible
hardware thread schedulers don't even suffer from this issue.
Finally, although I expect this to fix most of
https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter
miscompilations with Julia's unreachable-heavy code when targeting these
older GPUs using an older ptxas version (specifically, from CUDA 11.4 or
below). This is likely due to related bugs in ptxas which have been fixed
since, as I have filed several reproducers with NVIDIA over the past couple of
years. I'm not inclined to look into fixing those issues over here, and will
instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs.
Also see: