diff --git a/llvm/docs/DynamicInstances.rst b/llvm/docs/DynamicInstances.rst new file mode 100644 --- /dev/null +++ b/llvm/docs/DynamicInstances.rst @@ -0,0 +1,431 @@ +==================================================== +Dynamic Instances and Convergent Operation Semantics +==================================================== + +.. contents:: + :local: + :depth: 4 + +Overview +======== + +This document defines dynamic instances and the semantics of related intrinsics +and the ``convergent`` attribute. Convergent operations communicate with other +threads that execute the same operation. Typical examples of convergent +operations are control barriers, which pause execution until all threads in some +set of threads have reached the same barrier, and thread group reductions, which +reduce an input value across some set of threads, e.g. summing up values +computed by each thread. + +Convergent operations are commonly available in SPMD programming environments +such as GPUs, where a large number of threads execute the same program +concurrently. Implementations commonly map threads of execution onto lanes of +a hardware SIMD unit, and the intent of convergent operations is often to +communicate among those threads that are mapped to the same SIMD vector. +However, the semantics defined in this document are independent of such +implementation details. + +The defining characteristic that distinguishes convergent operations from other +inter-thread communication is that the set of threads among which communication +occurs is implicitly defined by control flow. For example, in the following GPU +compute kernel, communication occurs precisely among those threads of an +implementation-defined execution scope (such as workgroup or subgroup) for +which ``condition`` is true: + +.. code-block:: text + + void example_kernel() { + ... + if (condition) + convergent_operation(); + ... + } + +One can think of the communicating sets of threads in terms of *divergence* +and *reconvergence*: sets of threads split at branch instructions if threads +follow different paths through the control flow graph (divergence) and may +later merge when they reach the same static point in the program +(reconvergence). + +In structured programming languages, there is usually an intuitive and +unambiguous way of determining when threads reconverge. However, this is not +the case in unstructured control flow representations as used by LLVM. The +language of *dynamic instances* of instructions is used to formalize divergence +and reconvergence and provide a global, static view of when reconvergence +occurs. + + +Waves/Warps and an Illustrating Example of Dynamic Instances +============================================================ + +Consider a simple loop: + +.. code-block:: text + + void example_loop() { + A: + br label %B + + B: + ... + br i1 %cond, label %B, label %C + + C: + ... + } + +We can illustrate all possible executions of this code by drawing a graph of +dynamic instances (of basic blocks) that arises from "unrolling" the CFG: + +.. code-block:: text + + A -> B0 -> B1 -> B2 -> B3 -> ... + \ \ \ \ + -------------------- ... -> C + + // or, for example: + + A -> B0 -> B1 -> B2 -> B3 -> ... + \ \ \ \ + C0 C1 C2 C3 + +The difference between the two graphs is whether threads "reconverge" after +the loop: if `C` contains a subgroup operation (in the GPU language sense), +will it communicate among all threads of the subgroup (usually: wave or warp) +or only among those that left the loop in the same iteration? + +Note that other graphs are possible as well, where threads reconverge partially. + +As a first approximation, you can think of the dynamic instances as all +executions of an instruction or basic block by the SIMD unit on which a +wave/warp is running. If the threads reconverge in `C`, then `C` is only +executed once for each wave/warp, corresponding to only one dynamic instance. + +However, note that this analogy is not precise. Lockstep execution is not +enforced in the LLVM IR semantics. This falls out naturally from the wave/warp +model of execution when considering workgroup scope. Even if `C` is considered +to be reconverged at the workgroup scope, different waves/warps of the same +workgroup will generally execute instructions in `C` at different points in +time, except for explicit barrier instructions. + + +Motivating Examples of Convergent Operations +============================================ + +The following stylized pixel shader samples a texture at a given set of +coordinates. Texture sampling requires screen-space derivatives of the +coordinates to determine the level of detail (mipmap level) of the sample, +which is commonly approximated by taking the difference with respect to +neighboring pixels -- which are computed by different threads in the same +wave or warp: + +.. code-block:: text + + void example_shader() { + ... + color = textureSample(texture, coordinates); + if (condition) { + use_color(); + } + ... + } + +From a purely single-threaded perspective, sinking the `textureSample` into +the if-statement appears legal. However, if the condition is false for some +neighboring pixels, then their corresponding threads will not execute together +in the wave/warp, making it impossible to take the difference of coordinates +as an approximation of the screen-space derivative (in practice, the outcome +will be an undefined value). + +That is, the `textureSample` operation fits our definition of a convergent +operation: + + * It communicates with a set of thread that implicitly depends on control flow. + + * Correctness depends on this set of threads. + +The following example shows that merging common code of branches can be +incorrect in the face of convergent operations: + +.. code-block:: text + + void example_kernel() { + delta = ... + if (delta > 0) { + total_gains = nonuniformSubgroupSum(delta); + ... + } else { + total_losses = nonuniformSubgroupSum(delta); + ... + } + } + +The `nonuniformSubgroupSum` computing the `total_gains` will be executed by +the subset of threads with positive `delta` in a warp, and so will sum up al +the `delta`s of those threads; and similarly for the `nonuniformSubgroupSum` +that computes the `total_losses`. + +If we were to move and merge the `nonuniformSubgroupSum` above the if-statement, +it would sum up the `delta` across _all_ threads. + +Consider a simple example of jump threading: + +.. code-block:: text + + void example_original() { + entry: + ... + br i1 %cond1, label %then1, label %mid + + then1: + ... + %cond2 = ... + br label %mid + + mid: + %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ] + br i1 %flag, label %then2, label %end + + then2: + ... + call void @nonuniformSubgroupControlBarrier() + ... + br label %end + + end: + } + + void example_jumpthreaded() { + entry: + br i1 %cond1, label %then1, label %then2 + + then1: + ... + %cond2 = ... + br i1 %cond2, label %then2, label %end + + then2: + ... + call void @nonuniformSubgroupControlBarrier() + ... + br label %end + + end: + } + +Is the control barrier guaranteed to synchronize among the same set of threads +in both cases? Different implementations in the literature may give different +answers to this question: + +* In an implementation that reconverges at post-dominators, threads reconverge + at `%mid` in the first version, so that all threads (within a wave/warp) that + execute the control barrier do so together. In the second version, threads + that reach the control barrier via different paths synchronize separately. + +* An implementation that sorts basic blocks topologically and ensures maximal + reconvergence for each basic block would behave the same way in both + versions. + +Therefore, we need to be able to specify explicitly which threads communicate +in a convergent operation. + + + +Definitions and Rules +===================== + +1. Every instruction in an LLVM IR function has zero or more dynamic instances. + +2. There is exactly one dynamic instance of the function entry point. + +3. The dynamic instances form a directed acyclic (potentially infinite) graph. + +4. If there is a walk *W* (in this graph) from a dynamic instance *A* to a + dynamic instance *B*, then there is a corresponding walk *w* in the + function's control flow graph in the following sense: the walks have the + same length, and the i-th dynamic instance in *W* is a dynamic instance of + the i-th instruction in *w*. + + We say that *w* is the *projection* of *W*, and *W* is a *lifting* of *w*. + +5. Given a dynamic instance *A*, every walk through the function's control flow + graph starting at the instruction corresponding to *A* has a unique lifting + that starts in *A*. That is, two threads that follow the same control flow + path will not diverge. + + For the purpose of this rule, edges in the control flow graph are + significant. For example, if both labels of a branch instruction point to the + same basic block, threads executing the same dynamic instance of the branch + instruction with different truth values of the condition may diverge. + + Note: This rule implies that when threads diverge inside of a function call, + they will reconverge when the returning to the calling function. + +6. When executing a dynamic instance *A* of a convergent operation, the set of + communicating threads is the subset of threads that execute *A* out of some + implementation-defined set of threads. + + Examples of this implementation-defined set of threads are the OpenCL + workgroup of the executing thread or a SPIR-V subgroup. + +These rules intentionally do not fully define the set (and connectivity) of +dynamic instances: the only constraint on when reconvergence may happen is +that the graph must be acyclic. In particular, reconvergence may happen in the +middle of a basic block. Later in the document, we describe intrinsics that can +be used to constrain where reconvergence may or or may not occur. + +7. When the graph of dynamic instances is not fully defined, it is considered + to be (partially) implementation-defined. Program transforms are free to add + additional constraints to the graph connectivity in this case. + + +Correctness of Program Transforms +================================= + +(This section is informative.) + +As implied by the rules in the previous section, program transforms are correct +with regards to convergent operations if they preserve their semantics, which +means that the set of communicating threads must remain the same. + +If the connectivity of the graph of dynamic instances is fully defined, this +means that if two possible executions of the original program do/do not execute +the same dynamic instance of a given convergent operation, then they must/must +not do so in the transformed program. +Since the connectivity is in general *not* fully defined, the following more +complex rule applies: if the transformed program allows a graph of dynamic +instances such that two possible executions *E1* and *E2* of a program do/do not +execute the same dynamic instance of a convergent operation, then the original +program must allow a graph such that *E1* and *E2* do/do not execute the same +dynamic instance at the corresponding point of their execution. + +In general, program transforms with a single-threaded focus are correct if they +do not move convergent operations in the control flow graph, i.e. if they do +not sink or hoist convergent operations across a branch. This applies even to +program transforms that change the control flow graph. + +For example, consider unrolling a simple loop that in itself does not contain +convergent operations. This loop causes divergence based on, and only based on, +the trip count. This remains true even after (partial) unrolling. Therefore, +unrolling the loop does not change the possible graphs of dynamic instances in +any way that could affect convergent operations. The same argument applies to +other loop transforms and to transforms such as replacing a switch statement by +a binary tree of conditional branches. + +On the other hand, a loop that contains convergent operations and has unknown +trip count cannot be unrolled in general, since the remainder loop changes the +possible graph structures for convergent operations in the loop body. + +A major exception to the rule of thumb that program transforms are +conservatively correct if they "do not touch" convergent operations are program +transforms that create control flow "from thin air". Examples of this are a +transform that specializes parts of a function based on testing a value +against a constant, or a transform that converts ``select`` instructions into +branches. Both kinds of transforms introduce additional divergence without +guaranteeing reconvergence, which may change the behavior of convergent +operations that occur much later in the program. (Note, however, that +if-conversion is perfectly legal as long as the converted basic blocks +themselves do not contain any convergent operations, because it only reduces +the set of possible graphs of dynamic instances as far as convergent operations +are concerned.) + + +Memory Model Non-Interaction +============================ + +The fact that an operation is convergent has no effect on how it should be +treated for memory model purposes. In particular, an operation that is +``convergent`` and ``readnone`` does not introduce additional ordering +constraints as far as the memory model is concerned. There is no implied +barrier, neither in the memory barrier sense nor in the control barrier +sense of synchronizing the execution of threads. Threads executing the same +dynamic instance do not necessarily do so at the same time. + + +Other Interactions +================== + +1. `convergent` vs. `speculatable`. A function can be both `convergent` and + `speculatable`, indicating that the function does not have undefined + behavior and has no effects besides calculating its result -- but is still + affected by the set of threads executing this function. This typically + prevents speculation of calls to the function unless the constraint imposed + by `convergent` is further relaxed by some other means. + + +Intrinsics +========== + +This section defines a set of intrinsics that can be used together to restrict +the possible graphs of dynamic instances by ensuring reconvergence or +non-reconvergence at certain points of the program. + +.. _llvm.convergence.anchor: + +``llvm.convergence.anchor`` +------------------------------------ + +.. code-block:: text + + token @llvm.convergence.anchor() convergent readnone + + +This intrinsic is a marker that acts as an "anchor". The produced token can +be consumed by *join* or *point* intrinsics in order to restrict the dynamic +instances DAG structure. + +The *region* of an anchor is the subset of dominance region of the region from +which corresponding join or point intrinsics are reachable (without leaving +the dominance region). + +Anchor regions must be well-nested: if the region of anchor *a1* contains +a join or point anchored by *a2*, then *a1* must dominate *a2*. + + +.. _llvm.convergence.join: + +``llvm.convergence.join`` +------------------------------------ + +.. code-block:: text + + void @llvm.convergence.join(token %anchor) convergent readnone + +The anchor token must be refer to a call to the anchor intrinsic, which we +simply call the anchor. + +This intrinsic ensures reconvergence in the following sense. Let *b* be a call +to this intrinsic that can be reached from the function entry (called a +"join"), and let *A* be a dynamic instance of the anchor *a*. Then there is a +dynamic instance *B* of *b* such that for every walk in the graph of dynamic +instances that begins at *A* and encounters a dynamic instance of *b* before +encountering another dynamic instance of *a*, *B* is that dynamic instance of +*b*. + +The program is undefined if the control flow graph has a cycle that contains +a *join* but not its corresponding *anchor*. + + +.. _llvm.convergence.point: + +``llvm.convergence.point`` +----------------------------------- + +.. code-block:: text + + void @llvm.convergence.point(token %anchor) convergent readnone + +The anchor token must be produced by a call to the anchor intrinsic, which we +simply call the anchor. + +This intrinsic ensures non-reconvergence. It is named after the points at the +end of the tines of a fork. Non-reconvergence is ensured in the following +sense. Let *b* be a call to this intrinsic that can be reached from the +function entry (called a "point"), let *a* be the anchor, and let *B* be a +dynamic instance of *b*. Then there is a dynamic instance *A* of *a* such that +on every walk in the graph of dynamic instances starting at the function entry +point and reaching *B*, *A* is the most recently encountered dynamic instance +of *a*. + +The program is undefined if the control flow graph has a cycle that contains +a *point* but not its corresponding *anchor*. diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst --- a/llvm/docs/LangRef.rst +++ b/llvm/docs/LangRef.rst @@ -1401,21 +1401,17 @@ function call are also considered to be cold; and, thus, given low weight. ``convergent`` - In some parallel execution models, there exist operations that cannot be - made control-dependent on any additional values. We call such operations - ``convergent``, and mark them with this attribute. - - The ``convergent`` attribute may appear on functions or call/invoke - instructions. When it appears on a function, it indicates that calls to - this function should not be made control-dependent on additional values. - For example, the intrinsic ``llvm.nvvm.barrier0`` is ``convergent``, so - calls to this intrinsic cannot be made control-dependent on additional - values. + In some parallel execution models, there exist operations that communicate + with a set of threads that is implicitly defined by control flow. We call + such operations ``convergent`` and mark them with this attribute. + + The ``convergent`` attribute may appear on call/invoke instructions to + indicate that the instruction is a convergent operation, or on functions + to indicate that calls to this function are convergent operations. - When it appears on a call/invoke, the ``convergent`` attribute indicates - that we should treat the call as though we're calling a convergent - function. This is particularly useful on indirect calls; without this we - may treat such calls as though the target is non-convergent. + The presence of this attribute indicates that certain program transforms + involving control flow are forbidden. For a detailed description, see the + `Dynamic Instances `_ document. The optimizer may remove the ``convergent`` attribute on functions when it can prove that the function does not execute any convergent operations. @@ -1518,10 +1514,15 @@ (synchronize) with another thread through memory or other well-defined means. Synchronization is considered possible in the presence of `atomic` accesses that enforce an order, thus not "unordered" and "monotonic", `volatile` accesses, - as well as `convergent` function calls. Note that through `convergent` function calls - non-memory communication, e.g., cross-lane operations, are possible and are also - considered synchronization. However `convergent` does not contradict `nosync`. - If an annotated function does ever synchronize with another thread, + as well as `convergent` function calls. + + Note that `convergent` operations can involve communication that is + considered to be not through memory (e.g. through cross-lane operations) + and does not necessarily imply an ordering between threads for the purposes + of the memory model. Therefore, an operation can be both `convergent` and + `nosync`. + + If a `nosync` function does ever synchronize with another thread, the behavior is undefined. ``nounwind`` This function attribute indicates that the function never raises an @@ -14438,6 +14439,13 @@ .. _dbg_intrinsics: +Convergence Intrinsics +---------------------- + +The LLVM convergence intrinsics for controlling the semantics of convergent +operations, which all start with the ``llvm.convergence.`` prefix, are +described in the `Dynamic Instances `_ document. + Debugger Intrinsics ------------------- diff --git a/llvm/docs/Reference.rst b/llvm/docs/Reference.rst --- a/llvm/docs/Reference.rst +++ b/llvm/docs/Reference.rst @@ -17,6 +17,7 @@ CommandGuide/index Coroutines DependenceGraphs/index + DynamicInstances ExceptionHandling Extensions FaultMaps @@ -128,6 +129,9 @@ :doc:`GlobalISel/index` This describes the prototype instruction selection replacement, GlobalISel. +:doc:`DynamicInstances` + Description of dynamic instances semantics and convergent operations. + ===================== Testing and Debugging ===================== diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td --- a/llvm/include/llvm/IR/Intrinsics.td +++ b/llvm/include/llvm/IR/Intrinsics.td @@ -1279,6 +1279,15 @@ [IntrNoMem, ImmArg<1>, ImmArg<2>]>; +//===------- Convergence Intrinsics ---------------------------------------===// + +def int_convergence_anchor : Intrinsic<[llvm_token_ty], [], + [IntrNoMem, IntrConvergent]>; +def int_convergence_join : Intrinsic<[], [llvm_token_ty], + [IntrNoMem, IntrConvergent]>; +def int_convergence_point : Intrinsic<[], [llvm_token_ty], + [IntrNoMem, IntrConvergent]>; + //===----------------------------------------------------------------------===// // Target-specific intrinsics //===----------------------------------------------------------------------===// diff --git a/llvm/lib/Analysis/ValueTracking.cpp b/llvm/lib/Analysis/ValueTracking.cpp --- a/llvm/lib/Analysis/ValueTracking.cpp +++ b/llvm/lib/Analysis/ValueTracking.cpp @@ -3946,6 +3946,12 @@ auto *CI = cast(Inst); const Function *Callee = CI->getCalledFunction(); + // The called function depends on the set of threads executing it, which + // could change if the call is moved to a different location in control + // flow. + if (CI->isConvergent()) + return false; + // The called function could have undefined behavior or side-effects, even // if marked readnone nounwind. return Callee && Callee->isSpeculatable(); diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp --- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp +++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp @@ -1094,13 +1094,9 @@ // is unsafe -- it adds a control-flow dependency to the convergent // operation. Therefore restrict remainder loop (try unrollig without). // - // TODO: This is quite conservative. In practice, convergent_op() - // is likely to be called unconditionally in the loop. In this - // case, the program would be ill-formed (on most architectures) - // unless n were the same on all threads in a thread group. - // Assuming n is the same on all threads, any kind of unrolling is - // safe. But currently llvm's notion of convergence isn't powerful - // enough to express this. + // TODO: This is somewhat conservative, as we could allow unrolling if the + // trip count is uniform across the threads in a thread group, which it + // should be if convergent_op() is a barrier. if (Convergent) UP.AllowRemainder = false; diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp --- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp +++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp @@ -1263,6 +1263,13 @@ isa(I1)) return false; + // Cannot hoist convergent calls since that could change the set of threads + // with which they communicate. + if (const auto *C = dyn_cast(I1)) { + if (C->isConvergent()) + return false; + } + BasicBlock *BIParent = BI->getParent(); bool Changed = false; @@ -1455,13 +1462,19 @@ I->getType()->isTokenTy()) return false; - // Conservatively return false if I is an inline-asm instruction. Sinking - // and merging inline-asm instructions can potentially create arguments - // that cannot satisfy the inline-asm constraints. - if (const auto *C = dyn_cast(I)) + if (const auto *C = dyn_cast(I)) { + // Conservatively return false if I is an inline-asm instruction. Sinking + // and merging inline-asm instructions can potentially create arguments + // that cannot satisfy the inline-asm constraints. if (C->isInlineAsm()) return false; + // Do not sink convergent calls, as sinking may affect the set of threads + // with which the convergent operation communicates. + if (C->isConvergent()) + return false; + } + // Each instruction must have zero or one use. if (HasUse && !I->hasOneUse()) return false; diff --git a/llvm/test/Transforms/JumpThreading/basic.ll b/llvm/test/Transforms/JumpThreading/basic.ll --- a/llvm/test/Transforms/JumpThreading/basic.ll +++ b/llvm/test/Transforms/JumpThreading/basic.ll @@ -605,6 +605,40 @@ ; CHECK: } } +; CHECK-LABEL: @convergence_intrinsics +; CHECK: entry: +; CHECK: if1.then: +; CHECK: br label %if1.end +; CHECK: if1.end: +; CHECK: if2.then: +; CHECK: if2.end +define i32 @convergence_intrinsics(i1 %cc1) { +entry: + %tok = call token @llvm.convergence.anchor() + br i1 %cc1, label %if1.then, label %if1.end + +if1.then: + %v = call i32 @f1() + %cc.v = icmp ne i32 %v, 0 + br label %if1.end + +if1.end: + %cc2 = phi i1 [ false, %entry ], [ %cc.v, %if1.then ] + call void @llvm.convergence.join(token %tok) + br i1 %cc2, label %if2.then, label %if2.end + +if2.then: + %ballot1 = call i32 @ballot(i1 true) + br label %if2.end + +if2.end: + %r = phi i32 [ 0, %if1.end ], [ %ballot1, %if2.then ] + ret i32 %r +} + +declare token @llvm.convergence.anchor() convergent readnone +declare void @llvm.convergence.join(token) convergent readnone +declare i32 @ballot(i1) convergent readnone ; CHECK: attributes [[$NOD]] = { noduplicate } ; CHECK: attributes [[$CON]] = { convergent } diff --git a/llvm/test/Transforms/SimplifyCFG/attr-convergent.ll b/llvm/test/Transforms/SimplifyCFG/attr-convergent.ll --- a/llvm/test/Transforms/SimplifyCFG/attr-convergent.ll +++ b/llvm/test/Transforms/SimplifyCFG/attr-convergent.ll @@ -3,6 +3,7 @@ ; Checks that the SimplifyCFG pass won't duplicate a call to a function marked ; convergent. ; +; CHECK-LABEL: @check ; CHECK: call void @barrier ; CHECK-NOT: call void @barrier define void @check(i1 %cond, i32* %out) { @@ -25,4 +26,46 @@ ret void } +; CHECK-LABEL: @dont_merge_convergent +; CHECK: if.then: +; CHECK: %ballot.then = call i32 @ballot +; CHECK: if.else: +; CHECK: %ballot.else = call i32 @ballot +define i32 @dont_merge_convergent(i1 %cond1, i1 %cond2) { +entry: + br i1 %cond1, label %if.then, label %if.else + +if.then: + %ballot.then = call i32 @ballot(i1 %cond2) + br label %end + +if.else: + %ballot.else = call i32 @ballot(i1 %cond2) + br label %end + +end: + %ballot = phi i32 [ %ballot.then, %if.then ], [ %ballot.else, %if.else ] + ret i32 %ballot +} + +; CHECK-LABEL: @dont_speculate_convergent +; CHECK: entry: +; CHECK: then: +; CHECK: call i32 @speculatable_convergent +; CHECK: end: +define i32 @dont_speculate_convergent(i1 %cond1, i32 %in) { +entry: + br i1 %cond1, label %then, label %end + +then: + %v = call i32 @speculatable_convergent(i32 %in) + br label %end + +end: + %r = phi i32 [ 0, %entry ], [ %v, %then ] + ret i32 %r +} + +declare i32 @ballot(i1 %arg) convergent readnone +declare i32 @speculatable_convergent(i32) convergent readnone speculatable declare void @barrier() convergent