diff --git a/llvm/docs/DynamicInstances.rst b/llvm/docs/DynamicInstances.rst new file mode 100644 --- /dev/null +++ b/llvm/docs/DynamicInstances.rst @@ -0,0 +1,244 @@ +==================================================== +Dynamic Instances and Convergent Operation Semantics +==================================================== + +.. contents:: + :local: + :depth: 4 + +Overview +======== + +This document defines dynamic instances and the semantics of related intrinsics +and the ``convergent`` attribute. Convergent operations communicate with other +threads that execute the same operation. Typical examples of convergent +operations are control barriers, which pause execution until all threads in some +set of threads have reached the same barrier, and thread group reductions, which +reduce an input value across some set of threads, e.g. summing up values +computed by each thread. + +Convergent operations are commonly available in SPMD programming environments +such as GPUs, where a large number of threads execute the same program +concurrently. Implementations commonly map threads of execution onto lanes of +a hardware SIMD unit, and the intent of convergent operations is often to +communicate among those threads that are mapped to the same SIMD vector. +However, the semantics defined in this document are independent of such +implementation details. + +The defining characteristic that distinguishes convergent operations from other +inter-thread communication is that the set of threads among which communication +occurs is implicitly defined by control flow. For example, in the following GPU +compute kernel, communication occurs precisely among those threads of an +implementation-defined execution scope (such as workgroup or subgroup) for +which ``condition`` is true: + +.. code-block:: text + + void example_kernel() { + ... + if (condition) + convergent_operation(); + ... + } + +One can think of the communicating sets of threads in terms of *divergence* +and *reconvergence*: sets of threads split at branch instructions if threads +follow different paths through the control flow graph (divergence) and may +later merge when they reach the same static point in the program +(reconvergence). + +In structured programming languages, there is usually an intuitive and +unambiguous way of determining when threads reconverge. However, this is not +the case in unstructured control flow representations as used by LLVM. The +language of *dynamic instances* of instructions is used to formalize divergence +and reconvergence and provide a global, static view of when reconvergence +occurs. + + +Definitions and Rules +===================== + +1. Every instruction in an LLVM IR function has zero or more dynamic instances. + +2. There is exactly one dynamic instance of the function entry point. + +3. The dynamic instances form a directed acyclic (potentially infinite) graph. + +4. If there is a walk *W* (in this graph) from a dynamic instance *A* to a + dynamic instance *B*, then there is a corresponding walk *w* in the + function's control flow graph in the following sense: the walks have the + same length, and the i-th dynamic instance in *W* is a dynamic instance of + the i-th instruction in *w*. + + We say that *w* is the *projection* of *W*, and *W* is a *lifting* of *w*. + +5. Given a dynamic instance *A*, every walk through the function's control flow + graph starting at the instruction corresponding to *A* has a unique lifting + that starts in *A*. That is, two threads that follow the same control flow + path will not diverge. + + For the purpose of this rule, edges in the control flow graph are + significant. For example, if both labels of a branch instruction point to the + same basic block, threads executing the same dynamic instance of the branch + instruction with different truth values of the condition may diverge. + + Note: This rule implies that when threads diverge inside of a function call, + they will reconverge when the returning to the calling function. + +6. When executing a dynamic instance *A* of a convergent operation, the set of + communicating threads is the subset of threads that execute *A* out of some + implementation-defined set of threads. + + Examples of this implementation-defined set of threads are the OpenCL + workgroup of the executing thread or a SPIR-V subgroup. + +These rules intentionally do not fully define the set (and connectivity) of +dynamic instances: the only constraint on when reconvergence may happen is +that the graph must be acyclic. In particular, reconvergence may happen in the +middle of a basic block. Later in the document, we describe intrinsics that can +be used to constrain where reconvergence may or or may not occur. + +7. When the graph of dynamic instances is not fully defined, it is considered + to be (partially) implementation-defined. Program transforms are free to add + additional constraints to the graph connectivity in this case. + + +Correctness of Program Transforms +================================= + +(This section is informative.) + +As implied by the rules in the previous section, program transforms are correct +with regards to convergent operations if they preserve their semantics, which +means that the set of communicating threads must remain the same. + +If the connectivity of the graph of dynamic instances is fully defined, this +means that if two possible executions of the original program do/do not execute +the same dynamic instance of a given convergent operation, then they must/must +not do so in the transformed program. +Since the connectivity is in general *not* fully defined, the following more +complex rule applies: if the transformed program allows a graph of dynamic +instances such that two possible executions *E1* and *E2* of a program do/do not +execute the same dynamic instance of a convergent operation, then the original +program must allow a graph such that *E1* and *E2* do/do not execute the same +dynamic instance at the corresponding point of their execution. + +In general, program transforms with a single-threaded focus are correct if they +do not move convergent operations in the control flow graph, i.e. if they do +not sink or hoist convergent operations across a branch. This applies even to +program transforms that change the control flow graph. + +For example, consider unrolling a simple loop that in itself does not contain +convergent operations. This loop causes divergence based on, and only based on, +the trip count. This remains true even after (partial) unrolling. Therefore, +unrolling the loop does not change the possible graphs of dynamic instances in +any way that could affect convergent operations. The same argument applies to +other loop transforms and to transforms such as replacing a switch statement by +a binary tree of conditional branches. + +On the other hand, a loop that contains convergent operations and has unknown +trip count cannot be unrolled in general, since the remainder loop changes the +possible graph structures for convergent operations in the loop body. + +A major exception to the rule of thumb that program transforms are +conservatively correct if they "do not touch" convergent operations are program +transforms that create control flow "from thin air". Examples of this are a +transform that specializes parts of a function based on testing a value +against a constant, or a transform that converts ``select`` instructions into +branches. Both kinds of transforms introduce additional divergence without +guaranteeing reconvergence, which may change the behavior of convergent +operations that occur much later in the program. (Note, however, that +if-conversion is perfectly legal as long as the converted basic blocks +themselves do not contain any convergent operations, because it only reduces +the set of possible graphs of dynamic instances as far as convergent operations +are concerned.) + + +Memory Model Non-Interaction +============================ + +The fact that an operation is convergent has no effect on how it should be +treated for memory model purposes. In particular, an operation that is +``convergent`` and ``readnone`` does not introduce additional ordering +constraints as far as the memory model is concerned. There is no implied +barrier, neither in the memory barrier sense nor in the control barrier +sense of synchronizing the execution of threads. Threads executing the same +dynamic instance do not necessarily do so at the same time. + + +Intrinsics +========== + +This section defines a set of intrinsics that can be used together to restrict +the possible graphs of dynamic instances by ensuring reconvergence or +non-reconvergence at certain points of the program. + +.. _llvm.convergence.anchor: + +``llvm.convergence.anchor`` +------------------------------------ + +.. code-block:: text + + token @llvm.convergence.anchor() convergent readnone + + +This intrinsic is a marker that acts as an "anchor". The produced token can +be consumed by *join* or *point* intrinsics in order to restrict the dynamic +instances DAG structure. + +The *region* of an anchor is the subset of dominance region of the region from +which corresponding join or point intrinsics are reachable (without leaving +the dominance region). + +Anchor regions must be well-nested: if the region of anchor *a1* contains +a join or point anchored by *a2*, then *a1* must dominate *a2*. + + +.. _llvm.convergence.join: + +``llvm.convergence.join`` +------------------------------------ + +.. code-block:: text + + void @llvm.convergence.join(token %anchor) convergent readnone + +The anchor token must be refer to a call to the anchor intrinsic, which we +simply call the anchor. + +This intrinsic ensures reconvergence in the following sense. Let *b* be a call +to this intrinsic that can be reached from the function entry (called a +"join"), and let *A* be a dynamic instance of the anchor *a*. Then there is a +dynamic instance *B* of *b* such that for every walk in the graph of dynamic +instances that begins at *A* and encounters a dynamic instance of *b* before +encountering another dynamic instance of *a*, *B* is that dynamic instance of +*b*. + +The program is undefined if the control flow graph has a cycle that contains +a *join* but not its corresponding *anchor*. + + +.. _llvm.convergence.point: + +``llvm.convergence.point`` +----------------------------------- + +.. code-block:: text + + void @llvm.convergence.point(token %anchor) convergent readnone + +The anchor token must be produced by a call to the anchor intrinsic, which we +simply call the anchor. + +This intrinsic ensures non-reconvergence. It is named after the points at the +end of the tines of a fork. Non-reconvergence is ensured in the following +sense. Let *b* be a call to this intrinsic that can be reached from the +function entry (called a "point"), let *a* be the anchor, and let *B* be a +dynamic instance of *b*. Then there is a dynamic instance *A* of *a* such that +on every walk in the graph of dynamic instances starting at the function entry +point and reaching *B*, *A* is the most recently encountered dynamic instance +of *a*. + +The program is undefined if the control flow graph has a cycle that contains +a *point* but not its corresponding *anchor*. diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst --- a/llvm/docs/LangRef.rst +++ b/llvm/docs/LangRef.rst @@ -1390,21 +1390,17 @@ function call are also considered to be cold; and, thus, given low weight. ``convergent`` - In some parallel execution models, there exist operations that cannot be - made control-dependent on any additional values. We call such operations - ``convergent``, and mark them with this attribute. - - The ``convergent`` attribute may appear on functions or call/invoke - instructions. When it appears on a function, it indicates that calls to - this function should not be made control-dependent on additional values. - For example, the intrinsic ``llvm.nvvm.barrier0`` is ``convergent``, so - calls to this intrinsic cannot be made control-dependent on additional - values. + In some parallel execution models, there exist operations that communicate + with a set of threads that is implicitly defined by control flow. We call + such operations ``convergent`` and mark them with this attribute. + + The ``convergent`` attribute may appear on call/invoke instructions to + indicate that the instruction is a convergent operation, or on functions + to indicate that calls to this function are convergent operations. - When it appears on a call/invoke, the ``convergent`` attribute indicates - that we should treat the call as though we're calling a convergent - function. This is particularly useful on indirect calls; without this we - may treat such calls as though the target is non-convergent. + The presence of this attribute indicates that certain program transforms + involving control flow are forbidden. For a detailed description, see the + `Dynamic Instances `_ document. The optimizer may remove the ``convergent`` attribute on functions when it can prove that the function does not execute any convergent operations. @@ -1507,10 +1503,15 @@ (synchronize) with another thread through memory or other well-defined means. Synchronization is considered possible in the presence of `atomic` accesses that enforce an order, thus not "unordered" and "monotonic", `volatile` accesses, - as well as `convergent` function calls. Note that through `convergent` function calls - non-memory communication, e.g., cross-lane operations, are possible and are also - considered synchronization. However `convergent` does not contradict `nosync`. - If an annotated function does ever synchronize with another thread, + as well as `convergent` function calls. + + Note that `convergent` operations can involve communication that is + considered to be not through memory (e.g. through cross-lane operations) + and does not necessarily imply an ordering between threads for the purposes + of the memory model. Therefore, an operation can be both `convergent` and + `nosync`. + + If a `nosync` function does ever synchronize with another thread, the behavior is undefined. ``nounwind`` This function attribute indicates that the function never raises an @@ -14419,6 +14420,13 @@ .. _dbg_intrinsics: +Convergence Intrinsics +---------------------- + +The LLVM convergence intrinsics for controlling the semantics of convergent +operations, which all start with the ``llvm.convergence.`` prefix, are +described in the `Dynamic Instances `_ document. + Debugger Intrinsics ------------------- diff --git a/llvm/docs/Reference.rst b/llvm/docs/Reference.rst --- a/llvm/docs/Reference.rst +++ b/llvm/docs/Reference.rst @@ -17,6 +17,7 @@ CommandGuide/index Coroutines DependenceGraphs/index + DynamicInstances ExceptionHandling Extensions FaultMaps @@ -128,6 +129,9 @@ :doc:`GlobalISel` This describes the prototype instruction selection replacement, GlobalISel. +:doc:`DynamicInstances` + Description of dynamic instances semantics and convergent operations. + ===================== Testing and Debugging ===================== @@ -208,4 +212,4 @@ LLVM support for coroutines. :doc:`YamlIO` - A reference guide for using LLVM's YAML I/O library. \ No newline at end of file + A reference guide for using LLVM's YAML I/O library. diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td --- a/llvm/include/llvm/IR/Intrinsics.td +++ b/llvm/include/llvm/IR/Intrinsics.td @@ -1279,6 +1279,15 @@ [IntrNoMem, ImmArg<1>, ImmArg<2>]>; +//===------- Convergence Intrinsics ---------------------------------------===// + +def int_convergence_anchor : Intrinsic<[llvm_token_ty], [], + [IntrNoMem, IntrConvergent]>; +def int_convergence_join : Intrinsic<[], [llvm_token_ty], + [IntrNoMem, IntrConvergent]>; +def int_convergence_point : Intrinsic<[], [llvm_token_ty], + [IntrNoMem, IntrConvergent]>; + //===----------------------------------------------------------------------===// // Target-specific intrinsics //===----------------------------------------------------------------------===// diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp --- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp +++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp @@ -1094,13 +1094,9 @@ // is unsafe -- it adds a control-flow dependency to the convergent // operation. Therefore restrict remainder loop (try unrollig without). // - // TODO: This is quite conservative. In practice, convergent_op() - // is likely to be called unconditionally in the loop. In this - // case, the program would be ill-formed (on most architectures) - // unless n were the same on all threads in a thread group. - // Assuming n is the same on all threads, any kind of unrolling is - // safe. But currently llvm's notion of convergence isn't powerful - // enough to express this. + // TODO: This is somewhat conservative, as we could allow unrolling if the + // trip count is uniform across the threads in a thread group, which it + // should be if convergent_op() is a barrier. if (Convergent) UP.AllowRemainder = false; diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp --- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp +++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp @@ -1263,6 +1263,13 @@ isa(I1)) return false; + // Cannot hoist convergent calls since that could change the set of threads + // with which they communicate. + if (const auto *C = dyn_cast(I1)) { + if (C->isConvergent()) + return false; + } + BasicBlock *BIParent = BI->getParent(); bool Changed = false; @@ -1455,13 +1462,19 @@ I->getType()->isTokenTy()) return false; - // Conservatively return false if I is an inline-asm instruction. Sinking - // and merging inline-asm instructions can potentially create arguments - // that cannot satisfy the inline-asm constraints. - if (const auto *C = dyn_cast(I)) + if (const auto *C = dyn_cast(I)) { + // Conservatively return false if I is an inline-asm instruction. Sinking + // and merging inline-asm instructions can potentially create arguments + // that cannot satisfy the inline-asm constraints. if (C->isInlineAsm()) return false; + // Do not sink convergent calls, as sinking may affect the set of threads + // with which the convergent operation communicates. + if (C->isConvergent()) + return false; + } + // Each instruction must have zero or one use. if (HasUse && !I->hasOneUse()) return false; diff --git a/llvm/test/Transforms/JumpThreading/basic.ll b/llvm/test/Transforms/JumpThreading/basic.ll --- a/llvm/test/Transforms/JumpThreading/basic.ll +++ b/llvm/test/Transforms/JumpThreading/basic.ll @@ -605,6 +605,40 @@ ; CHECK: } } +; CHECK-LABEL: @convergence_intrinsics +; CHECK: entry: +; CHECK: if1.then: +; CHECK: br label %if1.end +; CHECK: if1.end: +; CHECK: if2.then: +; CHECK: if2.end +define i32 @convergence_intrinsics(i1 %cc1) { +entry: + %tok = call token @llvm.convergence.anchor() + br i1 %cc1, label %if1.then, label %if1.end + +if1.then: + %v = call i32 @f1() + %cc.v = icmp ne i32 %v, 0 + br label %if1.end + +if1.end: + %cc2 = phi i1 [ false, %entry ], [ %cc.v, %if1.then ] + call void @llvm.convergence.join(token %tok) + br i1 %cc2, label %if2.then, label %if2.end + +if2.then: + %ballot1 = call i32 @ballot(i1 true) + br label %if2.end + +if2.end: + %r = phi i32 [ 0, %if1.end ], [ %ballot1, %if2.then ] + ret i32 %r +} + +declare token @llvm.convergence.anchor() convergent readnone +declare void @llvm.convergence.join(token) convergent readnone +declare i32 @ballot(i1) convergent readnone ; CHECK: attributes [[$NOD]] = { noduplicate } ; CHECK: attributes [[$CON]] = { convergent } diff --git a/llvm/test/Transforms/SimplifyCFG/attr-convergent.ll b/llvm/test/Transforms/SimplifyCFG/attr-convergent.ll --- a/llvm/test/Transforms/SimplifyCFG/attr-convergent.ll +++ b/llvm/test/Transforms/SimplifyCFG/attr-convergent.ll @@ -3,6 +3,7 @@ ; Checks that the SimplifyCFG pass won't duplicate a call to a function marked ; convergent. ; +; CHECK-LABEL: @check ; CHECK: call void @barrier ; CHECK-NOT: call void @barrier define void @check(i1 %cond, i32* %out) { @@ -25,4 +26,27 @@ ret void } +; CHECK-LABEL: @dont_merge_convergent +; CHECK: if.then: +; CHECK: %ballot.then = call i32 @ballot +; CHECK: if.else: +; CHECK: %ballot.else = call i32 @ballot +define i32 @dont_merge_convergent(i1 %cond1, i1 %cond2) { +entry: + br i1 %cond1, label %if.then, label %if.else + +if.then: + %ballot.then = call i32 @ballot(i1 %cond2) + br label %end + +if.else: + %ballot.else = call i32 @ballot(i1 %cond2) + br label %end + +end: + %ballot = phi i32 [ %ballot.then, %if.then ], [ %ballot.else, %if.else ] + ret i32 %ballot +} + +declare i32 @ballot(i1 %arg) convergent readnone declare void @barrier() convergent