This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
docs/
81/180
ConvergentOperations.rst
4/5
LangRef.rst
-
Reference.rst
-
include/llvm/IR/
-
llvm/
-
IR/
-
Intrinsics.td
-
LLVMContext.h
-
lib/IR/
-
IR/
-
LLVMContext.cpp
-
Verifier.cpp
-
test/
-
Bitcode/
-
operand-bundles-bc-analyzer.ll
-
Verifier/
-
convergencectrl-invalid.ll
-
convergencectrl-valid.ll

Differential D85603

IR: Add convergence control operand bundle and intrinsics
AbandonedPublic

Authored by sameerds on Aug 9 2020, 7:22 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
simoll
tra
jlebar
resistor
nhaehnle

Summary

See ConvergentOperations.rst for the details.

This replaces the proposal from https://reviews.llvm.org/D68994

This patch adds the operand bundle and intrinsics themselves, as well as
the LangRef documentation describing the semantics of controlled
convergent operations. Follow-up patches will adjust existing passes to
comply with those changes, as well as provide new functionality on top
of this mechanism.

Change-Id: I045c6bc864c4dc5fb0a23b0279e30fac06c5b974

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Harbormaster completed remote builds in B67733: Diff 284397.Aug 10 2020, 9:01 AM

jdoerfert added inline comments.Aug 10 2020, 9:26 AM

llvm/docs/ConvergentOperations.rst
281	The "heart" and the increment step are fairly vague. Maybe talk about something tangible, e.g., the target of a backedge?

efriedma added inline comments.Aug 10 2020, 12:09 PM

llvm/docs/ConvergentOperations.rst
366	The part that's sort of unclear is that calls coming from outside of LLVM IR may or may not be part of the same dynamic instance. Obviously we can't define that here, but I think we should explicitly note it as something that's implementation-defined.

sameerds added inline comments.Aug 11 2020, 1:25 AM

llvm/docs/ConvergentOperations.rst
203–204	I think the notion of dynamic instances applies to all instructions. Continuing with #3 below, it seems to me that different threads can execute the same dynamic instance of any instruction. It's just that this notion is not very interesting in the case of non-communicating instructions. The ones that communicate need to be marked convergent, so that the effect of transformations on them is limited.
402	So this defines a proper nesting of convergence regions? An informative note would be helpful.
465–471	Which part of the formal semantics shows that this is a valid translation? Rule for the execution of dynamic instances seems to be useful to only specify which threads execute the convergent operations. But what relates them to the original loop? Is it because the set of dynamic instances produced by the second version has a one-to-one mapping with the set of dynamic instances produced by the first version?
517	I think this intends to say "block in the loop body other than the loop header", but the wording chosen is a little difficult to parse on a first read.
523–525	+1 To me, the whole point of this new concept is to capture control dependency so that we don't have to go look at branch conditions again. But allowing such a transformation reintroduces the need to go check the control dependency to understand which threads are really executing this instance.
548–552	This is also the transform that CUDA (and potentially HIP) will disallow. Hoisting or sinking a conditional changes the set of threads executing the each leg of the branch. In CUDA, the two programs have completely different meanings depend on whether the anchor is outside the branch or inside each leg. There seems to be an opportunity here to relate the notion of an anchor to language builtins that return the mask of currently executing threads.

sameerds added inline comments.Aug 11 2020, 1:42 AM

llvm/docs/ConvergentOperations.rst
561	What forbids the convergent operations from being hoisted? Isn't that the whole point of this new framework? In particular, what would the total_gains/total_losses example look like with appropriate use of convergence tokens?

simoll added inline comments.Aug 11 2020, 3:11 AM

llvm/docs/ConvergentOperations.rst
203–204	I'm more concerned about the implications this constraint may have for transformation like branch fusion. The memory model is pretty permissive and allows fusion of memory accesses regardless. @nhaehnle Do you care about non-memory side effects, like exceptions? Do these follow the same weak semantics as the memory model?

With this change, I've edited the documents in a way where I hope all
comments have been addressed.

Augmented the initial motivating examples with version that show how the control intrinsics are added
Added an additional motivating example to illustrate what the "anchor" is for
Changed the order in which control intrinsics are described, to hopefully make it even more obvious that the "less surprising" one is the entry intrinsic
Add informational notes to the "Dynamic Instances and Convergence Tokens" section, and clean up some minor things in the Formal Rules
A bunch of other random changes all over the place

Herald added a subscriber: jfb. · View Herald TranscriptAug 11 2020, 8:07 AM

nhaehnle added inline comments.Aug 11 2020, 8:07 AM

llvm/docs/ConvergentOperations.rst
203–204	I'm not entirely sure what you mean by the question. There isn't supposed to be any interaction between exceptions and what's being described here. There aren't any relevant constraints expressed on the dynamic instances of non-convergent operations in the first place, and for convergent operations I'd think of them as happening in two steps: there's a cross-thread communication, and afterwards each thread individually decides whether it throws an exception in its context. This can obviously take the exchanged data into account, to the point where you could model an operation as exchanging bits between threads to indicate whether an exception should be thrown in each thread -- so you could have an operation that throws an exception based on a value in another thread, as long as that other thread executes the same dynamic instance. Similarly, you could have UB in thread A based on an argument value in thread B as long as A and B execute the same dynamic instance. I'm going to add an informational note to the end of this section that dynamic instances of non-convergent instructions don't matter.
213–215	The logical split between the two sections is that this section has the basic definitions, while the "Formal Rules" section has the rules about how the convergence control intrinsics place additional constraints on how dynamic instances can be formed. If the token represents the dynamic instance exactly then this would also limit the freedom llvm.experimental.convergence.anchor() has. For example, this would rule out thread partitioning if it were so because then no token-producing instruction could return different token values per dynamic invocation. I'm not sure I understand the argument. What exactly do you mean by dynamic invocation here? Each time a thread executes the same anchor call site, it will receive a different token value, corresponding to a different dynamic instance. That may or may not be the same dynamic instance as received by other threads. So even if control flow is entirely uniform, an implementation would be free to produce a different thread partitioning each time the anchor is executed. That is on purpose: if you want more predictable thread partitionings, use a combination of `entry` and `loop` intrinsics as required.
281	When it comes to defining rules that are applicable to completely general IR, the loop intrinsic call site feels more tangible than the notion of backedge. For example, backedges don't really work as a concept when you have irreducible control flow. The loop intrinsic call site also really doesn't have to be in the header block of a natural loop -- it could be inside of an if-statement in the loop, for example, which has interesting consequences but can still be defined (and can actually be useful: someone pointed me at a recent paper by Damani et al - Speculative Reconvergence for Improve SIMT Efficiency, which proposes a certain "unnatural" way of controlling convergence in some kinds of loop for performance; the same kind of effect can be achieved by placing the loop heart inside of an if-statement).
291–294	The intention is that the IR-based rules still apply regardless of whether the caller is in the same module or not. I'm not sure if this needs to spelled out more clearly. And yes, for other cases we should be able to think of it as a property of the calling convention.
340–344	No, this is explicitly not sufficient. You can have: %tok = call token @llvm.experimental.convergence.anchor() br i1 %cc, label %then, label %next then: call void @convergent_op() [ "convergencectrl"(token %tok) ] br label %next next:
362–366	I think this comment may have moved to a confusing location relative to the document. `entry` and `anchor` are inherently different. I'm going to add a note about looking at language specs etc.
388–389	No, the rule excludes code such as: %a = call token @llvm.experimental.convergence.anchor() %b = call token @llvm.experimental.convergence.anchor() call void @convergent_op() [ "convergencectrl"(token %a) ] call void @convergent_op() [ "convergencectrl"(token %b) ] The convergence region of `%b` contains a use of `%a` but not its definition. I'm going to add a note about nesting.
405	I agree with @t-tye's explanation here. The choice here reflects the choice made e.g. in the Vulkan memory model: the only "convergent" operation (not the term used in Vulkan...) which interacts with the memory model is OpControlBarrier, so it's good to be able to treat these two kinds of communication orthogonally.
447	It still feels like llvm.experimental.convergence.anchor is materializing the set of threads out of thin air rather than as a clear "chain of custody" from the function entry (transitively passed via call sites). Yes, that is the point of `llvm.experimental.convergence.anchor`. And yes, if there was clear "chain of custody" as you call it from outside of the loop, then this unrolling with remainder would be incorrect.
465–471	The first version doesn't have a unique set of dynamic instances in the first place, because `anchor` is by design implementation-defined. So the possible universes of dynamic instances in the transformed/unrolled version only needs to be a subset. In a sense, the loop unroll with remainder picks a subset by saying: from now on, if you have two threads with e.g. iteration counts 3 and 4, then they will never communicate during the 3rd iteration. In the original program, they may or may not have communicated during the 3rd iteration -- up to the implementation, and in this case, the implementation decided to do a form of loop unrolling which implicitly ends up making a choice.
471	I hope this has been answered in the context of your other comments?
508	Is that still grammatically correct? The parse of the sentence is Loops in which ((a loop intrinsic outside of the loop header) uses a token defined outside of the loop) That is, "a loop intrinsic outside of the loop header" is the subject of the sentence in the outer parentheses.
517	Going to try an improvement :)
523–525	I mean, `anchor` is implementation-defined, so you can't make a totally solid statement anyway. You could only make solid relative statements if the token produced by the anchor was also used by some other convergent operations, and if those are outside of the if-statement, the sinking wouldn't be allowed anymore anyway...
548–552	CUDA is very different here: the builtins that take an explicit threadmask don't have an implicit dependence on control flow, so they shouldn't be modeled as convergent operations. They have other downsides, which is why we prefer to go down this path of convergent operations.
561	I'm going to add that example.
576–579	Should be answered elsewhere.
605–606	The pixel example would use `entry` instead of `anchor`. I'm going to add that example.
615–616	Should be answered elsewhere.

Harbormaster completed remote builds in B67909: Diff 284735.Aug 11 2020, 8:18 AM

sameerds added inline comments.Aug 11 2020, 9:49 PM

llvm/docs/ConvergentOperations.rst
53–56	I think I "get" it now, and it might be related to how this paragraph produces an expectation that is actually not intended. The entire time so far, I have been reading this document expecting a formal framework that completely captures convergence; something so complete, that one can point at any place in the program and merely look at the convergence intrinsics to decide whether a transform is valid. But that is not the case. This document becomes a lot more clear if the intrinsics being introduced are only meant to augment control flow but not replace it in the context of convergence. These intrinsics are only meant to be introduced by the frontend to remove ambiguity about convergence. In particular: In the jump-threading example, the frontend inserts the convergence intrinsics to resolve the ambiguity in favour of maximal convergence. In the loop-unroll example, the frontend disallows unrolling by inserting the anchor outside of the loop and using it inside. In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case. This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?
548–552	Combined with my other comment about the introduction, I think the current formalism is compatible with CUDA. One can say that some convergent functions in CUDA have additional semantics about how different dynamic instances communicate with each other. That communication is outside the scope of this document, where the mask argument is used to relate the dynamic instances. The current framework seems to be sufficient to govern the effect of optimizations on the dynamic instances. For example, it is sufficient that a CUDA ballot is not hoisted/sunk across a condition; the ballot across the two branch legs is managed by the mask, which was created before the branch.

sameerds added inline comments.Aug 12 2020, 12:03 AM

llvm/docs/ConvergentOperations.rst
281	It was the optimizer that introduced the ambiguity ... should the optimizer be responsible for adding the necessary intrinsics that preserve the original convergence?
552–554	So the heart is not a property of the loop itself in LLVM IR. It is a place chosen by the frontend based on semantics external to LLVM IR, in a way that allows the frontend to express constraints about convergence in the loop.
571	Just like the loop intrinsic, this intrinsic occurs in a place chosen by the frontend based on semantics outside of LLVM IR, and used by the frontend to express constraints elsewhere in the IR.
612–613	The older comments about this seem to have floated away. At risk of repeating the discussion, what is n capturing? Is it meant to relate copies of the call U created by unrolling the loop, for example?
647–650	Just like the n property of the loop intrinsic, I think an informational note explaining this will be helpful.
654–657	This is not a rule; it's just a definition.
659–661	Since a convergence region is defined for a token, this text needs to bring out the fact that two different tokens are being talked about at this point. Something like: If the convergence region for token T1 contains a use of another token T2, then it must also contain the definition of T2."
750–755	So unrolling is forbidden because it fails to preserve the set of threads that execute the same dynamic instance of loop() for n=0 and n=1?
760	Correcting the use of the loop intrinsic seems to be a delicate matter. There is a rule which talks about "two or more uses by loop()" inside a loop body, and this particular example seems to side-step exactly that by eliminating one call to loop().

nhaehnle added inline comments.Aug 12 2020, 12:48 PM

llvm/docs/ConvergentOperations.rst
53–56	In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case. This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist? What ambiguities do you have in mind? If you have a fully acyclic function, then the way you can think about it is: we determine "the" set of threads that execute the function at the entry. At every point in the function, the communication set is then the subset of threads that get to that point. It's easy to evaluate this if you just topologically sort the blocks and then evaluate them in that order.
281	No. The jump-threaded code could also come out of C(++) code with `goto`s, so this doesn't really work.
548–552	I don't understand what you're trying to get at here. The semantics of modern CUDA builtins are fully captured by saying they're non-convergent, but they have a side effect. That side effect is communication with some set of other threads, but that set isn't affected by control flow, it's fully specified by an explicit argument. Because of this, there is no need to argue about dynamic instances. All legal program transforms subject to those constraints are then legal. There is no need to label them as `convergent`. If you can think of a counter-example, I'd be curious to see it.
552–554	Yes.
571	I'd rephrase it slightly by saying that the place is chosen by the frontend in a way that preserves the semantics of the original language into LLVM IR. But I suspect that we're ultimately thinking of the same thing.
612–613	It's really just a loop iteration counter. Every time a thread executes the `loop` intrinsic, it executes a new dynamic instance of it. You could think of this dynamic instance being labeled by the iteration, and then whether a thread executes the same dynamic instance as another thread depends in part on whether they have the same loop iteration label. Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the `loop` intrinsic. This means that if you have a natural loop where the `loop` intrinsic is not called in the header but in some other block that is conditional, the loop iterations will be counted in a way that seems funny (but this can actually be put to a potentially good use as I noted elsewhere). Unrolling will actually not duplicate the `loop` intrinsic, but only keep the copy that corresponds to the first unrolled iteration.
654–657	Fair enough. I'm going to split this up into rules about cycles and rules about convergence regions.
659–661	It's needed from a formal point of view, but it does seem to trip people up, so I'm going to implement your suggestion :)
750–755	Not sure what you mean by n=0 and n=1. The issue is that if some threads go through the remainder loop while others execute more iterations, then the set of threads will be partitioned into those that take the remainder loop and those that don't.
760	Correct. I did think about whether it was possible to eliminate that static rule, but it gets nasty really quickly, for example if you try to unroll loops with multiple exits. The way it's written, a modification to loop unrolling is required (D85605), but it's ultimately the less painful solution.

sameerds added inline comments.Aug 12 2020, 9:59 PM

llvm/docs/ConvergentOperations.rst
53–56	Your explanation intuitively makes sense, but it is not clear how to reconcile it with jump threading. That's one of the "ambiguities" I had in mind when dealing with acyclic control flow. It's almost like the text needs a paragraph explaining that "structured acyclic control flow" already contains sufficient information about convergence, but general acyclic control flow needs special attention in specific cases, starting with jump threading.
281	But what about the flip side? If the frontend is sure that only structured control flow is present in the input program, can it skip inserting the convergence intrinsics? Or should it still insert those intrinsics just in case optimizations changed the graph? If yes, is this something that LLVM must prescribe for every frontend as part of this document?
548–552	I am trying to understand whether there are constructs in Clang-supported high-level languages that cannot be addressed by these intrinsics. And if such constructs do exist, then whether that gate the adoption of this enhancement in LLVM. But I see your point now. The sync() builtins in CUDA are no longer dependent on convergence. The decision to hoist or sink them is based entirely on other things like data dependences (and maybe just that).
612–613	Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic. This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration? Unrolling will actually not duplicate the `loop` intrinsic, but only keep the copy that corresponds to the first unrolled iteration. This is a bit of surprise. My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated.
750–755	The n that I used is the virtual loop count that is described in the loop intrinsic. The example needs to explain how the rules established in this document prevent the unrolling. The intuitive explanation is in terms of sets of threads, but what is the formal explanation in terms of the static rules for dynamic instances?
760	I still don't really understand what the "two or more" rule is for. One outcome of the rule seems to be that for a loop L2 nested inside loop L1, if L1 uses a token defined outside L1, then L2 cannot use the same token. I didn't get very far beyond that.

Actually submit all the changes that I thought I had submitted
two days ago.

Also:

add rationale for the static rule on cycles
expand the discussion of program transform correctness for loop unrolling and split the section since it's getting quite large

llvm/docs/ConvergentOperations.rst
53–56	I hesitate to write anything like that, because then you get into the problem of defining what "structured" means -- there are multiple definitions in the literature. My argument would be that purely acyclic control flow -- whether structured or not -- contains sufficient information about convergence to define semantics consistently, without assistance, and avoiding spooky action at a distance. That you still need some assistance to make actual guarantees is really down to composability. For example, you can have a fully acyclic function called from inside a cycle, and then what happen at inlining. One can explore an alternative scheme where you don't have to insert anything into the acyclic function in this case and it's the job of the inlining transform to fix things up, and I have done some exploring in this direction. There are at least two downsides: The burden on generic program transforms becomes larger. There is no longer any way for the programmer to express the distinction between functions (or sub-sections of code) that cares about the set of threads with which they're executed vs. those that don't (like the `@reserveSpaceInBuffer` example I added), and that closes the door on certain performance optimization and becomes problematic if you want to start thinking about independent forward progress.
281	It needs to insert the control intrinsics if it wants to have any guarantees. There aren't a lot of useful guarantees we can make today without this, so that's fine. I don't want to say that frontends absolutely must insert the control intrinsics just yet, that's why uncontrolled convergent operations are allowed but deprecated. Frontends for languages with convergent operations that don't change will remain in the world of "things tend to work as expected a lot of the time, but stuff can break in surprising ways at the least convenient moment" that they are already in today. If they run the ConvergenceControlHeuristic pass just after IR generation, the times where things break will likely be somewhat reduced, but probably not eliminated entirely. It's difficult to make a definitive claim because there's obviously also the question of which guarantees the high-level language is supposed to give to the developer. For a HLL that just doesn't want give any guarantees, not inserting control intrinsics is fine from the POV of language spec correctness, although you're likely to run into corner cases where the language behavior clashes with developers' intuitive expectations.
612–613	Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic. This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration? Well... what even is a loop iteration? :) For the purpose of convergence, the loop heart defines what the iterations are, so it is reached on every iteration by definition. (But there may well be cycles in the CFG that don't contain a loop intrinsic, and that's fine.) More likely your real question is whether in a natural loop, the loop intrinsic must be reached once per execution of the loop header (or traversal of a back edge) -- the answer is no. Part of the rationale here (and also an unfortunately inherent source of potential confusion) is that for defining convergence, and more generally for implementing whole-program vectorization of the style we effectively do in AMDGPU, leaning only on natural loops doesn't work, at least in part because of the possibility of irreducible control flow. This is why all the actual algorithms I'm building on this rely on the Havlak-inspired CycleInfo of D83094, and all the rules in this document are expressed in terms of cycles (in the sense of circular walks in the CFG) instead of natural loops. My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated. I would have liked to have that property but couldn't make it work without imposing static rules that would be much harder to understand and follow. The point about unrolling is mentioned in the later examples section where I talk through a bunch of example loops and whether they can be unrolled or not.
750–755	The formal explanation is ultimately that the set of communicating threads is changed, but I agree that it could be helpful to spell out how that comes about via the rules on dynamic instances, so I'm going to do that.
760	I'm adding a "rationale" section specifically to explain those static rules about cycles.

Harbormaster completed remote builds in B68219: Diff 285267.Aug 13 2020, 12:01 AM

Typos and yet slightly more detail.

Harbormaster completed remote builds in B68221: Diff 285272.Aug 13 2020, 12:33 AM

sameerds added inline comments.Aug 13 2020, 2:57 AM

llvm/docs/ConvergentOperations.rst
175–176	But this use of the intrinsics does not add any new constraints, right? This specific optimization is already sufficiently constrained by control dependence.
744–745	The exhausted reader just begs to see the corrected version at this point! :)
808–810	Following the structure of previous examples, it would be good to have a demonstration of how this can result in misinterpreted convergence. That would explain why this example should be illegal. This paragraph directly applies the rules to show how the example is recognized as illegal.

nhaehnle added inline comments.Aug 13 2020, 4:01 AM

llvm/docs/ConvergentOperations.rst
175–176	It doesn't add any constraints for existing generic transforms in LLVM that I'm aware of, but there's still a bit of non-trivial content to it at least in theory. Whether it matters in practice depends on the backend. E.g., it doesn't matter for AMDGPU, but modern versions of CUDA say that some sort of divergence can basically happen at any point in the program. If you wanted to take code that uses the convergent operations and translate it to CUDA builtins, the control intrinsics make a difference. In that case, you'd want the uniform threadmask to replace the entry intrinsic. If it was an anchor somewhere instead, you'd want to replace the anchor by `__activemask()` and then use its return value. In both cases, you'd possibly modify the mask somehow to account for additional control dependencies between the anchor and its use. This "modify the mask somehow" hides a lot of complexity, but thinking about it quite a bit I believe it's a similar amount of complexity to what we have in the AMDGPU backend to make things work, probably less because more of the burden is shouldered by hardware in the end. Plus there's the composability aspect of it if we're talking about functions that aren't kernel entry points and might be inlined.
744–745	The exhausted author is taking a note and will get around to it soon ;)
808–810	Isn't it just the same as in the example directly above? You'd expand C / E to a longer sequence of what happens in those inner loops, but the essentially difficulty is the same.

sameerds added inline comments.Aug 13 2020, 9:40 AM

llvm/docs/ConvergentOperations.rst
808–810	Maybe it is the same. See earlier note about exhausted reader. :) Maybe it's just me, but the concepts in this document are quite slippery, and well-rounded examples that restate the obvious can go a long way in gaining confidence.

Add more language about loops

Harbormaster completed remote builds in B68442: Diff 285707.Aug 14 2020, 11:40 AM

tighten the static rules about cycles: there was a gap in the exact phrasing if two loop heart intrinsics in a cycle use _different_ convergence tokens
add verifier checks and corresponding tests for the static rules

Harbormaster completed remote builds in B69031: Diff 286816.Aug 20 2020, 7:42 AM

clang-format fixes

Harbormaster completed remote builds in B69032: Diff 286817.Aug 20 2020, 7:43 AM

simoll added inline comments.Aug 28 2020, 5:19 AM

llvm/docs/ConvergentOperations.rst
340–344	You mean control could deviate threads? But those threads won't even reach the convergent instruction and only among those that do those that have the same runtime token value will execute it as a pack.

nhaehnle added inline comments.Sep 7 2020, 7:49 AM

llvm/docs/ConvergentOperations.rst
340–344	Ah, I misread your earlier comment. Yes, though there's a question of whether the different threads actually see the same value, or whether they see different values that happen to refer to the same dynamic instance of the defining instruction. One may want to think of the token value as a handle to some control structure that refers to a dynamic instance and also holds a loop counter for the loop heart intrinsic. I don't think it really matters much either way.

I've only read up to Formal Rules so later sections might change things but I figure it's potentially useful to see a readers thoughts mid-read. I'm pretty sure I've misunderstood the anchor intrinsic based on what I've read of the doc and comments so far.

llvm/docs/ConvergentOperations.rst
141–144	I think this is a little misleading, IIUC and assuming that the sets of communicating threads are quads as mentioned above then `%condition` doesn't need to be uniform across all the threads referenced by `%entry`. The only use is inside the `then:` block so I would expect that communicating threads for which `%condition` is uniformly false don't need to be considered as their result will not be used by any thread that enters `then:`. As you're trying to leave methods out, it's probably best left at `... with additional knowledge, that it doesn't change the result` The reason I bring this up is that I think it's worth thinking about how a generic transform, or an IR-level/gMIR-level/MIR-level target transform would perform this transform if it did understand convergence. To be clear, I'm not talking about the property it proves or the method by which it proves it. I mean: How would such a transform know what to prove and when to try? For MIR and intrinsics, the answer seems obvious. The backend simply knows more about the instructions/intrinsics convergence than `convergencectrl` declares and can use that information instead. Once it recognizes an instruction/intrinsic as one it knows more about, it can try to prove whatever property it needs. However, outside of those special cases there doesn't seem to be a way to know what to prove or when to try, even for a target-specific pass. To use the above example, if `@textureSample` were a non-intrinsic function with the same properties you describe I don't think it would be possible to know any better than what `convergencectrl` declares, preventing the analysis the sinking transform would depend on. It's arguably out of scope for this doc but do you foresee convergence tokens and the `convergent` attribute becoming finer grained in future to support earlier or more target-independent transforms on convergent operations? Do you have any thoughts on how that would be done?

dsanders added inline comments.Sep 8 2020, 10:04 PM

llvm/docs/ConvergentOperations.rst
28–30	This is rather nit-picky but there's some convergent operations where inter-thread communication isn't happening depending on how you model it. For example, a population count could be modelled as threads communicating (sum of 0 or 1 responses) which fits the definition here, but it could also be modelled threads optionally communicating (count of responses received), or as an external thread-manager broadcasting its count to the threads. Either way, communication is still happening but the second and third models are stretching the definition a bit I don't think it's worth bogging down the main text for that nitpick but it might be worth clarifying in a footnote or something that receiving/sending any data from, to, or about another thread counts as communication. Also, declining to communicate counts as communication if it affects the outcome.
212–214	Should we also mention that it's valid when %cc is non-uniform so long as the same effect is achieved by other means? In this particular example, additional communication is fine so long as we ensure unintended threads contribute 0 to the sums (e.g. by masking %delta using %cc first). In other words, it's not the actual communication we need to keep consistent but the effects (and side-effects) of that communication.
248–252	I feel like there's something I'm missing here. This sounds like: if (condition1) { %token = anchor() if (condition2) { ... } sum() convergencectrl(%token) } can be rewritten to: if (condition1) { if (condition2) { %token = anchor() ... sum() convergencectrl(%token) } } which made sense at first given statements like `we don't care which threads go together`, but we also have no way of saying that we did care which threads go together unless we also say that it must be the same as the threads from function entry. I'd originally expected that this would be allowed: if (condition1) { %token = entry() if (condition2) { ... } sum() convergencectrl(%token) } and would prevent sinking into or hoisting out of either if-statement but your reply here seems to indicate that's not allowed. How do convergence tokens prevent hoisting/sinking for this case? Having read a bit further and thought about it a bit more, I suspect what I'm missing is that anchor() is as immobile as it's name would suggest. However I haven't seen anything say it's immobile and things like `we don't care which threads go together` and `the code does not care about the exact set of threads with which it is executed` give me the impression that it can sink/hoist as long as the consumers of the token do too. My main thought that undermines my original reading is that if it can move then there'd be nothing stopping me deleting it either as I could always invent a `if(false) { ... }` to sink it all into.

nhaehnle added inline comments.Sep 15 2020, 6:08 AM

llvm/docs/ConvergentOperations.rst
28–30	That's a fair point. The way I'm thinking about this is that there may be communication with a `void` payload, but ultimately this can be bikeshed to death.
141–144	I can clean up the text. As for the question of how generic transforms could do better in the future: the way I see it, this would involve divergence analysis. If `%condition` is uniform (in a suitably defined sense), then sinking the `@textureSample` is okay since it doesn't change the relevant set of threads. The downside is that divergence analysis tends to be relatively expensive. It's worth exploring whether it can be computed incrementally and preserved. This particular example is an interesting one since it shows that scopes matter: on typical hardware, you really only need uniformity of `%condition` at the `quad` scope. I think that's worth exploring at some point, but it's definitely something to leave for later. I don't think there's anything in this proposal that would inherently prevent it.
248–252	That transform is allowed (assuming that sinking the user of the result of the `sum()` is also possible). Though either way, an implementation is free to isolate individual threads, i.e. in your example, the result of `sum` could just be replaced by the value you're summing over so that each thread just gets its own value. This may seem useless at first, but it is the point of the anchor :) If you want the set of threads to have some fixed relation to something external (like a compute workgroup or full Vulkan subgroup), you need to use `entry` instead of `anchor`. `anchor` is still useful, as long as you have multiple things anchored to it. It will then ensure that they are relatively consistent to each other.

sameerds added inline comments.Sep 16 2020, 4:43 AM

llvm/docs/ConvergentOperations.rst
248–252	If I understand this right, then even `entry` does not capture anything specific ... it is merely a place holder for the `anchor` at the callsite of a function. This matters, for example, when the call is inside a loop and the frontend is trying to specify something in terms of the threads that together enter the loop. The `entry` at the start of a kernel is almost the same as an `anchor`, except the definition of threads that see the same dynamic instance is coming from the language above rather than the implementation below. The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler. Is that correct?

sameerds added inline comments.Sep 16 2020, 4:51 AM

llvm/docs/ConvergentOperations.rst
248–252	Just realized that this is not true: "The entry at the start of a kernel is almost the same as an anchor", but the rest still seems to hold.

nhaehnle added inline comments.Sep 23 2020, 9:26 AM

llvm/docs/ConvergentOperations.rst
248–252	The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler. Probably? I'm not sure I agree with the exact wording. In a compute kernel, the `entry` intrinsic preserves the set of threads (workgroup/threadgroup/block) that are launched together, where "together" is parameterized by the scope you care about (dispatch/workgroup/subgroup/wave/whatever you call it). `loop` intrinsics controlled by the resulting token value in turn preserve that set of threads modulo divergent exits from the loop. And so on. So I'd state it as: the intrinsics cannot enforce any grouping that wasn't there before, they can only enforce preservation of groupings. I hope that's what you meant, just with different words? :)

simoll mentioned this in D84413: [DA][SDA] SyncDependenceAnalysis re-write.Sep 28 2020, 5:20 AM

asbirlea added reviewers: tra, jlebar.Oct 7 2020, 1:28 PM

ping

nhaehnle mentioned this in D89826: [FunctionAttrs][NPM] Fix handling of convergent.Oct 22 2020, 12:04 PM

ping^2

asbirlea added a reviewer: resistor.Oct 28 2020, 2:32 PM

Herald added a subscriber: dexonsmith. · View Herald TranscriptOct 28 2020, 2:32 PM

Hi. :) A few people pinged me asking for my feedback here, since I touched the convergent attr way back in the day, for CUDA.

I'm going to try to give feedback, but with the caveat that there's a huge amount of discussion here, and with my apologies that I can't read the whole thread's worth of context. It's a lot. Sorry that I'm probably bringing up things that have already been discussed.

I strongly agree that convergent as-is has problems. Fixing them is clearly complicated, and it seems like a lot of work has gone into this proposal.

I have been out of it for too long to feel comfortable signing off on whether this proposal fixes the problems with convergent. The proposal seems reasonable to me, but as we saw with e.g. undef/poison, these things can be extremely subtle.

I'm also not comfortable speaking to whether this representation will be ergonomic in the relevant LLVM passes.

What I'm more comfortable speaking to is:

Is the proposal clear to me?

I think the proposal is clear, modulo my few comments (relative to the length of the patch, anyway). This kind of writing is really tricky, I admire that I could mostly understand it. I thought the extensive examples were really helpful.

Is it clear how to modify clang's CUDA frontend to use this new form?

It's not perfectly clear to me how to do this. Is it as simple as saying, loops always have a convergent.loop() intrinsic at the top, functions always have convergent.entry() at the top, and that's it? If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

Hope this helps.

llvm/docs/ConvergentOperations.rst
28–30	CUDA `__syncthreads()` is the prototypical convergent function (at least, it was -- maybe under this definition it's not?), but syncthreads does not exchange any information. It's just a barrier. Assuming you still consider syncthreads to be convergent, my concern is someone would read this and (quite reasonably) think that we are incorrectly modeling it as convergent. The way I'm thinking about this is that there may be communication with a void payload, If you count "communicate nil" as communication, then perhaps the operation is not in fact communication but rather is "communication or synchronization"? Perhaps: A convergent operation involves inter-thread communication or synchronization that occurs outside of the memory model, where the set of threads which participate in the inter-thread operation is implicitly affected by control flow.
89	Up to you, but I think this example would be more evocative if we wrote out the definition of textureSample. I am imagining that it involves something like a `__shfl`, but that's because I already understand GPUs. Your audience is bigger than that.
221	Nit: Clarify that this example isn't using the proposed convergence intrinsics? Perhaps Consider an example of how jump threading removes structure in a way that can make semantics non-obvious without the convergence intrinsics described in this document.
250	Nit: Add ellipsis above this line, or remove it in the equivalent spot in the original code?
313	This paragraph really clarifies for me what's going on. +1
348	...wait, there are such things as convergent functions? This is the first I'm hearing about it in the doc! So far it seemed there were only convergent calls. What's a convergent function? :)
499	Do you plan to check this in the verifier (insofar as possible, I understand that it's not possible to check this for cross-TU calls).
507	This one is a local property -- could we say that this makes the program ill-formed, instead of UB?
511	Again, could we say this makes the program ill-formed? (At least the entry-block check, I'm not sure what a convergence region is, yet.)
595	Have we formally defined what a "controlled" convergent operation is? Do you mean a `call` to a `convergent` function with a `"convergencectrl"` operand bundle? (Say that?)
955	In this section I would have found it helpful if we'd differentiated upfront between the three kinds of unrolling: Partial unrolling of a loop with no known trip multiple (so, there's a "tail" that collects the remaining elements) Partial unrolling by a trip multiple (so there's no "tail") Full unrolling, which eliminates the loop I think you're saying that only the first kind of unrolling is tricky.
982–983	It would help me if we could we elaborate with half a sentence what the behavior change might be.
988–989	Do you mean that this kind of unrolling is forbidden? But if you're going to forbid all unrolling of loops with uncontrolled convergent ops...that's going to make CUDA code a lot slower. Unless you're also going to fix clang, in which case, no objections, but maybe you want to say "will be forbidden once we've updated front-ends"?
999	One thing I don't get from this example is what I should do as a frontend to LLVM. That is, when should I do this form, and when should I put a new anchor inside a loop? It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor. That is, I have to behave as though the compiler could transform anchor() foo(); into, effectively if (threadIdx.x % 2 == 0) { anchor() convergent_fn(); } else { anchor(); convergent_fn(); } Something like this? OK, so I always have to use the convergence.loop() form. But then this is saying I can never unroll. ITYM that with convergence.loop(), I can never partially unroll with a "tail", which makes a lot of sense? But would help me if we were explicit about that.
1033	`counter > 1`?

arsenm mentioned this in D90361: Prevent LICM and machineLICM from hoisting convergent operations.Oct 29 2020, 8:04 AM

Address the comments from @jlebar that I indicate I'd address,
except for changes affecting the Verifier -- I'll do those later.

Harbormaster completed remote builds in B77017: Diff 301840.Oct 30 2020, 2:41 AM

In D85603#2361168, @jlebar wrote:

I'm going to try to give feedback, but with the caveat that there's a huge amount of discussion here, and with my apologies that I can't read the whole thread's worth of context. It's a lot. Sorry that I'm probably bringing up things that have already been discussed.

Thanks, and don't worry. A lot of the old comments don't make sense anymore because the document was changed and Phabricator shows them in nonsensical places unfortunately.

[snip]

Is it clear how to modify clang's CUDA frontend to use this new form?

It's not perfectly clear to me how to do this. Is it as simple as saying, loops always have a convergent.loop() intrinsic at the top, functions always have convergent.entry() at the top, and that's it? If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

There are two kinds of answers to this. One is that you can only really know how the frontend should be modified once you've established what the high-level language semantics ought to be. Part of why I'm doing this work is to enable us to experiment with this kind of question and verify our understanding what this should look like (I'm going to caveat this with saying that I'm coming at it from the graphics side).

The other kind of answer is that for most but not all constructs, there's a pretty natural answer that boils down pretty much to what you wrote. Of course it generally breaks down in the face of goto, for example. I have a follow-on patch, D85609, which adds a pass that does this kind of insertion on top of LLVM IR. I'd appreciate your review on that if you find the time -- I think what it tries to do is fairly natural, but it is a bit more work to dig through. A reasonable first step for someone working on the CUDA frontend would be to insert that pass early in the pass pipeline. Longer term, it may be necessary to insert them directly during IR generation, but this at least partially depends on the high-level language semantics question.

Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.
// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);
but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

I have trouble answering this as well due to the lack of proper specification from Nvidia, and I'm not set up to run this kind of experiment.

From a theory point of view, because those newer versions of sync operations take that explicit arg, we shouldn't consider them to be convergent according to what's being defined here. Only the __activemask() builtin probably still needs to be considered convergent (also in light of https://bugs.llvm.org/show_bug.cgi?id=47210).

The result of your experiment seems to contradict the theory. Having worked on this part of our compiler for a while now, I think it's entirely possible that the result of your experiment is simply a bug somewhere along the compiler stack, but of course I can't say for certain. If it's not supposed to be a bug, then to me this means there's something subtle missing in the way the new sync operations are described. Either way, some clarification would be good.

llvm/docs/ConvergentOperations.rst
28–30	Your suggestion looks good to me, going to apply it.
89	`textureSample` is actually a built-in function of graphics languages. I'm going to add a clause to try to clarify that. I assume all GPUs have dedicated circuitry for it. I specifically wanted to mention `textureSample` in the document at least once because it (and some close analogs) are often forgotten in discussions of convergent even by graphics people like myself. Obviously the document should also be accessible to folks from the GPU compute world, which is why I tried to give a succinct explanation of the relevant facts about `textureSample` in the paragraph above. Later in the document there are also examples using shuffles, though with the Khronos-y spelling of `subgroupShuffle` instead of the CUDA-y `__shfl`. The choice of spelling is partly because that's just the world I'm personally working in most of my time, but also partly because I'd prefer using terms from common industry standards. I understand that CUDA is a bit of a de facto "standard", so if you think it's necessary to convert at least one example to CUDA spelling, we can do that -- just not this one here in particular, because it's specifically meant to be a graphics shader example.
221	Thanks, going to make this change.
250	Added ellipsis.
348	Uhh... technically true. How about adding something like the following somewhere: In LLVM IR, function calls are the only instructions that can involve convergent operations. A call itself (i.e., the act of jumping to the callee, setting up a stack frame, etc.) is not a convergent operation. However, if the callee uses the `llvm.experimental.convergence.entry` intrinsic, then we think of the entire execution of the callee as a convergent operation from the perspective of the calling function. Such callees must be marked with the `convergent` attribute, and for brevity we say that they are "convergent functions". If the callee isn't known at the call site (i.e., an indirect function call), then the `call `instruction itself must have the` `convergent`` attribute. The only reason for why a function F would need to use the `llvm.experimental.convergence.entry` intrinsic is if F in turn uses some other convergent operation, i.e., a call to a convergent function. Chains of such calls are expected to eventually end with the use of a (target-specific) intrinsic that is `convergent`.
499	Do we typically check "mere UB" in the verifier? Thinking about it a little, doing this seems risky for IR linking: it would mean that you can link two well-formed modules together and end up with an ill-formed one? If that's something that already exists and we're okay with it, then I'd be happy to add such checks, but I wouldn't want to be the one to introduce them...
507	Yes, that's a good idea.
511	The entry-block check should be straightforward.
595	Yes, the section "Dynamic Instances and Convergence Tokens" already says this: The convergence control intrinsics described in this document and convergent operations that have a `convergencectrl` operand bundle are considered controlled convergent operations. I'm going to add an anchor there since the doc is pretty long :)
955	Yes, that's correct, and I'm going to add essentially your three bullets at the top.
982–983	I gave it a try. It ended up being a full sentence though ;)
988–989	Yes, this kind of unrolling. This is already forbidden for uncontrolled convergent operations today. If you want to dig a little deeper, I would appreciate if you could also add your review to D85605. That's a follow-up change for (1) correctness of loop unrolling with regards to the `loop` intrinsics rules and (2) relaxing some of the constraints that exist today where possible when all convergent ops are controlled (by an anchor in the loop).
999	ITYM that with convergence.loop(), I can never partially unroll with a "tail", which makes a lot of sense? Yes, that's correct. Hopefully clearer with the addition at the top of the section. It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor. Right. The anchor essentially allows you to achieve the same thing as `__activemask` in CUDA, but in a more structured way that doesn't run into problems when you have two sides of an if/else both executing a sync operation with the same thread mask.
1033	Thanks, changing to `counter >= 2` because that's what I had in a similar example above.

sstefan1 added a subscriber: sstefan1.Oct 31 2020, 6:51 AM

Man, phab doesn't make this easy, does it?

One tip, shift+A hides all inline comments, making the patch easier to read. One problem, though: It hides all the comments! :)

Do we typically check "mere UB" in the verifier? Thinking about it a little, doing this seems risky for IR linking: it would mean that you can link two well-formed modules together and end up with an ill-formed one? If that's something that already exists and we're okay with it, then I'd be happy to add such checks, but I wouldn't want to be the one to introduce them...

I see. You may not be able to check this. My preference, which it seems like you share, is just, inasmuch as we _can_ mark something as ill-formed and check for it, that seems preferable to UB.

How about adding something like the following somewhere:

In LLVM IR, function calls are the only instructions that can involve convergent operations. A call itself (i.e., the act of jumping to the callee, setting up a stack frame, etc.) is not a convergent operation. [...]

Yes, that is very clear to me.

Thank you for making those changes.

I am satisfied that this can be implemented in a frontend (and anyway, you have the patch). I've pinged some folks at nvidia asking for them to have a look wrt sm70, and I actually already got a reply, so I am hopeful we might hear from them. I don't want to keep you in limbo indefinitely, so I've asked if they might be able to provide a timeline.

Stay tuned, I guess.

In D85603#2361168, @jlebar wrote:

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I believe what is described here about convergent, as best I can understand it, is the semantics of syncthreads in CUDA. This semantics is the same for <sm70 and sm70+. Not clear whether what is described here is a "textually aligned" semantics or unaligned. syncthreads is aligned, meaning that all threads in the threadblock must wait on the same lexical syncthreads().

I believe with sm70 the re-convergence has different semantics, due to the fact that we have forward progress guarantee in a warp. In pre-sm70 the following could deadlock

volatile int flag = 0;

if (cond) { // thread dependent conditional

while (flag == 0) ; // spin-lock

} else

flag++;

// re-convergence point

now it works as expected

The following also works (doesn't deadlock)

volatile int flag = 0;

if (cond) { // thread dependent conditional

while (flag == 0) ; // spin-lock

}
// re-convergence point
flag++;

In D85603#2370240, @vgrover99 wrote:

I believe what is described here about convergent, as best I can understand it, is the semantics of syncthreads in CUDA. This semantics is the same for <sm70 and sm70+. Not clear whether what is described here is a "textually aligned" semantics or unaligned. syncthreads is aligned, meaning that all threads in the threadblock must wait on the same lexical syncthreads().

Textual alignment is a good way to examine this spec with respect to CUDA. The notion of dynamic instances is textually aligned according to basic rule 2:

Executions of different instructions always occur in different dynamic instances. For this and other rules in this document, instructions of the same type at different points in the program are considered to be different instructions.

This correctly covers __syncthreads(), and is a bit conservative about builtins that take a mask like __syncwarp(). In @jlebar's example, each call to __syncwarp() is a separate dynamic instance, although CUDA actually treats them as a single synchronization point.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

In general, the formal rules work correctly for this too: hoisting and sinking is disallowed without additional information. So the proposal is compatible with the new CUDA semantics for these builtins. These builtins do need convergence control: sinking such a call across a condition should be forbidden by default, since we can no longer guarantee that every thread in the mask still makes a matching call. Of course, specific optimizations that can recompute the mask can over-ride this restriction.

foo = __shfl_sync(mask, ...);
if (condition) {
  // cannot sink foo here if condition is divergent
  sole_use_of_foo();
}

In D85603#2370634, @sameerds wrote:

This correctly covers __syncthreads(), and is a bit conservative about builtins that take a mask like __syncwarp(). In @jlebar's example, each call to __syncwarp() is a separate dynamic instance, although CUDA actually treats them as a single synchronization point.

Yes, in CUDA and in PTX __syncwarp is an unaligned primitive.

In D85603#2361168, @jlebar wrote:
Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.
// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);
but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

To extrapolate from Vinod's answer, I would say that we can represent sm70+ convergence semantics with this proposal. The situation seems to be covered by the examples in the section on hoisting and sinking. Consider the following example copied from the spec:

define void @example(...) convergent {
  %entry = call token @llvm.experimental.convergence.entry()
  %data = ...
  %id = ...
  if (condition) {
    %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
    ...
  }
}

Here, hoisting subgroupShuffle() is generally disallowed because it depends on the identity of active threads. A CUDA builtin with a mask argument similarly identifies specific threads that must be active at the set of textually unaligned calls that synchronize with each other. So any change in the control flow surrounding those calls is generally disallowed without more information. The new representation doesn't seem to restrict a more informed optimizer that can predict how the threads evolve.

PiotrFusik removed a subscriber: PiotrFusik.Nov 6 2020, 2:07 AM

In D85603#2378229, @sameerds wrote:
In D85603#2361168, @jlebar wrote:
Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.
// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);
but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.
To extrapolate from Vinod's answer, I would say that we can represent sm70+ convergence semantics with this proposal. The situation seems to be covered by the examples in the section on hoisting and sinking. Consider the following example copied from the spec:
define void @example(...) convergent {
  %entry = call token @llvm.experimental.convergence.entry()
  %data = ...
  %id = ...
  if (condition) {
    %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
    ...
  }
}
Here, hoisting subgroupShuffle() is generally disallowed because it depends on the identity of active threads. A CUDA builtin with a mask argument similarly identifies specific threads that must be active at the set of textually unaligned calls that synchronize with each other. So any change in the control flow surrounding those calls is generally disallowed without more information. The new representation doesn't seem to restrict a more informed optimizer that can predict how the threads evolve.

Yes, that makes sense to me, although it also makes sense to reflect on this a bit more.

Roughly speaking, there are subgroup ops with "implicit" thread set (what Vinod calls textually aligned, and what this proposal mostly focuses on, because they require the most additional explanation) and subgroup ops with "explicit" thread set (sm70+).

What's interesting is that the latter (__shfl_syncetc.) have similar constraints on how they can and can't be moved, but for different reasons, and the constraints are different. For example:

mask = ...;
if (blah) {
  y = __shfl_sync(a, b, mask);
  ...
} else {
  y = __shfl_sync(a, b, mask);
  ...
}

The __shfl_sync has an explicit thread mask and can be hoisted. However, a subgroupShuffle with implicit thread mask cannot be hoisted. So here, __shfl_sync allows more freedom.

Conversely:

__shfl_sync(a, b, mask); // result unused

This cannot be dead-code eliminated, because it might be communicating with threads executing a different part of the program. By contrast, a subgroupShuffle with implicit thread mask whose result is unused can be dead-code-eliminated. So here, subgroupShuffle allows more freedom.

By similar logic, subgroup ops with implicit thread mask in the same basic blocks can be re-ordered wrt each other, but this is not true for explicit thread mask (not-textually-aligned) subgroup ops.

I believe that with this proposal, we can model this with the attributes we have by saying that subgroupShuffle is convergent readnone, while __shfl_sync is inaccessiblememonly.

In D85603#2379258, @nhaehnle wrote:

Roughly speaking, there are subgroup ops with "implicit" thread set (what Vinod calls textually aligned, and what this proposal mostly focuses on, because they require the most additional explanation) and subgroup ops with "explicit" thread set (sm70+).

That's an excellent way to "lift" sm70+ operations into the semantics being handled by this proposal. The bottomline seems to be that the proposed formalism achieves the following:

Dynamic instances capture the thread sets that are implicitly determined by control flow graph.
1. This covers both kinds of operations, with and without explicit thread sets as arguments.
2. No assumptions are made about thread grouping in the underlying hardware.
There is a straightforward way for a frontend to insert these intrinsics while ensuring correctness and not overly constraining optimization.
Generic optimizations are safe as long as they preserve the mapping of threads to dynamic instances (basic rule 4).
1. The mapping is usually altered by changes to surrounding control flow, and hence such changes are forbidden in general.
2. This does not preclude more informed optimizations that are aware of their impact on the set of threads at a dynamic instance.
3. Optimizations can also benefit from attributes that indicate how this set of threads is allowed to change.
None of the above has been established beyond all doubt, but the current understanding is sufficient to justify the "experimental" tag.

Is that a reasonable summary at this point?

The bottomline seems to be that the proposed formalism achieves the following: <snip>

I agree, fwiw.

What do you need at this point to move forward?

Few questions below. Please bear with me as I try to grok the proposal and the long stream of comments...

I believe that with this proposal, we can model this with the attributes we have by saying that subgroupShuffle is convergent readnone, while __shfl_sync is inaccessiblememonly.

subgroupShuffle would require convergentctrl and __shfl_sync would not, correct?

There is a straightforward way for a frontend to insert these intrinsics while ensuring correctness and not overly constraining optimization.

This feels like it could use a bit of discussion in the documentation, at least spelling out the straight-forward mapping for a simple C-like language with one example built-in that uses implicit thread masks. My understanding of this proposal implies the following rules:

Add call to llvm.experimental.convergence.entry to the beginning of every convergent function (generally assume the program entry-point is convergent)
Add call to llvm.experimental.convergence.anchor to the beginning of every non-convergent function
Add call to llvm.experimental.convergence.loop to the beginning of every natural loop header block
For each call to a convergent function (intrinsic or other-wise), attach convergencectrl bundle pointing to the closest call to entry/anchor/loop, in terms of nesting

Is this correct for general structured code, or am I missing some case?

Things are less clear when you consider odd looping structures, for example:

entry:
    entry_token = llvm.experimental.convergence.anchor();
    if (cond1) goto head1;
    else goto head2;

head1:
    head1_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    cond2 = ...;
    if cond2 goto tail1;
    else goto tail2;

head2:
    head2_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    break_cond = ...
    if break_cond goto exit;
    else goto head2b;

head2b:
    cond3 = ...;
    if cond3 goto tail2;
    else goto tail1;

tail1:
    cond4 = subgroupOp(...);      // What does this anchor to?
    if cond4 goto head1;
    else goto head2;

tail2:
    cond5 = subgroupOp(...);      // What does this anchor to?
    if cond5 goto head2;
    else goto head1;

exit:
    ...

Ignoring the new intrinsics, the subgroupOp calls may have defined semantics between groups of threads that reach the two loop tail blocks together, at least if you guarantee maximal convergence. When you add in the proposed intrinsics, what do the subgroupOp calls anchor to? The llvm.experimental.convergence.loop calls do not dominate the subgroupOp calls, so the llvm.experimental.convergence.anchor call is the only choice. But this breaks the first rule static rule of cycles in the proposal: we have a use of a convergence token inside a cycle but the token is not defined inside the loop. Do we just add a call to llvm.experimental.convergence.anchor to both tail1 and tail2, using them as the convergence token for the respective subgroupOp?

Is there any concern over implementing and validating all of the necessary logic in all CFG-altering optimization passes? This seems like a large change that will require modifications through-out the existing optimization passes in order to achieve the goal of "ensuring correctness and not overly constraining optimization". I'm just curious if there is any plan being formulated for tackling this issue.

critson added a subscriber: critson.Jan 12 2021, 9:35 PM

In D85603#2422804, @jholewinski wrote:

Few questions below. Please bear with me as I try to grok the proposal and the long stream of comments...

I believe that with this proposal, we can model this with the attributes we have by saying that subgroupShuffle is convergent readnone, while __shfl_sync is inaccessiblememonly.

subgroupShuffle would require convergentctrl and __shfl_sync would not, correct?

The answer seems to depend on whether it is correct to say the following about __shfl_sync

The only constraint on control flow transformations around __shfl_sync is identical to the current definition of the convergent intrinsic.
Other optimizations such as DCE and reordering, that do not involve changes in the control flow, should be modelled using other constructs like inaccessiblememonly

If this is true, then yes, __shfl_sync doesn't need the new convergence control bundles.

There is a straightforward way for a frontend to insert these intrinsics while ensuring correctness and not overly constraining optimization.

This feels like it could use a bit of discussion in the documentation, at least spelling out the straight-forward mapping for a simple C-like language with one example built-in that uses implicit thread masks. My understanding of this proposal implies the following rules:

Add call to llvm.experimental.convergence.entry to the beginning of every convergent function (generally assume the program entry-point is convergent)

Add call to llvm.experimental.convergence.anchor to the beginning of every non-convergent function

Add call to llvm.experimental.convergence.loop to the beginning of every natural loop header block

For each call to a convergent function (intrinsic or other-wise), attach convergencectrl bundle pointing to the closest call to entry/anchor/loop, in terms of nesting

Is this correct for general structured code, or am I missing some case?

Right. These heuristics are actually proposed in a related patch under review:
https://reviews.llvm.org/D85609

In particular, the above pass is expected to do the right thing when working with cross-thread operations that are "textually aligned" (for example, SPIRV, OpenCL, HIP, and CUDA before sm70).

Things are less clear when you consider odd looping structures, for example:
[snip]
Ignoring the new intrinsics, the subgroupOp calls may have defined semantics between groups of threads that reach the two loop tail blocks together, at least if you guarantee maximal convergence. When you add in the proposed intrinsics, what do the subgroupOp calls anchor to?

I believe the question needs to be turned around: what do you want subgroupOp to anchor to? In general, it should be impossible to infer the correct operand bundles from arbitrary LLVM IR, else we could just express everything as an analysis instead of having these explicit markers. The intention of these markers is for a frontend to be able to produce constraints that cannot be expressed using just the structure of the program.

Is there any concern over implementing and validating all of the necessary logic in all CFG-altering optimization passes? This seems like a large change that will require modifications through-out the existing optimization passes in order to achieve the goal of "ensuring correctness and not overly constraining optimization". I'm just curious if there is any plan being formulated for tackling this issue.

These intrinsics are expected to be very useful for new techniques being worked out in the AMDGPU backend. The following reviews start off the changes required in the optimizer:
https://reviews.llvm.org/D85604
https://reviews.llvm.org/D85605
https://reviews.llvm.org/D85606

Is it clear how to modify clang's CUDA frontend to use this new form?

It's not perfectly clear to me how to do this. Is it as simple as saying, loops always have a convergent.loop() intrinsic at the top, functions always have convergent.entry() at the top, and that's it? If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

There are two kinds of answers to this. One is that you can only really know how the frontend should be modified once you've established what the high-level language semantics ought to be. Part of why I'm doing this work is to enable us to experiment with this kind of question and verify our understanding what this should look like (I'm going to caveat this with saying that I'm coming at it from the graphics side).

The other kind of answer is that for most but not all constructs, there's a pretty natural answer that boils down pretty much to what you wrote. Of course it generally breaks down in the face of goto, for example. I have a follow-on patch, D85609, which adds a pass that does this kind of insertion on top of LLVM IR. I'd appreciate your review on that if you find the time -- I think what it tries to do is fairly natural, but it is a bit more work to dig through. A reasonable first step for someone working on the CUDA frontend would be to insert that pass early in the pass pipeline. Longer term, it may be necessary to insert them directly during IR generation, but this at least partially depends on the high-level language semantics question.

Regarding the HLL and frontend side, I believe this could be represented fairly similarly in different C/C++-based languages - considering that we already follow the same implementation for existing convergent semantics at least between CUDA and OpenCL. However, it isn't yet in its optimal state and perhaps we can attempt to refine this topic holistically for example also addressing the following rework that removes the need to make everything convergent: https://reviews.llvm.org/D69498. Otherwise, we will likely have to generate the convergent intrinsics absolutely everywhere, which is not ideal!

Looking at the wording in some parts of your convergent semantics definition there might be options resulting in some tradeoff between tooling complexity and optimization opportunities:

+The
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic is typically expected to appear in the header of a natural loop.
+However, it can also appear in non-header blocks of a loop. In that case, the
+loop can generally not be unrolled.

I understand this is not in the scope of this work. And I think it is perfectly reasonable to provide experimental support that could help with further evaluation and productization too. However, it would be good to make some preliminary assessment for the frontend support rather soon. What I think could speed up the progress on the frontend/HLL is some sort of description about the conditions where the new intrinsics have to be inserted. My understanding is that the plan is not to expose them to the application code that would require educating the application developers about all the low-level details? Looking at your transformation pass in https://reviews.llvm.org/D69498 it seems that adding those automatically should somehow be possible and you already have some rules defined where and how those can be added? But there are certain things that can be done in IR that are very constrained in AST as it makes Parsing more complicated.

In D85603#2676631, @Anastasia wrote:

Regarding the HLL and frontend side, I believe this could be represented fairly similarly in different C/C++-based languages - considering that we already follow the same implementation for existing convergent semantics at least between CUDA and OpenCL. However, it isn't yet in its optimal state and perhaps we can attempt to refine this topic holistically for example also addressing the following rework that removes the need to make everything convergent: https://reviews.llvm.org/D69498. Otherwise, we will likely have to generate the convergent intrinsics absolutely everywhere, which is not ideal!

As far as I could skim through the specs, OpenCL requires that all threads in a workgroup or subgroup encounter a convergent operation. On the other hand, SPIRV and CUDA allow a more general "non-uniform" version that are executed by "currently active" threads. This proposal is general enough to cover both cases (independent of the newer CUDA primitives that take explicit masks).

Also, this new proposal supersedes https://reviews.llvm.org/D69498. In fact it presents a generalization that can even entirely eliminate the need for the convergent attribute. (See Nicolai's older comment about keeping convergent for now ... it can be used by frontends who elect to keep the current broken-but-well-known formalism it represents.

Looking at the wording in some parts of your convergent semantics definition there might be options resulting in some tradeoff between tooling complexity and optimization opportunities:
+The
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic is typically expected to appear in the header of a natural loop.
+However, it can also appear in non-header blocks of a loop. In that case, the
+loop can generally not be unrolled.

I believe this is meant to say that the formalism does not forbid putting the loop intrinsic in a non-header block, but that is not expected in most known cases. It is not an optimization choice that every flow must make.

I understand this is not in the scope of this work. And I think it is perfectly reasonable to provide experimental support that could help with further evaluation and productization too. However, it would be good to make some preliminary assessment for the frontend support rather soon. What I think could speed up the progress on the frontend/HLL is some sort of description about the conditions where the new intrinsics have to be inserted. My understanding is that the plan is not to expose them to the application code that would require educating the application developers about all the low-level details? Looking at your transformation pass in https://reviews.llvm.org/D69498 it seems that adding those automatically should somehow be possible and you already have some rules defined where and how those can be added? But there are certain things that can be done in IR that are very constrained in AST as it makes Parsing more complicated.

This other review request is likely to demonstrate what you are asking for:
https://reviews.llvm.org/D85609

In D85603#2678742, @sameerds wrote:

In D85603#2676631, @Anastasia wrote:

Regarding the HLL and frontend side, I believe this could be represented fairly similarly in different C/C++-based languages - considering that we already follow the same implementation for existing convergent semantics at least between CUDA and OpenCL. However, it isn't yet in its optimal state and perhaps we can attempt to refine this topic holistically for example also addressing the following rework that removes the need to make everything convergent: https://reviews.llvm.org/D69498. Otherwise, we will likely have to generate the convergent intrinsics absolutely everywhere, which is not ideal!

As far as I could skim through the specs, OpenCL requires that all threads in a workgroup or subgroup encounter a convergent operation. On the other hand, SPIRV and CUDA allow a more general "non-uniform" version that are executed by "currently active" threads. This proposal is general enough to cover both cases (independent of the newer CUDA primitives that take explicit masks).

We do have a new functionality in OpenCL that requires supporting convergent operations in non-uniform CF too:
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#_extended_subgroup_functions
https://llvm.org/PR46199

Also, this new proposal supersedes https://reviews.llvm.org/D69498. In fact it presents a generalization that can even entirely eliminate the need for the convergent attribute. (See Nicolai's older comment about keeping convergent for now ... it can be used by frontends who elect to keep the current broken-but-well-known formalism it represents.

Sorry for not being clear - I was talking about two separate threads here (1) generalizing convergent attribute to non-uniform CF that is addressed by this patch and (2) inverting convergent attribute that is addressed in https://reviews.llvm.org/D69498. Just to provide more details regarding (2) - right now in clang we have a logic that adds convergent to every single function because when we parse the function we don't know whether it will call any function in a call tree that would use convergent operations. Therefore we need to be conservative to prevent incorrect optimizations but this is not ideal for multiple reasons. The optimiser can undo all or some of those convergent decorations if it can prove they are not needed. And for the uniform CF convergent operations this was the only "broken" functionality to my memory.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

Looking at the wording in some parts of your convergent semantics definition there might be options resulting in some tradeoff between tooling complexity and optimization opportunities:
+The
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic is typically expected to appear in the header of a natural loop.
+However, it can also appear in non-header blocks of a loop. In that case, the
+loop can generally not be unrolled.
I believe this is meant to say that the formalism does not forbid putting the loop intrinsic in a non-header block, but that is not expected in most known cases. It is not an optimization choice that every flow must make.

I understand this is not in the scope of this work. And I think it is perfectly reasonable to provide experimental support that could help with further evaluation and productization too. However, it would be good to make some preliminary assessment for the frontend support rather soon. What I think could speed up the progress on the frontend/HLL is some sort of description about the conditions where the new intrinsics have to be inserted. My understanding is that the plan is not to expose them to the application code that would require educating the application developers about all the low-level details? Looking at your transformation pass in https://reviews.llvm.org/D69498 it seems that adding those automatically should somehow be possible and you already have some rules defined where and how those can be added? But there are certain things that can be done in IR that are very constrained in AST as it makes Parsing more complicated.

This other review request is likely to demonstrate what you are asking for:
https://reviews.llvm.org/D85609

Thanks, I was referring to this review indeed. Perhaps it is easier if I spawn a separate discussion there to see whether and how we can apply the same logic to the frontend and also how it combines with the conservative approach of generating convergent attribute everywhere that we have right now.

In D85603#2679362, @Anastasia wrote:

Sorry for not being clear - I was talking about two separate threads here (1) generalizing convergent attribute to non-uniform CF that is addressed by this patch and (2) inverting convergent attribute that is addressed in https://reviews.llvm.org/D69498. Just to provide more details regarding (2) - right now in clang we have a logic that adds convergent to every single function because when we parse the function we don't know whether it will call any function in a call tree that would use convergent operations. Therefore we need to be conservative to prevent incorrect optimizations but this is not ideal for multiple reasons. The optimiser can undo all or some of those convergent decorations if it can prove they are not needed. And for the uniform CF convergent operations this was the only "broken" functionality to my memory.

I see now. Thanks! Besides goal (1), the other goal for this new formalism is to clarify the meaning of "convergence" in a way that allows more freedom to the optimizer. Language specs typically define convergence with operational semantics, such as:

SPIRV: "different invocations of an entry point execute the same dynamic instances of an instruction when they follow the same control-flow path"
OpenCL: "all work-items in the work-group must enter the conditional if any work-item in the work-group enters the conditional statement"

The proposed formalism lifts this into a declarative semantics which is easier for the compiler to reason with. This allows optimizations like jump threading, where the transformed program has ambiguous operational semantics (see the example in the actual spec). The presence of convergence control tokens makes sure that the "point of convergence" is well-defined even if the transformed control flow is ambiguous.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

To be honest, I was not aware of this other effort, and even after you pointed it out, I wasn't paying attention to the words that I was reading. It seems like the current spec has so far focussed on demonstrating the soundness of the formalism. But I think it is possible to cover (2), which is to make the default setting conservative. This will need a bit of a rewording. In particular, this definition from the spec:

The convergence control intrinsics described in this document and convergent
operations that have a ``convergencectrl`` operand bundle are considered
*controlled* convergent operations.

Other convergent operations are *uncontrolled*.

This needs to be inverted in the spirit of D69498. I would propose the following tweak:

By default, every call has an implicit convergencectrl bundle with a token returned by the @llvm.experimental.convergence.entry intrinsic from the entry block of the caller. This default is the most conservative setting within the semantics defined here.
A more informed frontend or a suitable transformation can replace this conservative token with one of the following:
1. A token returned by any of the other intrinsics, which provides more specific information about convergence at this callsite.
2. A predefined constant token (say none), which indicates complete freedom. This would be equivalent to the noconvergent attribute proposed in D69498.

Such a rewording would invert how we approach the spec. Instead of a representation that explicitly talks about special intrinsics that "need" convergence, the new semantics applies to all function calls. The redefined default is conservative instead of free, and the presence of the bundles relaxes the default instead of adding constraints.

Also, answering one of your comments in the other review (D85609#inline-943432) about the relevance of the llvm.experimental.convergence.anchor, this intrinsic cannot be inferred by the frontend. It represents a new ability to represent optimization opportunities like the one demonstrated in the "opportunistic convergence" example. The intrinsic says that the call that uses this token doesn't depend on any specific set of threads, but merely marks the threads that do reach it. This is most useful when multiple calls agree on the same set of threads. Identifying such sets of operations will need help from the user (or more realistically, a library writer). Something like the following might work, where the actual value of group doesn't really matter beyond relating the various calls to each other.

auto group = non_uniform_group_active_workitems();
op1(group);
if (C)
   op2(group);
op3(group);

Hi @jholewinski, sorry for missing your comment earlier. It's been a while! I still need to work through the rest of the comments here, but there's a pretty crucial point here that seems to have been missed:

In D85603#2422804, @jholewinski wrote:

Things are less clear when you consider odd looping structures, for example:

entry:
    entry_token = llvm.experimental.convergence.anchor();
    if (cond1) goto head1;
    else goto head2;

head1:
    head1_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    cond2 = ...;
    if cond2 goto tail1;
    else goto tail2;

head2:
    head2_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    break_cond = ...
    if break_cond goto exit;
    else goto head2b;

head2b:
    cond3 = ...;
    if cond3 goto tail2;
    else goto tail1;

tail1:
    cond4 = subgroupOp(...);      // What does this anchor to?
    if cond4 goto head1;
    else goto head2;

tail2:
    cond5 = subgroupOp(...);      // What does this anchor to?
    if cond5 goto head2;
    else goto head1;

exit:
    ...

Regardless of the question about subgroupOp, this example is not valid IR: it breaks the static rule that "Every cycle in the CFG that contains two different uses of a convergence token T must also contain the definition of T."Specifically, there are two uses of entry_token, in head1 and head2, and a cycle head1 -> tail1 -> head2 -> head2b -> tail1 -> head1 that goes through both of them without going through the definition of entry_token.

Roughly speaking, an irreducible loop can contain at most one loop intrinsic that refers to a token from outside the irreducible loop.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

This is a good point. Generally, HLL need to be more conscious about what they actually expect convergent operations to do :) I tend to be optimistic: I mentioned on D85609 a proposal I presented in the context of Khronos. The important point from there is that every statement of the HLL would be (possibly implicitly) annotated with its "canonical convergence token" using very simple rules. This only really falls flat if you have goto jumping into the middle of a loop (or Duff's device etc.). I don't know how efficiently e.g. the Clang frontend can decide whether such constructs exist or not.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

To be honest, I was not aware of this other effort, and even after you pointed it out, I wasn't paying attention to the words that I was reading. It seems like the current spec has so far focussed on demonstrating the soundness of the formalism. But I think it is possible to cover (2), which is to make the default setting conservative. This will need a bit of a rewording. In particular, this definition from the spec:
The convergence control intrinsics described in this document and convergent
operations that have a ``convergencectrl`` operand bundle are considered
*controlled* convergent operations.

Other convergent operations are *uncontrolled*.
This needs to be inverted in the spirit of D69498. I would propose the following tweak:

By default, every call has an implicit convergencectrl bundle with a token returned by the @llvm.experimental.convergence.entry intrinsic from the entry block of the caller. This default is the most conservative setting within the semantics defined here.

A more informed frontend or a suitable transformation can replace this conservative token with one of the following:

A token returned by any of the other intrinsics, which provides more specific information about convergence at this callsite.

A predefined constant token (say none), which indicates complete freedom. This would be equivalent to the noconvergent attribute proposed in D69498.

Such a rewording would invert how we approach the spec. Instead of a representation that explicitly talks about special intrinsics that "need" convergence, the new semantics applies to all function calls. The redefined default is conservative instead of free, and the presence of the bundles relaxes the default instead of adding constraints.

Sounds good. If that would be acceptable to the wider community it might help to simplify the frontend design and improve the user interface and the coherence of the interfaces within the compiler stack too.

FYI, if we forced early inlining in the LLVM stack, the frontend would not need to mark every function as convergent conservatively but in the Compute scenarios we occasionally have very large functions that when inlined result in huge binaries and longer compilation time. And we also have extern functions too that we have no information of during the compilation. So this doesn't seem like a route we can safely take at least not for all languages.

If we invert the convergent logic then we can add nocovergent attribute or even a pragma directive for the application developers to indicate what code doesn't contain cross-threads operations and can be optimized more aggressively.

Also, answering one of your comments in the other review (D85609#inline-943432) about the relevance of the llvm.experimental.convergence.anchor, this intrinsic cannot be inferred by the frontend. It represents a new ability to represent optimization opportunities like the one demonstrated in the "opportunistic convergence" example. The intrinsic says that the call that uses this token doesn't depend on any specific set of threads, but merely marks the threads that do reach it. This is most useful when multiple calls agree on the same set of threads. Identifying such sets of operations will need help from the user (or more realistically, a library writer). Something like the following might work, where the actual value of group doesn't really matter beyond relating the various calls to each other.
auto group = non_uniform_group_active_workitems();
op1(group);
if (C)
   op2(group);
op3(group);

Ok, this makes sense. Thanks for clarifications.

In D85603#2685900, @nhaehnle wrote:

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

This is a good point. Generally, HLL need to be more conscious about what they actually expect convergent operations to do :) I tend to be optimistic: I mentioned on D85609 a proposal I presented in the context of Khronos. The important point from there is that every statement of the HLL would be (possibly implicitly) annotated with its "canonical convergence token" using very simple rules. This only really falls flat if you have goto jumping into the middle of a loop (or Duff's device etc.). I don't know how efficiently e.g. the Clang frontend can decide whether such constructs exist or not.

I see. Technically this sounds feasible to add i.e. we could insert a custom AST visitor to detect the pattern after the AST is parsed or perhaps the detection can be done during the parsing itself. The only question is how many of such patterns exist considering the variety HL language constructs and how this will impact the parsing time, etc. I would say prototyping this could be a good starting point. However what happens when such pattern are detected? Do we generate IR slightly differently?

In D85603#2691257, @Anastasia wrote:

This needs to be inverted in the spirit of D69498. I would propose the following tweak:

By default, every call has an implicit convergencectrl bundle with a token returned by the @llvm.experimental.convergence.entry intrinsic from the entry block of the caller. This default is the most conservative setting within the semantics defined here.

A more informed frontend or a suitable transformation can replace this conservative token with one of the following:

A token returned by any of the other intrinsics, which provides more specific information about convergence at this callsite.

A predefined constant token (say none), which indicates complete freedom. This would be equivalent to the noconvergent attribute proposed in D69498.

Such a rewording would invert how we approach the spec. Instead of a representation that explicitly talks about special intrinsics that "need" convergence, the new semantics applies to all function calls. The redefined default is conservative instead of free, and the presence of the bundles relaxes the default instead of adding constraints.

Sounds good. If that would be acceptable to the wider community it might help to simplify the frontend design and improve the user interface and the coherence of the interfaces within the compiler stack too.

From what I understand, there was a fair bit of agreement in D69498 about the need to make the default safer. The real question is should we fold that idea into this proposal?

There's one mistake in what I outlined above. The first point about default token is expressly forbidden by the static rule on cycles: if an intrinsic other than .loop inside a cycle uses a token, then the definition must also be in the same cycle. But I think this can be fixed by simply saying that the dynamic instance of a call without an explicit operand bundle is "undertermined". Any optimization must then back off, the whole point being that an optimization is safe if it preserves dynamic instances, and it is impossible to preserve an undetermined dynamic instance.

FYI, if we forced early inlining in the LLVM stack, the frontend would not need to mark every function as convergent conservatively but in the Compute scenarios we occasionally have very large functions that when inlined result in huge binaries and longer compilation time. And we also have extern functions too that we have no information of during the compilation. So this doesn't seem like a route we can safely take at least not for all languages.

Or in other words, the "proper" definition must cover the whole of LLVM IR, and not introduce any assumptions like the absence of function calls.

If we invert the convergent logic then we can add nocovergent attribute or even a pragma directive for the application developers to indicate what code doesn't contain cross-threads operations and can be optimized more aggressively.

Dynamic instances provide complete information at the callsite. But having an attribute on a function declaration (especially extern) is useful because it removes the need to analyse the function body itself.

arsenm mentioned this in D69498: IR: Invert convergent attribute handling.Apr 22 2021, 6:52 PM

JonChesterfield added a subscriber: JonChesterfield.Apr 23 2021, 8:30 AM

sameerds added a child revision: D104504: RFC: Update token semantics with default convergent attribute.Jun 17 2021, 7:47 PM

sameerds mentioned this in D106859: [Sink] allow sinking convergent operations across uniform branches.Aug 1 2021, 4:14 AM

kpet added a subscriber: kpet.Nov 5 2021, 4:07 AM

simoll mentioned this in D114146: [DA][NFC] Update publication - add remarks.Nov 19 2021, 2:11 AM

bader added a subscriber: bader.Jan 17 2022, 12:58 AM

sameerds mentioned this in D130746: RFC: Uniformity Analysis for Irreducible Control Flow.Nov 20 2022, 4:56 PM

Peter9606 added a subscriber: Peter9606.Dec 30 2022, 1:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 30 2022, 1:09 AM

sameerds mentioned this in D147116: [RFC] Introduce convergence control intrinsics.Mar 28 2023, 11:58 PM

tianshilei1992 added a subscriber: tianshilei1992.Jun 20 2023, 9:00 AM

sameerds mentioned this in rGda61c865e734: [RFC] Introduce convergence control intrinsics.Jul 12 2023, 12:02 AM

Superseded by D147116

sameerds abandoned this revision.Aug 21 2023, 11:48 PM

Revision Contents

Path

Size

llvm/

docs/

ConvergentOperations.rst

1324 lines

LangRef.rst

55 lines

Reference.rst

4 lines

include/

llvm/

IR/

Intrinsics.td

9 lines

LLVMContext.h

13 lines

lib/

IR/

LLVMContext.cpp

5 lines

Verifier.cpp

123 lines

test/

Bitcode/

operand-bundles-bc-analyzer.ll

1 line

Verifier/

convergencectrl-invalid.ll

81 lines

convergencectrl-valid.ll

119 lines

Diff 301840

llvm/docs/ConvergentOperations.rst

This file was added.

				==============================
				Convergent Operation Semantics
				==============================

				.. contents::
				:local:
				:depth: 4

				Overview
				========

				Some parallel execution environments execute threads in groups that allow
				efficient communication within each group. Notably this is the case for
				whole-program vectorization environments such as GPUs, where threads are mapped
				to lanes of a SIMD vector. Efficient communication among threads is possible in
				this case by simply exchanging data between the lanes of a vector. However, the
				semantics defined in this document are independent of such implementation
				details.

				When control flow diverges, i.e. threads of the same group follow different
				paths through the CFG, not all threads of the group may be available to
				participate in this communication. This is the defining characteristic that
				distinguishes convergent operations from other inter-thread communication and
				which requires the use of the ``convergent`` attribute to indicate
				additional constraints on program transforms:

				A convergent operation involves inter-thread communication or synchronization
				that occurs outside of the memory model, where the set of threads which
				participate in communication is implicitly affected by control flow.

				t-tyeUnsubmitted Not Done Reply Inline Actions What would be an example where control flow affects without implicitly defining the set of threads? t-tye: What would be an example where control flow affects without implicitly defining the set of…
				nhaehnleUnsubmitted Done Reply Inline Actions Control flow alone is not enough to define the set of threads, because the initial set of threads is always defined in an environment-specific way, e.g. by how a kernel launch groups threads into waves and workgroups. I'm going to remove the "implicitly defined" part in the hope that that avoids confusion. nhaehnle: Control flow alone is not enough to define the set of threads, because the initial set of…
				dsandersUnsubmitted Not Done Reply Inline Actions This is rather nit-picky but there's some convergent operations where inter-thread communication isn't happening depending on how you model it. For example, a population count could be modelled as threads communicating (sum of 0 or 1 responses) which fits the definition here, but it could also be modelled threads optionally communicating (count of responses received), or as an external thread-manager broadcasting its count to the threads. Either way, communication is still happening but the second and third models are stretching the definition a bit I don't think it's worth bogging down the main text for that nitpick but it might be worth clarifying in a footnote or something that receiving/sending any data from, to, or about another thread counts as communication. Also, declining to communicate counts as communication if it affects the outcome. dsanders: This is rather nit-picky but there's some convergent operations where inter-thread…
				nhaehnleUnsubmitted Not Done Reply Inline Actions That's a fair point. The way I'm thinking about this is that there may be communication with a `void` payload, but ultimately this can be bikeshed to death. nhaehnle: That's a fair point. The way I'm thinking about this is that there may be communication with a…
				jlebarUnsubmitted Done Reply Inline Actions CUDA `__syncthreads()` is the prototypical convergent function (at least, it was -- maybe under this definition it's not?), but syncthreads does not exchange any information. It's just a barrier. Assuming you still consider syncthreads to be convergent, my concern is someone would read this and (quite reasonably) think that we are incorrectly modeling it as convergent. The way I'm thinking about this is that there may be communication with a void payload, If you count "communicate nil" as communication, then perhaps the operation is not in fact communication but rather is "communication or synchronization"? Perhaps: A convergent operation involves inter-thread communication or synchronization that occurs outside of the memory model, where the set of threads which participate in the inter-thread operation is implicitly affected by control flow. jlebar: CUDA `__syncthreads()` is the prototypical convergent function (at least, it was -- maybe under…
				nhaehnleUnsubmitted Done Reply Inline Actions Your suggestion looks good to me, going to apply it. nhaehnle: Your suggestion looks good to me, going to apply it.
				For example, in the following GPU compute kernel, communication during the
				convergent operation is expected to occur precisely among those threads of an
				implementation-defined execution scope (such as workgroup or subgroup) for
				which ``condition`` is true:

				.. code-block:: c++

				void example_kernel() {
				...
				if (condition)
				convergent_operation();
				...
				}

				In structured programming languages, there is often an intuitive and
				unambiguous way of determining the threads that are expected to communicate.
				However, this is not always the case even in structured programming languages,
				and the intuition breaks down entirely in unstructured control flow. This
				document describes the formal semantics in LLVM, i.e. how to determine the set
				of communicating threads for convergent operations.

				The definitions in this document leave many details open, such as how groups of
				threads are formed in the first place. It focuses on the questions that are
				relevant for deciding the correctness of generic program transforms and
				convergence-related analyses such as divergence analysis.

				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions I think I "get" it now, and it might be related to how this paragraph produces an expectation that is actually not intended. The entire time so far, I have been reading this document expecting a formal framework that completely captures convergence; something so complete, that one can point at any place in the program and merely look at the convergence intrinsics to decide whether a transform is valid. But that is not the case. This document becomes a lot more clear if the intrinsics being introduced are only meant to augment control flow but not replace it in the context of convergence. These intrinsics are only meant to be introduced by the frontend to remove ambiguity about convergence. In particular: In the jump-threading example, the frontend inserts the convergence intrinsics to resolve the ambiguity in favour of maximal convergence. In the loop-unroll example, the frontend disallows unrolling by inserting the anchor outside of the loop and using it inside. In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case. This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist? sameerds: I think I "get" it now, and it might be related to how this paragraph produces an expectation…
				nhaehnleUnsubmitted Done Reply Inline Actions In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case. This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist? What ambiguities do you have in mind? If you have a fully acyclic function, then the way you can think about it is: we determine "the" set of threads that execute the function at the entry. At every point in the function, the communication set is then the subset of threads that get to that point. It's easy to evaluate this if you just topologically sort the blocks and then evaluate them in that order. nhaehnle: > 3. In general acyclic control flow, control dependence is entirely sufficient to decide…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Your explanation intuitively makes sense, but it is not clear how to reconcile it with jump threading. That's one of the "ambiguities" I had in mind when dealing with acyclic control flow. It's almost like the text needs a paragraph explaining that "structured acyclic control flow" already contains sufficient information about convergence, but general acyclic control flow needs special attention in specific cases, starting with jump threading. sameerds: Your explanation intuitively makes sense, but it is not clear how to reconcile it with jump…
				nhaehnleUnsubmitted Done Reply Inline Actions I hesitate to write anything like that, because then you get into the problem of defining what "structured" means -- there are multiple definitions in the literature. My argument would be that purely acyclic control flow -- whether structured or not -- contains sufficient information about convergence to define semantics consistently, without assistance, and avoiding spooky action at a distance. That you still need some assistance to make actual guarantees is really down to composability. For example, you can have a fully acyclic function called from inside a cycle, and then what happen at inlining. One can explore an alternative scheme where you don't have to insert anything into the acyclic function in this case and it's the job of the inlining transform to fix things up, and I have done some exploring in this direction. There are at least two downsides: The burden on generic program transforms becomes larger. There is no longer any way for the programmer to express the distinction between functions (or sub-sections of code) that cares about the set of threads with which they're executed vs. those that don't (like the `@reserveSpaceInBuffer` example I added), and that closes the door on certain performance optimization and becomes problematic if you want to start thinking about independent forward progress. nhaehnle: I hesitate to write anything like that, because then you get into the problem of defining what…
				Note: It is common among practicioners to think about convergent operations in
				terms of divergence and reconvergence: sets of threads split at branch
				instructions if threads follow different paths through the control flow graph
				(divergence) and may later merge when they reach the same static point in the
				program (reconvergence). This operational point of view is often convenient for
				backend implementations and it can sometimes be useful for guiding intuition.
				However, the semantics defined here are declarative, since experience has shown
				that the operational view is too restrictive in practice for general compiler
				transforms. It is up to each implementation to operationalize those declarative
				semantics in a way that makes sense for the underlying hardware, which varies
				wildly.


				Motivating Examples of Convergent Operations
				============================================

				(This section is informative.)

				Texture sampling in a pixel shader
				----------------------------------

				The following stylized pixel shader samples a texture at a given set of
				coordinates, using the builtin function `textureSample`. Texture sampling
				requires screen-space derivatives of the coordinates to determine the level of
				detail (mipmap level) of the sample. They are commonly approximated by taking
				the difference between neighboring pixels, which are computed by different
				threads in the same group:

				.. code-block:: c++

				void example_shader() {
				...
				color = textureSample(texture, coordinates);
				jlebarUnsubmitted Not Done Reply Inline Actions Up to you, but I think this example would be more evocative if we wrote out the definition of textureSample. I am imagining that it involves something like a `__shfl`, but that's because I already understand GPUs. Your audience is bigger than that. jlebar: Up to you, but I think this example would be more evocative if we wrote out the definition of…
				nhaehnleUnsubmitted Done Reply Inline Actions `textureSample` is actually a built-in function of graphics languages. I'm going to add a clause to try to clarify that. I assume all GPUs have dedicated circuitry for it. I specifically wanted to mention `textureSample` in the document at least once because it (and some close analogs) are often forgotten in discussions of convergent even by graphics people like myself. Obviously the document should also be accessible to folks from the GPU compute world, which is why I tried to give a succinct explanation of the relevant facts about `textureSample` in the paragraph above. Later in the document there are also examples using shuffles, though with the Khronos-y spelling of `subgroupShuffle` instead of the CUDA-y `__shfl`. The choice of spelling is partly because that's just the world I'm personally working in most of my time, but also partly because I'd prefer using terms from common industry standards. I understand that CUDA is a bit of a de facto "standard", so if you think it's necessary to convert at least one example to CUDA spelling, we can do that -- just not this one here in particular, because it's specifically meant to be a graphics shader example. nhaehnle: `textureSample` is actually a built-in function of graphics languages. I'm going to add a…
				if (condition) {
				use(color);
				}
				...
				}

				From a purely single-threaded perspective, sinking the `textureSample` into
				the if-statement appears legal. However, if the condition is false for some
				neighboring pixels, then their corresponding threads will not execute together
				in the group, making it impossible to take the difference of coordinates as an
				approximation of the screen-space derivative. In practice, the outcome will be
				an undefined value.

				That is, the `textureSample` operation fits our definition of a convergent
				operation:

				1. It communicates with a set of threads that implicitly depends on control
				flow.

				2. Correctness depends on this set of threads.

				The compiler frontend can emit IR that expresses the convergence constraints as
				follows:

				.. code-block:: llvm

				define void @example_shader() convergent {
				%entry = call token @llvm.experimental.convergence.entry()
				...
				%color = call T @textureSample(U %texture, V %coordinates) [ "convergencectrl"(token %entry) ]
				br i1 %condition, label %then, label %end

				then:
				call void @use(T %color)
				br label %end

				end:
				}

				The :ref:`llvm.experimental.convergence.entry <llvm.experimental.convergence.entry>`
				intrinsic is itself ``convergent``, and we expect it to communicate at least
				among all threads of the same "quad" -- a group of 2x2 pixels that are
				evaluated together for the purpose of approximating screen-space derivatives.
				This fact is not part of the generic LLVM IR semantics: it would have to be
				defined somewhere else, for example as part of target-specific ABI definitions
				and/or in reference to some relevant API specs.

				Since the ``@textureSample`` call then uses the token produced by the entry
				intrinsic in its ``convergencectrl`` bundle, and has no additional control
				dependencies, it must communicate among the same set of threads. This indicates
				to generic program transforms that sinking the ``@textureSample`` call is
				forbidden. (A program transform can still sink the call if it can prove somehow,
				e.g. by leaning on target-specific callbacks that can analyze the program with
				additional knowledge, that ``%condition`` is always uniform across the threads
				referenced by the convergence token ``%entry``.)
				dsandersUnsubmitted Not Done Reply Inline Actions I think this is a little misleading, IIUC and assuming that the sets of communicating threads are quads as mentioned above then `%condition` doesn't need to be uniform across all the threads referenced by `%entry`. The only use is inside the `then:` block so I would expect that communicating threads for which `%condition` is uniformly false don't need to be considered as their result will not be used by any thread that enters `then:`. As you're trying to leave methods out, it's probably best left at `... with additional knowledge, that it doesn't change the result` The reason I bring this up is that I think it's worth thinking about how a generic transform, or an IR-level/gMIR-level/MIR-level target transform would perform this transform if it did understand convergence. To be clear, I'm not talking about the property it proves or the method by which it proves it. I mean: How would such a transform know what to prove and when to try? For MIR and intrinsics, the answer seems obvious. The backend simply knows more about the instructions/intrinsics convergence than `convergencectrl` declares and can use that information instead. Once it recognizes an instruction/intrinsic as one it knows more about, it can try to prove whatever property it needs. However, outside of those special cases there doesn't seem to be a way to know what to prove or when to try, even for a target-specific pass. To use the above example, if `@textureSample` were a non-intrinsic function with the same properties you describe I don't think it would be possible to know any better than what `convergencectrl` declares, preventing the analysis the sinking transform would depend on. It's arguably out of scope for this doc but do you foresee convergence tokens and the `convergent` attribute becoming finer grained in future to support earlier or more target-independent transforms on convergent operations? Do you have any thoughts on how that would be done? dsanders: I think this is a little misleading, IIUC and assuming that the sets of communicating threads…
				nhaehnleUnsubmitted Not Done Reply Inline Actions I can clean up the text. As for the question of how generic transforms could do better in the future: the way I see it, this would involve divergence analysis. If `%condition` is uniform (in a suitably defined sense), then sinking the `@textureSample` is okay since it doesn't change the relevant set of threads. The downside is that divergence analysis tends to be relatively expensive. It's worth exploring whether it can be computed incrementally and preserved. This particular example is an interesting one since it shows that scopes matter: on typical hardware, you really only need uniformity of `%condition` at the `quad` scope. I think that's worth exploring at some point, but it's definitely something to leave for later. I don't think there's anything in this proposal that would inherently prevent it. nhaehnle: I can clean up the text. As for the question of how generic transforms could do better in the…

				.. _convergence_example_reductions:

				Reductions inside divergent control flow
				----------------------------------------

				The following example shows that merging common code of branches can be
				incorrect in the face of convergent operations:

				.. code-block:: c++

				void example_kernel() {
				delta = ...
				if (delta > 0) {
				total_gains = subgroupAdd(delta);
				...
				} else {
				total_losses = subgroupAdd(delta);
				...
				}
				}

				The ``subgroupAdd`` computing the ``total_gains`` will be executed by the
				subset of threads with positive ``delta`` in a subgroup (wave), and so will sum
				up all the ``delta`` values of those threads; and similarly for the
				``subgroupAdd`` that computes the ``total_losses``.

				If we were to hoist and merge the ``subgroupAdd`` above the if-statement, it
				would sum up the ``delta`` across all threads instead.

				The compiler frontend can emit IR that expresses the convergence constraints
				as follows:
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions But this use of the intrinsics does not add any new constraints, right? This specific optimization is already sufficiently constrained by control dependence. sameerds: But this use of the intrinsics does not add any new constraints, right? This specific…
				nhaehnleUnsubmitted Done Reply Inline Actions It doesn't add any constraints for existing generic transforms in LLVM that I'm aware of, but there's still a bit of non-trivial content to it at least in theory. Whether it matters in practice depends on the backend. E.g., it doesn't matter for AMDGPU, but modern versions of CUDA say that some sort of divergence can basically happen at any point in the program. If you wanted to take code that uses the convergent operations and translate it to CUDA builtins, the control intrinsics make a difference. In that case, you'd want the uniform threadmask to replace the entry intrinsic. If it was an anchor somewhere instead, you'd want to replace the anchor by `__activemask()` and then use its return value. In both cases, you'd possibly modify the mask somehow to account for additional control dependencies between the anchor and its use. This "modify the mask somehow" hides a lot of complexity, but thinking about it quite a bit I believe it's a similar amount of complexity to what we have in the AMDGPU backend to make things work, probably less because more of the burden is shouldered by hardware in the end. Plus there's the composability aspect of it if we're talking about functions that aren't kernel entry points and might be inlined. nhaehnle: It doesn't add any constraints for existing generic transforms in LLVM that I'm aware of, but…

				.. code-block:: llvm

				define void @example_kernel() convergent {
				%entry = call token @llvm.experimental.convergence.entry()
				%delta = ...
				%cc = icmp sgt i32 %delta, 0
				t-tyeUnsubmitted Not Done Reply Inline Actions Should wave be used here? Above the concept of SIMD is used so would SIMD instruction be a better term to use? t-tye: Should wave be used here? Above the concept of SIMD is used so would SIMD instruction be a…
				nhaehnleUnsubmitted Done Reply Inline Actions The term "subgroup" is used in the example code, which strongly hints at GLSL / SPIR-V / Vulkan terminology. nhaehnle: The term "subgroup" is used in the example code, which strongly hints at GLSL / SPIR-V / Vulkan…
				br i1 %cc, label %then, label %else

				then:
				t-tyeUnsubmitted Not Done Reply Inline Actions Clarify why the second version is different? Perhaps: In the second version, threads reconverge at `end`, causing threads that reach the control barrier via different paths to synchronize separately. t-tye: Clarify why the second version is different? Perhaps: In the second version, threads…
				nhaehnleUnsubmitted Done Reply Inline Actions Going to add: "the first (and only) post-dominator is `end`, so threads do not reconverge before then" nhaehnle: Going to add: "the first (and only) post-dominator is ``end``, so threads do not reconverge…
				%total_gains = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
				...
				br label %end

				else:
				%total_losses = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
				t-tyeUnsubmitted Done Reply Inline Actions .. _dynamic_instances_and_convergence_tokens: t-tye: .. _dynamic_instances_and_convergence_tokens:
				...
				br label %end

				end:
				...
				}

				The entry intrinsic behaves like in the previous example: assuming that
				t-tyeUnsubmitted Not Done Reply Inline Actions The notion of static instruction has not been defined. Above it simply uses the term LLVM IR instruction. Suggest either using that term here, or defining static instruction above. t-tye: The notion of static instruction has not been defined. Above it simply uses the term *LLVM IR…
				nhaehnleUnsubmitted Not Done Reply Inline Actions I'm going to try to rephrase this a bit more explicitly. nhaehnle: I'm going to try to rephrase this a bit more explicitly.
				``@example_kernel`` is an OpenCL kernel (as hinted at by the "subgroup"
				terminology), we expect it to communicate among all threads within the
				"subgroup". This typically maps to a SIMD vector on GPU hardware.

				simollUnsubmitted Not Done Reply Inline Actions I suppose this only refers to convergent instructions but it isn't clear to me from the wording: Does this constraint apply to all IR instructions or only those that are convergent? (Only 4. explicitly mentions convergent operations) simoll: I suppose this only refers to convergent instructions but it isn't clear to me from the wording…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions I think the notion of dynamic instances applies to all instructions. Continuing with #3 below, it seems to me that different threads can execute the same dynamic instance of any instruction. It's just that this notion is not very interesting in the case of non-communicating instructions. The ones that communicate need to be marked convergent, so that the effect of transformations on them is limited. sameerds: I think the notion of dynamic instances applies to all instructions. Continuing with #3 below…
				simollUnsubmitted Not Done Reply Inline Actions I'm more concerned about the implications this constraint may have for transformation like branch fusion. The memory model is pretty permissive and allows fusion of memory accesses regardless. @nhaehnle Do you care about non-memory side effects, like exceptions? Do these follow the same weak semantics as the memory model? simoll: I'm more concerned about the implications this constraint may have for transformation like…
				nhaehnleUnsubmitted Not Done Reply Inline Actions I'm not entirely sure what you mean by the question. There isn't supposed to be any interaction between exceptions and what's being described here. There aren't any relevant constraints expressed on the dynamic instances of non-convergent operations in the first place, and for convergent operations I'd think of them as happening in two steps: there's a cross-thread communication, and afterwards each thread individually decides whether it throws an exception in its context. This can obviously take the exchanged data into account, to the point where you could model an operation as exchanging bits between threads to indicate whether an exception should be thrown in each thread -- so you could have an operation that throws an exception based on a value in another thread, as long as that other thread executes the same dynamic instance. Similarly, you could have UB in thread A based on an argument value in thread B as long as A and B execute the same dynamic instance. I'm going to add an informational note to the end of this section that dynamic instances of non-convergent instructions don't matter. nhaehnle: I'm not entirely sure what you mean by the question. There isn't supposed to be any interaction…
				The calls to ``@subgroupAdd`` use the token produced by the entry intrinsic,
				but they also have an additional control dependency. According to the rules
				defined in this document, they only communicate among the subset of threads
				t-tyeUnsubmitted Not Done Reply Inline Actions This is an important concept to understand. Does more need to be said about the "may" part? t-tye: This is an important concept to understand. Does more need to be said about the "may" part?
				nhaehnleUnsubmitted Not Done Reply Inline Actions In a sense, that's what the rest of the document is about, so... hopefully not here? :) nhaehnle: In a sense, that's what the rest of the document is about, so... hopefully not here? :)
				that actually end up executing the respective (static) call site.

				Hoisting them would remove the control dependency and cause them to communicate
				among the full set of threads that the entry intrinsic communicated with.
				Again, hoisting is allowed if it can be proven that ``%cc`` is always uniform
				among the relevant set of threads: in that case, the ``@subgroupAdd`` already
				communicates among the full set of threads in the original program.
				dsandersUnsubmitted Not Done Reply Inline Actions Should we also mention that it's valid when %cc is non-uniform so long as the same effect is achieved by other means? In this particular example, additional communication is fine so long as we ensure unintended threads contribute 0 to the sums (e.g. by masking %delta using %cc first). In other words, it's not the actual communication we need to keep consistent but the effects (and side-effects) of that communication. dsanders: Should we also mention that it's valid when %cc is non-uniform so long as the same effect is…

				simollUnsubmitted Not Done Reply Inline Actions This is actually super important and should probably go into the formal semantics: the token value represents the dynamic instance of the producing instruction. If the token represents the dynamic instance exactly then this would also limit the freedom `llvm.experimental.convergence.anchor()` has. For example, this would rule out thread partitioning if it were so because then no token-producing instruction could return different token values per dynamic invocation. simoll: This is actually super important and should probably go into the formal semantics: the token…
				nhaehnleUnsubmitted Not Done Reply Inline Actions The logical split between the two sections is that this section has the basic definitions, while the "Formal Rules" section has the rules about how the convergence control intrinsics place additional constraints on how dynamic instances can be formed. If the token represents the dynamic instance exactly then this would also limit the freedom llvm.experimental.convergence.anchor() has. For example, this would rule out thread partitioning if it were so because then no token-producing instruction could return different token values per dynamic invocation. I'm not sure I understand the argument. What exactly do you mean by dynamic invocation here? Each time a thread executes the same anchor call site, it will receive a different token value, corresponding to a different dynamic instance. That may or may not be the same dynamic instance as received by other threads. So even if control flow is entirely uniform, an implementation would be free to produce a different thread partitioning each time the anchor is executed. That is on purpose: if you want more predictable thread partitionings, use a combination of `entry` and `loop` intrinsics as required. nhaehnle: The logical split between the two sections is that this section has the basic definitions…

				Unstructured control flow
				-------------------------

				Consider an example of how jump threading removes structure in a way that can
				make semantics non-obvious without the convergence intrinsics described in this
				jlebarUnsubmitted Done Reply Inline Actions Nit: Clarify that this example isn't using the proposed convergence intrinsics? Perhaps Consider an example of how jump threading removes structure in a way that can make semantics non-obvious without the convergence intrinsics described in this document. jlebar: Nit: Clarify that this example isn't using the proposed convergence intrinsics? Perhaps >…
				nhaehnleUnsubmitted Done Reply Inline Actions Thanks, going to make this change. nhaehnle: Thanks, going to make this change.
				document:

				.. code-block:: llvm

				t-tyeUnsubmitted Not Done Reply Inline Actions What other convergent operations exist that are not defined in this document? Seems would be good to enumerate them or provide a reference on where to find more about them. t-tye: What other convergent operations exist that are not defined in this document? Seems would be…
				nhaehnleUnsubmitted Done Reply Inline Actions Well, they'd be deprecated so I really don't want to talk too much about it... I'm going to rearrange this to hopefully make that clearer. nhaehnle: Well, they'd be deprecated so I really don't want to talk too much about it... I'm going to…
				void example_original() {
				entry:
				...
				br i1 %cond1, label %then1, label %mid

				then1:
				...
				%cond2 = ...
				br label %mid

				mid:
				%flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
				br i1 %flag, label %then2, label %end

				then2:
				...
				call void @subgroupControlBarrier()
				...
				br label %end

				t-tyeUnsubmitted Done Reply Inline Actions anchor t-tye: ``` anchor ```
				end:
				}
				t-tyeUnsubmitted Done Reply Inline Actions See questions below. I had been assuming that the set of threads would be well defined by the source language and not be an implementation defined concept. I was thinking this is present to model source language semantics, not different target implementation approaches. I feel I am missing something. t-tye: See questions below. I had been assuming that the set of threads would be well defined by the…
				simollUnsubmitted Done Reply Inline Actions +1 simoll: +1

				void example_jumpthreaded() {
				entry:
				jlebarUnsubmitted Done Reply Inline Actions Nit: Add ellipsis above this line, or remove it in the equivalent spot in the original code? jlebar: Nit: Add ellipsis above this line, or remove it in the equivalent spot in the original code?
				nhaehnleUnsubmitted Done Reply Inline Actions Added ellipsis. nhaehnle: Added ellipsis.
				...
				br i1 %cond1, label %then1, label %then2
				t-tyeUnsubmitted Not Done Reply Inline Actions So where should it be defined what the set of threads should be? It seems it is not a target dependent concept as the target must implement the semantics of the programming language. So should each clang_lang define the initial set of threads at the construct that denotes the beginning of execution of of the various constructs? For example, an OpenCL kernel, a CUDA device function, or a C/C++ `main` function. Presumably later text will define how the set of threads is passed between a call site and a called function? "happen to be active at the same time" does not seem the right sentiment. The programing language semantics will dictate what the set is. For example, OpenCL may define the set to be the work-items that are members of the same subgroup as defined by the OpenCL language. It is not all the work-items that start executing the dispatch grid as a whole which may reasonably also be considered to "happen to be active at the same time". So may be this needs to admit that the language prescribes the set? Then a reference to the language specific page that defines this in a "ClangRef" document? t-tye: So where should it be defined what the set of threads should be? It seems it is not a target…
				nhaehnleUnsubmitted Not Done Reply Inline Actions For what you have in mind, you want to be looking at the `entry` intrinsic instead of the `anchor` intrinsic. The `entry` intrinsic is used to form a relation with the group of converged threads at function entry, which for the kernel entry point would be the entire wave/workgroup/subgroup. For a called function, it would be the set of threads indicated by the `convergencectrl` operand bundle at the call site. The `anchor` is there for us to explicitly be able to express: we don't care which threads go together; all we care about is that the operations that refer to the same anchor are executed with the same set of threads (subject to control flow and all the other rules). nhaehnle: For what you have in mind, you want to be looking at the `entry` intrinsic instead of the…
				dsandersUnsubmitted Not Done Reply Inline Actions I feel like there's something I'm missing here. This sounds like: if (condition1) { %token = anchor() if (condition2) { ... } sum() convergencectrl(%token) } can be rewritten to: if (condition1) { if (condition2) { %token = anchor() ... sum() convergencectrl(%token) } } which made sense at first given statements like `we don't care which threads go together`, but we also have no way of saying that we did care which threads go together unless we also say that it must be the same as the threads from function entry. I'd originally expected that this would be allowed: if (condition1) { %token = entry() if (condition2) { ... } sum() convergencectrl(%token) } and would prevent sinking into or hoisting out of either if-statement but your reply here seems to indicate that's not allowed. How do convergence tokens prevent hoisting/sinking for this case? Having read a bit further and thought about it a bit more, I suspect what I'm missing is that anchor() is as immobile as it's name would suggest. However I haven't seen anything say it's immobile and things like `we don't care which threads go together` and `the code does not care about the exact set of threads with which it is executed` give me the impression that it can sink/hoist as long as the consumers of the token do too. My main thought that undermines my original reading is that if it can move then there'd be nothing stopping me deleting it either as I could always invent a `if(false) { ... }` to sink it all into. dsanders: I feel like there's something I'm missing here. This sounds like: ``` if (condition1) {…
				nhaehnleUnsubmitted Done Reply Inline Actions That transform is allowed (assuming that sinking the user of the result of the `sum()` is also possible). Though either way, an implementation is free to isolate individual threads, i.e. in your example, the result of `sum` could just be replaced by the value you're summing over so that each thread just gets its own value. This may seem useless at first, but it is the point of the anchor :) If you want the set of threads to have some fixed relation to something external (like a compute workgroup or full Vulkan subgroup), you need to use `entry` instead of `anchor`. `anchor` is still useful, as long as you have multiple things anchored to it. It will then ensure that they are relatively consistent to each other. nhaehnle: That transform is allowed (assuming that sinking the user of the result of the `sum()` is also…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions If I understand this right, then even `entry` does not capture anything specific ... it is merely a place holder for the `anchor` at the callsite of a function. This matters, for example, when the call is inside a loop and the frontend is trying to specify something in terms of the threads that together enter the loop. The `entry` at the start of a kernel is almost the same as an `anchor`, except the definition of threads that see the same dynamic instance is coming from the language above rather than the implementation below. The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler. Is that correct? sameerds: If I understand this right, then even `entry` does not capture anything specific ... it is…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Just realized that this is not true: "The entry at the start of a kernel is almost the same as an anchor", but the rest still seems to hold. sameerds: Just realized that this is not true: "The entry at the start of a kernel is almost the same as…
				nhaehnleUnsubmitted Done Reply Inline Actions The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler. Probably? I'm not sure I agree with the exact wording. In a compute kernel, the `entry` intrinsic preserves the set of threads (workgroup/threadgroup/block) that are launched together, where "together" is parameterized by the scope you care about (dispatch/workgroup/subgroup/wave/whatever you call it). `loop` intrinsics controlled by the resulting token value in turn preserve that set of threads modulo divergent exits from the loop. And so on. So I'd state it as: the intrinsics cannot enforce any grouping that wasn't there before, they can only enforce preservation of groupings. I hope that's what you meant, just with different words? :) nhaehnle: > The end result is that none of these intrinsics can be used to dictate how the implementation…

				then1:
				...
				%cond2 = ...
				br i1 %cond2, label %then2, label %end

				then2:
				...
				call void @subgroupControlBarrier()
				...
				br label %end

				t-tyeUnsubmitted Done Reply Inline Actions heart t-tye: ``` heart ```
				end:
				}

				Is the control barrier guaranteed to synchronize among the same set of threads
				in both cases? Different implementations in the literature may give different
				answers to this question:

				* In an implementation that reconverges at post-dominators, threads reconverge
				at ``mid`` in the first version, so that all threads (within a subgroup/wave)
				that execute the control barrier do so together. In the second version,
				threads that reach the control barrier via different paths synchronize
				separately: the first (and only) post-dominator is ``end``, so threads do not
				reconverge before then.

				* An implementation that sorts basic blocks topologically and ensures maximal
				reconvergence for each basic block would behave the same way in both
				versions.
				jdoerfertUnsubmitted Not Done Reply Inline Actions The "heart" and the increment step are fairly vague. Maybe talk about something tangible, e.g., the target of a backedge? jdoerfert: The "heart" and the increment step are fairly vague. Maybe talk about something tangible, e.g.
				nhaehnleUnsubmitted Done Reply Inline Actions When it comes to defining rules that are applicable to completely general IR, the loop intrinsic call site feels more tangible than the notion of backedge. For example, backedges don't really work as a concept when you have irreducible control flow. The loop intrinsic call site also really doesn't have to be in the header block of a natural loop -- it could be inside of an if-statement in the loop, for example, which has interesting consequences but can still be defined (and can actually be useful: someone pointed me at a recent paper by Damani et al - Speculative Reconvergence for Improve SIMT Efficiency, which proposes a certain "unnatural" way of controlling convergence in some kinds of loop for performance; the same kind of effect can be achieved by placing the loop heart inside of an if-statement). nhaehnle: When it comes to defining rules that are applicable to completely general IR, the loop…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions It was the optimizer that introduced the ambiguity ... should the optimizer be responsible for adding the necessary intrinsics that preserve the original convergence? sameerds: It was the optimizer that introduced the ambiguity ... should the optimizer be responsible for…
				nhaehnleUnsubmitted Done Reply Inline Actions No. The jump-threaded code could also come out of C(++) code with `goto`s, so this doesn't really work. nhaehnle: No. The jump-threaded code could also come out of C(++) code with `goto`s, so this doesn't…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions But what about the flip side? If the frontend is sure that only structured control flow is present in the input program, can it skip inserting the convergence intrinsics? Or should it still insert those intrinsics just in case optimizations changed the graph? If yes, is this something that LLVM must prescribe for every frontend as part of this document? sameerds: But what about the flip side? If the frontend is sure that only structured control flow is…
				nhaehnleUnsubmitted Done Reply Inline Actions It needs to insert the control intrinsics if it wants to have any guarantees. There aren't a lot of useful guarantees we can make today without this, so that's fine. I don't want to say that frontends absolutely must insert the control intrinsics just yet, that's why uncontrolled convergent operations are allowed but deprecated. Frontends for languages with convergent operations that don't change will remain in the world of "things tend to work as expected a lot of the time, but stuff can break in surprising ways at the least convenient moment" that they are already in today. If they run the ConvergenceControlHeuristic pass just after IR generation, the times where things break will likely be somewhat reduced, but probably not eliminated entirely. It's difficult to make a definitive claim because there's obviously also the question of which guarantees the high-level language is supposed to give to the developer. For a HLL that just doesn't want give any guarantees, not inserting control intrinsics is fine from the POV of language spec correctness, although you're likely to run into corner cases where the language behavior clashes with developers' intuitive expectations. nhaehnle: It needs to insert the control intrinsics if it wants to have any guarantees. There aren't…

				We generally take the stance that reconvergence in acyclic control flow must
				be maximal. The compiler frontend could augment the original code as follows:

				.. code-block:: llvm
				t-tyeUnsubmitted Not Done Reply Inline Actions Would another way to achieve this be that the LLVM IR function itself have a convergencectrl bundle? This reflects that a function is "passed in" the set of threads. t-tye: Would another way to achieve this be that the LLVM IR function itself have a convergencectrl…
				nhaehnleUnsubmitted Done Reply Inline Actions I'm not sure what exactly the point is here. That's just how operand bundles work in LLVM: you add them to a call site, so the caller of the function that calls `entry` has to put the operand bundle there. So the operand bundle is not part of the function type, although there is the correlation that if a function is `convergent`, you have to call it with a `convergencectrl` bundle or else there's undefined behavior as stated in the very next paragraph. At least for our purposes, having it not be part of the function type is perfectly fine. nhaehnle: I'm not sure what exactly the point is here. That's just how operand bundles work in LLVM: you…

				define void @example_original() convergent {
				entry:
				%entry = call token @llvm.experimental.convergence.entry()
				...
				br i1 %cond1, label %then1, label %mid

				then1:
				t-tyeUnsubmitted Not Done Reply Inline Actions As mentioned above, is this dependent on the language semantics? t-tye: As mentioned above, is this dependent on the language semantics?
				nhaehnleUnsubmitted Done Reply Inline Actions Kind of, though I would expect all this dependence to be captured by the target triple or some other environment factors, e.g. the calling convention used by a kernel entry point. That's why this particular document doesn't need to say anything here, at least formally. nhaehnle: Kind of, though I would expect all this dependence to be captured by the target triple or some…
				simollUnsubmitted Not Done Reply Inline Actions Not sure whether the expectation of uniformity makes sense here: there could be a caller with a non-uniform convergence token in a different module. This may only become apparent when everything is linked together. Would this be a property of the calling convention of the kernel function (ie if it's a GPU kernel we know that the entry token is all-uniform). simoll: Not sure whether the expectation of uniformity makes sense here: there could be a caller with a…
				nhaehnleUnsubmitted Done Reply Inline Actions The intention is that the IR-based rules still apply regardless of whether the caller is in the same module or not. I'm not sure if this needs to spelled out more clearly. And yes, for other cases we should be able to think of it as a property of the calling convention. nhaehnle: The intention is that the IR-based rules still apply regardless of whether the caller is in the…
				...
				%cond2 = ...
				br label %mid
				arsenmUnsubmitted Not Done Reply Inline Actions Should this just be a verifier error? Why make it undefined? arsenm: Should this just be a verifier error? Why make it undefined?
				t-tyeUnsubmitted Not Done Reply Inline Actions The formal model needs to state the legality. It would in addition be good to have the verifier enforce the requirement. t-tye: The formal model needs to state the legality. It would in addition be good to have the verifier…

				mid:
				%flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
				arsenmUnsubmitted Not Done Reply Inline Actions This should also just be a verifier check? arsenm: This should also just be a verifier check?
				arsenmUnsubmitted Not Done Reply Inline Actions Is it legal for this to be called multiple times in the same function? arsenm: Is it legal for this to be called multiple times in the same function?
				t-tyeUnsubmitted Not Done Reply Inline Actions I would assume it can be called multiple times in the same function provided it is not in another convergence region. If the token were an operand (or result of?) of the LLVM IR function then seems this would become simpler as would simply reference that value. What is "another convergence region"? These tokens are deliberately not lexical scopes so they can describe unstructured control flow. So what is a "region" in this sense? Is it that only one token should be being used per dynamic "region" instance? That `llvm.experimental.convergence.loop` is effectively partitioning the parent token into the loop iteration instances and it is not meaningful to use the parent token inside one of those loop instances? Basically within in one post-dominator region only one token should be used? Maybe this is all explained in the formal section. In any case, the term "another convergence region" needs defining. Since it cannot be outside the functions entry block, how can it be in another region anyway? When is `llvm.experimental.convergence.entry `as opposed to` `llvm.experimental.convergence.anchor` `used? Seems they are both conceptually doing the same thing. When would` `llvm.experimental.convergence.anchor` `be used, since the start of the program is typically also a function and` `llvm.experimental.convergence.entry`` could simply be capturing that "outside LLVM" token? t-tye: I would assume it can be called multiple times in the same function provided it is not in…
				efriedmaUnsubmitted Not Done Reply Inline Actions Could we get away without the "outside of a function's entry block" restriction? It seems sort of inconvenient that transforming a select to an if-then-else requires scanning the entire basic block. I guess we have to do that scan anyway, though, given the way alloca is defined, so maybe not a big deal. efriedma: Could we get away without the "outside of a function's entry block" restriction? It seems sort…
				nhaehnleUnsubmitted Done Reply Inline Actions @arsenm: Is it legal for this to be called multiple times in the same function? Yes, subject to the constraints listed here. @t-tye: What is "another convergence region"? These tokens are deliberately not lexical scopes so they can describe unstructured control flow. So what is a "region" in this sense? [...] This is defined later in the document. I'm going to add a proper link. @efriedma Could we get away without the "outside of a function's entry block" restriction? It seems sort of inconvenient that transforming a select to an if-then-else requires scanning the entire basic block. I guess we have to do that scan anyway, though, given the way alloca is defined, so maybe not a big deal. Right, the "only in the entry block" rule came about specifically by analogy with `alloca`s. In an early version, I only had "must not appear in a cycle", because that's all you need for the definition of convergence rules to work out. However, function inlining then becomes more complicated because the entire inline function would have to be scanned for `entry` intrinsics. With the restriction to the entry block, we can just piggyback on the existing handling of `alloca`s. The same should apply for select-to-if/else. So it's largely a pragmatic choice. nhaehnle: @arsenm: > Is it legal for this to be called multiple times in the same function? Yes, subject…
				br i1 %flag, label %then2, label %end

				then2:
				...
				call void @subgroupControlBarrier() [ "convergencectrl"(token %entry) ]
				...
				br label %end

				end:
				}

				If S is the set of threads that the entry intrinsic communicated with, then
				the ``@subgroupControlBarrier`` call communicates with the subset of S that
				jlebarUnsubmitted Not Done Reply Inline Actions This paragraph really clarifies for me what's going on. +1 jlebar: This paragraph really clarifies for me what's going on. +1
				actually reaches the call site. This set of threads doesn't change after
				jump-threading, so the answer to the question posed above remains the same.


				Opportunistic convergent operations
				-----------------------------------

				Some programs have local regions of code that contain a sequence of convergent
				operations where the code does not care about the exact set of threads with
				which it is executed, but only that the set of threads is the same for all the
				operations within the sequence. (If a subset of the convergent operations in
				the sequence have additional, non-uniform control dependencies, then this is
				not possible. However, the code may still require that the sets of threads are
				logically consistent with the conditions of those control dependencies.)
				In this case,
				:ref:`llvm.experimental.convergence.anchor <llvm.experimental.convergence.anchor>`
				can be used to express the desired semantics.

				The following example function could be part of a hypothetical "append buffer"
				implementation, where threads conditionally write fixed-sized records
				contiguously into a global buffer. The function ``@reserveSpaceInBuffer``
				returns the index into the buffer at which the calling thread should store its
				data.

				This could be achieved by using a simple atomic operation in every thread to
				bump an allocation counter.
				t-tyeUnsubmitted Done Reply Inline Actions Should the rules start by defining what "static controlled convergent operation" means as that term is used in the following rule? There is a definition above for "controlled convergent operation". The "static" part seems undefined as mentioned above. A "uncontrolled convergent operation" also needs to be defined as it is used in the last rule. t-tye: Should the rules start by defining what "static controlled convergent operation" means as that…

				However, the following implementation can be more performant on some hardware,
				because it uses only a single atomic operation for an entire group of threads.
				To do this, it first determines the total size of the group, which will be the
				operand to the atomic operation, and then later broadcasts the result of the
				simollUnsubmitted Not Done Reply Inline Actions Should suffice to say that they two threads will execute the same instance if they see the same token value. Above you stated that the token value represents the dynamic instance of the defining instruction. simoll: Should suffice to say that they two threads will execute the same instance if they see the same…
				nhaehnleUnsubmitted Done Reply Inline Actions No, this is explicitly not sufficient. You can have: %tok = call token @llvm.experimental.convergence.anchor() br i1 %cc, label %then, label %next then: call void @convergent_op() [ "convergencectrl"(token %tok) ] br label %next next: nhaehnle: No, this is explicitly not sufficient. You can have: ``` %tok = call token @llvm.experimental.
				simollUnsubmitted Not Done Reply Inline Actions You mean control could deviate threads? But those threads won't even reach the convergent instruction and only among those that do those that have the same runtime token value will execute it as a pack. simoll: You mean control could deviate threads? But those threads won't even reach the convergent…
				nhaehnleUnsubmitted Done Reply Inline Actions Ah, I misread your earlier comment. Yes, though there's a question of whether the different threads actually see the same value, or whether they see different values that happen to refer to the same dynamic instance of the defining instruction. One may want to think of the token value as a handle to some control structure that refers to a dynamic instance and also holds a loop counter for the loop heart intrinsic. I don't think it really matters much either way. nhaehnle: Ah, I misread your earlier comment. Yes, though there's a question of whether the different…
				atomic operation to all threads of the group, so that each thread can compute
				its individual position in the buffer:

				.. code-block:: llvm
				jlebarUnsubmitted Not Done Reply Inline Actions ...wait, there are such things as convergent functions? This is the first I'm hearing about it in the doc! So far it seemed there were only convergent calls. What's a convergent function? :) jlebar: ...wait, there are such things as convergent functions? This is the first I'm hearing about it…
				nhaehnleUnsubmitted Done Reply Inline Actions Uhh... technically true. How about adding something like the following somewhere: In LLVM IR, function calls are the only instructions that can involve convergent operations. A call itself (i.e., the act of jumping to the callee, setting up a stack frame, etc.) is not a convergent operation. However, if the callee uses the `llvm.experimental.convergence.entry` intrinsic, then we think of the entire execution of the callee as a convergent operation from the perspective of the calling function. Such callees must be marked with the `convergent` attribute, and for brevity we say that they are "convergent functions". If the callee isn't known at the call site (i.e., an indirect function call), then the `call `instruction itself must have the` `convergent`` attribute. The only reason for why a function F would need to use the `llvm.experimental.convergence.entry` intrinsic is if F in turn uses some other convergent operation, i.e., a call to a convergent function. Chains of such calls are expected to eventually end with the use of a (target-specific) intrinsic that is `convergent`. nhaehnle: Uhh... technically true. How about adding something like the following somewhere: > In LLVM…

				define i32 @reserveSpaceInBuffer() { ; NOTE: _not_ a convergent function!
				entry:
				%anchor = call token @llvm.experimental.convergence.anchor()
				efriedmaUnsubmitted Not Done Reply Inline Actions Say you have a loop with a non-uniform trip count; does this mean the threads are allowed to communicate for the iterations that both threads execute? efriedma: Say you have a loop with a non-uniform trip count; does this mean the threads are allowed to…
				t-tyeUnsubmitted Done Reply Inline Actions if: 1. They obtained the ``convergencectrl`` token operand value from the same dynamic instance of the defining instruction, and 2. There is an n such that both threads execute U for the n'th time with that same token operand value. t-tye: ``` if: 1. They obtained the ``convergencectrl`` token operand value from the same dynamic…
				nhaehnleUnsubmitted Done Reply Inline Actions Say you have a loop with a non-uniform trip count; does this mean the threads are allowed to communicate for the iterations that both threads execute? Yes -- allowed to, and must. (I.e., this prevents unrolling with remainder, as is written later.) nhaehnle: > Say you have a loop with a non-uniform trip count; does this mean the threads are allowed to…

				%ballot = call i64 @subgroupBallot(i1 true) [ "convergencectrl"(token %anchor) ]
				%numThreads.p = call i64 @llvm.ctpop.i64(i64 %ballot)
				%numThreads = trunc i64 %numThreads.p to i32

				%absoluteThreadIdx = call i32 @getSubgroupLocalInvocationId()
				%absoluteThreadIdx.ext = zext i32 %absoluteThreadIdx to i64
				%mask.p = shl i64 1, %absoluteThreadIdx.ext
				%mask = sub i64 %mask.p, 1

				%maskedBallot = and i64 %ballot, %mask
				%relativeThreadIdx.p = call i64 @llvm.ctpop.i64(i64 %maskedBallot)
				%relativeThreadIdx = trunc i64 %relativeThreadIdx.p to i32

				efriedmaUnsubmitted Not Done Reply Inline Actions I'd prefer to define the "call stack" abstraction in a way that doesn't assume the whole world is LLVM IR. efriedma: I'd prefer to define the "call stack" abstraction in a way that doesn't assume the whole world…
				nhaehnleUnsubmitted Not Done Reply Inline Actions @efriedma I'd prefer to define the "call stack" abstraction in a way that doesn't assume the whole world is LLVM IR. It's already explicitly written in a way that doesn't assume the whole world is LLVM IR. The rule only makes a statement about what happens when the function is called from LLVM IR, and leaves open what happens if the function is called through some other mechanism. I don't see what else we can do here. nhaehnle: @efriedma > I'd prefer to define the "call stack" abstraction in a way that doesn't assume the…
				efriedmaUnsubmitted Not Done Reply Inline Actions The part that's sort of unclear is that calls coming from outside of LLVM IR may or may not be part of the same dynamic instance. Obviously we can't define that here, but I think we should explicitly note it as something that's implementation-defined. efriedma: The part that's sort of unclear is that calls coming from outside of LLVM IR may or may not be…
				t-tyeUnsubmitted Not Done Reply Inline Actions See comments above. Would it be possible to unify this with the definition of `llvm.experimental.convergence.anchor`? That also needs defining here. Seems this this rule could be left as is without the "If the function is executed for some reason outside of the scope of LLVM IR, e.g. because it is a kernel entry function, then this rule does not apply. On the other hand," part. And a new rule needs to be added to specify what the dynamic instance is for when F is not invoked by a `call`,` `invoke, or` `callbr`` instruction. That rule would reference the language semantics that defines how threads are partitioned into dynamic instances. For OpenCL that is based on the subgroup language definition, etc. t-tye: See comments above. Would it be possible to unify this with the definition of ``llvm.
				nhaehnleUnsubmitted Done Reply Inline Actions I think this comment may have moved to a confusing location relative to the document. `entry` and `anchor` are inherently different. I'm going to add a note about looking at language specs etc. nhaehnle: I think this comment may have moved to a confusing location relative to the document. `entry`…
				%isFirstThread = icmp eq i32 %relativeThreadIdx, 0
				br i1 %isFirstThread, label %then, label %end

				efriedmaUnsubmitted Not Done Reply Inline Actions "uncontolled divergent operation", meaning a convergent operation without a token? Can you just say that's outside the scope of this document earlier, where you say it's deprecated? efriedma: "uncontolled divergent operation", meaning a convergent operation without a token? Can you…
				nhaehnleUnsubmitted Done Reply Inline Actions Yes, going to make essentially that change. nhaehnle: Yes, going to make essentially that change.
				then:
				%baseOffset.1 = atomicrmw add i32* @bufferAllocationCount, i32 %numThreads monotonic
				br label %end

				end:
				%baseOffset.2 = phi i32 [ undef, %entry ], [ %baseOffset.1, %then ]
				%baseOffset = call i32 @subgroupBroadcastFirst(i32 %baseOffset.2) [ "convergencectrl"(token %anchor) ]
				%offset = add i32 %baseOffset, %relativeThreadIdx
				ret i32 %offset
				}

				The key here is that the function really doesn't care which set of threads it
				t-tyeUnsubmitted Not Done Reply Inline Actions Is this intended to say: 2. Every cycle in the CFG that contains two or more static uses of a convergence token T by :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>` must also contain the definition of T. Or could T be different for each use? Suggest a similar change to the previous rule to make it clearer. t-tye: Is this intended to say: ``` 2. Every cycle in the CFG that contains two or more static uses…
				nhaehnleUnsubmitted Done Reply Inline Actions Seems reasonable, will do. nhaehnle: Seems reasonable, will do.
				it is being called with. It takes whatever set of threads it can get. What the
				implementation of the function cares about is that the initial
				``@subgroupBallot`` -- which is used to retrieve the bitmask of threads that
				t-tyeUnsubmitted Not Done Reply Inline Actions "the minimal region in which T is live and used" Should this be clarified that it is the minimal live region (in the same way that phi nodes can be minimally created). Another interpretation of "live" allows the value to be live outside the dominance region. The "(i.e. ...)" is not really a "namely". It is actually part of the definition of what "the region in which T is live" means unless the above change (or similar) is made. Does "dominance region" need defining? T may be in many nested dominance regions, I assume here it means the minimal one? The set of blocks that are dominated by the immediate dominator of the block containing T? Then what is the subset of the dominance region being defined by "convergence region"? How can the use of T happen outside the dominance region? Wouldn't that imply a phi? But above it was stated tokens cannot be used in a phi. Does the subset respect the blocks that the use must pass through to reach the block containing the use? Or is the definition only blocks that are dominated by the block containing the definition of T that also use T or are on a path from the definition to the use of T? Again, how can there be blocks on a path between the definition of T and and a use of T that are not dominated by the block containing the definition of T given that phi nodes are not allowed to specify a token? Maybe more explanation is needed? t-tye: "the minimal region in which T is live and used" Should this be clarified that it is the…
				nhaehnleUnsubmitted Not Done Reply Inline Actions I'm going to try to rephrase that. nhaehnle: I'm going to try to rephrase that.
				executed the anchor together -- executes with the same set of threads as the
				final ``@subgroupBroadcastFirst``. Nothing else is required for correctness as
				far as convergence is concerned.

				The function ``@reserveSpaceInBuffer`` itself is _not_ ``convergent``: callers
				arsenmUnsubmitted Not Done Reply Inline Actions Is there a verifier implemented for these rules? arsenm: Is there a verifier implemented for these rules?
				simollUnsubmitted Not Done Reply Inline Actions Isn't 4. implied by the fact that this is SSA and the convergence region consists of all blocks that are dominated by the definition? simoll: Isn't 4. implied by the fact that this is SSA and the convergence region consists of all blocks…
				nhaehnleUnsubmitted Done Reply Inline Actions No, the rule excludes code such as: %a = call token @llvm.experimental.convergence.anchor() %b = call token @llvm.experimental.convergence.anchor() call void @convergent_op() [ "convergencectrl"(token %a) ] call void @convergent_op() [ "convergencectrl"(token %b) ] The convergence region of `%b` contains a use of `%a` but not its definition. I'm going to add a note about nesting. nhaehnle: No, the rule excludes code such as: ``` %a = call token @llvm.experimental.convergence.anchor…
				are free to move call sites of the function as they see fit. This can change
				the behavior in practice, by changing the sets of threads that are grouped
				together for the atomic operation. This can be visible in the output of the
				program, since the order in which outputs appear in the buffer is changed.
				However, this does not break the overall contract that ``@reserveSpaceInBuffer``
				t-tyeUnsubmitted Not Done Reply Inline Actions Is this the legacy definition of convergence that is now deprecated? Would it be good to clarify that? Perhaps the legacy rules should be in a separate section so they do not get muddled with the new rules, and can be deleted once the deprecated support is removed. t-tye: Is this the legacy definition of convergence that is now deprecated? Would it be good to…
				nhaehnleUnsubmitted Done Reply Inline Actions Since two people suggested this, I'm going to move it. nhaehnle: Since two people suggested this, I'm going to move it.
				has with its caller -- which makes sense: the order of outputs is
				non-deterministic anyway because of the atomic operation that is involved.

				If the function is inlined, the use of the anchor intrinsic similarly indicates
				that certain transforms which are usually forbidden by the presence of
				convergent operations are in fact allowed, as long as they don't break up the
				region of code that is controlled by the anchor.

				sameerdsAuthorUnsubmitted Done Reply Inline Actions So this defines a proper nesting of convergence regions? An informative note would be helpful. sameerds: So this defines a proper nesting of convergence regions? An informative note would be helpful.

				.. _dynamic_instances_and_convergence_tokens:

				efriedmaUnsubmitted Not Done Reply Inline Actions It's a bit of an exaggeration to say it has no effect on the memory model. Consider the thread group reduction example: there's implicitly some bit of "memory" used to communicate. (For the definition of readnone, "memory" is anything used to store/communicate state.) Whether that bit of memory is the same for two instructions depends on whether they correspond to the same dynamic instance. Of course, if you don't use any attributes, we'll conservatively assume that the memory accessed by an intrinsic depends on the current thread ID or something like that, so this is really only interesting if you're using readonly/readnone/etc. efriedma: It's a bit of an exaggeration to say it has no effect on the memory model. Consider the thread…
				t-tyeUnsubmitted Not Done Reply Inline Actions It does seem that traditionally the cross lane operations are not considered as using "memory" (in the sense of the language memory model) to do their communication. It is true that an implementation may use memory/storage to do this, but that is outside the memory behavior being defined by the language memory model. One could argue that execution barriers are also communication and so may use storage/memory in their implementation, yet languages seem to choose to not include that in the memory model. Although those language may allow memory model semantics to be optionally specified in addition to the execution barrier semantics. What is attractive about this formalism is it is clearly defining semantics for both cross thread execution communication, distinct from cross thread language memory model communication. The SIMD/SIMT languages [often informally] appear to have this distinction and this allows LLVM IR to model that set of semantics accurately. t-tye: It does seem that traditionally the cross lane operations are not considered as using "memory"…
				nhaehnleUnsubmitted Done Reply Inline Actions I agree with @t-tye's explanation here. The choice here reflects the choice made e.g. in the Vulkan memory model: the only "convergent" operation (not the term used in Vulkan...) which interacts with the memory model is OpControlBarrier, so it's good to be able to treat these two kinds of communication orthogonally. nhaehnle: I agree with @t-tye's explanation here. The choice here reflects the choice made e.g. in the…
				Dynamic Instances and Convergence Tokens
				========================================

				efriedmaUnsubmitted Not Done Reply Inline Actions You don't really define "same time" anywhere. That's probably outside the scope of this document anyway, but not sure referring to it here adds anything. efriedma: You don't really define "same time" anywhere. That's probably outside the scope of this…
				t-tyeUnsubmitted Not Done Reply Inline Actions I think there is value in mentioning this, but it should be an "informational note". The insight having this present is that it is the responsibility of the implementation to implement the "as if" semantics. This is comparable to the way the memory model is presenting an "as if" set of rules that the physical hardware may not in fact be literally doing. The point being that these rules can be implemented on systems that do not have physical SIMD/SIMT hardware. In such systems the dynamic instruction instances may not be executed at the same time, and other means are used to ensure the communication happens correctly (perhaps staging buffers). This is even true on SIMD/SIMT hardware if the set of threads is larger than the SIMD/SIMT instruction size as it is for example if the subgroup size requires multiple waves/warps and scratchpad memory is used. t-tye: I think there is value in mentioning this, but it should be an "informational note". The…
				nhaehnleUnsubmitted Done Reply Inline Actions Right, that was exactly the intention here: make it plain as day to people that the requirement is only "as if"-semantics, not literal lock-step execution. I'm going to prefix this with "Informational note" nhaehnle: Right, that was exactly the intention here: make it plain as day to people that the requirement…
				Every execution of an LLVM IR instruction occurs in a dynamic instance of
				the instruction. Dynamic instances are the formal objects by which we talk
				about communicating threads in convergent operations. They satisfy:

				1. Different executions of the same instruction by a single thread
				give rise to different dynamic instances of that instruction.

				2. Executions of different instructions always occur in different dynamic
				instances. For this and other rules in this document, instructions of the
				same type at different points in the program are considered to be different
				instructions.

				3. Executions of the same instruction by different threads may occur in
				the same dynamic instance.

				4. When executing a convergent operation, the set of threads that execute the
				same dynamic instance is the set of threads that communicate with each other
				for that operation.

				Convergence tokens are values of ``token`` type, i.e. they cannot be used in
				``phi`` or ``select`` instructions. A convergence token value represents the
				dynamic instance of the instruction that produced it.

				Convergent operations typically have a ``convergencectrl`` operand bundle with
				a convergence token operand to define the set of communicating threads relative
				to some anchor. The details are described in the
				:ref:`Formal Rules <convergence_formal_rules>` section.

				.. _controlled_convergent_operation:

				The convergence control intrinsics described in this document and convergent
				operations that have a ``convergencectrl`` operand bundle are considered
				controlled convergent operations.

				Other convergent operations are uncontrolled. Their use is deprecated.
				Program transforms are correct for uncontrolled convergent operations if they
				do not make such operations control-dependent on additional values. The
				remainder of this document is only concerned with controlled convergent
				operations.
				arsenmUnsubmitted Done Reply Inline Actions This and a lot of the later examples use "call tok" instead of the proper "call token" arsenm: This and a lot of the later examples use "call tok" instead of the proper "call token"
				t-tyeUnsubmitted Not Done Reply Inline Actions This seems to be the motivation for why llvm.experimental.convergence.anchor is wanted rather than a token flowing into the enclosing function. Or could this transformation also be done if it used a token obtained from llvm.experimental.convergence.entry outside the loop? Why would this example not use llvm.experimental.convergence.loop since each loop iteration could involve a different dynamic instance? Or is that the point, this is explicitly saying all the threads that entered the loop must participate, and transformation cannot change this. But wouldn't using llvm.experimental.convergence.loop also enforce that in this case? It still feels like llvm.experimental.convergence.anchor is materializing the set of threads out of thin air rather than as a clear "chain of custody" from the function entry (transitively passed via call sites). If one did do that could there be clear transformations to determine when this transformation is legal? t-tye: This seems to be the motivation for why llvm.experimental.convergence.anchor is wanted rather…
				nhaehnleUnsubmitted Done Reply Inline Actions It still feels like llvm.experimental.convergence.anchor is materializing the set of threads out of thin air rather than as a clear "chain of custody" from the function entry (transitively passed via call sites). Yes, that is the point of `llvm.experimental.convergence.anchor`. And yes, if there was clear "chain of custody" as you call it from outside of the loop, then this unrolling with remainder would be incorrect. nhaehnle: > It still feels like llvm.experimental.convergence.anchor is materializing the set of threads…

				Informational notes:

				1. The rules above define dynamic instances for all LLVM IR instructions,
				whether convergent or not. However, the dynamic instances for non-convergent
				instructions are entirely irrelevant. The only way that dynamic instances
				can have an effect on the execution of a program is via rule 4 about the
				cross-thread communication in convergent operatons.

				2. The text defines convergence token values as representing a dynamic
				instance, but you could almost think of them as representing a set of
				threads instead -- specifically, the set S of threads that executed the
				dynamic instance, i.e. that executed the defining instruction D "together".

				Intuitively, when a convergence token value T is used by a
				``convergencectrl`` bundle on an instruction I, then the set of threads that
				communicates in I is a subset of the set S represented by the token value.
				Specifically, it is the subset of threads that ends up executing I while
				using the token value.

				This by itself wouldn't quite work as a definition: what if I is executed
				multiple times by the same threads? Which execution of I in thread 1
				communicates with which execution of I in thread 2? Leaning on the notion of
				dynamic instances gives a robust answer to this question as long as D and I
				t-tyeUnsubmitted Not Done Reply Inline Actions This confuses me. Shouldn't these intrinsics have well defined semantics so that source languages can map their semantics on to them? How is that possible if the intrinsics do not have well defined meaning? Their implementation would still be target/implementation defined. t-tye: This confuses me. Shouldn't these intrinsics have well defined semantics so that source…
				nhaehnleUnsubmitted Done Reply Inline Actions I hope this has been answered in the context of your other comments? nhaehnle: I hope this has been answered in the context of your other comments?
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Which part of the formal semantics shows that this is a valid translation? Rule for the execution of dynamic instances seems to be useful to only specify which threads execute the convergent operations. But what relates them to the original loop? Is it because the set of dynamic instances produced by the second version has a one-to-one mapping with the set of dynamic instances produced by the first version? sameerds: Which part of the formal semantics shows that this is a valid translation? Rule for the…
				nhaehnleUnsubmitted Done Reply Inline Actions The first version doesn't have a unique set of dynamic instances in the first place, because `anchor` is by design implementation-defined. So the possible universes of dynamic instances in the transformed/unrolled version only needs to be a subset. In a sense, the loop unroll with remainder picks a subset by saying: from now on, if you have two threads with e.g. iteration counts 3 and 4, then they will never communicate during the 3rd iteration. In the original program, they may or may not have communicated during the 3rd iteration -- up to the implementation, and in this case, the implementation decided to do a form of loop unrolling which implicitly ends up making a choice. nhaehnle: The first version doesn't have a unique set of dynamic instances in the first place, because…
				are at the same loop (or cycle) nesting level.

				The case where D and I are at different loop nesting levels is forbidden by
				the static validity rules spelled out in the
				:ref:`Formal Rules <convergence_formal_rules>` section -- handling that case
				is the purpose of
				:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`.


				Convergence Control Intrinsics
				==============================

				This section describes target-independent intrinsics that can be used to
				produce convergence tokens.

				.. _llvm.experimental.convergence.entry:

				``llvm.experimental.convergence.entry``
				----------------------------------------

				.. code-block:: llvm

				token @llvm.experimental.convergence.entry() convergent readnone

				This intrinsic is used to tie the dynamic instances inside of a function to
				those in the caller. Informally, one can think of it as returning the
				convergence token value that was used in the ``convergencectrl`` operand bundle
				when the current function was called. The formal definition based on dynamic
				jlebarUnsubmitted Not Done Reply Inline Actions Do you plan to check this in the verifier (insofar as possible, I understand that it's not possible to check this for cross-TU calls). jlebar: Do you plan to check this in the verifier (insofar as possible, I understand that it's not…
				nhaehnleUnsubmitted Done Reply Inline Actions Do we typically check "mere UB" in the verifier? Thinking about it a little, doing this seems risky for IR linking: it would mean that you can link two well-formed modules together and end up with an ill-formed one? If that's something that already exists and we're okay with it, then I'd be happy to add such checks, but I wouldn't want to be the one to introduce them... nhaehnle: Do we typically check "mere UB" in the verifier? Thinking about it a little, doing this seems…
				instances is given :ref:`later <convergence_formal_rules_entry>`.

				Behavior is undefined if the containing function was called from IR without
				a ``convergencectrl`` bundle.

				The expectation is that for program "main" functions whose caller is not
				visible to LLVM, such as kernel entry functions, the implementation returns a
				convergence token that represents uniform control flow, i.e. that is guaranteed
				jlebarUnsubmitted Not Done Reply Inline Actions This one is a local property -- could we say that this makes the program ill-formed, instead of UB? jlebar: This one is a local property -- could we say that this makes the program ill-formed, instead of…
				nhaehnleUnsubmitted Done Reply Inline Actions Yes, that's a good idea. nhaehnle: Yes, that's a good idea.
				to refer to all threads within a (target- or environment-dependent) group.
				t-tyeUnsubmitted Not Done Reply Inline Actions header, t-tye: header,
				t-tyeUnsubmitted Not Done Reply Inline Actions loop, t-tye: loop,
				nhaehnleUnsubmitted Done Reply Inline Actions Is that still grammatically correct? The parse of the sentence is Loops in which ((a loop intrinsic outside of the loop header) uses a token defined outside of the loop) That is, "a loop intrinsic outside of the loop header" is the subject of the sentence in the outer parentheses. nhaehnle: Is that still grammatically correct? The parse of the sentence is > Loops in which ((a loop…

				Behavior is undefined if this intrinsic appears in a function that isn't
				``convergent``.
				jlebarUnsubmitted Not Done Reply Inline Actions Again, could we say this makes the program ill-formed? (At least the entry-block check, I'm not sure what a convergence region is, yet.) jlebar: Again, could we say this makes the program ill-formed? (At least the entry-block check, I'm…
				nhaehnleUnsubmitted Done Reply Inline Actions The entry-block check should be straightforward. nhaehnle: The entry-block check should be straightforward.

				Behavior is undefined if this intrinsic appears inside of another
				:ref:`convergence region <convergence_region>` or outside of a function's entry
				block.

				Function inlining substitutes this intrinsic with the token from the operand
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions I think this intends to say "block in the loop body other than the loop header", but the wording chosen is a little difficult to parse on a first read. sameerds: I think this intends to say "block in the loop body other than the loop header", but the…
				nhaehnleUnsubmitted Done Reply Inline Actions Going to try an improvement :) nhaehnle: Going to try an improvement :)
				bundle. For example:
				arsenmUnsubmitted Done Reply Inline Actions @ in wrong place for the call arsenm: @ in wrong place for the call

				.. code-block:: c++

				// Before inlining:

				void callee() convergent {
				%tok = call token @llvm.experimental.convergence.entry()
				t-tyeUnsubmitted Not Done Reply Inline Actions This also confuses me. If anchor is supposed to denote the current set of threads in the current dynamic instance, then it seems undefined IR to use it in the conditional when all those threads cannot be performing the dynamic operation instance. I feel I am missing a fundamental aspect of the formal model. t-tye: This also confuses me. If anchor is supposed to denote the current set of threads in the…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions +1 To me, the whole point of this new concept is to capture control dependency so that we don't have to go look at branch conditions again. But allowing such a transformation reintroduces the need to go check the control dependency to understand which threads are really executing this instance. sameerds: +1 To me, the whole point of this new concept is to capture control dependency so that we…
				nhaehnleUnsubmitted Done Reply Inline Actions I mean, `anchor` is implementation-defined, so you can't make a totally solid statement anyway. You could only make solid relative statements if the token produced by the anchor was also used by some other convergent operations, and if those are outside of the if-statement, the sinking wouldn't be allowed anymore anyway... nhaehnle: I mean, `anchor` is implementation-defined, so you can't make a totally solid statement anyway.
				convergent_operation(...) [ "convergencectrl"(token %tok) ]
				}

				void main() {
				%outer = call token @llvm.experimental.convergence.anchor()
				for (...) {
				%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
				arsenmUnsubmitted Not Done Reply Inline Actions %tok defined in both branches looks like broken SSA to me arsenm: %tok defined in both branches looks like broken SSA to me
				t-tyeUnsubmitted Not Done Reply Inline Actions Which would mean a phi which is not allowed. But again this is changing what set of threads %tok is denoting so I feel I am not understanding what a convergent token is fundamentally denoting. My thinking had been that the convergent tokens were a way that the high level language mapping to LLVM IR can communicated the language mandated convergence rules. But these examples seem to dis-spell that notion and make it a target dependent concept unrelated to the source language. t-tye: Which would mean a phi which is not allowed. But again this is changing what set of threads…
				nhaehnleUnsubmitted Done Reply Inline Actions You're overanalyzing this. It's just a weird mash-up of C-like if-statements with LLVM IR-like notation that made me not think about the potential ways this could be interpreted. I'm going to rename the variables to disambiguate. nhaehnle: You're overanalyzing this. It's just a weird mash-up of C-like if-statements with LLVM IR-like…
				callee() [ "convergencectrl"(token %inner) ]
				}
				}

				// After inlining:

				void main() {
				%outer = call token @llvm.experimental.convergence.anchor()
				for (...) {
				%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
				convergent_operation(...) [ "convergencectrl"(token %inner) ]
				}
				}
				arsenmUnsubmitted Done Reply Inline Actions Indentation arsenm: Indentation


				.. _llvm.experimental.convergence.loop:

				``llvm.experimental.convergence.loop``
				--------------------------------------

				t-tyeUnsubmitted Not Done Reply Inline Actions So the convergent token is the set of threads, but any intervening conditional control flow may change which threads a nested convergent operation may be required to communicate with? My understanding was that the tokens were intended to be explicit in denoting the involved threads to avoid needing to crawl the LLVM IR to determine the control dependence. And were intended to be explicit in preventing control dependence changes. But these examples seem to contradict that understanding. So when a convergent token is used in a dynamic instance of a static convergent operation, what set of threads is it mandating have to participate? Those defined by the dynamic instance of the static token definition that control dependence permits to execute? t-tye: So the convergent token is the set of threads, but any intervening conditional control flow may…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions This is also the transform that CUDA (and potentially HIP) will disallow. Hoisting or sinking a conditional changes the set of threads executing the each leg of the branch. In CUDA, the two programs have completely different meanings depend on whether the anchor is outside the branch or inside each leg. There seems to be an opportunity here to relate the notion of an anchor to language builtins that return the mask of currently executing threads. sameerds: This is also the transform that CUDA (and potentially HIP) will disallow. Hoisting or sinking a…
				nhaehnleUnsubmitted Done Reply Inline Actions CUDA is very different here: the builtins that take an explicit threadmask don't have an implicit dependence on control flow, so they shouldn't be modeled as convergent operations. They have other downsides, which is why we prefer to go down this path of convergent operations. nhaehnle: CUDA is very different here: the builtins that take an explicit threadmask don't have an…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Combined with my other comment about the introduction, I think the current formalism is compatible with CUDA. One can say that some convergent functions in CUDA have additional semantics about how different dynamic instances communicate with each other. That communication is outside the scope of this document, where the mask argument is used to relate the dynamic instances. The current framework seems to be sufficient to govern the effect of optimizations on the dynamic instances. For example, it is sufficient that a CUDA ballot is not hoisted/sunk across a condition; the ballot across the two branch legs is managed by the mask, which was created before the branch. sameerds: Combined with my other comment about the introduction, I think the current formalism is…
				nhaehnleUnsubmitted Done Reply Inline Actions I don't understand what you're trying to get at here. The semantics of modern CUDA builtins are fully captured by saying they're non-convergent, but they have a side effect. That side effect is communication with some set of other threads, but that set isn't affected by control flow, it's fully specified by an explicit argument. Because of this, there is no need to argue about dynamic instances. All legal program transforms subject to those constraints are then legal. There is no need to label them as `convergent`. If you can think of a counter-example, I'd be curious to see it. nhaehnle: I don't understand what you're trying to get at here. The semantics of modern CUDA builtins…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions I am trying to understand whether there are constructs in Clang-supported high-level languages that cannot be addressed by these intrinsics. And if such constructs do exist, then whether that gate the adoption of this enhancement in LLVM. But I see your point now. The sync() builtins in CUDA are no longer dependent on convergence. The decision to hoist or sink them is based entirely on other things like data dependences (and maybe just that). sameerds: I am trying to understand whether there are constructs in Clang-supported high-level languages…
				.. code-block:: llvm

				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions So the heart is not a property of the loop itself in LLVM IR. It is a place chosen by the frontend based on semantics external to LLVM IR, in a way that allows the frontend to express constraints about convergence in the loop. sameerds: So the heart is not a property of the loop itself in LLVM IR. It is a place chosen by the…
				nhaehnleUnsubmitted Done Reply Inline Actions Yes. nhaehnle: Yes.
				token @llvm.experimental.convergence.loop() [ "convergencectrl"(token) ] convergent readnone

				This intrinsic defines the heart of a loop, i.e. the place where an imaginary
				loop counter is incremented for the purpose of determining convergence
				semantics.

				The convergence control token operand is usually defined outside of the loop,
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions What forbids the convergent operations from being hoisted? Isn't that the whole point of this new framework? In particular, what would the total_gains/total_losses example look like with appropriate use of convergence tokens? sameerds: What forbids the convergent operations from being hoisted? Isn't that the whole point of this…
				nhaehnleUnsubmitted Done Reply Inline Actions I'm going to add that example. nhaehnle: I'm going to add that example.
				but this is not a requirement for the validity of a program (the resulting
				behavior is quite different, though).

				The resulting convergence token can be used outside of the loop; see the
				:ref:`Formal Rules <convergence_formal_rules>` section for details.


				.. _llvm.experimental.convergence.anchor:

				``llvm.experimental.convergence.anchor``
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Just like the loop intrinsic, this intrinsic occurs in a place chosen by the frontend based on semantics outside of LLVM IR, and used by the frontend to express constraints elsewhere in the IR. sameerds: Just like the loop intrinsic, this intrinsic occurs in a place chosen by the frontend based on…
				nhaehnleUnsubmitted Done Reply Inline Actions I'd rephrase it slightly by saying that the place is chosen by the frontend in a way that preserves the semantics of the original language into LLVM IR. But I suspect that we're ultimately thinking of the same thing. nhaehnle: I'd rephrase it slightly by saying that the place is chosen by the frontend in a way that…
				----------------------------------------

				.. code-block:: llvm

				token @llvm.experimental.convergence.anchor() convergent readnone

				This intrinsic is a marker that acts as an anchor producing an initial
				convergence token that is independent from any "outer scope". The set of
				t-tyeUnsubmitted Done Reply Inline Actions I think this is the part that I am struggling with. It feels like llvm.experimental.convergence.anchor is allowed to partition the threads in in arbitrary way. So how does that square with the language mandating how the threads must be partitioned? t-tye: I think this is the part that I am struggling with. It feels like llvm.experimental.convergence.
				nhaehnleUnsubmitted Done Reply Inline Actions Should be answered elsewhere. nhaehnle: Should be answered elsewhere.
				threads executing the same dynamic instance of this intrinsic is
				implementation-defined.

				The expectation is that all threads within a group that "happen to be active at
				the same time" will execute the same dynamic instance, so that programs can
				detect the maximal set of threads that can communicate efficiently within
				some local region of the program.


				.. _convergence_formal_rules:

				Formal Rules
				============

				The convergence control intrinsics described in the previous section place
				additional constraints on the execution of dynamic instances, which should be
				jlebarUnsubmitted Not Done Reply Inline Actions Have we formally defined what a "controlled" convergent operation is? Do you mean a `call` to a `convergent` function with a `"convergencectrl"` operand bundle? (Say that?) jlebar: Have we formally defined what a "controlled" convergent operation is? Do you mean a `call` to…
				nhaehnleUnsubmitted Done Reply Inline Actions Yes, the section "Dynamic Instances and Convergence Tokens" already says this: The convergence control intrinsics described in this document and convergent operations that have a `convergencectrl` operand bundle are considered controlled convergent operations. I'm going to add an anchor there since the doc is pretty long :) nhaehnle: Yes, the section "Dynamic Instances and Convergence Tokens" already says this: > The…
				understood on top of the
				:ref:`basic rules about dynamic instances <dynamic_instances_and_convergence_tokens>`:

				1. Let U be a :ref:`controlled <controlled_convergent_operation>` convergent
				operation other than the convergence control intrinsics. Let D be the
				instruction that defines the convergence token used by U. Two threads
				executing U execute the same dynamic instance of U if and only if they
				obtained the token value from the same dynamic instance of D.

				(Informational note: As mentioned in the
				:ref:`basic rules about dynamic instances <dynamic_instances_and_convergence_tokens>`,
				t-tyeUnsubmitted Done Reply Inline Actions This seems to contradict the pixel example at the beginning. Or is this transformation allowed if it can be proven tat pure.convergent.operation does not rely on the result from the threads that would not execute the condition to true? How could that be done? t-tye: This seems to contradict the pixel example at the beginning. Or is this transformation allowed…
				nhaehnleUnsubmitted Done Reply Inline Actions The pixel example would use `entry` instead of `anchor`. I'm going to add that example. nhaehnle: The pixel example would use `entry` instead of `anchor`. I'm going to add that example.
				the requirement here is that U is the same point in the program and not just
				the same type of instruction. In particular, this rule does not apply when
				the same ``convergent`` function is called from different call sites.)

				2. Two threads executing the same call U of
				:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
				execute the same dynamic instance of U if and only if:
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions The older comments about this seem to have floated away. At risk of repeating the discussion, what is n capturing? Is it meant to relate copies of the call U created by unrolling the loop, for example? sameerds: The older comments about this seem to have floated away. At risk of repeating the discussion…
				nhaehnleUnsubmitted Done Reply Inline Actions It's really just a loop iteration counter. Every time a thread executes the `loop` intrinsic, it executes a new dynamic instance of it. You could think of this dynamic instance being labeled by the iteration, and then whether a thread executes the same dynamic instance as another thread depends in part on whether they have the same loop iteration label. Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the `loop` intrinsic. This means that if you have a natural loop where the `loop` intrinsic is not called in the header but in some other block that is conditional, the loop iterations will be counted in a way that seems funny (but this can actually be put to a potentially good use as I noted elsewhere). Unrolling will actually not duplicate the `loop` intrinsic, but only keep the copy that corresponds to the first unrolled iteration. nhaehnle: It's really just a loop iteration counter. Every time a thread executes the `loop` intrinsic…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic. This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration? Unrolling will actually not duplicate the `loop` intrinsic, but only keep the copy that corresponds to the first unrolled iteration. This is a bit of surprise. My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated. sameerds: > Note that for the purpose of labeling, threads can never "skip" an iteration! They all start…
				nhaehnleUnsubmitted Done Reply Inline Actions Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic. This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration? Well... what even is a loop iteration? :) For the purpose of convergence, the loop heart defines what the iterations are, so it is reached on every iteration by definition. (But there may well be cycles in the CFG that don't contain a loop intrinsic, and that's fine.) More likely your real question is whether in a natural loop, the loop intrinsic must be reached once per execution of the loop header (or traversal of a back edge) -- the answer is no. Part of the rationale here (and also an unfortunately inherent source of potential confusion) is that for defining convergence, and more generally for implementing whole-program vectorization of the style we effectively do in AMDGPU, leaning only on natural loops doesn't work, at least in part because of the possibility of irreducible control flow. This is why all the actual algorithms I'm building on this rely on the Havlak-inspired CycleInfo of D83094, and all the rules in this document are expressed in terms of cycles (in the sense of circular walks in the CFG) instead of natural loops. My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated. I would have liked to have that property but couldn't make it work without imposing static rules that would be much harder to understand and follow. The point about unrolling is mentioned in the later examples section where I talk through a bunch of example loops and whether they can be unrolled or not. nhaehnle: > > Note that for the purpose of labeling, threads can never "skip" an iteration! They all…

				1. They obtained the ``convergencectrl`` token operand value from the same
				dynamic instance of the defining instruction, and
				t-tyeUnsubmitted Not Done Reply Inline Actions Again still not clear how llvm.experimental.convergence.anchor can be allowed to be implementation defined. Or is this saying that when the set of threads is defined by the laguage llvm.experimental.convergence.entry must be used. Maybe the graphics languages a looser in their execution model to allow arbitrary implementation of some aspects and that is what llvm.experimental.convergence.anchor is modeling? But it cannot be used for compute language that have [debatably] stronger rules? t-tye: Again still not clear how llvm.experimental.convergence.anchor can be allowed to be…
				nhaehnleUnsubmitted Done Reply Inline Actions Should be answered elsewhere. nhaehnle: Should be answered elsewhere.
				2. There is an n such that both threads execute U for the n'th time
				with that same token operand value.

				.. _convergence_formal_rules_entry:

				3. If two threads execute the same call U of
				:ref:`llvm.experimental.convergence.entry <llvm.experimental.convergence.entry>`,
				and at least one of them executes the function F containing U because it was
				called by a ``call``, ``invoke``, or ``callbr`` instruction, then they
				execute the same dynamic instance of U if and only if both threads execute F
				because it was called by the same dynamic instance of a ``call``, ``invoke``,
				or ``callbr`` instruction.

				Informational notes:

				1. If a thread executes the function due to a call from IR, then the
				thread cannot "spontaneously converge" with threads that execute the
				function for some other reason.

				2. The behavior of ``llvm.experimental.convergence.entry`` in functions
				that are called from outside the scope of LLVM, e.g. kernel entry
				point functions, is expected to be defined elsewhere, e.g. in reference
				to the relevant language or API (e.g. OpenCL, Vulkan) specifications.

				For the purpose of the following rules, a cycle is a walk in the CFG, i.e.
				a directed sequence of nodes and edges in the CFG whose start and end points
				are the same.

				The following static rules about cycles must be satisfied by valid programs:

				1. Every cycle in the CFG that contains a use of a convergence token T other
				than a use by
				:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
				must also contain the definition of T.
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Just like the n property of the loop intrinsic, I think an informational note explaining this will be helpful. sameerds: Just like the n property of the loop intrinsic, I think an informational note explaining this…

				2. Every cycle in the CFG that contains two different uses of a convergence
				token T must also contain the definition of T.

				3. Every cycle in the CFG that contains uses of two different convergence tokens
				T1 and T2 must also contain the definition of at least one of them.

				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions This is not a rule; it's just a definition. sameerds: This is not a rule; it's just a definition.
				nhaehnleUnsubmitted Done Reply Inline Actions Fair enough. I'm going to split this up into rules about cycles and rules about convergence regions. nhaehnle: Fair enough. I'm going to split this up into rules about cycles and rules about convergence…
				Taken together, these rules imply that for every cycle C, there can be at most
				one convergence token T which is used in C but defined outside of it, and that
				T can be used only once in C, and only by `llvm.experimental.convergence.loop`.

				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Since a convergence region is defined for a token, this text needs to bring out the fact that two different tokens are being talked about at this point. Something like: If the convergence region for token T1 contains a use of another token T2, then it must also contain the definition of T2." sameerds: Since a convergence region is defined for a token, this text needs to bring out the fact that…
				nhaehnleUnsubmitted Done Reply Inline Actions It's needed from a formal point of view, but it does seem to trip people up, so I'm going to implement your suggestion :) nhaehnle: It's needed from a formal point of view, but it does seem to trip people up, so I'm going to…
				.. _convergence_region:

				The convergence region of a convergence token T is the minimal region in
				which T is live and used, i.e., the set of program points dominated by the
				definition D of T from which a use of T can be reached by a walk in the CFG
				that is fully dominated by D.

				The following static rule about convergence regions must be satisfied by
				valid programs:

				1. If a convergence region R for a token T1 contains a use of a convergence
				token T2, then R must also contain the definition of T2. (In other words,
				convergence regions must be reasonably nested.)


				Memory Model Non-Interaction
				============================

				The fact that an operation is convergent has no effect on how it is treated for
				memory model purposes. In particular, an operation that is ``convergent`` and
				``readnone`` does not introduce additional ordering constraints as far as the
				memory model is concerned. There is no implied barrier, neither in the memory
				barrier sense nor in the control barrier sense of synchronizing the execution
				of threads.

				Informational note: Threads that execute the same dynamic instance do not
				necessarily do so at the same time.


				Other Interactions
				==================

				``convergent`` vs. ``speculatable``. A function can be both ``convergent`` and
				``speculatable``, indicating that the function does not have undefined
				behavior and has no effects besides calculating its result, but is still
				affected by the set of threads executing this function. This typically
				prevents speculation of calls to the function unless the constraint imposed
				by ``convergent`` is further relaxed by some other means.


				Rationales
				==========

				(This section is informative.)

				Static rules about cycles
				-------------------------

				Consider a loop with (incorrect!) convergence control as in the following
				pseudocode:

				.. code-block:: llvm

				; WARNING: Example of incorrect convergence control!

				%anchor = call token @llvm.experimental.convergence.anchor()
				for (;;) {
				...
				call void @convergent.op() [ "convergencectrl"(token %anchor) ]
				...
				}

				This code is forbidden by the first static rule about cycles.

				A first formal argument why we have to do this is that the dynamic rule for
				deciding whether two threads execute the same dynamic instances of
				``@convergent.op`` leads to a logical contradiction in this code.
				Assume two threads execute the same dynamic instance of the anchor
				followed by two iterations of the loop. Thread 1 executes dynamic instances
				I1 and I2 of ``@convergent.op``, thread 2 executes dynamic instances J1 and J2.
				Using all the rules, we can deduce:

				1. ``I1 != I2`` and ``J1 != J2`` by the basic rules of dynamic instances.

				2. ``I1 == J1`` by the first dynamic rule about controlled convergent
				operations: both threads execute the same static instruction while using
				a convergence token value produced by the same dynamic instance of an
				instruction (the anchor).

				3. ``I1 == J2`` by the same argument. Also, ``I2 == J1`` and ``I2 == J2``.

				The fact that one may be intuitively tempted to think of ``I1`` and ``J2``
				as being executed in different loop iterations is completely irrelevant for
				the formal argument. There is no mechanism in LLVM IR semantics for
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions The exhausted reader just begs to see the corrected version at this point! :) sameerds: The exhausted reader just begs to see the //corrected// version at this point! :)
				nhaehnleUnsubmitted Done Reply Inline Actions The exhausted author is taking a note and will get around to it soon ;) nhaehnle: The exhausted author is taking a note and will get around to it soon ;)
				forming associations between loop iterations in different threads, except
				for the rules defined in this document -- and the rules in this document
				require a loop heart intrinsic for talking about loop iterations.

				4. By transitivity, we have ``I1 == I2`` and ``J1 == J2``. That is a
				contradiction.

				This problem goes away by inserting a loop heart intrinsic as follows, which
				establishes a relationship between loop iterations across threads.

				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions So unrolling is forbidden because it fails to preserve the set of threads that execute the same dynamic instance of loop() for n=0 and n=1? sameerds: So unrolling is forbidden because it fails to preserve the set of threads that execute the same…
				nhaehnleUnsubmitted Done Reply Inline Actions Not sure what you mean by n=0 and n=1. The issue is that if some threads go through the remainder loop while others execute more iterations, then the set of threads will be partitioned into those that take the remainder loop and those that don't. nhaehnle: Not sure what you mean by n=0 and n=1. The issue is that if some threads go through the…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions The n that I used is the virtual loop count that is described in the loop intrinsic. The example needs to explain how the rules established in this document prevent the unrolling. The intuitive explanation is in terms of sets of threads, but what is the formal explanation in terms of the static rules for dynamic instances? sameerds: The n that I used is the virtual loop count that is described in the loop intrinsic. The…
				nhaehnleUnsubmitted Done Reply Inline Actions The formal explanation is ultimately that the set of communicating threads is changed, but I agree that it could be helpful to spell out how that comes about via the rules on dynamic instances, so I'm going to do that. nhaehnle: The formal explanation is ultimately that the set of communicating threads is changed, but…
				.. code-block:: llvm

				%anchor = call token @llvm.experimental.convergence.anchor()
				for (;;) {
				%loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Correcting the use of the loop intrinsic seems to be a delicate matter. There is a rule which talks about "two or more uses by loop()" inside a loop body, and this particular example seems to side-step exactly that by eliminating one call to loop(). sameerds: Correcting the use of the loop intrinsic seems to be a delicate matter. There is a rule which…
				nhaehnleUnsubmitted Done Reply Inline Actions Correct. I did think about whether it was possible to eliminate that static rule, but it gets nasty really quickly, for example if you try to unroll loops with multiple exits. The way it's written, a modification to loop unrolling is required (D85605), but it's ultimately the less painful solution. nhaehnle: Correct. I did think about whether it was possible to eliminate that static rule, but it gets…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions I still don't really understand what the "two or more" rule is for. One outcome of the rule seems to be that for a loop L2 nested inside loop L1, if L1 uses a token defined outside L1, then L2 cannot use the same token. I didn't get very far beyond that. sameerds: I still don't really understand what the "two or more" rule is for. One outcome of the rule…
				nhaehnleUnsubmitted Done Reply Inline Actions I'm adding a "rationale" section specifically to explain those static rules about cycles. nhaehnle: I'm adding a "rationale" section specifically to explain those static rules about cycles.
				...
				call void @convergent.op() [ "convergencectrl"(token %loop) ]
				...
				}

				In the same scenario of two threads executing the same dynamic instance of the
				anchor and then two iterations of the loop, the dynamic rule about loop heart
				intrinsics implies that both threads execute the same dynamic instance L1 of
				the loop heart intrinsic in their respective first iterations and the same
				dynamic instance L2 in their respective second iterations of the loop.

				This then implies that they execute the same dynamic instance ``I1 == J1`` of
				the ``@convergent.op`` in their first iterations and the same dynamic instance
				``I2 == J2`` in their second iterations. The rule is an "if and only if" rule,
				so it also implies that ``I1 != J2`` and ``I2 != J1``, because those executions
				see different values of the ``%loop`` token, referring to different dynamic
				instances of the loop intrinsic.

				One may ask whether we could change the dynamic rule instead of adding the
				static rule about cycles. That is impractical due to deeper difficulties.
				Consider the following loop, again with incorrect convergence control:

				.. code-block:: llvm

				; WARNING: Example of incorrect convergence control!

				; (A)
				%anchor = call token @llvm.experimental.convergence.anchor()
				for (;;) {
				; (B)
				if (condition1) {
				; (C)
				call void @convergent.op.1() [ "convergencectrl"(token %anchor) ]
				}
				; (D)
				if (condition2) {
				; (E)
				call void @convergent.op.2() [ "convergencectrl"(token %anchor) ]
				}
				; (F)
				}
				; (G)

				Assume two threads execute the same dynamic instance of the anchor followed
				by this sequence of basic blocks:

				.. code-block:: text

				Thread 1: A B C D F B D E F G
				Thread 2: A B D E F B C D F G
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Following the structure of previous examples, it would be good to have a demonstration of how this can result in misinterpreted convergence. That would explain why this example should be illegal. This paragraph directly applies the rules to show how the example is recognized as illegal. sameerds: Following the structure of previous examples, it would be good to have a demonstration of how…
				nhaehnleUnsubmitted Done Reply Inline Actions Isn't it just the same as in the example directly above? You'd expand C / E to a longer sequence of what happens in those inner loops, but the essentially difficulty is the same. nhaehnle: Isn't it just the same as in the example directly above? You'd expand C / E to a longer…
				sameerdsAuthorUnsubmitted Not Done Reply Inline Actions Maybe it is the same. See earlier note about exhausted reader. :) Maybe it's just me, but the concepts in this document are quite slippery, and well-rounded examples that restate the obvious can go a long way in gaining confidence. sameerds: Maybe it is the same. See earlier note about exhausted reader. :) Maybe it's just me, but the…

				That is, both threads execute two iterations of the loop, but they execute
				the different convergent operations in different iterations. Without forming a
				relation between loop iterations across the threads, there is no reasonable way
				of defining which dynamic instances of the convergent operations should be the
				same across the threads, if any.

				Again, this can be addressed by adding a loop heart intrinsic, most naturally
				as:

				.. code-block:: llvm

				%anchor = call token @llvm.experimental.convergence.anchor()
				for (;;) {
				%loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
				if (condition1) {
				call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
				}
				if (condition2) {
				call void @convergent.op.2() [ "convergencectrl"(token %loop) ]
				}
				}

				Let ``%loop(i;j)`` be the dynamic instance of ``j``-th execution of the loop
				heart intrinsic by thread ``i``, and analogously ``@op.k(i)`` and ``@op.k(i)``
				the dynamic instances of the execution of ``@convergent.op.k`` by thread ``i``.
				Then we have:

				1. ``%loop(1;j) == %loop(2;j)`` for ``j = 1, 2`` because of the dynamic rule
				about loop heart intrinsics.

				2. ``%loop(i;1) != %loop(i;2)`` for ``i = 1, 2`` because of the basic rule that
				different executions by the same thread happen in different dynamic
				instances.

				3. ``@op.1(1) != @op.1(2)``, since ``@op.1(1)`` use the value of the ``%loop``
				convergence token referring to ``%loop(1;1)`` and ``@op.1(2)`` use that
				referring to ``%loop(2;2) == %loop(1;2)``, which is different from
				``%loop(1;1)``.

				4. Similarly, ``@op.2(1) != @op.2(2)``.

				However, loop heart intrinsics could be inserted differently, at the cost
				of also inserting a free-standing anchor:

				.. code-block:: llvm

				%anchor = call token @llvm.experimental.convergence.anchor()
				for (;;) {
				if (condition1) {
				%loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
				call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
				}
				if (condition2) {
				%free = call token @llvm.experimental.convergence.anchor()
				call void @convergent.op.2() [ "convergencectrl"(token %free) ]
				}
				}

				This leads to the "unnatural counting of loop iterations" that is also mentioned
				elsewhere. Let ``%loop(i)`` be the dynamic instance of the execution of the
				loop heart intrinsic by thread ``i`` (each thread executes it only once), and
				let ``@op.k(i)`` be as before. Then:

				1. ``%loop(1) == %loop(2)`` because of the dynamic rule about loop heart
				intrinsics.

				2. ``@op.1(1) == @op.1(2)`` because ``@op.1(i)`` uses the value of ``%loop``
				referring to ``%loop(i)``, and ``%loop(1) == %loop(2)``, so they refer to the
				same dynamic instance.

				3. Whether ``@op.2(1) == @op.2(2)`` is implementation-defined because of the
				use of the ``%free`` anchor intrinsic.

				In practice, they almost certainly have to be different dynamic instances.
				Consider that if an implementation strictly follows the order of
				instructions given in the program, the executions of the threads can be
				"aligned" as follows:

				.. code-block:: text

				Thread 1: A B C D F B D E F G
				Thread 2: A B D E F B C D F G

				So then ``@op.2(1)`` physically executes later than ``@op.2(2)`` and there
				can be no communication between the threads, which means they execute
				different dynamic instances.

				That said, it is conceivable that there aren't actually any data or other
				dependencies that would enforce this execution order. In that case, a highly
				out-of-order implementation could potentially allow communication. That's
				why the rules defined in this document are silent about whether
				``@op.2(1) == @op.2(2)`` or not.

				This type of convergence control seems relatively unlikely to appear in real
				programs. Its possibility is simply a logical consequence of the model.

				An equivalent issue arises if the convergent operations are replaced by nested
				loops with loop heart intrinsics that directly refer to ``%anchor``, hence
				the variants of the static rules about cycles that apply to them:

				.. code-block:: llvm

				; WARNING: Example of incorrect convergence control!

				%anchor = call token @llvm.experimental.convergence.anchor()
				for (;;) {
				if (condition1) {
				for (;;) {
				%loop1 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
				}
				}
				if (condition2) {
				for (;;) {
				%loop2 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
				}
				}
				}

				There is a cycle (closed walk in the CFG) that goes through both loop heart
				intrinsics using ``%anchor`` but not through the definition of ``%anchor``,
				so this code is invalid.


				Examples for the Correctness of Program Transforms
				==================================================

				(This section is informative.)

				As implied by the rules in the previous sections, program transforms are correct
				with respect to convergent operations if they preserve or refine their
				semantics. This means that the set of communicating threads in the transformed
				program must have been possible in the original program.

				Program transforms with a single-threaded focus are generally conservatively
				correct if they do not sink or hoist convergent operations across a branch.
				This applies even to program transforms that change the control flow graph.

				For example, unrolling a loop that does not contain convergent operations
				cannot break any of the guarantees required for convergent operations outside
				of the loop.


				Loop unrolling examples
				-----------------------
				jlebarUnsubmitted Not Done Reply Inline Actions In this section I would have found it helpful if we'd differentiated upfront between the three kinds of unrolling: Partial unrolling of a loop with no known trip multiple (so, there's a "tail" that collects the remaining elements) Partial unrolling by a trip multiple (so there's no "tail") Full unrolling, which eliminates the loop I think you're saying that only the first kind of unrolling is tricky. jlebar: In this section I would have found it helpful if we'd differentiated upfront between the three…
				nhaehnleUnsubmitted Done Reply Inline Actions Yes, that's correct, and I'm going to add essentially your three bullets at the top. nhaehnle: Yes, that's correct, and I'm going to add essentially your three bullets at the top.

				We consider three kinds of loop unrolling here:

				* Partial unrolling with no known trip multiple, so a "tail" is required to
				collect the remaining elements.
				* Partial unrolling by a trip multiple, so no "tail" is required.
				* Full unrolling, which eliminates the loop.

				The first kind is forbidden when ``@llvm.experimental.convergence.loop`` is
				used. We illustrate the reasoning with some examples.

				First, an arbitrary loop that contains convergent operations can be unrolled
				in all of these ways, even with "tail", if all convergent operations refer back
				to an anchor inside the loop. For example (in pseudo-code):

				.. code-block:: llvm

				while (counter > 0) {
				%tok = call token @llvm.experimental.convergence.anchor()
				call void @convergent.operation() [ "convergencectrl"(token %tok) ]
				counter--;
				}

				This can be unrolled to:

				.. code-block:: llvm

				while (counter >= 2) {
				jlebarUnsubmitted Not Done Reply Inline Actions It would help me if we could we elaborate with half a sentence what the behavior change might be. jlebar: It would help me if we could we elaborate with half a sentence what the behavior change might…
				nhaehnleUnsubmitted Done Reply Inline Actions I gave it a try. It ended up being a full sentence though ;) nhaehnle: I gave it a try. It ended up being a full sentence though ;)
				%tok = call token @llvm.experimental.convergence.anchor()
				call void @convergent.operation() [ "convergencectrl"(token %tok) ]
				%tok = call token @llvm.experimental.convergence.anchor()
				call void @convergent.operation() [ "convergencectrl"(token %tok) ]
				counter -= 2;
				}
				jlebarUnsubmitted Not Done Reply Inline Actions Do you mean that this kind of unrolling is forbidden? But if you're going to forbid all unrolling of loops with uncontrolled convergent ops...that's going to make CUDA code a lot slower. Unless you're also going to fix clang, in which case, no objections, but maybe you want to say "will be forbidden once we've updated front-ends"? jlebar: Do you mean that this kind of unrolling is forbidden? But if you're going to forbid all…
				nhaehnleUnsubmitted Done Reply Inline Actions Yes, this kind of unrolling. This is already forbidden for uncontrolled convergent operations today. If you want to dig a little deeper, I would appreciate if you could also add your review to D85605. That's a follow-up change for (1) correctness of loop unrolling with regards to the `loop` intrinsics rules and (2) relaxing some of the constraints that exist today where possible when all convergent ops are controlled (by an anchor in the loop). nhaehnle: Yes, this kind of unrolling. This is already forbidden for uncontrolled convergent operations…
				while (counter > 0) {
				%tok = call token @llvm.experimental.convergence.anchor()
				call void @convergent.operation() [ "convergencectrl"(token %tok) ]
				counter--;
				}

				This is likely to change the behavior of the convergent operation if there
				are threads whose initial counter value is not a multiple of 2. In particular,
				all threads with an odd trip count are now likely to execute the convergent
				operation in their respective final iterations together because the
				jlebarUnsubmitted Not Done Reply Inline Actions One thing I don't get from this example is what I should do as a frontend to LLVM. That is, when should I do this form, and when should I put a new anchor inside a loop? It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor. That is, I have to behave as though the compiler could transform anchor() foo(); into, effectively if (threadIdx.x % 2 == 0) { anchor() convergent_fn(); } else { anchor(); convergent_fn(); } Something like this? OK, so I always have to use the convergence.loop() form. But then this is saying I can never unroll. ITYM that with convergence.loop(), I can never partially unroll with a "tail", which makes a lot of sense? But would help me if we were explicit about that. jlebar: One thing I don't get from this example is what I should do as a frontend to LLVM. That is…
				nhaehnleUnsubmitted Done Reply Inline Actions ITYM that with convergence.loop(), I can never partially unroll with a "tail", which makes a lot of sense? Yes, that's correct. Hopefully clearer with the addition at the top of the section. It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor. Right. The anchor essentially allows you to achieve the same thing as `__activemask` in CUDA, but in a more structured way that doesn't run into problems when you have two sides of an if/else both executing a sync operation with the same thread mask. nhaehnle: > ITYM that with convergence.loop(), I can never partially unroll with a "tail", which makes…
				underlying implementation is likely to try to group as many threads together
				as possible for the execution of the "tail".

				This change is allowed because the anchor intrinsic has implementation-defined
				convergence behavior and the loop unrolling transform is considered to be part
				of the implementation. Another way of reasoning is that while the likely
				behavior of the code has changed, the guarantees about its behavior have
				remained the same.

				If the loop contains uncontrolled convergent operations, this kind of unrolling
				is forbidden.

				Unrolling a loop with convergent operations that refer to tokens produced
				outside the loop is forbidden when a "tail" or "remainder" would have to
				be introduced. Consider:

				.. code-block:: llvm

				; (A)
				%outer = call token @llvm.experimental.convergence.anchor()
				while (counter > 0) {
				%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
				; (B)
				call void @convergent.operation() [ "convergencectrl"(token %inner) ]
				counter--;
				}
				; (C)

				To understand why unrolling is forbidden, consider two threads that execute
				the same dynamic instance of the anchor and then proceed with 3 and 4 loop
				iterations, respectively:

				.. code-block:: text

				jlebarUnsubmitted Not Done Reply Inline Actions `counter > 1`? jlebar: `counter > 1`?
				nhaehnleUnsubmitted Done Reply Inline Actions Thanks, changing to `counter >= 2` because that's what I had in a similar example above. nhaehnle: Thanks, changing to `counter >= 2` because that's what I had in a similar example above.
				Thread 1: A B B B C
				Thread 2: A B B B B C

				By the dynamic rule on loop heart intrinsics, these threads execute the same
				dynamic instances of the loop intrinsic for the first 3 iterations, and then
				thread 2 executes another dynamic instance by itself.

				By the dynamic rule on general convergent operations, the threads execute
				the same dynamic instance of the ``@convergent.operation`` in the first 3
				iterations (that is, the dynamic instance executed by thread 1 in iteration
				n is the same as that executed by thread 2 in iteration n, for n = 1,2,3;
				the dynamic instance executed in iteration 1 is different from that in
				iteration 2, etc.).

				Now assume that the loop is unrolled by a factor of 2, which requires a
				remainder as follows:

				.. code-block:: llvm

				; (A)
				%outer = call token @llvm.experimental.convergence.anchor()
				while (counter >= 2) {
				%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
				; (B)
				call void @convergent.operation() [ "convergencectrl"(token %inner) ]
				call void @convergent.operation() [ "convergencectrl"(token %inner) ]
				counter -= 2;
				}
				; (C)
				if (counter > 0) {
				%remainder = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
				; (D)
				call void @convergent.operation() [ "convergencectrl"(token %remainder) ]
				}
				; (E)

				First of all, note some interesting problems surrounding the loop intrinsic:

				1. It is not duplicated inside the unrolled loop. This is to comply with
				the static validity rules in the :ref:`Formal Rules <convergence_formal_rules>`
				section.

				2. It is unclear whether the loop intrinsic ought to be duplicated in the
				remainder, or whether the final ``@convergent.operation`` in D should just
				refer to either ``%inner`` (which is possible in SSA form) or directly to
				``%outer``. The decision made here is arbitrary and doesn't change the
				argument that follows. Ultimately, it simply doesn't matter because the
				transform is incorrect either way.

				The threads now execute the following sequences of blocks:

				.. code-block:: text

				Thread 1: A B C D E
				Thread 2: A B B C D E

				Analogous to the argument above, they execute the same dynamic instance of the
				``%inner`` intrinsic and the ``@convergent.operation`` in the first iteration
				of the unrolled loop, which corresponds to the first 2 iterations of the
				original loop.

				However, they execute different static calls to ``@convergent.operation`` for
				the 3rd iteration of the original loop. In thread 1, that iteration corresponds
				to the call in the remainder, while in thread 2 it corresponds to the first
				call to ``@convergent.operation`` in the unrolled loop. Therefore, they execute
				different dynamic instances, which means that the set of communicating threads
				for the 3rd iteration of the original loop is different. This is why the
				unrolling is incorrect.

				On the other hand, unrolling without "tail" is allowed. For example, assuming
				that the trip counter is known to be a multiple of 2, we can unroll the loop
				as follows:

				.. code-block:: llvm

				%outer = call token @llvm.experimental.convergence.anchor()
				while (counter > 0) {
				%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
				call void @convergent.operation() [ "convergencectrl"(token %inner) ]
				call void @convergent.operation() [ "convergencectrl"(token %inner) ]
				counter -= 2;
				}

				Note again that the loop intrinsic is not duplicated.

				The
				:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
				intrinsic is typically expected to appear in the header of a natural loop.
				However, it can also appear in non-header blocks of a loop. In that case, the
				loop can generally not be unrolled.


				Hoisting and sinking
				--------------------

				In general, hoisting and sinking of convergent operations is forbidden. This is
				because moving the operation to a different point in control flow generally
				changes the set of threads that reach the operation and therefore, by the first
				dynamic rule in the :ref:`Formal Rules <convergence_formal_rules>` section,
				the set of threads that execute the same dynamic instance of the operation.
				By definition, this changes the set of threads that participate in the
				communication of the convergent operation, which will typically change its
				result.

				There are a number of exceptions, though most of them require additional
				knowledge.

				For example, hoisting and sinking across uniform conditional branches -- i.e.,
				conditional branches where within every possible relevant set of threads, all
				threads will always take the same direction -- is generally allowed. See the
				end of the
				:ref:`example of reductions inside control flow <convergence_example_reductions>`
				for a brief discussion.

				Some convergent operations can be hoisted but not sunk, or vice versa. A simple
				example is the ``subgroupShuffle(data, id)`` operation. It returns the ``data``
				operand of the thread identified by ``id``, where thread IDs are fixed and
				assigned to each thread at launch. The result is undefined (or perhaps there is
				UB, depending on the language and environment) if thread ``id`` is not in the
				communicating set of threads. So hoisting is allowed in the following
				pseudo-code example:

				.. code-block:: llvm

				define void @example(...) convergent {
				%entry = call token @llvm.experimental.convergence.entry()
				%data = ...
				%id = ...
				if (condition) {
				%shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
				...
				} else {
				%shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
				...
				}
				}

				After hoisting the calls to ``@subgroupShuffle``, the communicating set of
				threads is the union of the two sets of threads in the original program, so
				``%id`` can only go "out of range" after hoisting if it did so in the original
				program.

				However, speculative execution of ``@subgroupShuffle`` in the following program
				may be forbidden:

				.. code-block:: llvm

				define void @example(...) convergent {
				%entry = call token @llvm.experimental.convergence.entry()
				%data = ...
				%id = ...
				if (condition) {
				%shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
				...
				}
				}

				There is no guarantee about the value of ``%id`` in the threads where
				``condition`` is false. If ``@subgroupShuffle`` is defined to have UB when
				``%id`` is outside of the set of communicating threads, then speculating and
				hoisting ``@subgroupShuffle`` might introduce UB.

				On the other hand, if ``@subgroupShuffle`` is defined such that it merely
				produces an undefined value or poison as result when ``%id`` is "out of range",
				then speculating is okay.

				Even though
				:ref:`llvm.experimental.convergence.anchor <llvm.experimental.convergence.anchor>`
				is marked as ``convergent``, it can be sunk in some cases. For example, in
				pseudo-code:

				.. code-block:: llvm

				%tok = call token @llvm.experimental.convergence.anchor()
				if (condition) {
				call void @convergent.operation() [ "convergencectrl"(token %tok) ]
				}

				Assuming that ``%tok`` is only used inside the conditional block, the anchor can
				be sunk. The rationale is two-fold. First, the anchor has implementation-defined
				behavior, and the sinking is part of the implementation. Second, already in the
				original program, the set of threads that communicates in the
				``@convergent.operation`` is automatically subset to the threads for which
				``condition`` is true.

				Anchors can be hoisted in acyclic control flow. For example:

				.. code-block:: llvm

				if (condition) {
				%tok1 = call token @llvm.experimental.convergence.anchor()
				call void @convergent.operation() [ "convergencectrl"(token %tok1) ]
				} else {
				%tok2 = call token @llvm.experimental.convergence.anchor()
				call void @convergent.operation() [ "convergencectrl"(token %tok2) ]
				}

				The anchors can be hoisted, resulting in:

				.. code-block:: llvm

				%tok = call token @llvm.experimental.convergence.anchor()
				if (condition) {
				call void @convergent.operation() [ "convergencectrl"(token %tok) ]
				} else {
				call void @convergent.operation() [ "convergencectrl"(token %tok) ]
				}

				The behavior is unchanged, since each of the static convergent operations only
				ever communicates with threads that have the same ``condition`` value.
				By contrast, hoisting the convergent operations themselves is forbidden.

				Hoisting and sinking anchors out of and into loops is forbidden. For example:

				.. code-block:: llvm

				for (;;) {
				%tok = call token @llvm.experimental.convergence.anchor()
				call void @convergent.operation() [ "convergencectrl"(token %tok) ]
				}

				Hoisting the anchor would make the program invalid according to the static
				validity rules. Conversely:

				.. code-block:: llvm

				%outer = call token @llvm.experimental.convergence.anchor()
				while (counter > 0) {
				%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
				call void @convergent.operation() [ "convergencectrl"(token %inner) ]
				counter--;
				}

				The program would stay valid if the anchor was sunk into the loop, but its
				behavior could end up being different. If the anchor is inside the loop, then
				each loop iteration has a new dynamic instance of the anchor, and the set of
				threads participating in those dynamic instances of the anchor could be
				different in arbitrary implementation-defined ways. Via the dynamic rules
				about dynamic instances of convergent operations, this then implies that the
				set of threads executing ``@convergent.operation`` could be different in
				each loop iteration in arbitrary implementation-defined ways.

				Convergent operations can be sunk together with their anchor. Again in
				pseudo-code:

				.. code-block:: llvm

				%tok = call token @llvm.experimental.convergence.anchor()
				%a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
				%b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
				if (condition) {
				use(%a, %b)
				}

				Assuming that ``%tok``, ``%a``, and ``%b`` are only used inside the conditional
				block, all can be sunk together:

				.. code-block:: llvm

				if (condition) {
				%tok = call token @llvm.experimental.convergence.anchor()
				%a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
				%b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
				use(%a, %b)
				}

				The rationale is that the anchor intrinsic has implementation-defined behavior,
				and the sinking transform is considered to be part of the implementation:
				the sinking will restrict the set of communicating threads to those for which
				``condition`` is true, but that could have happened in the original program
				anyway for some arbitrary other reason.

				However, sinking only the convergent operation producing ``%b`` would be
				incorrect. That would allow the (remainder of the) implementation to include
				threads for which ``condition`` is false to participate in the same dynamic
				instance of the anchor and therefore in the calculation of ``%a``, and so the
				set of threads communicating for the calculations of ``%a`` and ``%b`` could be
				different, which the original program doesn't allow.

				Note that the entry intrinsic behaves differently. Sinking the convergent
				operations is forbidden in the following snippet:

				.. code-block:: llvm

				%tok = call token @llvm.experimental.convergence.entry()
				%a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
				%b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
				if (condition) {
				use(%a, %b)
				}

llvm/docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,465 Lines • ▼ Show 20 Lines	``builtin``
direct calls to functions that are declared with the ``nobuiltin``		direct calls to functions that are declared with the ``nobuiltin``
attribute.		attribute.
``cold``		``cold``
This attribute indicates that this function is rarely called. When		This attribute indicates that this function is rarely called. When
computing edge weights, basic blocks post-dominated by a cold		computing edge weights, basic blocks post-dominated by a cold
function call are also considered to be cold; and, thus, given low		function call are also considered to be cold; and, thus, given low
weight.		weight.
``convergent``		``convergent``
In some parallel execution models, there exist operations that cannot be		Some parallel execution environments execute threads in groups that allow
made control-dependent on any additional values. We call such operations		efficient communication within the group, among a subset of threads that
``convergent``, and mark them with this attribute.		is implicitly defined by control flow. We call such operations
		``convergent`` and mark them with this attribute.
The ``convergent`` attribute may appear on functions or call/invoke
instructions. When it appears on a function, it indicates that calls to		The ``convergent`` attribute may appear on call/invoke instructions to
this function should not be made control-dependent on additional values.		indicate that the instruction is a convergent operation, or on functions
For example, the intrinsic ``llvm.nvvm.barrier0`` is ``convergent``, so		to indicate that calls to this function are convergent operations.
calls to this intrinsic cannot be made control-dependent on additional
values.		The presence of this attribute indicates that certain program transforms
		involving control flow are forbidden. For a detailed description, see the
When it appears on a call/invoke, the ``convergent`` attribute indicates		:doc:`ConvergentOperations` document.
		t-tyeUnsubmitted Done Reply Inline Actions Use the Sphinx document reference: :doc:`ConvergentOperations` t-tye: Use the Sphinx document reference: ``` :doc:`ConvergentOperations` ```
that we should treat the call as though we're calling a convergent
function. This is particularly useful on indirect calls; without this we
may treat such calls as though the target is non-convergent.

The optimizer may remove the ``convergent`` attribute on functions when it		The optimizer may remove the ``convergent`` attribute on functions when it
can prove that the function does not execute any convergent operations.		can prove that the function does not execute
		``llvm.experimental.convergent.entry`` or uncontrolled convergent
		operations (see :ref:`dynamic_instances_and_convergence_tokens`).
		t-tyeUnsubmitted Done Reply Inline Actions Add reference to the defintiion of the term? (see :ref:`dynamic_instances_and_convergence_tokens`) t-tye: Add reference to the defintiion of the term? ``` (see :ref…
Similarly, the optimizer may remove ``convergent`` on calls/invokes when it		Similarly, the optimizer may remove ``convergent`` on calls/invokes when it
can prove that the call/invoke cannot call a convergent function.		can prove that the call/invoke cannot call a convergent function.
``inaccessiblememonly``		``inaccessiblememonly``
This attribute indicates that the function may only access memory that		This attribute indicates that the function may only access memory that
is not accessible by the module being compiled. This is a weaker form		is not accessible by the module being compiled. This is a weaker form
of ``readnone``. If the function reads or writes other memory, the		of ``readnone``. If the function reads or writes other memory, the
behavior is undefined.		behavior is undefined.
``inaccessiblemem_or_argmemonly``		``inaccessiblemem_or_argmemonly``
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	``willreturn``
Annotated functions may still raise an exception, i.a., ``nounwind`` is not implied.		Annotated functions may still raise an exception, i.a., ``nounwind`` is not implied.
If an invocation of an annotated function does not return control back		If an invocation of an annotated function does not return control back
to a point in the call stack, the behavior is undefined.		to a point in the call stack, the behavior is undefined.
``nosync``		``nosync``
This function attribute indicates that the function does not communicate		This function attribute indicates that the function does not communicate
(synchronize) with another thread through memory or other well-defined means.		(synchronize) with another thread through memory or other well-defined means.
Synchronization is considered possible in the presence of `atomic` accesses		Synchronization is considered possible in the presence of `atomic` accesses
that enforce an order, thus not "unordered" and "monotonic", `volatile` accesses,		that enforce an order, thus not "unordered" and "monotonic", `volatile` accesses,
as well as `convergent` function calls. Note that through `convergent` function calls		as well as `convergent` function calls.
non-memory communication, e.g., cross-lane operations, are possible and are also
considered synchronization. However `convergent` does not contradict `nosync`.		Note that `convergent` operations can involve communication that is
If an annotated function does ever synchronize with another thread,		considered to be not through memory and does not necessarily imply an
		arsenmUnsubmitted Not Done Reply Inline Actions Missing space in notnecessarily arsenm: Missing space in notnecessarily
		ordering between threads for the purposes of the memory model. Therefore,
		an operation can be both `convergent` and `nosync`.

		If a `nosync` function does ever synchronize with another thread,
the behavior is undefined.		the behavior is undefined.
``nounwind``		``nounwind``
This function attribute indicates that the function never raises an		This function attribute indicates that the function never raises an
exception. If the function does raise an exception, its runtime		exception. If the function does raise an exception, its runtime
behavior is undefined. However, functions marked nounwind may still		behavior is undefined. However, functions marked nounwind may still
trap or generate asynchronous exceptions. Exception handling schemes		trap or generate asynchronous exceptions. Exception handling schemes
that are recognized by LLVM to handle asynchronous exceptions, such		that are recognized by LLVM to handle asynchronous exceptions, such
as SEH, will still provide their implementation defined semantics.		as SEH, will still provide their implementation defined semantics.
▲ Show 20 Lines • Show All 666 Lines • ▼ Show 20 Lines
A "gc-live" operand bundle is only valid on a :ref:`gc.statepoint <gc_statepoint>`		A "gc-live" operand bundle is only valid on a :ref:`gc.statepoint <gc_statepoint>`
intrinsic. The operand bundle must contain every pointer to a garbage collected		intrinsic. The operand bundle must contain every pointer to a garbage collected
object which potentially needs to be updated by the garbage collector.		object which potentially needs to be updated by the garbage collector.

When lowered, any relocated value will be recorded in the corresponding		When lowered, any relocated value will be recorded in the corresponding
:ref:`stackmap entry <statepoint-stackmap-format>`. See the intrinsic description		:ref:`stackmap entry <statepoint-stackmap-format>`. See the intrinsic description
for further details.		for further details.

		Convergence Control Operand Bundles
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

		A "convergencectrl" operand bundle is only valid on a ``convergent`` operation.
		When present, the operand bundle must contain exactly one value of token type.
		See the :doc:`ConvergentOperations` document for details.
		t-tyeUnsubmitted Done Reply Inline Actions :doc:`ConvergentOperations` t-tye: ``` :doc:`ConvergentOperations` ```

.. _moduleasm:		.. _moduleasm:

Module-Level Inline Assembly		Module-Level Inline Assembly
----------------------------		----------------------------

Modules may contain "module-level inline asm" blocks, which corresponds		Modules may contain "module-level inline asm" blocks, which corresponds
to the GCC "file scope inline asm" blocks. These blocks are internally		to the GCC "file scope inline asm" blocks. These blocks are internally
concatenated by LLVM and treated as a single unit, but may be separated		concatenated by LLVM and treated as a single unit, but may be separated
▲ Show 20 Lines • Show All 13,790 Lines • ▼ Show 20 Lines
Examples:		Examples:
"""""""""		"""""""""

.. code-block:: llvm		.. code-block:: llvm

%a = load i16, i16* @x, align 2		%a = load i16, i16* @x, align 2
%res = call float @llvm.convert.from.fp16(i16 %a)		%res = call float @llvm.convert.from.fp16(i16 %a)

		Convergence Intrinsics
		----------------------

		The LLVM convergence intrinsics for controlling the semantics of ``convergent``
		operations, which all start with the ``llvm.experimental.convergence.``
		prefix, are described in the :doc:`ConvergentOperations` document.
		t-tyeUnsubmitted Done Reply Inline Actions :doc:`ConvergentOperations` t-tye: ``` :doc:`ConvergentOperations` ```

.. _dbg_intrinsics:		.. _dbg_intrinsics:

Debugger Intrinsics		Debugger Intrinsics
-------------------		-------------------

The LLVM debugger intrinsics (which all start with ``llvm.dbg.``		The LLVM debugger intrinsics (which all start with ``llvm.dbg.``
prefix), are described in the `LLVM Source Level		prefix), are described in the `LLVM Source Level
Debugging <SourceLevelDebugging.html#format-common-intrinsics>`_		Debugging <SourceLevelDebugging.html#format-common-intrinsics>`_
▲ Show 20 Lines • Show All 4,658 Lines • Show Last 20 Lines

llvm/docs/Reference.rst

Show All 9 Lines	.. toctree::
:hidden:		:hidden:

Atomics		Atomics
BitCodeFormat		BitCodeFormat
BlockFrequencyTerminology		BlockFrequencyTerminology
BranchWeightMetadata		BranchWeightMetadata
Bugpoint		Bugpoint
CommandGuide/index		CommandGuide/index
		ConvergentOperations
Coroutines		Coroutines
DependenceGraphs/index		DependenceGraphs/index
ExceptionHandling		ExceptionHandling
Extensions		Extensions
FaultMaps		FaultMaps
FuzzingLLVM		FuzzingLLVM
GarbageCollection		GarbageCollection
GetElementPtr		GetElementPtr
▲ Show 20 Lines • Show All 99 Lines • ▼ Show 20 Lines

:doc:`Machine IR (MIR) Format Reference Manual <MIRLangRef>`		:doc:`Machine IR (MIR) Format Reference Manual <MIRLangRef>`
A reference manual for the MIR serialization format, which is used to test		A reference manual for the MIR serialization format, which is used to test
LLVM's code generation passes.		LLVM's code generation passes.

:doc:`GlobalISel/index`		:doc:`GlobalISel/index`
This describes the prototype instruction selection replacement, GlobalISel.		This describes the prototype instruction selection replacement, GlobalISel.

		:doc:`ConvergentOperations`
		Description of ``convergent`` operation semantics and related intrinsics.

=====================		=====================
Testing and Debugging		Testing and Debugging
=====================		=====================

:doc:`LLVM Testing Infrastructure Guide <TestingGuide>`		:doc:`LLVM Testing Infrastructure Guide <TestingGuide>`
A reference manual for using the LLVM testing infrastructure.		A reference manual for using the LLVM testing infrastructure.

:doc:`TestSuiteGuide`		:doc:`TestSuiteGuide`
▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

llvm/include/llvm/IR/Intrinsics.td

Show First 20 Lines • Show All 1,550 Lines • ▼ Show 20 Lines	def int_preserve_struct_access_index : Intrinsic<[llvm_anyptr_ty],
llvm_i32_ty],		llvm_i32_ty],
[IntrNoMem,		[IntrNoMem,
ImmArg<ArgIndex<1>>,		ImmArg<ArgIndex<1>>,
ImmArg<ArgIndex<2>>]>;		ImmArg<ArgIndex<2>>]>;

//===---------- Intrinsics to query properties of scalable vectors --------===//		//===---------- Intrinsics to query properties of scalable vectors --------===//
def int_vscale : Intrinsic<[llvm_anyint_ty], [], [IntrNoMem]>;		def int_vscale : Intrinsic<[llvm_anyint_ty], [], [IntrNoMem]>;

//===----------------------------------------------------------------------===//		//===------- Convergence Intrinsics ---------------------------------------===//

		def int_experimental_convergence_entry
		: Intrinsic<[llvm_token_ty], [], [IntrNoMem, IntrConvergent]>;
		def int_experimental_convergence_anchor
		: Intrinsic<[llvm_token_ty], [], [IntrNoMem, IntrConvergent]>;
		def int_experimental_convergence_loop
		: Intrinsic<[llvm_token_ty], [], [IntrNoMem, IntrConvergent]>;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Target-specific intrinsics		// Target-specific intrinsics
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

include "llvm/IR/IntrinsicsPowerPC.td"		include "llvm/IR/IntrinsicsPowerPC.td"
include "llvm/IR/IntrinsicsX86.td"		include "llvm/IR/IntrinsicsX86.td"
include "llvm/IR/IntrinsicsARM.td"		include "llvm/IR/IntrinsicsARM.td"
Show All 10 Lines

llvm/include/llvm/IR/LLVMContext.h

//===- llvm/LLVMContext.h - Class for managing "global" state ---- C++ --===//		//===- llvm/LLVMContext.h - Class for managing "global" state ---- C++ --===//
		Lint: Lint Inline Actions clang-format not found in user's PATH; not linting file. Lint: Lint: clang-format not found in user's PATH; not linting file.
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	#undef LLVM_FIXED_MD_KIND
};		};

/// Known operand bundle tag IDs, which always have the same value. All		/// Known operand bundle tag IDs, which always have the same value. All
/// operand bundle tags that LLVM has special knowledge of are listed here.		/// operand bundle tags that LLVM has special knowledge of are listed here.
/// Additionally, this scheme allows LLVM to efficiently check for specific		/// Additionally, this scheme allows LLVM to efficiently check for specific
/// operand bundle tags without comparing strings. Keep this in sync with		/// operand bundle tags without comparing strings. Keep this in sync with
/// LLVMContext::LLVMContext().		/// LLVMContext::LLVMContext().
enum : unsigned {		enum : unsigned {
OB_deopt = 0, // "deopt"		OB_deopt = 0, // "deopt"
OB_funclet = 1, // "funclet"		OB_funclet = 1, // "funclet"
OB_gc_transition = 2, // "gc-transition"		OB_gc_transition = 2, // "gc-transition"
OB_cfguardtarget = 3, // "cfguardtarget"		OB_cfguardtarget = 3, // "cfguardtarget"
OB_preallocated = 4, // "preallocated"		OB_preallocated = 4, // "preallocated"
OB_gc_live = 5, // "gc-live"		OB_gc_live = 5, // "gc-live"
		OB_convergencectrl = 6, // "convergencectrl"
};		};

/// getMDKindID - Return a unique non-zero ID for the specified metadata kind.		/// getMDKindID - Return a unique non-zero ID for the specified metadata kind.
/// This ID is uniqued across modules in the current LLVMContext.		/// This ID is uniqued across modules in the current LLVMContext.
unsigned getMDKindID(StringRef Name) const;		unsigned getMDKindID(StringRef Name) const;

/// getMDKindNames - Populate client supplied SmallVector with the name for		/// getMDKindNames - Populate client supplied SmallVector with the name for
/// custom metadata IDs registered in this LLVMContext.		/// custom metadata IDs registered in this LLVMContext.
▲ Show 20 Lines • Show All 241 Lines • Show Last 20 Lines

llvm/lib/IR/LLVMContext.cpp

//===-- LLVMContext.cpp - Implement LLVMContext ---------------------------===//		//===-- LLVMContext.cpp - Implement LLVMContext ---------------------------===//
		Lint: Lint Inline Actions clang-format not found in user's PATH; not linting file. Lint: Lint: clang-format not found in user's PATH; not linting file.
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	assert(PreallocatedEntry->second == LLVMContext::OB_preallocated &&
"preallocated operand bundle id drifted!");		"preallocated operand bundle id drifted!");
(void)PreallocatedEntry;		(void)PreallocatedEntry;

auto *GCLiveEntry = pImpl->getOrInsertBundleTag("gc-live");		auto *GCLiveEntry = pImpl->getOrInsertBundleTag("gc-live");
assert(GCLiveEntry->second == LLVMContext::OB_gc_live &&		assert(GCLiveEntry->second == LLVMContext::OB_gc_live &&
"gc-transition operand bundle id drifted!");		"gc-transition operand bundle id drifted!");
(void)GCLiveEntry;		(void)GCLiveEntry;

		auto *ConvergenceCtrlEntry = pImpl->getOrInsertBundleTag("convergencectrl");
		assert(ConvergenceCtrlEntry->second == LLVMContext::OB_convergencectrl &&
		"convergencectrl operand bundle id drifted!");
		(void)ConvergenceCtrlEntry;

SyncScope::ID SingleThreadSSID =		SyncScope::ID SingleThreadSSID =
pImpl->getOrInsertSyncScopeID("singlethread");		pImpl->getOrInsertSyncScopeID("singlethread");
assert(SingleThreadSSID == SyncScope::SingleThread &&		assert(SingleThreadSSID == SyncScope::SingleThread &&
"singlethread synchronization scope ID drifted!");		"singlethread synchronization scope ID drifted!");
(void)SingleThreadSSID;		(void)SingleThreadSSID;

SyncScope::ID SystemSSID =		SyncScope::ID SystemSSID =
pImpl->getOrInsertSyncScopeID("");		pImpl->getOrInsertSyncScopeID("");
▲ Show 20 Lines • Show All 270 Lines • Show Last 20 Lines

llvm/lib/IR/Verifier.cpp

//===-- Verifier.cpp - Implement the Module Verifier -----------------------==//		//===-- Verifier.cpp - Implement the Module Verifier -----------------------==//
		Lint: Lint Inline Actions clang-format not found in user's PATH; not linting file. Lint: Lint: clang-format not found in user's PATH; not linting file.
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
Show All 36 Lines

#include "llvm/IR/Verifier.h"		#include "llvm/IR/Verifier.h"
#include "llvm/ADT/APFloat.h"		#include "llvm/ADT/APFloat.h"
#include "llvm/ADT/APInt.h"		#include "llvm/ADT/APInt.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/MapVector.h"		#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
		#include "llvm/ADT/PostOrderIterator.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringExtras.h"		#include "llvm/ADT/StringExtras.h"
#include "llvm/ADT/StringMap.h"		#include "llvm/ADT/StringMap.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/Twine.h"		#include "llvm/ADT/Twine.h"
#include "llvm/ADT/ilist.h"		#include "llvm/ADT/ilist.h"
		#include "llvm/Analysis/CycleInfo.h"
#include "llvm/BinaryFormat/Dwarf.h"		#include "llvm/BinaryFormat/Dwarf.h"
#include "llvm/IR/Argument.h"		#include "llvm/IR/Argument.h"
#include "llvm/IR/Attributes.h"		#include "llvm/IR/Attributes.h"
#include "llvm/IR/BasicBlock.h"		#include "llvm/IR/BasicBlock.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/CallingConv.h"		#include "llvm/IR/CallingConv.h"
#include "llvm/IR/Comdat.h"		#include "llvm/IR/Comdat.h"
#include "llvm/IR/Constant.h"		#include "llvm/IR/Constant.h"
▲ Show 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	class Verifier : public InstVisitor<Verifier>, VerifierSupport {

/// Whether we've seen a call to @llvm.localescape in this function		/// Whether we've seen a call to @llvm.localescape in this function
/// already.		/// already.
bool SawFrameEscape;		bool SawFrameEscape;

/// Whether the current function has a DISubprogram attached to it.		/// Whether the current function has a DISubprogram attached to it.
bool HasDebugInfo = false;		bool HasDebugInfo = false;

		/// Whether the current function has convergencectrl operand bundles.
		bool HasConvergenceControl = false;

/// Whether source was present on the first DIFile encountered in each CU.		/// Whether source was present on the first DIFile encountered in each CU.
DenseMap<const DICompileUnit *, bool> HasSourceDebugInfo;		DenseMap<const DICompileUnit *, bool> HasSourceDebugInfo;

/// Stores the count of how many objects were passed to llvm.localescape for a		/// Stores the count of how many objects were passed to llvm.localescape for a
/// given function and the largest index passed to llvm.localrecover.		/// given function and the largest index passed to llvm.localrecover.
DenseMap<Function *, std::pair<unsigned, unsigned>> FrameEscapeInfo;		DenseMap<Function *, std::pair<unsigned, unsigned>> FrameEscapeInfo;

// Maps catchswitches and cleanuppads that unwind to siblings to the		// Maps catchswitches and cleanuppads that unwind to siblings to the
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	for (const BasicBlock &BB : F) {
}		}
return false;		return false;
}		}

Broken = false;		Broken = false;
// FIXME: We strip const here because the inst visitor strips const.		// FIXME: We strip const here because the inst visitor strips const.
visit(const_cast<Function &>(F));		visit(const_cast<Function &>(F));
verifySiblingFuncletUnwinds();		verifySiblingFuncletUnwinds();
		if (HasConvergenceControl)
		verifyConvergenceControl(const_cast<Function &>(F));
InstsInThisBlock.clear();		InstsInThisBlock.clear();
DebugFnArgs.clear();		DebugFnArgs.clear();
LandingPadResultTy = nullptr;		LandingPadResultTy = nullptr;
SawFrameEscape = false;		SawFrameEscape = false;
SiblingFuncletInfo.clear();		SiblingFuncletInfo.clear();
		HasConvergenceControl = false;

return !Broken;		return !Broken;
}		}

/// Verify the module that this instance of \c Verifier was initialized with.		/// Verify the module that this instance of \c Verifier was initialized with.
bool verify() {		bool verify() {
Broken = false;		Broken = false;

▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	void verifyFunctionAttrs(FunctionType *FT, AttributeList Attrs,
const Value *V, bool IsIntrinsic);		const Value *V, bool IsIntrinsic);
void verifyFunctionMetadata(ArrayRef<std::pair<unsigned, MDNode *>> MDs);		void verifyFunctionMetadata(ArrayRef<std::pair<unsigned, MDNode *>> MDs);

void visitConstantExprsRecursively(const Constant *EntryC);		void visitConstantExprsRecursively(const Constant *EntryC);
void visitConstantExpr(const ConstantExpr *CE);		void visitConstantExpr(const ConstantExpr *CE);
void verifyStatepoint(const CallBase &Call);		void verifyStatepoint(const CallBase &Call);
void verifyFrameRecoverIndices();		void verifyFrameRecoverIndices();
void verifySiblingFuncletUnwinds();		void verifySiblingFuncletUnwinds();
		void verifyConvergenceControl(Function &F);

void verifyFragmentExpression(const DbgVariableIntrinsic &I);		void verifyFragmentExpression(const DbgVariableIntrinsic &I);
template <typename ValueOrMetadata>		template <typename ValueOrMetadata>
void verifyFragmentExpression(const DIVariable &V,		void verifyFragmentExpression(const DIVariable &V,
DIExpression::FragmentInfo Fragment,		DIExpression::FragmentInfo Fragment,
ValueOrMetadata *Desc);		ValueOrMetadata *Desc);
void verifyFnArgs(const DbgVariableIntrinsic &I);		void verifyFnArgs(const DbgVariableIntrinsic &I);
void verifyNotEntryValue(const DbgVariableIntrinsic &I);		void verifyNotEntryValue(const DbgVariableIntrinsic &I);
▲ Show 20 Lines • Show All 1,705 Lines • ▼ Show 20 Lines	do {
Active.insert(PredPad);		Active.insert(PredPad);
} while (true);		} while (true);
// Each node only has one successor, so we've walked all the active		// Each node only has one successor, so we've walked all the active
// nodes' successors.		// nodes' successors.
Active.clear();		Active.clear();
}		}
}		}

		void Verifier::verifyConvergenceControl(Function &F) {
		BasicBlock Entry = const_cast<BasicBlock >(&F.getEntryBlock());
		DenseMap<BasicBlock , SmallVector<CallBase , 8>> LiveTokenMap;
		DenseMap<Cycle , CallBase > CycleHearts;

		// Just like the DominatorTree, compute the CycleInfo locally so that we
		// can run the verifier outside of a pass manager and we don't rely on
		// potentially out-dated analysis results.
		CycleInfo CI;
		CI.compute(Entry);

		ReversePostOrderTraversal<Function *> RPOT(&F);
		for (BasicBlock *BB : RPOT) {
		SmallVector<CallBase *, 8> LiveTokens;
		auto LTIt = LiveTokenMap.find(BB);
		if (LTIt != LiveTokenMap.end()) {
		LiveTokens = std::move(LTIt->second);
		LiveTokenMap.erase(LTIt);
		}

		Cycle *BBCycle = CI.getCycle(BB);

		for (Instruction &I : *BB) {
		CallBase *CB = dyn_cast<CallBase>(&I);
		if (!CB)
		continue;

		auto Bundle = CB->getOperandBundle(LLVMContext::OB_convergencectrl);
		if (Bundle) {
		Assert(Bundle->Inputs.size() == 1 &&
		Bundle->Inputs[0]->getType()->isTokenTy(),
		"the 'convergencectrl' bundle requires exactly one token use",
		CB);

		Value *Token = Bundle->Inputs[0].get();
		auto *Def = dyn_cast<CallBase>(Token);
		Assert(Def != nullptr,
		"convergence control tokens can only be produced by call "
		"instructions",
		Token);

		Assert(llvm::is_contained(LiveTokens, Token),
		"convergence region is not well-nested", Token, CB);

		while (LiveTokens.back() != Token)
		LiveTokens.pop_back();

		// Check static rules about cycles.
		BasicBlock *DefBB = Def->getParent();
		if (DefBB != BB) {
		Cycle *DefCycle = CI.getCycle(DefBB);
		if (!CI.contains(BBCycle, DefCycle)) {
		auto *II = dyn_cast<IntrinsicInst>(CB);
		Assert(II && II->getIntrinsicID() ==
		Intrinsic::experimental_convergence_loop,
		"convergence token use by an instruction other than "
		"llvm.experimental.convergence.loop in a cycle that does "
		"not contain the token's definition",
		CB);

		// Now check the rule for loop heart intrinsics.
		Cycle *UpperBound = CI.findSmallestCommonCycle(DefCycle, BBCycle);
		for (Cycle *C = BBCycle; C != UpperBound; C = C->getParent()) {
		Assert(!CycleHearts.count(C),
		"two static convergence token uses in a cycle that does "
		"not contain either token's definition",
		CB, CycleHearts[C]);
		CycleHearts[C] = CB;
		}
		}
		}
		}

		if (CB->getType()->isTokenTy())
		LiveTokens.push_back(CB);
		}

		// Propagate token liveness
		for (BasicBlock *Succ : llvm::successors(BB)) {
		DomTreeNode *SuccNode = DT.getNode(Succ);
		LTIt = LiveTokenMap.find(Succ);
		if (LTIt == LiveTokenMap.end()) {
		// We're the first predecessor: all tokens which dominate the
		// successor are live for now.
		LTIt = LiveTokenMap.try_emplace(Succ).first;
		for (CallBase *LiveToken : LiveTokens) {
		if (!DT.dominates(DT.getNode(LiveToken->getParent()), SuccNode))
		break;
		LTIt->second.push_back(LiveToken);
		}
		} else {
		// Compute the intersection of live tokens.
		auto It = llvm::partition(LTIt->second, [&LiveTokens](CallBase *Token) {
		return llvm::is_contained(LiveTokens, Token);
		});
		LTIt->second.erase(It, LTIt->second.end());
		}
		}
		}
		}

// visitFunction - Verify that a function is ok.		// visitFunction - Verify that a function is ok.
//		//
void Verifier::visitFunction(const Function &F) {		void Verifier::visitFunction(const Function &F) {
visitGlobalValue(F);		visitGlobalValue(F);

// Check function arguments.		// Check function arguments.
FunctionType *FT = F.getFunctionType();		FunctionType *FT = F.getFunctionType();
unsigned NumArgs = F.arg_size();		unsigned NumArgs = F.arg_size();
▲ Show 20 Lines • Show All 930 Lines • ▼ Show 20 Lines	void Verifier::visitCallBase(CallBase &Call) {
// do so causes assertion failures when the inliner sets up inline scope info.		// do so causes assertion failures when the inliner sets up inline scope info.
if (Call.getFunction()->getSubprogram() && Call.getCalledFunction() &&		if (Call.getFunction()->getSubprogram() && Call.getCalledFunction() &&
Call.getCalledFunction()->getSubprogram())		Call.getCalledFunction()->getSubprogram())
AssertDI(Call.getDebugLoc(),		AssertDI(Call.getDebugLoc(),
"inlinable function call in a function with "		"inlinable function call in a function with "
"debug info must have a !dbg location",		"debug info must have a !dbg location",
Call);		Call);

		if (Call.getOperandBundle(LLVMContext::OB_convergencectrl))
		HasConvergenceControl = true;

visitInstruction(Call);		visitInstruction(Call);
}		}

/// Two types are "congruent" if they are identical, or if they are both pointer		/// Two types are "congruent" if they are identical, or if they are both pointer
/// types with different pointee types and the same address space.		/// types with different pointee types and the same address space.
static bool isTypeCongruent(Type L, Type R) {		static bool isTypeCongruent(Type L, Type R) {
if (L == R)		if (L == R)
return true;		return true;
▲ Show 20 Lines • Show All 1,867 Lines • ▼ Show 20 Lines	Assert(ResultTy->getNumElements() ==
"Result of a matrix operation does not fit in the returned vector!");		"Result of a matrix operation does not fit in the returned vector!");

if (Stride)		if (Stride)
Assert(Stride->getZExtValue() >= NumRows->getZExtValue(),		Assert(Stride->getZExtValue() >= NumRows->getZExtValue(),
"Stride must be greater or equal than the number of rows!", IF);		"Stride must be greater or equal than the number of rows!", IF);

break;		break;
}		}
		case Intrinsic::experimental_convergence_entry:
		case Intrinsic::experimental_convergence_anchor:
		Assert(!Call.getOperandBundle(LLVMContext::OB_convergencectrl),
		"entry or anchor intrinsic must not have a convergencectrl bundle",
		&Call);
		break;
		case Intrinsic::experimental_convergence_loop:
		Assert(Call.getOperandBundle(LLVMContext::OB_convergencectrl),
		"loop heart intrinsic must have a convergencectrl bundle", &Call);
		break;
};		};
}		}

/// Carefully grab the subprogram from a local scope.		/// Carefully grab the subprogram from a local scope.
///		///
/// This carefully grabs the subprogram from a local scope, avoiding the		/// This carefully grabs the subprogram from a local scope, avoiding the
/// built-in assertions that would typically fire.		/// built-in assertions that would typically fire.
static DISubprogram getSubprogram(Metadata LocalScope) {		static DISubprogram getSubprogram(Metadata LocalScope) {
▲ Show 20 Lines • Show All 806 Lines • Show Last 20 Lines

llvm/test/Bitcode/operand-bundles-bc-analyzer.ll

	; RUN: llvm-as < %s \| llvm-bcanalyzer -dump -disable-histogram \| FileCheck %s			; RUN: llvm-as < %s \| llvm-bcanalyzer -dump -disable-histogram \| FileCheck %s

	; CHECK: <OPERAND_BUNDLE_TAGS_BLOCK			; CHECK: <OPERAND_BUNDLE_TAGS_BLOCK
	; CHECK-NEXT: <OPERAND_BUNDLE_TAG			; CHECK-NEXT: <OPERAND_BUNDLE_TAG
	; CHECK-NEXT: <OPERAND_BUNDLE_TAG			; CHECK-NEXT: <OPERAND_BUNDLE_TAG
	; CHECK-NEXT: <OPERAND_BUNDLE_TAG			; CHECK-NEXT: <OPERAND_BUNDLE_TAG
	; CHECK-NEXT: <OPERAND_BUNDLE_TAG			; CHECK-NEXT: <OPERAND_BUNDLE_TAG
	; CHECK-NEXT: <OPERAND_BUNDLE_TAG			; CHECK-NEXT: <OPERAND_BUNDLE_TAG
	; CHECK-NEXT: <OPERAND_BUNDLE_TAG			; CHECK-NEXT: <OPERAND_BUNDLE_TAG
	; CHECK-NEXT: <OPERAND_BUNDLE_TAG			; CHECK-NEXT: <OPERAND_BUNDLE_TAG
	; CHECK-NEXT: <OPERAND_BUNDLE_TAG			; CHECK-NEXT: <OPERAND_BUNDLE_TAG
				; CHECK-NEXT: <OPERAND_BUNDLE_TAG
	; CHECK-NEXT: </OPERAND_BUNDLE_TAGS_BLOCK			; CHECK-NEXT: </OPERAND_BUNDLE_TAGS_BLOCK

	; CHECK: <FUNCTION_BLOCK			; CHECK: <FUNCTION_BLOCK
	; CHECK: <OPERAND_BUNDLE			; CHECK: <OPERAND_BUNDLE
	; CHECK: <OPERAND_BUNDLE			; CHECK: <OPERAND_BUNDLE
	; CHECK-NOT: <OPERAND_BUNDLE			; CHECK-NOT: <OPERAND_BUNDLE
	; CHECK: </FUNCTION_BLOCK			; CHECK: </FUNCTION_BLOCK

	Show All 11 Lines

llvm/test/Verifier/convergencectrl-invalid.ll

This file was added.

				; RUN: not opt -S %s -verify 2>&1 \| FileCheck %s

				; CHECK: convergence region is not well-nested
				; CHECK: %t1_tok2
				define void @region_nesting1() {
				%t1_tok1 = call token @llvm.experimental.convergence.anchor()
				%t1_tok2 = call token @llvm.experimental.convergence.anchor()
				call void @f() [ "convergencectrl"(token %t1_tok1) ]
				call void @f() [ "convergencectrl"(token %t1_tok2) ]
				ret void
				}

				; CHECK: convergence region is not well-nested
				; CHECK: %t2_tok2
				define void @region_nesting2() {
				A:
				%t2_tok1 = call token @llvm.experimental.convergence.anchor()
				%t2_tok2 = call token @llvm.experimental.convergence.anchor()
				br i1 undef, label %B, label %C

				B:
				call void @f() [ "convergencectrl"(token %t2_tok1) ]
				br label %C

				C:
				call void @f() [ "convergencectrl"(token %t2_tok2) ]
				ret void
				}

				; CHECK: convergence token use by an instruction other than llvm.experimental.convergence.loop in a cycle that does not contain the token's definition
				; CHECK: %t3_tok1
				define void @use_in_cycle() {
				A:
				%t3_tok1 = call token @llvm.experimental.convergence.anchor()
				br label %B

				B:
				call void @f() [ "convergencectrl"(token %t3_tok1) ]
				br label %B
				}

				; CHECK: two static convergence token uses in a cycle that does not contain either token's definition
				; CHECK: %t4_tok1
				; CHECK: %t4_tok2
				define void @multiple_hearts() {
				A:
				%t4_tok1 = call token @llvm.experimental.convergence.anchor()
				%t4_tok2 = call token @llvm.experimental.convergence.anchor()
				br label %B

				B:
				%h2 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %t4_tok2) ]
				%h1 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %t4_tok1) ]
				br label %B
				}

				; CHECK: two static convergence token uses in a cycle that does not contain either token's definition
				; CHECK: %t5_tok1
				; CHECK: %t5_tok1
				define void @multiple_hearts_nested() {
				A:
				%t5_tok1 = call token @llvm.experimental.convergence.anchor()
				br label %B

				B:
				br i1 undef, label %C, label %D

				C:
				%h1 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %t5_tok1) ]
				br i1 undef, label %C, label %B

				D:
				%h2 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %t5_tok1) ]
				br i1 undef, label %D, label %B
				}

				declare void @f() convergent

				declare token @llvm.experimental.convergence.entry()
				declare token @llvm.experimental.convergence.anchor()
				declare token @llvm.experimental.convergence.loop()

llvm/test/Verifier/convergencectrl-valid.ll

This file was added.

				; RUN: opt -S %s -verify

				define void @region_nesting1() {
				A:
				%tok1 = call token @llvm.experimental.convergence.anchor()
				%tok2 = call token @llvm.experimental.convergence.anchor()
				br label %B

				B:
				br i1 undef, label %C, label %D

				C:
				call void @f() [ "convergencectrl"(token %tok1) ]
				ret void

				D:
				call void @f() [ "convergencectrl"(token %tok2) ]
				ret void
				}

				; Mirror image of @region_nesting1
				define void @region_nesting2() {
				A:
				%tok1 = call token @llvm.experimental.convergence.anchor()
				%tok2 = call token @llvm.experimental.convergence.anchor()
				br label %B

				B:
				br i1 undef, label %C, label %D

				C:
				call void @f() [ "convergencectrl"(token %tok2) ]
				ret void

				D:
				call void @f() [ "convergencectrl"(token %tok1) ]
				ret void
				}

				define void @loop_nesting() {
				A:
				%a = call token @llvm.experimental.convergence.anchor()
				br label %B

				B:
				%b = call token @llvm.experimental.convergence.anchor()
				br i1 undef, label %C, label %D

				C:
				%c = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %b) ]
				call void @f() [ "convergencectrl"(token %c) ]
				br label %B

				D:
				call void @f() [ "convergencectrl"(token %b) ]
				br i1 undef, label %B, label %E

				E:
				ret void
				}

				define void @irreducible1() {
				A:
				%a = call token @llvm.experimental.convergence.anchor()
				br i1 undef, label %B, label %C

				B:
				%b = call token @llvm.experimental.convergence.anchor()
				br i1 undef, label %C, label %D

				C:
				%c = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %a) ]
				br i1 undef, label %B, label %E

				D:
				call void @f() [ "convergencectrl"(token %b) ]
				br i1 undef, label %B, label %F

				E:
				call void @f() [ "convergencectrl"(token %c) ]
				br i1 undef, label %C, label %F

				F:
				call void @f() [ "convergencectrl"(token %a) ]
				ret void
				}

				; Mirror image of @irreducible1
				define void @irreducible2() {
				A:
				%a = call token @llvm.experimental.convergence.anchor()
				br i1 undef, label %B, label %C

				B:
				%b = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %a) ]
				br i1 undef, label %C, label %D

				C:
				%c = call token @llvm.experimental.convergence.anchor()
				br i1 undef, label %B, label %E

				D:
				call void @f() [ "convergencectrl"(token %b) ]
				br i1 undef, label %B, label %F

				E:
				call void @f() [ "convergencectrl"(token %c) ]
				br i1 undef, label %C, label %F

				F:
				call void @f() [ "convergencectrl"(token %a) ]
				ret void
				}

				declare void @f() convergent

				declare token @llvm.experimental.convergence.entry()
				declare token @llvm.experimental.convergence.anchor()
				declare token @llvm.experimental.convergence.loop()