This is an archive of the discontinued LLVM Phabricator instance.

IR: Add convergence control operand bundle and intrinsics
AbandonedPublic

Authored by sameerds on Aug 9 2020, 7:22 AM.

Details

Summary

See ConvergentOperations.rst for the details.

This replaces the proposal from https://reviews.llvm.org/D68994

This patch adds the operand bundle and intrinsics themselves, as well as
the LangRef documentation describing the semantics of controlled
convergent operations. Follow-up patches will adjust existing passes to
comply with those changes, as well as provide new functionality on top
of this mechanism.

Change-Id: I045c6bc864c4dc5fb0a23b0279e30fac06c5b974

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jdoerfert added inline comments.Aug 10 2020, 9:26 AM
llvm/docs/ConvergentOperations.rst
281

The "heart" and the increment step are fairly vague. Maybe talk about something tangible, e.g., the target of a backedge?

efriedma added inline comments.Aug 10 2020, 12:09 PM
llvm/docs/ConvergentOperations.rst
366

The part that's sort of unclear is that calls coming from outside of LLVM IR may or may not be part of the same dynamic instance. Obviously we can't define that here, but I think we should explicitly note it as something that's implementation-defined.

sameerds added inline comments.Aug 11 2020, 1:25 AM
llvm/docs/ConvergentOperations.rst
203–204

I think the notion of dynamic instances applies to all instructions. Continuing with #3 below, it seems to me that different threads can execute the same dynamic instance of any instruction. It's just that this notion is not very interesting in the case of non-communicating instructions. The ones that communicate need to be marked convergent, so that the effect of transformations on them is limited.

402

So this defines a proper nesting of convergence regions? An informative note would be helpful.

465–471

Which part of the formal semantics shows that this is a valid translation? Rule for the execution of dynamic instances seems to be useful to only specify which threads execute the convergent operations. But what relates them to the original loop? Is it because the set of dynamic instances produced by the second version has a one-to-one mapping with the set of dynamic instances produced by the first version?

517

I think this intends to say "block in the loop body other than the loop header", but the wording chosen is a little difficult to parse on a first read.

523–525

+1

To me, the whole point of this new concept is to capture control dependency so that we don't have to go look at branch conditions again. But allowing such a transformation reintroduces the need to go check the control dependency to understand which threads are really executing this instance.

548–552

This is also the transform that CUDA (and potentially HIP) will disallow. Hoisting or sinking a conditional changes the set of threads executing the each leg of the branch. In CUDA, the two programs have completely different meanings depend on whether the anchor is outside the branch or inside each leg. There seems to be an opportunity here to relate the notion of an anchor to language builtins that return the mask of currently executing threads.

sameerds added inline comments.Aug 11 2020, 1:42 AM
llvm/docs/ConvergentOperations.rst
561

What forbids the convergent operations from being hoisted? Isn't that the whole point of this new framework? In particular, what would the total_gains/total_losses example look like with appropriate use of convergence tokens?

simoll added inline comments.Aug 11 2020, 3:11 AM
llvm/docs/ConvergentOperations.rst
203–204

I'm more concerned about the implications this constraint may have for transformation like branch fusion.
The memory model is pretty permissive and allows fusion of memory accesses regardless.
@nhaehnle Do you care about non-memory side effects, like exceptions? Do these follow the same weak semantics as the memory model?

nhaehnle updated this revision to Diff 284735.Aug 11 2020, 8:07 AM
nhaehnle marked 11 inline comments as done.

With this change, I've edited the documents in a way where I hope all
comments have been addressed.

  • Augmented the initial motivating examples with version that show how the control intrinsics are added
  • Added an additional motivating example to illustrate what the "anchor" is for
  • Changed the order in which control intrinsics are described, to hopefully make it even more obvious that the "less surprising" one is the entry intrinsic
  • Add informational notes to the "Dynamic Instances and Convergence Tokens" section, and clean up some minor things in the Formal Rules
  • A bunch of other random changes all over the place
nhaehnle added inline comments.Aug 11 2020, 8:07 AM
llvm/docs/ConvergentOperations.rst
203–204

I'm not entirely sure what you mean by the question. There isn't supposed to be any interaction between exceptions and what's being described here. There aren't any relevant constraints expressed on the dynamic instances of non-convergent operations in the first place, and for convergent operations I'd think of them as happening in two steps: there's a cross-thread communication, and afterwards each thread individually decides whether it throws an exception in its context.

This can obviously take the exchanged data into account, to the point where you could model an operation as exchanging bits between threads to indicate whether an exception should be thrown in each thread -- so you could have an operation that throws an exception based on a value in another thread, as long as that other thread executes the same dynamic instance. Similarly, you could have UB in thread A based on an argument value in thread B as long as A and B execute the same dynamic instance.

I'm going to add an informational note to the end of this section that dynamic instances of non-convergent instructions don't matter.

213–215

The logical split between the two sections is that this section has the basic definitions, while the "Formal Rules" section has the rules about how the convergence control intrinsics place additional constraints on how dynamic instances can be formed.

If the token represents the dynamic instance exactly then this would also limit the freedom llvm.experimental.convergence.anchor() has. For example, this would rule out thread partitioning if it were so because then no token-producing instruction could return different token values per dynamic invocation.

I'm not sure I understand the argument. What exactly do you mean by dynamic invocation here?

Each time a thread executes the same anchor call site, it will receive a different token value, corresponding to a different dynamic instance. That may or may not be the same dynamic instance as received by other threads. So even if control flow is entirely uniform, an implementation would be free to produce a different thread partitioning each time the anchor is executed. That is on purpose: if you want more predictable thread partitionings, use a combination of entry and loop intrinsics as required.

281

When it comes to defining rules that are applicable to completely general IR, the loop intrinsic call site feels *more* tangible than the notion of backedge. For example, backedges don't really work as a concept when you have irreducible control flow.

The loop intrinsic call site also really doesn't have to be in the header block of a natural loop -- it could be inside of an if-statement in the loop, for example, which has interesting consequences but can still be defined (and can actually be useful: someone pointed me at a recent paper by Damani et al - Speculative Reconvergence for Improve SIMT Efficiency, which proposes a certain "unnatural" way of controlling convergence in some kinds of loop for performance; the same kind of effect can be achieved by placing the loop heart inside of an if-statement).

291–294

The intention is that the IR-based rules still apply regardless of whether the caller is in the same module or not. I'm not sure if this needs to spelled out more clearly.

And yes, for other cases we should be able to think of it as a property of the calling convention.

340–344

No, this is explicitly not sufficient. You can have:

  %tok = call token @llvm.experimental.convergence.anchor()
  br i1 %cc, label %then, label %next

then:
  call void @convergent_op() [ "convergencectrl"(token %tok) ]
  br label %next

next:
362–366

I think this comment may have moved to a confusing location relative to the document.

entry and anchor are inherently different.

I'm going to add a note about looking at language specs etc.

388–389

No, the rule excludes code such as:

%a = call token @llvm.experimental.convergence.anchor()
%b = call token @llvm.experimental.convergence.anchor()
call void @convergent_op() [ "convergencectrl"(token %a) ]
call void @convergent_op() [ "convergencectrl"(token %b) ]

The convergence region of %b contains a use of %a but not its definition.

I'm going to add a note about nesting.

405

I agree with @t-tye's explanation here. The choice here reflects the choice made e.g. in the Vulkan memory model: the only "convergent" operation (not the term used in Vulkan...) which interacts with the memory model is OpControlBarrier, so it's good to be able to treat these two kinds of communication orthogonally.

447

It still feels like llvm.experimental.convergence.anchor is materializing the set of threads out of thin air rather than as a clear "chain of custody" from the function entry (transitively passed via call sites).

Yes, that is the point of llvm.experimental.convergence.anchor.

And yes, if there was clear "chain of custody" as you call it from outside of the loop, then this unrolling with remainder would be incorrect.

465–471

The first version doesn't have a unique set of dynamic instances in the first place, because anchor is by design implementation-defined.

So the possible universes of dynamic instances in the transformed/unrolled version only needs to be a subset. In a sense, the loop unroll with remainder picks a subset by saying: from now on, if you have two threads with e.g. iteration counts 3 and 4, then they will never communicate during the 3rd iteration.

In the original program, they may or may not have communicated during the 3rd iteration -- up to the implementation, and in this case, the implementation decided to do a form of loop unrolling which implicitly ends up making a choice.

471

I hope this has been answered in the context of your other comments?

508

Is that still grammatically correct? The parse of the sentence is

Loops in which ((a loop intrinsic outside of the loop header) uses a token defined outside of the loop)

That is, "a loop intrinsic outside of the loop header" is the subject of the sentence in the outer parentheses.

517

Going to try an improvement :)

523–525

I mean, anchor is implementation-defined, so you can't make a totally solid statement anyway. You could only make solid *relative* statements if the token produced by the anchor was also used by some other convergent operations, and if those are outside of the if-statement, the sinking wouldn't be allowed anymore anyway...

548–552

CUDA is very different here: the builtins that take an explicit threadmask don't have an implicit dependence on control flow, so they shouldn't be modeled as convergent operations. They have other downsides, which is why we prefer to go down this path of convergent operations.

561

I'm going to add that example.

576–579

Should be answered elsewhere.

605–606

The pixel example would use entry instead of anchor. I'm going to add that example.

615–616

Should be answered elsewhere.

sameerds added inline comments.Aug 11 2020, 9:49 PM
llvm/docs/ConvergentOperations.rst
53–56

I think I "get" it now, and it might be related to how this paragraph produces an expectation that is actually not intended. The entire time so far, I have been reading this document expecting a formal framework that completely captures convergence; something so complete, that one can point at any place in the program and merely look at the convergence intrinsics to decide whether a transform is valid. But that is not the case. This document becomes a lot more clear if the intrinsics being introduced are only meant to augment control flow but not replace it in the context of convergence. These intrinsics are only meant to be introduced by the frontend to remove ambiguity about convergence. In particular:

  1. In the jump-threading example, the frontend inserts the convergence intrinsics to resolve the ambiguity in favour of maximal convergence.
  2. In the loop-unroll example, the frontend disallows unrolling by inserting the anchor outside of the loop and using it inside.
  3. In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.

This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?

548–552

Combined with my other comment about the introduction, I think the current formalism is compatible with CUDA. One can say that some convergent functions in CUDA have additional semantics about how different dynamic instances communicate with each other. That communication is outside the scope of this document, where the mask argument is used to relate the dynamic instances. The current framework seems to be sufficient to govern the effect of optimizations on the dynamic instances. For example, it is sufficient that a CUDA ballot is not hoisted/sunk across a condition; the ballot across the two branch legs is managed by the mask, which was created before the branch.

sameerds added inline comments.Aug 12 2020, 12:03 AM
llvm/docs/ConvergentOperations.rst
281

It was the optimizer that introduced the ambiguity ... should the optimizer be responsible for adding the necessary intrinsics that preserve the original convergence?

552–554

So the heart is not a property of the loop itself in LLVM IR. It is a place chosen by the frontend based on semantics external to LLVM IR, in a way that allows the frontend to express constraints about convergence in the loop.

571

Just like the loop intrinsic, this intrinsic occurs in a place chosen by the frontend based on semantics outside of LLVM IR, and used by the frontend to express constraints elsewhere in the IR.

612–613

The older comments about this seem to have floated away. At risk of repeating the discussion, what is *n* capturing? Is it meant to relate copies of the call U created by unrolling the loop, for example?

647–650

Just like the *n* property of the loop intrinsic, I think an informational note explaining this will be helpful.

654–657

This is not a rule; it's just a definition.

659–661

Since a convergence region is defined for a token, this text needs to bring out the fact that two different tokens are being talked about at this point. Something like: If the convergence region for token T1 contains a use of another token T2, then it must also contain the definition of T2."

750–755

So unrolling is forbidden because it fails to preserve the set of threads that execute the same dynamic instance of loop() for n=0 and n=1?

760

Correcting the use of the loop intrinsic seems to be a delicate matter. There is a rule which talks about "two or more uses by loop()" inside a loop body, and this particular example seems to side-step exactly that by eliminating one call to loop().

nhaehnle added inline comments.Aug 12 2020, 12:48 PM
llvm/docs/ConvergentOperations.rst
53–56
  1. In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.

This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?

What ambiguities do you have in mind?

If you have a fully acyclic function, then the way you can think about it is: we determine "the" set of threads that execute the function at the entry. At every point in the function, the communication set is then the subset of threads that get to that point. It's easy to evaluate this if you just topologically sort the blocks and then evaluate them in that order.

281

No. The jump-threaded code could also come out of C(++) code with gotos, so this doesn't really work.

548–552

I don't understand what you're trying to get at here.

The semantics of modern CUDA builtins are fully captured by saying they're non-convergent, but they have a side effect. That side effect is communication with some set of other threads, but that set isn't affected by control flow, it's fully specified by an explicit argument. Because of this, there is no need to argue about dynamic instances.

All legal program transforms subject to those constraints are then legal. There is no need to label them as convergent. If you can think of a counter-example, I'd be curious to see it.

552–554

Yes.

571

I'd rephrase it slightly by saying that the place is chosen by the frontend in a way that preserves the semantics of the original language into LLVM IR. But I suspect that we're ultimately thinking of the same thing.

612–613

It's really just a loop iteration counter. Every time a thread executes the loop intrinsic, it executes a new dynamic instance of it. You could think of this dynamic instance being labeled by the iteration, and then whether a thread executes the same dynamic instance as another thread depends in part on whether they have the same loop iteration label.

Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic. This means that if you have a natural loop where the loop intrinsic is not called in the header but in some other block that is conditional, the loop iterations will be counted in a way that seems funny (but this can actually be put to a potentially good use as I noted elsewhere).

Unrolling will actually not duplicate the loop intrinsic, but only keep the copy that corresponds to the first unrolled iteration.

654–657

Fair enough. I'm going to split this up into rules about cycles and rules about convergence regions.

659–661

It's needed from a formal point of view, but it does seem to trip people up, so I'm going to implement your suggestion :)

750–755

Not sure what you mean by n=0 and n=1. The issue is that if some threads go through the remainder loop while others execute more iterations, then the set of threads will be partitioned into those that take the remainder loop and those that don't.

760

Correct.

I did think about whether it was possible to eliminate that static rule, but it gets nasty really quickly, for example if you try to unroll loops with multiple exits. The way it's written, a modification to loop unrolling is required (D85605), but it's ultimately the less painful solution.

sameerds added inline comments.Aug 12 2020, 9:59 PM
llvm/docs/ConvergentOperations.rst
53–56

Your explanation intuitively makes sense, but it is not clear how to reconcile it with jump threading. That's one of the "ambiguities" I had in mind when dealing with acyclic control flow. It's almost like the text needs a paragraph explaining that "structured acyclic control flow" already contains sufficient information about convergence, but general acyclic control flow needs special attention in specific cases, starting with jump threading.

281

But what about the flip side? If the frontend is sure that only structured control flow is present in the input program, can it skip inserting the convergence intrinsics? Or should it still insert those intrinsics just in case optimizations changed the graph? If yes, is this something that LLVM must prescribe for every frontend as part of this document?

548–552

I am trying to understand whether there are constructs in Clang-supported high-level languages that cannot be addressed by these intrinsics. And if such constructs do exist, then whether that gate the adoption of this enhancement in LLVM. But I see your point now. The sync() builtins in CUDA are no longer dependent on convergence. The decision to hoist or sink them is based entirely on other things like data dependences (and maybe just that).

612–613

Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic.

This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration?

Unrolling will actually not duplicate the loop intrinsic, but only keep the copy that corresponds to the first unrolled iteration.

This is a bit of surprise. My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated.

750–755

The n that I used is the virtual loop count that is described in the loop intrinsic. The example needs to explain how the rules established in this document prevent the unrolling. The intuitive explanation is in terms of sets of threads, but what is the formal explanation in terms of the static rules for dynamic instances?

760

I still don't really understand what the "two or more" rule is for. One outcome of the rule seems to be that for a loop L2 nested inside loop L1, if L1 uses a token defined outside L1, then L2 cannot use the same token. I didn't get very far beyond that.

nhaehnle updated this revision to Diff 285267.Aug 13 2020, 12:00 AM

Actually submit all the changes that I thought I had submitted
two days ago.

Also:

  • add rationale for the static rule on cycles
  • expand the discussion of program transform correctness for loop unrolling and split the section since it's getting quite large
llvm/docs/ConvergentOperations.rst
53–56

I hesitate to write anything like that, because then you get into the problem of defining what "structured" means -- there are multiple definitions in the literature.

My argument would be that purely acyclic control flow -- whether structured or not -- contains sufficient information about convergence to define semantics consistently, without assistance, and avoiding spooky action at a distance.

That you still need some assistance to make actual *guarantees* is really down to composability. For example, you can have a fully acyclic function called from inside a cycle, and then what happen at inlining. One can explore an alternative scheme where you don't have to insert anything into the acyclic function in this case and it's the job of the inlining transform to fix things up, and I have done some exploring in this direction. There are at least two downsides:

  1. The burden on generic program transforms becomes larger.
  1. There is no longer any way for the programmer to express the distinction between functions (or sub-sections of code) that cares about the set of threads with which they're executed vs. those that don't (like the @reserveSpaceInBuffer example I added), and that closes the door on certain performance optimization and becomes problematic if you want to start thinking about independent forward progress.
281

It needs to insert the control intrinsics if it wants to have any guarantees. There aren't a lot of useful guarantees we can make today without this, so that's fine.

I don't want to say that frontends absolutely must insert the control intrinsics just yet, that's why uncontrolled convergent operations are allowed but deprecated. Frontends for languages with convergent operations that don't change will remain in the world of "things tend to work as expected a lot of the time, but stuff can break in surprising ways at the least convenient moment" that they are already in today. If they run the ConvergenceControlHeuristic pass just after IR generation, the times where things break will likely be somewhat reduced, but probably not eliminated entirely. It's difficult to make a definitive claim because there's obviously also the question of which guarantees the high-level language is supposed to give to the developer. For a HLL that just doesn't want give any guarantees, not inserting control intrinsics is fine from the POV of language spec correctness, although you're likely to run into corner cases where the language behavior clashes with developers' intuitive expectations.

612–613
Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic.

This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration?

Well... what even is a loop iteration? :)

For the purpose of convergence, the loop heart defines what the iterations are, so it is reached on every iteration *by definition*. (But there may well be cycles in the CFG that don't contain a loop intrinsic, and that's fine.)

More likely your real question is whether in a natural loop, the loop intrinsic must be reached once per execution of the loop header (or traversal of a back edge) -- the answer is no.

Part of the rationale here (and also an unfortunately inherent source of potential confusion) is that for defining convergence, and more generally for implementing whole-program vectorization of the style we effectively do in AMDGPU, leaning only on natural loops doesn't work, at least in part because of the possibility of irreducible control flow. This is why all the actual algorithms I'm building on this rely on the Havlak-inspired CycleInfo of D83094, and all the rules in this document are expressed in terms of cycles (in the sense of circular walks in the CFG) instead of natural loops.

My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated.

I would have liked to have that property but couldn't make it work without imposing static rules that would be much harder to understand and follow. The point about unrolling is mentioned in the later examples section where I talk through a bunch of example loops and whether they can be unrolled or not.

750–755

The formal explanation is ultimately that the set of communicating threads is changed, but I agree that it could be helpful to spell out how that comes about via the rules on dynamic instances, so I'm going to do that.

760

I'm adding a "rationale" section specifically to explain those static rules about cycles.

nhaehnle updated this revision to Diff 285272.Aug 13 2020, 12:32 AM

Typos and yet slightly more detail.

sameerds added inline comments.Aug 13 2020, 2:57 AM
llvm/docs/ConvergentOperations.rst
175–176

But this use of the intrinsics does not add any new constraints, right? This specific optimization is already sufficiently constrained by control dependence.

744–745

The exhausted reader just begs to see the corrected version at this point! :)

808–810

Following the structure of previous examples, it would be good to have a demonstration of how this can result in misinterpreted convergence. That would explain why this example should be illegal. This paragraph directly applies the rules to show how the example is recognized as illegal.

nhaehnle added inline comments.Aug 13 2020, 4:01 AM
llvm/docs/ConvergentOperations.rst
175–176

It doesn't add any constraints for existing generic transforms in LLVM that I'm aware of, but there's still a bit of non-trivial content to it at least in theory. Whether it matters in practice depends on the backend.

E.g., it doesn't matter for AMDGPU, but modern versions of CUDA say that some sort of divergence can basically happen at any point in the program. If you wanted to take code that uses the convergent operations and translate it to CUDA builtins, the control intrinsics make a difference. In that case, you'd want the uniform threadmask to replace the entry intrinsic. If it was an anchor somewhere instead, you'd want to replace the anchor by __activemask() and then use its return value. In both cases, you'd possibly modify the mask somehow to account for additional control dependencies between the anchor and its use. This "modify the mask somehow" hides a lot of complexity, but thinking about it quite a bit I believe it's a similar amount of complexity to what we have in the AMDGPU backend to make things work, probably less because more of the burden is shouldered by hardware in the end.

Plus there's the composability aspect of it if we're talking about functions that aren't kernel entry points and might be inlined.

744–745

The exhausted author is taking a note and will get around to it soon ;)

808–810

Isn't it just the same as in the example directly above? You'd expand C / E to a longer sequence of what happens in those inner loops, but the essentially difficulty is the same.

sameerds added inline comments.Aug 13 2020, 9:40 AM
llvm/docs/ConvergentOperations.rst
808–810

Maybe it is the same. See earlier note about exhausted reader. :) Maybe it's just me, but the concepts in this document are quite slippery, and well-rounded examples that restate the obvious can go a long way in gaining confidence.

nhaehnle updated this revision to Diff 285707.Aug 14 2020, 11:35 AM

Add more language about loops

nhaehnle updated this revision to Diff 286816.Aug 20 2020, 7:42 AM
  • tighten the static rules about cycles: there was a gap in the exact phrasing if two loop heart intrinsics in a cycle use _different_ convergence tokens
  • add verifier checks and corresponding tests for the static rules
nhaehnle updated this revision to Diff 286817.Aug 20 2020, 7:43 AM

clang-format fixes

simoll added inline comments.Aug 28 2020, 5:19 AM
llvm/docs/ConvergentOperations.rst
340–344

You mean control could deviate threads? But those threads won't even reach the convergent instruction and only among those that do those that have the same runtime token value will execute it as a pack.

nhaehnle added inline comments.Sep 7 2020, 7:49 AM
llvm/docs/ConvergentOperations.rst
340–344

Ah, I misread your earlier comment. Yes, though there's a question of whether the different threads actually see the same value, or whether they see different values that happen to refer to the same dynamic instance of the defining instruction. One may want to think of the token value as a handle to some control structure that refers to a dynamic instance and also holds a loop counter for the loop heart intrinsic. I don't think it really matters much either way.

I've only read up to Formal Rules so later sections might change things but I figure it's potentially useful to see a readers thoughts mid-read. I'm pretty sure I've misunderstood the anchor intrinsic based on what I've read of the doc and comments so far.

llvm/docs/ConvergentOperations.rst
141–144

I think this is a little misleading, IIUC and assuming that the sets of communicating threads are quads as mentioned above then %condition doesn't need to be uniform across all the threads referenced by %entry. The only use is inside the then: block so I would expect that communicating threads for which %condition is uniformly false don't need to be considered as their result will not be used by any thread that enters then:. As you're trying to leave methods out, it's probably best left at ... with additional knowledge, that it doesn't change the result

The reason I bring this up is that I think it's worth thinking about how a generic transform, or an IR-level/gMIR-level/MIR-level target transform would perform this transform if it did understand convergence. To be clear, I'm not talking about the property it proves or the method by which it proves it. I mean: How would such a transform know what to prove and when to try?

For MIR and intrinsics, the answer seems obvious. The backend simply knows more about the instructions/intrinsics convergence than convergencectrl declares and can use that information instead. Once it recognizes an instruction/intrinsic as one it knows more about, it can try to prove whatever property it needs. However, outside of those special cases there doesn't seem to be a way to know what to prove or when to try, even for a target-specific pass. To use the above example, if @textureSample were a non-intrinsic function with the same properties you describe I don't think it would be possible to know any better than what convergencectrl declares, preventing the analysis the sinking transform would depend on. It's arguably out of scope for this doc but do you foresee convergence tokens and the convergent attribute becoming finer grained in future to support earlier or more target-independent transforms on convergent operations? Do you have any thoughts on how that would be done?

dsanders added inline comments.Sep 8 2020, 10:04 PM
llvm/docs/ConvergentOperations.rst
28–30

This is rather nit-picky but there's some convergent operations where inter-thread communication isn't happening depending on how you model it. For example, a population count could be modelled as threads communicating (sum of 0 or 1 responses) which fits the definition here, but it could also be modelled threads optionally communicating (count of responses received), or as an external thread-manager broadcasting its count to the threads. Either way, communication is still happening but the second and third models are stretching the definition a bit

I don't think it's worth bogging down the main text for that nitpick but it might be worth clarifying in a footnote or something that receiving/sending any data from, to, or about another thread counts as communication. Also, declining to communicate counts as communication if it affects the outcome.

212–214

Should we also mention that it's valid when %cc is non-uniform so long as the same effect is achieved by other means? In this particular example, additional communication is fine so long as we ensure unintended threads contribute 0 to the sums (e.g. by masking %delta using %cc first). In other words, it's not the actual communication we need to keep consistent but the effects (and side-effects) of that communication.

248–252

I feel like there's something I'm missing here. This sounds like:

if (condition1) {
  %token = anchor()
  if (condition2) {
     ...
  }
  sum() convergencectrl(%token)
}

can be rewritten to:

if (condition1) {
  if (condition2) {
    %token = anchor()
     ...
    sum() convergencectrl(%token)
  }
}

which made sense at first given statements like we don't care which threads go together, but we also have no way of saying that we did care which threads go together unless we also say that it must be the same as the threads from function entry. I'd originally expected that this would be allowed:

if (condition1) {
  %token = entry()
  if (condition2) {
     ...
  }
  sum() convergencectrl(%token)
}

and would prevent sinking into or hoisting out of either if-statement but your reply here seems to indicate that's not allowed. How do convergence tokens prevent hoisting/sinking for this case?

Having read a bit further and thought about it a bit more, I suspect what I'm missing is that anchor() is as immobile as it's name would suggest. However I haven't seen anything say it's immobile and things like we don't care which threads go together and the code does not care about the exact set of threads with which it is executed give me the impression that it can sink/hoist as long as the consumers of the token do too. My main thought that undermines my original reading is that if it can move then there'd be nothing stopping me deleting it either as I could always invent a if(false) { ... } to sink it all into.

nhaehnle added inline comments.Sep 15 2020, 6:08 AM
llvm/docs/ConvergentOperations.rst
28–30

That's a fair point. The way I'm thinking about this is that there may be communication with a void payload, but ultimately this can be bikeshed to death.

141–144

I can clean up the text.

As for the question of how generic transforms could do better in the future: the way I see it, this would involve divergence analysis. If %condition is uniform (in a suitably defined sense), then sinking the @textureSample is okay since it doesn't change the relevant set of threads. The downside is that divergence analysis tends to be relatively expensive. It's worth exploring whether it can be computed incrementally and preserved.

This particular example is an interesting one since it shows that scopes matter: on typical hardware, you really only need uniformity of %condition at the quad scope. I think that's worth exploring at some point, but it's definitely something to leave for later. I don't think there's anything in this proposal that would inherently prevent it.

248–252

That transform is allowed (assuming that sinking the user of the result of the sum() is also possible). Though either way, an implementation is free to isolate individual threads, i.e. in your example, the result of sum could just be replaced by the value you're summing over so that each thread just gets its own value. This may seem useless at first, but it is the point of the anchor :)

If you want the set of threads to have some fixed relation to something external (like a compute workgroup or full Vulkan subgroup), you need to use entry instead of anchor.

anchor is still useful, as long as you have multiple things anchored to it. It will then ensure that they are relatively consistent to each other.

sameerds added inline comments.Sep 16 2020, 4:43 AM
llvm/docs/ConvergentOperations.rst
248–252

If I understand this right, then even entry does not capture anything specific ... it is merely a place holder for the anchor at the callsite of a function. This matters, for example, when the call is inside a loop and the frontend is trying to specify something in terms of the threads that together enter the loop. The entry at the start of a kernel is almost the same as an anchor, except the definition of threads that see the same dynamic instance is coming from the language above rather than the implementation below.

The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler.

Is that correct?

sameerds added inline comments.Sep 16 2020, 4:51 AM
llvm/docs/ConvergentOperations.rst
248–252

Just realized that this is not true: "The entry at the start of a kernel is almost the same as an anchor", but the rest still seems to hold.

nhaehnle added inline comments.Sep 23 2020, 9:26 AM
llvm/docs/ConvergentOperations.rst
248–252

The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler.

Probably? I'm not sure I agree with the exact wording. In a compute kernel, the entry intrinsic preserves the set of threads (workgroup/threadgroup/block) that are launched together, where "together" is parameterized by the scope you care about (dispatch/workgroup/subgroup/wave/whatever you call it). loop intrinsics controlled by the resulting token value in turn preserve that set of threads modulo divergent exits from the loop. And so on.

So I'd state it as: the intrinsics cannot enforce any grouping that wasn't there before, they can only enforce preservation of groupings.

I hope that's what you meant, just with different words? :)

Hi. :) A few people pinged me asking for my feedback here, since I touched the convergent attr way back in the day, for CUDA.

I'm going to try to give feedback, but with the caveat that there's a huge amount of discussion here, and with my apologies that I can't read the whole thread's worth of context. It's a lot. Sorry that I'm probably bringing up things that have already been discussed.

I strongly agree that convergent as-is has problems. Fixing them is clearly complicated, and it seems like a lot of work has gone into this proposal.

I have been out of it for too long to feel comfortable signing off on whether this proposal fixes the problems with convergent. The proposal seems reasonable to me, but as we saw with e.g. undef/poison, these things can be extremely subtle.

I'm also not comfortable speaking to whether this representation will be ergonomic in the relevant LLVM passes.

What I'm more comfortable speaking to is:

  • Is the proposal clear to me?

I think the proposal is clear, modulo my few comments (relative to the length of the patch, anyway). This kind of writing is really tricky, I admire that I could mostly understand it. I thought the extensive examples were really helpful.

  • Is it clear how to modify clang's CUDA frontend to use this new form?

It's not perfectly clear to me how to do this. Is it as simple as saying, loops always have a convergent.loop() intrinsic at the top, functions always have convergent.entry() at the top, and that's it? If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

  • Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

Hope this helps.

llvm/docs/ConvergentOperations.rst
28–30

CUDA __syncthreads() is the prototypical convergent function (at least, it was -- maybe under this definition it's not?), but syncthreads does not exchange any information. It's just a barrier.

Assuming you still consider syncthreads to be convergent, my concern is someone would read this and (quite reasonably) think that we are incorrectly modeling it as convergent.

The way I'm thinking about this is that there may be communication with a void payload,

If you count "communicate nil" as communication, then perhaps the operation is not in fact communication but rather is "communication or synchronization"? Perhaps:

A convergent operation involves inter-thread communication or synchronization that occurs outside of the memory model, where the set of threads which participate in the inter-thread operation is implicitly affected by control flow.

89

Up to you, but I think this example would be more evocative if we wrote out the definition of textureSample. I am imagining that it involves something like a __shfl, but that's because I already understand GPUs. Your audience is bigger than that.

221

Nit: Clarify that this example isn't using the proposed convergence intrinsics? Perhaps

Consider an example of how jump threading removes structure in a way that can make semantics non-obvious without the convergence intrinsics described in this document.

250

Nit: Add ellipsis above this line, or remove it in the equivalent spot in the original code?

313

This paragraph really clarifies for me what's going on. +1

348

...wait, there are such things as convergent functions? This is the first I'm hearing about it in the doc! So far it seemed there were only convergent *calls*. What's a convergent function? :)

499

Do you plan to check this in the verifier (insofar as possible, I understand that it's not possible to check this for cross-TU calls).

507

This one is a local property -- could we say that this makes the program ill-formed, instead of UB?

511

Again, could we say this makes the program ill-formed? (At least the entry-block check, I'm not sure what a convergence region is, yet.)

595

Have we formally defined what a "controlled" convergent operation is? Do you mean a call to a convergent function with a "convergencectrl" operand bundle? (Say that?)

955

In this section I would have found it helpful if we'd differentiated upfront between the three kinds of unrolling:

  • Partial unrolling of a loop with no known trip multiple (so, there's a "tail" that collects the remaining elements)
  • Partial unrolling by a trip multiple (so there's no "tail")
  • Full unrolling, which eliminates the loop

I think you're saying that only the first kind of unrolling is tricky.

982–983

It would help me if we could we elaborate with half a sentence what the behavior change might be.

988–989

Do you mean that this *kind of* unrolling is forbidden?

But if you're going to forbid *all* unrolling of loops with uncontrolled convergent ops...that's going to make CUDA code a lot slower. Unless you're also going to fix clang, in which case, no objections, but maybe you want to say "will be forbidden once we've updated front-ends"?

999

One thing I don't get from this example is what I should do as a *frontend* to LLVM. That is, when should I do this form, and when should I put a new anchor inside a loop?

It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor. That is, I have to behave as though the compiler could transform

anchor()
foo();

into, effectively

if (threadIdx.x % 2 == 0) {
  anchor()
  convergent_fn();
} else {
  anchor();
  convergent_fn();
}

Something like this?

OK, so I always have to use the convergence.loop() form. But then this is saying I can never unroll.

ITYM that with convergence.loop(), I can never *partially unroll with a "tail"*, which makes a lot of sense? But would help me if we were explicit about that.

1033

counter > 1?

nhaehnle updated this revision to Diff 301840.Oct 30 2020, 2:39 AM
nhaehnle marked 3 inline comments as done.

Address the comments from @jlebar that I indicate I'd address,
except for changes affecting the Verifier -- I'll do those later.

I'm going to try to give feedback, but with the caveat that there's a huge amount of discussion here, and with my apologies that I can't read the whole thread's worth of context. It's a lot. Sorry that I'm probably bringing up things that have already been discussed.

Thanks, and don't worry. A lot of the old comments don't make sense anymore because the document was changed and Phabricator shows them in nonsensical places unfortunately.

[snip]

  • Is it clear how to modify clang's CUDA frontend to use this new form?

It's not perfectly clear to me how to do this. Is it as simple as saying, loops always have a convergent.loop() intrinsic at the top, functions always have convergent.entry() at the top, and that's it? If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

There are two kinds of answers to this. One is that you can only really know how the frontend should be modified once you've established what the high-level language semantics ought to be. Part of why I'm doing this work is to enable us to experiment with this kind of question and verify our understanding what this should look like (I'm going to caveat this with saying that I'm coming at it from the graphics side).

The other kind of answer is that for most but not all constructs, there's a pretty natural answer that boils down pretty much to what you wrote. Of course it generally breaks down in the face of goto, for example. I have a follow-on patch, D85609, which adds a pass that does this kind of insertion on top of LLVM IR. I'd appreciate your review on that if you find the time -- I think what it tries to do is fairly natural, but it is a bit more work to dig through. A reasonable first step for someone working on the CUDA frontend would be to insert that pass early in the pass pipeline. Longer term, it may be necessary to insert them directly during IR generation, but this at least partially depends on the high-level language semantics question.

  • Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

I have trouble answering this as well due to the lack of proper specification from Nvidia, and I'm not set up to run this kind of experiment.

From a theory point of view, because those newer versions of sync operations take that explicit arg, we shouldn't consider them to be convergent according to what's being defined here. Only the __activemask() builtin probably still needs to be considered convergent (also in light of https://bugs.llvm.org/show_bug.cgi?id=47210).

The result of your experiment seems to contradict the theory. Having worked on this part of our compiler for a while now, I think it's entirely possible that the result of your experiment is simply a bug somewhere along the compiler stack, but of course I can't say for certain. If it's not supposed to be a bug, then to me this means there's something subtle missing in the way the new sync operations are described. Either way, some clarification would be good.

llvm/docs/ConvergentOperations.rst
28–30

Your suggestion looks good to me, going to apply it.

89

textureSample is actually a built-in function of graphics languages. I'm going to add a clause to try to clarify that. I assume all GPUs have dedicated circuitry for it. I specifically wanted to mention textureSample in the document at least once because it (and some close analogs) are often forgotten in discussions of convergent even by graphics people like myself.

Obviously the document should also be accessible to folks from the GPU compute world, which is why I tried to give a succinct explanation of the relevant facts about textureSample in the paragraph above.

Later in the document there are also examples using shuffles, though with the Khronos-y spelling of subgroupShuffle instead of the CUDA-y __shfl. The choice of spelling is partly because that's just the world I'm personally working in most of my time, but also partly because I'd prefer using terms from common industry standards. I understand that CUDA is a bit of a de facto "standard", so if you think it's necessary to convert at least one example to CUDA spelling, we can do that -- just not this one here in particular, because it's specifically meant to be a graphics shader example.

221

Thanks, going to make this change.

250

Added ellipsis.

348

Uhh... technically true. How about adding something like the following somewhere:

In LLVM IR, function calls are the only instructions that can involve convergent
operations. A call itself (i.e., the act of jumping to the callee, setting up a
stack frame, etc.) is not a convergent operation. However, if the callee uses
the `llvm.experimental.convergence.entry` intrinsic, then we think of the
entire execution of the callee as a convergent operation from the perspective of
the calling function. Such callees must be marked with the `convergent`
attribute, and for brevity we say that they are "convergent functions". If the
callee isn't known at the call site (i.e., an indirect function call), then the
`call instruction itself must have the convergent` attribute.

The only reason for why a function F would need to use the
`llvm.experimental.convergence.entry` intrinsic is if F in turn uses some
other convergent operation, i.e., a call to a convergent function. Chains of
such calls are expected to eventually end with the use of a (target-specific)
intrinsic that is `convergent`.

499

Do we typically check "mere UB" in the verifier? Thinking about it a little, doing this seems risky for IR linking: it would mean that you can link two well-formed modules together and end up with an ill-formed one? If that's something that already exists and we're okay with it, then I'd be happy to add such checks, but I wouldn't want to be the one to introduce them...

507

Yes, that's a good idea.

511

The entry-block check should be straightforward.

595

Yes, the section "Dynamic Instances and Convergence Tokens" already says this:

The convergence control intrinsics described in this document and convergent
operations that have a `convergencectrl` operand bundle are considered
*controlled* convergent operations.

I'm going to add an anchor there since the doc is pretty long :)

955

Yes, that's correct, and I'm going to add essentially your three bullets at the top.

982–983

I gave it a try. It ended up being a full sentence though ;)

988–989

Yes, this kind of unrolling. This is already forbidden for uncontrolled convergent operations today. If you want to dig a little deeper, I would appreciate if you could also add your review to D85605. That's a follow-up change for (1) correctness of loop unrolling with regards to the loop intrinsics rules and (2) relaxing some of the constraints that exist today where possible when all convergent ops are controlled (by an anchor in the loop).

999

ITYM that with convergence.loop(), I can never *partially unroll with a "tail"*, which makes a lot of sense?

Yes, that's correct. Hopefully clearer with the addition at the top of the section.

It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor.

Right. The anchor essentially allows you to achieve the same thing as __activemask in CUDA, but in a more structured way that doesn't run into problems when you have two sides of an if/else both executing a sync operation with the same thread mask.

1033

Thanks, changing to counter >= 2 because that's what I had in a similar example above.

jlebar added a comment.EditedNov 1 2020, 9:49 AM

Man, phab doesn't make this easy, does it?

One tip, shift+A hides all inline comments, making the patch easier to read. One problem, though: It hides all the comments! :)

Do we typically check "mere UB" in the verifier? Thinking about it a little, doing this seems risky for IR linking: it would mean that you can link two well-formed modules together and end up with an ill-formed one? If that's something that already exists and we're okay with it, then I'd be happy to add such checks, but I wouldn't want to be the one to introduce them...

I see. You may not be able to check this. My preference, which it seems like you share, is just, inasmuch as we _can_ mark something as ill-formed and check for it, that seems preferable to UB.

How about adding something like the following somewhere:

In LLVM IR, function calls are the only instructions that can involve convergent operations. A call itself (i.e., the act of jumping to the callee, setting up a stack frame, etc.) is not a convergent operation. [...]

Yes, that is very clear to me.

Thank you for making those changes.

I am satisfied that this can be implemented in a frontend (and anyway, you have the patch). I've pinged some folks at nvidia asking for them to have a look wrt sm70, and I actually already got a reply, so I am hopeful we might hear from them. I don't want to keep you in limbo indefinitely, so I've asked if they might be able to provide a timeline.

Stay tuned, I guess.

vgrover99 added a subscriber: vgrover99.EditedNov 2 2020, 7:12 PM

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I believe what is described here about convergent, as best I can understand it, is the semantics of syncthreads in CUDA. This semantics is the same for <sm70 and sm70+. Not clear whether what is described here is a "textually aligned" semantics or unaligned. syncthreads is aligned, meaning that all threads in the threadblock must wait on the same lexical syncthreads().

I believe with sm70 the re-convergence has different semantics, due to the fact that we have forward progress guarantee in a warp. In pre-sm70 the following could deadlock

volatile int flag = 0;

if (cond) { // thread dependent conditional

while (flag == 0) ; // spin-lock

} else

flag++;

// re-convergence point

now it works as expected

The following also works (doesn't deadlock)

volatile int flag = 0;

if (cond) { // thread dependent conditional

while (flag == 0) ; // spin-lock

}
// re-convergence point
flag++;

I believe what is described here about convergent, as best I can understand it, is the semantics of syncthreads in CUDA. This semantics is the same for <sm70 and sm70+. Not clear whether what is described here is a "textually aligned" semantics or unaligned. syncthreads is aligned, meaning that all threads in the threadblock must wait on the same lexical syncthreads().

Textual alignment is a good way to examine this spec with respect to CUDA. The notion of dynamic instances is textually aligned according to basic rule 2:

  1. Executions of different instructions always occur in different dynamic instances. For this and other rules in this document, instructions of the same type at different points in the program are considered to be different instructions.

This correctly covers __syncthreads(), and is a bit conservative about builtins that take a mask like __syncwarp(). In @jlebar's example, each call to __syncwarp() is a separate dynamic instance, although CUDA actually treats them as a single synchronization point.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

In general, the formal rules work correctly for this too: hoisting and sinking is disallowed without additional information. So the proposal is compatible with the new CUDA semantics for these builtins. These builtins do need convergence control: sinking such a call across a condition should be forbidden by default, since we can no longer guarantee that every thread in the mask still makes a matching call. Of course, specific optimizations that can recompute the mask can over-ride this restriction.

foo = __shfl_sync(mask, ...);
if (condition) {
  // cannot sink foo here if condition is divergent
  sole_use_of_foo();
}

This correctly covers __syncthreads(), and is a bit conservative about builtins that take a mask like __syncwarp(). In @jlebar's example, each call to __syncwarp() is a separate dynamic instance, although CUDA actually treats them as a single synchronization point.

Yes, in CUDA and in PTX __syncwarp is an unaligned primitive.

  • Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

To extrapolate from Vinod's answer, I would say that we can represent sm70+ convergence semantics with this proposal. The situation seems to be covered by the examples in the section on hoisting and sinking. Consider the following example copied from the spec:

define void @example(...) convergent {
  %entry = call token @llvm.experimental.convergence.entry()
  %data = ...
  %id = ...
  if (condition) {
    %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
    ...
  }
}

Here, hoisting subgroupShuffle() is generally disallowed because it depends on the identity of active threads. A CUDA builtin with a mask argument similarly identifies specific threads that must be active at the set of textually unaligned calls that synchronize with each other. So any change in the control flow surrounding those calls is generally disallowed without more information. The new representation doesn't seem to restrict a more informed optimizer that can predict how the threads evolve.

nhaehnle added a comment.EditedNov 6 2020, 8:44 AM
  • Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

To extrapolate from Vinod's answer, I would say that we can represent sm70+ convergence semantics with this proposal. The situation seems to be covered by the examples in the section on hoisting and sinking. Consider the following example copied from the spec:

define void @example(...) convergent {
  %entry = call token @llvm.experimental.convergence.entry()
  %data = ...
  %id = ...
  if (condition) {
    %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
    ...
  }
}

Here, hoisting subgroupShuffle() is generally disallowed because it depends on the identity of active threads. A CUDA builtin with a mask argument similarly identifies specific threads that must be active at the set of textually unaligned calls that synchronize with each other. So any change in the control flow surrounding those calls is generally disallowed without more information. The new representation doesn't seem to restrict a more informed optimizer that can predict how the threads evolve.

Yes, that makes sense to me, although it also makes sense to reflect on this a bit more.

Roughly speaking, there are subgroup ops with "implicit" thread set (what Vinod calls textually aligned, and what this proposal mostly focuses on, because they require the most additional explanation) and subgroup ops with "explicit" thread set (sm70+).

What's interesting is that the latter (__shfl_syncetc.) have similar constraints on how they can and can't be moved, but for different reasons, and the constraints are different. For example:

mask = ...;
if (blah) {
  y = __shfl_sync(a, b, mask);
  ...
} else {
  y = __shfl_sync(a, b, mask);
  ...
}

The __shfl_sync has an explicit thread mask and can be hoisted. However, a subgroupShuffle with implicit thread mask cannot be hoisted. So here, __shfl_sync allows more freedom.

Conversely:

__shfl_sync(a, b, mask); // result unused

This cannot be dead-code eliminated, because it might be communicating with threads executing a different part of the program. By contrast, a subgroupShuffle with implicit thread mask whose result is unused can be dead-code-eliminated. So here, subgroupShuffle allows more freedom.

By similar logic, subgroup ops with implicit thread mask in the same basic blocks can be re-ordered wrt each other, but this is not true for explicit thread mask (not-textually-aligned) subgroup ops.

I believe that with this proposal, we can model this with the attributes we have by saying that subgroupShuffle is convergent readnone, while __shfl_sync is inaccessiblememonly.

Roughly speaking, there are subgroup ops with "implicit" thread set (what Vinod calls textually aligned, and what this proposal mostly focuses on, because they require the most additional explanation) and subgroup ops with "explicit" thread set (sm70+).

That's an excellent way to "lift" sm70+ operations into the semantics being handled by this proposal. The bottomline seems to be that the proposed formalism achieves the following:

  1. Dynamic instances capture the thread sets that are implicitly determined by control flow graph.
    1. This covers both kinds of operations, with and without explicit thread sets as arguments.
    2. No assumptions are made about thread grouping in the underlying hardware.
  2. There is a straightforward way for a frontend to insert these intrinsics while ensuring correctness and not overly constraining optimization.
  3. Generic optimizations are safe as long as they preserve the mapping of threads to dynamic instances (basic rule 4).
    1. The mapping is usually altered by changes to surrounding control flow, and hence such changes are forbidden in general.
    2. This does not preclude more informed optimizations that are aware of their impact on the set of threads at a dynamic instance.
    3. Optimizations can also benefit from attributes that indicate how this set of threads is allowed to change.
  4. None of the above has been established beyond all doubt, but the current understanding is sufficient to justify the "experimental" tag.

Is that a reasonable summary at this point?

jlebar added a comment.Nov 9 2020, 4:49 PM

The bottomline seems to be that the proposed formalism achieves the following: <snip>

I agree, fwiw.

What do you need at this point to move forward?

Few questions below. Please bear with me as I try to grok the proposal and the long stream of comments...

I believe that with this proposal, we can model this with the attributes we have by saying that subgroupShuffle is convergent readnone, while __shfl_sync is inaccessiblememonly.

subgroupShuffle would require convergentctrl and __shfl_sync would not, correct?

There is a straightforward way for a frontend to insert these intrinsics while ensuring correctness and not overly constraining optimization.

This feels like it could use a bit of discussion in the documentation, at least spelling out the straight-forward mapping for a simple C-like language with one example built-in that uses implicit thread masks. My understanding of this proposal implies the following rules:

  1. Add call to llvm.experimental.convergence.entry to the beginning of every convergent function (generally assume the program entry-point is convergent)
  2. Add call to llvm.experimental.convergence.anchor to the beginning of every non-convergent function
  3. Add call to llvm.experimental.convergence.loop to the beginning of every natural loop header block
  4. For each call to a convergent function (intrinsic or other-wise), attach convergencectrl bundle pointing to the closest call to entry/anchor/loop, in terms of nesting

Is this correct for general structured code, or am I missing some case?

Things are less clear when you consider odd looping structures, for example:

entry:
    entry_token = llvm.experimental.convergence.anchor();
    if (cond1) goto head1;
    else goto head2;

head1:
    head1_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    cond2 = ...;
    if cond2 goto tail1;
    else goto tail2;

head2:
    head2_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    break_cond = ...
    if break_cond goto exit;
    else goto head2b;

head2b:
    cond3 = ...;
    if cond3 goto tail2;
    else goto tail1;

tail1:
    cond4 = subgroupOp(...);      // What does this anchor to?
    if cond4 goto head1;
    else goto head2;

tail2:
    cond5 = subgroupOp(...);      // What does this anchor to?
    if cond5 goto head2;
    else goto head1;

exit:
    ...

Ignoring the new intrinsics, the subgroupOp calls may have defined semantics between groups of threads that reach the two loop tail blocks together, at least if you guarantee maximal convergence. When you add in the proposed intrinsics, what do the subgroupOp calls anchor to? The llvm.experimental.convergence.loop calls do not dominate the subgroupOp calls, so the llvm.experimental.convergence.anchor call is the only choice. But this breaks the first rule static rule of cycles in the proposal: we have a use of a convergence token inside a cycle but the token is not defined inside the loop. Do we just add a call to llvm.experimental.convergence.anchor to both tail1 and tail2, using them as the convergence token for the respective subgroupOp?

Is there any concern over implementing and validating all of the necessary logic in all CFG-altering optimization passes? This seems like a large change that will require modifications through-out the existing optimization passes in order to achieve the goal of "ensuring correctness and not overly constraining optimization". I'm just curious if there is any plan being formulated for tackling this issue.

Few questions below. Please bear with me as I try to grok the proposal and the long stream of comments...

I believe that with this proposal, we can model this with the attributes we have by saying that subgroupShuffle is convergent readnone, while __shfl_sync is inaccessiblememonly.

subgroupShuffle would require convergentctrl and __shfl_sync would not, correct?

The answer seems to depend on whether it is correct to say the following about __shfl_sync

  1. The only constraint on control flow transformations around __shfl_sync is identical to the current definition of the convergent intrinsic.
  2. Other optimizations such as DCE and reordering, that do not involve changes in the control flow, should be modelled using other constructs like inaccessiblememonly

If this is true, then yes, __shfl_sync doesn't need the new convergence control bundles.

There is a straightforward way for a frontend to insert these intrinsics while ensuring correctness and not overly constraining optimization.

This feels like it could use a bit of discussion in the documentation, at least spelling out the straight-forward mapping for a simple C-like language with one example built-in that uses implicit thread masks. My understanding of this proposal implies the following rules:

  1. Add call to llvm.experimental.convergence.entry to the beginning of every convergent function (generally assume the program entry-point is convergent)
  2. Add call to llvm.experimental.convergence.anchor to the beginning of every non-convergent function
  3. Add call to llvm.experimental.convergence.loop to the beginning of every natural loop header block
  4. For each call to a convergent function (intrinsic or other-wise), attach convergencectrl bundle pointing to the closest call to entry/anchor/loop, in terms of nesting

Is this correct for general structured code, or am I missing some case?

Right. These heuristics are actually proposed in a related patch under review:
https://reviews.llvm.org/D85609

In particular, the above pass is expected to do the right thing when working with cross-thread operations that are "textually aligned" (for example, SPIRV, OpenCL, HIP, and CUDA before sm70).

Things are less clear when you consider odd looping structures, for example:
[snip]
Ignoring the new intrinsics, the subgroupOp calls may have defined semantics between groups of threads that reach the two loop tail blocks together, at least if you guarantee maximal convergence. When you add in the proposed intrinsics, what do the subgroupOp calls anchor to?

I believe the question needs to be turned around: what do you want subgroupOp to anchor to? In general, it should be impossible to infer the correct operand bundles from arbitrary LLVM IR, else we could just express everything as an analysis instead of having these explicit markers. The intention of these markers is for a frontend to be able to produce constraints that cannot be expressed using just the structure of the program.

Is there any concern over implementing and validating all of the necessary logic in all CFG-altering optimization passes? This seems like a large change that will require modifications through-out the existing optimization passes in order to achieve the goal of "ensuring correctness and not overly constraining optimization". I'm just curious if there is any plan being formulated for tackling this issue.

These intrinsics are expected to be very useful for new techniques being worked out in the AMDGPU backend. The following reviews start off the changes required in the optimizer:
https://reviews.llvm.org/D85604
https://reviews.llvm.org/D85605
https://reviews.llvm.org/D85606

  • Is it clear how to modify clang's CUDA frontend to use this new form?

It's not perfectly clear to me how to do this. Is it as simple as saying, loops always have a convergent.loop() intrinsic at the top, functions always have convergent.entry() at the top, and that's it? If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

There are two kinds of answers to this. One is that you can only really know how the frontend should be modified once you've established what the high-level language semantics ought to be. Part of why I'm doing this work is to enable us to experiment with this kind of question and verify our understanding what this should look like (I'm going to caveat this with saying that I'm coming at it from the graphics side).

The other kind of answer is that for most but not all constructs, there's a pretty natural answer that boils down pretty much to what you wrote. Of course it generally breaks down in the face of goto, for example. I have a follow-on patch, D85609, which adds a pass that does this kind of insertion on top of LLVM IR. I'd appreciate your review on that if you find the time -- I think what it tries to do is fairly natural, but it is a bit more work to dig through. A reasonable first step for someone working on the CUDA frontend would be to insert that pass early in the pass pipeline. Longer term, it may be necessary to insert them directly during IR generation, but this at least partially depends on the high-level language semantics question.

Regarding the HLL and frontend side, I believe this could be represented fairly similarly in different C/C++-based languages - considering that we already follow the same implementation for existing convergent semantics at least between CUDA and OpenCL. However, it isn't yet in its optimal state and perhaps we can attempt to refine this topic holistically for example also addressing the following rework that removes the need to make everything convergent: https://reviews.llvm.org/D69498. Otherwise, we will likely have to generate the convergent intrinsics absolutely everywhere, which is not ideal!

Looking at the wording in some parts of your convergent semantics definition there might be options resulting in some tradeoff between tooling complexity and optimization opportunities:

+The
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic is typically expected to appear in the header of a natural loop.
+However, it can also appear in non-header blocks of a loop. In that case, the
+loop can generally not be unrolled.

I understand this is not in the scope of this work. And I think it is perfectly reasonable to provide experimental support that could help with further evaluation and productization too. However, it would be good to make some preliminary assessment for the frontend support rather soon. What I think could speed up the progress on the frontend/HLL is some sort of description about the conditions where the new intrinsics have to be inserted. My understanding is that the plan is not to expose them to the application code that would require educating the application developers about all the low-level details? Looking at your transformation pass in https://reviews.llvm.org/D69498 it seems that adding those automatically should somehow be possible and you already have some rules defined where and how those can be added? But there are certain things that can be done in IR that are very constrained in AST as it makes Parsing more complicated.

Regarding the HLL and frontend side, I believe this could be represented fairly similarly in different C/C++-based languages - considering that we already follow the same implementation for existing convergent semantics at least between CUDA and OpenCL. However, it isn't yet in its optimal state and perhaps we can attempt to refine this topic holistically for example also addressing the following rework that removes the need to make everything convergent: https://reviews.llvm.org/D69498. Otherwise, we will likely have to generate the convergent intrinsics absolutely everywhere, which is not ideal!

As far as I could skim through the specs, OpenCL requires that all threads in a workgroup or subgroup encounter a convergent operation. On the other hand, SPIRV and CUDA allow a more general "non-uniform" version that are executed by "currently active" threads. This proposal is general enough to cover both cases (independent of the newer CUDA primitives that take explicit masks).

Also, this new proposal supersedes https://reviews.llvm.org/D69498. In fact it presents a generalization that can even entirely eliminate the need for the convergent attribute. (See Nicolai's older comment about keeping convergent for now ... it can be used by frontends who elect to keep the current broken-but-well-known formalism it represents.

Looking at the wording in some parts of your convergent semantics definition there might be options resulting in some tradeoff between tooling complexity and optimization opportunities:

+The
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic is typically expected to appear in the header of a natural loop.
+However, it can also appear in non-header blocks of a loop. In that case, the
+loop can generally not be unrolled.

I believe this is meant to say that the formalism does not forbid putting the loop intrinsic in a non-header block, but that is not expected in most known cases. It is not an optimization choice that every flow must make.

I understand this is not in the scope of this work. And I think it is perfectly reasonable to provide experimental support that could help with further evaluation and productization too. However, it would be good to make some preliminary assessment for the frontend support rather soon. What I think could speed up the progress on the frontend/HLL is some sort of description about the conditions where the new intrinsics have to be inserted. My understanding is that the plan is not to expose them to the application code that would require educating the application developers about all the low-level details? Looking at your transformation pass in https://reviews.llvm.org/D69498 it seems that adding those automatically should somehow be possible and you already have some rules defined where and how those can be added? But there are certain things that can be done in IR that are very constrained in AST as it makes Parsing more complicated.

This other review request is likely to demonstrate what you are asking for:
https://reviews.llvm.org/D85609

Regarding the HLL and frontend side, I believe this could be represented fairly similarly in different C/C++-based languages - considering that we already follow the same implementation for existing convergent semantics at least between CUDA and OpenCL. However, it isn't yet in its optimal state and perhaps we can attempt to refine this topic holistically for example also addressing the following rework that removes the need to make everything convergent: https://reviews.llvm.org/D69498. Otherwise, we will likely have to generate the convergent intrinsics absolutely everywhere, which is not ideal!

As far as I could skim through the specs, OpenCL requires that all threads in a workgroup or subgroup encounter a convergent operation. On the other hand, SPIRV and CUDA allow a more general "non-uniform" version that are executed by "currently active" threads. This proposal is general enough to cover both cases (independent of the newer CUDA primitives that take explicit masks).

We do have a new functionality in OpenCL that requires supporting convergent operations in non-uniform CF too:
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#_extended_subgroup_functions
https://llvm.org/PR46199

Also, this new proposal supersedes https://reviews.llvm.org/D69498. In fact it presents a generalization that can even entirely eliminate the need for the convergent attribute. (See Nicolai's older comment about keeping convergent for now ... it can be used by frontends who elect to keep the current broken-but-well-known formalism it represents.

Sorry for not being clear - I was talking about two separate threads here (1) generalizing convergent attribute to non-uniform CF that is addressed by this patch and (2) inverting convergent attribute that is addressed in https://reviews.llvm.org/D69498. Just to provide more details regarding (2) - right now in clang we have a logic that adds convergent to every single function because when we parse the function we don't know whether it will call any function in a call tree that would use convergent operations. Therefore we need to be conservative to prevent incorrect optimizations but this is not ideal for multiple reasons. The optimiser can undo all or some of those convergent decorations if it can prove they are not needed. And for the uniform CF convergent operations this was the only "broken" functionality to my memory.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

Looking at the wording in some parts of your convergent semantics definition there might be options resulting in some tradeoff between tooling complexity and optimization opportunities:

+The
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic is typically expected to appear in the header of a natural loop.
+However, it can also appear in non-header blocks of a loop. In that case, the
+loop can generally not be unrolled.

I believe this is meant to say that the formalism does not forbid putting the loop intrinsic in a non-header block, but that is not expected in most known cases. It is not an optimization choice that every flow must make.

I understand this is not in the scope of this work. And I think it is perfectly reasonable to provide experimental support that could help with further evaluation and productization too. However, it would be good to make some preliminary assessment for the frontend support rather soon. What I think could speed up the progress on the frontend/HLL is some sort of description about the conditions where the new intrinsics have to be inserted. My understanding is that the plan is not to expose them to the application code that would require educating the application developers about all the low-level details? Looking at your transformation pass in https://reviews.llvm.org/D69498 it seems that adding those automatically should somehow be possible and you already have some rules defined where and how those can be added? But there are certain things that can be done in IR that are very constrained in AST as it makes Parsing more complicated.

This other review request is likely to demonstrate what you are asking for:
https://reviews.llvm.org/D85609

Thanks, I was referring to this review indeed. Perhaps it is easier if I spawn a separate discussion there to see whether and how we can apply the same logic to the frontend and also how it combines with the conservative approach of generating convergent attribute everywhere that we have right now.

Sorry for not being clear - I was talking about two separate threads here (1) generalizing convergent attribute to non-uniform CF that is addressed by this patch and (2) inverting convergent attribute that is addressed in https://reviews.llvm.org/D69498. Just to provide more details regarding (2) - right now in clang we have a logic that adds convergent to every single function because when we parse the function we don't know whether it will call any function in a call tree that would use convergent operations. Therefore we need to be conservative to prevent incorrect optimizations but this is not ideal for multiple reasons. The optimiser can undo all or some of those convergent decorations if it can prove they are not needed. And for the uniform CF convergent operations this was the only "broken" functionality to my memory.

I see now. Thanks! Besides goal (1), the other goal for this new formalism is to clarify the meaning of "convergence" in a way that allows more freedom to the optimizer. Language specs typically define convergence with operational semantics, such as:

  1. SPIRV: "different invocations of an entry point execute the same dynamic instances of an instruction when they follow the same control-flow path"
  2. OpenCL: "all work-items in the work-group must enter the conditional if any work-item in the work-group enters the conditional statement"

The proposed formalism lifts this into a declarative semantics which is easier for the compiler to reason with. This allows optimizations like jump threading, where the transformed program has ambiguous operational semantics (see the example in the actual spec). The presence of convergence control tokens makes sure that the "point of convergence" is well-defined even if the transformed control flow is ambiguous.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

To be honest, I was not aware of this other effort, and even after you pointed it out, I wasn't paying attention to the words that I was reading. It seems like the current spec has so far focussed on demonstrating the soundness of the formalism. But I think it is possible to cover (2), which is to make the default setting conservative. This will need a bit of a rewording. In particular, this definition from the spec:

The convergence control intrinsics described in this document and convergent
operations that have a ``convergencectrl`` operand bundle are considered
*controlled* convergent operations.

Other convergent operations are *uncontrolled*.

This needs to be inverted in the spirit of D69498. I would propose the following tweak:

  1. By default, every call has an implicit convergencectrl bundle with a token returned by the @llvm.experimental.convergence.entry intrinsic from the entry block of the caller. This default is the most conservative setting within the semantics defined here.
  2. A more informed frontend or a suitable transformation can replace this conservative token with one of the following:
    1. A token returned by any of the other intrinsics, which provides more specific information about convergence at this callsite.
    2. A predefined constant token (say none), which indicates complete freedom. This would be equivalent to the noconvergent attribute proposed in D69498.

Such a rewording would invert how we approach the spec. Instead of a representation that explicitly talks about special intrinsics that "need" convergence, the new semantics applies to all function calls. The redefined default is conservative instead of free, and the presence of the bundles relaxes the default instead of adding constraints.

Also, answering one of your comments in the other review (D85609#inline-943432) about the relevance of the llvm.experimental.convergence.anchor, this intrinsic cannot be inferred by the frontend. It represents a new ability to represent optimization opportunities like the one demonstrated in the "opportunistic convergence" example. The intrinsic says that the call that uses this token doesn't depend on any specific set of threads, but merely marks the threads that do reach it. This is most useful when multiple calls agree on the same set of threads. Identifying such sets of operations will need help from the user (or more realistically, a library writer). Something like the following might work, where the actual value of group doesn't really matter beyond relating the various calls to each other.

auto group = non_uniform_group_active_workitems();
op1(group);
if (C)
   op2(group);
op3(group);

Hi @jholewinski, sorry for missing your comment earlier. It's been a while! I still need to work through the rest of the comments here, but there's a pretty crucial point here that seems to have been missed:

Things are less clear when you consider odd looping structures, for example:

entry:
    entry_token = llvm.experimental.convergence.anchor();
    if (cond1) goto head1;
    else goto head2;

head1:
    head1_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    cond2 = ...;
    if cond2 goto tail1;
    else goto tail2;

head2:
    head2_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    break_cond = ...
    if break_cond goto exit;
    else goto head2b;

head2b:
    cond3 = ...;
    if cond3 goto tail2;
    else goto tail1;

tail1:
    cond4 = subgroupOp(...);      // What does this anchor to?
    if cond4 goto head1;
    else goto head2;

tail2:
    cond5 = subgroupOp(...);      // What does this anchor to?
    if cond5 goto head2;
    else goto head1;

exit:
    ...

Regardless of the question about subgroupOp, this example is not valid IR: it breaks the static rule that "Every cycle in the CFG that contains two different uses of a convergence token T must also contain the definition of T."Specifically, there are two uses of entry_token, in head1 and head2, and a cycle head1 -> tail1 -> head2 -> head2b -> tail1 -> head1 that goes through both of them without going through the definition of entry_token.

Roughly speaking, an irreducible loop can contain at most one loop intrinsic that refers to a token from outside the irreducible loop.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

This is a good point. Generally, HLL need to be more conscious about what they actually expect convergent operations to do :) I tend to be optimistic: I mentioned on D85609 a proposal I presented in the context of Khronos. The important point from there is that every statement of the HLL would be (possibly implicitly) annotated with its "canonical convergence token" using very simple rules. This only really falls flat if you have goto jumping into the middle of a loop (or Duff's device etc.). I don't know how efficiently e.g. the Clang frontend can decide whether such constructs exist or not.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

To be honest, I was not aware of this other effort, and even after you pointed it out, I wasn't paying attention to the words that I was reading. It seems like the current spec has so far focussed on demonstrating the soundness of the formalism. But I think it is possible to cover (2), which is to make the default setting conservative. This will need a bit of a rewording. In particular, this definition from the spec:

The convergence control intrinsics described in this document and convergent
operations that have a ``convergencectrl`` operand bundle are considered
*controlled* convergent operations.

Other convergent operations are *uncontrolled*.

This needs to be inverted in the spirit of D69498. I would propose the following tweak:

  1. By default, every call has an implicit convergencectrl bundle with a token returned by the @llvm.experimental.convergence.entry intrinsic from the entry block of the caller. This default is the most conservative setting within the semantics defined here.
  2. A more informed frontend or a suitable transformation can replace this conservative token with one of the following:
    1. A token returned by any of the other intrinsics, which provides more specific information about convergence at this callsite.
    2. A predefined constant token (say none), which indicates complete freedom. This would be equivalent to the noconvergent attribute proposed in D69498.

Such a rewording would invert how we approach the spec. Instead of a representation that explicitly talks about special intrinsics that "need" convergence, the new semantics applies to all function calls. The redefined default is conservative instead of free, and the presence of the bundles relaxes the default instead of adding constraints.

Sounds good. If that would be acceptable to the wider community it might help to simplify the frontend design and improve the user interface and the coherence of the interfaces within the compiler stack too.

FYI, if we forced early inlining in the LLVM stack, the frontend would not need to mark every function as convergent conservatively but in the Compute scenarios we occasionally have very large functions that when inlined result in huge binaries and longer compilation time. And we also have extern functions too that we have no information of during the compilation. So this doesn't seem like a route we can safely take at least not for all languages.

If we invert the convergent logic then we can add nocovergent attribute or even a pragma directive for the application developers to indicate what code doesn't contain cross-threads operations and can be optimized more aggressively.

Also, answering one of your comments in the other review (D85609#inline-943432) about the relevance of the llvm.experimental.convergence.anchor, this intrinsic cannot be inferred by the frontend. It represents a new ability to represent optimization opportunities like the one demonstrated in the "opportunistic convergence" example. The intrinsic says that the call that uses this token doesn't depend on any specific set of threads, but merely marks the threads that do reach it. This is most useful when multiple calls agree on the same set of threads. Identifying such sets of operations will need help from the user (or more realistically, a library writer). Something like the following might work, where the actual value of group doesn't really matter beyond relating the various calls to each other.

auto group = non_uniform_group_active_workitems();
op1(group);
if (C)
   op2(group);
op3(group);

Ok, this makes sense. Thanks for clarifications.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

This is a good point. Generally, HLL need to be more conscious about what they actually expect convergent operations to do :) I tend to be optimistic: I mentioned on D85609 a proposal I presented in the context of Khronos. The important point from there is that every statement of the HLL would be (possibly implicitly) annotated with its "canonical convergence token" using very simple rules. This only really falls flat if you have goto jumping into the middle of a loop (or Duff's device etc.). I don't know how efficiently e.g. the Clang frontend can decide whether such constructs exist or not.

I see. Technically this sounds feasible to add i.e. we could insert a custom AST visitor to detect the pattern after the AST is parsed or perhaps the detection can be done during the parsing itself. The only question is how many of such patterns exist considering the variety HL language constructs and how this will impact the parsing time, etc. I would say prototyping this could be a good starting point. However what happens when such pattern are detected? Do we generate IR slightly differently?

This needs to be inverted in the spirit of D69498. I would propose the following tweak:

  1. By default, every call has an implicit convergencectrl bundle with a token returned by the @llvm.experimental.convergence.entry intrinsic from the entry block of the caller. This default is the most conservative setting within the semantics defined here.
  2. A more informed frontend or a suitable transformation can replace this conservative token with one of the following:
    1. A token returned by any of the other intrinsics, which provides more specific information about convergence at this callsite.
    2. A predefined constant token (say none), which indicates complete freedom. This would be equivalent to the noconvergent attribute proposed in D69498.

Such a rewording would invert how we approach the spec. Instead of a representation that explicitly talks about special intrinsics that "need" convergence, the new semantics applies to all function calls. The redefined default is conservative instead of free, and the presence of the bundles relaxes the default instead of adding constraints.

Sounds good. If that would be acceptable to the wider community it might help to simplify the frontend design and improve the user interface and the coherence of the interfaces within the compiler stack too.

From what I understand, there was a fair bit of agreement in D69498 about the need to make the default safer. The real question is should we fold that idea into this proposal?

There's one mistake in what I outlined above. The first point about default token is expressly forbidden by the static rule on cycles: if an intrinsic other than .loop inside a cycle uses a token, then the definition must also be in the same cycle. But I think this can be fixed by simply saying that the dynamic instance of a call without an explicit operand bundle is "undertermined". Any optimization must then back off, the whole point being that an optimization is safe if it preserves dynamic instances, and it is impossible to preserve an undetermined dynamic instance.

FYI, if we forced early inlining in the LLVM stack, the frontend would not need to mark every function as convergent conservatively but in the Compute scenarios we occasionally have very large functions that when inlined result in huge binaries and longer compilation time. And we also have extern functions too that we have no information of during the compilation. So this doesn't seem like a route we can safely take at least not for all languages.

Or in other words, the "proper" definition must cover the whole of LLVM IR, and not introduce any assumptions like the absence of function calls.

If we invert the convergent logic then we can add nocovergent attribute or even a pragma directive for the application developers to indicate what code doesn't contain cross-threads operations and can be optimized more aggressively.

Dynamic instances provide complete information at the callsite. But having an attribute on a function declaration (especially extern) is useful because it removes the need to analyse the function body itself.

kpet added a subscriber: kpet.Nov 5 2021, 4:07 AM
bader added a subscriber: bader.Jan 17 2022, 12:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 30 2022, 1:09 AM
sameerds commandeered this revision.Aug 21 2023, 11:48 PM
sameerds added a reviewer: nhaehnle.

Superseded by D147116

sameerds abandoned this revision.Aug 21 2023, 11:48 PM