Page MenuHomePhabricator

IR: Add convergence control operand bundle and intrinsics
Needs ReviewPublic

Authored by nhaehnle on Aug 9 2020, 7:22 AM.

Details

Summary

See ConvergentOperations.rst for the details.

This replaces the proposal from https://reviews.llvm.org/D68994

This patch adds the operand bundle and intrinsics themselves, as well as
the LangRef documentation describing the semantics of controlled
convergent operations. Follow-up patches will adjust existing passes to
comply with those changes, as well as provide new functionality on top
of this mechanism.

Change-Id: I045c6bc864c4dc5fb0a23b0279e30fac06c5b974

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
nhaehnle added inline comments.Aug 11 2020, 8:07 AM
llvm/docs/ConvergentOperations.rst
213–215

The logical split between the two sections is that this section has the basic definitions, while the "Formal Rules" section has the rules about how the convergence control intrinsics place additional constraints on how dynamic instances can be formed.

If the token represents the dynamic instance exactly then this would also limit the freedom llvm.experimental.convergence.anchor() has. For example, this would rule out thread partitioning if it were so because then no token-producing instruction could return different token values per dynamic invocation.

I'm not sure I understand the argument. What exactly do you mean by dynamic invocation here?

Each time a thread executes the same anchor call site, it will receive a different token value, corresponding to a different dynamic instance. That may or may not be the same dynamic instance as received by other threads. So even if control flow is entirely uniform, an implementation would be free to produce a different thread partitioning each time the anchor is executed. That is on purpose: if you want more predictable thread partitionings, use a combination of entry and loop intrinsics as required.

281

When it comes to defining rules that are applicable to completely general IR, the loop intrinsic call site feels *more* tangible than the notion of backedge. For example, backedges don't really work as a concept when you have irreducible control flow.

The loop intrinsic call site also really doesn't have to be in the header block of a natural loop -- it could be inside of an if-statement in the loop, for example, which has interesting consequences but can still be defined (and can actually be useful: someone pointed me at a recent paper by Damani et al - Speculative Reconvergence for Improve SIMT Efficiency, which proposes a certain "unnatural" way of controlling convergence in some kinds of loop for performance; the same kind of effect can be achieved by placing the loop heart inside of an if-statement).

291–294

The intention is that the IR-based rules still apply regardless of whether the caller is in the same module or not. I'm not sure if this needs to spelled out more clearly.

And yes, for other cases we should be able to think of it as a property of the calling convention.

340–344

No, this is explicitly not sufficient. You can have:

  %tok = call token @llvm.experimental.convergence.anchor()
  br i1 %cc, label %then, label %next

then:
  call void @convergent_op() [ "convergencectrl"(token %tok) ]
  br label %next

next:
362–366

I think this comment may have moved to a confusing location relative to the document.

entry and anchor are inherently different.

I'm going to add a note about looking at language specs etc.

388–389

No, the rule excludes code such as:

%a = call token @llvm.experimental.convergence.anchor()
%b = call token @llvm.experimental.convergence.anchor()
call void @convergent_op() [ "convergencectrl"(token %a) ]
call void @convergent_op() [ "convergencectrl"(token %b) ]

The convergence region of %b contains a use of %a but not its definition.

I'm going to add a note about nesting.

405

I agree with @t-tye's explanation here. The choice here reflects the choice made e.g. in the Vulkan memory model: the only "convergent" operation (not the term used in Vulkan...) which interacts with the memory model is OpControlBarrier, so it's good to be able to treat these two kinds of communication orthogonally.

447

It still feels like llvm.experimental.convergence.anchor is materializing the set of threads out of thin air rather than as a clear "chain of custody" from the function entry (transitively passed via call sites).

Yes, that is the point of llvm.experimental.convergence.anchor.

And yes, if there was clear "chain of custody" as you call it from outside of the loop, then this unrolling with remainder would be incorrect.

465–471

The first version doesn't have a unique set of dynamic instances in the first place, because anchor is by design implementation-defined.

So the possible universes of dynamic instances in the transformed/unrolled version only needs to be a subset. In a sense, the loop unroll with remainder picks a subset by saying: from now on, if you have two threads with e.g. iteration counts 3 and 4, then they will never communicate during the 3rd iteration.

In the original program, they may or may not have communicated during the 3rd iteration -- up to the implementation, and in this case, the implementation decided to do a form of loop unrolling which implicitly ends up making a choice.

471

I hope this has been answered in the context of your other comments?

508

Is that still grammatically correct? The parse of the sentence is

Loops in which ((a loop intrinsic outside of the loop header) uses a token defined outside of the loop)

That is, "a loop intrinsic outside of the loop header" is the subject of the sentence in the outer parentheses.

517

Going to try an improvement :)

523–525

I mean, anchor is implementation-defined, so you can't make a totally solid statement anyway. You could only make solid *relative* statements if the token produced by the anchor was also used by some other convergent operations, and if those are outside of the if-statement, the sinking wouldn't be allowed anymore anyway...

548–552

CUDA is very different here: the builtins that take an explicit threadmask don't have an implicit dependence on control flow, so they shouldn't be modeled as convergent operations. They have other downsides, which is why we prefer to go down this path of convergent operations.

561

I'm going to add that example.

576–579

Should be answered elsewhere.

605–606

The pixel example would use entry instead of anchor. I'm going to add that example.

615–616

Should be answered elsewhere.

sameerds added inline comments.Aug 11 2020, 9:49 PM
llvm/docs/ConvergentOperations.rst
53–56

I think I "get" it now, and it might be related to how this paragraph produces an expectation that is actually not intended. The entire time so far, I have been reading this document expecting a formal framework that completely captures convergence; something so complete, that one can point at any place in the program and merely look at the convergence intrinsics to decide whether a transform is valid. But that is not the case. This document becomes a lot more clear if the intrinsics being introduced are only meant to augment control flow but not replace it in the context of convergence. These intrinsics are only meant to be introduced by the frontend to remove ambiguity about convergence. In particular:

  1. In the jump-threading example, the frontend inserts the convergence intrinsics to resolve the ambiguity in favour of maximal convergence.
  2. In the loop-unroll example, the frontend disallows unrolling by inserting the anchor outside of the loop and using it inside.
  3. In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.

This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?

548–552

Combined with my other comment about the introduction, I think the current formalism is compatible with CUDA. One can say that some convergent functions in CUDA have additional semantics about how different dynamic instances communicate with each other. That communication is outside the scope of this document, where the mask argument is used to relate the dynamic instances. The current framework seems to be sufficient to govern the effect of optimizations on the dynamic instances. For example, it is sufficient that a CUDA ballot is not hoisted/sunk across a condition; the ballot across the two branch legs is managed by the mask, which was created before the branch.

sameerds added inline comments.Aug 12 2020, 12:03 AM
llvm/docs/ConvergentOperations.rst
281

It was the optimizer that introduced the ambiguity ... should the optimizer be responsible for adding the necessary intrinsics that preserve the original convergence?

552–554

So the heart is not a property of the loop itself in LLVM IR. It is a place chosen by the frontend based on semantics external to LLVM IR, in a way that allows the frontend to express constraints about convergence in the loop.

571

Just like the loop intrinsic, this intrinsic occurs in a place chosen by the frontend based on semantics outside of LLVM IR, and used by the frontend to express constraints elsewhere in the IR.

612–613

The older comments about this seem to have floated away. At risk of repeating the discussion, what is *n* capturing? Is it meant to relate copies of the call U created by unrolling the loop, for example?

647–650

Just like the *n* property of the loop intrinsic, I think an informational note explaining this will be helpful.

654–657

This is not a rule; it's just a definition.

659–661

Since a convergence region is defined for a token, this text needs to bring out the fact that two different tokens are being talked about at this point. Something like: If the convergence region for token T1 contains a use of another token T2, then it must also contain the definition of T2."

750–755

So unrolling is forbidden because it fails to preserve the set of threads that execute the same dynamic instance of loop() for n=0 and n=1?

760

Correcting the use of the loop intrinsic seems to be a delicate matter. There is a rule which talks about "two or more uses by loop()" inside a loop body, and this particular example seems to side-step exactly that by eliminating one call to loop().

nhaehnle added inline comments.Aug 12 2020, 12:48 PM
llvm/docs/ConvergentOperations.rst
53–56
  1. In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.

This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?

What ambiguities do you have in mind?

If you have a fully acyclic function, then the way you can think about it is: we determine "the" set of threads that execute the function at the entry. At every point in the function, the communication set is then the subset of threads that get to that point. It's easy to evaluate this if you just topologically sort the blocks and then evaluate them in that order.

281

No. The jump-threaded code could also come out of C(++) code with gotos, so this doesn't really work.

548–552

I don't understand what you're trying to get at here.

The semantics of modern CUDA builtins are fully captured by saying they're non-convergent, but they have a side effect. That side effect is communication with some set of other threads, but that set isn't affected by control flow, it's fully specified by an explicit argument. Because of this, there is no need to argue about dynamic instances.

All legal program transforms subject to those constraints are then legal. There is no need to label them as convergent. If you can think of a counter-example, I'd be curious to see it.

552–554

Yes.

571

I'd rephrase it slightly by saying that the place is chosen by the frontend in a way that preserves the semantics of the original language into LLVM IR. But I suspect that we're ultimately thinking of the same thing.

612–613

It's really just a loop iteration counter. Every time a thread executes the loop intrinsic, it executes a new dynamic instance of it. You could think of this dynamic instance being labeled by the iteration, and then whether a thread executes the same dynamic instance as another thread depends in part on whether they have the same loop iteration label.

Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic. This means that if you have a natural loop where the loop intrinsic is not called in the header but in some other block that is conditional, the loop iterations will be counted in a way that seems funny (but this can actually be put to a potentially good use as I noted elsewhere).

Unrolling will actually not duplicate the loop intrinsic, but only keep the copy that corresponds to the first unrolled iteration.

654–657

Fair enough. I'm going to split this up into rules about cycles and rules about convergence regions.

659–661

It's needed from a formal point of view, but it does seem to trip people up, so I'm going to implement your suggestion :)

750–755

Not sure what you mean by n=0 and n=1. The issue is that if some threads go through the remainder loop while others execute more iterations, then the set of threads will be partitioned into those that take the remainder loop and those that don't.

760

Correct.

I did think about whether it was possible to eliminate that static rule, but it gets nasty really quickly, for example if you try to unroll loops with multiple exits. The way it's written, a modification to loop unrolling is required (D85605), but it's ultimately the less painful solution.

sameerds added inline comments.Aug 12 2020, 9:59 PM
llvm/docs/ConvergentOperations.rst
53–56

Your explanation intuitively makes sense, but it is not clear how to reconcile it with jump threading. That's one of the "ambiguities" I had in mind when dealing with acyclic control flow. It's almost like the text needs a paragraph explaining that "structured acyclic control flow" already contains sufficient information about convergence, but general acyclic control flow needs special attention in specific cases, starting with jump threading.

281

But what about the flip side? If the frontend is sure that only structured control flow is present in the input program, can it skip inserting the convergence intrinsics? Or should it still insert those intrinsics just in case optimizations changed the graph? If yes, is this something that LLVM must prescribe for every frontend as part of this document?

548–552

I am trying to understand whether there are constructs in Clang-supported high-level languages that cannot be addressed by these intrinsics. And if such constructs do exist, then whether that gate the adoption of this enhancement in LLVM. But I see your point now. The sync() builtins in CUDA are no longer dependent on convergence. The decision to hoist or sink them is based entirely on other things like data dependences (and maybe just that).

612–613

Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic.

This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration?

Unrolling will actually not duplicate the loop intrinsic, but only keep the copy that corresponds to the first unrolled iteration.

This is a bit of surprise. My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated.

750–755

The n that I used is the virtual loop count that is described in the loop intrinsic. The example needs to explain how the rules established in this document prevent the unrolling. The intuitive explanation is in terms of sets of threads, but what is the formal explanation in terms of the static rules for dynamic instances?

760

I still don't really understand what the "two or more" rule is for. One outcome of the rule seems to be that for a loop L2 nested inside loop L1, if L1 uses a token defined outside L1, then L2 cannot use the same token. I didn't get very far beyond that.

nhaehnle updated this revision to Diff 285267.Aug 13 2020, 12:00 AM

Actually submit all the changes that I thought I had submitted
two days ago.

Also:

  • add rationale for the static rule on cycles
  • expand the discussion of program transform correctness for loop unrolling and split the section since it's getting quite large
llvm/docs/ConvergentOperations.rst
53–56

I hesitate to write anything like that, because then you get into the problem of defining what "structured" means -- there are multiple definitions in the literature.

My argument would be that purely acyclic control flow -- whether structured or not -- contains sufficient information about convergence to define semantics consistently, without assistance, and avoiding spooky action at a distance.

That you still need some assistance to make actual *guarantees* is really down to composability. For example, you can have a fully acyclic function called from inside a cycle, and then what happen at inlining. One can explore an alternative scheme where you don't have to insert anything into the acyclic function in this case and it's the job of the inlining transform to fix things up, and I have done some exploring in this direction. There are at least two downsides:

  1. The burden on generic program transforms becomes larger.
  1. There is no longer any way for the programmer to express the distinction between functions (or sub-sections of code) that cares about the set of threads with which they're executed vs. those that don't (like the @reserveSpaceInBuffer example I added), and that closes the door on certain performance optimization and becomes problematic if you want to start thinking about independent forward progress.
281

It needs to insert the control intrinsics if it wants to have any guarantees. There aren't a lot of useful guarantees we can make today without this, so that's fine.

I don't want to say that frontends absolutely must insert the control intrinsics just yet, that's why uncontrolled convergent operations are allowed but deprecated. Frontends for languages with convergent operations that don't change will remain in the world of "things tend to work as expected a lot of the time, but stuff can break in surprising ways at the least convenient moment" that they are already in today. If they run the ConvergenceControlHeuristic pass just after IR generation, the times where things break will likely be somewhat reduced, but probably not eliminated entirely. It's difficult to make a definitive claim because there's obviously also the question of which guarantees the high-level language is supposed to give to the developer. For a HLL that just doesn't want give any guarantees, not inserting control intrinsics is fine from the POV of language spec correctness, although you're likely to run into corner cases where the language behavior clashes with developers' intuitive expectations.

612–613
Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic.

This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration?

Well... what even is a loop iteration? :)

For the purpose of convergence, the loop heart defines what the iterations are, so it is reached on every iteration *by definition*. (But there may well be cycles in the CFG that don't contain a loop intrinsic, and that's fine.)

More likely your real question is whether in a natural loop, the loop intrinsic must be reached once per execution of the loop header (or traversal of a back edge) -- the answer is no.

Part of the rationale here (and also an unfortunately inherent source of potential confusion) is that for defining convergence, and more generally for implementing whole-program vectorization of the style we effectively do in AMDGPU, leaning only on natural loops doesn't work, at least in part because of the possibility of irreducible control flow. This is why all the actual algorithms I'm building on this rely on the Havlak-inspired CycleInfo of D83094, and all the rules in this document are expressed in terms of cycles (in the sense of circular walks in the CFG) instead of natural loops.

My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated.

I would have liked to have that property but couldn't make it work without imposing static rules that would be much harder to understand and follow. The point about unrolling is mentioned in the later examples section where I talk through a bunch of example loops and whether they can be unrolled or not.

750–755

The formal explanation is ultimately that the set of communicating threads is changed, but I agree that it could be helpful to spell out how that comes about via the rules on dynamic instances, so I'm going to do that.

760

I'm adding a "rationale" section specifically to explain those static rules about cycles.

nhaehnle updated this revision to Diff 285272.Aug 13 2020, 12:32 AM

Typos and yet slightly more detail.

sameerds added inline comments.Aug 13 2020, 2:57 AM
llvm/docs/ConvergentOperations.rst
175–176

But this use of the intrinsics does not add any new constraints, right? This specific optimization is already sufficiently constrained by control dependence.

744–745

The exhausted reader just begs to see the corrected version at this point! :)

808–810

Following the structure of previous examples, it would be good to have a demonstration of how this can result in misinterpreted convergence. That would explain why this example should be illegal. This paragraph directly applies the rules to show how the example is recognized as illegal.

nhaehnle added inline comments.Aug 13 2020, 4:01 AM
llvm/docs/ConvergentOperations.rst
175–176

It doesn't add any constraints for existing generic transforms in LLVM that I'm aware of, but there's still a bit of non-trivial content to it at least in theory. Whether it matters in practice depends on the backend.

E.g., it doesn't matter for AMDGPU, but modern versions of CUDA say that some sort of divergence can basically happen at any point in the program. If you wanted to take code that uses the convergent operations and translate it to CUDA builtins, the control intrinsics make a difference. In that case, you'd want the uniform threadmask to replace the entry intrinsic. If it was an anchor somewhere instead, you'd want to replace the anchor by __activemask() and then use its return value. In both cases, you'd possibly modify the mask somehow to account for additional control dependencies between the anchor and its use. This "modify the mask somehow" hides a lot of complexity, but thinking about it quite a bit I believe it's a similar amount of complexity to what we have in the AMDGPU backend to make things work, probably less because more of the burden is shouldered by hardware in the end.

Plus there's the composability aspect of it if we're talking about functions that aren't kernel entry points and might be inlined.

744–745

The exhausted author is taking a note and will get around to it soon ;)

808–810

Isn't it just the same as in the example directly above? You'd expand C / E to a longer sequence of what happens in those inner loops, but the essentially difficulty is the same.

sameerds added inline comments.Aug 13 2020, 9:40 AM
llvm/docs/ConvergentOperations.rst
808–810

Maybe it is the same. See earlier note about exhausted reader. :) Maybe it's just me, but the concepts in this document are quite slippery, and well-rounded examples that restate the obvious can go a long way in gaining confidence.

nhaehnle updated this revision to Diff 285707.Aug 14 2020, 11:35 AM

Add more language about loops

nhaehnle updated this revision to Diff 286816.Aug 20 2020, 7:42 AM
  • tighten the static rules about cycles: there was a gap in the exact phrasing if two loop heart intrinsics in a cycle use _different_ convergence tokens
  • add verifier checks and corresponding tests for the static rules
nhaehnle updated this revision to Diff 286817.Aug 20 2020, 7:43 AM

clang-format fixes

simoll added inline comments.Aug 28 2020, 5:19 AM
llvm/docs/ConvergentOperations.rst
340–344

You mean control could deviate threads? But those threads won't even reach the convergent instruction and only among those that do those that have the same runtime token value will execute it as a pack.

nhaehnle added inline comments.Sep 7 2020, 7:49 AM
llvm/docs/ConvergentOperations.rst
340–344

Ah, I misread your earlier comment. Yes, though there's a question of whether the different threads actually see the same value, or whether they see different values that happen to refer to the same dynamic instance of the defining instruction. One may want to think of the token value as a handle to some control structure that refers to a dynamic instance and also holds a loop counter for the loop heart intrinsic. I don't think it really matters much either way.

I've only read up to Formal Rules so later sections might change things but I figure it's potentially useful to see a readers thoughts mid-read. I'm pretty sure I've misunderstood the anchor intrinsic based on what I've read of the doc and comments so far.

llvm/docs/ConvergentOperations.rst
27–29

This is rather nit-picky but there's some convergent operations where inter-thread communication isn't happening depending on how you model it. For example, a population count could be modelled as threads communicating (sum of 0 or 1 responses) which fits the definition here, but it could also be modelled threads optionally communicating (count of responses received), or as an external thread-manager broadcasting its count to the threads. Either way, communication is still happening but the second and third models are stretching the definition a bit

I don't think it's worth bogging down the main text for that nitpick but it might be worth clarifying in a footnote or something that receiving/sending any data from, to, or about another thread counts as communication. Also, declining to communicate counts as communication if it affects the outcome.

140–143

I think this is a little misleading, IIUC and assuming that the sets of communicating threads are quads as mentioned above then %condition doesn't need to be uniform across all the threads referenced by %entry. The only use is inside the then: block so I would expect that communicating threads for which %condition is uniformly false don't need to be considered as their result will not be used by any thread that enters then:. As you're trying to leave methods out, it's probably best left at ... with additional knowledge, that it doesn't change the result

The reason I bring this up is that I think it's worth thinking about how a generic transform, or an IR-level/gMIR-level/MIR-level target transform would perform this transform if it did understand convergence. To be clear, I'm not talking about the property it proves or the method by which it proves it. I mean: How would such a transform know what to prove and when to try?

For MIR and intrinsics, the answer seems obvious. The backend simply knows more about the instructions/intrinsics convergence than convergencectrl declares and can use that information instead. Once it recognizes an instruction/intrinsic as one it knows more about, it can try to prove whatever property it needs. However, outside of those special cases there doesn't seem to be a way to know what to prove or when to try, even for a target-specific pass. To use the above example, if @textureSample were a non-intrinsic function with the same properties you describe I don't think it would be possible to know any better than what convergencectrl declares, preventing the analysis the sinking transform would depend on. It's arguably out of scope for this doc but do you foresee convergence tokens and the convergent attribute becoming finer grained in future to support earlier or more target-independent transforms on convergent operations? Do you have any thoughts on how that would be done?

211–213

Should we also mention that it's valid when %cc is non-uniform so long as the same effect is achieved by other means? In this particular example, additional communication is fine so long as we ensure unintended threads contribute 0 to the sums (e.g. by masking %delta using %cc first). In other words, it's not the actual communication we need to keep consistent but the effects (and side-effects) of that communication.

248–252

I feel like there's something I'm missing here. This sounds like:

if (condition1) {
  %token = anchor()
  if (condition2) {
     ...
  }
  sum() convergencectrl(%token)
}

can be rewritten to:

if (condition1) {
  if (condition2) {
    %token = anchor()
     ...
    sum() convergencectrl(%token)
  }
}

which made sense at first given statements like we don't care which threads go together, but we also have no way of saying that we did care which threads go together unless we also say that it must be the same as the threads from function entry. I'd originally expected that this would be allowed:

if (condition1) {
  %token = entry()
  if (condition2) {
     ...
  }
  sum() convergencectrl(%token)
}

and would prevent sinking into or hoisting out of either if-statement but your reply here seems to indicate that's not allowed. How do convergence tokens prevent hoisting/sinking for this case?

Having read a bit further and thought about it a bit more, I suspect what I'm missing is that anchor() is as immobile as it's name would suggest. However I haven't seen anything say it's immobile and things like we don't care which threads go together and the code does not care about the exact set of threads with which it is executed give me the impression that it can sink/hoist as long as the consumers of the token do too. My main thought that undermines my original reading is that if it can move then there'd be nothing stopping me deleting it either as I could always invent a if(false) { ... } to sink it all into.

nhaehnle added inline comments.Sep 15 2020, 6:08 AM
llvm/docs/ConvergentOperations.rst
27–29

That's a fair point. The way I'm thinking about this is that there may be communication with a void payload, but ultimately this can be bikeshed to death.

140–143

I can clean up the text.

As for the question of how generic transforms could do better in the future: the way I see it, this would involve divergence analysis. If %condition is uniform (in a suitably defined sense), then sinking the @textureSample is okay since it doesn't change the relevant set of threads. The downside is that divergence analysis tends to be relatively expensive. It's worth exploring whether it can be computed incrementally and preserved.

This particular example is an interesting one since it shows that scopes matter: on typical hardware, you really only need uniformity of %condition at the quad scope. I think that's worth exploring at some point, but it's definitely something to leave for later. I don't think there's anything in this proposal that would inherently prevent it.

248–252

That transform is allowed (assuming that sinking the user of the result of the sum() is also possible). Though either way, an implementation is free to isolate individual threads, i.e. in your example, the result of sum could just be replaced by the value you're summing over so that each thread just gets its own value. This may seem useless at first, but it is the point of the anchor :)

If you want the set of threads to have some fixed relation to something external (like a compute workgroup or full Vulkan subgroup), you need to use entry instead of anchor.

anchor is still useful, as long as you have multiple things anchored to it. It will then ensure that they are relatively consistent to each other.

sameerds added inline comments.Sep 16 2020, 4:43 AM
llvm/docs/ConvergentOperations.rst
248–252

If I understand this right, then even entry does not capture anything specific ... it is merely a place holder for the anchor at the callsite of a function. This matters, for example, when the call is inside a loop and the frontend is trying to specify something in terms of the threads that together enter the loop. The entry at the start of a kernel is almost the same as an anchor, except the definition of threads that see the same dynamic instance is coming from the language above rather than the implementation below.

The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler.

Is that correct?

sameerds added inline comments.Sep 16 2020, 4:51 AM
llvm/docs/ConvergentOperations.rst
248–252

Just realized that this is not true: "The entry at the start of a kernel is almost the same as an anchor", but the rest still seems to hold.

nhaehnle added inline comments.Sep 23 2020, 9:26 AM
llvm/docs/ConvergentOperations.rst
248–252

The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler.

Probably? I'm not sure I agree with the exact wording. In a compute kernel, the entry intrinsic preserves the set of threads (workgroup/threadgroup/block) that are launched together, where "together" is parameterized by the scope you care about (dispatch/workgroup/subgroup/wave/whatever you call it). loop intrinsics controlled by the resulting token value in turn preserve that set of threads modulo divergent exits from the loop. And so on.

So I'd state it as: the intrinsics cannot enforce any grouping that wasn't there before, they can only enforce preservation of groupings.

I hope that's what you meant, just with different words? :)