Page MenuHomePhabricator

IR: Add convergence control operand bundle and intrinsics
Needs ReviewPublic

Authored by nhaehnle on Aug 9 2020, 7:22 AM.

Details

Summary

See ConvergentOperations.rst for the details.

This replaces the proposal from https://reviews.llvm.org/D68994

This patch adds the operand bundle and intrinsics themselves, as well as
the LangRef documentation describing the semantics of controlled
convergent operations. Follow-up patches will adjust existing passes to
comply with those changes, as well as provide new functionality on top
of this mechanism.

Change-Id: I045c6bc864c4dc5fb0a23b0279e30fac06c5b974

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

clang-format fixes

simoll added inline comments.Aug 28 2020, 5:19 AM
llvm/docs/ConvergentOperations.rst
340–344

You mean control could deviate threads? But those threads won't even reach the convergent instruction and only among those that do those that have the same runtime token value will execute it as a pack.

nhaehnle added inline comments.Sep 7 2020, 7:49 AM
llvm/docs/ConvergentOperations.rst
340–344

Ah, I misread your earlier comment. Yes, though there's a question of whether the different threads actually see the same value, or whether they see different values that happen to refer to the same dynamic instance of the defining instruction. One may want to think of the token value as a handle to some control structure that refers to a dynamic instance and also holds a loop counter for the loop heart intrinsic. I don't think it really matters much either way.

I've only read up to Formal Rules so later sections might change things but I figure it's potentially useful to see a readers thoughts mid-read. I'm pretty sure I've misunderstood the anchor intrinsic based on what I've read of the doc and comments so far.

llvm/docs/ConvergentOperations.rst
28–30

This is rather nit-picky but there's some convergent operations where inter-thread communication isn't happening depending on how you model it. For example, a population count could be modelled as threads communicating (sum of 0 or 1 responses) which fits the definition here, but it could also be modelled threads optionally communicating (count of responses received), or as an external thread-manager broadcasting its count to the threads. Either way, communication is still happening but the second and third models are stretching the definition a bit

I don't think it's worth bogging down the main text for that nitpick but it might be worth clarifying in a footnote or something that receiving/sending any data from, to, or about another thread counts as communication. Also, declining to communicate counts as communication if it affects the outcome.

141–144

I think this is a little misleading, IIUC and assuming that the sets of communicating threads are quads as mentioned above then %condition doesn't need to be uniform across all the threads referenced by %entry. The only use is inside the then: block so I would expect that communicating threads for which %condition is uniformly false don't need to be considered as their result will not be used by any thread that enters then:. As you're trying to leave methods out, it's probably best left at ... with additional knowledge, that it doesn't change the result

The reason I bring this up is that I think it's worth thinking about how a generic transform, or an IR-level/gMIR-level/MIR-level target transform would perform this transform if it did understand convergence. To be clear, I'm not talking about the property it proves or the method by which it proves it. I mean: How would such a transform know what to prove and when to try?

For MIR and intrinsics, the answer seems obvious. The backend simply knows more about the instructions/intrinsics convergence than convergencectrl declares and can use that information instead. Once it recognizes an instruction/intrinsic as one it knows more about, it can try to prove whatever property it needs. However, outside of those special cases there doesn't seem to be a way to know what to prove or when to try, even for a target-specific pass. To use the above example, if @textureSample were a non-intrinsic function with the same properties you describe I don't think it would be possible to know any better than what convergencectrl declares, preventing the analysis the sinking transform would depend on. It's arguably out of scope for this doc but do you foresee convergence tokens and the convergent attribute becoming finer grained in future to support earlier or more target-independent transforms on convergent operations? Do you have any thoughts on how that would be done?

212–214

Should we also mention that it's valid when %cc is non-uniform so long as the same effect is achieved by other means? In this particular example, additional communication is fine so long as we ensure unintended threads contribute 0 to the sums (e.g. by masking %delta using %cc first). In other words, it's not the actual communication we need to keep consistent but the effects (and side-effects) of that communication.

248–252

I feel like there's something I'm missing here. This sounds like:

if (condition1) {
  %token = anchor()
  if (condition2) {
     ...
  }
  sum() convergencectrl(%token)
}

can be rewritten to:

if (condition1) {
  if (condition2) {
    %token = anchor()
     ...
    sum() convergencectrl(%token)
  }
}

which made sense at first given statements like we don't care which threads go together, but we also have no way of saying that we did care which threads go together unless we also say that it must be the same as the threads from function entry. I'd originally expected that this would be allowed:

if (condition1) {
  %token = entry()
  if (condition2) {
     ...
  }
  sum() convergencectrl(%token)
}

and would prevent sinking into or hoisting out of either if-statement but your reply here seems to indicate that's not allowed. How do convergence tokens prevent hoisting/sinking for this case?

Having read a bit further and thought about it a bit more, I suspect what I'm missing is that anchor() is as immobile as it's name would suggest. However I haven't seen anything say it's immobile and things like we don't care which threads go together and the code does not care about the exact set of threads with which it is executed give me the impression that it can sink/hoist as long as the consumers of the token do too. My main thought that undermines my original reading is that if it can move then there'd be nothing stopping me deleting it either as I could always invent a if(false) { ... } to sink it all into.

nhaehnle added inline comments.Sep 15 2020, 6:08 AM
llvm/docs/ConvergentOperations.rst
28–30

That's a fair point. The way I'm thinking about this is that there may be communication with a void payload, but ultimately this can be bikeshed to death.

141–144

I can clean up the text.

As for the question of how generic transforms could do better in the future: the way I see it, this would involve divergence analysis. If %condition is uniform (in a suitably defined sense), then sinking the @textureSample is okay since it doesn't change the relevant set of threads. The downside is that divergence analysis tends to be relatively expensive. It's worth exploring whether it can be computed incrementally and preserved.

This particular example is an interesting one since it shows that scopes matter: on typical hardware, you really only need uniformity of %condition at the quad scope. I think that's worth exploring at some point, but it's definitely something to leave for later. I don't think there's anything in this proposal that would inherently prevent it.

248–252

That transform is allowed (assuming that sinking the user of the result of the sum() is also possible). Though either way, an implementation is free to isolate individual threads, i.e. in your example, the result of sum could just be replaced by the value you're summing over so that each thread just gets its own value. This may seem useless at first, but it is the point of the anchor :)

If you want the set of threads to have some fixed relation to something external (like a compute workgroup or full Vulkan subgroup), you need to use entry instead of anchor.

anchor is still useful, as long as you have multiple things anchored to it. It will then ensure that they are relatively consistent to each other.

sameerds added inline comments.Sep 16 2020, 4:43 AM
llvm/docs/ConvergentOperations.rst
248–252

If I understand this right, then even entry does not capture anything specific ... it is merely a place holder for the anchor at the callsite of a function. This matters, for example, when the call is inside a loop and the frontend is trying to specify something in terms of the threads that together enter the loop. The entry at the start of a kernel is almost the same as an anchor, except the definition of threads that see the same dynamic instance is coming from the language above rather than the implementation below.

The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler.

Is that correct?

sameerds added inline comments.Sep 16 2020, 4:51 AM
llvm/docs/ConvergentOperations.rst
248–252

Just realized that this is not true: "The entry at the start of a kernel is almost the same as an anchor", but the rest still seems to hold.

nhaehnle added inline comments.Sep 23 2020, 9:26 AM
llvm/docs/ConvergentOperations.rst
248–252

The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler.

Probably? I'm not sure I agree with the exact wording. In a compute kernel, the entry intrinsic preserves the set of threads (workgroup/threadgroup/block) that are launched together, where "together" is parameterized by the scope you care about (dispatch/workgroup/subgroup/wave/whatever you call it). loop intrinsics controlled by the resulting token value in turn preserve that set of threads modulo divergent exits from the loop. And so on.

So I'd state it as: the intrinsics cannot enforce any grouping that wasn't there before, they can only enforce preservation of groupings.

I hope that's what you meant, just with different words? :)

Hi. :) A few people pinged me asking for my feedback here, since I touched the convergent attr way back in the day, for CUDA.

I'm going to try to give feedback, but with the caveat that there's a huge amount of discussion here, and with my apologies that I can't read the whole thread's worth of context. It's a lot. Sorry that I'm probably bringing up things that have already been discussed.

I strongly agree that convergent as-is has problems. Fixing them is clearly complicated, and it seems like a lot of work has gone into this proposal.

I have been out of it for too long to feel comfortable signing off on whether this proposal fixes the problems with convergent. The proposal seems reasonable to me, but as we saw with e.g. undef/poison, these things can be extremely subtle.

I'm also not comfortable speaking to whether this representation will be ergonomic in the relevant LLVM passes.

What I'm more comfortable speaking to is:

  • Is the proposal clear to me?

I think the proposal is clear, modulo my few comments (relative to the length of the patch, anyway). This kind of writing is really tricky, I admire that I could mostly understand it. I thought the extensive examples were really helpful.

  • Is it clear how to modify clang's CUDA frontend to use this new form?

It's not perfectly clear to me how to do this. Is it as simple as saying, loops always have a convergent.loop() intrinsic at the top, functions always have convergent.entry() at the top, and that's it? If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

  • Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

Hope this helps.

llvm/docs/ConvergentOperations.rst
28–30

CUDA __syncthreads() is the prototypical convergent function (at least, it was -- maybe under this definition it's not?), but syncthreads does not exchange any information. It's just a barrier.

Assuming you still consider syncthreads to be convergent, my concern is someone would read this and (quite reasonably) think that we are incorrectly modeling it as convergent.

The way I'm thinking about this is that there may be communication with a void payload,

If you count "communicate nil" as communication, then perhaps the operation is not in fact communication but rather is "communication or synchronization"? Perhaps:

A convergent operation involves inter-thread communication or synchronization that occurs outside of the memory model, where the set of threads which participate in the inter-thread operation is implicitly affected by control flow.

89

Up to you, but I think this example would be more evocative if we wrote out the definition of textureSample. I am imagining that it involves something like a __shfl, but that's because I already understand GPUs. Your audience is bigger than that.

221

Nit: Clarify that this example isn't using the proposed convergence intrinsics? Perhaps

Consider an example of how jump threading removes structure in a way that can make semantics non-obvious without the convergence intrinsics described in this document.

250

Nit: Add ellipsis above this line, or remove it in the equivalent spot in the original code?

313

This paragraph really clarifies for me what's going on. +1

348

...wait, there are such things as convergent functions? This is the first I'm hearing about it in the doc! So far it seemed there were only convergent *calls*. What's a convergent function? :)

499

Do you plan to check this in the verifier (insofar as possible, I understand that it's not possible to check this for cross-TU calls).

507

This one is a local property -- could we say that this makes the program ill-formed, instead of UB?

511

Again, could we say this makes the program ill-formed? (At least the entry-block check, I'm not sure what a convergence region is, yet.)

595

Have we formally defined what a "controlled" convergent operation is? Do you mean a call to a convergent function with a "convergencectrl" operand bundle? (Say that?)

955

In this section I would have found it helpful if we'd differentiated upfront between the three kinds of unrolling:

  • Partial unrolling of a loop with no known trip multiple (so, there's a "tail" that collects the remaining elements)
  • Partial unrolling by a trip multiple (so there's no "tail")
  • Full unrolling, which eliminates the loop

I think you're saying that only the first kind of unrolling is tricky.

982–983

It would help me if we could we elaborate with half a sentence what the behavior change might be.

988–989

Do you mean that this *kind of* unrolling is forbidden?

But if you're going to forbid *all* unrolling of loops with uncontrolled convergent ops...that's going to make CUDA code a lot slower. Unless you're also going to fix clang, in which case, no objections, but maybe you want to say "will be forbidden once we've updated front-ends"?

999

One thing I don't get from this example is what I should do as a *frontend* to LLVM. That is, when should I do this form, and when should I put a new anchor inside a loop?

It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor. That is, I have to behave as though the compiler could transform

anchor()
foo();

into, effectively

if (threadIdx.x % 2 == 0) {
  anchor()
  convergent_fn();
} else {
  anchor();
  convergent_fn();
}

Something like this?

OK, so I always have to use the convergence.loop() form. But then this is saying I can never unroll.

ITYM that with convergence.loop(), I can never *partially unroll with a "tail"*, which makes a lot of sense? But would help me if we were explicit about that.

1033

counter > 1?

nhaehnle updated this revision to Diff 301840.Oct 30 2020, 2:39 AM
nhaehnle marked 3 inline comments as done.

Address the comments from @jlebar that I indicate I'd address,
except for changes affecting the Verifier -- I'll do those later.

I'm going to try to give feedback, but with the caveat that there's a huge amount of discussion here, and with my apologies that I can't read the whole thread's worth of context. It's a lot. Sorry that I'm probably bringing up things that have already been discussed.

Thanks, and don't worry. A lot of the old comments don't make sense anymore because the document was changed and Phabricator shows them in nonsensical places unfortunately.

[snip]

  • Is it clear how to modify clang's CUDA frontend to use this new form?

It's not perfectly clear to me how to do this. Is it as simple as saying, loops always have a convergent.loop() intrinsic at the top, functions always have convergent.entry() at the top, and that's it? If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

There are two kinds of answers to this. One is that you can only really know how the frontend should be modified once you've established what the high-level language semantics ought to be. Part of why I'm doing this work is to enable us to experiment with this kind of question and verify our understanding what this should look like (I'm going to caveat this with saying that I'm coming at it from the graphics side).

The other kind of answer is that for most but not all constructs, there's a pretty natural answer that boils down pretty much to what you wrote. Of course it generally breaks down in the face of goto, for example. I have a follow-on patch, D85609, which adds a pass that does this kind of insertion on top of LLVM IR. I'd appreciate your review on that if you find the time -- I think what it tries to do is fairly natural, but it is a bit more work to dig through. A reasonable first step for someone working on the CUDA frontend would be to insert that pass early in the pass pipeline. Longer term, it may be necessary to insert them directly during IR generation, but this at least partially depends on the high-level language semantics question.

  • Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

I have trouble answering this as well due to the lack of proper specification from Nvidia, and I'm not set up to run this kind of experiment.

From a theory point of view, because those newer versions of sync operations take that explicit arg, we shouldn't consider them to be convergent according to what's being defined here. Only the __activemask() builtin probably still needs to be considered convergent (also in light of https://bugs.llvm.org/show_bug.cgi?id=47210).

The result of your experiment seems to contradict the theory. Having worked on this part of our compiler for a while now, I think it's entirely possible that the result of your experiment is simply a bug somewhere along the compiler stack, but of course I can't say for certain. If it's not supposed to be a bug, then to me this means there's something subtle missing in the way the new sync operations are described. Either way, some clarification would be good.

llvm/docs/ConvergentOperations.rst
28–30

Your suggestion looks good to me, going to apply it.

89

textureSample is actually a built-in function of graphics languages. I'm going to add a clause to try to clarify that. I assume all GPUs have dedicated circuitry for it. I specifically wanted to mention textureSample in the document at least once because it (and some close analogs) are often forgotten in discussions of convergent even by graphics people like myself.

Obviously the document should also be accessible to folks from the GPU compute world, which is why I tried to give a succinct explanation of the relevant facts about textureSample in the paragraph above.

Later in the document there are also examples using shuffles, though with the Khronos-y spelling of subgroupShuffle instead of the CUDA-y __shfl. The choice of spelling is partly because that's just the world I'm personally working in most of my time, but also partly because I'd prefer using terms from common industry standards. I understand that CUDA is a bit of a de facto "standard", so if you think it's necessary to convert at least one example to CUDA spelling, we can do that -- just not this one here in particular, because it's specifically meant to be a graphics shader example.

221

Thanks, going to make this change.

250

Added ellipsis.

348

Uhh... technically true. How about adding something like the following somewhere:

In LLVM IR, function calls are the only instructions that can involve convergent
operations. A call itself (i.e., the act of jumping to the callee, setting up a
stack frame, etc.) is not a convergent operation. However, if the callee uses
the `llvm.experimental.convergence.entry` intrinsic, then we think of the
entire execution of the callee as a convergent operation from the perspective of
the calling function. Such callees must be marked with the `convergent`
attribute, and for brevity we say that they are "convergent functions". If the
callee isn't known at the call site (i.e., an indirect function call), then the
`call instruction itself must have the convergent` attribute.

The only reason for why a function F would need to use the
`llvm.experimental.convergence.entry` intrinsic is if F in turn uses some
other convergent operation, i.e., a call to a convergent function. Chains of
such calls are expected to eventually end with the use of a (target-specific)
intrinsic that is `convergent`.

499

Do we typically check "mere UB" in the verifier? Thinking about it a little, doing this seems risky for IR linking: it would mean that you can link two well-formed modules together and end up with an ill-formed one? If that's something that already exists and we're okay with it, then I'd be happy to add such checks, but I wouldn't want to be the one to introduce them...

507

Yes, that's a good idea.

511

The entry-block check should be straightforward.

595

Yes, the section "Dynamic Instances and Convergence Tokens" already says this:

The convergence control intrinsics described in this document and convergent
operations that have a `convergencectrl` operand bundle are considered
*controlled* convergent operations.

I'm going to add an anchor there since the doc is pretty long :)

955

Yes, that's correct, and I'm going to add essentially your three bullets at the top.

982–983

I gave it a try. It ended up being a full sentence though ;)

988–989

Yes, this kind of unrolling. This is already forbidden for uncontrolled convergent operations today. If you want to dig a little deeper, I would appreciate if you could also add your review to D85605. That's a follow-up change for (1) correctness of loop unrolling with regards to the loop intrinsics rules and (2) relaxing some of the constraints that exist today where possible when all convergent ops are controlled (by an anchor in the loop).

999

ITYM that with convergence.loop(), I can never *partially unroll with a "tail"*, which makes a lot of sense?

Yes, that's correct. Hopefully clearer with the addition at the top of the section.

It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor.

Right. The anchor essentially allows you to achieve the same thing as __activemask in CUDA, but in a more structured way that doesn't run into problems when you have two sides of an if/else both executing a sync operation with the same thread mask.

1033

Thanks, changing to counter >= 2 because that's what I had in a similar example above.

jlebar added a comment.EditedNov 1 2020, 9:49 AM

Man, phab doesn't make this easy, does it?

One tip, shift+A hides all inline comments, making the patch easier to read. One problem, though: It hides all the comments! :)

Do we typically check "mere UB" in the verifier? Thinking about it a little, doing this seems risky for IR linking: it would mean that you can link two well-formed modules together and end up with an ill-formed one? If that's something that already exists and we're okay with it, then I'd be happy to add such checks, but I wouldn't want to be the one to introduce them...

I see. You may not be able to check this. My preference, which it seems like you share, is just, inasmuch as we _can_ mark something as ill-formed and check for it, that seems preferable to UB.

How about adding something like the following somewhere:

In LLVM IR, function calls are the only instructions that can involve convergent operations. A call itself (i.e., the act of jumping to the callee, setting up a stack frame, etc.) is not a convergent operation. [...]

Yes, that is very clear to me.

Thank you for making those changes.

I am satisfied that this can be implemented in a frontend (and anyway, you have the patch). I've pinged some folks at nvidia asking for them to have a look wrt sm70, and I actually already got a reply, so I am hopeful we might hear from them. I don't want to keep you in limbo indefinitely, so I've asked if they might be able to provide a timeline.

Stay tuned, I guess.

vgrover99 added a subscriber: vgrover99.EditedNov 2 2020, 7:12 PM

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I believe what is described here about convergent, as best I can understand it, is the semantics of syncthreads in CUDA. This semantics is the same for <sm70 and sm70+. Not clear whether what is described here is a "textually aligned" semantics or unaligned. syncthreads is aligned, meaning that all threads in the threadblock must wait on the same lexical syncthreads().

I believe with sm70 the re-convergence has different semantics, due to the fact that we have forward progress guarantee in a warp. In pre-sm70 the following could deadlock

volatile int flag = 0;

if (cond) { // thread dependent conditional

while (flag == 0) ; // spin-lock

} else

flag++;

// re-convergence point

now it works as expected

The following also works (doesn't deadlock)

volatile int flag = 0;

if (cond) { // thread dependent conditional

while (flag == 0) ; // spin-lock

}
// re-convergence point
flag++;

I believe what is described here about convergent, as best I can understand it, is the semantics of syncthreads in CUDA. This semantics is the same for <sm70 and sm70+. Not clear whether what is described here is a "textually aligned" semantics or unaligned. syncthreads is aligned, meaning that all threads in the threadblock must wait on the same lexical syncthreads().

Textual alignment is a good way to examine this spec with respect to CUDA. The notion of dynamic instances is textually aligned according to basic rule 2:

  1. Executions of different instructions always occur in different dynamic instances. For this and other rules in this document, instructions of the same type at different points in the program are considered to be different instructions.

This correctly covers __syncthreads(), and is a bit conservative about builtins that take a mask like __syncwarp(). In @jlebar's example, each call to __syncwarp() is a separate dynamic instance, although CUDA actually treats them as a single synchronization point.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

In general, the formal rules work correctly for this too: hoisting and sinking is disallowed without additional information. So the proposal is compatible with the new CUDA semantics for these builtins. These builtins do need convergence control: sinking such a call across a condition should be forbidden by default, since we can no longer guarantee that every thread in the mask still makes a matching call. Of course, specific optimizations that can recompute the mask can over-ride this restriction.

foo = __shfl_sync(mask, ...);
if (condition) {
  // cannot sink foo here if condition is divergent
  sole_use_of_foo();
}

This correctly covers __syncthreads(), and is a bit conservative about builtins that take a mask like __syncwarp(). In @jlebar's example, each call to __syncwarp() is a separate dynamic instance, although CUDA actually treats them as a single synchronization point.

Yes, in CUDA and in PTX __syncwarp is an unaligned primitive.

  • Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

To extrapolate from Vinod's answer, I would say that we can represent sm70+ convergence semantics with this proposal. The situation seems to be covered by the examples in the section on hoisting and sinking. Consider the following example copied from the spec:

define void @example(...) convergent {
  %entry = call token @llvm.experimental.convergence.entry()
  %data = ...
  %id = ...
  if (condition) {
    %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
    ...
  }
}

Here, hoisting subgroupShuffle() is generally disallowed because it depends on the identity of active threads. A CUDA builtin with a mask argument similarly identifies specific threads that must be active at the set of textually unaligned calls that synchronize with each other. So any change in the control flow surrounding those calls is generally disallowed without more information. The new representation doesn't seem to restrict a more informed optimizer that can predict how the threads evolve.

nhaehnle added a comment.EditedNov 6 2020, 8:44 AM
  • Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples. But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is. At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.

// Warning, does not seem to work on sm75
if (cond)
  __syncwarp(FULL_MASK);
else
  __syncwarp(FULL_MASK);

but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make cond have different values within the warp. So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics. I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal. Then we should also double-check that clang can in fact generate the relevant LLVM IR.

To extrapolate from Vinod's answer, I would say that we can represent sm70+ convergence semantics with this proposal. The situation seems to be covered by the examples in the section on hoisting and sinking. Consider the following example copied from the spec:

define void @example(...) convergent {
  %entry = call token @llvm.experimental.convergence.entry()
  %data = ...
  %id = ...
  if (condition) {
    %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
    ...
  }
}

Here, hoisting subgroupShuffle() is generally disallowed because it depends on the identity of active threads. A CUDA builtin with a mask argument similarly identifies specific threads that must be active at the set of textually unaligned calls that synchronize with each other. So any change in the control flow surrounding those calls is generally disallowed without more information. The new representation doesn't seem to restrict a more informed optimizer that can predict how the threads evolve.

Yes, that makes sense to me, although it also makes sense to reflect on this a bit more.

Roughly speaking, there are subgroup ops with "implicit" thread set (what Vinod calls textually aligned, and what this proposal mostly focuses on, because they require the most additional explanation) and subgroup ops with "explicit" thread set (sm70+).

What's interesting is that the latter (__shfl_syncetc.) have similar constraints on how they can and can't be moved, but for different reasons, and the constraints are different. For example:

mask = ...;
if (blah) {
  y = __shfl_sync(a, b, mask);
  ...
} else {
  y = __shfl_sync(a, b, mask);
  ...
}

The __shfl_sync has an explicit thread mask and can be hoisted. However, a subgroupShuffle with implicit thread mask cannot be hoisted. So here, __shfl_sync allows more freedom.

Conversely:

__shfl_sync(a, b, mask); // result unused

This cannot be dead-code eliminated, because it might be communicating with threads executing a different part of the program. By contrast, a subgroupShuffle with implicit thread mask whose result is unused can be dead-code-eliminated. So here, subgroupShuffle allows more freedom.

By similar logic, subgroup ops with implicit thread mask in the same basic blocks can be re-ordered wrt each other, but this is not true for explicit thread mask (not-textually-aligned) subgroup ops.

I believe that with this proposal, we can model this with the attributes we have by saying that subgroupShuffle is convergent readnone, while __shfl_sync is inaccessiblememonly.

Roughly speaking, there are subgroup ops with "implicit" thread set (what Vinod calls textually aligned, and what this proposal mostly focuses on, because they require the most additional explanation) and subgroup ops with "explicit" thread set (sm70+).

That's an excellent way to "lift" sm70+ operations into the semantics being handled by this proposal. The bottomline seems to be that the proposed formalism achieves the following:

  1. Dynamic instances capture the thread sets that are implicitly determined by control flow graph.
    1. This covers both kinds of operations, with and without explicit thread sets as arguments.
    2. No assumptions are made about thread grouping in the underlying hardware.
  2. There is a straightforward way for a frontend to insert these intrinsics while ensuring correctness and not overly constraining optimization.
  3. Generic optimizations are safe as long as they preserve the mapping of threads to dynamic instances (basic rule 4).
    1. The mapping is usually altered by changes to surrounding control flow, and hence such changes are forbidden in general.
    2. This does not preclude more informed optimizations that are aware of their impact on the set of threads at a dynamic instance.
    3. Optimizations can also benefit from attributes that indicate how this set of threads is allowed to change.
  4. None of the above has been established beyond all doubt, but the current understanding is sufficient to justify the "experimental" tag.

Is that a reasonable summary at this point?

jlebar added a comment.Nov 9 2020, 4:49 PM

The bottomline seems to be that the proposed formalism achieves the following: <snip>

I agree, fwiw.

What do you need at this point to move forward?

Few questions below. Please bear with me as I try to grok the proposal and the long stream of comments...

I believe that with this proposal, we can model this with the attributes we have by saying that subgroupShuffle is convergent readnone, while __shfl_sync is inaccessiblememonly.

subgroupShuffle would require convergentctrl and __shfl_sync would not, correct?

There is a straightforward way for a frontend to insert these intrinsics while ensuring correctness and not overly constraining optimization.

This feels like it could use a bit of discussion in the documentation, at least spelling out the straight-forward mapping for a simple C-like language with one example built-in that uses implicit thread masks. My understanding of this proposal implies the following rules:

  1. Add call to llvm.experimental.convergence.entry to the beginning of every convergent function (generally assume the program entry-point is convergent)
  2. Add call to llvm.experimental.convergence.anchor to the beginning of every non-convergent function
  3. Add call to llvm.experimental.convergence.loop to the beginning of every natural loop header block
  4. For each call to a convergent function (intrinsic or other-wise), attach convergencectrl bundle pointing to the closest call to entry/anchor/loop, in terms of nesting

Is this correct for general structured code, or am I missing some case?

Things are less clear when you consider odd looping structures, for example:

entry:
    entry_token = llvm.experimental.convergence.anchor();
    if (cond1) goto head1;
    else goto head2;

head1:
    head1_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    cond2 = ...;
    if cond2 goto tail1;
    else goto tail2;

head2:
    head2_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    break_cond = ...
    if break_cond goto exit;
    else goto head2b;

head2b:
    cond3 = ...;
    if cond3 goto tail2;
    else goto tail1;

tail1:
    cond4 = subgroupOp(...);      // What does this anchor to?
    if cond4 goto head1;
    else goto head2;

tail2:
    cond5 = subgroupOp(...);      // What does this anchor to?
    if cond5 goto head2;
    else goto head1;

exit:
    ...

Ignoring the new intrinsics, the subgroupOp calls may have defined semantics between groups of threads that reach the two loop tail blocks together, at least if you guarantee maximal convergence. When you add in the proposed intrinsics, what do the subgroupOp calls anchor to? The llvm.experimental.convergence.loop calls do not dominate the subgroupOp calls, so the llvm.experimental.convergence.anchor call is the only choice. But this breaks the first rule static rule of cycles in the proposal: we have a use of a convergence token inside a cycle but the token is not defined inside the loop. Do we just add a call to llvm.experimental.convergence.anchor to both tail1 and tail2, using them as the convergence token for the respective subgroupOp?

Is there any concern over implementing and validating all of the necessary logic in all CFG-altering optimization passes? This seems like a large change that will require modifications through-out the existing optimization passes in order to achieve the goal of "ensuring correctness and not overly constraining optimization". I'm just curious if there is any plan being formulated for tackling this issue.

Few questions below. Please bear with me as I try to grok the proposal and the long stream of comments...

I believe that with this proposal, we can model this with the attributes we have by saying that subgroupShuffle is convergent readnone, while __shfl_sync is inaccessiblememonly.

subgroupShuffle would require convergentctrl and __shfl_sync would not, correct?

The answer seems to depend on whether it is correct to say the following about __shfl_sync

  1. The only constraint on control flow transformations around __shfl_sync is identical to the current definition of the convergent intrinsic.
  2. Other optimizations such as DCE and reordering, that do not involve changes in the control flow, should be modelled using other constructs like inaccessiblememonly

If this is true, then yes, __shfl_sync doesn't need the new convergence control bundles.

There is a straightforward way for a frontend to insert these intrinsics while ensuring correctness and not overly constraining optimization.

This feels like it could use a bit of discussion in the documentation, at least spelling out the straight-forward mapping for a simple C-like language with one example built-in that uses implicit thread masks. My understanding of this proposal implies the following rules:

  1. Add call to llvm.experimental.convergence.entry to the beginning of every convergent function (generally assume the program entry-point is convergent)
  2. Add call to llvm.experimental.convergence.anchor to the beginning of every non-convergent function
  3. Add call to llvm.experimental.convergence.loop to the beginning of every natural loop header block
  4. For each call to a convergent function (intrinsic or other-wise), attach convergencectrl bundle pointing to the closest call to entry/anchor/loop, in terms of nesting

Is this correct for general structured code, or am I missing some case?

Right. These heuristics are actually proposed in a related patch under review:
https://reviews.llvm.org/D85609

In particular, the above pass is expected to do the right thing when working with cross-thread operations that are "textually aligned" (for example, SPIRV, OpenCL, HIP, and CUDA before sm70).

Things are less clear when you consider odd looping structures, for example:
[snip]
Ignoring the new intrinsics, the subgroupOp calls may have defined semantics between groups of threads that reach the two loop tail blocks together, at least if you guarantee maximal convergence. When you add in the proposed intrinsics, what do the subgroupOp calls anchor to?

I believe the question needs to be turned around: what do you want subgroupOp to anchor to? In general, it should be impossible to infer the correct operand bundles from arbitrary LLVM IR, else we could just express everything as an analysis instead of having these explicit markers. The intention of these markers is for a frontend to be able to produce constraints that cannot be expressed using just the structure of the program.

Is there any concern over implementing and validating all of the necessary logic in all CFG-altering optimization passes? This seems like a large change that will require modifications through-out the existing optimization passes in order to achieve the goal of "ensuring correctness and not overly constraining optimization". I'm just curious if there is any plan being formulated for tackling this issue.

These intrinsics are expected to be very useful for new techniques being worked out in the AMDGPU backend. The following reviews start off the changes required in the optimizer:
https://reviews.llvm.org/D85604
https://reviews.llvm.org/D85605
https://reviews.llvm.org/D85606

  • Is it clear how to modify clang's CUDA frontend to use this new form?

It's not perfectly clear to me how to do this. Is it as simple as saying, loops always have a convergent.loop() intrinsic at the top, functions always have convergent.entry() at the top, and that's it? If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

There are two kinds of answers to this. One is that you can only really know how the frontend should be modified once you've established what the high-level language semantics ought to be. Part of why I'm doing this work is to enable us to experiment with this kind of question and verify our understanding what this should look like (I'm going to caveat this with saying that I'm coming at it from the graphics side).

The other kind of answer is that for most but not all constructs, there's a pretty natural answer that boils down pretty much to what you wrote. Of course it generally breaks down in the face of goto, for example. I have a follow-on patch, D85609, which adds a pass that does this kind of insertion on top of LLVM IR. I'd appreciate your review on that if you find the time -- I think what it tries to do is fairly natural, but it is a bit more work to dig through. A reasonable first step for someone working on the CUDA frontend would be to insert that pass early in the pass pipeline. Longer term, it may be necessary to insert them directly during IR generation, but this at least partially depends on the high-level language semantics question.

Regarding the HLL and frontend side, I believe this could be represented fairly similarly in different C/C++-based languages - considering that we already follow the same implementation for existing convergent semantics at least between CUDA and OpenCL. However, it isn't yet in its optimal state and perhaps we can attempt to refine this topic holistically for example also addressing the following rework that removes the need to make everything convergent: https://reviews.llvm.org/D69498. Otherwise, we will likely have to generate the convergent intrinsics absolutely everywhere, which is not ideal!

Looking at the wording in some parts of your convergent semantics definition there might be options resulting in some tradeoff between tooling complexity and optimization opportunities:

+The
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic is typically expected to appear in the header of a natural loop.
+However, it can also appear in non-header blocks of a loop. In that case, the
+loop can generally not be unrolled.

I understand this is not in the scope of this work. And I think it is perfectly reasonable to provide experimental support that could help with further evaluation and productization too. However, it would be good to make some preliminary assessment for the frontend support rather soon. What I think could speed up the progress on the frontend/HLL is some sort of description about the conditions where the new intrinsics have to be inserted. My understanding is that the plan is not to expose them to the application code that would require educating the application developers about all the low-level details? Looking at your transformation pass in https://reviews.llvm.org/D69498 it seems that adding those automatically should somehow be possible and you already have some rules defined where and how those can be added? But there are certain things that can be done in IR that are very constrained in AST as it makes Parsing more complicated.

Regarding the HLL and frontend side, I believe this could be represented fairly similarly in different C/C++-based languages - considering that we already follow the same implementation for existing convergent semantics at least between CUDA and OpenCL. However, it isn't yet in its optimal state and perhaps we can attempt to refine this topic holistically for example also addressing the following rework that removes the need to make everything convergent: https://reviews.llvm.org/D69498. Otherwise, we will likely have to generate the convergent intrinsics absolutely everywhere, which is not ideal!

As far as I could skim through the specs, OpenCL requires that all threads in a workgroup or subgroup encounter a convergent operation. On the other hand, SPIRV and CUDA allow a more general "non-uniform" version that are executed by "currently active" threads. This proposal is general enough to cover both cases (independent of the newer CUDA primitives that take explicit masks).

Also, this new proposal supersedes https://reviews.llvm.org/D69498. In fact it presents a generalization that can even entirely eliminate the need for the convergent attribute. (See Nicolai's older comment about keeping convergent for now ... it can be used by frontends who elect to keep the current broken-but-well-known formalism it represents.

Looking at the wording in some parts of your convergent semantics definition there might be options resulting in some tradeoff between tooling complexity and optimization opportunities:

+The
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic is typically expected to appear in the header of a natural loop.
+However, it can also appear in non-header blocks of a loop. In that case, the
+loop can generally not be unrolled.

I believe this is meant to say that the formalism does not forbid putting the loop intrinsic in a non-header block, but that is not expected in most known cases. It is not an optimization choice that every flow must make.

I understand this is not in the scope of this work. And I think it is perfectly reasonable to provide experimental support that could help with further evaluation and productization too. However, it would be good to make some preliminary assessment for the frontend support rather soon. What I think could speed up the progress on the frontend/HLL is some sort of description about the conditions where the new intrinsics have to be inserted. My understanding is that the plan is not to expose them to the application code that would require educating the application developers about all the low-level details? Looking at your transformation pass in https://reviews.llvm.org/D69498 it seems that adding those automatically should somehow be possible and you already have some rules defined where and how those can be added? But there are certain things that can be done in IR that are very constrained in AST as it makes Parsing more complicated.

This other review request is likely to demonstrate what you are asking for:
https://reviews.llvm.org/D85609

Regarding the HLL and frontend side, I believe this could be represented fairly similarly in different C/C++-based languages - considering that we already follow the same implementation for existing convergent semantics at least between CUDA and OpenCL. However, it isn't yet in its optimal state and perhaps we can attempt to refine this topic holistically for example also addressing the following rework that removes the need to make everything convergent: https://reviews.llvm.org/D69498. Otherwise, we will likely have to generate the convergent intrinsics absolutely everywhere, which is not ideal!

As far as I could skim through the specs, OpenCL requires that all threads in a workgroup or subgroup encounter a convergent operation. On the other hand, SPIRV and CUDA allow a more general "non-uniform" version that are executed by "currently active" threads. This proposal is general enough to cover both cases (independent of the newer CUDA primitives that take explicit masks).

We do have a new functionality in OpenCL that requires supporting convergent operations in non-uniform CF too:
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#_extended_subgroup_functions
https://llvm.org/PR46199

Also, this new proposal supersedes https://reviews.llvm.org/D69498. In fact it presents a generalization that can even entirely eliminate the need for the convergent attribute. (See Nicolai's older comment about keeping convergent for now ... it can be used by frontends who elect to keep the current broken-but-well-known formalism it represents.

Sorry for not being clear - I was talking about two separate threads here (1) generalizing convergent attribute to non-uniform CF that is addressed by this patch and (2) inverting convergent attribute that is addressed in https://reviews.llvm.org/D69498. Just to provide more details regarding (2) - right now in clang we have a logic that adds convergent to every single function because when we parse the function we don't know whether it will call any function in a call tree that would use convergent operations. Therefore we need to be conservative to prevent incorrect optimizations but this is not ideal for multiple reasons. The optimiser can undo all or some of those convergent decorations if it can prove they are not needed. And for the uniform CF convergent operations this was the only "broken" functionality to my memory.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

Looking at the wording in some parts of your convergent semantics definition there might be options resulting in some tradeoff between tooling complexity and optimization opportunities:

+The
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic is typically expected to appear in the header of a natural loop.
+However, it can also appear in non-header blocks of a loop. In that case, the
+loop can generally not be unrolled.

I believe this is meant to say that the formalism does not forbid putting the loop intrinsic in a non-header block, but that is not expected in most known cases. It is not an optimization choice that every flow must make.

I understand this is not in the scope of this work. And I think it is perfectly reasonable to provide experimental support that could help with further evaluation and productization too. However, it would be good to make some preliminary assessment for the frontend support rather soon. What I think could speed up the progress on the frontend/HLL is some sort of description about the conditions where the new intrinsics have to be inserted. My understanding is that the plan is not to expose them to the application code that would require educating the application developers about all the low-level details? Looking at your transformation pass in https://reviews.llvm.org/D69498 it seems that adding those automatically should somehow be possible and you already have some rules defined where and how those can be added? But there are certain things that can be done in IR that are very constrained in AST as it makes Parsing more complicated.

This other review request is likely to demonstrate what you are asking for:
https://reviews.llvm.org/D85609

Thanks, I was referring to this review indeed. Perhaps it is easier if I spawn a separate discussion there to see whether and how we can apply the same logic to the frontend and also how it combines with the conservative approach of generating convergent attribute everywhere that we have right now.

Sorry for not being clear - I was talking about two separate threads here (1) generalizing convergent attribute to non-uniform CF that is addressed by this patch and (2) inverting convergent attribute that is addressed in https://reviews.llvm.org/D69498. Just to provide more details regarding (2) - right now in clang we have a logic that adds convergent to every single function because when we parse the function we don't know whether it will call any function in a call tree that would use convergent operations. Therefore we need to be conservative to prevent incorrect optimizations but this is not ideal for multiple reasons. The optimiser can undo all or some of those convergent decorations if it can prove they are not needed. And for the uniform CF convergent operations this was the only "broken" functionality to my memory.

I see now. Thanks! Besides goal (1), the other goal for this new formalism is to clarify the meaning of "convergence" in a way that allows more freedom to the optimizer. Language specs typically define convergence with operational semantics, such as:

  1. SPIRV: "different invocations of an entry point execute the same dynamic instances of an instruction when they follow the same control-flow path"
  2. OpenCL: "all work-items in the work-group must enter the conditional if any work-item in the work-group enters the conditional statement"

The proposed formalism lifts this into a declarative semantics which is easier for the compiler to reason with. This allows optimizations like jump threading, where the transformed program has ambiguous operational semantics (see the example in the actual spec). The presence of convergence control tokens makes sure that the "point of convergence" is well-defined even if the transformed control flow is ambiguous.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

To be honest, I was not aware of this other effort, and even after you pointed it out, I wasn't paying attention to the words that I was reading. It seems like the current spec has so far focussed on demonstrating the soundness of the formalism. But I think it is possible to cover (2), which is to make the default setting conservative. This will need a bit of a rewording. In particular, this definition from the spec:

The convergence control intrinsics described in this document and convergent
operations that have a ``convergencectrl`` operand bundle are considered
*controlled* convergent operations.

Other convergent operations are *uncontrolled*.

This needs to be inverted in the spirit of D69498. I would propose the following tweak:

  1. By default, every call has an implicit convergencectrl bundle with a token returned by the @llvm.experimental.convergence.entry intrinsic from the entry block of the caller. This default is the most conservative setting within the semantics defined here.
  2. A more informed frontend or a suitable transformation can replace this conservative token with one of the following:
    1. A token returned by any of the other intrinsics, which provides more specific information about convergence at this callsite.
    2. A predefined constant token (say none), which indicates complete freedom. This would be equivalent to the noconvergent attribute proposed in D69498.

Such a rewording would invert how we approach the spec. Instead of a representation that explicitly talks about special intrinsics that "need" convergence, the new semantics applies to all function calls. The redefined default is conservative instead of free, and the presence of the bundles relaxes the default instead of adding constraints.

Also, answering one of your comments in the other review (D85609#inline-943432) about the relevance of the llvm.experimental.convergence.anchor, this intrinsic cannot be inferred by the frontend. It represents a new ability to represent optimization opportunities like the one demonstrated in the "opportunistic convergence" example. The intrinsic says that the call that uses this token doesn't depend on any specific set of threads, but merely marks the threads that do reach it. This is most useful when multiple calls agree on the same set of threads. Identifying such sets of operations will need help from the user (or more realistically, a library writer). Something like the following might work, where the actual value of group doesn't really matter beyond relating the various calls to each other.

auto group = non_uniform_group_active_workitems();
op1(group);
if (C)
   op2(group);
op3(group);

Hi @jholewinski, sorry for missing your comment earlier. It's been a while! I still need to work through the rest of the comments here, but there's a pretty crucial point here that seems to have been missed:

Things are less clear when you consider odd looping structures, for example:

entry:
    entry_token = llvm.experimental.convergence.anchor();
    if (cond1) goto head1;
    else goto head2;

head1:
    head1_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    cond2 = ...;
    if cond2 goto tail1;
    else goto tail2;

head2:
    head2_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
    break_cond = ...
    if break_cond goto exit;
    else goto head2b;

head2b:
    cond3 = ...;
    if cond3 goto tail2;
    else goto tail1;

tail1:
    cond4 = subgroupOp(...);      // What does this anchor to?
    if cond4 goto head1;
    else goto head2;

tail2:
    cond5 = subgroupOp(...);      // What does this anchor to?
    if cond5 goto head2;
    else goto head1;

exit:
    ...

Regardless of the question about subgroupOp, this example is not valid IR: it breaks the static rule that "Every cycle in the CFG that contains two different uses of a convergence token T must also contain the definition of T."Specifically, there are two uses of entry_token, in head1 and head2, and a cycle head1 -> tail1 -> head2 -> head2b -> tail1 -> head1 that goes through both of them without going through the definition of entry_token.

Roughly speaking, an irreducible loop can contain at most one loop intrinsic that refers to a token from outside the irreducible loop.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

This is a good point. Generally, HLL need to be more conscious about what they actually expect convergent operations to do :) I tend to be optimistic: I mentioned on D85609 a proposal I presented in the context of Khronos. The important point from there is that every statement of the HLL would be (possibly implicitly) annotated with its "canonical convergence token" using very simple rules. This only really falls flat if you have goto jumping into the middle of a loop (or Duff's device etc.). I don't know how efficiently e.g. the Clang frontend can decide whether such constructs exist or not.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

To be honest, I was not aware of this other effort, and even after you pointed it out, I wasn't paying attention to the words that I was reading. It seems like the current spec has so far focussed on demonstrating the soundness of the formalism. But I think it is possible to cover (2), which is to make the default setting conservative. This will need a bit of a rewording. In particular, this definition from the spec:

The convergence control intrinsics described in this document and convergent
operations that have a ``convergencectrl`` operand bundle are considered
*controlled* convergent operations.

Other convergent operations are *uncontrolled*.

This needs to be inverted in the spirit of D69498. I would propose the following tweak:

  1. By default, every call has an implicit convergencectrl bundle with a token returned by the @llvm.experimental.convergence.entry intrinsic from the entry block of the caller. This default is the most conservative setting within the semantics defined here.
  2. A more informed frontend or a suitable transformation can replace this conservative token with one of the following:
    1. A token returned by any of the other intrinsics, which provides more specific information about convergence at this callsite.
    2. A predefined constant token (say none), which indicates complete freedom. This would be equivalent to the noconvergent attribute proposed in D69498.

Such a rewording would invert how we approach the spec. Instead of a representation that explicitly talks about special intrinsics that "need" convergence, the new semantics applies to all function calls. The redefined default is conservative instead of free, and the presence of the bundles relaxes the default instead of adding constraints.

Sounds good. If that would be acceptable to the wider community it might help to simplify the frontend design and improve the user interface and the coherence of the interfaces within the compiler stack too.

FYI, if we forced early inlining in the LLVM stack, the frontend would not need to mark every function as convergent conservatively but in the Compute scenarios we occasionally have very large functions that when inlined result in huge binaries and longer compilation time. And we also have extern functions too that we have no information of during the compilation. So this doesn't seem like a route we can safely take at least not for all languages.

If we invert the convergent logic then we can add nocovergent attribute or even a pragma directive for the application developers to indicate what code doesn't contain cross-threads operations and can be optimized more aggressively.

Also, answering one of your comments in the other review (D85609#inline-943432) about the relevance of the llvm.experimental.convergence.anchor, this intrinsic cannot be inferred by the frontend. It represents a new ability to represent optimization opportunities like the one demonstrated in the "opportunistic convergence" example. The intrinsic says that the call that uses this token doesn't depend on any specific set of threads, but merely marks the threads that do reach it. This is most useful when multiple calls agree on the same set of threads. Identifying such sets of operations will need help from the user (or more realistically, a library writer). Something like the following might work, where the actual value of group doesn't really matter beyond relating the various calls to each other.

auto group = non_uniform_group_active_workitems();
op1(group);
if (C)
   op2(group);
op3(group);

Ok, this makes sense. Thanks for clarifications.

To address this there was an attempt to invert the behavior of convergent attribute in this patch (https://reviews.llvm.org/D69498) then the frontend wouldn't need to generate the attribute everywhere and the optimizer wouldn't need to undo what frontend does. The change in this review doesn't address (2) as far as I can see - it seems it only generalized old convergent semantics to cover the cases with non-uniform CF. I am not clear yet about the details of how and what frontend should generate in IR for this new logic but it looks more complex than before. And if we have to stick to the conservative approach of assuming everything is convergent as it is now this might complicate and slow down the parsing. So I am just checking whether addressing (2) is still feasible with the new approach or it is not a direction we can/should go?

This is a good point. Generally, HLL need to be more conscious about what they actually expect convergent operations to do :) I tend to be optimistic: I mentioned on D85609 a proposal I presented in the context of Khronos. The important point from there is that every statement of the HLL would be (possibly implicitly) annotated with its "canonical convergence token" using very simple rules. This only really falls flat if you have goto jumping into the middle of a loop (or Duff's device etc.). I don't know how efficiently e.g. the Clang frontend can decide whether such constructs exist or not.

I see. Technically this sounds feasible to add i.e. we could insert a custom AST visitor to detect the pattern after the AST is parsed or perhaps the detection can be done during the parsing itself. The only question is how many of such patterns exist considering the variety HL language constructs and how this will impact the parsing time, etc. I would say prototyping this could be a good starting point. However what happens when such pattern are detected? Do we generate IR slightly differently?

This needs to be inverted in the spirit of D69498. I would propose the following tweak:

  1. By default, every call has an implicit convergencectrl bundle with a token returned by the @llvm.experimental.convergence.entry intrinsic from the entry block of the caller. This default is the most conservative setting within the semantics defined here.
  2. A more informed frontend or a suitable transformation can replace this conservative token with one of the following:
    1. A token returned by any of the other intrinsics, which provides more specific information about convergence at this callsite.
    2. A predefined constant token (say none), which indicates complete freedom. This would be equivalent to the noconvergent attribute proposed in D69498.

Such a rewording would invert how we approach the spec. Instead of a representation that explicitly talks about special intrinsics that "need" convergence, the new semantics applies to all function calls. The redefined default is conservative instead of free, and the presence of the bundles relaxes the default instead of adding constraints.

Sounds good. If that would be acceptable to the wider community it might help to simplify the frontend design and improve the user interface and the coherence of the interfaces within the compiler stack too.

From what I understand, there was a fair bit of agreement in D69498 about the need to make the default safer. The real question is should we fold that idea into this proposal?

There's one mistake in what I outlined above. The first point about default token is expressly forbidden by the static rule on cycles: if an intrinsic other than .loop inside a cycle uses a token, then the definition must also be in the same cycle. But I think this can be fixed by simply saying that the dynamic instance of a call without an explicit operand bundle is "undertermined". Any optimization must then back off, the whole point being that an optimization is safe if it preserves dynamic instances, and it is impossible to preserve an undetermined dynamic instance.

FYI, if we forced early inlining in the LLVM stack, the frontend would not need to mark every function as convergent conservatively but in the Compute scenarios we occasionally have very large functions that when inlined result in huge binaries and longer compilation time. And we also have extern functions too that we have no information of during the compilation. So this doesn't seem like a route we can safely take at least not for all languages.

Or in other words, the "proper" definition must cover the whole of LLVM IR, and not introduce any assumptions like the absence of function calls.

If we invert the convergent logic then we can add nocovergent attribute or even a pragma directive for the application developers to indicate what code doesn't contain cross-threads operations and can be optimized more aggressively.

Dynamic instances provide complete information at the callsite. But having an attribute on a function declaration (especially extern) is useful because it removes the need to analyse the function body itself.

kpet added a subscriber: kpet.Nov 5 2021, 4:07 AM
bader added a subscriber: bader.Jan 17 2022, 12:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 30 2022, 1:09 AM