This is an archive of the discontinued LLVM Phabricator instance.

IR: Invert convergent attribute handling
Needs ReviewPublic

Authored by arsenm on Oct 27 2019, 7:26 PM.

Details

Summary

While we're planning to fix the convergent definition in D68994, I thought
this would be an opportunity to fix another design flaw in the convergent
attribute.

Invert the direction of the convergent attribute by removing it, and
adding the noconvergent attribute. The current convergent attribute is
currently odd, as adding it is required for correctness. There is a
burden on frontends in environments that care about convergent
operations to add the attribute just in case it is needed. This has
proven to be somewhat problematic, and different frontend projects
have consistently run into problems related to this. OpenMP and SYCL
for example are currently not assuming convergent by default.

The backwards definition also did not provide a way to directly
specify non-convergent functions in some cases. There is no way to
specify a negative attribute, so when convergent needs to be assumed,
there was no way to opt-out. By inverting this, user attribute usage
is more flexible.

Assume the old convergent semantics by default, so an undecorated
function is correct with respect to convergence. For compatbility
purposes, the old attribute can simply be ignored. The new
noconvergent attribute produces the old default behavior.

Infer the noconvergent attribute, which is similar to nounwind.

The main risk of this change is negative tests in control flow
passes. They may no longer trigger due to not having noconvergent, and
not for what they truly intended to test. I've speculatively added
noconvergent to JumpThreading tests and a handful of others, but other
passes may be relevant.

For now this is just trying to change the IR attribute handling. The
MachineInstr flag and intrinsic decoration are still in the positive
direction. IntrConvergent should probably be inverted, but I think
having the MI flag be positive is probably fine.

The source language attribute for convergent is now a no-op. It could
arguably try to enable convergence in a non-SPMD language, but using
this would probably be very error prone. A user facing noconvergent
attribute is probably the next step.

Diff Detail

Event Timeline

arsenm created this revision.Oct 27 2019, 7:26 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 27 2019, 7:27 PM

I think this is the kind of thing to document on this page: https://llvm.org/docs/Frontend/PerformanceTips.html

arsenm updated this revision to Diff 226614.Oct 27 2019, 9:36 PM

Add section to frontend performance tips documentation

This certainly seems like a more tractable representation, although I suppose it'd be a thorn in the side of (IR-level) outlining.

A note on spelling: the no prefix seems to be used largely with verbs; it's weird to use it here with an adjective, especially since noncovergent is just a letter away.

clang/lib/CodeGen/CGObjCMac.cpp
4259

I don't think GPU thread convergence is a concept that could ever really be applied to a program containing ObjC exceptions, so while this seems correct, it also seems pointless.

clang/lib/CodeGen/CGStmt.cpp
1949

The comment here seems to no longer match the code.

This certainly seems like a more tractable representation, although I suppose it'd be a thorn in the side of (IR-level) outlining.

Sorry, I mis-clicked and sent this review while I was still editing it.

I agree that this is a theoretically better representation for targets that care about convergence, since it biases compilation towards the more conservative assumption. On the other hand, since the default language mode is apparently to assume non-convergence of user functions, and since the vast majority of targets do not include convergence as a meaningful concept, you're also just gratuitously breaking a ton of existing test cases, as well as adding compile time for manipulating this attribute. Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

arsenm marked an inline comment as done.Oct 28 2019, 10:42 AM

A note on spelling: the no prefix seems to be used largely with verbs; it's weird to use it here with an adjective, especially since noncovergent is just a letter away.

I don't think it reads any different to me than, say nofree->no free calls. noconvergent-> no convergent operations. nonconvergent sounds odder to me

This certainly seems like a more tractable representation, although I suppose it'd be a thorn in the side of (IR-level) outlining.

Sorry, I mis-clicked and sent this review while I was still editing it.

I agree that this is a theoretically better representation for targets that care about convergence, since it biases compilation towards the more conservative assumption. On the other hand, since the default language mode is apparently to assume non-convergence of user functions, and since the vast majority of targets do not include convergence as a meaningful concept, you're also just gratuitously breaking a ton of existing test cases, as well as adding compile time for manipulating this attribute. Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

The default language mode for the non-parallel languages. The set of languages not setting this does need to grow (SYCL and OpenMP device code are both currently broken).

I don't think there are other attributes with weird connections to some global construct. Part of the point of attributes is that they are function level context. I don't really understand the compile time concern; the cost of the attribute is one bit in a bitset. We have quite a few attributes already that will be inferred for a large percentage of functions anyway. The case that's really important to add in frontend is external declarations and other opaque cases, others will be inferred anyway. You could make the same argument with nounwind, since most languages don't have exceptions.

clang/lib/CodeGen/CGObjCMac.cpp
4259

It doesn't apply, so this needs to be explicitly marked as not having it. This bypasses the place that emits the language assumed default attribute so this is needed. A handful of tests do fail without this from control flow changes not happening

Before we get lost in review details let's make sure we agree that his is the right path. I mean, transition from convergent to noconvergent (or a similar spelling).
I do support this by the way.

[...] Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

I see what you mean but I think it would be "a hack" to teach the lookup methods, e.g., isNoConvergent, to also look into the datalayout (or any other place). Instead, front-ends should emit noconvergent as one of their default arguments if that is the behavior they want. This should be essentially free after they update their code. Neither idea will fix the "old IR" problem anyway but we could make the deduction add noconvergent to all functions that are not convergent if convergent was found in the module.

The kind of tracking that you're doing of convergent operations is one example of a basically limitless set of "taint" analyses, of which half a dozen already exist in LLVM. I object to the idea that an arbitrary number of unrelated tests need to be changed every time somebody invents a new taint analysis. I also object to the idea that frontends have a responsibility to proactively compensate for every new weird target-specific representation change just to get the optimizer to perform standard transformations that fully respect the documented semantics of LLVM IR. The basic problem at the root of convergence analysis — that code layout is essentially part of GPU program semantics because of its impact on GPU thread groups — is in fact a very special property of GPU targets that is not in any way unreasonable to ask those targets to be explicit about.

If you can come up with a way to make this change that doesn't require changes to non-GPU frontends or tests, I agree that treating functions as convergent by default makes sense for GPU targets.

@rjmccall Thanks for the quick response! I tried to describe my view below. To avoid confusion on my side, could you maybe describe what you think when/which attribute should be created, derived, and used during the optimization pipeline?

The kind of tracking that you're doing of convergent operations is one example of a basically limitless set of "taint" analyses, of which half a dozen already exist in LLVM.

Yes, there are a lot, and to make things worse I added another one to rule them all. So at least in the middle-term there should be only one "(non)convergent" analysis.

I object to the idea that an arbitrary number of unrelated tests need to be changed every time somebody invents a new taint analysis.

Maybe the problem is that I don't understand why the effect on the test is so bad, given that it is the same as any new default attribute has (= very minimal and mechanical).
Or maybe I have less of an issue with these changes because I expect other attributes to be added as default soon and they will have a similar effect on the tests.
(To mention one with a made up name: forward_process_guarantee for functions that carry this property instead of pretending all do throughout the middle-end and hoping the front-ends deal with it accordingly.)

I also object to the idea that frontends have a responsibility to proactively compensate for every new weird target-specific representation change just to get the optimizer to perform standard transformations that fully respect the documented semantics of LLVM IR.

I generally agree and this has to be announced properly, but I don't think will require "much change" (there are default attribute sets already I suppose).
It is also, arguably, in the interest of many/all frontends to be able to produce correct (GPU) parallel code by minimizing the chance of accidental errors.
Finally, we basically have that situation you object to already, every time we fix an "error" in the IR semantics which allowed a transformation in the general case (forward process comes to mind again, dereferenceable_globally as well D61652, ...).

If you can come up with a way to make this change that doesn't require changes to non-GPU frontends or tests, I agree that treating functions as convergent by default makes sense for GPU targets.

Do you object noconvergent as an alternative to convergent?
If not, the one "problem" that pops up with having it not as a default attribute is that the middle-end would then need to check the target in addition to the noconvergent attribute, right?

I think the question should be what the IR policy is for properties that are required for correctness and not necessarily what most users will use. There's a general trend towards functions being correct by default, and attributes adding optimization possibilities. convergent (and the broken noduplicate) are exceptional by inverting this. I think strictfp is also wrong in this sense, and should possibly also be fixed.

One thing to probably note is that its not only a "target specific" issue, but a language specific issue as well (IMHO). OpenCL, CUDA, SYCL are all languages (to name a few) that have a concept of "convergence" and its not related only to the fact that they mostly run on a GPU, but to their programming model as well with respect to accessing textures and SIMD-wide communication and decision making (ballot)

Considering that Clang has support for these languages it kind of makes sense to me that the frontend is taking care of implementing the supported languages correctly by applying the proper attributes.

Now, its debatable that "convergent" is enough to cover all the cases we need ... and CUDA for example is running away from "ballot()" and such and moving to versions that have an explicit mask of the threads involved in the transaction for that reason. Even the new version with the "dynamic instances" of the mentioned proposal might not be enough ... but that's discussion for the other RFC ...

As you know, I am very much in favor of this change, and really anything that re-establishes the rule that code is treated maximally conservatively when there are no attributes or metadata.

There is one slight concern here because what we arguably need in SIMT targets is two attributes, which I'm going to call allowconvergence and allowdivergence for now. Convergent operations are operations that communicate with some set of other threads that is implicitly affected by where the operation is in control flow. If we have as a baseline that this set of threads must not be changed at all by transforms, then there are two directions in which this can be relaxed:

  • It may be correct to transform code in a way that may cause more threads to communicate. This applies e.g. to OpenCL uniform barriers and could be expressed as allowconvergence.
  • It may be correct to transform code in a way that may cause fewer threads to communicate. This applies to readanylane-style operations (which are implicit e.g. in some of our buffer and image intrinsics) and could be expressed as allowdivergence.

Though non-SIMT targets may not want to add two attributes everywhere. Perhaps a viable path forward is to take this change towards noconvergent now if we can still add allowconvergence and allowdivergence later; noconvergent would then simply be the union of those two attributes in the end.

piotr added a subscriber: piotr.Oct 29 2019, 3:51 AM
simoll added a subscriber: simoll.Oct 29 2019, 9:00 AM

One thing to probably note is that its not only a "target specific" issue, but a language specific issue as well (IMHO). OpenCL, CUDA, SYCL are all languages (to name a few) that have a concept of "convergence" and its not related only to the fact that they mostly run on a GPU, but to their programming model as well with respect to accessing textures and SIMD-wide communication and decision making (ballot)

Considering that Clang has support for these languages it kind of makes sense to me that the frontend is taking care of implementing the supported languages correctly by applying the proper attributes.

It absolutely makes sense for Clang as a GPU-programming frontend to set attributes appropriately when targeting the GPU. I'm objecting to making "convergence" and related "code layout is semantics" properties a universal part of the IR semantics that literally every frontend has to know about in order to get reasonable behavior from LLVM. I know that doing so makes sense to GPU backend developers because you mostly work exclusively on GPU toolchains, but AFAIK there are half a dozen GPU frontends and they're all forks of Clang, whereas there are dozens of LLVM frontends out there that don't care about targeting the GPU and quite possibly never will. (And even if they do target GPUs, they often will not care about exposing thread groups; many programming environments are just interested in taking advantage of the GPU's parallel-computation power and have no interest in inter-thread interference.)

John.

It absolutely makes sense for Clang as a GPU-programming frontend to set attributes appropriately when targeting the GPU. I'm objecting to making "convergence" and related "code layout is semantics" properties a universal part of the IR semantics that literally every frontend has to know about in order to get reasonable behavior from LLVM. I know that doing so makes sense to GPU backend developers because you mostly work exclusively on GPU toolchains, but AFAIK there are half a dozen GPU frontends and they're all forks of Clang, whereas there are dozens of LLVM frontends out there that don't care about targeting the GPU and quite possibly never will. (And even if they do target GPUs, they often will not care about exposing thread groups; many programming environments are just interested in taking advantage of the GPU's parallel-computation power and have no interest in inter-thread interference.)

John.

I agree that the issue with making it "transparent" as a concept to whoever is not interested in programming models that have the concept of SIMD-wide execution is an open issue (not only related to this patch, but in general to the convergent idea as a whole, where writing a llvm pass that maintains convergence now is an extra burden to the developer of such pass that wasn't there before and that is probably completely orthogonal to the interest of the pass writer probably targeting C/C++ or other non-parallel languages). I opened some discussions going on the other related RFC for extending the concept of convergent to avoid hoisting as well regarding how are we gonna avoid burdening the LLVM community and maintain the invariants we want with respect to this concept.
I have no idea what the impact of the convergent attribute is in Clang (with the exception of adding the flag itself), but a lot of the heavy-lifting I know its in LVLM itself.

That said I just want to point out that some of these languages run on CPUs as well (like openCL ) and they share the same properties with respect to instructions that read execution masks of neighboring threads and making sure threads stay converged when executing them. It's certainly unfortunate that LLVM has some deficiencies in supporting these concepts and that the so far proposed solutions are not burden free for the rest of the community. :-/

Maybe we can start by looking into the motivation for this patch:

There is a burden on frontends in environments that care about convergent operations to add the attribute just in case it is needed. This has proven to be somewhat problematic, and different frontend projects have consistently run into problems related to this.

Can you clarify what kind of problems? Do you have concrete examples?

Maybe we can start by looking into the motivation for this patch:

There is a burden on frontends in environments that care about convergent operations to add the attribute just in case it is needed. This has proven to be somewhat problematic, and different frontend projects have consistently run into problems related to this.

Can you clarify what kind of problems? Do you have concrete examples?

The frontend has to put convergent on everything that is not known to be non-convergent right now. That is, the front-end and all function generating constructs need to know about convergent to preserve correctness. Dropping convergent is also not sound. Frontends in the past put convergent only on barriers or similar calls but not on all functions that might transitively call barriers, leading to subtle bugs.

arsenm added a comment.EditedOct 29 2019, 12:54 PM

Maybe we can start by looking into the motivation for this patch:

There is a burden on frontends in environments that care about convergent operations to add the attribute just in case it is needed. This has proven to be somewhat problematic, and different frontend projects have consistently run into problems related to this.

Can you clarify what kind of problems? Do you have concrete examples?

Basically no frontend has gotten this right, including clang and non-clang frontends. There's a gradual discovery process that it's necessary in the first place, and then a long tail of contexts where it's discovered it's missing. As I mentioned before, SYCL and OpenMP are both still broken in this regard. For example in clang, convergent wasn't initially added to all functions, and then it was discovered that transforms on indirect callers violated it. Then later it was discovered this was broken for inline asm. Then some specific intrinsic call sites were a problem. Then there are special cases that don't necessarily go through the standard user function codegen paths which don't get the default attributes. Some odd places where IR is compiled into the user library that didn't have convergent added. The discovery of convergent violations is painful; more painful than discovering that control flow didn't optimize the way you expected. It requires eternal vigilance and all developers to be aware of it.

The short version is the fact that most developers aren't aware of and don't understand the subtleties of convergence, and making sure the front-end does something in all contexts requires a lot of diligence. It's very easy to introduce these difficult to debug bugs when calls are broken by default.

Maybe we can start by looking into the motivation for this patch:

There is a burden on frontends in environments that care about convergent operations to add the attribute just in case it is needed. This has proven to be somewhat problematic, and different frontend projects have consistently run into problems related to this.

Can you clarify what kind of problems? Do you have concrete examples?

The frontend has to put convergent on everything that is not known to be non-convergent right now. That is, the front-end and all function generating constructs need to know about convergent to preserve correctness.

Right, but that does not seem like a problem to me so far: a GPU frontend must add a gpu-specific attribute on every function seems like a reasonable constraint to me.

Frontends in the past put convergent only on barriers or similar calls but not on all functions that might transitively call barriers, leading to subtle bugs.

Right, but it seems like trying to fix bugs in the frontend at the LLVM does not seems necessarily right: it looks like a big hammer.

Basically no frontend has gotten this right, including clang and non-clang frontends. There's a gradual discovery process that it's necessary in the first place, and then a long tail of contexts where it's discovered it's missing. As I mentioned before, SYCL and OpenMP are both still broken in this regard. For example in clang, convergent wasn't initially added to all functions, and then it was discovered that transforms on indirect callers violated it.

The rule seems fairly straightforward for the frontend: the attributes must be always added everywhere if you're using SIMT.
Originally it likely wasn't thought through for convergent, but at this point this seems behind us.

Then later it was discovered this was broken for inline asm.

Can you clarify what you mean here and how this change would fix it?

Then some specific intrinsic call sites were a problem.

Same, I'm not sure I understand this one.

Then there are special cases that don't necessarily go through the standard user function codegen paths which don't get the default attributes.

Some odd places where IR is compiled into the user library that didn't have convergent added.

Bug in the user library? Again I'm not sure why it justifies a change in LLVM.

The short version is the fact that most developers aren't aware of and don't understand the subtleties of convergence, and making sure the front-end does something in all contexts requires a lot of diligence. It's very easy to introduce these difficult to debug bugs when calls are broken by default.

The fact that most developers aren't aware of convergence means a lot of potential bugs in LLVM because passes and transformations won't honor the convergent semantics, but the default for the attribute won't change that.
As of the frontend, it seems to me that this is just about structuring the code-path for function emission to define the right set of default attribute. It isn't clear to me why a refactoring of the frontend isn't a better course of actions here.

Requiring the presence of an attribute for correctness is a bad thing. OpenMP was missing this annotation in a number of places and is probably still missing it elsewhere. I wouldn't bet on handwritten bitcode libraries getting it right everywhere either. An optimisation pass accidentally dropping the attribute seems a plausible failure mode as well.

Strongly in favour of replacing convergent with no{n}convergent in the IR.

Not as convinced it should be inserted by the front end. The attribute is needed before any CFG rewrites and as far as I know they all occur downstream of the front end. That suggests an IR pass that walks over all functions, intrinsic calls, inline asm and so forth and marks them as appropriate. Standard C++/x86 and similar don't need to run the pass, OpenCL/x86 probably does. I'd suggest running it manually across handwritten bitcode as a sanity check as well.

Of course, if a front end targeting gpus wants to change control flow (julia may do this, mlir does, I sincerely hope clang doesn't) and use this attribute to control the process, then that front end picks up the responsibility for inserting the attribute where it wants it.

The short version is the fact that most developers aren't aware of and don't understand the subtleties of convergence, and making sure the front-end does something in all contexts requires a lot of diligence. It's very easy to introduce these difficult to debug bugs when calls are broken by default.

The fact that most developers aren't aware of convergence means a lot of potential bugs in LLVM because passes and transformations won't honor the convergent semantics, but the default for the attribute won't change that.

While you are right, a default non-convergent behavior will not make all bug sources disappear, it will make *a lot of bug sources disappear*. Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

As of the frontend, it seems to me that this is just about structuring the code-path for function emission to define the right set of default attribute. It isn't clear to me why a refactoring of the frontend isn't a better course of actions here.

Whatever we do, there will be consequences for current and future front-ends. At the end of the day, the underlying question we need to answer here is: Do we want to have "optimization with opt-in soundness" or "soundness with opt-in optimizations", and I would choose the latter any time.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past. There is a logical gap between this and concluding that the only solution is to change the default. For instance someone mentioned a pass that could be inserted as the very beginning of the pipeline by any GPU compiler and add the convergent attribute everywhere.

As of the frontend, it seems to me that this is just about structuring the code-path for function emission to define the right set of default attribute. It isn't clear to me why a refactoring of the frontend isn't a better course of actions here.

Whatever we do, there will be consequences for current and future front-ends. At the end of the day, the underlying question we need to answer here is: Do we want to have "optimization with opt-in soundness" or "soundness with opt-in optimizations", and I would choose the latter any time.

This view seems appropriate to me only if you consider a GPU (or SIMT) compiler alone. Another view (which seems to be what @rjmccall has) is that SIMT/GPU is a marginal use-case and "soundness" is already existing for most LLVM use-cases.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past. There is a logical gap between this and concluding that the only solution is to change the default. For instance someone mentioned a pass that could be inserted as the very beginning of the pipeline by any GPU compiler and add the convergent attribute everywhere.

Avoiding human error is fundamental to good design. If you have to have an additional IR pass, then there's already a phase where the IR is broken and something could legally break the IR.

As of the frontend, it seems to me that this is just about structuring the code-path for function emission to define the right set of default attribute. It isn't clear to me why a refactoring of the frontend isn't a better course of actions here.

Whatever we do, there will be consequences for current and future front-ends. At the end of the day, the underlying question we need to answer here is: Do we want to have "optimization with opt-in soundness" or "soundness with opt-in optimizations", and I would choose the latter any time.

This view seems appropriate to me only if you consider a GPU (or SIMT) compiler alone. Another view (which seems to be what @rjmccall has) is that SIMT/GPU is a marginal use-case and "soundness" is already existing for most LLVM use-cases.

I think this is a concerning argument. Declaring that GPUs are a "marginal" use case and LLVM only follows its design principles in cases where it benefits C++ x86/ARM users is an issue. In that case LLVM is no longer trying to be the compiler infrastructure for everyone that it tries to be. Soundness for most doesn't sound like a good design. I've proposed attributes in the past that were shot down for not following a correct by-default design and to me it seems like a problem if this principle isn't going to be universally followed. It's additionally concerning since most GPUs uses LLVM in some fashion if they're going to be declared a second class use case.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past.

Evidence suggest this problem will resurface over and over again, calling it an artifact of the past is dangerous and not helpful.

There is a logical gap between this and concluding that the only solution is to change the default.

There is no gap. Making the default restrictive but sound will prevent various kinds of errors we have seen in the past.
Assuming we could somehow "not repeat the errors in the future" is in my opinion the unrealistic view.

For instance someone mentioned a pass that could be inserted as the very beginning of the pipeline by any GPU compiler and add the convergent attribute everywhere.

I talked to @hfinkel earlier about this and his idea was to use a pass (or IRBuilder mode, or sth.) to do the opposite: opt-into a "subset" of LLVM-IR behaviors.
The idea is that if you want just a low-level C for CPUs, you run this pass or enable this mode and you get the appropriate LLVM-IR. That would for this patch mean
you get noconvergent everywhere but in the future you could get other attributes/changes as well.

As of the frontend, it seems to me that this is just about structuring the code-path for function emission to define the right set of default attribute. It isn't clear to me why a refactoring of the frontend isn't a better course of actions here.

Whatever we do, there will be consequences for current and future front-ends. At the end of the day, the underlying question we need to answer here is: Do we want to have "optimization with opt-in soundness" or "soundness with opt-in optimizations", and I would choose the latter any time.

This view seems appropriate to me only if you consider a GPU (or SIMT) compiler alone. Another view (which seems to be what @rjmccall has) is that SIMT/GPU is a marginal use-case and "soundness" is already existing for most LLVM use-cases.

I'm strongly opposing the idea to marginalize GPU, or any accelerator, targets. This is not only splitting the community by making it hostile towards some who have to work around whatever is deemed "the main use-case", it is also arguably a position of the past for many of us. Convenience for people that do not care about accelerators is a goal we should have in mind for sure, but it should not oppose a reasonable and sound support of accelerators.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past.

Evidence suggest this problem will resurface over and over again, calling it an artifact of the past is dangerous and not helpful.

Can you provide the "evidence"? So far the only thing I have seen is historical and closely tied to how convergence was rolled out AFAICT.

There is a logical gap between this and concluding that the only solution is to change the default.

There is no gap.

Bluntly contradicting what I wrote without apparently trying to understand my point or trying to explain your point with respect to what I'm trying to explain is not gonna get me to agree with you magically, I'm not sure what you expect here other than shutting down the discussion and leaving us in a frustrating disagreement and lack of understanding with this?

Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

Just wanted to resurface these alternatives from John. Given that some targets want a fundamentally different default from what most frontends expect, I think we ought to find some way to encode the difference.

As I've said before, I absolutely view this polarity flip as good for GPU compilers, and I don't mind Clang needing to take some patches to update how we operate when targeting the GPU and to update some GPU-specific tests. I do not think we should view GPU targets as "marginal". But I do think that GPUs have different base semantics from CPUs (under normal, non-vector-transformed rules) and that it's not appropriate to unconditionally make those weaker GPU semantics the base semantics of LLVM IR.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past.

Evidence suggest this problem will resurface over and over again, calling it an artifact of the past is dangerous and not helpful.

Can you provide the "evidence"? So far the only thing I have seen is historical and closely tied to how convergence was rolled out AFAICT.

I'm not sure what other evidence could exist for how something is handled in practice besides historical record. The current design requires a higher awareness at all times to avoid subtlety breaking code, which I think is fundamentally more error prone.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past.

Evidence suggest this problem will resurface over and over again, calling it an artifact of the past is dangerous and not helpful.

Can you provide the "evidence"? So far the only thing I have seen is historical and closely tied to how convergence was rolled out AFAICT.

Recently, as mentioned a few times before, OpenMP target offloading gets/got it wrong long after convergence was rolled out.
r238264 introduced convergent in 2015, r315094 fixed it in part for OpenCL in 2017 (with the idea to remove noduplicate which was also broken!) a year after OpenCL support was started.

There is a logical gap between this and concluding that the only solution is to change the default.

There is no gap.

Bluntly contradicting what I wrote without apparently trying to understand my point or trying to explain your point with respect to what I'm trying to explain is not gonna get me to agree with you magically, I'm not sure what you expect here other than shutting down the discussion and leaving us in a frustrating disagreement and lack of understanding with this?

You removed my explanation that directly follows my statement. Please do not use my quotes out of context like this to blame me.

To reiterate what actually followed my above quote: If we have a "by default unsound behavior" we will by design have cases we miss, e.g., anything new not added with this special case in mind. So changing the default to the "by construction sound" case, is in my opinion the only way to prevent that. That being said, I already agreed that this might not be necessary for every target, e.g., based on the data layout. However, I do try to bring forward reasons why a "split in semantics" on this level will hurt us as a community and maintainers of the IR continuously.

tra added a comment.Oct 30 2019, 4:44 PM

Would I be too far off the mark to summarize the situation this way?

  • current default for unattributed functions is unsound for GPU back-ends, but is fine for most of the other back-ends.
  • it's easy to unintentionally end up using/mis-optimizing unattributed functions for back-ends where that care about convergence.
  • making functions convergent by default everywhere is correct, but is overly pessimistic for back-ends that do not need it. Such back-ends will need to add explicit attributes in order to maintain current level of optimizations. In a way it's the other side of the current situation -- default is not good for some back-ends and we have to add attributes everywhere to make things work. Flipping the default improves correctness-by-default, but places logistical burden on back- and front-ends that may not care about convergence otherwise.

Perhaps we can deal with that by providing a way to specify per-module default for the assumed convergence of the functions and then checking in the back-end (only those that do care about convergence) that the default convergence is explicitly set (and, perhaps, set to something specific?) via function/module attributes or CLI.

This way the unintentional use of vanilla IR w/o attributes with NVPTX back-end will produce an error complaining that default convergence is not set and we don't know if the IR is still sound. If necessary, the user can set appropriate convergence wholesale via CLI or module attribute. The burden on platforms that don't care about convergence will be limited to setting the default and applying attributes on entities that do not match the default assumption (there may be none of those).

One pitfall of this approach is that we may run optimizations based on wrong assumptions and end up miscompiling things before we know what those assumptions will be (e.g. opt vs llc). Perhaps we can document the default assumption to be nonconvergent and always set a module attribute indicating the assumption used. This assumption will percolate through the optimization pipeline. If its the wrong one we would be able to reject / warn about it in the passes for the back-ends that have different preference.

Would something like that be a reasonable compromise?

In D69498#1728039, @tra wrote:

Perhaps we can deal with that by providing a way to specify per-module default for the assumed convergence of the functions and then checking in the back-end (only those that do care about convergence) that the default convergence is explicitly set (and, perhaps, set to something specific?) via function/module attributes or CLI.

This way the unintentional use of vanilla IR w/o attributes with NVPTX back-end will produce an error complaining that default convergence is not set and we don't know if the IR is still sound. If necessary, the user can set appropriate convergence wholesale via CLI or module attribute. The burden on platforms that don't care about convergence will be limited to setting the default and applying attributes on entities that do not match the default assumption (there may be none of those).

Convergent can be correctly stripped in many cases (and noconvergent inferred in the same cases), so erroring if a function isn't convergent won't really work

As far as optimization inhibition is concerned, noconvergent will be inferred for all functions that don't call convergent intrinsics (i.e. the state of the world for all functions on all CPU targets). The frontend needing to do something for optimization comes up in relation to calling external declarations which won't be inferred in non-LTO contexts (plus inline asm and indirect calls, the other opaque call types). IMO the mild inconvenience of needing to add another attribute to optimize external call declarations is not unreasonable. This is already the situation for nounwind with languages that don't have exceptions. I don't see what real benefit there is to inventing some new inconsistent, target-specific IR construct to avoid frontends needing to add an attribute that behaves nearly identically to nounwind

It absolutely makes sense for Clang as a GPU-programming frontend to set attributes appropriately when targeting the GPU. I'm objecting to making "convergence" and related "code layout is semantics" properties a universal part of the IR semantics that literally every frontend has to know about in order to get reasonable behavior from LLVM. I know that doing so makes sense to GPU backend developers because you mostly work exclusively on GPU toolchains, but AFAIK there are half a dozen GPU frontends and they're all forks of Clang, whereas there are dozens of LLVM frontends out there that don't care about targeting the GPU and quite possibly never will. (And even if they do target GPUs, they often will not care about exposing thread groups; many programming environments are just interested in taking advantage of the GPU's parallel-computation power and have no interest in inter-thread interference.)

John.

I agree that the issue with making it "transparent" as a concept to whoever is not interested in programming models that have the concept of SIMD-wide execution is an open issue (not only related to this patch, but in general to the convergent idea as a whole, where writing a llvm pass that maintains convergence now is an extra burden to the developer of such pass that wasn't there before and that is probably completely orthogonal to the interest of the pass writer probably targeting C/C++ or other non-parallel languages). I opened some discussions going on the other related RFC for extending the concept of convergent to avoid hoisting as well regarding how are we gonna avoid burdening the LLVM community and maintain the invariants we want with respect to this concept.
I have no idea what the impact of the convergent attribute is in Clang (with the exception of adding the flag itself), but a lot of the heavy-lifting I know its in LVLM itself.

That said I just want to point out that some of these languages run on CPUs as well (like openCL ) and they share the same properties with respect to instructions that read execution masks of neighboring threads and making sure threads stay converged when executing them. It's certainly unfortunate that LLVM has some deficiencies in supporting these concepts and that the so far proposed solutions are not burden free for the rest of the community. :-/

Aside from functionality issues with inter-thread interference, there is another very related aspect - optimizing for SIMT execution. This topic hasn't been discussed much yet but optimizing for single threaded execution doesn't always yield optimal result when run in SIMT style. For example divergent code regions can significantly slow down execution of multiple threads even if they have small instruction count. I think the biggest gap in LLVM is absence of a way to represent SIMT concept in general i.e. by adding special annotations to functions that run in SIMT mode or adding annotation to the whole module. This can be useful not only to correctly implement convergent but to potentially setup separate optimization pass pipeline for GPU-like targets to allow excluding harmful transformations and potentially add more passes to middle end that are targeting SIMT. Currently there is no solution to this at middle end and therefore each backend ends up adding the same logic during lowering phases instead. However, adding the concept of SIMT seems like a large architectural change and might not be something we can do quickly. Potentially we can identify some small steps towards this goal that can be done now to at least to support the convergent for SIMT targets correctly without affecting much non-SIMT community.

Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

Just wanted to resurface these alternatives from John. Given that some targets want a fundamentally different default from what most frontends expect, I think we ought to find some way to encode the difference.

Just thought about a slight variation on this: what about adding a flag on the datalayout (or a module flag) but not use it in the transformations/analyses, instead use it only when creating Function by always setting the convergent attribute when the flag is present? This would require to always have a Module passed to the Function constructor though (the C API already does, we would just need to update the C++ users).

So creating a Function in a Module with this flag would have the convergent attribute set on creation (can be unset explicitly).

Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

Just wanted to resurface these alternatives from John. Given that some targets want a fundamentally different default from what most frontends expect, I think we ought to find some way to encode the difference.

Just thought about a slight variation on this: what about adding a flag on the datalayout (or a module flag) but not use it in the transformations/analyses, instead use it only when creating Function by always setting the convergent attribute when the flag is present? This would require to always have a Module passed to the Function constructor though (the C API already does, we would just need to update the C++ users).

So creating a Function in a Module with this flag would have the convergent attribute set on creation (can be unset explicitly).

That works for me.

arsenm added a comment.Nov 6 2019, 9:45 AM

Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

Just wanted to resurface these alternatives from John. Given that some targets want a fundamentally different default from what most frontends expect, I think we ought to find some way to encode the difference.

Just thought about a slight variation on this: what about adding a flag on the datalayout (or a module flag) but not use it in the transformations/analyses, instead use it only when creating Function by always setting the convergent attribute when the flag is present? This would require to always have a Module passed to the Function constructor though (the C API already does, we would just need to update the C++ users).

So creating a Function in a Module with this flag would have the convergent attribute set on creation (can be unset explicitly).

This sounds almost too low level. The bitcode loading of a function shouldn’t be adding attributes to non-intrinsics, otherwise it’s changing the meaning of the program and not preserving the input

Just thought about a slight variation on this: what about adding a flag on the datalayout (or a module flag) but not use it in the transformations/analyses, instead use it only when creating Function by always setting the convergent attribute when the flag is present? This would require to always have a Module passed to the Function constructor though (the C API already does, we would just need to update the C++ users).

So creating a Function in a Module with this flag would have the convergent attribute set on creation (can be unset explicitly).

This sounds almost too low level. The bitcode loading of a function shouldn’t be adding attributes to non-intrinsics, otherwise it’s changing the meaning of the program and not preserving the input

You're right, but I thought loading a function from bitcode would be 1) create the function (it may get a default convergent attribute added) and 2) set the exact attributes (which would override the default)

Talked with Matt at the last social, they thought I was strongly opposed to this: I am not, and I told them I would clarify here:

I don't buy the argument of "frontend projects have consistently run into problems related to this", I think this is not a good justification for this change. But this change seems like it can be motivated without this argument.
I also have been trying to help brainstorming alternatives just to see if we could get something that would fit what Matt is looking to achieve while getting past @rjmccall concerns, not because I am against this change.

JonChesterfield added a subscriber: sameerds.EditedApr 19 2021, 4:15 AM

I've managed to write a bug in some freestanding c++ compiled for amdgpu by forgetting the convergent attribute in strategic places. Took a while to debug that. OpenMP does now set the current attribute correctly, I think. I wonder if anyone has audited all the rocm device code to verify it is correctly annotated.

The trailing comments above are 'that works for me' and 'not strongly opposed'. @sameerds is this something you're looking at picking up? Patch has probably rotted a bit.

The way I see it, the notion of convergence is relevant only to a certain class of targets (usually represented by GPUs) and it only affects certain optimizations. Then why not have only these optimizations check TTI to see if convergence matters? TTI.hasBranchDivergence() seems like a sufficient proxy for this information.

  1. convergent becomes the default in LLVM IR, but it does not affect optimizations on non-GPU targets.
  2. This is not a reinterpretation of the same IR on different targets. The notional execution model of LLVM IR will say that all function calls are convergent. Targets that only care about one thread at a time represent the degenerate case where all executions are convergent anyway.

This recasts the whole question to be one about combining optimizations with target-specific information. The only changes required are in transforms that check CallInst::isConvergent(). These should now also check TTI, possibly adding a dependency on the TTI analysis where it didn't exist earlier.

foad added a subscriber: foad.Apr 21 2021, 1:48 AM

The way I see it, the notion of convergence is relevant only to a certain class of targets (usually represented by GPUs) and it only affects certain optimizations. Then why not have only these optimizations check TTI to see if convergence matters? TTI.hasBranchDivergence() seems like a sufficient proxy for this information.

  1. convergent becomes the default in LLVM IR, but it does not affect optimizations on non-GPU targets.
  2. This is not a reinterpretation of the same IR on different targets. The notional execution model of LLVM IR will say that all function calls are convergent. Targets that only care about one thread at a time represent the degenerate case where all executions are convergent anyway.

This recasts the whole question to be one about combining optimizations with target-specific information. The only changes required are in transforms that check CallInst::isConvergent(). These should now also check TTI, possibly adding a dependency on the TTI analysis where it didn't exist earlier.

@sameerds I agree with your conclusions but I would describe the situation a little differently. As I understand it, the optimizations that check isConvergent really only care about moving convergent calls past control flow that might be divergent. !hasBranchDivergence is a promise that there are no possible sources of divergence for a target, so you can run a divergence analysis if you like but it will just tell you that everything is uniform, so all control flow is uniform, so it's OK to move isConvergent calls around.

In practice the optimizations that check isConvergent don't seem to use divergence analysis, they just pessimistically assume that any control flow might be divergent (if hasBranchDivergence). But they could and perhaps should use divergence analysis, and then it would all just fall out in the wash with no need for an explicit hasBranchDivergence test.

This recasts the whole question to be one about combining optimizations with target-specific information. The only changes required are in transforms that check CallInst::isConvergent(). These should now also check TTI, possibly adding a dependency on the TTI analysis where it didn't exist earlier.

@sameerds I agree with your conclusions but I would describe the situation a little differently. As I understand it, the optimizations that check isConvergent really only care about moving convergent calls past control flow that might be divergent. !hasBranchDivergence is a promise that there are no possible sources of divergence for a target, so you can run a divergence analysis if you like but it will just tell you that everything is uniform, so all control flow is uniform, so it's OK to move isConvergent calls around.

In practice the optimizations that check isConvergent don't seem to use divergence analysis, they just pessimistically assume that any control flow might be divergent (if hasBranchDivergence). But they could and perhaps should use divergence analysis, and then it would all just fall out in the wash with no need for an explicit hasBranchDivergence test.

Sure it is formally the same thing. But in practice, the main issue for everyone is the effect on compile time for targets that don't care about convergence/divergence. For such targets, running even the divergence analysis is an unnecessary cost. Any checks for uniform control flow will still be hidden under TTI.hasBranchDivergence() for such targets. We see that happening already in places where divergence analysis is constructed optionally.

foad added a comment.Apr 21 2021, 4:56 AM

But in practice, the main issue for everyone is the effect on compile time for targets that don't care about convergence/divergence. For such targets, running even the divergence analysis is an unnecessary cost.

LegacyDivergenceAnalysis::runOnFunction bails out immediately if !hasBranchDivergence.

I realize now that what @foad says above puts the idea in a clearer context. Essentially, any check for isConvergent() isn't happening in a vacuum. It is relevant only in the presence of divergent control flow, which in turn is relevant only when the target has divergence. Any standalone check for isConvergent() is merely making a pessimistic assumption that all the control flow incident on it is divergent. This has two consequences:

  1. The meaning of the convergent attribute has always been target-dependent.
  2. Dependence on TTI is not a real cost at all. We may eventually update every use of isConvergent() to depend on a check for divergence. The check for TTI is only the first step towards that.

I would propose refining the definition of the noconvergent attribute as follows:

noconvergent:

Some targets with a parallel execution model provide cross-thread operations whose outcome is affected by the presence of divergent control flow. We call such operations convergent. Optimizations that change control flow may affect the correctness of a program that uses convergent operations. In the presence of divergent control flow, such optimizations conservatively treat every call/invoke instruction as convergent by default. The noconvergent attribute relaxes this constraint as follows:

  • The noconvergent attribute can be added to a call/invoke to indicate that it is not affected by changes to the control flow that reaches it.
  • The noconvergent attribute can be added to a function to indicate that it does not execute any convergent operations. A call/invoke automatically inherits the noconvergent attribute if it is set on the callee.

I realize now that what @foad says above puts the idea in a clearer context. Essentially, any check for isConvergent() isn't happening in a vacuum. It is relevant only in the presence of divergent control flow, which in turn is relevant only when the target has divergence. Any standalone check for isConvergent() is merely making a pessimistic assumption that all the control flow incident on it is divergent. This has two consequences:

  1. The meaning of the convergent attribute has always been target-dependent.
  2. Dependence on TTI is not a real cost at all. We may eventually update every use of isConvergent() to depend on a check for divergence. The check for TTI is only the first step towards that.

The core IR semantics must *not* depend on target interpretation. convergent has never been target dependent, and was defined in an abstract way. Only certain targets will really care, but we cannot directly define it to mean what we want it to mean for particular targets

I realize now that what @foad says above puts the idea in a clearer context. Essentially, any check for isConvergent() isn't happening in a vacuum. It is relevant only in the presence of divergent control flow, which in turn is relevant only when the target has divergence. Any standalone check for isConvergent() is merely making a pessimistic assumption that all the control flow incident on it is divergent. This has two consequences:

  1. The meaning of the convergent attribute has always been target-dependent.
  2. Dependence on TTI is not a real cost at all. We may eventually update every use of isConvergent() to depend on a check for divergence. The check for TTI is only the first step towards that.

The core IR semantics must *not* depend on target interpretation. convergent has never been target dependent, and was defined in an abstract way. Only certain targets will really care, but we cannot directly define it to mean what we want it to mean for particular targets

My bad. I should have said "the implementation of convergent" is target dependent. But even that is not precise enough. It is more correct to say that the convergent attribute can be implemented inside LLVM by depending on TTI so that it only impacts targets that have divergence. This dependence is not new; it's merely missing because the current uses of convergent pessimistically assume divergent control flow on targets.

The important point is that frontends have to do nothing special for targets that don't have divergence, because the concepts of convergence and divergence have a trivial effect on optimizations. I had already observed in a previous comment that this new definition does not amount to reinterpreting the same program differently on different targets. It is perfectly okay to say that all calls are convergent on a CPU, because all branches are trivially uniform on the CPU (this part is inspired by @foad's observation), and hence the default convergent nature of calls has no effect on optimization for CPUs. If we were to rename isConvergent() to hasConvergenceConstraints() then we can see that it trivially returns false for any target that has no divergence.

Note that my proposed definition for noconvergent makes no mention of the target. The first sentence starts with "some targets", but that's just for a helpful context. The concept of convergent calls is defined purely in terms of divergent control flow. This one sentence very nicely ties everything up:

In the presence of divergent control flow, such optimizations conservatively treat every call/invoke instruction as convergent by default.

  1. The meaning of the convergent attribute has always been target-dependent.
  2. Dependence on TTI is not a real cost at all. We may eventually update every use of isConvergent() to depend on a check for divergence. The check for TTI is only the first step towards that.

The core IR semantics must *not* depend on target interpretation. convergent has never been target dependent, and was defined in an abstract way. Only certain targets will really care, but we cannot directly define it to mean what we want it to mean for particular targets

My bad. I should have said "the implementation of convergent" is target dependent. But even that is not precise enough. It is more correct to say that the convergent attribute can be implemented inside LLVM by depending on TTI so that it only impacts targets that have divergence. This dependence is not new; it's merely missing because the current uses of convergent pessimistically assume divergent control flow on targets.

You're still saying the same thing. This needs to be defined generically. Frontends don't *have* to to anything, they'll just get the assumed convergent behavior by default. Either frontends could always add noconvergent to avoid any possible optimization hit, or we could have a pass add noconvergent depending on the target. I don't want to change the interpretation of core attributes based on the target

You're still saying the same thing. This needs to be defined generically. Frontends don't *have* to to anything, they'll just get the assumed convergent behavior by default. Either frontends could always add noconvergent to avoid any possible optimization hit, or we could have a pass add noconvergent depending on the target. I don't want to change the interpretation of core attributes based on the target

Like I said earlier, this not a reinterpretation base on target. I think we are not talking about the same "convergent" here. There is currently an attribute in LLVM with the spelling "c-o-n-v-e-r-g-e-n-t". It does not do what its name says. It has a prescriptive definition that says "thou shalt not add control dependences to this function". This is not what we actually need, because it fails in two ways:

  1. It is actually okay add control dependences to a convergent function as long as the new branches are uniform.
  2. Some times we hack a transform to avoid doing things that cannot be described as "add new control dependence", and still talk about the function having the above named attribute.

There is another important bug that obfuscates the whole discussion: most places that use "isConvergent()" should in fact first check if it really matters to the surrounding control flow. There is too much loaded onto that one attribute, without sufficient context. The definition of "noconvergent" that I am proposing starts out by first fixing the definition of convergent itself. This definition is independent of target, and only talks about the properties of the control flow that reach the callsite. To repeat, this does not reinterpret the IR in a target-defined way. Like I said, in the new definition, all function calls are convergent even on CPUs, so I don't see where the target comes in. If you still insist on talking about interpretation of core attributes, please start by deconstructing the definition that I propose so I can see what I am missing.

What matter isn't "convergent" in itself, it is how it restricts transformations: if you know that all the control flow is always uniform, are there still restriction on any transformation in presence of convergent instructions?

If not, then the "target approach" seems nice: as @foad and @sameerds noted we just need to have a guarantee on these targets about the uniformity of the branches, it does not require to change the convergent semantics on a target basis I think.

  1. It is actually okay add control dependences to a convergent function as long as the new branches are uniform.
  2. Some times we hack a transform to avoid doing things that cannot be described as "add new control dependence", and still talk about the function having the above named attribute.

There is another important bug that obfuscates the whole discussion: most places that use "isConvergent()" should in fact first check if it really matters to the surrounding control flow. There is too much loaded onto that one attribute, without sufficient context. The definition of "noconvergent" that I am proposing starts out by first fixing the definition of convergent itself. This definition is independent of target, and only talks about the properties of the control flow that reach the callsite. To repeat, this does not reinterpret the IR in a target-defined way. Like I said, in the new definition, all function calls are convergent even on CPUs, so I don't see where the target comes in. If you still insist on talking about interpretation of core attributes, please start by deconstructing the definition that I propose so I can see what I am missing.

Yes, the current concept of convergent is broken. I think this whole recent burst of conversation has been too focused on the current convergent situation and this patch in particular. I think we need to conceptually rebase this into the world that D85603 creates. The expected convergence behavior is not only a target property, but of the source language. The frontend ultimately has to express the intended semantics of these cross lane regions, and the convergence tokens gives this ability.

My point about not making this target dependent is we shouldn't be trying to shoehorn in TTI checks around convergent operations just to save frontends from the inconvenience of adding another attribute to function declarations. We can interprocedurally infer noconvergent like any other attribute most of the time. The convergent calls and their tokens should be interpreted the same even if the target CPU doesn't really have cross lane behavior.

  1. It is actually okay add control dependences to a convergent function as long as the new branches are uniform.
  2. Some times we hack a transform to avoid doing things that cannot be described as "add new control dependence", and still talk about the function having the above named attribute.

There is another important bug that obfuscates the whole discussion: most places that use "isConvergent()" should in fact first check if it really matters to the surrounding control flow. There is too much loaded onto that one attribute, without sufficient context. The definition of "noconvergent" that I am proposing starts out by first fixing the definition of convergent itself. This definition is independent of target, and only talks about the properties of the control flow that reach the callsite. To repeat, this does not reinterpret the IR in a target-defined way. Like I said, in the new definition, all function calls are convergent even on CPUs, so I don't see where the target comes in. If you still insist on talking about interpretation of core attributes, please start by deconstructing the definition that I propose so I can see what I am missing.

Yes, the current concept of convergent is broken. I think this whole recent burst of conversation has been too focused on the current convergent situation and this patch in particular. I think we need to conceptually rebase this into the world that D85603 creates.

Actually that is exactly what this current conversation is doing. My definition of noconvergent first rebases the definition of convergent into what D85603 does. My wording is no different than the one you see here: https://reviews.llvm.org/D85603#change-WRNH9XUSSoR8

The expected convergence behavior is not only a target property, but of the source language. The frontend ultimately has to express the intended semantics of these cross lane regions, and the convergence tokens gives this ability.

That is true, and it is covered by the simple fact that we both agree that convergent needs to be the default unless specified by the frontend using noconvergent.

My point about not making this target dependent is we shouldn't be trying to shoehorn in TTI checks around convergent operations just to save frontends from the inconvenience of adding another attribute to function declarations. We can interprocedurally infer noconvergent like any other attribute most of the time. The convergent calls and their tokens should be interpreted the same even if the target CPU doesn't really have cross lane behavior.

This is not an attempt to shoehorn TTI checks for mere convenience. You missed the part about how the current use of CallInst::isConvergent() is broken. This is where the wording in D85603 inherits a major fault from the existing definition of convergent. It is wrong to say that optimizations are plain forbidden by the mere knowledge/assumption that some call is convergent. Those optimizations are forbidden only if divergent control flow is involved. The definition of convergent needs to refrain from being prescriptive about what the compiler can and cannot do. See the definition of nosync for comparison.

The convergent property merely says that the call is special, and then it is up to each optimization to decide what happens. That decision is based on whether the optimization affects divergent control flow incident on the call, and whether that effect is relevant to a convergent function. Now that's where the following reasoning comes into play:

  1. Every use of isConvergent() needs to be tempered with knowledge of the control flow incident on it.
  2. The only way to check for divergent control flow is to call on divergence analysis.
  3. When divergence analysis is called on a target with no divergence, it returns nullptr, which is treated as equivalent to returning uniform for every query. This is not hypothetical, it's already happening whenever a transform does if (!DA) { ... }.
  4. Pending an attempt to actually call DA at each of these places, a correct simplification is to assume uniform control flow when the target does not have divergence, and assume divergent control flow otherwise.

That's all there is to this. Like I keep saying, this is not a reinterpretation, nor is it a hack. From my first comment on the thread, it is an attempt to recast the whole question o be one about combining optimizations with target-specific information. Now before you balk at yet another use of the phrase "target-specific", please note that DA is totally target-specific. The "sources of divergence" must be identified by the target. So this is a perfectly legitimate place to say "target-specific". It's making the optimization "target-specific", not the attribute.

foad added a comment.Apr 23 2021, 1:12 AM

I would propose refining the definition of the noconvergent attribute as follows:

noconvergent:

Some targets with a parallel execution model provide cross-thread operations whose outcome is affected by the presence of divergent control flow. We call such operations convergent. Optimizations that change control flow may affect the correctness of a program that uses convergent operations. In the presence of divergent control flow, such optimizations conservatively treat every call/invoke instruction as convergent by default. The noconvergent attribute relaxes this constraint as follows:

  • The noconvergent attribute can be added to a call/invoke to indicate that it is not affected by changes to the control flow that reaches it.
  • The noconvergent attribute can be added to a function to indicate that it does not execute any convergent operations. A call/invoke automatically inherits the noconvergent attribute if it is set on the callee.

I don't have much to add to the conversation except to point out that this definition defines noconvergent in terms of divergent control flow, but the langref itself doesn't define what divergent control flow is, which makes it an incomplete spec. (Perhaps I'm just restating @arsenm's objections.) This seems unsatisfactory to me but I have no idea what to do about it. I agree with @sameerds that the current definition of convergent is too restrictive because in practice we really do want to be able to move convergent calls past uniform control flow.

nhaehnle added a comment.EditedApr 23 2021, 9:34 AM

I don't have much to add to the conversation except to point out that this definition defines noconvergent in terms of divergent control flow, but the langref itself doesn't define what divergent control flow is, which makes it an incomplete spec. (Perhaps I'm just restating @arsenm's objections.) This seems unsatisfactory to me but I have no idea what to do about it. I agree with @sameerds that the current definition of convergent is too restrictive because in practice we really do want to be able to move convergent calls past uniform control flow.

That is one of the things that D85603 addresses. I suspect Sameer was assuming the language from there in the background.

I agree with Matt that it would be best to avoid too much TTI dependence. The existing situation with the intrinsics is already a bit of a strange one. I wonder if it is possible to move to a world where isSourceOfDivergence need not be target-dependent. (This requires e.g. new attributes on function arguments as well as new information about address spaces in the data layout, plus some new attributes to define intrinsic data flow. Likely beyond the scope of this patch.)

I don't have much to add to the conversation except to point out that this definition defines noconvergent in terms of divergent control flow, but the langref itself doesn't define what divergent control flow is, which makes it an incomplete spec. (Perhaps I'm just restating @arsenm's objections.) This seems unsatisfactory to me but I have no idea what to do about it. I agree with @sameerds that the current definition of convergent is too restrictive because in practice we really do want to be able to move convergent calls past uniform control flow.

That is one of the things that D85603 addresses. I suspect Sameer was assuming the language from there in the background.

I agree with Matt that it would be best to avoid too much TTI dependence. The existing situation with the intrinsics is already a bit of a strange one. I wonder if it is possible to move to a world where isSourceOfDivergence need not be target-dependent. (This requires e.g. new attributes on function arguments as well as new information about address spaces in the data layout, plus some new attributes to define intrinsic data flow. Likely beyond the scope of this patch.)

In general, yes, it is great to avoid dependence on TTI. But this particular instance is actually a dependence on DA. The fact that DA depends on TTI is a separate problem. But I think that's a natural fallout of the fact that LLVM is not (and should not be) in the business of defining an execution model for multiple threads. Every optimization that affects control flow needs to be able to reason about an execution model that LLVM itself does not define. So what we end up with is an approximate model in the form of information gleaned from the frontend as well as the target. Neither source is "more correct" or "less ideal" than the other.

JonChesterfield added a comment.EditedApr 30 2021, 2:47 AM

I wonder if it is necessary for the exec mask to be implicit state, managed by a convergent/divergent abstraction.

We could reify it as an argument passed to kernels (where all bits are set), and adjusted by phi nodes at entry to basic blocks. Various intrinsics take and return that reified value. __ballot, a comparison op, possibly load/store.

At that point all the awkward control flow constraints are raised to data flow, and I think (but haven't really chased this into the dark corners) that solves the problems I know about. Notably, a uniform branch would be one where the exec mask value was unchanged, so the associated phi is constant folded.

Makes a mess of the human readable IR, but possibly at a lower overall complexity than continuing to handle divergence as control flow.

edit: said reified mask, if integer, would also be the value post-volta sync intrinsics take, where the developer is supposed to compute the mask across branches and pass it around everywhere. Clang could therefore provide builtins that lower to those sync intrinsics with an always correct mask automatically passed. That would make volta much easier to program.

I wonder if it is necessary for the exec mask to be implicit state, managed by a convergent/divergent abstraction.

We could reify it as an argument passed to kernels (where all bits are set), and adjusted by phi nodes at entry to basic blocks. Various intrinsics take and return that reified value. __ballot, a comparison op, possibly load/store.

At that point all the awkward control flow constraints are raised to data flow, and I think (but haven't really chased this into the dark corners) that solves the problems I know about. Notably, a uniform branch would be one where the exec mask value was unchanged, so the associated phi is constant folded.

Makes a mess of the human readable IR, but possibly at a lower overall complexity than continuing to handle divergence as control flow.

That is essentially what the current divergence analysis (Karrenberg and Hack) does anyway. Except that since we don't know how many threads are running concurrently, the mask is a symbolic value that can either be "divergent" or "uniform". If I am looking at this the right way, you seem to be saying two separate things: the notion of divergence/uniformity is what solves the problem at hand, and separately, the notion of divergence can be turned into explicit dataflow.

edit: said reified mask, if integer, would also be the value post-volta sync intrinsics take, where the developer is supposed to compute the mask across branches and pass it around everywhere. Clang could therefore provide builtins that lower to those sync intrinsics with an always correct mask automatically passed. That would make volta much easier to program.

Additionally, the explicit mask (if we could agree on its type) would make it possible to have "dynamic uniformity". Currently we can only prove static uniformity.