Page MenuHomePhabricator

IR: Invert convergent attribute handling
Needs ReviewPublic

Authored by arsenm on Oct 27 2019, 7:26 PM.

Details

Summary

While we're planning to fix the convergent definition in D68994, I thought
this would be an opportunity to fix another design flaw in the convergent
attribute.

Invert the direction of the convergent attribute by removing it, and
adding the noconvergent attribute. The current convergent attribute is
currently odd, as adding it is required for correctness. There is a
burden on frontends in environments that care about convergent
operations to add the attribute just in case it is needed. This has
proven to be somewhat problematic, and different frontend projects
have consistently run into problems related to this. OpenMP and SYCL
for example are currently not assuming convergent by default.

The backwards definition also did not provide a way to directly
specify non-convergent functions in some cases. There is no way to
specify a negative attribute, so when convergent needs to be assumed,
there was no way to opt-out. By inverting this, user attribute usage
is more flexible.

Assume the old convergent semantics by default, so an undecorated
function is correct with respect to convergence. For compatbility
purposes, the old attribute can simply be ignored. The new
noconvergent attribute produces the old default behavior.

Infer the noconvergent attribute, which is similar to nounwind.

The main risk of this change is negative tests in control flow
passes. They may no longer trigger due to not having noconvergent, and
not for what they truly intended to test. I've speculatively added
noconvergent to JumpThreading tests and a handful of others, but other
passes may be relevant.

For now this is just trying to change the IR attribute handling. The
MachineInstr flag and intrinsic decoration are still in the positive
direction. IntrConvergent should probably be inverted, but I think
having the MI flag be positive is probably fine.

The source language attribute for convergent is now a no-op. It could
arguably try to enable convergence in a non-SPMD language, but using
this would probably be very error prone. A user facing noconvergent
attribute is probably the next step.

Diff Detail

Event Timeline

arsenm created this revision.Oct 27 2019, 7:26 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 27 2019, 7:27 PM

I think this is the kind of thing to document on this page: https://llvm.org/docs/Frontend/PerformanceTips.html

arsenm updated this revision to Diff 226614.Oct 27 2019, 9:36 PM

Add section to frontend performance tips documentation

This certainly seems like a more tractable representation, although I suppose it'd be a thorn in the side of (IR-level) outlining.

A note on spelling: the no prefix seems to be used largely with verbs; it's weird to use it here with an adjective, especially since noncovergent is just a letter away.

clang/lib/CodeGen/CGObjCMac.cpp
4259

I don't think GPU thread convergence is a concept that could ever really be applied to a program containing ObjC exceptions, so while this seems correct, it also seems pointless.

clang/lib/CodeGen/CGStmt.cpp
1949

The comment here seems to no longer match the code.

This certainly seems like a more tractable representation, although I suppose it'd be a thorn in the side of (IR-level) outlining.

Sorry, I mis-clicked and sent this review while I was still editing it.

I agree that this is a theoretically better representation for targets that care about convergence, since it biases compilation towards the more conservative assumption. On the other hand, since the default language mode is apparently to assume non-convergence of user functions, and since the vast majority of targets do not include convergence as a meaningful concept, you're also just gratuitously breaking a ton of existing test cases, as well as adding compile time for manipulating this attribute. Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

arsenm marked an inline comment as done.Oct 28 2019, 10:42 AM

A note on spelling: the no prefix seems to be used largely with verbs; it's weird to use it here with an adjective, especially since noncovergent is just a letter away.

I don't think it reads any different to me than, say nofree->no free calls. noconvergent-> no convergent operations. nonconvergent sounds odder to me

This certainly seems like a more tractable representation, although I suppose it'd be a thorn in the side of (IR-level) outlining.

Sorry, I mis-clicked and sent this review while I was still editing it.

I agree that this is a theoretically better representation for targets that care about convergence, since it biases compilation towards the more conservative assumption. On the other hand, since the default language mode is apparently to assume non-convergence of user functions, and since the vast majority of targets do not include convergence as a meaningful concept, you're also just gratuitously breaking a ton of existing test cases, as well as adding compile time for manipulating this attribute. Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

The default language mode for the non-parallel languages. The set of languages not setting this does need to grow (SYCL and OpenMP device code are both currently broken).

I don't think there are other attributes with weird connections to some global construct. Part of the point of attributes is that they are function level context. I don't really understand the compile time concern; the cost of the attribute is one bit in a bitset. We have quite a few attributes already that will be inferred for a large percentage of functions anyway. The case that's really important to add in frontend is external declarations and other opaque cases, others will be inferred anyway. You could make the same argument with nounwind, since most languages don't have exceptions.

clang/lib/CodeGen/CGObjCMac.cpp
4259

It doesn't apply, so this needs to be explicitly marked as not having it. This bypasses the place that emits the language assumed default attribute so this is needed. A handful of tests do fail without this from control flow changes not happening

Before we get lost in review details let's make sure we agree that his is the right path. I mean, transition from convergent to noconvergent (or a similar spelling).
I do support this by the way.

[...] Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

I see what you mean but I think it would be "a hack" to teach the lookup methods, e.g., isNoConvergent, to also look into the datalayout (or any other place). Instead, front-ends should emit noconvergent as one of their default arguments if that is the behavior they want. This should be essentially free after they update their code. Neither idea will fix the "old IR" problem anyway but we could make the deduction add noconvergent to all functions that are not convergent if convergent was found in the module.

The kind of tracking that you're doing of convergent operations is one example of a basically limitless set of "taint" analyses, of which half a dozen already exist in LLVM. I object to the idea that an arbitrary number of unrelated tests need to be changed every time somebody invents a new taint analysis. I also object to the idea that frontends have a responsibility to proactively compensate for every new weird target-specific representation change just to get the optimizer to perform standard transformations that fully respect the documented semantics of LLVM IR. The basic problem at the root of convergence analysis — that code layout is essentially part of GPU program semantics because of its impact on GPU thread groups — is in fact a very special property of GPU targets that is not in any way unreasonable to ask those targets to be explicit about.

If you can come up with a way to make this change that doesn't require changes to non-GPU frontends or tests, I agree that treating functions as convergent by default makes sense for GPU targets.

@rjmccall Thanks for the quick response! I tried to describe my view below. To avoid confusion on my side, could you maybe describe what you think when/which attribute should be created, derived, and used during the optimization pipeline?

The kind of tracking that you're doing of convergent operations is one example of a basically limitless set of "taint" analyses, of which half a dozen already exist in LLVM.

Yes, there are a lot, and to make things worse I added another one to rule them all. So at least in the middle-term there should be only one "(non)convergent" analysis.

I object to the idea that an arbitrary number of unrelated tests need to be changed every time somebody invents a new taint analysis.

Maybe the problem is that I don't understand why the effect on the test is so bad, given that it is the same as any new default attribute has (= very minimal and mechanical).
Or maybe I have less of an issue with these changes because I expect other attributes to be added as default soon and they will have a similar effect on the tests.
(To mention one with a made up name: forward_process_guarantee for functions that carry this property instead of pretending all do throughout the middle-end and hoping the front-ends deal with it accordingly.)

I also object to the idea that frontends have a responsibility to proactively compensate for every new weird target-specific representation change just to get the optimizer to perform standard transformations that fully respect the documented semantics of LLVM IR.

I generally agree and this has to be announced properly, but I don't think will require "much change" (there are default attribute sets already I suppose).
It is also, arguably, in the interest of many/all frontends to be able to produce correct (GPU) parallel code by minimizing the chance of accidental errors.
Finally, we basically have that situation you object to already, every time we fix an "error" in the IR semantics which allowed a transformation in the general case (forward process comes to mind again, dereferenceable_globally as well D61652, ...).

If you can come up with a way to make this change that doesn't require changes to non-GPU frontends or tests, I agree that treating functions as convergent by default makes sense for GPU targets.

Do you object noconvergent as an alternative to convergent?
If not, the one "problem" that pops up with having it not as a default attribute is that the middle-end would then need to check the target in addition to the noconvergent attribute, right?

I think the question should be what the IR policy is for properties that are required for correctness and not necessarily what most users will use. There's a general trend towards functions being correct by default, and attributes adding optimization possibilities. convergent (and the broken noduplicate) are exceptional by inverting this. I think strictfp is also wrong in this sense, and should possibly also be fixed.

One thing to probably note is that its not only a "target specific" issue, but a language specific issue as well (IMHO). OpenCL, CUDA, SYCL are all languages (to name a few) that have a concept of "convergence" and its not related only to the fact that they mostly run on a GPU, but to their programming model as well with respect to accessing textures and SIMD-wide communication and decision making (ballot)

Considering that Clang has support for these languages it kind of makes sense to me that the frontend is taking care of implementing the supported languages correctly by applying the proper attributes.

Now, its debatable that "convergent" is enough to cover all the cases we need ... and CUDA for example is running away from "ballot()" and such and moving to versions that have an explicit mask of the threads involved in the transaction for that reason. Even the new version with the "dynamic instances" of the mentioned proposal might not be enough ... but that's discussion for the other RFC ...

As you know, I am very much in favor of this change, and really anything that re-establishes the rule that code is treated maximally conservatively when there are no attributes or metadata.

There is one slight concern here because what we arguably need in SIMT targets is two attributes, which I'm going to call allowconvergence and allowdivergence for now. Convergent operations are operations that communicate with some set of other threads that is implicitly affected by where the operation is in control flow. If we have as a baseline that this set of threads must not be changed at all by transforms, then there are two directions in which this can be relaxed:

  • It may be correct to transform code in a way that may cause more threads to communicate. This applies e.g. to OpenCL uniform barriers and could be expressed as allowconvergence.
  • It may be correct to transform code in a way that may cause fewer threads to communicate. This applies to readanylane-style operations (which are implicit e.g. in some of our buffer and image intrinsics) and could be expressed as allowdivergence.

Though non-SIMT targets may not want to add two attributes everywhere. Perhaps a viable path forward is to take this change towards noconvergent now if we can still add allowconvergence and allowdivergence later; noconvergent would then simply be the union of those two attributes in the end.

piotr added a subscriber: piotr.Oct 29 2019, 3:51 AM
simoll added a subscriber: simoll.Oct 29 2019, 9:00 AM

One thing to probably note is that its not only a "target specific" issue, but a language specific issue as well (IMHO). OpenCL, CUDA, SYCL are all languages (to name a few) that have a concept of "convergence" and its not related only to the fact that they mostly run on a GPU, but to their programming model as well with respect to accessing textures and SIMD-wide communication and decision making (ballot)

Considering that Clang has support for these languages it kind of makes sense to me that the frontend is taking care of implementing the supported languages correctly by applying the proper attributes.

It absolutely makes sense for Clang as a GPU-programming frontend to set attributes appropriately when targeting the GPU. I'm objecting to making "convergence" and related "code layout is semantics" properties a universal part of the IR semantics that literally every frontend has to know about in order to get reasonable behavior from LLVM. I know that doing so makes sense to GPU backend developers because you mostly work exclusively on GPU toolchains, but AFAIK there are half a dozen GPU frontends and they're all forks of Clang, whereas there are dozens of LLVM frontends out there that don't care about targeting the GPU and quite possibly never will. (And even if they do target GPUs, they often will not care about exposing thread groups; many programming environments are just interested in taking advantage of the GPU's parallel-computation power and have no interest in inter-thread interference.)

John.

It absolutely makes sense for Clang as a GPU-programming frontend to set attributes appropriately when targeting the GPU. I'm objecting to making "convergence" and related "code layout is semantics" properties a universal part of the IR semantics that literally every frontend has to know about in order to get reasonable behavior from LLVM. I know that doing so makes sense to GPU backend developers because you mostly work exclusively on GPU toolchains, but AFAIK there are half a dozen GPU frontends and they're all forks of Clang, whereas there are dozens of LLVM frontends out there that don't care about targeting the GPU and quite possibly never will. (And even if they do target GPUs, they often will not care about exposing thread groups; many programming environments are just interested in taking advantage of the GPU's parallel-computation power and have no interest in inter-thread interference.)

John.

I agree that the issue with making it "transparent" as a concept to whoever is not interested in programming models that have the concept of SIMD-wide execution is an open issue (not only related to this patch, but in general to the convergent idea as a whole, where writing a llvm pass that maintains convergence now is an extra burden to the developer of such pass that wasn't there before and that is probably completely orthogonal to the interest of the pass writer probably targeting C/C++ or other non-parallel languages). I opened some discussions going on the other related RFC for extending the concept of convergent to avoid hoisting as well regarding how are we gonna avoid burdening the LLVM community and maintain the invariants we want with respect to this concept.
I have no idea what the impact of the convergent attribute is in Clang (with the exception of adding the flag itself), but a lot of the heavy-lifting I know its in LVLM itself.

That said I just want to point out that some of these languages run on CPUs as well (like openCL ) and they share the same properties with respect to instructions that read execution masks of neighboring threads and making sure threads stay converged when executing them. It's certainly unfortunate that LLVM has some deficiencies in supporting these concepts and that the so far proposed solutions are not burden free for the rest of the community. :-/

Maybe we can start by looking into the motivation for this patch:

There is a burden on frontends in environments that care about convergent operations to add the attribute just in case it is needed. This has proven to be somewhat problematic, and different frontend projects have consistently run into problems related to this.

Can you clarify what kind of problems? Do you have concrete examples?

Maybe we can start by looking into the motivation for this patch:

There is a burden on frontends in environments that care about convergent operations to add the attribute just in case it is needed. This has proven to be somewhat problematic, and different frontend projects have consistently run into problems related to this.

Can you clarify what kind of problems? Do you have concrete examples?

The frontend has to put convergent on everything that is not known to be non-convergent right now. That is, the front-end and all function generating constructs need to know about convergent to preserve correctness. Dropping convergent is also not sound. Frontends in the past put convergent only on barriers or similar calls but not on all functions that might transitively call barriers, leading to subtle bugs.

arsenm added a comment.EditedOct 29 2019, 12:54 PM

Maybe we can start by looking into the motivation for this patch:

There is a burden on frontends in environments that care about convergent operations to add the attribute just in case it is needed. This has proven to be somewhat problematic, and different frontend projects have consistently run into problems related to this.

Can you clarify what kind of problems? Do you have concrete examples?

Basically no frontend has gotten this right, including clang and non-clang frontends. There's a gradual discovery process that it's necessary in the first place, and then a long tail of contexts where it's discovered it's missing. As I mentioned before, SYCL and OpenMP are both still broken in this regard. For example in clang, convergent wasn't initially added to all functions, and then it was discovered that transforms on indirect callers violated it. Then later it was discovered this was broken for inline asm. Then some specific intrinsic call sites were a problem. Then there are special cases that don't necessarily go through the standard user function codegen paths which don't get the default attributes. Some odd places where IR is compiled into the user library that didn't have convergent added. The discovery of convergent violations is painful; more painful than discovering that control flow didn't optimize the way you expected. It requires eternal vigilance and all developers to be aware of it.

The short version is the fact that most developers aren't aware of and don't understand the subtleties of convergence, and making sure the front-end does something in all contexts requires a lot of diligence. It's very easy to introduce these difficult to debug bugs when calls are broken by default.

Maybe we can start by looking into the motivation for this patch:

There is a burden on frontends in environments that care about convergent operations to add the attribute just in case it is needed. This has proven to be somewhat problematic, and different frontend projects have consistently run into problems related to this.

Can you clarify what kind of problems? Do you have concrete examples?

The frontend has to put convergent on everything that is not known to be non-convergent right now. That is, the front-end and all function generating constructs need to know about convergent to preserve correctness.

Right, but that does not seem like a problem to me so far: a GPU frontend must add a gpu-specific attribute on every function seems like a reasonable constraint to me.

Frontends in the past put convergent only on barriers or similar calls but not on all functions that might transitively call barriers, leading to subtle bugs.

Right, but it seems like trying to fix bugs in the frontend at the LLVM does not seems necessarily right: it looks like a big hammer.

Basically no frontend has gotten this right, including clang and non-clang frontends. There's a gradual discovery process that it's necessary in the first place, and then a long tail of contexts where it's discovered it's missing. As I mentioned before, SYCL and OpenMP are both still broken in this regard. For example in clang, convergent wasn't initially added to all functions, and then it was discovered that transforms on indirect callers violated it.

The rule seems fairly straightforward for the frontend: the attributes must be always added everywhere if you're using SIMT.
Originally it likely wasn't thought through for convergent, but at this point this seems behind us.

Then later it was discovered this was broken for inline asm.

Can you clarify what you mean here and how this change would fix it?

Then some specific intrinsic call sites were a problem.

Same, I'm not sure I understand this one.

Then there are special cases that don't necessarily go through the standard user function codegen paths which don't get the default attributes.

Some odd places where IR is compiled into the user library that didn't have convergent added.

Bug in the user library? Again I'm not sure why it justifies a change in LLVM.

The short version is the fact that most developers aren't aware of and don't understand the subtleties of convergence, and making sure the front-end does something in all contexts requires a lot of diligence. It's very easy to introduce these difficult to debug bugs when calls are broken by default.

The fact that most developers aren't aware of convergence means a lot of potential bugs in LLVM because passes and transformations won't honor the convergent semantics, but the default for the attribute won't change that.
As of the frontend, it seems to me that this is just about structuring the code-path for function emission to define the right set of default attribute. It isn't clear to me why a refactoring of the frontend isn't a better course of actions here.

Requiring the presence of an attribute for correctness is a bad thing. OpenMP was missing this annotation in a number of places and is probably still missing it elsewhere. I wouldn't bet on handwritten bitcode libraries getting it right everywhere either. An optimisation pass accidentally dropping the attribute seems a plausible failure mode as well.

Strongly in favour of replacing convergent with no{n}convergent in the IR.

Not as convinced it should be inserted by the front end. The attribute is needed before any CFG rewrites and as far as I know they all occur downstream of the front end. That suggests an IR pass that walks over all functions, intrinsic calls, inline asm and so forth and marks them as appropriate. Standard C++/x86 and similar don't need to run the pass, OpenCL/x86 probably does. I'd suggest running it manually across handwritten bitcode as a sanity check as well.

Of course, if a front end targeting gpus wants to change control flow (julia may do this, mlir does, I sincerely hope clang doesn't) and use this attribute to control the process, then that front end picks up the responsibility for inserting the attribute where it wants it.

The short version is the fact that most developers aren't aware of and don't understand the subtleties of convergence, and making sure the front-end does something in all contexts requires a lot of diligence. It's very easy to introduce these difficult to debug bugs when calls are broken by default.

The fact that most developers aren't aware of convergence means a lot of potential bugs in LLVM because passes and transformations won't honor the convergent semantics, but the default for the attribute won't change that.

While you are right, a default non-convergent behavior will not make all bug sources disappear, it will make *a lot of bug sources disappear*. Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

As of the frontend, it seems to me that this is just about structuring the code-path for function emission to define the right set of default attribute. It isn't clear to me why a refactoring of the frontend isn't a better course of actions here.

Whatever we do, there will be consequences for current and future front-ends. At the end of the day, the underlying question we need to answer here is: Do we want to have "optimization with opt-in soundness" or "soundness with opt-in optimizations", and I would choose the latter any time.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past. There is a logical gap between this and concluding that the only solution is to change the default. For instance someone mentioned a pass that could be inserted as the very beginning of the pipeline by any GPU compiler and add the convergent attribute everywhere.

As of the frontend, it seems to me that this is just about structuring the code-path for function emission to define the right set of default attribute. It isn't clear to me why a refactoring of the frontend isn't a better course of actions here.

Whatever we do, there will be consequences for current and future front-ends. At the end of the day, the underlying question we need to answer here is: Do we want to have "optimization with opt-in soundness" or "soundness with opt-in optimizations", and I would choose the latter any time.

This view seems appropriate to me only if you consider a GPU (or SIMT) compiler alone. Another view (which seems to be what @rjmccall has) is that SIMT/GPU is a marginal use-case and "soundness" is already existing for most LLVM use-cases.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past. There is a logical gap between this and concluding that the only solution is to change the default. For instance someone mentioned a pass that could be inserted as the very beginning of the pipeline by any GPU compiler and add the convergent attribute everywhere.

Avoiding human error is fundamental to good design. If you have to have an additional IR pass, then there's already a phase where the IR is broken and something could legally break the IR.

As of the frontend, it seems to me that this is just about structuring the code-path for function emission to define the right set of default attribute. It isn't clear to me why a refactoring of the frontend isn't a better course of actions here.

Whatever we do, there will be consequences for current and future front-ends. At the end of the day, the underlying question we need to answer here is: Do we want to have "optimization with opt-in soundness" or "soundness with opt-in optimizations", and I would choose the latter any time.

This view seems appropriate to me only if you consider a GPU (or SIMT) compiler alone. Another view (which seems to be what @rjmccall has) is that SIMT/GPU is a marginal use-case and "soundness" is already existing for most LLVM use-cases.

I think this is a concerning argument. Declaring that GPUs are a "marginal" use case and LLVM only follows its design principles in cases where it benefits C++ x86/ARM users is an issue. In that case LLVM is no longer trying to be the compiler infrastructure for everyone that it tries to be. Soundness for most doesn't sound like a good design. I've proposed attributes in the past that were shot down for not following a correct by-default design and to me it seems like a problem if this principle isn't going to be universally followed. It's additionally concerning since most GPUs uses LLVM in some fashion if they're going to be declared a second class use case.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past.

Evidence suggest this problem will resurface over and over again, calling it an artifact of the past is dangerous and not helpful.

There is a logical gap between this and concluding that the only solution is to change the default.

There is no gap. Making the default restrictive but sound will prevent various kinds of errors we have seen in the past.
Assuming we could somehow "not repeat the errors in the future" is in my opinion the unrealistic view.

For instance someone mentioned a pass that could be inserted as the very beginning of the pipeline by any GPU compiler and add the convergent attribute everywhere.

I talked to @hfinkel earlier about this and his idea was to use a pass (or IRBuilder mode, or sth.) to do the opposite: opt-into a "subset" of LLVM-IR behaviors.
The idea is that if you want just a low-level C for CPUs, you run this pass or enable this mode and you get the appropriate LLVM-IR. That would for this patch mean
you get noconvergent everywhere but in the future you could get other attributes/changes as well.

As of the frontend, it seems to me that this is just about structuring the code-path for function emission to define the right set of default attribute. It isn't clear to me why a refactoring of the frontend isn't a better course of actions here.

Whatever we do, there will be consequences for current and future front-ends. At the end of the day, the underlying question we need to answer here is: Do we want to have "optimization with opt-in soundness" or "soundness with opt-in optimizations", and I would choose the latter any time.

This view seems appropriate to me only if you consider a GPU (or SIMT) compiler alone. Another view (which seems to be what @rjmccall has) is that SIMT/GPU is a marginal use-case and "soundness" is already existing for most LLVM use-cases.

I'm strongly opposing the idea to marginalize GPU, or any accelerator, targets. This is not only splitting the community by making it hostile towards some who have to work around whatever is deemed "the main use-case", it is also arguably a position of the past for many of us. Convenience for people that do not care about accelerators is a goal we should have in mind for sure, but it should not oppose a reasonable and sound support of accelerators.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past.

Evidence suggest this problem will resurface over and over again, calling it an artifact of the past is dangerous and not helpful.

Can you provide the "evidence"? So far the only thing I have seen is historical and closely tied to how convergence was rolled out AFAICT.

There is a logical gap between this and concluding that the only solution is to change the default.

There is no gap.

Bluntly contradicting what I wrote without apparently trying to understand my point or trying to explain your point with respect to what I'm trying to explain is not gonna get me to agree with you magically, I'm not sure what you expect here other than shutting down the discussion and leaving us in a frustrating disagreement and lack of understanding with this?

Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

Just wanted to resurface these alternatives from John. Given that some targets want a fundamentally different default from what most frontends expect, I think we ought to find some way to encode the difference.

As I've said before, I absolutely view this polarity flip as good for GPU compilers, and I don't mind Clang needing to take some patches to update how we operate when targeting the GPU and to update some GPU-specific tests. I do not think we should view GPU targets as "marginal". But I do think that GPUs have different base semantics from CPUs (under normal, non-vector-transformed rules) and that it's not appropriate to unconditionally make those weaker GPU semantics the base semantics of LLVM IR.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past.

Evidence suggest this problem will resurface over and over again, calling it an artifact of the past is dangerous and not helpful.

Can you provide the "evidence"? So far the only thing I have seen is historical and closely tied to how convergence was rolled out AFAICT.

I'm not sure what other evidence could exist for how something is handled in practice besides historical record. The current design requires a higher awareness at all times to avoid subtlety breaking code, which I think is fundamentally more error prone.

Let me quote @arsenm here because this is so important: "Basically no frontend has gotten this right, including clang and non-clang frontends." For me, that statement alone is reason to change the status quo.

I am not convinced by this: this seems overly simplistic. The fact that convergent wasn't rolled out properly in frontend can also been as an artifact of the past.

Evidence suggest this problem will resurface over and over again, calling it an artifact of the past is dangerous and not helpful.

Can you provide the "evidence"? So far the only thing I have seen is historical and closely tied to how convergence was rolled out AFAICT.

Recently, as mentioned a few times before, OpenMP target offloading gets/got it wrong long after convergence was rolled out.
r238264 introduced convergent in 2015, r315094 fixed it in part for OpenCL in 2017 (with the idea to remove noduplicate which was also broken!) a year after OpenCL support was started.

There is a logical gap between this and concluding that the only solution is to change the default.

There is no gap.

Bluntly contradicting what I wrote without apparently trying to understand my point or trying to explain your point with respect to what I'm trying to explain is not gonna get me to agree with you magically, I'm not sure what you expect here other than shutting down the discussion and leaving us in a frustrating disagreement and lack of understanding with this?

You removed my explanation that directly follows my statement. Please do not use my quotes out of context like this to blame me.

To reiterate what actually followed my above quote: If we have a "by default unsound behavior" we will by design have cases we miss, e.g., anything new not added with this special case in mind. So changing the default to the "by construction sound" case, is in my opinion the only way to prevent that. That being said, I already agreed that this might not be necessary for every target, e.g., based on the data layout. However, I do try to bring forward reasons why a "split in semantics" on this level will hurt us as a community and maintainers of the IR continuously.

tra added a comment.Oct 30 2019, 4:44 PM

Would I be too far off the mark to summarize the situation this way?

  • current default for unattributed functions is unsound for GPU back-ends, but is fine for most of the other back-ends.
  • it's easy to unintentionally end up using/mis-optimizing unattributed functions for back-ends where that care about convergence.
  • making functions convergent by default everywhere is correct, but is overly pessimistic for back-ends that do not need it. Such back-ends will need to add explicit attributes in order to maintain current level of optimizations. In a way it's the other side of the current situation -- default is not good for some back-ends and we have to add attributes everywhere to make things work. Flipping the default improves correctness-by-default, but places logistical burden on back- and front-ends that may not care about convergence otherwise.

Perhaps we can deal with that by providing a way to specify per-module default for the assumed convergence of the functions and then checking in the back-end (only those that do care about convergence) that the default convergence is explicitly set (and, perhaps, set to something specific?) via function/module attributes or CLI.

This way the unintentional use of vanilla IR w/o attributes with NVPTX back-end will produce an error complaining that default convergence is not set and we don't know if the IR is still sound. If necessary, the user can set appropriate convergence wholesale via CLI or module attribute. The burden on platforms that don't care about convergence will be limited to setting the default and applying attributes on entities that do not match the default assumption (there may be none of those).

One pitfall of this approach is that we may run optimizations based on wrong assumptions and end up miscompiling things before we know what those assumptions will be (e.g. opt vs llc). Perhaps we can document the default assumption to be nonconvergent and always set a module attribute indicating the assumption used. This assumption will percolate through the optimization pipeline. If its the wrong one we would be able to reject / warn about it in the passes for the back-ends that have different preference.

Would something like that be a reasonable compromise?

In D69498#1728039, @tra wrote:

Perhaps we can deal with that by providing a way to specify per-module default for the assumed convergence of the functions and then checking in the back-end (only those that do care about convergence) that the default convergence is explicitly set (and, perhaps, set to something specific?) via function/module attributes or CLI.

This way the unintentional use of vanilla IR w/o attributes with NVPTX back-end will produce an error complaining that default convergence is not set and we don't know if the IR is still sound. If necessary, the user can set appropriate convergence wholesale via CLI or module attribute. The burden on platforms that don't care about convergence will be limited to setting the default and applying attributes on entities that do not match the default assumption (there may be none of those).

Convergent can be correctly stripped in many cases (and noconvergent inferred in the same cases), so erroring if a function isn't convergent won't really work

As far as optimization inhibition is concerned, noconvergent will be inferred for all functions that don't call convergent intrinsics (i.e. the state of the world for all functions on all CPU targets). The frontend needing to do something for optimization comes up in relation to calling external declarations which won't be inferred in non-LTO contexts (plus inline asm and indirect calls, the other opaque call types). IMO the mild inconvenience of needing to add another attribute to optimize external call declarations is not unreasonable. This is already the situation for nounwind with languages that don't have exceptions. I don't see what real benefit there is to inventing some new inconsistent, target-specific IR construct to avoid frontends needing to add an attribute that behaves nearly identically to nounwind

It absolutely makes sense for Clang as a GPU-programming frontend to set attributes appropriately when targeting the GPU. I'm objecting to making "convergence" and related "code layout is semantics" properties a universal part of the IR semantics that literally every frontend has to know about in order to get reasonable behavior from LLVM. I know that doing so makes sense to GPU backend developers because you mostly work exclusively on GPU toolchains, but AFAIK there are half a dozen GPU frontends and they're all forks of Clang, whereas there are dozens of LLVM frontends out there that don't care about targeting the GPU and quite possibly never will. (And even if they do target GPUs, they often will not care about exposing thread groups; many programming environments are just interested in taking advantage of the GPU's parallel-computation power and have no interest in inter-thread interference.)

John.

I agree that the issue with making it "transparent" as a concept to whoever is not interested in programming models that have the concept of SIMD-wide execution is an open issue (not only related to this patch, but in general to the convergent idea as a whole, where writing a llvm pass that maintains convergence now is an extra burden to the developer of such pass that wasn't there before and that is probably completely orthogonal to the interest of the pass writer probably targeting C/C++ or other non-parallel languages). I opened some discussions going on the other related RFC for extending the concept of convergent to avoid hoisting as well regarding how are we gonna avoid burdening the LLVM community and maintain the invariants we want with respect to this concept.
I have no idea what the impact of the convergent attribute is in Clang (with the exception of adding the flag itself), but a lot of the heavy-lifting I know its in LVLM itself.

That said I just want to point out that some of these languages run on CPUs as well (like openCL ) and they share the same properties with respect to instructions that read execution masks of neighboring threads and making sure threads stay converged when executing them. It's certainly unfortunate that LLVM has some deficiencies in supporting these concepts and that the so far proposed solutions are not burden free for the rest of the community. :-/

Aside from functionality issues with inter-thread interference, there is another very related aspect - optimizing for SIMT execution. This topic hasn't been discussed much yet but optimizing for single threaded execution doesn't always yield optimal result when run in SIMT style. For example divergent code regions can significantly slow down execution of multiple threads even if they have small instruction count. I think the biggest gap in LLVM is absence of a way to represent SIMT concept in general i.e. by adding special annotations to functions that run in SIMT mode or adding annotation to the whole module. This can be useful not only to correctly implement convergent but to potentially setup separate optimization pass pipeline for GPU-like targets to allow excluding harmful transformations and potentially add more passes to middle end that are targeting SIMT. Currently there is no solution to this at middle end and therefore each backend ends up adding the same logic during lowering phases instead. However, adding the concept of SIMT seems like a large architectural change and might not be something we can do quickly. Potentially we can identify some small steps towards this goal that can be done now to at least to support the convergent for SIMT targets correctly without affecting much non-SIMT community.

Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

Just wanted to resurface these alternatives from John. Given that some targets want a fundamentally different default from what most frontends expect, I think we ought to find some way to encode the difference.

Just thought about a slight variation on this: what about adding a flag on the datalayout (or a module flag) but not use it in the transformations/analyses, instead use it only when creating Function by always setting the convergent attribute when the flag is present? This would require to always have a Module passed to the Function constructor though (the C API already does, we would just need to update the C++ users).

So creating a Function in a Module with this flag would have the convergent attribute set on creation (can be unset explicitly).

Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

Just wanted to resurface these alternatives from John. Given that some targets want a fundamentally different default from what most frontends expect, I think we ought to find some way to encode the difference.

Just thought about a slight variation on this: what about adding a flag on the datalayout (or a module flag) but not use it in the transformations/analyses, instead use it only when creating Function by always setting the convergent attribute when the flag is present? This would require to always have a Module passed to the Function constructor though (the C API already does, we would just need to update the C++ users).

So creating a Function in a Module with this flag would have the convergent attribute set on creation (can be unset explicitly).

That works for me.

arsenm added a comment.Nov 6 2019, 9:45 AM

Perhaps there should be a global metadata, or something in the increasingly-misnamed "data layout" string, which says that convergence is meaningful, and we should only add the attribute in appropriately-annotated modules?

Just wanted to resurface these alternatives from John. Given that some targets want a fundamentally different default from what most frontends expect, I think we ought to find some way to encode the difference.

Just thought about a slight variation on this: what about adding a flag on the datalayout (or a module flag) but not use it in the transformations/analyses, instead use it only when creating Function by always setting the convergent attribute when the flag is present? This would require to always have a Module passed to the Function constructor though (the C API already does, we would just need to update the C++ users).

So creating a Function in a Module with this flag would have the convergent attribute set on creation (can be unset explicitly).

This sounds almost too low level. The bitcode loading of a function shouldn’t be adding attributes to non-intrinsics, otherwise it’s changing the meaning of the program and not preserving the input

Just thought about a slight variation on this: what about adding a flag on the datalayout (or a module flag) but not use it in the transformations/analyses, instead use it only when creating Function by always setting the convergent attribute when the flag is present? This would require to always have a Module passed to the Function constructor though (the C API already does, we would just need to update the C++ users).

So creating a Function in a Module with this flag would have the convergent attribute set on creation (can be unset explicitly).

This sounds almost too low level. The bitcode loading of a function shouldn’t be adding attributes to non-intrinsics, otherwise it’s changing the meaning of the program and not preserving the input

You're right, but I thought loading a function from bitcode would be 1) create the function (it may get a default convergent attribute added) and 2) set the exact attributes (which would override the default)

Talked with Matt at the last social, they thought I was strongly opposed to this: I am not, and I told them I would clarify here:

I don't buy the argument of "frontend projects have consistently run into problems related to this", I think this is not a good justification for this change. But this change seems like it can be motivated without this argument.
I also have been trying to help brainstorming alternatives just to see if we could get something that would fit what Matt is looking to achieve while getting past @rjmccall concerns, not because I am against this change.