This is an archive of the discontinued LLVM Phabricator instance.

Allow convergent attribute for function arguments
Needs RevisionPublic

Authored by nhaehnle on Nov 7 2016, 2:29 AM.

Details

Summary

While convergent functions are functions where the fact that they are called
must be uniform across multiple threads in an SIMT/SPMD-type execution
model, convergent function arguments are arguments whose value must be
uniform across multiple threads.

The problem that this is intended to address is that (for AMDGPU, but also
for general GLSL semantics):

%v1 = texelFetch(%sampler, %coord0)
%v2 = texelFetch(%sampler, %coord1)
%v = select i1 %cond, vType %v1, %v2

is logically equivalent to and could benefit from being transformed to:

%coord = select i1 %cond, cType %coord0, %coord1
%v = texelFetch(%sampler, %coord)

On the other hand,

%v1 = texelFetch(%sampler0, %coord)
%v2 = texelFetch(%sampler1, %coord)
%v = select i1 %cond, vType %v1, %v2

must not be transformed to

%s = select i1 %cond, sType %sampler0, %sampler1
%v = texelFetch(%s, %coord)

because of uniformity restrictions on the first argument of texelFetch.

While InstCombine does not actually perform these transforms today,
SimplifyCFG does tail sinking that amounts to the same thing, and there are
shaders in the wild that are mis-compiled because of it.

In other words, this patch is really a bug fix, but it tries to fix the bug
without unnecessary performance regression and keep the door open for future
optimization improvements.

This is very much an RFC with feedback very much appreciated, but I'd
personally be happy to push the patch as-is (minus the part of the
select-call.ll test which merely illustrates a potential future improvement
and minus the incomplete formalization).

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=97988

Event Timeline

nhaehnle updated this revision to Diff 77022.Nov 7 2016, 2:29 AM
nhaehnle retitled this revision from to Allow convergent attribute for function arguments.
nhaehnle updated this object.
nhaehnle added a subscriber: llvm-commits.
jlebar edited edge metadata.Nov 7 2016, 1:08 PM

I looked at just the langref and intrinsics.td.

docs/LangRef.rst
1154

Nit, I think you want "whereas" or "while". Sort of like "although" but a less strong sense of contrast.

1168

This is going to be the paragraph to wordsmith.

I think the approach of saying "A correct transformation has the following property: For any two runs of the program r1 and r2, if r1 and r2 were compatible before the transformation, they must be compatible after the transformation" makes some sense as a way to import SPMD semantics into the langref.

But I am not sure about the definition of 'compatible'. For one thing, this doesn't work if the transformation adds or removes a callsite with convergent parameters. But you're sometimes allowed to do this. For example, you could unroll a loop which contains a convergent-parameter call.

This is really tricky...

include/llvm/IR/Intrinsics.td
54

This definition actually seems fairly clear to me, if we s/control-//. Is there a reason we don't want to use this as our canonical definition?

mehdi_amini edited edge metadata.Nov 7 2016, 3:04 PM

This seems to import the "SIMT/SPMD/..." model into LangRef, and the constraint here seems a bit scary since I suspect it may impact significantly the optimizer.
(adding @chandlerc as reviewer as well)

jlebar added a comment.Nov 7 2016, 3:06 PM

This seems to import the "SIMT/SPMD/..." model into LangRef, and the constraint here seems a bit scary since I suspect it may impact significantly the optimizer.

Right, see previous discussion on llvm-dev. We're trying to craft something that doesn't import SIMT/SPMD into the langref while still allowing us to model the constraints the hardware puts on these functions.

The existing convergent function attr also has nontrivial impacts on the optimizer.

Thank you for taking a look. I've replied to some points inline.

As for the overall impact on passes, (a) the impact is local in the sense that you only have to care when you touch a call-site which calls a function with this attribute, and (b) I suspect the places where you do need to consider the attribute in e.g. loop transformations are quite similar to the existing use of convergent as a function argument.

docs/LangRef.rst
1154

Makes sense.

1168

So, the intention is for the language to allow the addition of callsites in loop unrolling.

Compatible means that at every call site, the sequences of actual convergent function arguments are equal or one is a subset of the other.

If you add a callsite, you have to make sure that this property holds. In the case of loop unrolling, I think the constraint ends up being similar to what happens for convergent functions, i.e. as long as you know that the loop has a multiple of N iterations, you can unroll the loop N times. The effect is that the sequence of actual function arguments at the call site in the loop from run r1 on the original program is split into N sequences of actual function arguments, one for each of the N call sites in the loop-unrolled program.

If the sequences in r1 and r2 were equal originally, then they're trivially equal afterwards. If the sequence in r1 was a subsequence of the sequence in r2, then this must mean that the loop took fewer iterations in r1 than in r2. Then it follows that the sequences in r1 are still subsequences of the corresponding sequences in r2.

(I think this approach is fine for the proof that unrolling exactly N times satisfies the condition. For the reverse proof that loop transformations satisfying the condition actually do the right thing with respect to the real SIMT model, I'm cheating slightly because there could be an outer loop so that there is not necessarily a clear implication from subsequence -> fewer iterations, but then e.g. use the freedom to choose r1 and r2 in such a way that the outer loop is run through only once.)

include/llvm/IR/Intrinsics.td
54

I'd be happy with it. When I posted something like this on the mailing list initially, Mehdi asked for clarification :)

mehdi_amini added inline comments.Nov 8 2016, 9:36 AM
include/llvm/IR/Intrinsics.td
54

My concern on the mailing was the original definition was based on "uniformity".

Anyway, what about this example:

if (cond) {
  Tmp1 = ...
  Foo(Tmp1 [convergent])
} else {
  Tmp2 = ...
  Foo(Tmp2 [convergent])
}

I think your definition does not prevent:

if (cond) {
  Tmp1 = ...
} else {
  Tmp2 = ...
}

Tmp3 = phi [Tmp1, Tmp2]

if (cond) {
  Foo(Tmp3 [convergent])
} else {
  Foo(Tmp3 [convergent])
}

However it breaks uniformity.

nhaehnle added inline comments.Nov 8 2016, 9:56 AM
include/llvm/IR/Intrinsics.td
54

That transform is allowed. It will cause some inefficiencies in the code generation, but the semantics are indeed okay.

At each call site of Foo, the argument is "dynamically uniform" using the language of the GLSL specification, i.e. all threads that actually go through the call site will see the same value (of course assuming that this was true for the original program). So maybe it's really just a matter of finding the right wording.

mehdi_amini added inline comments.Nov 8 2016, 9:57 AM
include/llvm/IR/Intrinsics.td
54

OK let's just do one extra transformation:

if (cond) {
  Tmp1 = ...
} else {
  Tmp2 = ...
}

Tmp3 = phi [Tmp1, Tmp2]
Foo(Tmp3 [convergent])
jlebar added inline comments.Nov 8 2016, 10:00 AM
docs/LangRef.rst
1168

Oh, I see what you mean now! Okay. My problem was I was trying to compare call sites before and after the transformation, but you only compare calls made within one piece of code (i.e. either before compared to before or after compared to after).

Now that I've slept on it (and understood my mistake here), I think that reasoning about how programs run is going to be pretty complicated -- to the point that I am still not 100% sure this is correct. If we can agree on a data-flow formulation, I'd probably find that a lot easier to reason about myself. It sounds like Mehdi doesn't have as much of a problem with it as you might have thought?

nhaehnle added inline comments.Nov 9 2016, 5:14 AM
docs/LangRef.rst
1168

Right, the comparison is within one piece of code.

See the last example that Mehdi brought up as an example for why a pure data-flow formulation may be challenging.

include/llvm/IR/Intrinsics.td
54

That transformation is incorrect, and this RFC patch already disables it (in SimplifyCFG/tail sinking). A simpler example would be just:

if (cond) {
  Tmp1 = Foo(v [convergent])
} else {
  Tmp2 = Foo(v [convergent])
}
Tmp3 = phi [Tmp1, Tmp2]

to

Tmp3 = Foo(v [convergent])

This transform is incorrect in general because cond could be something like v == 0, where v is an int that can only take values 0 or 1.

Formally, the constraint formulation with runs r1 and r2 forbids this transform because you can choose r1 = { cond = true, v = A }, r2 = { cond = false, v = B }. These runs are compatible in the original program, but incompatible in the transformed program. (This is basically what happens in the linked bug report.)

The transform would be correct if the branch condition were proved to be "uncorrelated" with the convergent argument. This can sometimes be done with target-specific knowledge. This target-specific knowledge effectively restricts which pairs of runs need to be considered in the runs-based constraint formulation.

mehdi_amini added inline comments.Nov 9 2016, 8:04 AM
include/llvm/IR/Intrinsics.td
54

I'm addressing the definition proposed here: " The values to this argument are convergent, and calls to the intrinsic may not be modified in a way that makes the argument value control-dependent on additional values."

I don't see how any of *this* two definition forbids the transformation I mentioned.

Or the one in LangRef: "When a function argument has the `convergent` attribute, a call to the function may only be transformed in a way that does not add additional sources of divergence to the argument."

It is using the concept of "divergence" without defining it, so this one is more iffy, but still not clear.

hfinkel added a subscriber: hfinkel.Nov 9 2016, 8:11 AM
hfinkel added inline comments.
include/llvm/IR/Intrinsics.td
54

It is using the concept of "divergence" without defining it, so this one is more iffy, but still not clear.

Yea, we normally talk about this in terms of adding control dependencies, but we'll need to carefully define what we mean here.

nhaehnle added inline comments.Nov 10 2016, 4:13 AM
include/llvm/IR/Intrinsics.td
54

Okay, so how about in this location in Intrinsics.td we remove the "control" from "control-dependent" and refer explicitly to the IR attribute? Then at least it's clear that the proper definition is located in one specific place. Maybe as short as:

// Convergent - The values to this argument are convergent. This corresponds
// to the IR function argument attribute of the same name.

Then we can focus on the language in LangRef - any feedback on that?

mehdi_amini added inline comments.Nov 10 2016, 8:47 AM
include/llvm/IR/Intrinsics.td
54

The quote I posted just above about using "source of divergence" comes from LangRef.

hfinkel added inline comments.Nov 10 2016, 9:39 AM
include/llvm/IR/Intrinsics.td
54

No, because there are important differences. Both refer to control dependencies, but on a function it refers to control dependencies on the execution of the function, but on the value, it is talking about control dependencies on the value being passed into the function, right?

mehdi_amini added inline comments.Nov 14 2016, 10:07 PM
include/llvm/IR/Intrinsics.td
54

@hfinkel: were you answering to me? If so it looks like we're talking past each other.

hfinkel added inline comments.Nov 14 2016, 11:00 PM
include/llvm/IR/Intrinsics.td
54

@mehdi_amini: no, I was responding to @nhaehnle

curan added a subscriber: curan.Nov 16 2016, 8:51 AM

Tested-by: Kai Wasserbäch <kai@dev.carbon-project.org>

Fixes fdo#97988 for me (together with the Mesa part attached to that bug report).

nhaehnle updated this revision to Diff 78270.Nov 16 2016, 3:04 PM
nhaehnle edited edge metadata.

@mehdi_amini: I changed Intrinsics.td like I suggested, so that only LangRef
should matter.

In LangRef, I removed the 'source of divergence' part and instead made the
definition of compatibility with two runs as discussed in the sub-thread
with @jlebar more explicit. I don't know if that's more to your taste, at
least I couldn't find anything clearly wrong with it yet.

@hfinkel: I'm not sure what you were replying to exactly. The current
version does not talk about control-dependencies anymore at all, because
that language does not make it obvious what should happen with Mehdi's
example, i.e. you're not allowed to eliminate the branches in:

if (cond) {
  Tmp1 = Foo(v [convergent])
} else {
  Tmp2 = Foo(v [convergent])
}
Tmp3 = phi [Tmp1, Tmp2]

"Control-dependencies on a value" doesn't really capture that. Unless you
understand that the control-dependencies of a value depend on which basic
block you're reading the value in. But that's not how people would
generally understand it, right?

mehdi_amini added inline comments.Nov 16 2016, 3:45 PM
docs/LangRef.rst
1147

uniform isn't defined in the LangRef.

mehdi_amini added inline comments.Nov 16 2016, 3:45 PM
docs/LangRef.rst
1155

(Ignore this, phab bug)

1161

It is not clear to me what is r1/S1 and r2/S2 that prevent to CSE:

if (cond) {
  Tmp1 = Foo(v [convergent])
} else {
  Tmp2 = Foo(v [convergent])
}
Tmp3 = phi [Tmp1, Tmp2]

to:

Tmp3 = Foo(v[convergent])

But your definition reference a "call site" as if there is a one-to-one mapping before/after a transformation.

nhaehnle added inline comments.Nov 17 2016, 1:57 AM
docs/LangRef.rst
1147

Right, this is also used in the paragraph below. This whole paragraph isn't intended to be normative, but to provide context that allows one to understand the actual condition. Perhaps this could be phrased better. What about:

"In some parallel execution models, there exist operations that are executed simultaneously for multiple threads, and one or more arguments of the operation must have the same value across all simultaneously executing threads."

1161

The definition of "compatible" and everything inside it is completely independent of transformations. It talks about two runs of the same program.

Your example is forbidden because in the original program, the runs r1 = { cond = true, v = V0 } and r2 = { cond = false, v = V1 } (with V0 != V1) are compatible (the first call-site of Foo has the sequences r1(CS) = ( V0 ) and r2(CS) = ( ), i.e. r2(CS) is a subsequence of r1(CS); similarly for the second call-site), but in the transformed program they're not (the single call-site of Foo has the sequences r1(CS) = ( V0 ) and r2(CS) = ( V1 ), and neither is a subsequence of the other).

I do think that the language I used is pretty clear on that, actually, but since both you and @jlebar seem to have had the same difficulty, I'd like to understand that better. Perhaps it would help to swap those two paragraphs around as follows?

Program transformations must ensure that every pair of runs that is compatible with respect to convergent function attributes in the original program must also be compatible in the transformed program. Compatibility is defined as follows:

Two runs r1 and r2 of a program are said to be compatible (wrt convergent function attributes) if ...

Maybe that helps the reader to parse the definition of compatibility because it first establishes a context where compatibility is about runs within a program rather than runs across multiple (transformed) versions of a program.

nhaehnle updated this revision to Diff 78348.Nov 17 2016, 4:54 AM
nhaehnle edited edge metadata.

Add missing handling of the convergent attribute in GVNHoist and
LoopUnroll.

Also update LangRef with the suggested changes to phrasing.

hfinkel added inline comments.Nov 17 2016, 6:51 AM
docs/LangRef.rst
1163

Please write out "with respect to." Also, you mean convergent function-parameter attributes, not function attributes.

The runs of the program terminology I find bothersome, in part because that's not what we're talking about. We're talking about the behavior of different threads within the same program. I think it would be better to say, "Two executions of the calling function, e1 and e2, are compatible..." Then we can say that transformations must preserve compatibility w.r.t. convergent function-parameter attributes for all potential executions of the calling function. We might even further specify that this is for potential simultaneous executions.

Finally, I'd really like it if you would include the example we've been discussing in the reference as a non-normative example. I did not really understand the definition until I read @nhaehnle's explanation above.

nhaehnle added inline comments.Nov 18 2016, 11:18 AM
docs/LangRef.rst
1163

I'm only one person :)

I'm making the editing changes and adding the example.

s/run/execution/ is fine with me, and I'm adding that as an alternative below, but whenever I try to formulate some text that talks about executing a calling function rather than executing a function, I feel confused about the calling vs. the callee function.

Perhaps you can clarify:

The runs of the program terminology I find bothersome, in part because that's not what we're talking about. We're talking about the behavior of different threads within the same program.

What's the difference between threads and executions in this context?

nhaehnle updated this revision to Diff 78559.Nov 18 2016, 11:19 AM
nhaehnle edited edge metadata.

Editing changes and s/run/execution/ alternative.

hfinkel added inline comments.Nov 18 2016, 2:52 PM
docs/LangRef.rst
1163

but whenever I try to formulate some text that talks about executing a calling function rather than executing a function, I feel confused about the calling vs. the callee function.

Feel free to establish some terminology early in the definition for later use. We're talking about a function-parameter attribute. That attribute must appear on the function parameters of some call site. That call site exists in some function (the caller a.k.a. the calling function) and calls some other function (the callee function).

What's the difference between threads and executions in this context?

As I think about it, a thread is some entity that executes things (e.g. functions). Threads execute functions, and each time a thread executes some function, that constitutes an execution of that function.

jlebar added inline comments.Nov 20 2016, 12:17 PM
docs/LangRef.rst
1153

While the `convergent` attribute on functions can be thought of as "whether and when each call site to this function is reached must be uniform across multiple threads",

I know this isn't meant to be normative, but I'm a little concerned by this language. Specifically, our semantics say that the compiler is not allowed to *introduce* new sources of divergence to convergent calls, but it is also not allowed to *assume* that convergent calls are uniform.

I presume the same is true for convergent args -- the compiler cannot introduce divergence into the arg values, but it also cannot assume anything about the arg values as a consequence of their being marked convergent.

I know rewriting this to be precise might defeat the purpose of your summary here. But could we add some weasel words so that people can't point to this as justification that divergent calls / args marked with convergent are LLVM UB?

1159

Nit, suggest putting "compatible" in quotes so it's clear it's a term we're about to define.

hfinkel added inline comments.Nov 20 2016, 12:37 PM
docs/LangRef.rst
1153

This is a good point. We might want to say something like that there is some implementation-defined set of simultaneously-executing threads for which the restriction holds. Thus we won't be able to make an target-independent assumptions about the implied behavior.

mehdi_amini added inline comments.Nov 20 2016, 12:40 PM
docs/LangRef.rst
1153

I don't see why we should mention anything about thread or concurrency in the LangRef?
Also, I'm not thrilled by this paragraph either (to begin with, the sentence about the convergent attribute on function isn't clear to me).

nhaehnle added inline comments.Nov 21 2016, 7:43 AM
docs/LangRef.rst
1153

Hmm. I wonder what the general opinion is on whether this paragraph is still helpful at all. The way I see it, it can serve two purposes:

  1. Explain the attribute in an intuitive way to someone with an SIMT/SPMD background.
  2. Explain why the same term is used here for a function-parameter attribute and elsewhere as a function attribute.

But if it just adds confusion for others, maybe it's best to remove it rather than discuss it to death? After all, the first purpose is kind of fulfilled already by the paragraph before that...

If the paragraph is kept around, then I agree with something like:

... whether and when each call site to this function is reached must be uniform across a target-dependent set of simultaneously-executing threads ...

1159

Sure, no problem.

1163

I see. I played around with this some more, and here's an example I'd like to hear your thoughts on:

define void @foo(i32 %a) {
  ...
  call void @bar(i32 %a)
  ...
}

define void @bar(i32 %a) {
  call void @baz(i32 convergent %a)
}

This code has a potential problem because at least the parameter of bar should arguably have been labelled convergent.

I'd say that inlining shouldn't have to worry about the convergent attribute at all. However, if the definition we're discussing here takes a function-centric view, then we cannot prove in a target-independent way that inlining bar is correct (because foo gets a new callsite with a convergent function argument, where previously there were no such callsites, i.e. every pair of runs of foo was compatible). If the definition takes a program-centric view (i.e., executions that start at foo implicitly also consider the sequence of arguments passed into baz inside bar), then inlining bar is clearly correct.

That would be an argument in favour of staying with the language about "executions of a program" rather than "executions of a function".

nhaehnle added inline comments.Nov 21 2016, 8:15 AM
docs/LangRef.rst
1163

Hmm, looking at other transforms of similar cases, I think we have to say that cases like this are undefined behaviour.

Curious, I think the same applies to the convergent attribute for functions: if a function F contains calls to a convergent function G, but is not itself marked convergent, then calls to F should be undefined behaviour, right? LangRef is currently somewhat silent on this particular point.

hfinkel added inline comments.Nov 21 2016, 8:20 AM
docs/LangRef.rst
1163

The execution of a function semantically includes the execution of any functions called, so I still feel this is the better terminology. We're talking about preventing transformations that invalidate executions of individual program executions, comparing different program executions is irrelevant, but each single program execution can contain multiple simultaneous executions of the functions in question. I think that the inlining would be correct regardless. Nothing that the inliner does will change the convergence properties of function arguments, as far as I can tell.

However, you also raise a second important point. If the first argument of @bar is not marked as convergent, then other optimizations around calls to @bar can ruin the convergent property regardless. Thus, it too needs to be marked.

hfinkel added inline comments.Nov 21 2016, 8:23 AM
docs/LangRef.rst
1163

Regarding the convergent function attribute, yes. This is why Clang in CUDA mode marks all functions as convergent and then we remove the attribute when we prove it is not needed (in lib/Transforms/IPO/FunctionAttrs.cpp).

nhaehnle updated this revision to Diff 79063.Nov 23 2016, 6:15 AM
nhaehnle edited edge metadata.
  • Clarify that the set of threads is target-dependent (I'm still open to eliminating that paragraph outright if people find it confusing)
  • Talk about executions of functions
  • No mention of concurrency or simultaneous execution (except for the motivational explanation in the first paragraph), per @mehdi_amini
  • Add an explicit paragraph about calling functions containing call-sites with convergent function parameters

On the last point, this is fairly restrictive right now. E.g. one could
ask what happens if the called function F passes a value that was loaded
from memory to a convergent parameter. At least for AMDGPU purposes it's
fine to leave that undefined, and probably will remain so for a long time.

When/If it does become an issue, I suspect that solving this will require a
new function attribute (convergentmem?). However, I'd rather leave this
open for now.

hfinkel added inline comments.Dec 2 2016, 4:43 PM
docs/LangRef.rst
1147

I think that would be better. I still find the current text in this paragraph and the next confusing because understanding them hinges on understanding the term uniform which we never define.

1164

So, the clarify, we need to ensure compatibility between all call sites with a convergent argument, or all call sites with a convergent argument to the same function, or all call sites with a convergent argument subject to some other condition?

nhaehnle added inline comments.Dec 7 2016, 6:59 AM
docs/LangRef.rst
1147

Okay, that's what the current version says.

1164

I hope we're not talking past each other, but for the definition of compatibility, each call site is considered separately.

Compatibility is a property of a pair of executions; it is satisfied if a certain property is satisfied at each call site to a function with convergent arguments.

When one proves that a transformation preserves compatibility, one would look at each call-site in the transformed code, and prove the property based on the fact that it was satisfied at some call-site in the original code, but the definition of compatibility itself is never concerned with the interaction or relationship of multiple call sites to each other.

hfinkel added inline comments.Dec 7 2016, 7:15 AM
docs/LangRef.rst
1147

Yes, but it still says uniform in the paragraph below. Please either define that term or don't use it.

1164

but for the definition of compatibility, each call site is considered separately.

That's what I thought, thanks! I just wanted to make sure.

jlebar accepted this revision.Dec 7 2016, 2:05 PM
jlebar edited edge metadata.

Since we seem to be getting close to agreement on the langref wording, I went ahead and looked at the rest of the code. Just one minor comment, otherwise it looks good to me. I guess I have a broader question of, did you audit in order to make sure that these are the only places where we need to handle specifically convergent args? Do you need a reviewer to check that?

lib/Transforms/Utils/SimplifyCFG.cpp
1443

I don't understand the "because" clause here; can we rephrase?

This revision is now accepted and ready to land.Dec 7 2016, 2:05 PM
mehdi_amini requested changes to this revision.Dec 7 2016, 2:19 PM
mehdi_amini edited edge metadata.

I don't believe we're there yet on the wording.

This revision now requires changes to proceed.Dec 7 2016, 2:19 PM

divergence is used in multiple places in comments in the code, it isn't great as it is based on SIMT divergence. And even worse it is *data* divergence while we usually consider the control flow when talking about divergence on GPU. I don't like the fact that we spread this everywhere slowly.

I also think we should have someone else look at the LangRef (with a fresh look) change as well, the wording seems "correct" to me, but not necessarily easy to understand or easy to forge a mental model for the engineer working on optimizations on the LLVM IR. @sanjoy or @chandlerc maybe?

docs/LangRef.rst
1149

I don't like this paragraph, and especially the use of "threads", which in general does not imply the lock-step model. It seems like a loose definition that makes little sense for anyone not versed into SIMT and GPU execution model.

1154

I believe: s/function argument/function parameter/ ( http://llvm.org/docs/LangRef.html#parameter-attributes )
(Check the other places as well)

1156

This paragraph isn't providing much either, refers to a definition of the `convergent` attribute on functions that is not present elsewhere in the document, and is still using "uniform".

1163

What is the current status on this? How do we propagate/infer this attribute?

Thank you for taking a look.

I mostly agree about the use of 'divergence', I think the places where it's necessary can be filled in using the language of executions. However, there's definitely precedence for talking about divergence of data in LLVM. The DivergenceAnalysis pass explicitly determines divergence vs. uniformity of data. The purpose is ultimately to decide whether control flow is uniform or divergent, but to decide that, it obviously has to look at the divergence of data as well because branch conditions are data.

docs/LangRef.rst
1149

"threads" is literally the term used by AMD for the thing we're talking about: threads executing in lock-step within a single "wave". It would be counter-productive to deviate from that.

And yes, "thread" doesn't always imply lock-step, but the paragraph does explicitly talk about "operations that are executed simultaneously for multiple threads". Maybe this would help drive the point home even more:

In some parallel execution models, there exist operations that are executed by a single hardware execution unit on behalf of multiple threads simultaneously and one or more parameters of the operation ...

... but I was previously encouraged to stay away from talking about hardware in the LangRef, so I have to say that this feels increasingly like bike-shedding to me.

I'd actually be happy to use language that admits that at the end of the day, this is all about doing something useful with hardware, because I think it helps understand the intuition behind the definition. But the back-and-forth needs to converge at some point (in the limit sense, not in the attribute sense ;-)).

1154

Ok, will do.

1156

Ok, let's just kill it. Of all the fuss about this definition, the one thing that this paragraph was really meant to address hasn't ever come up, so... (although, perhaps it hasn't come up precisely because the paragraph served its purpose :))

1163

See the paragraph starting at line 1198 ("Calling a function F that contains a call-site CS..."). Also, to quote myself about what that paragraph has to say:

this is fairly restrictive right now. E.g. one could
ask what happens if the called function F passes a value that was loaded
from memory to a convergent parameter. At least for AMDGPU purposes it's
fine to leave that undefined, and probably will remain so for a long time.

We currently have no need to actually either infer or remove the attribute, but while the paragraph doesn't explicitly state it, it contains all the definitional power to work with and without a flow that does what is done for the convergent function attribute in some compilation flows: add it everywhere in the beginning, then remove it in places where it can be proved to be unneeded.

lib/Transforms/Utils/SimplifyCFG.cpp
1443

Yes, let's rephrase using the language of executions:

// We cannot sink calls if any parameter is convergent: two executions that go
// through different predecessor blocks are compatible with respect to
// convergent parameters at the considered call-sites before sinking even if
// the concrete values passed to the called function are different. After sinking,
// they are no longer compatible, because different concrete values are passed
// to the called function at the same call-site.

Or the strategically better variant:

// We cannot sink calls if any parameter is convergent.

Thank you for taking a look.

I mostly agree about the use of 'divergence', I think the places where it's necessary can be filled in using the language of executions. However, there's definitely precedence for talking about divergence of data in LLVM. The DivergenceAnalysis pass explicitly determines divergence vs. uniformity of data. The purpose is ultimately to decide whether control flow is uniform or divergent, but to decide that, it obviously has to look at the divergence of data as well because branch conditions are data.

We need to clarify, there are two parts in my comments:

  1. "divergence" is not common in LLVM. I stand on this point: the DivergenceAnalysis is a GPU specific pass and is hardly part of the core semantic of the IR.
  2. "divergence" as it is usually encounter is about control flow and not data flow.

About 2, since you're pointing at the DivergenceAnalysis pass, it mentions This file implements divergence analysis which determines whether a branch in a GPU program is divergent.. My understand of divergence in this context has always been a control flow divergence, which does not "break" SSA.

I mean that unless I missed something here, the usual notion of divergence does not prevent the folding we took as an example in LangRef for this patch:

if (cond) {
     Tmp1 = Foo(v [convergent])
   } else {
     Tmp2 = Foo(v [convergent])
   }
   Tmp3 = phi [Tmp1, Tmp2]

too

Tmp3 = Foo(v [convergent])

I'm looking at sinking may add divergence even when the argument is the same IR value in all blocks, here there is no control flow divergence, I don't believe a GPU programmer would say that the transformation above "could create divergence".

(Of course I understand the comment, and it makes perfect sense to me, but I have been following this patch and it is hard to judge, which is why I asked for @chandlerc or @sanjoy to provide another look at this).

docs/LangRef.rst
1149

The fact that AMD is using "threads" is irrelevant to me: we're not writing an AMDGPU documentation here. This is why the use of threads alone seems wrong and misleading to me in the context of this document. (My point wouldn't stand in a GPU-centric document, but the GPU model is not part of the LangRef right now)

Otherwise, I would prefer that we spell it clearly, something along these lines:

GPUs (and other similar hardware) are frequently based on a SIMT (or SPMD) model, where a single hardware execution unit executes identical operations simultaneously on behalf of multiple threads of execution. Some of these operations require that one or more arguments of the operation must have the same value across all simultaneously executing threads. Such arguments are called ``convergent``.

We need to clarify, there are two parts in my comments:

  1. "divergence" is not common in LLVM. I stand on this point: the DivergenceAnalysis is a GPU specific pass and is hardly part of the core semantic of the IR.

Fair enough.

  1. "divergence" as it is usually encounter is about control flow and not data flow.

About 2, since you're pointing at the DivergenceAnalysis pass, it mentions This file implements divergence analysis which determines whether a branch in a GPU program is divergent.. My understand of divergence in this context has always been a control flow divergence, which does not "break" SSA.

I mean that unless I missed something here, the usual notion of divergence does not prevent the folding we took as an example in LangRef for this patch:

if (cond) {
     Tmp1 = Foo(v [convergent])
   } else {
     Tmp2 = Foo(v [convergent])
   }
   Tmp3 = phi [Tmp1, Tmp2]

too

Tmp3 = Foo(v [convergent])

I'm looking at sinking may add divergence even when the argument is the same IR value in all blocks, here there is no control flow divergence, I don't believe a GPU programmer would say that the transformation above "could create divergence".

(Of course I understand the comment, and it makes perfect sense to me, but I have been following this patch and it is hard to judge, which is why I asked for @chandlerc or @sanjoy to provide another look at this).

Within the scope of LLVM, I agree with you. In the actual context of GPU programming, at least the GLSL spec doesn't really use "divergent" anyway and instead talks about uniform vs. non-uniform. Rules involving (dynamically) uniform data and uniform control flow are actually roughly equally frequent. It's just that when you write a compiler for it, the control flow part is by far the bigger issue. The data part mostly takes care of itself from the compiler perspective, especially when optimizations aren't too aggressive, which is why we managed to miss the issue that this patch addresses for so long.

That said, I agree that "create divergence" is unfortunate phrasing either way. If one wanted to stick to "divergence", it would be better to say that it exposes divergence or something along those lines.

docs/LangRef.rst
1149

My point was that inventing a new term to use instead of 'threads' is counter-productive, but it seems we agree on that anyway. I agree that something needs to be added for the purpose of LangRef, and I originally thought that talking about "simultaneously executing" would be sufficient, but clearly it isn't. I'd be happy to use your wording.

sanjoy edited edge metadata.Dec 9 2016, 4:35 PM

Few comments inline

docs/LangRef.rst
1147

(Based on what I've heard about GPUs) Should you emphasize that they're executed in "lockstep"? Or is that not a correct statement?

1158

I think you need to be specific and explicit about how you're relating "executions" between the pre-transform and the post-transform functions. For instance if the pre-transform function is:

void f(int* val dereferenceable(8)) {
  int k = *val
}

and the post-transform program is:

void f(int* val dereferenceable(8)) {
  int k0 = *(long *)val;
  int k = k & 0xff..ff;
}

an execution of the pre-transform program where k = 400 may not make sense in the context of the post-transform program since you need to produce a 8 byte value for k0.

1175

What if Foo is a function pointer that we loaded from some place (and thus we don't have a declaration that we can inspect to know for sure if the argument is marked convergent)? This rule would imply we can't do this sinking transform in that case.

1199

This is a little tricky -- what if the source program is:

void f(int convergent i);
void g(int convergent j, int /* normal */ k) {
  f(j - j + k);
}

Can we simplify the call site to call f(k)? If so, did we introduce UB in functions that call g?

mehdi_amini added inline comments.Dec 9 2016, 4:51 PM
docs/LangRef.rst
1199

Isn't your source program already UB? You have a call site with an argument that is data-dependent on a non-convergent parameter, k.

The kind of annoying and SSA-breaking limitation of convergent would be:

void f(int convergent i);
void g(int convergent j, int /* normal */ k) {
  if (j == k)
     f(j);
}

You can't replace f(j); with f(k); IIUC.

sanjoy added inline comments.Dec 9 2016, 5:11 PM
docs/LangRef.rst
1199

You're right, I misread "only data-dependencies" as "there must be at least one convergent data dependencies".

The other interesting corner case here is (assuming I did not misread this second time around) around PHI nodes and selects -- if we have:

void @other_convergent(int convergent x);
void @f(int convergent k, int /* normal */ l) {
entry:
  br cond label left, label merge

left:
  br label merge

merge:
  val = PHI [ k, entry ], [ l, left ]
  call @other_convergent(val);
}

Is calling the function above well defined if cond is is always false? What happens if we convert the phi to a select? What happens when we convert the select to arithmetic?

Thank you for taking a look!

docs/LangRef.rst
1147

It is a correct statement; whether it needs to be emphasized is anyone's guess ;-) I didn't think so, but it seems that this a bit of a point of contention, so probably something like it is required. Did you see @mehdi_amini 's alternative wording for this paragraph? Copying it here:

GPUs (and other similar hardware) are frequently based on a SIMT (or SPMD) model, where a single hardware execution unit executes identical operations simultaneously on behalf of multiple threads of execution. Some of these operations require that one or more arguments of the operation must have the same value across all simultaneously executing threads. Such arguments are called `convergent`.

I'd be fine with it, and I think it's fairly clear.

1158

As far as I'd understand it, the initial state of memory is part of what defines an execution. Since the parameter is dereferenceable(8), that includes 8 bytes of memory even in the original program.

1175

In that case, we may not have a way to specify that the parameter is convergent. This may be a gap in this patch (I haven't actually checked), but not a gap that needs to be filled immediately.

This is similar to the convergent attribute of functions: if you're calling through a function pointer, you may want to add the convergent attribute to the call-site. As long as that attribute isn't present, you can assume that the called function is not convergent. The same holds for the function parameter attribute.

1199

@mehdi_amini: This example is interesting, because the transform breaks the condition of the definition, but it is correct in the target semantics. This is very unfortunate, and I don't have an immediate answer for it.

@sanjoy's second example is similar. My first intuition would actually have been to say that in all variations, calling the function is undefined. But that's not great if in the future you want to map the semantics of higher-level languages directly to LLVM, including function calls, because there the example is perfectly fine.

Anyway, @mehdi_amini's example is the bigger problem. Forcing GVN and others to keep track of the convergent-ness is clearly a bad idea. Thinking out loud, maybe the job could be done by a function attribute that indicates that any parameter may have to be treated as convergent.

curan added a comment.Jan 3 2017, 7:33 AM

Is there any progress on this? I'd be really nice, if this could land so I could enjoy HW acceleration for videos with mpv without the need to build LLVM constantly and carry an additional patch.

Is there any progress on this? I'd be really nice, if this could land so I could enjoy HW acceleration for videos with mpv without the need to build LLVM constantly and carry an additional patch.

I'm fairly confident it is not gonna ship in LLVM 4.0 as we branch tomorrow.

dark_sylinc added a comment.EditedFeb 12 2017, 12:38 PM

Ok!

So I landed here after a referral. We were literally hit by this bug in LLVM (https://bugs.freedesktop.org/show_bug.cgi?id=99780)

If anyone needs something tangible to know what a miscompiled program looks like due to a missing convergent argument, this is what the bug looks like:
http://imgur.com/43tIuI6
http://imgur.com/iBhxoBj

From reading this thread I see there are a few problems with the terms, some confussions, as well as understanding why this is needed.

So I'm going to chime in with an explanation (feel free to add it to the code) in hope of speeding up the acceptance/enhancement of this patch:

GCN executes a 'wavefront'. Each wavefront consists 64 'waves' (aka threads). In NVIDIA's lingo, wavefront = warp and wave = thread; and 1 warp consists of 32 threads (though actualy numbers could vary within architectures and is not relevant).

If a video cards needs to process 128 pixels, it will likely do so using at least 2 wavefronts (2 wavefronts x 64 waves = 128).
OpenGL uses the term "uniform" to refer to parameters that will be constant for all wavefronts within the same dispatch.
The problem being addressed here is not uniformity in OpenGL terms, but rather convergence. Convergence is a parameter that must be the same for all waves within the same wavefront, but it is ok if it is different across different wavefronts.

The graph illustrates the idea much better than words:

Terms get messier if we add DirectX terminology, because Microsoft uses the term "const" to refer to what OpenGL understands as "uniform". In fact, DirectX treats uniforms as a synonym of convergence.
I'd suggest you stick to OpenGL terms for consistency with your surrounding ecosystem.

DX12's HLSL (via Shader Model 5.1) allows programmers to explicitly tag variables that require convergence as such (by default DX12 assumes variables do not need to enforce convergence on indexing, but it requires on branches), as this can lead to performance optimizations if we promise in advance our reads won't diverge.
One such example is rendering sprites via textured quads (1 quad = 2 triangles): Every sprite may have its own texture (and therefore index a different texture). However we promise that each triangle will read the same texture. Due to rasterization ordering rules, different triangles won't be put together in the same wavefront therefore giving us the ability to lift away this restriction and have the compiler produce faster code.

Unfortunately GLSL doesn't have this modifier (yet?), but it also assumes variables do not need convergence, just like DX12. The problem however, is that LLVM is compiling code incorrectly:

As per both OpenGL & DirectX rules, the following does not need to have convergence enforced at compiler level:

in int myIndex;
sampler2D myTextures[2];

texture( myTextures[myIndex], uv ); //If myIndex diverges, it's programmer's fault.

However the following DOES require to enforce convergence at compiler level:

in int myIndex;
sampler2D myTextures[2];

if( myIndex == 0 )
    texture( myTextures[0], uv );
else if( 
    texture( myTextures[1], uv );

Currently, the problem is that right now LLVM in the latter case is "optimizing" the code to execute like the former code to save one image_sample instruction, which is wrong.

Right now what I see wrong about the patch' terminology is that it marks variables requiring to enforce convergence at compiler level as "convergent". That's exactly the opposite! It should be tagged as "divergent". If a variable is divergent, then the compiler must enforce convergence. If a variable is already convergent, then it can perform aggressive optimizations. You don't enforce convergence on something that is already convergent!
It's like calling a C variable "const" because it can mutate. I think this contradiction is what's causing a buzzing noise in some of the reviewers.

I propose instead:

  1. Reserve keyword "convergent_branch". If a variable is tagged as such, it means additional optimizations can be done (that could be dangerous if the programmer isn't careful). In C terms, tagging a variable "convergent_branch" is analogous to tagging a C pointer with __restrict in terms of programmer responsability.
  2. Implement keyword "divergent_branch" (what the current patch refers to as "convergent"). If a variable is tagged divergent_branch, it means the compiler must go to great lengths to ensure correct output even if branches diverge.
  3. Reserve the keywords "divergent_index" & "convergent_index" for future use.
  4. Tagging a variable as both convergent_branch and divergent_branch at the same time would be illegal and contradictory.
  5. Tagging a variable as both convergent_index and divergent_index at the same time would be illegal and contradictory.
  6. It is legal for a variable to be both divergent_index and convergent_branch at the same time, or both convergent_index and divergent_branch.
  7. Full level paranoia would be to tag a variable as both divergent_index & divergent_branch
  8. Maximum optimization can be obtained by tagging a variable as both convergent_index & convergent_branch

About terms:
What does a variable that is marked as divergent mean?
It means the compiler must enforce results are correct even if the execution diverges within the same wavefront. We say it is divergent because the compiler must enforce convergence at code level.
What does a variable that is marked as convergent mean?
It means the compiler is free to optimize code in a way that could result in improper output if the programmer doesn't guarantee convergence via external means (e.g. by the data it will feed to the program, by exploiting rasterization rules, etc)

I hope this explanation and proposal (to change the name of the keyword currently being used) shed lights to all of you; and accelerates the inclusion of the final patch.
I'm not a language lawyer, I'm not an LLVM expert either. I'm just a graphics guy who was pissed enough to care.

Cheers
Matias

I've given it a little more thought during coffee break.

divergent_branch, divergent_index, convergent_branch & convergent_index aren't enough.
At least one more modifier is needed: cannot_diverge (or similar name).

By default most variables would be convergent_branch convergent_index; while texture variables would be divergent_branch convergent_index. These defaults could be changed on demand (i.e. to accomodate to GPU arch needs).

Consider the following examples:

Example 1:

float4 texture( cannot_diverge_branch sampler2D texSampler, float2 uv ); //Internally implemented by LLVM to map to a GPU instruction(s).

divergent_branch convergent_index sampler2D myTexture0;
divergent_branch convergent_index sampler2D myTexture1;
convergent_branch convergent_index int condValue;
convergent_branch convergent_index float2 uv;

void main()
{
    if( condValue > 1 )
        outColour = texture( myTexture0, uv );
    else
        outColour = texture( myTexture1, uv );
}

Here in this case, myTexture0 & myTexture1 are both marked as divergent_branch. Because we enter a branch and we are calling texture() which has cannot_diverge on its argument, the following should happen:

  1. LLVM checks that an argument with divergent_branch was passed to a function that requires this argument to not diverge.
  2. The branch can only be flattened if it performs 2 texture samples then branchless select. It cannot do branchless select first, then perform the 1 sample.
  3. myTexture0 can not be factored out of the branch to do branchless selection with myTexture1.
  4. myTexture1 can not be factored out of the branch to do branchless selection with myTexture0.

If myTexture0 & myTexture1 were both to be tagged as convergent_branch instead, then there would be no problem.

Example 2:

divergent_branch convergent_index sampler2D myTexture0;
divergent_branch convergent_index sampler2D myTexture1;
convergent_branch convergent_index int condValue;
convergent_branch convergent_index float2 uv;

float4 doSomething( sampler2D texSampler, float2 uv );
{
    return uv.xyxy;
}

void main()
{
    if( condValue > 1 )
        outColour = doSomething( myTexture0, uv );
    else
        outColour = doSomething( myTexture1, uv );
}

Here, 'doSomething' was called. However its argument does not have cannot_diverge; therefore the branch can still be flattened in any way. myTexture0 is not used, but if it could be used in any form (arithmetic operations??? not possible but just for the sake of the example); it could be moved by the optimizer outside the loop for branchless select against myTexture1.

Example 3:

divergent_branch convergent_index sampler2D myTexture0;
divergent_branch convergent_index sampler2D myTexture1;
convergent_branch convergent_index int condValue;
convergent_branch convergent_index float2 uv;

float4 doSomething( sampler2D texSampler, float2 uv );
{
    return texture( texSampler, uv.xy );
}

void main()
{
    if( condValue > 1 )
        outColour = doSomething( myTexture0, uv );
    else
        outColour = doSomething( myTexture1, uv );
}

Here, doSomething() does not have 'cannot_diverge' attribute, but it ends up calling texture() which does. Therefore the same outcome as example 1 ends up happening.

Example 4:

divergent_branch convergent_index sampler2D myTexture0[4];
divergent_branch convergent_index sampler2D myTexture1;
convergent_branch convergent_index int condValue;
convergent_branch convergent_index int index;
convergent_branch convergent_index float2 uv;

float4 doSomething( sampler2D texSampler, float2 uv );
{
    return texture( texSampler, uv.xy );
}

void main()
{
    if( condValue > 1 )
        outColour = doSomething( myTexture0[index], uv );
    else
        outColour = doSomething( myTexture1, uv );
}

Here myTexture0 was indexed. Since the sampler is marked as convergent_index, the comparison of cannot_diverge_branch && divergent_branch is ignored. This means the myTexture0 can be factored out of the loop, branch could be flattened in any way, etc.

Example 5:

divergent_branch convergent_index sampler2D myTexture0[4];
divergent_branch convergent_index sampler2D myTexture1;
convergent_branch convergent_index int condValue;
convergent_branch convergent_index int index;
convergent_branch convergent_index float2 uv;

float4 doSomething( sampler2D texSampler, float2 uv );
{
    return texture( texSampler, uv.xy );
}

void main()
{
    if( condValue > 1 )
        outColour = doSomething( myTexture0[divergent_index(index)], uv );
    else
        outColour = doSomething( myTexture1, uv );
}

This is like Example 4; but the index caused the variable texSampler = myTexture0[divergent_index(index)] to become divergent_index as well.
This means the compiler now behaves like in Example 1.

In summary, the final logic inside LLVM would be something like this:

if( function.argument.cannot_diverge_branch &&
    (variable.divergent_branch || variable.divergent_index) &&
    !variable.was_converge_indexed /*set to true if it was read from an array without divergent_index(index)*/ )
{
    beCarefulWithFlattenning( variable );
    cannotMoveOutOfBranch( variable );
}

I don't know how easy/hard it would be to implement all of this; but it is clear the number of use cases cannot be handled easily with just one attribute if one intends to compile the code to produce correct output AND optimize as much as possible.

Of course at the time being implementing divergent_branch & cannot_diverge is the most urgent matter. The rest of the keywords could be left reserved.

May I point out this bug does not only produce graphical artifacts. It can also cause GPU hangs:

int nCount = 0;
if( someCond )
    nCount = texture( myTex0, uv );
else
    nCount = texture( myTex1, uv );
    
for( int i=0; i<nCount; ++i )
{
    ...
}

What happens if nCount is read from the wrong path and ends up a very large value? That's right. GPU hang (if someone's lost here; a GPU hang is whenever the GPU stops responding to perform graphics commands like updating the display. This can be caused because of HW malfunctions, it crashed, or it's stuck doing a very long computation).
Considering there's a lot of GPU hang tickets out there in the radeonsi bugtracker, I wouldn't be surprised if one of these hangs is actually caused by this. I cannot stress enough this should be a priority bug.

Currently, the problem is that right now LLVM in the latter case is "optimizing" the code to execute like the former code to save one image_sample instruction, which is wrong.

I'd like to clarify here that the existing bug is not really in the LLVM optimizer, but rather in the fact that graphic frontend are not emitting these calls with the right set of conservative attributes. The immediate fix would be for them to change this and not leave the optimizer mess up with these operations. (And not to put pressure on LLVM to adopt this patch as a correctness fix, it is not, or only incidentally).
(Of course being conservative means losing some optimizations possibilities in some cases, and the purpose of this patch is to help getting these back with finer modeling of the constraint).

Right now what I see wrong about the patch' terminology is that it marks variables requiring to enforce convergence at compiler level as "convergent". That's exactly the opposite! It should be tagged as "divergent". If a variable is divergent, then the compiler must enforce convergence. If a variable is already convergent, then it can perform aggressive optimizations. You don't enforce convergence on something that is already convergent!

I'm not sure I follow. We're maybe not talking about the same thing. We not constraining variables here, but instead "operations" if I get it correctly (going back to this every two months isn't making it easy).
IIUC, taking your examples:

int myIndex;
 sampler2D myTextures[2];
 texture( myTextures[myIndex], uv ); //If myIndex diverges, it's programmer's fault.

The information we're trying to carry right now with this patch, is not about myTextures[myIndex], but about the fact that a call to texture needs to preserve the convergence on its first argument.

We are *not* carrying any information supplied by the programmer about the convergence property of a variable. We can just assume that in the source program the first argument of a call texture is convergent, and we need to maintain it.

I propose instead:
[snip]

None of it seems to fit LLVM IR but looks rather like a language specification to me.
As a starting point, LLVM IR doesn't have any concept of "variables". We have SSA values and memory (which can be stack allocations or not).

It also seems to me that you're trying to bring the source-level programmer annotations of convergence on source program variables, in order to enable more optimizations. There may be some optimizations to do around this, but that seems orthogonal to what this patch is doing which is, as far as I understand, about modeling the correctness constraint associated with a call to texture (i.e. to express what is the property that needs to be maintained on the first argument).

I've given it a little more thought during coffee break.

divergent_branch, divergent_index, convergent_branch & convergent_index aren't enough.
At least one more modifier is needed: cannot_diverge (or similar name).

You just figured what this patch is about :)

The only thing the current patch is doing is exactly adding this knowledge to LLVM, you wrote

float4 texture( cannot_diverge_branch sampler2D texSampler, float2 uv ); //Internally implemented by LLVM to map to a GPU instruction(s).

This patch is proposing:

float4 texture(convergent sampler2D texSampler, float2 uv ); //Internally implemented by LLVM to map to a GPU instruction(s).

I don't think we really need to care about tagging variables right now, there might be some more optimizations possible, but that should be addressed separately.

I cannot stress enough this should be a priority bug.

As I mentioned previously, I believe it is very easy to fix, no need to change LLVM: marking the texture operation as having unknown side-effect should be enough.

Hi!

As I said I'm not an LLVM expert so if someone thinks I'm confusing IR with source level aspects then I won't get in the way.

You just figured what this patch is about :)

The only thing the current patch is doing is exactly adding this knowledge to LLVM, you wrote

float4 texture( cannot_diverge_branch sampler2D texSampler, float2 uv ); //Internally implemented by LLVM to map to a GPU instruction(s).

This patch is proposing:

float4 texture(convergent sampler2D texSampler, float2 uv ); //Internally implemented by LLVM to map to a GPU instruction(s).

I don't think we really need to care about tagging variables right now, there might be some more optimizations possible, but that should be addressed separately.

Then we agree and we understand each other. What I was saying is to add even more information (information which may come from the source level annotations, or may be deducted automatically, or may depend on GPU arch) in order to perform aggressive optimizations, which you if believe it should be addressed separately, I won't object.

As I mentioned previously, I believe it is very easy to fix, no need to change LLVM: marking the texture operation as having unknown side-effect should be enough.

I will pass this on to the radeonsi ticket.

Hi there,

This patch should also fix https://bugs.freedesktop.org/show_bug.cgi?id=99484 and the minor rendering issue (Dirt Rally) I tried to fix with https://reviews.llvm.org/D30609.

Although my patch works, it's not the correct solution and it's also too agressive. This one definitely makes more sense.

Please read above my comment about this bug and how it can be fixed without changing LLVM.

Please read above my comment about this bug and how it can be fixed without changing LLVM.

While true, making texture calls have unknown behavior forces the compiler to prevent reordering at all... on an architecture that heavily relies on latency hiding (both horizontal i.e. wave occupancy and vertical i.e. instruction reordering).

Btw. I've been running Mesa radeonsi with this patch on top of LLVM 5.0 for my daily work and leisure time for almost a month now, and haven't encountered any issue.

Please read above my comment about this bug and how it can be fixed without changing LLVM.

While true, making texture calls have unknown behavior forces the compiler to prevent reordering at all... on an architecture that heavily relies on latency hiding (both horizontal i.e. wave occupancy and vertical i.e. instruction reordering).

Right now, frontends are given a choice between performance and correctness...
My point is that one can't choose performance and sacrifices knowingly correctness, and then complains about a LLVM bug that need to be fixed. The bug is in the shader compiler in the first place!

I perfectly understand how having a solution to enable these optimizations is very important for shader compilers.
It is likely that if I was in charge of such a shader compiler, I'd try to explore how to express my intrinsics differently such that I can get "most" optimizations (i.e. not having to say to LLVM that this call can do anything).

Example to think about: can token help? Could the intrinsic be modeled by taking a pointer to a memory location instead of a SSA value? (there are finer grain way to express the effect on a pointer argument, and it wouldn't break SSA).

Btw. I've been running Mesa radeonsi with this patch on top of LLVM 5.0 for my daily work and leisure time for almost a month now, and haven't encountered any issue.

Great. But be aware that anecdotal data are unlikely to expedite a review here. What is needed is someone to spend time to find a solution to and nail the issues discussed in LangRef above.

Please read above my comment about this bug and how it can be fixed without changing LLVM.

While true, making texture calls have unknown behavior forces the compiler to prevent reordering at all... on an architecture that heavily relies on latency hiding (both horizontal i.e. wave occupancy and vertical i.e. instruction reordering).

Right now, frontends are given a choice between performance and correctness...
My point is that one can't choose performance and sacrifices knowingly correctness, and then complains about a LLVM bug that need to be fixed. The bug is in the shader compiler in the first place!

I missed the rationale for this statement, but this does not seem obviously correct to me. Even marking the intrinsics as having unknown side effects does not seem strong enough to ensure the semantics required here - we don't really have a strong definition for "unknown side effects", AFAIK, but I'd not have thought that even unknown side effects on Foo would prevent the folding of:

if (cond) {
  Tmp1 = Foo(v [convergent])
} else {
  Tmp2 = Foo(v [convergent])
}
Tmp3 = phi [Tmp1, Tmp2]

and so, no, I don't think we currently have a solution for this of any kind (even a conservative one) - at least in theory. The closest thing I can think of might be the sideeffect marker on inline-asm statements: That might prevent the folding if we allow inline asm statements to define labels visible to other inline asm statements?

I perfectly understand how having a solution to enable these optimizations is very important for shader compilers.
It is likely that if I was in charge of such a shader compiler, I'd try to explore how to express my intrinsics differently such that I can get "most" optimizations (i.e. not having to say to LLVM that this call can do anything).

Example to think about: can token help? Could the intrinsic be modeled by taking a pointer to a memory location instead of a SSA value? (there are finer grain way to express the effect on a pointer argument, and it wouldn't break SSA).

Btw. I've been running Mesa radeonsi with this patch on top of LLVM 5.0 for my daily work and leisure time for almost a month now, and haven't encountered any issue.

Great. But be aware that anecdotal data are unlikely to expedite a review here. What is needed is someone to spend time to find a solution to and nail the issues discussed in LangRef above.

My point is that one can't choose performance and sacrifices knowingly correctness, and then complains about a LLVM bug that need to be fixed. The bug is in the shader compiler in the first place!

I missed the rationale for this statement, but this does not seem obviously correct to me. Even marking the intrinsics as having unknown side effects does not seem strong enough to ensure the semantics required here - we don't really have a strong definition for "unknown side effects", AFAIK, but I'd not have thought that even unknown side effects on Foo would prevent the folding of:

if (cond) {
  Tmp1 = Foo(v [convergent])
} else {
  Tmp2 = Foo(v [convergent])
}
Tmp3 = phi [Tmp1, Tmp2]

You're totally correct. I don't see any attribute that would model the convergent effect.
The issue stays the same: the intrinsic are "lying" by using SSA value with an unmodeled constraints.

Example to think about: can token help? Could the intrinsic be modeled by taking a pointer to a memory location instead of a SSA value? (there are finer grain way to express the effect on a pointer argument, and it wouldn't break SSA).

Let me expand on this aspect instead, what about something like:

void foo(int v, bool cond) {
  if (cond) {
    token1 = statepoint()
    bar(v, token1); // v needs to be convergent
  } else {
    token2 = statepoint()
    bar(v, token2);  // v needs to be convergent
  }
}

This is trying to capture the value of v at a given point using the token to bind the call to bar to this particular value of v.

sanjoy added a comment.Mar 7 2017, 9:25 AM

My point is that one can't choose performance and sacrifices knowingly correctness, and then complains about a LLVM bug that need to be fixed. The bug is in the shader compiler in the first place!

I missed the rationale for this statement, but this does not seem obviously correct to me. Even marking the intrinsics as having unknown side effects does not seem strong enough to ensure the semantics required here - we don't really have a strong definition for "unknown side effects", AFAIK, but I'd not have thought that even unknown side effects on Foo would prevent the folding of:

if (cond) {
  Tmp1 = Foo(v [convergent])
} else {
  Tmp2 = Foo(v [convergent])
}
Tmp3 = phi [Tmp1, Tmp2]

You're totally correct. I don't see any attribute that would model the convergent effect.
The issue stays the same: the intrinsic are "lying" by using SSA value with an unmodeled constraints.

Example to think about: can token help? Could the intrinsic be modeled by taking a pointer to a memory location instead of a SSA value? (there are finer grain way to express the effect on a pointer argument, and it wouldn't break SSA).

Let me expand on this aspect instead, what about something like:

void foo(int v, bool cond) {
  if (cond) {
    token1 = statepoint()
    bar(v, token1); // v needs to be convergent
  } else {
    token2 = statepoint()
    bar(v, token2);  // v needs to be convergent
  }
}

This is trying to capture the value of v at a given point using the token to bind the call to bar to this particular value of v.

Is it okay to fold that to:

void foo(int v, bool cond) {
  if (cond) {
    // token1 = statepoint()
    // bar(v, token1); // v needs to be convergent
  } else {
    // token2 = statepoint()
    // bar(v, token2);  // v needs to be convergent
  }
  token_merged = statepoint()
  bar(v, token_merged); // v needs to be convergent
}

Is it okay to fold that to:

void foo(int v, bool cond) {
  if (cond) {
    // token1 = statepoint()
    // bar(v, token1); // v needs to be convergent
  } else {
    // token2 = statepoint()
    // bar(v, token2);  // v needs to be convergent
  }
  token_merged = statepoint()
  bar(v, token_merged); // v needs to be convergent
}

Yes, that is ok.

Examples that are ok to fold:

//Example 1: Safe to fold.
void foo(int v, bool cond) {
  if (cond) {
    // token1 = statepoint()
    // bar(v, token1); // v needs to be convergent
  } else {
    // token2 = statepoint()
    // bar(v, token2);  // v needs to be convergent
  }
  token_merged = statepoint()
  bar(v, token_merged); // v needs to be convergent
}

//Example 2: Unsafe to fold, but the language allows it thus it's programmer's
//responsibility to ensure correctness through data fed to the machine.
void foo(int v, bool cond) {
  token = statepoint()
  if (cond) {
    //idx0 = statepoint();
  } else {
    //idx1 = statepoint();
  }
  idx3 = phi[idx0, idx1];
  bar(v[idx3], token); // v needs to be convergent
}

//Example 3: Safe to fold, proven at compile time that v1 == v2
void foo(int v, bool cond) {
  token = statepoint()
  if (cond) {
    // v1 = statepoint()
    // bar(v1, token); // v needs to be convergent
  } else {
    // v2 = statepoint()
    // bar(v2, token);  // v needs to be convergent
  }
  //proven at compile time that v1 == v2 or that cond is always true / always false
  bar(v1, token); // v needs to be convergent
}

//Example 4: Safe to fold, proven at compile time that idx0 == idx1
void foo(int v, bool cond) {
  token = statepoint()
  if (cond) {
    //idx0 = statepoint();
  } else {
    //idx1 = statepoint();
  }
  //proven at compile time that idx1 == idx2 or that cond is always true / always false
  bar(v[idx1], token); // v needs to be convergent
}

Examples that is NOT ok to fold:

//Example 5 Unsafe to fold, should not be fold.
void foo(int v, bool cond) {
  token = statepoint()
  if (cond) {
    // v1 = statepoint()
    // bar(v1, token); // v needs to be convergent
  } else {
    // v2 = statepoint()
    // bar(v2, token);  // v needs to be convergent
  }
  v_merged = phi[v1, v2]
  bar(v_merged, token); // v needs to be convergent
}

//Example 6: (HLSL only) Unsafe to fold and programmer explicitly requested not to fold through language semantics (NonUniformResourceIndex).
//This is identical to Example 2, except the programmer explicitly asked not to fold through a special keyword or intrinsic.
//As far as I can see support for differentiating between these two scenarios is beyond the scope of this patch.
void foo(int v, bool cond) {
  token = statepoint()
  if (cond) {
    //idx0 = statepoint();
  } else {
    //idx1 = statepoint();
  }
  idx3 = noncovergent_phi[idx0, idx1];
  bar(v[idx3], token); // v needs to be convergent
}

//Example 1: Safe to fold.
void foo(int v, bool cond) {

if (cond) {
  // token1 = statepoint()
  // bar(v, token1); // v needs to be convergent
} else {
  // token2 = statepoint()
  // bar(v, token2);  // v needs to be convergent
}
token_merged = statepoint()
bar(v, token_merged); // v needs to be convergent

}

My understanding was that this is *not* OK to fold, we don't know that two different "thread" have the same value for v. Can you elaborate, it is very possible that I'm misunderstanding something here.

I meant that T1 would run with { v1 = 0; cond = true; } and T1 with with { v1 = 1; cond = false; }

//Example 1: Safe to fold.
void foo(int v, bool cond) {

if (cond) {
  // token1 = statepoint()
  // bar(v, token1); // v needs to be convergent
} else {
  // token2 = statepoint()
  // bar(v, token2);  // v needs to be convergent
}
token_merged = statepoint()
bar(v, token_merged); // v needs to be convergent

}

My understanding was that this is *not* OK to fold, we don't know that two different "thread" have the same value for v. Can you elaborate, it is very possible that I'm misunderstanding something here.

My mistake was not being explicit enough in the examples. My apologies. GPUs differentiate dispatch, threadgroup and thread. A dispatch is made of threadgroups, a threadgroup is made of threads.
Uniform values are read only and have the same value the whole dispatch, while convergent values have the same value for the whole threadgroup.
I assumed v came from a uniform. In this example if v is a uniform, it is safe to fold. As both T0 and T1 are guaranteed to have the same value in v.
If v is neither uniform nor convergent, then you're right, it's not safe to fold.

In the example of selecting between v0 and v1; if the condition comes from a uniform (or a convergent), it's safe to fold. If the condition does not come from a uniform or convergent, it's not safe.

//OK to fold. Notice v is uniform; v never changes within the threadgroup.
//"token" can be different because it's not required to be convergent.
void foo( uniform sampler v, bool cond ) {
  if (cond) {
    // token1 = statepoint()
    // bar(v, token1); // v needs to be convergent
  } else {
    // token2 = statepoint()
    // bar(v, token2);  // v needs to be convergent
  }
  token_merged = statepoint()
  bar(v, token_merged); // v is guaranteed to be the same for all threads within the threadgroup.
}

//Unsafe code. Can be fold. This code is unsafe no matter what we do. v may be
//different between threads in the same threadgroup, but all threadgroups will follow
//the same path (cond is true for all threads within the threadgroup, or false for all threads
//within the threadgroup).
void foo( sampler v, uniform bool cond ) {
  if (cond) {
    // token1 = statepoint()
    // bar(v, token1);
  } else {
    // token2 = statepoint()
    // bar(v, token2);
  }
  token_merged = statepoint()
  bar(v, token_merged); // v may be different in the same threadgroup, but there's nothing compiler can do.
}

//Unsafe to fold. Must not be fold. v and cond are neither uniform nor convergent.
void foo( sampler v, bool cond ) {
  if (cond) {
    // token1 = statepoint()
    // bar(v, token1);
  } else {
    // token2 = statepoint()
    // bar(v, token2);
  }
  token_merged = statepoint()
  bar(v, token_merged);
}

//OK to fold. Notice v0, v1, and cond are all uniform.
void foo( uniform sampler v0, uniform sampler v1, uniform bool cond ) {
  token = statepoint();
  if (cond) {
    //bar( v0, token );
  } else {
    //bar( v1, token );
  }
  v2 = phi[ v0, v1 ];
  bar( v2, token );
}

//Unsafe to fold. Must not be fold. bool is not uniform, thus threads
//within the same threadgroup may follow a different path.
void foo( uniform sampler v0, uniform sampler v1, bool cond ) {
  token = statepoint();
  if (cond) {
    //bar( v0, token );
  } else {
    //bar( v1, token );
  }
  v2 = phi[ v0, v1 ];
  bar( v2, token );
}

Of course things aren't that simple. If I write:

uniform sampler mySampler[2];
uniform bool cond;

void foo( sampler v0, sampler v1, bool cond );

void main()
{
    foo( mySampler[0], mySampler[1], cond );
}

Then LLVM optimizer needs to see that all arguments to foo are uniform; then it's safe to fold foo (once foo is inlined). If instead I were to write:

uniform sampler mySampler[2];
uniform sampler anotherSampler;

void foo( sampler v0, sampler v1, bool cond );

void main()
{
    bool cond = anotherSampler.sample( uv = 0 );; //source of non-convergence.
    foo( mySampler[0], mySampler[1], cond );
}

Now foo cannot be folded because cond is not convergent.

If I were to write:

uniform sampler mySampler[2];
uniform sampler anotherSampler;
uniform bool someCond;

void foo( sampler v0, sampler v1, bool cond );

void main()
{
    foo( mySampler[0], mySampler[1], someCond ); // inline expansion can be folded
    bool anotherCond = anotherSampler.sample( uv = 0 );; //source of non-convergence.
    foo( mySampler[0], mySampler[1], anotherCond ); // inline expansion cannot be folded
}

then the first foo() call can be folded, the second one cannot.

It gets even funnier:

uniform sampler mySampler[2];
uniform sampler anotherSampler;

void foo( sampler v0, sampler v1, bool cond );

void main()
{
    bool cond = anotherSampler.sample( uv = 0 );; //source of non-convergence.
    int idx0 = anotherSampler.sample( uv = 1 ); //source of non-convergence.
    // mySampler[idx0] is no longer uniform nor convergent, and cond isn't convergent either. Cannot fold.
    foo( mySampler[idx0], mySampler[1], cond );
}

In other words the optimizer has to track whether the convergence property has been lost.

t-tye added a subscriber: t-tye.Feb 15 2018, 4:20 PM
arsenm resigned from this revision.Apr 5 2020, 8:04 AM
sanjoy resigned from this revision.Jan 29 2022, 5:27 PM