This is an archive of the discontinued LLVM Phabricator instance.

[RFC] Redefine `convergent` in terms of dynamic instances
AbandonedPublic

Authored by nhaehnle on Oct 15 2019, 8:48 AM.

Details

Reviewers
jdoerfert
Summary

GPU-oriented programming languages have some operations with constraints
that cannot currently be expressed properly in LLVM IR. For example:

uvec4 result;
if (cc) {
  result = ballot(true);
} else {
  result = ballot(true);
}

Even though both sides of the branch are identical, it is incorrect to
replace the if-statement with a single ballot call. This is because
ballot communicates with other threads, and the set of those threads
depends on where ballot is with respect to control flow.

In the past, we have tried to fix this up somewhat by putting the
convergent attribute on functions. However, this approach has some
weaknesses. First, the restrictions imposed by convergent are not
actually strong enough for some cases such as the example above. Second,
the definition of convergent relies on the notion of
control-dependencies, which have action at a distance that makes it
difficult to satisfy. For example, the jump threading pass currently
does not honor the convergent attribute correctly in cases
such as:

bool flag = false;
if (cc1) {
  ...
  if (cc2)
    flag = true;
}
if (flag) {
  result = ballot(true);
}

Since the convergent ballot operation is at a distance from the part
of the code inspected by the jump threading pass, the pass will decide
to transform the code in an incorrect way.

This patch proposes to fix these and related problems by putting the
convergent attribute and the underlying notions of divergence and
reconvergence on a solid formal basis. At the same time, the impact
on generic transforms is small by design: a new set of intrinsics is
introduced that can be used to control reconvergence without being
prone to action at a distance. Frontends for GPU-oriented programming
langauges are expected to insert these intrinsics, so that passes such
as jump threading will be "correct by default".

In the jump threading example above, a frontend would be expected to
insert intrinsics as follows:

bool flag = false;
token tok = @llvm.convergence.anchor();
if (cc1) {
  ...
  if (cc2)
    flag = true;
}
@llvm.convergence.join(tok);
if (flag) {
  result = ballot(true);
}

The convergence intrinsics indicate that threads are expected to
reconverge before the second if-statement, which affects the behavior
of the ballot call. The join intrinsic call guards against incorrect
jump threading.

The intention of this RFC is to gauge the interest of the LLVM community
and whether this direction can be accepted going forward. Frontend and
backend parts are required for a complete solution, though the frontend
parts are language-specific and therefore not part of LLVM itself.

Additional Notes:

  • Function inlining really needs to add convergence intrinsics when the caller is convergent and the callee contains control flow

Event Timeline

nhaehnle created this revision.Oct 15 2019, 8:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 15 2019, 8:48 AM
nhaehnle added subscribers: arsenm, alex-t, tpr and 7 others.

Thanks for posting this. Please send a plain email explaining the proposal to llvm-dev - RFCs regarding new intrinsics and other IR extensions should go there to reach a wide audience.

Thanks for posting this. Please send a plain email explaining the proposal to llvm-dev - RFCs regarding new intrinsics and other IR extensions should go there to reach a wide audience.

RFC is here: http://lists.llvm.org/pipermail/llvm-dev/2019-October/135929.html

chapeiro removed a subscriber: chapeiro.
chapeiro added a subscriber: chapeiro.

Are you planning on adding codegen support in a separate patch?

None of these tests use llvm.convergence.point.

Can you also add tests for unroll with a convergent operation in the presence of the new intrinsics?

llvm/docs/DynamicInstances.rst
230

I don't fully understand what the 'point' of this one is. What is the point of ensuring nonreconvergence? Which real example does this correspond to?

Nicola added a subscriber: Nicola.Oct 15 2019, 10:53 AM
piotr added a subscriber: piotr.Oct 16 2019, 6:50 AM
nhaehnle marked an inline comment as done.Oct 16 2019, 8:01 AM

Are you planning on adding codegen support in a separate patch?

Yes, though it'll be a while before it's ready.

llvm/docs/DynamicInstances.rst
230

It allows us to mark regions of code that are semantically part of a loop for the purposes of convergence even when they're not part of the loop as far as loop analysis is concerned. This applies to high-level code such as the following:

int divergent_key = ...;
int v = ...;
int sum;
for (;;;) {
  tok = @llvm.convergence.anchor()
  int uniform_key = readfirstlane(divergent_key);
  if (uniform_key == divergent_key) {
    sum = subgroup_reduce_add(v);
    @llvm.convergence.point(tok)
    break;
  }
}

The point indicates that no reconvergence should happen for the execution of the reduce operation (or in a sense, that the reduce operation should not be moved out of the loop).

Points are in some sense redundant, as the code above could be equivalently rewritten as:

int divergent_key = ...;
int v = ...;
int sum;
for (;;;) {
  tok = @llvm.convergence.anchor()
  int uniform_key = readfirstlane(divergent_key);
  if (uniform_key == divergent_key) {
    sum = subgroup_reduce_add(v);
  }
  @llvm.convergence.join(tok)
  if (uniform_key == divergent_key)
    break;
}

Though that reformulation loses some information about control-flow, in particular it leads to more conservative live ranges.

foad added a subscriber: foad.Oct 21 2019, 2:18 AM

What is the expectiation of lowering of a loop like the one you mentioned above?

int divergent_key = ...;
int v = ...;
int sum;

for (;;;) {
  tok = @llvm.convergence.anchor()
  int uniform_key = readfirstlane(divergent_key);
  if (uniform_key == divergent_key) {
    sum = subgroup_reduce_add(v);
    @llvm.convergence.point(tok)
    break;
  }
}

In particular the expectation of lowering in LLVM-IR is typically that the block with the "break" is an exit block and as such is typically moved outside of the loop (when the threads are reconverted). Is the expectation to have LLVM.CONVERGENCE.POINT to be lowered to the backends with a pseudo instruction similar to DBG_VALUE (and then be ignored by regalloc and such) up to the point that final control-flow block ordering is defined so that the execution of the "unconverged" part of the exit block is lowered so that is executed inside the loop instead of outside?

llvm/docs/DynamicInstances.rst
62

While I like the idea of formalizing the whole "Dynamic Instances" concept I wonder if it could be explained in a more intuitive way as well in order to make the understanding of the formal definitions easier by approaching them with an intuition already of what they are.

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1474

Is this the only change in SimplifyCFG we would have to do?

What about FoldBranchToCommonDest() ? It seems to be hoisting instructions into the predecessors , but it just checks for the Speculable attribute (through isSafeToSpeculativelyExecute()) and seem to ignore the Convergent attribute.

In general how much the new behavior has been vetted around LLVM passes with respect respecting hoisting or changes in thread execution mask?

nhaehnle marked 2 inline comments as done.Oct 29 2019, 3:46 AM

What is the expectiation of lowering of a loop like the one you mentioned above?

int divergent_key = ...;
int v = ...;
int sum;

for (;;;) {
  tok = @llvm.convergence.anchor()
  int uniform_key = readfirstlane(divergent_key);
  if (uniform_key == divergent_key) {
    sum = subgroup_reduce_add(v);
    @llvm.convergence.point(tok)
    break;
  }
}

In particular the expectation of lowering in LLVM-IR is typically that the block with the "break" is an exit block and as such is typically moved outside of the loop (when the threads are reconverted). Is the expectation to have LLVM.CONVERGENCE.POINT to be lowered to the backends with a pseudo instruction similar to DBG_VALUE (and then be ignored by regalloc and such) up to the point that final control-flow block ordering is defined so that the execution of the "unconverged" part of the exit block is lowered so that is executed inside the loop instead of outside?

Something along those lines, yes. The details would obviously depend on how the backend handles divergence and reconvergence. For example, in AMDGPU we fairly early on convert into a form where divergent values are computed using vector instructions, i.e. mostly explicitly SIMD instructions that use an EXEC physical register. So in our case, the intrinsic would affect this conversion of the code into SIMD form.

llvm/docs/DynamicInstances.rst
62

I don't think we can get away without a formal description, but I'm not against having some more illustrative examples. Do you have anything in mind?

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1474

I've fixed the places that I was aware of based on bugs we've encountered over the years. I wouldn't be surprised if there were more places like you say...

simoll added inline comments.Oct 29 2019, 9:11 AM
llvm/docs/DynamicInstances.rst
230

Couldn't you align the conceptual convergence point with the loop structure by adding a @llvm.always.true intrinsic like so:

int divergent_key = ...;
int v = ...;
int sum;
for (;;;) {
  int uniform_key = readfirstlane(divergent_key);
  if (uniform_key == divergent_key) {
    sum = subgroup_reduce_add(v);
    %we.know.its.true = call @llvm.always.true()
   br i1 %we.know.its.true, %loopExitBlock, %nextBlockInTheLoop
  }
}

That way you do not need the intrinsic and the DA/SDA already assume that threads only synchronize at (LoopInfo) loop exits.

Thanks for the answers! It was about time somebody tackled this problem in LLVM :)

A couple extra questions I have:

  1. Changing the convergent attribute from being sink prevention only to both sink/hoist prevention (which should impact CSE as well I believe even if I didn't see any change on that regard) would have some performance impact. Considering there's another proposal with respect to make convergent the default from Matt, how do we see this impacting performance. Should we have two (or more?) attributes instead? For example quad-scoped texture operations don't really care about being hoisted/cse'd up, but they care about being sinked.
  1. llvm.convergence.join is marked IntNoMem. Its not clear to me how is it gonna survive instcombine considering that is not emitting any value , so nothing is in place to keep it alive. What's the mechanism that's keeping it around?
  1. Is there a plan on how we are gonna figure out all the optimizations that need to be updated and in particular, how are we gonna make sure new optimizations (probably written with single threaded in mind) are not gonna break our invariants. (Basically how do we make everybody aware this exists and make them think about it). In particular there could be classes of optimizations that work uniquely on CFG that now would maybe have to look "inside the blocks" before transforming because they need to make sure there's no convergent instruction/call while operating over the blocks they are considering.
llvm/docs/DynamicInstances.rst
62

Yep, my suggestion was about adding a preamble before the formal definition that explains in more "earthly" terms what the concept is before jumping to the formal definition. Certainly not replacing the formal definition. I really like the fact that you added it in there!
Maybe commonly known concepts like warps, threads in a warp and divergence could be used to that effect. If I understood correctly the concept "Dynamic instance" is the execution of an instruction by a certain group of threads and the same instruction can be executed by different group of threads at different point in time. (Please correct me if I misunderstood) . If its something like this it could be more intuitively be placed before the formal definition.

nhaehnle marked 2 inline comments as done.Oct 30 2019, 6:53 AM

Thank you for your detailed comments!

  1. Changing the convergent attribute from being sink prevention only to both sink/hoist prevention (which should impact CSE as well I believe even if I didn't see any change on that regard) would have some performance impact. Considering there's another proposal with respect to make convergent the default from Matt, how do we see this impacting performance. Should we have two (or more?) attributes instead? For example quad-scoped texture operations don't really care about being hoisted/cse'd up, but they care about being sinked.

That's a good point. Something like allowconvergence and allowdivergence?

  1. llvm.convergence.join is marked IntNoMem. Its not clear to me how is it gonna survive instcombine considering that is not emitting any value , so nothing is in place to keep it alive. What's the mechanism that's keeping it around?

You're right. The immediate answer to your question is that I missed marking the intrinsics as nounwind. Once I do that, they are indeed removed like you say. Any ideas on that?

  1. Is there a plan on how we are gonna figure out all the optimizations that need to be updated and in particular, how are we gonna make sure new optimizations (probably written with single threaded in mind) are not gonna break our invariants. (Basically how do we make everybody aware this exists and make them think about it). In particular there could be classes of optimizations that work uniquely on CFG that now would maybe have to look "inside the blocks" before transforming because they need to make sure there's no convergent instruction/call while operating over the blocks they are considering.

I don't have a good answer for that, unfortunately. In general, education about convergent is the only answer that comes to mind on the first half. On the second half, I did go to some length to think about whether transforms need to worry about "spooky action at a distance", i.e. whether they need to scan code for convergent that they would not otherwise have to scan, and I believe this is largely not a problem (though I cannot prove it). This is summarized in the "Correctness" section of the new document.

llvm/docs/DynamicInstances.rst
230

I would like to avoid having to insert fake edges in the CFG: it might trip up generic transforms.

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1474

Giving this some more thought, it does raise an interesting question: is there a legitimate use for a function that is speculatable convergent?

It seems initially strange, given that speculation changes the set of threads calling the function, which goes counter to the whole point of convergent. But then, as you mentioned in your later comment and I've mentioned in D69498, we'll want to relax convergent. Something like a subgroup shuffle could perhaps be speculatable convergent allowconvergence, indicating that it can be hoisted but not sunk.

I'm going to add some comments to that effect.

nhaehnle updated this revision to Diff 227086.Oct 30 2019, 6:54 AM
  • add some more expository text and examples
  • handle the case of speculatable+convergent
kpet added a subscriber: kpet.May 21 2020, 9:13 AM

What is the status of this change?

I'm introducing extended subgroup functions in OpenCL: https://reviews.llvm.org/D79781#inline-735663 and currently marking them with the convergent attribute doesn't prevent invalid optimizations.

ychen added a subscriber: ychen.May 28 2020, 9:52 AM

What is the status of this change?

I still want to pursue a proper modeling of convergent that would probably express that the invalid optimizations that you're likely to see are forbidden. However, the model described here has some issues, and I hope to introduce a slight variation that addresses those issue.

As a matter of pragmatism, I wonder whether we shouldn't just already disable those problematic optimizations today...

However, the model described here has some issues, and I hope to introduce a slight variation that addresses those issue.

Can you expand on the issues that are present here?

What is the status of this change?

I still want to pursue a proper modeling of convergent that would probably express that the invalid optimizations that you're likely to see are forbidden. However, the model described here has some issues, and I hope to introduce a slight variation that addresses those issue.

As a matter of pragmatism, I wonder whether we shouldn't just already disable those problematic optimizations today...

Do you mean disable them for the convergent functions?

Can you expand on the issues that are present here?

The intrinsics are fine, the formal modeling using dynamic instances isn't because it prevents desirable optimizations. For example:

%tok1 = @llvm.convergence.anchor()
if (divergent_condition) {
   ...
}
v1 = ballot(..., %tok1)

%tok2 = @llvm.convergence.anchor()
v2 = ballot(..., %tok2)

The ballot for v1 enforces reconvergence of the dynamic instances after the divergent branch, which means the ballot for v2 also executes fully converged (relative to whatever dynamic instances we had when entering this snippet).

If v1 is unused, we would like to DCE the ballot. However, after doing this there is no longer anything that guarantees reconvergence formally before the second anchor. Which means the ballot for v2 is now allowed to execute in a way that's partitioned according to the divergent condition.

As a matter of pragmatism, I wonder whether we shouldn't just already disable those problematic optimizations today...

Do you mean disable them for the convergent functions?

Yes.

Now that I'm actually editing this text, I realize that this is an older version than I thought. I've given up on "join" and "point" as being too noisy in the IR in favor of something that more explicitly tracks loop structure, which is the real point of concern at least for what we care about.

nhaehnle abandoned this revision.Aug 9 2020, 7:47 AM

I am abandoning this change in favor of D85603, which is the next and hopefully final iteration on this.

Herald added a project: Restricted Project. · View Herald TranscriptDec 30 2022, 1:10 AM