This is an archive of the discontinued LLVM Phabricator instance.

[RFC] Introduce convergence control intrinsics
ClosedPublic

Authored by sameerds on Mar 28 2023, 11:58 PM.

Details

Summary

This is a reboot of the original design and implementation by
Nicolai Haehnle <nicolai.haehnle@amd.com>:
https://reviews.llvm.org/D85603

This change also obsoletes an earlier attempt at restarting the work on
convergence tokens:
https://reviews.llvm.org/D104504

Changes relative to D85603:

  1. Clean up the definition of a "convergent operation", a convergent call and convergent function.
  2. Clean up the relationship between dynamic instances, sets of threads and convergence tokens.
  3. Redistribute the formal rules into the definitions of the convergence intrinsics.
  4. Expand on the semantics of entering a function from outside LLVM, and the environment-defined outcome of the entry intrinsic.
  5. Replace the term "cycle" with "closed path". The static rules are defined in terms of closed paths, and then a relation is established with cycles.
  6. Specify that if a function contains a controlled convergent operation, then all convergent operations in that function must be controlled.
  7. Describe an optional procedure to infer tokens for uncontrolled convergent operations.
  8. Introduce controlled maximal convergence-before and controlled m-converged property as an update to the original properties in UniformityAnalysis.
  9. Additional constraint that a cycle heart can only occur in the header of a reducible cycle (natural loop).

Diff Detail

Event Timeline

sameerds created this revision.Mar 28 2023, 11:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2023, 11:58 PM
sameerds requested review of this revision.Mar 28 2023, 11:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2023, 11:58 PM
sameerds added a subscriber: Restricted Project.Apr 5 2023, 12:27 AM
jsilvanus added inline comments.Apr 14 2023, 5:44 AM
llvm/docs/ConvergentOperations.rst
575

Maybe mention that n is an integer, so

There is an integer *n* such that [..]

726–728

Maybe "does not satisfy" -> "violates"?
In the current form, there is a ambiguity in whether all properties are violated, or just a single one.
I assume a single one is intended?

830–831

I think this property should be communicated more prominently.

Per my understanding, loops in reducible control flow have unique headers, which give rise to a "natural" convergence (implicit maximal convergence?) by counting executions of the header, and considering those converged if the counter agrees.
For irreducible control flow, there are no unique headers, instead there is an ambiguity caused by the dependency on a choice of a cycle hierarchy (and their headers).

Explicit convergence control intrinsincs eliminate this ambiguity by allowing to explicitly define loop hearts. But for reducible control flow, if loop hearts are not placed at loop headers, then the notion of convergence may be different. I believe this is what these lines refer to. For example, in the example in llvm/docs/convergence-heart.png, removing the edge from D to R makes the CFG reducible, but iteration counts at R might be different from iteration counts at H, due to the shortcut from H to L.

Is that intentional, a "neutral" side-effect of the model, or a side-effect of the model that we would want to eliminate but cannot easily?
In any case, I think we should discuss this more explicitly. Also, an example somewhere suggests to construct explicit control intrinsics by putting hearts into loop headers, maybe we can mention there that this ensures that the two notions of convergence agree, because the imaginary counters do.

nhaehnle added inline comments.Apr 17 2023, 8:30 AM
llvm/docs/ConvergentOperations.rst
830–831

I'd say it's somewhere between intentional and a neutral side-effect. It certainly allows for some interesting experiments.

By the way, the whole loop intrinsic business isn't only about eliminating ambiguity for irreducible loops. Consider:

do {
  do {
    a();
  } while (conda);
  b();
} while (condb);

// vs.
do {
  a();
  if (conda)
    continue;
  b();
} while (condb);

Assuming that conda implies condb, these two loops are semantically identical from a single-threaded perspective and could easily result in identical CFGs. But the "intuitively expected" convergence behavior is very different.

Convergence control allows us to explicitly encode in the IR which of the two intuitive behaviors are expected. (And for the first version of the loop, this would result in two nested loops in the CFG that cannot be collapsed into a single one because the loop intrinsics are "in the way".)

jsilvanus added inline comments.Apr 20 2023, 4:17 AM
llvm/docs/ConvergentOperations.rst
830–831

Thanks for the background and the example, that definitely helps.

I still feel this should be stated more explicitly, possibly quite at the beginning when introducing these new concepts?

We motivate the use of convergent operations in great detail, but motivate only very briefly why we need to control it explicitly.
An example where naive loop-header-based convergence is not intended would be helpful for that.
Also, generalizing this example, the fact that explicit control intrinsics allow to change the structure of nested loops while preserving convergence semantics.

jsilvanus added inline comments.Apr 20 2023, 4:24 AM
llvm/docs/ConvergentOperations.rst
830–831

To expand on the above, it probably suffices to add a loop-based example (the one above?) to the list of examples motivating intrinsics, and mention early on that deviating from natural loop-header based convergence is possible and intended, referencing the above new example.

nhaehnle added inline comments.Apr 25 2023, 8:11 AM
llvm/docs/ConvergentOperations.rst
830–831

Ironically, there's now been some discussion that we may be able to simplify this by only allowing loop hearts in natural loop headers.

They're still required (the example above with the two loops vs. single loop with continue still stands). But yeah, perhaps that could be added to the document.

sameerds updated this revision to Diff 528817.Jun 6 2023, 5:55 AM
  • rebase
  • cycle heart is allowed only in the header of a natural loop
  • address review comments
sameerds marked 7 inline comments as done.
sameerds added inline comments.
llvm/docs/ConvergentOperations.rst
231

New section to really bring out the benefit of explicit convergence control. Something that both @jdoerfert and @jsilvanus had asked about, at different points of time.

726–728

Reworded to remove the confusion. The intention is "one or more".

830–831

The spec is now updated to allow a cycle heart only in the header of a natural loop. Thus the notion of an iteration under convergence remains unchanged from the intuitive notion.

Explicit cycle hearts would have allowed the user to specify how "all threads" (for a suitable definition of "all") converge inside an irreducible cycle. But the usefulness of this is rare enough that we can discount it for now.

sameerds updated this revision to Diff 529541.Jun 8 2023, 2:38 AM
sameerds marked 2 inline comments as done.
  • add missing checks in verifier, along with tests
arsenm added inline comments.Jun 8 2023, 1:35 PM
llvm/docs/ConvergentOperations.rst
141

Block is missing a terminator. Also should have a token use?

391

Use opaque pointers

llvm/include/llvm/IR/Intrinsics.td
2531

Needs a lot more intrinsic properties. Should really be DefaultAttrsIntrinsic +Convergent+NoMem

llvm/lib/IR/Verifier.cpp
41

Typo ConvertentOperations

llvm/test/Analysis/UniformityAnalysis/AMDGPU/join-at-loop-heart.ll
9

Why was this deleted?

llvm/test/Verifier/convergencectrl-invalid.ll
172

Avoid undef

220

Need some tests with invoke and demonstrate the exception issues

llvm/test/Verifier/convergencectrl-valid.ll
1 ↗(On Diff #529541)

These kinds of tests should go in test/Assembler and roundtrip through llvm-as/llvm-dis

Also should get a bitcode compatibility test

arsenm added inline comments.Jun 8 2023, 1:39 PM
llvm/lib/IR/Verifier.cpp
2572

Don't reconstruct each iteration?

2587

This is a pretty long and indented block, move to helper function?

2642

Don't need llvm::

arsenm added a comment.Jun 8 2023, 1:40 PM

Add mention to release notes

sameerds updated this revision to Diff 531598.Jun 14 2023, 8:35 PM
  • release notes
  • added more tests
  • default attributes for intrinsics
sameerds marked 9 inline comments as done.Jun 14 2023, 8:43 PM
sameerds added inline comments.
llvm/docs/ConvergentOperations.rst
141

Tokens are used only on convergent operations. A token doesn't need to be kept alive beyond the last convergent op that uses it.

llvm/lib/IR/Verifier.cpp
2572

Reconstruction on each iteration is not something I've thought about much. But the programmer's manual only mentions this for std::vector and not SmallVector. Moved the declaration out of the loop anyway.

llvm/test/Analysis/UniformityAnalysis/AMDGPU/join-at-loop-heart.ll
9

This test was incorrectly added with the change that introduced uniformity analysis. The current semantics disallow a heart anywhere other than a loop header, so the example in this test is now invalid.

llvm/test/Verifier/convergencectrl-invalid.ll
220

Well it turns out that EH landing pads occur in blocks where we can't have a call to the entry() or loop() intrinsics. There are some rules about the predecessor blocks, which prevent landing pads in the entry block or in a loop header. So there is nothing to test here.

Note to self: If there is no conflict, might as well remove the comments from Verifier.cpp.

FIXME: A loop intrinsic is required to be the first non-PHI only if it is a true heart (in a loop header). Verifier should not complain if it occurs in any other block.

I'm wondering if there's a better way to represent loops. At the core, the key notion for loop convergence is that all threads converge for every iteration. So in general, a while loop is something like the following:

start:
@llvm.experimental.convergence.loop()
if (!cond) goto end
body;
goto start;
end:

For most cases, making this work doesn't require llvm.experimental.convergence.loop to return a token; the mere existence of a convergent call that executes on every thread every iteration forces the necessary structure on the code. The problem is that after a "break" or "return" inside, the loop doesn't execute the llvm.experimental.convergence.loop call; to solve this, you make llvm.experimental.convergence.loop return a token, and impose a bunch of rules on the placement of the intrinsic and the usage of the token, so the control flow can be reconstructed.

But the fact that we don't execute the llvm.experimental.convergence.loop after a "break" isn't fundamental to the definition of "break", it's just an artifact of the way you're choosing to lower "break". You could instead define a hidden condition:

bool didbreak = false;
start:
if (!cond) goto end
body;
@llvm.experimental.convergence.loop()
if (didbreak) goto end
goto start;
end:

And then define "break" to be equivalent to "didbreak = true; continue;". "return" requires generating a bit more code after the loop, but works similarly.

clang already has the infrastructure necessary to make this work; it's exactly the same kind of control flow you need for C++ destructors with "break"/"return".

The advantage of this formulation is that it makes the invariants for llvm.experimental.convergence.loop a lot simpler. Actually, you don't need any special rules at all: the fact that the intrinsic is convergent dictates the result you want. The disadvantage is that the scalar control flow is a bit more complicated.

(I'm really not an expert on convergence, so I might be missing something. But I didn't see any other discussion along these lines.)

sameerds marked 3 inline comments as done.Jun 16 2023, 5:33 AM
start:
@llvm.experimental.convergence.loop()
if (!cond) goto end
body;
goto start;
end:

For most cases, making this work doesn't require llvm.experimental.convergence.loop to return a token; the mere existence of a convergent call that executes on every thread every iteration forces the necessary structure on the code. The problem is that after a "break" or "return" inside, the loop doesn't execute the llvm.experimental.convergence.loop call; to solve this, you make llvm.experimental.convergence.loop return a token, and impose a bunch of rules on the placement of the intrinsic and the usage of the token, so the control flow can be reconstructed.

The problem is not that exited threads don't execute the call to llvm.experimental.convergence.loop. We actually want to allow that, and then identify subsets of threads that executed the loop the same number of times and then broke out. The key part is this missing convergent op in the example:

start:
  %inner = @llvm.experimental.convergence.loop() [token %outer]
  if (!cond) {
    convergent_op() [%inner]
    goto end
  }
  body;
  goto start;
end:

The tokens allow us to identify the subsets of threads that will execute convergent_op() "together", on their way out of the loop along the break statement. The token %outer define the set S of threads that entered the loop together, and the token %inner now identifies subsets of S that exited "together". The explicit use of %inner is specifying which threads should communicate at convergent_op(). If the call had used %outer as an argument, it would have meant that the communication at convergent_op() is "outside" the loop, and all threads that entered the loop should execute it together.

The important fact is that convergent_op() is itself outside the CFG loop, although lexically it looks like it is inside the loop. This distinction is even greater when the we replace the "start ... goto start" with a proper high-level loop statement.

I think what I'm describing didn't really get across... probably the convergence-related terminology I'm using isn't quite right.

The result of the change I'm suggesting is that for a loop like while (true) { if (g()) { convergent_op(); break; } }, convergent_op() actually stays inside the CFG loop. If all operations lexically inside the loop are also inside the CFG loop, we don't need tokens to figure out which operations are lexically inside the loop.

The result of the change I'm suggesting is that for a loop like while (true) { if (g()) { convergent_op(); break; } }, convergent_op() actually stays inside the CFG loop. If all operations lexically inside the loop are also inside the CFG loop, we don't need tokens to figure out which operations are lexically inside the loop.

Yeah, this is how we have always looked at convergence. The new tokens actually try to move away from that picture. A number of different angles to view this from:

  1. It's not useful to always think in terms of "all threads". The tokens returned by the new intrinsics help further specify "which set of threads" converges at a given operation.
  2. The implicit convergence derived from control dependences is kinda sufficient to work with "all threads". It is an approximation that allows a single-thread view to do safe things around convergent operations. But it is not sufficient to clearly specify what it is the actual relation between the CFG and the convergence of multiple threads.
  3. For example, in the same loop or CFG region, etc, one convergent op might be interested in a local convergence captured by the anchor intrinsic, while another might be interested in the threads captured by the loop intrinsic. Now that loop intrinsic might itself have a token argument returned by an anchor intrinsic outside the loop. The subset relationship of all these threads is captured by the constraint on convergence regions.
  4. Until code generation, it is sufficient to just record the relationship between sets of convergent threads. The usual transforms only have to follow the simple static rules about loop hearts and convergence regions to ensure correctness.
  5. The transformation that you are thinking of is actually performed by the backend, where it will "pull" the convergent ops on the exit edges into the loop, and introduce proper mask manipulation to make sure that the right set of threads in a wave/warp executes it.
  6. Until then, we do not actually want to pull that convergent op into the loop body. That will produce unnecessary constraints on transforms working with the loop. The convergent op is most definitely on the exit of the loop. And it's useful to keep it there.
  7. The next step in these patches is to introduce an analysis (D85608) that captures "extended cycles" like the one we are discussing here. This will be used by other analyses and transforms that are "convergence aware" to reason about these extended cycles. This does not require the frontend or any other entity to modify the cycle structure, and no new rules are imposed on the LLVM IR. One example is an enhancement to UniformityAnalysis, where it will recognize some cases of "temporal divergence" that are actually uniform because they are on the exit path of this example.

Until then, we do not actually want to pull that convergent op into the loop body. That will produce unnecessary constraints on transforms working with the loop. The convergent op is most definitely on the exit of the loop. And it's useful to keep it there.

This is not obvious to me. For example, pushing the convergent op out of the loop makes fully unrolling a lot harder: you need to compute a region containing all the uses of the loop token outside the loop, sink any uses of other convergence tokens out of that region, pull all the relevant basic blocks into the loop, then unroll. (I guess in general, full unroll is actually impossible? As far as I can tell, none of the rules guarantee that "sink any uses of other convergence tokens out of that region" is a legal transform. Or am I missing some rule that actually ensures that?)

Or consider loop strength reduction: my first impression is that you want LSR to treat the operands of convergent operations that use a loop's token as if they're inside the loop, because the computation of those operands ends up inside the loop in the final assembly.

I guess pushing the blocks out of the loop makes it simpler to sink non-convergent operations out of the loop?

As far as I can tell, none of the rules guarantee that "sink any uses of other convergence tokens out of that region" is a legal transform. Or am I missing some rule that actually ensures that?

Looking again, I guess this is the "convergence region" rule. It doesn't look like the verifier enforces this rule at the moment, but I assume that's planned?

Spent a bit more time thinking about this... the whole "explicit convergence" thing is starting to make more sense. At a conceptual level, I can see why the notion of barriers is sort of incompatible with what you want to do with tokens. And if you don't have a barrier, then there's nothing actually keeping anything inside the loop, so the notion of an extended loop region is the natural result.

That said, I still think it would be very inconvenient to adapt certain transforms, like unrolling and LSR, to actually handle extended loop regions effectively. It might make sense to have a utility to transform an extended loop into a plain loop.

nhaehnle accepted this revision.Jun 19 2023, 9:29 AM

I think this is in pretty good shape. You may want to give it a bit more time in case more discussion shows up, but it's good for me.

This revision is now accepted and ready to land.Jun 19 2023, 9:29 AM

Spent a bit more time thinking about this... the whole "explicit convergence" thing is starting to make more sense. At a conceptual level, I can see why the notion of barriers is sort of incompatible with what you want to do with tokens. And if you don't have a barrier, then there's nothing actually keeping anything inside the loop, so the notion of an extended loop region is the natural result.

That said, I still think it would be very inconvenient to adapt certain transforms, like unrolling and LSR, to actually handle extended loop regions effectively. It might make sense to have a utility to transform an extended loop into a plain loop.

Thanks Eli, for all the attention you are giving here! It's really important to see that people are understanding the spec enough to look beyond it at the implications!

About unrolling, please do see D85605 for an initial attempt at fixing those passes. It only unrolls in the trivial cases. If a token defined inside a loop is used outside, then there is no attempt at reconstructing the extended region. Instead, we rely on the super-conservative existing behaviour that unrolling is disabled if a token type is used outside the block where it is defined. Clearly, there is a long way to go from there.

Notably, the stack of reviews starting D85603 is the original attempt at introducing convergence tokens. The new stack of reviews attempt is mostly a rebase with some simplifications as we discovered them.

Good to know that the rule about nesting convergence regions works out. It is indeed enforced by the verifier in the lambda "checkBundle", where it tracks a stack of "live tokens" and ensures that these intervals are well-nested.

sameerds updated this revision to Diff 532784.Jun 19 2023, 10:00 PM
  • rebase
  • removed the verifier comment about conflict with EH landing pads
sameerds edited the summary of this revision. (Show Details)Jun 19 2023, 10:05 PM

I think this is in pretty good shape. You may want to give it a bit more time in case more discussion shows up, but it's good for me.

Thanks. I'll wait until Monday for further comments.

sameerds marked an inline comment as done.Jun 19 2023, 10:19 PM
sameerds added inline comments.
llvm/test/Verifier/convergencectrl-invalid.ll
220

Simplified the comments in the verifier, but did not relax the check for non-heart loop calls. For now, the verifier allows a call to loop only at the start of a block. That's not necessary if the call is not a loop heart, but relaxing this check is not very important yet. We can revisit if we have a real use-case where a non-heart loop call needs to be in the middle of a block.

tra resigned from this revision.Jun 20 2023, 10:00 AM

My understanding of nuances here is not sufficient for a meaningful review. @nhaehnle's LGTM works for me.

sameerds updated this revision to Diff 539394.Jul 11 2023, 11:11 PM
sameerds edited the summary of this revision. (Show Details)
  • rebase
  • preparing to commit
This revision was landed with ongoing or failed builds.Jul 12 2023, 12:02 AM
This revision was automatically updated to reflect the committed changes.