This is an archive of the discontinued LLVM Phabricator instance.

[AggressiveInstCombine] convert rotate with guard branch into funnel shift (PR34924)
ClosedPublic

Authored by spatel on Dec 12 2018, 8:47 AM.

Details

Summary

Now, that we have funnel shift intrinsics, it should be safe to convert this form of rotate to it. In the worst case (a target that doesn't have rotate instructions), we will expand this into a branch-less sequence of ALU ops (neg/and/and/lshr/shl/or) in the backend, so it's still very likely to be a perf improvement over the original code.

The motivating source code pattern for this is shown in:
https://bugs.llvm.org/show_bug.cgi?id=34924

Background:
I looked at several different options before deciding where to try this - instcombine, simplifycfg, CGP - because it doesn't fit cleanly anywhere AFAIK.

The backend (CGP, SDAG, GlobalIsel?) is too late for what we're trying to accomplish. We want to have the IR converted before we reach things like vectorization because the simplified code can make a loop much simpler to transform.

Technically, this could be included in instcombine, but it's a large pattern match that includes control-flow, so it just felt wrong to stuff into there (although I have a draft of that patch). Similarly, this could be part of simplifycfg, but all of this pattern matching is a stretch.

So we're left with our relatively new dumping ground for homeless transforms: aggressive-instcombine. This only runs at -O3, but that seems like a reasonable limitation given that source code has many options to avoid this pattern (including the recently added clang intrinsics for rotates).

I'm including a PhaseOrdering test because we require the teamwork of 3 passes (aggressive-instcombine, instcombine, simplifycfg) to get this into the minimal IR form that we want.

And that showed a surprise - the new pass manager fails to reduce this. That's because the new PM includes this:

// Speculative execution if the target has divergent branches; otherwise nop.
FPM.addPass(SpeculativeExecutionPass());

...in the default simplification pipeline. That pass hoists all of the instructions into the entry block which allows simplifycfg to turn this sequence into straight-line code with a select. That then allows instcombine to recognize this sequence as a rotate and remove the select. But instcombine is not yet canonicalizing the 6 op ALU sequence to funnel-shift yet. I'm planning to add that, but this really looks like a bug in the new pass manager because hoisting a sequence of ops isn't supposed to be happening for all targets - from SpeculativeExecution.h:

// This pass hoists instructions to enable speculative execution on
// targets where branches are expensive. This is aimed at GPUs.

Diff Detail

Event Timeline

spatel created this revision.Dec 12 2018, 8:47 AM
nikic added inline comments.Dec 12 2018, 10:43 AM
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
95

Rather than matching the rotate pattern here, would it be possible to leave the conversion to fsh to InstCombine (as you say, this is planned anyway), and only do the removal of the zero check here? Or is general canonicalization in InstCombine still blocked on some optimization issues?

110

Do we possibly need to check that the rotate only has a single use, to avoid converting both the phi into fsh and leaving behind the original rotate?

124

Is icmp ne %x, 0 already canonicalized to icmp eq %x, 0 beforehand?

spatel marked 3 inline comments as done.Dec 12 2018, 11:57 AM
spatel added inline comments.
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
95

There's no clear limit on the power of aggressive-instcombine, but I'm assuming that we should follow the same rules as regular instcombine. Ie, you should be able to transfer anything in here to instcombine if you don't care about compile-time.

So we defer CFG changes to other passes (simplifycfg in particular) and just focus on logical transforms. This one blurs the lines of course, but I don't think we should be trying to alter the branch rather than the phi here.

The only remaining optimization issue that I'm aware of for funnel shift canonicalization is addressed by D55563.

110

Sure, I can add that check. Strictly speaking, we can never create more instructions than we started with here (we're only replacing the phi), so we don't need to do that.

124

Yes, instcombine prefers 'eq' to 'ne', so we're relying on that. It's possible that we don't get that transform though - fresh example of that:
https://bugs.llvm.org/show_bug.cgi?id=39968
...so like the 'sle' negative test that's already included here, we are not catching every possible pattern in this patch. But until there's evidence that we need it, I don't think it's worth chasing.

nikic added inline comments.Dec 12 2018, 12:15 PM
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
95

I'm not suggesting to alter the branches here, but rather to split this transform into a two step process. First InstCombine converts the (x << y) | (x >> (bw - y)) into fshl(x, x, y) (which is legal independently of the null check for rotates, but not general funnel shifts) and then here we only handle the y == 0 ? x : fshl(x, x, y) pattern, rather than the shift-based variant.

However, after reading https://bugs.llvm.org/show_bug.cgi?id=34924, maybe this would not even be necessary? In one of the comments at the end you mention that if the rotate is represented as fsh, then simplifycfg already converts this pattern into a select. Together with the fsh+select simplification you already added in instsimplify, wouldn't that mean that a combination of instcombine+simplifycfg+instsimplify will already take care of the whole pattern (if we canonicalize the shl/ashr/or sequence to fsh in instcombine)?

So the process would go like this:

define i32 @rotl(i32 %a, i32 %b) {
entry:
  %cmp = icmp eq i32 %b, 0
  br i1 %cmp, label %end, label %rotbb

rotbb:
  %sub = sub i32 32, %b
  %shr = lshr i32 %a, %sub
  %shl = shl i32 %a, %b
  %or = or i32 %shr, %shl
  br label %end

end:
  %cond = phi i32 [ %or, %rotbb ], [ %a, %entry ]
  ret i32 %cond
}
// -instcombine
define i32 @rotl(i32 %a, i32 %b) {
entry:
  %cmp = icmp eq i32 %b, 0
  br i1 %cmp, label %end, label %rotbb

rotbb:
  %or = call i32 @llvm.fshl(i32 %a, i32 %a, i32 %b)
  br label %end

end:
  %cond = phi i32 [ %or, %rotbb ], [ %a, %entry ]
  ret i32 %cond
}
// -simplifycfg
define i32 @rotl(i32 %a, i32 %b) {
entry:
  %cmp = icmp eq i32 %b, 0
  %or = call i32 @llvm.fshl(i32 %a, i32 %a, i32 %b)
  %cond = select i1 %cmp, i32 %or, %a
  ret i32 %cond
}
// -instsimplify
define i32 @rotl(i32 %a, i32 %b) {
entry:
  %or = call i32 @llvm.fshl(i32 %a, i32 %a, i32 %b)
  ret i32 %or
}
fabiang added inline comments.Dec 12 2018, 12:36 PM
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
95

I agree that it feels a bit redundant to have the pattern replicated twice, once for branches here and once for selects in instsimplify (in what was D54552).

If the contents of rotbb get reliably converted into funnel shifts by instcombine, and simplifycfg turns it into a select, the existing code seems sufficient.

A counterargument may be that the more passes need to collaborate to handle this pattern as intended, the bigger the chances are of something going wrong in the middle of the process.

spatel marked an inline comment as done.Dec 12 2018, 1:27 PM
spatel added inline comments.
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
95

Ok - this wasn't clear in the bug comments, so let me try to explain it better here. We can't turn this alone:

%sub = sub i32 32, %b
%shr = lshr i32 %a, %sub
%shl = shl i32 %a, %b
%or = or i32 %shr, %shl

...into fshl/fshr. The reason is that -- for a target that does not have a rotate instruction -- this will get expanded into a sequence with more ops and almost certainly worse perf. And there's no way for the backend to reverse that, so that's forbidden AFAIK. If the code already has some attempt to guard against UB (by masking/selecting/branching), then we can turn it into funnel shift because we can (almost) guarantee that our worst-case expansion is still equal or better than the original code.

spatel updated this revision to Diff 177919.Dec 12 2018, 2:18 PM

Patch updated:

  1. Add hasOneUse() checks and test, so we're not converting to funnel shift if there's potential for perf regression.
  2. I went ahead and committed the PhaseOrdering test since it shows a bug independent of this patch (rL348980).
nikic added inline comments.Dec 12 2018, 2:30 PM
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
95

Okay, that makes sense. For targets with rotate it's a trivial win, but for those without we can only perform the transform here because we're trading off a branch for two extra ands.

In that case, I'm wondering if this transform should be restricted to power-of-two bitwidths. Otherwise this might end up introducing urems rather than simple masking (and looking at the current DAG code, we'd actually even add a select on zero, even though that doesn't seem necessary for rotates). Not that it really matters in the end, as I'm having a hard time imagining how one would end up with a non-power-of-two rotate.

spatel marked an inline comment as done.Dec 12 2018, 3:13 PM
spatel added inline comments.
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
95

Right - it's the targets that don't rotate that make this hard to handle for generic combining. If we had some kind of cost-model driven instcombine that ran before things like vectorization, we could be more flexible.

Yes, we can restrict this to power-of-2 types. And yes, I don't know how we'd ever see that kind of op unless someone has a very strange target. :)

spatel updated this revision to Diff 177959.Dec 12 2018, 3:45 PM

Patch updated:
Added check and test for non-power-of-2 type.

nikic added inline comments.Dec 16 2018, 12:47 PM
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
148

It looks like this is not the right way to replace a phi. For this test case:

define i32 @rotl(i32 %a, i32 %b) {
entry:
  %cmp = icmp eq i32 %b, 0
  br i1 %cmp, label %end, label %rotbb

rotbb:
  %sub = sub i32 32, %b
  %shr = lshr i32 %a, %sub
  %shl = shl i32 %a, %b
  %or = or i32 %shr, %shl
  br label %end

end:
  %cond = phi i32 [ %or, %rotbb ], [ %a, %entry ]
  %other = phi i32 [ 1, %rotbb ], [ 2, %entry ]
  %res = or i32 %cond, %other
  ret i32 %res
}

I get:

PHI nodes not grouped at top of basic block!
  %other = phi i32 [ 1, %rotbb ], [ 2, %entry ]
label %end
in function rotl
LLVM ERROR: Broken function found, compilation aborted!

Some possible alternatives from git grep "IRBuilder.*Phi":

lib/Transforms/Instrumentation/GCOVProfiling.cpp:          IRBuilder<> BuilderForPhi(&*BB.begin());
lib/Transforms/Scalar/IndVarSimplify.cpp:        IRBuilder<> Builder(&*WidePhi->getParent()->getFirstInsertionPt());
lib/Transforms/Utils/BypassSlowDivision.cpp:  IRBuilder<> Builder(PhiBB, PhiBB->begin());
spatel marked 2 inline comments as done.Dec 17 2018, 4:24 AM
spatel added inline comments.
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
148

Yes, good catch. I'll add a regression test and fix that.

spatel updated this revision to Diff 178451.Dec 17 2018, 4:49 AM
spatel marked an inline comment as done.

Patch updated:
Set builder insertion point to valid location for non-phi replacement instruction.

nikic accepted this revision.Dec 17 2018, 5:28 AM

LGTM. As a future extension, it would be nice to also handle non-rotate funnel shifts.

One note: We're not guarding whether the RotBB is reachable *only* from GuardBB, so it may be that it can also be executed with a shift amount of zero. However, in this case the result will be undef due to the oversized shift and we can assume that it coincides with the value provided by this transformation, so there shouldn't be a problem.

This revision is now accepted and ready to land.Dec 17 2018, 5:28 AM
This revision was automatically updated to reflect the committed changes.