This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG] Better legalization for FSHL and FSHR
ClosedPublic

Authored by foad on Mar 31 2020, 9:59 AM.

Details

Summary

In SelectionDAGBuilder always translate the fshl and fshr intrinsics to
FSHL and FSHR (or ROTL and ROTR) instead of lowering them to shifts and
ORs. Improve the legalization of FSHL and FSHR to avoid code quality
regressions.

Diff Detail

Event Timeline

foad created this revision.Mar 31 2020, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2020, 9:59 AM
foad added a comment.Mar 31 2020, 10:02 AM

This is a work in progress but I'd appreciate any feedback.

  • Does this seem like the right thing to do in general?
  • How should I implement integer expansion for funnel shifts?
  • Any advice on fixing any of the code quality regressions in the tests?

Thanks, I was just starting to do this

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
1122–1123

Should this maintain the special case for power of 2 (the GlobalISel version won't have the power of 2 urem combines for a while)

1126

Braces

llvm/test/CodeGen/X86/vector-fshr-128.ll
2662–2668

Bad looking regression?

This is a work in progress but I'd appreciate any feedback.

  • Does this seem like the right thing to do in general?

I think SelectionDAGBuilder expanding this is nonsense, so yes. The expansion is also duplicated in two places, which is also bad.

  • How should I implement integer expansion for funnel shifts?

This is already done?

  • Any advice on fixing any of the code quality regressions in the tests?
foad added a comment.Mar 31 2020, 12:15 PM
  • How should I implement integer expansion for funnel shifts?

This is already done?

I mean they're not handled in DAGTypeLegalizer::ExpandIntegerResult. This is what causes some X86 tests to crash.

foad marked 2 inline comments as done.Apr 1 2020, 1:26 AM
foad added inline comments.
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
1122–1123

You mean generate AND explicitly instead of UREM? I don't think there's any need to do that. Something else in SelectionDAG cleans it up.

foad updated this revision to Diff 254244.Apr 1 2020, 10:21 AM
foad marked an inline comment as done.

Translate fshl/fshr intrinsics to ROTL/ROTR up front whenever possible.
Implement integer promotion for ROTL/ROTR in addition to FSHL/FSHR.
Improve expandROT.
This removes most of the major code quality regressions.
Some tests still crash because I haven't implemented integer expansion.

@foad Any progress on fixing the crashes?

foad added a comment.Apr 15 2020, 10:07 AM

@foad Any progress on fixing the crashes?

Short answer: no not yet.

I haven't thought of any clever way to implement ExpandIntegerResult for funnel shifts, so I just want to lower them and run ExpandIntegerResult on the lowered code instead. But I'm not sure how to implement that.

@foad Any progress on fixing the crashes?

Short answer: no not yet.

I haven't thought of any clever way to implement ExpandIntegerResult for funnel shifts, so I just want to lower them and run ExpandIntegerResult on the lowered code instead. But I'm not sure how to implement that.

I believe this should be possible by calling ReplaceValueWith on the expandFunnelShift result (and returning an empty SDValue).

@foad D77316 might not be needed for funnel shift lowering now that D80489 has landed

foad updated this revision to Diff 266175.May 26 2020, 5:34 AM

Rebase.

foad marked 3 inline comments as done.May 26 2020, 5:39 AM

@foad D77316 might not be needed for funnel shift lowering now that D80489 has landed

True, but we get a slight PowerPC regression instead.

llvm/test/CodeGen/PowerPC/funnel-shift.ll
22–24

This is a slight regression. We need one more instruction overall because of the two clrlwi instructions to mask both normal and the inverted shift amounts.

129–131

Ditto.

158–159

Ditto.

RKSimon added a subscriber: hfinkel.

IIRC powerpc scalar shift ops can use shift amounts upto the bitwidth (not bitwidth-1) so there might be custom funnel shifts that they can do with that - @hfinkel @nemanjai any thoughts?

foad updated this revision to Diff 266888.May 28 2020, 8:35 AM

Implement ExpandIntegerResult for funnel shifts and rotates.
Simpler PromoteIntegerResult for rotates.

IIRC powerpc scalar shift ops can use shift amounts upto the bitwidth (not bitwidth-1) so there might be custom funnel shifts that they can do with that - @hfinkel @nemanjai any thoughts?

The reg+reg shifts consider enough bits of the register that contains the shift amount to shift by bitwidth * 2 - 1 bits. For example (i64:x << y) == 0 for 64 <= y < 128. The immediate forms do not have this property (i.e. there is no way to encode (i64:x << 64) since the immediate field is 6 bits.
But even for the reg+reg versions, I can't think of how we can utilize this.

foad updated this revision to Diff 267228.May 29 2020, 7:11 AM

Improve expansion of FSHL/FSHR by non-zero constant amount

foad added a comment.May 29 2020, 7:43 AM

IIRC powerpc scalar shift ops can use shift amounts upto the bitwidth (not bitwidth-1) so there might be custom funnel shifts that they can do with that - @hfinkel @nemanjai any thoughts?

The reg+reg shifts consider enough bits of the register that contains the shift amount to shift by bitwidth * 2 - 1 bits. For example (i64:x << y) == 0 for 64 <= y < 128. The immediate forms do not have this property (i.e. there is no way to encode (i64:x << 64) since the immediate field is 6 bits.
But even for the reg+reg versions, I can't think of how we can utilize this.

That's perfect. It means that for e.q. the fshl_i32 test case, the original code is fine and we don't even need the iseleq to handle the shift-by-zero case:

; CHECK-NEXT:    andi. 5, 5, 31
; CHECK-NEXT:    subfic 6, 5, 32
; CHECK-NEXT:    slw 5, 3, 5
; CHECK-NEXT:    srw 4, 4, 6
; CHECK-NEXT:    or 4, 5, 4
; CHECK-NEXT:    iseleq 3, 3, 4 # not needed!

Unfortunately I think we'd need a PowerPC-specific lowering to implement this. (It doesn't matter about the immediate forms because funnel shifts by immediate-0 can be folded away.)

foad marked an inline comment as done.May 29 2020, 7:47 AM
foad added inline comments.
llvm/test/CodeGen/X86/vector-fshl-rot-128.ll
1670–1676

Can anyone suggest what might have caused this regression? We get two presumably redundant movsd instructions:

	psrlq	$50, %xmm1
	movsd	%xmm1, %xmm1            # xmm1 = xmm1[0,1]
	psllq	$14, %xmm0
	movsd	%xmm0, %xmm0            # xmm0 = xmm0[0,1]
	orpd	%xmm1, %xmm0
foad added a comment.May 29 2020, 7:51 AM

This is functionally complete now and could be committed if we're willing to accept the remaining regressions.

foad added a comment.Jun 15 2020, 5:36 AM

ping?

I agree, ping!

spatel added inline comments.Jun 16 2020, 5:25 AM
llvm/test/CodeGen/X86/vector-fshl-rot-128.ll
1670–1676

I looked at the debug output for this case.
It only shows up for x86-32 because we legalize the shift amount constant there with:

t4: v2i64 = BUILD_VECTOR Constant:i64<14>, Constant:i64<14>

-->

  t15: v4i32 = BUILD_VECTOR Constant:i32<14>, undef:i32, Constant:i32<14>, undef:i32
t13: v2i64 = bitcast t15

And that then leads to a big mess when legalizing the 'rotl' node.

We convert an x86 vector-shift-by-variable node into an x86-vector-shift-by-constant node very late which alters an x86 MOVSD node to have the same operands.

But we've already processed that node in the last combining stage, so it survives to isel. We could probably zap that in DAGToDAG pre-processing.

This seems like an unlikely pattern/problem, and there are no existing tests that show it, so I don't think we need to hold this patch up for this regression.

RKSimon added inline comments.Jun 17 2020, 7:01 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6038 ↗(On Diff #267228)

I'd recommend C->getAPIntValue().urem(BW) instead - the fuzzer will catch this at some point........

6043 ↗(On Diff #267228)

C->getAPIntValue().urem(BW)

6053 ↗(On Diff #267228)

You might be able to use ISD::matchUnaryPredicate to do some of this - so you just need to provide a predicate lambda

RKSimon added inline comments.Jun 18 2020, 1:37 AM
llvm/test/CodeGen/AMDGPU/fshl.ll
25

IIRC amdgpu supports fshr but not fshl - could we do more in the expansion to handle this - invert the shift amount and use the legal opcode?

arsenm added inline comments.Jun 18 2020, 4:51 AM
llvm/test/CodeGen/AMDGPU/fshl.ll
25

I think this is what was here before; the alignbit is gone

I was also wondering if there was a tricky way to use the 32 bit version in the 64 bit implementation

foad marked 4 inline comments as done.Jun 18 2020, 7:40 AM
foad added inline comments.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6053 ↗(On Diff #267228)

Nice tip, thanks!

llvm/test/CodeGen/AMDGPU/fshl.ll
25

If we had a test like fshl_i32 but with the inputs in vgprs then I guess it would show a regression here.

foad updated this revision to Diff 271716.Jun 18 2020, 7:42 AM
foad marked an inline comment as done.Jun 18 2020, 7:42 AM

Rebase. Improve isNonZeroModBitWidth.

RKSimon added inline comments.Jun 25 2020, 1:24 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6303

@foad We've lost this functionality - this needs adding to the new expansion code or to DAGCombiner

foad marked an inline comment as done.Jun 25 2020, 4:13 AM
foad added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6303

TargetLowering::expandROT does this: "// If a rotate in the other direction is supported, use it." Are there tests showing that it's not working for some reason?

RKSimon added inline comments.Jun 25 2020, 4:59 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6303

Yes the AMDGPU/fshl.ll tests look like they've regressed because of this

reverse ping - any luck on the amdgpu regressions?

foad updated this revision to Diff 277695.Jul 14 2020, 12:58 AM
foad marked an inline comment as done.

Expand FSHL as FSHR or vice versa if the target only supports one.

foad added a comment.Jul 14 2020, 1:00 AM

reverse ping - any luck on the amdgpu regressions?

Yes should be fixed now.

Looks good to me.

RKSimon added inline comments.Jul 15 2020, 1:31 AM
llvm/test/CodeGen/PowerPC/funnel-shift.ll
22–24

Any luck with this?

foad marked an inline comment as done.Jul 15 2020, 2:16 AM
foad added inline comments.
llvm/test/CodeGen/PowerPC/funnel-shift.ll
22–24

No. I think it would need a target-specific lowering to take advantage of the fact that shift by 32 is well defined on PowerPC. Can I leave this for a later patch, and/or could a PowerPC expert volunteer to help with it?

RKSimon added inline comments.Jul 16 2020, 1:10 AM
llvm/test/CodeGen/PowerPC/funnel-shift.ll
22–24

So we're not just missing a micro optimization we can add to the powerpc backend as part of this patch? @hfinkel Any suggestions?

foad marked an inline comment as done.Jul 16 2020, 1:59 AM
foad added inline comments.
llvm/test/CodeGen/PowerPC/funnel-shift.ll
22–24

I can't think of a simple micro optimization that would help. For PowerPC it would be better to expand fshl(X, Y, Z) as X << (Z & 31) | Y >> (32 - (Z & 31)) (where the right shift supports shifting by 32) which would be better than either the "before" or the "after" code shown in this diff.

RKSimon added inline comments.Jul 16 2020, 3:05 AM
llvm/test/CodeGen/PowerPC/funnel-shift.ll
22–24

Are you able to add the PPC custom lowering to this patch?

foad marked an inline comment as done.Jul 16 2020, 6:40 AM
foad added inline comments.
llvm/test/CodeGen/PowerPC/funnel-shift.ll
22–24

I'd prefer to do it as a separate patch first: D83948.

foad updated this revision to Diff 283168.Aug 5 2020, 2:17 AM

Rebase now D83948 has landed.

This fixes the PowerPC regression, but there's a new regression in some newly added RISCV test cases.

foad updated this revision to Diff 283190.Aug 5 2020, 3:58 AM

RISCV updates to avoid code quality regressions from changes in the way funnel shifts are lowered.

foad added a subscriber: lewis-revill.

RISCV updates to avoid code quality regressions from changes in the way funnel shifts are lowered.

@asb @lewis-revill could you take a look at my RISCV changes please? Code quality seems to be as good or better in all lit test cases, I just had to change some RISCV patterns to match the new lowering.

@asb Any thoughts on the RISCV changes? This is the last blocker on the patch.

arsenm added inline comments.Aug 13 2020, 7:14 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
148 ↗(On Diff #283190)

dyn_cast rather than isa and cast?

foad updated this revision to Diff 285979.Aug 17 2020, 5:33 AM

Rebase and address a review comment.

foad added inline comments.Aug 17 2020, 5:35 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
148 ↗(On Diff #283190)

I refactored it a little to use isa and getConstantOperandVal, which seems to be the prevailing style in this file.

@asb any objections to me approving this?

RKSimon accepted this revision.Aug 21 2020, 2:00 AM

LGTM

This revision is now accepted and ready to land.Aug 21 2020, 2:00 AM
asb added a comment.Aug 21 2020, 2:27 AM

No objections accepting from the RISC-V side. I don't have a set of execution tests for the fshl/fshr changes to validate them, but I didn't see anything obviously wrong. The changes primarily affect the (still experimental) bitmanip extension also, so the bar for review is somewhat lower. The change does seem to improve srliw lowering in a few places in the GCC torture suite as well (in a way that is both correct and beneficial).

This revision was landed with ongoing or failed builds.Aug 21 2020, 2:33 AM
This revision was automatically updated to reflect the committed changes.
bjope added a subscriber: bjope.Aug 21 2020, 5:10 AM
bjope added inline comments.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6159 ↗(On Diff #286977)

Our OOT target got legal types that aren't pow-of-2. So this kind of broke some funnel-shift tests that happened to work earlier.

I realize that it might be hard to verify the behavior for non-pow-of-2 types upstream, so not sure if I'm welcome to do patches upstream that fixes the problem (although there might be other OOT targets in the same situation).

BTW: I haven't figured out exactly what part of the code below that only works for pow-of-2 types yet. I see some checks for isPowerOf2_32(BW) further down in the code, so maybe the assert isn't needed at all?

foad added inline comments.Aug 21 2020, 5:29 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6159 ↗(On Diff #286977)

I believe that only the conversion fshl(X, Y, Z) <--> fshr(X, Y, -Z) depends on bitwidth being a power of two. So pushing the assertion inside the "if" might be a quick fix, assuming your target has equally good support for shifting in both directions?

The correct conversion for non-power-of-two would be something like: fshl(X, Y, Z) -> fshr(X, Y, BW - (Z % BW)) and vice versa. Personally I'd be happy to have this code upstream, though I don't know if there would be any way of testing it.

bjope added inline comments.Aug 21 2020, 5:33 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6159 ↗(On Diff #286977)

Yes, I realize this is related to the new code below.
So we could just add it as a condition in the if-statement related to swapping between FSHL<->FSHR.

Now it is a bit weird to have the assert here, guarding all the code below, when the general case below handle non-pow-of-2 BW explicitly.

foad added inline comments.Aug 21 2020, 5:37 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6159 ↗(On Diff #286977)

So we could just add it as a condition in the if-statement related to swapping between FSHL<->FSHR.

Good idea, maybe with a comment explaining how to implement properly.

Now it is a bit weird to have the assert here, guarding all the code below, when the general case below handle non-pow-of-2 BW explicitly.

Agreed. Sorry about that!

bjope added inline comments.Aug 21 2020, 5:43 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6159 ↗(On Diff #286977)

No worries.
Patch seem good in general. I saw some lowering improvements for important cases (and only some minor regression, at least in instruction count, for weird types like i37 but that isn't really important for my backend).

bjope added inline comments.Aug 21 2020, 10:31 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6163 ↗(On Diff #286977)

Is this transform even correct?

Assume we got fshl 3, 2, 0, then the result would be 3. But this code could turn it into fshr 3, 2, (0-0), then the result would become 2.

bjope added inline comments.Aug 21 2020, 1:58 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6163 ↗(On Diff #286977)

Maybe one can limit the condition to

if (!isOperationLegalOrCustom(Node->getOpcode(), VT) &&
      isOperationLegalOrCustom(RevOpcode, VT) &&
      isPowerOf2_32(BW) && isNonZeroModBitWidth(Z, BW)) {

Although, isNonZeroModBitWidth currently allow undef values. And I wonder if it is allowed to turn fshl X, Y, undef into fshr X, Y, (0-UNDEF) , considering that we for example know that fshl -1, 0, undef can't be zero, but fshr -1, 0, undef can be zero. So one probably needs to restrict the condition further by adding a isNonZeroNonUndefModBitWidth helper and use that here.

foad added inline comments.Aug 22 2020, 1:11 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6163 ↗(On Diff #286977)

You're right. I'll take a proper look on Monday. Of course this is just an optimization so it's safe to disable it completely if it's causing problems. I think I added it to fix some code quality regressions in the AMDGPU tests.

xbolva00 added inline comments.
llvm/test/CodeGen/X86/vector-fshl-512.ll
674

Is this better sequence? Looks longer.

bjope added inline comments.Aug 24 2020, 12:31 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6163 ↗(On Diff #286977)

Proposed workaround/fixup here: https://reviews.llvm.org/D86430

foad added inline comments.Aug 24 2020, 1:16 AM
llvm/test/CodeGen/X86/vector-fshl-512.ll
674

I agree it's longer but my knowledge of AVX-512 is limited. It's using the expansion of fshl: X << (Z % BW) | Y >> 1 >> (BW - 1 - (Z % BW))
but it looks like the "Y >> 1 >> expr" part is generating messy code. Compared to the old code:

vpsrlw %xmm4, %ymm5, %ymm5
vpsrlw %xmm4, %ymm1, %ymm1

I would expect to see just a couple more instructions here:

vpsrlw $1, %ymm5, %ymm5
vpsrlw %xmm4, %ymm5, %ymm5
vpsrlw $1, %ymm1, %ymm1
vpsrlw %xmm4, %ymm1, %ymm1

which would be offset by not needing the vpcmpeqw/vpternlogq at the end.

The fuzzer has found a failing case here: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=25150

define i33 @fshr_multi_use(i33 %a) {
bb:
  %b = tail call i33 @llvm.fshr.i33(i33 %a, i33 %a, i33 1)
  %e = and i33 %b, 31
  ret i33 %e
}
foad added a comment.Aug 24 2020, 5:35 AM

The fuzzer has found a failing case here: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=25150

define i33 @fshr_multi_use(i33 %a) {
bb:
  %b = tail call i33 @llvm.fshr.i33(i33 %a, i33 %a, i33 1)
  %e = and i33 %b, 31
  ret i33 %e
}

D86449.

dmajor added a subscriber: dmajor.Aug 24 2020, 8:59 AM

We're seeing a break in the Firefox build after this change:

WidenVectorResult #0: t29: v2i32 = rotl t25, t28, /builds/worker/checkouts/gecko/gfx/angle/checkout/src/image_util/loadimage.cpp:640

Do not know how to widen the result of this operator!

The reproducers are available from the upper-left pane of https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313707074&repo=try&lineNumber=14721 if you click "Show Job Info".

Will this be covered by one of the existing followups?

We're seeing a break in the Firefox build after this change:

WidenVectorResult #0: t29: v2i32 = rotl t25, t28, /builds/worker/checkouts/gecko/gfx/angle/checkout/src/image_util/loadimage.cpp:640

Do not know how to widen the result of this operator!

The reproducers are available from the upper-left pane of https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313707074&repo=try&lineNumber=14721 if you click "Show Job Info".

Will this be covered by one of the existing followups?

We're also seeing this in our internal testing. I was about to put together a patch for it if its not already being worked on.

foad added a comment.Aug 24 2020, 9:55 AM

We're seeing a break in the Firefox build after this change:

WidenVectorResult #0: t29: v2i32 = rotl t25, t28, /builds/worker/checkouts/gecko/gfx/angle/checkout/src/image_util/loadimage.cpp:640

Do not know how to widen the result of this operator!

The reproducers are available from the upper-left pane of https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313707074&repo=try&lineNumber=14721 if you click "Show Job Info".

Will this be covered by one of the existing followups?

We're also seeing this in our internal testing. I was about to put together a patch for it if its not already being worked on.

I'm not working on it yet as I don't have a small test case yet. I imagine the fix is just to add cases for ROTL and ROTR at LegalizeVectorTypes.cpp:2866.

We're seeing a break in the Firefox build after this change:

WidenVectorResult #0: t29: v2i32 = rotl t25, t28, /builds/worker/checkouts/gecko/gfx/angle/checkout/src/image_util/loadimage.cpp:640

Do not know how to widen the result of this operator!

The reproducers are available from the upper-left pane of https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313707074&repo=try&lineNumber=14721 if you click "Show Job Info".

Will this be covered by one of the existing followups?

We're also seeing this in our internal testing. I was about to put together a patch for it if its not already being worked on.

I'm not working on it yet as I don't have a small test case yet. I imagine the fix is just to add cases for ROTL and ROTR at LegalizeVectorTypes.cpp:2866.

Should be fixed after 43465a43755498e11b14ceb46e278bd127b3b3d7

ro added a subscriber: ro.Aug 25 2020, 9:05 AM

This patch broke sparcv9-sun-solaris2.11 RelWithDebInfo build: cf. Bug 47303.

foad added a comment.Aug 26 2020, 1:58 AM
In D77152#2236467, @ro wrote:

This patch broke sparcv9-sun-solaris2.11 RelWithDebInfo build: cf. Bug 47303.

D86601.

llvm/test/CodeGen/X86/funnel-shift.ll