In SelectionDAGBuilder always translate the fshl and fshr intrinsics to
FSHL and FSHR (or ROTL and ROTR) instead of lowering them to shifts and
ORs. Improve the legalization of FSHL and FSHR to avoid code quality
regressions.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This is a work in progress but I'd appreciate any feedback.
- Does this seem like the right thing to do in general?
- How should I implement integer expansion for funnel shifts?
- Any advice on fixing any of the code quality regressions in the tests?
Thanks, I was just starting to do this
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp | ||
---|---|---|
1120–1121 | Should this maintain the special case for power of 2 (the GlobalISel version won't have the power of 2 urem combines for a while) | |
1124 | Braces | |
llvm/test/CodeGen/X86/vector-fshr-128.ll | ||
2682–2688 | Bad looking regression? |
I think SelectionDAGBuilder expanding this is nonsense, so yes. The expansion is also duplicated in two places, which is also bad.
- How should I implement integer expansion for funnel shifts?
This is already done?
- Any advice on fixing any of the code quality regressions in the tests?
I mean they're not handled in DAGTypeLegalizer::ExpandIntegerResult. This is what causes some X86 tests to crash.
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp | ||
---|---|---|
1120–1121 | You mean generate AND explicitly instead of UREM? I don't think there's any need to do that. Something else in SelectionDAG cleans it up. |
Translate fshl/fshr intrinsics to ROTL/ROTR up front whenever possible.
Implement integer promotion for ROTL/ROTR in addition to FSHL/FSHR.
Improve expandROT.
This removes most of the major code quality regressions.
Some tests still crash because I haven't implemented integer expansion.
Short answer: no not yet.
I haven't thought of any clever way to implement ExpandIntegerResult for funnel shifts, so I just want to lower them and run ExpandIntegerResult on the lowered code instead. But I'm not sure how to implement that.
I believe this should be possible by calling ReplaceValueWith on the expandFunnelShift result (and returning an empty SDValue).
Implement ExpandIntegerResult for funnel shifts and rotates.
Simpler PromoteIntegerResult for rotates.
The reg+reg shifts consider enough bits of the register that contains the shift amount to shift by bitwidth * 2 - 1 bits. For example (i64:x << y) == 0 for 64 <= y < 128. The immediate forms do not have this property (i.e. there is no way to encode (i64:x << 64) since the immediate field is 6 bits.
But even for the reg+reg versions, I can't think of how we can utilize this.
That's perfect. It means that for e.q. the fshl_i32 test case, the original code is fine and we don't even need the iseleq to handle the shift-by-zero case:
; CHECK-NEXT: andi. 5, 5, 31 ; CHECK-NEXT: subfic 6, 5, 32 ; CHECK-NEXT: slw 5, 3, 5 ; CHECK-NEXT: srw 4, 4, 6 ; CHECK-NEXT: or 4, 5, 4 ; CHECK-NEXT: iseleq 3, 3, 4 # not needed!
Unfortunately I think we'd need a PowerPC-specific lowering to implement this. (It doesn't matter about the immediate forms because funnel shifts by immediate-0 can be folded away.)
llvm/test/CodeGen/X86/vector-fshl-rot-128.ll | ||
---|---|---|
1704–1710 | Can anyone suggest what might have caused this regression? We get two presumably redundant movsd instructions: psrlq $50, %xmm1 movsd %xmm1, %xmm1 # xmm1 = xmm1[0,1] psllq $14, %xmm0 movsd %xmm0, %xmm0 # xmm0 = xmm0[0,1] orpd %xmm1, %xmm0 |
This is functionally complete now and could be committed if we're willing to accept the remaining regressions.
llvm/test/CodeGen/X86/vector-fshl-rot-128.ll | ||
---|---|---|
1704–1710 | I looked at the debug output for this case. t4: v2i64 = BUILD_VECTOR Constant:i64<14>, Constant:i64<14> --> t15: v4i32 = BUILD_VECTOR Constant:i32<14>, undef:i32, Constant:i32<14>, undef:i32 t13: v2i64 = bitcast t15 And that then leads to a big mess when legalizing the 'rotl' node. We convert an x86 vector-shift-by-variable node into an x86-vector-shift-by-constant node very late which alters an x86 MOVSD node to have the same operands. But we've already processed that node in the last combining stage, so it survives to isel. We could probably zap that in DAGToDAG pre-processing. This seems like an unlikely pattern/problem, and there are no existing tests that show it, so I don't think we need to hold this patch up for this regression. |
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
---|---|---|
6038 | I'd recommend C->getAPIntValue().urem(BW) instead - the fuzzer will catch this at some point........ | |
6043 | C->getAPIntValue().urem(BW) | |
6053 | You might be able to use ISD::matchUnaryPredicate to do some of this - so you just need to provide a predicate lambda |
llvm/test/CodeGen/AMDGPU/fshl.ll | ||
---|---|---|
25 | IIRC amdgpu supports fshr but not fshl - could we do more in the expansion to handle this - invert the shift amount and use the legal opcode? |
llvm/test/CodeGen/AMDGPU/fshl.ll | ||
---|---|---|
25 | I think this is what was here before; the alignbit is gone I was also wondering if there was a tricky way to use the 32 bit version in the 64 bit implementation |
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp | ||
---|---|---|
6272 | TargetLowering::expandROT does this: "// If a rotate in the other direction is supported, use it." Are there tests showing that it's not working for some reason? |
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp | ||
---|---|---|
6272 | Yes the AMDGPU/fshl.ll tests look like they've regressed because of this |
llvm/test/CodeGen/PowerPC/funnel-shift.ll | ||
---|---|---|
22–24 | Any luck with this? |
llvm/test/CodeGen/PowerPC/funnel-shift.ll | ||
---|---|---|
22–24 | No. I think it would need a target-specific lowering to take advantage of the fact that shift by 32 is well defined on PowerPC. Can I leave this for a later patch, and/or could a PowerPC expert volunteer to help with it? |
llvm/test/CodeGen/PowerPC/funnel-shift.ll | ||
---|---|---|
22–24 | I can't think of a simple micro optimization that would help. For PowerPC it would be better to expand fshl(X, Y, Z) as X << (Z & 31) | Y >> (32 - (Z & 31)) (where the right shift supports shifting by 32) which would be better than either the "before" or the "after" code shown in this diff. |
llvm/test/CodeGen/PowerPC/funnel-shift.ll | ||
---|---|---|
22–24 | Are you able to add the PPC custom lowering to this patch? |
Rebase now D83948 has landed.
This fixes the PowerPC regression, but there's a new regression in some newly added RISCV test cases.
RISCV updates to avoid code quality regressions from changes in the way funnel shifts are lowered.
@asb @lewis-revill could you take a look at my RISCV changes please? Code quality seems to be as good or better in all lit test cases, I just had to change some RISCV patterns to match the new lowering.
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp | ||
---|---|---|
148 ↗ | (On Diff #283190) | dyn_cast rather than isa and cast? |
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp | ||
---|---|---|
148 ↗ | (On Diff #283190) | I refactored it a little to use isa and getConstantOperandVal, which seems to be the prevailing style in this file. |
No objections accepting from the RISC-V side. I don't have a set of execution tests for the fshl/fshr changes to validate them, but I didn't see anything obviously wrong. The changes primarily affect the (still experimental) bitmanip extension also, so the bar for review is somewhat lower. The change does seem to improve srliw lowering in a few places in the GCC torture suite as well (in a way that is both correct and beneficial).
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
---|---|---|
6057 | Our OOT target got legal types that aren't pow-of-2. So this kind of broke some funnel-shift tests that happened to work earlier. I realize that it might be hard to verify the behavior for non-pow-of-2 types upstream, so not sure if I'm welcome to do patches upstream that fixes the problem (although there might be other OOT targets in the same situation). BTW: I haven't figured out exactly what part of the code below that only works for pow-of-2 types yet. I see some checks for isPowerOf2_32(BW) further down in the code, so maybe the assert isn't needed at all? |
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
---|---|---|
6057 | I believe that only the conversion fshl(X, Y, Z) <--> fshr(X, Y, -Z) depends on bitwidth being a power of two. So pushing the assertion inside the "if" might be a quick fix, assuming your target has equally good support for shifting in both directions? The correct conversion for non-power-of-two would be something like: fshl(X, Y, Z) -> fshr(X, Y, BW - (Z % BW)) and vice versa. Personally I'd be happy to have this code upstream, though I don't know if there would be any way of testing it. |
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
---|---|---|
6057 | Yes, I realize this is related to the new code below. Now it is a bit weird to have the assert here, guarding all the code below, when the general case below handle non-pow-of-2 BW explicitly. |
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
---|---|---|
6057 |
Good idea, maybe with a comment explaining how to implement properly.
Agreed. Sorry about that! |
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
---|---|---|
6057 | No worries. |
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
---|---|---|
6061 | Is this transform even correct? Assume we got fshl 3, 2, 0, then the result would be 3. But this code could turn it into fshr 3, 2, (0-0), then the result would become 2. |
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
---|---|---|
6061 | Maybe one can limit the condition to if (!isOperationLegalOrCustom(Node->getOpcode(), VT) && isOperationLegalOrCustom(RevOpcode, VT) && isPowerOf2_32(BW) && isNonZeroModBitWidth(Z, BW)) { Although, isNonZeroModBitWidth currently allow undef values. And I wonder if it is allowed to turn fshl X, Y, undef into fshr X, Y, (0-UNDEF) , considering that we for example know that fshl -1, 0, undef can't be zero, but fshr -1, 0, undef can be zero. So one probably needs to restrict the condition further by adding a isNonZeroNonUndefModBitWidth helper and use that here. |
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
---|---|---|
6061 | You're right. I'll take a proper look on Monday. Of course this is just an optimization so it's safe to disable it completely if it's causing problems. I think I added it to fix some code quality regressions in the AMDGPU tests. |
llvm/test/CodeGen/X86/vector-fshl-512.ll | ||
---|---|---|
673 | Is this better sequence? Looks longer. |
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
---|---|---|
6061 | Proposed workaround/fixup here: https://reviews.llvm.org/D86430 |
llvm/test/CodeGen/X86/vector-fshl-512.ll | ||
---|---|---|
673 | I agree it's longer but my knowledge of AVX-512 is limited. It's using the expansion of fshl: X << (Z % BW) | Y >> 1 >> (BW - 1 - (Z % BW)) vpsrlw %xmm4, %ymm5, %ymm5 vpsrlw %xmm4, %ymm1, %ymm1 I would expect to see just a couple more instructions here: vpsrlw $1, %ymm5, %ymm5 vpsrlw %xmm4, %ymm5, %ymm5 vpsrlw $1, %ymm1, %ymm1 vpsrlw %xmm4, %ymm1, %ymm1 which would be offset by not needing the vpcmpeqw/vpternlogq at the end. |
The fuzzer has found a failing case here: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=25150
define i33 @fshr_multi_use(i33 %a) { bb: %b = tail call i33 @llvm.fshr.i33(i33 %a, i33 %a, i33 1) %e = and i33 %b, 31 ret i33 %e }
We're seeing a break in the Firefox build after this change:
WidenVectorResult #0: t29: v2i32 = rotl t25, t28, /builds/worker/checkouts/gecko/gfx/angle/checkout/src/image_util/loadimage.cpp:640 Do not know how to widen the result of this operator!
The reproducers are available from the upper-left pane of https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313707074&repo=try&lineNumber=14721 if you click "Show Job Info".
Will this be covered by one of the existing followups?
We're also seeing this in our internal testing. I was about to put together a patch for it if its not already being worked on.
I'm not working on it yet as I don't have a small test case yet. I imagine the fix is just to add cases for ROTL and ROTR at LegalizeVectorTypes.cpp:2866.
Should this maintain the special case for power of 2 (the GlobalISel version won't have the power of 2 urem combines for a while)