Page MenuHomePhabricator

[DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits (WIP)
Needs ReviewPublic

Authored by RKSimon on Apr 9 2020, 7:20 AM.

Details

Summary

This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the ISD::SRL source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts.

This is another step towards removing SelectionDAG::GetDemandedBits and just using TargetLowering::SimplifyMultipleUseDemandedBits.

There's a number of regressions that I'm still investigating, notably:

Illegal load store code isn't great
X86 ends up splitting a funnel shift from another shift/lea

There a few cases where we end up with extra register moves which I think we can accept in exchange for the increased ILP.

Diff Detail

Unit TestsFailed

TimeTest
60,150 msx64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp
Script: -- : 'RUN: at line 4'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -gline-tables-only -m64 -O0 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/asan/TestCases/scariness_score_test.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/asan/X86_64LinuxConfig/TestCases/Output/scariness_score_test.cpp.tmp
330 msx64 debian > BOLT.runtime/X86::user-func-reorder.c
Script: -- : 'RUN: at line 30'; /usr/bin/clang --target=x86_64-linux -fuse-ld=lld -Wl,--unresolved-symbols=ignore-all -no-pie /var/lib/buildkite-agent/builds/llvm-project/bolt/test/runtime/X86/user-func-reorder.c -o /var/lib/buildkite-agent/builds/llvm-project/build/tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp.exe -Wl,-q

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
craig.topper added inline comments.Jan 26 2021, 12:43 PM
llvm/test/CodeGen/RISCV/rv64Zbp.ll
1433 ↗(On Diff #319268)

Running the tests through instcombine also breaks GORCI matching.

craig.topper added inline comments.Jan 26 2021, 12:47 PM
llvm/test/CodeGen/RISCV/rv64Zbp.ll
1433 ↗(On Diff #319268)

It's also worth noting, the tests that are failing are repeating the same pattern gorc pattern twice, which is redundant. The test was trying to test that we could detect the redundancy. I guess this patch may have seen some of the redundancy?

RKSimon planned changes to this revision.Jun 3 2021, 4:29 AM
RKSimon updated this revision to Diff 361510.Jul 25 2021, 8:29 AM

rebase (still needs work)

RKSimon planned changes to this revision.Jul 25 2021, 8:29 AM

I've raised https://bugs.llvm.org/show_bug.cgi?id=51209 about the poor quality of the gorc2 pattern matching and the gorc2, gorc2 -> gorc2 tests.

@RKSimon are the other problems with this patch than just the GORCI matching?

@RKSimon are the other problems with this patch than just the GORCI matching?

The GORCI matching is the main one.

There is also some minor issues with MatchRotate - we should be allowed to match rotate/funnel by constant pre-legalization (see ARM/ror.ll) as that can be re-expanded later without any harm done, before we see through the pattern and lose it, although now that we match this quite well in InstCombine I'm not sure is this is as likely to happen.

lenary removed a subscriber: lenary.Nov 2 2021, 6:05 AM
RKSimon updated this revision to Diff 391281.Dec 2 2021, 5:10 AM

rebase - squashed a few more regressions...

RKSimon planned changes to this revision.Dec 10 2021, 2:19 AM
RKSimon planned changes to this revision.Jan 23 2022, 11:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2022, 2:53 AM
RKSimon planned changes to this revision.Apr 6 2022, 2:53 AM
RKSimon planned changes to this revision.May 8 2022, 6:10 AM

Waiting for D124839 to land

RKSimon updated this revision to Diff 429424.May 14 2022, 2:16 AM
RKSimon retitled this revision from [DAG] Enable ISD::SHL/SRL SimplifyMultipleUseDemandedBits handling (WIP) to [DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits (WIP).
RKSimon edited the summary of this revision. (Show Details)

Rebased after D124839 to just handle ISD::SRL shifts

RKSimon added inline comments.May 14 2022, 2:22 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
147

@arsenm @foad At EuroLLVM Matt suggested that maybe we should increase the tolerance to 2 uses of the large immediates before pulling out the constant?

llvm/test/CodeGen/ARM/uxtb.ll
112 ↗(On Diff #429424)

I'm going to take a look at this, but I'm really not familiar with the UXTB matching code, so any pointers would be appreciated.

arsenm added inline comments.May 16 2022, 6:17 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
147

s_mov_b32 K + 2 * v_and_b32_32 = 16 bytes, 12 cycles
2 * (v_and_b32_e32 K) = 16 bytes, 8 cycles which is clearly better.

3 * (v_and_b32_e32 K) = 24 bytes, 12 cycles

So 2 uses of a constant seems plainly better for VOP1/VOP2 ops. Abbe that it becomes a code size vs. latency tradeoff

arsenm added inline comments.May 16 2022, 6:23 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
147

This decision is also generally made by SIFoldOperands. Probably need to fix it there and not in the DAG

foad added inline comments.May 16 2022, 6:35 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
147

I'm strongly in favour of never pulling out the constant (or rather, always folding into the instruction) and I have patches to that effect starting with D114643, which I'm hoping to get back to pretty soon.

RKSimon updated this revision to Diff 431345.May 23 2022, 5:40 AM
RKSimon edited the summary of this revision. (Show Details)

rebase

foad added a comment.May 23 2022, 5:54 AM

AMDGPU changes LGTM.

RKSimon added inline comments.
llvm/test/CodeGen/ARM/uxtb.ll
112 ↗(On Diff #429424)

instcombine optimises this as well:

define i32 @test10(i32 %p0) {
  %tmp1 = lshr i32 %p0, 7
  %tmp2 = and i32 %tmp1, 16253176
  %tmp4 = lshr i32 %p0, 12
  %tmp5 = and i32 %tmp4, 458759
  %tmp7 = or i32 %tmp5, %tmp2
  ret i32 %tmp7
}

which has the same problem:

_test10:
@ %bb.0:
        mov     r1, #248
        mov     r2, #7
        orr     r1, r1, #16252928
        orr     r2, r2, #458752
        and     r1, r1, r0, lsr #7
        and     r0, r2, r0, lsr #12
        orr     r0, r0, r1
        bx      lr
RKSimon added inline comments.May 26 2022, 10:38 AM
llvm/test/CodeGen/Thumb2/thumb2-uxtb.ll
185 ↗(On Diff #431345)

same problem - instcombine will have already optimized this to:

define i32 @test10(i32 %p0) {
  %tmp1 = lshr i32 %p0, 7
  %tmp2 = and i32 %tmp1, 16253176
  %tmp4 = lshr i32 %p0, 12
  %tmp5 = and i32 %tmp4, 458759
  %tmp7 = or i32 %tmp5, %tmp2
  ret i32 %tmp7
}

It feels like I'm avoiding the issue - but should I update the arm/thumb2 UXTB16 tests to match what the middle-end will have generated?

dmgreen added inline comments.May 27 2022, 6:37 AM
llvm/test/CodeGen/ARM/uxtb.ll
112 ↗(On Diff #429424)

I was taking a look. The test is super old now, so old that it had signed types when it was originally added.

I was surprised to see that and 0x70007 is being recognised via an and 0xff00ff tablegen pattern - it goes into SelectionDAGISel::CheckAndMask which checks that the other mask bits are already 0.

I think that is what this is trying to test - that a smaller and mask still matches the UXTB16. Is it possible to change it to something that still captures that, without relying on the multi-use fold of the %tmp2 not happening?

Maybe something like this?

%p = and i32 %p0, 3
%a = shl i32 65537, %p
%b = lshr i32 %a, 1
%tmp7 = and i32 %b, 458759
RKSimon added inline comments.May 30 2022, 1:59 PM
llvm/test/CodeGen/ARM/uxtb.ll
112 ↗(On Diff #429424)

Thanks for the hint - I'll give it a try

RKSimon updated this revision to Diff 433345.Jun 1 2022, 3:23 AM

rebase with alternative uxtb16 tests

RKSimon added inline comments.Jun 1 2022, 3:25 AM
llvm/test/CodeGen/ARM/uxtb.ll
112 ↗(On Diff #429424)

Thanks @dmgreen - those still match fine. Should I pre-commit these new tests and possibly alter the existing test10 variants with the -instcombine optimized IR to show they already fail to match?

dmgreen added inline comments.Jun 1 2022, 7:03 AM
llvm/test/CodeGen/ARM/uxtb.ll
112 ↗(On Diff #429424)

That sounds good to me.

RKSimon updated this revision to Diff 433398.Jun 1 2022, 7:42 AM
RKSimon edited the summary of this revision. (Show Details)

rebase

RKSimon added inline comments.
llvm/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll
139–140

@jonpa @uweigand These tests are proving very fragile depending on the order of and/shifts - should SystemZ be preferring masking leading/trailing bits with shift-pairs over shift+and / and+shift do you think? We have TLI::shouldFoldConstantShiftPairToMask to hand that.

uweigand added inline comments.Fri, Jun 10, 5:23 AM
llvm/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll
139–140

Well, this specific test only loads and then saves unmodified a 3xi31 vector, so ideally however the masking is done, it should be optimized away as unnecessary in either case. That's what currently happens, not sure why this is changing with this PR.

In general, I think using an and-mask would be preferable over a shift pair on SystemZ.

RKSimon planned changes to this revision.Fri, Jun 10, 5:32 AM

Thanks @uweigand I'll take another look at this soon

RKSimon updated this revision to Diff 438291.Mon, Jun 20, 1:36 AM

rebase after D125836

RKSimon planned changes to this revision.Mon, Jun 20, 1:36 AM