Page MenuHomePhabricator

[DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits
ClosedPublic

Authored by RKSimon on Apr 9 2020, 7:20 AM.

Details

Summary

This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the ISD::SRL source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts.

This is another step towards removing SelectionDAG::GetDemandedBits and just using TargetLowering::SimplifyMultipleUseDemandedBits.

There a few cases where we end up with extra register moves which I think we can accept in exchange for the increased ILP.

Diff Detail

Unit TestsFailed

TimeTest
60,140 msx64 debian > LLVM.CodeGen/NVPTX::wmma.py
Script: -- : 'RUN: at line 5'; "/usr/bin/python3.9" /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/NVPTX/wmma.py --ptx=60 --gpu-arch=70 > /var/lib/buildkite-agent/builds/llvm-project/build/test/CodeGen/NVPTX/Output/wmma.py.tmp-ptx60-sm_70.ll

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2022, 2:53 AM
RKSimon planned changes to this revision.Apr 6 2022, 2:53 AM
RKSimon planned changes to this revision.May 8 2022, 6:10 AM

Waiting for D124839 to land

RKSimon updated this revision to Diff 429424.May 14 2022, 2:16 AM
RKSimon retitled this revision from [DAG] Enable ISD::SHL/SRL SimplifyMultipleUseDemandedBits handling (WIP) to [DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits (WIP).
RKSimon edited the summary of this revision. (Show Details)

Rebased after D124839 to just handle ISD::SRL shifts

RKSimon added inline comments.May 14 2022, 2:22 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
148

@arsenm @foad At EuroLLVM Matt suggested that maybe we should increase the tolerance to 2 uses of the large immediates before pulling out the constant?

llvm/test/CodeGen/ARM/uxtb.ll
112

I'm going to take a look at this, but I'm really not familiar with the UXTB matching code, so any pointers would be appreciated.

arsenm added inline comments.May 16 2022, 6:17 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
148

s_mov_b32 K + 2 * v_and_b32_32 = 16 bytes, 12 cycles
2 * (v_and_b32_e32 K) = 16 bytes, 8 cycles which is clearly better.

3 * (v_and_b32_e32 K) = 24 bytes, 12 cycles

So 2 uses of a constant seems plainly better for VOP1/VOP2 ops. Abbe that it becomes a code size vs. latency tradeoff

arsenm added inline comments.May 16 2022, 6:23 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
148

This decision is also generally made by SIFoldOperands. Probably need to fix it there and not in the DAG

foad added inline comments.May 16 2022, 6:35 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
148

I'm strongly in favour of never pulling out the constant (or rather, always folding into the instruction) and I have patches to that effect starting with D114643, which I'm hoping to get back to pretty soon.

RKSimon updated this revision to Diff 431345.May 23 2022, 5:40 AM
RKSimon edited the summary of this revision. (Show Details)

rebase

foad added a comment.May 23 2022, 5:54 AM

AMDGPU changes LGTM.

RKSimon added inline comments.
llvm/test/CodeGen/ARM/uxtb.ll
112

instcombine optimises this as well:

define i32 @test10(i32 %p0) {
  %tmp1 = lshr i32 %p0, 7
  %tmp2 = and i32 %tmp1, 16253176
  %tmp4 = lshr i32 %p0, 12
  %tmp5 = and i32 %tmp4, 458759
  %tmp7 = or i32 %tmp5, %tmp2
  ret i32 %tmp7
}

which has the same problem:

_test10:
@ %bb.0:
        mov     r1, #248
        mov     r2, #7
        orr     r1, r1, #16252928
        orr     r2, r2, #458752
        and     r1, r1, r0, lsr #7
        and     r0, r2, r0, lsr #12
        orr     r0, r0, r1
        bx      lr
RKSimon added inline comments.May 26 2022, 10:38 AM
llvm/test/CodeGen/Thumb2/thumb2-uxtb.ll
185

same problem - instcombine will have already optimized this to:

define i32 @test10(i32 %p0) {
  %tmp1 = lshr i32 %p0, 7
  %tmp2 = and i32 %tmp1, 16253176
  %tmp4 = lshr i32 %p0, 12
  %tmp5 = and i32 %tmp4, 458759
  %tmp7 = or i32 %tmp5, %tmp2
  ret i32 %tmp7
}

It feels like I'm avoiding the issue - but should I update the arm/thumb2 UXTB16 tests to match what the middle-end will have generated?

dmgreen added inline comments.May 27 2022, 6:37 AM
llvm/test/CodeGen/ARM/uxtb.ll
112

I was taking a look. The test is super old now, so old that it had signed types when it was originally added.

I was surprised to see that and 0x70007 is being recognised via an and 0xff00ff tablegen pattern - it goes into SelectionDAGISel::CheckAndMask which checks that the other mask bits are already 0.

I think that is what this is trying to test - that a smaller and mask still matches the UXTB16. Is it possible to change it to something that still captures that, without relying on the multi-use fold of the %tmp2 not happening?

Maybe something like this?

%p = and i32 %p0, 3
%a = shl i32 65537, %p
%b = lshr i32 %a, 1
%tmp7 = and i32 %b, 458759
RKSimon added inline comments.May 30 2022, 1:59 PM
llvm/test/CodeGen/ARM/uxtb.ll
112

Thanks for the hint - I'll give it a try

RKSimon updated this revision to Diff 433345.Jun 1 2022, 3:23 AM

rebase with alternative uxtb16 tests

RKSimon added inline comments.Jun 1 2022, 3:25 AM
llvm/test/CodeGen/ARM/uxtb.ll
112

Thanks @dmgreen - those still match fine. Should I pre-commit these new tests and possibly alter the existing test10 variants with the -instcombine optimized IR to show they already fail to match?

dmgreen added inline comments.Jun 1 2022, 7:03 AM
llvm/test/CodeGen/ARM/uxtb.ll
112

That sounds good to me.

RKSimon updated this revision to Diff 433398.Jun 1 2022, 7:42 AM
RKSimon edited the summary of this revision. (Show Details)

rebase

RKSimon added inline comments.
llvm/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll
129–139

@jonpa @uweigand These tests are proving very fragile depending on the order of and/shifts - should SystemZ be preferring masking leading/trailing bits with shift-pairs over shift+and / and+shift do you think? We have TLI::shouldFoldConstantShiftPairToMask to hand that.

uweigand added inline comments.Jun 10 2022, 5:23 AM
llvm/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll
129–139

Well, this specific test only loads and then saves unmodified a 3xi31 vector, so ideally however the masking is done, it should be optimized away as unnecessary in either case. That's what currently happens, not sure why this is changing with this PR.

In general, I think using an and-mask would be preferable over a shift pair on SystemZ.

RKSimon planned changes to this revision.Jun 10 2022, 5:32 AM

Thanks @uweigand I'll take another look at this soon

RKSimon updated this revision to Diff 438291.Jun 20 2022, 1:36 AM

rebase after D125836

RKSimon planned changes to this revision.Jun 20 2022, 1:36 AM
RKSimon updated this revision to Diff 443924.Jul 12 2022, 5:36 AM
RKSimon edited the summary of this revision. (Show Details)

rebase and prefer SimplifyDemandedBits over GetDemandedBits for trunc stores

RKSimon updated this revision to Diff 444009.Jul 12 2022, 10:42 AM

Added (or (and X, C1), (and (or X, Y), C2)) -> (or (and X, C1|C2), (and Y, C2)) fold to try to reduce the SystemZ regression

spatel added inline comments.Jul 12 2022, 1:25 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6871 ↗(On Diff #444009)

This could be a preliminary patch. I don't think we'd get that in IR either (even without extra uses):
https://alive2.llvm.org/ce/z/g61VRe

spatel added inline comments.Jul 12 2022, 1:50 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6871 ↗(On Diff #444009)

If I'm reading the SystemZ debug spew correctly, we should have gotten this transform to fire twice, so it would do this:
https://alive2.llvm.org/ce/z/tUsepa
...but we miss it because we don't revisit the last 'or' node? Is that what D127115 would solve?

RKSimon added inline comments.Sun, Jul 17, 11:21 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6871 ↗(On Diff #444009)

I've confirmed that D127115 solves the SystemZ fun3 regression but not fun2

IMO the fun2 regression probably shouldn't block the patch from being merged. I've looked into the sequences, and actually neither of them is even close to optimal.

Looking at the semantics, we have 8 x i32 inputs, which need to be truncated to i31, concatenated, and then stored, occupying 31 bytes of memory. Memory is written via three 8-byte stores, followed by a 4-byte, a 2-byte, and a 1-byte store, which does look optimal to me. However, the computation of the 64-bit values to be stored is not.

The first of these should be the value

(A << 33) | ((B << 2) & 0x1fffffffc) | ((C >> 29) & 3)

where A, B, and C are the first three i32 inputs.

However, the computation being performed is more like

((A << 25) | ((B >> 6) & 0x01ffffff)) << 8
| ((B << 58) | ((C & 0x7fffffff) << 27)) >> 56

which gets the correct result, but in about double the number of instructions or cycles that should be required.

While the variant with this PR is even slightly worse than the variant before, that's probably not really relevant given the fact both sequences are rather inefficient. Ideally, we could fix this to get (close to) an optimal sequence, but that would be a different issue. (I'm not even sure yet whether the current inefficiency is due to the middle end or the back end.)

Thanks - I have a lot of individual DAG / SimplifyDemanded patches in progress atm, plus we're now getting closer to completing D127115.

A few patches still have minor regressions that I'm addressing, but this one in particular I've been wondering how much of a real world issue illegal type copies like this actually are? If we were further away from 15.x branch I'd ask to get this in and we ensure we address it once all the patches are in, but given how close we are I'm going to wait for now.

RKSimon updated this revision to Diff 447979.Wed, Jul 27, 3:19 AM
RKSimon retitled this revision from [DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits (WIP) to [DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits.
RKSimon edited the summary of this revision. (Show Details)

I think I've covered all the remaining regressions now - D129765 has cleaned up a number of annoying cases - including the SystemZ v3i31 copy test!

I think I've covered all the remaining regressions now - D129765 has cleaned up a number of annoying cases - including the SystemZ v3i31 copy test!

Thanks! SystemZ changes LGTM now as discussed above.

I think this is patch is good to go now - any more comments?

foad added a comment.Thu, Jul 28, 3:43 AM

AMDGPU changes still LGTM.

spatel accepted this revision.Thu, Jul 28, 5:56 AM

x86 diffs LGTM

llvm/test/CodeGen/X86/ins_subreg_coalesce-1.ll
8–10

Not sure if this test still models some situation that we care about, but you could put a TODO note on it (don't need to copy to %ecx?).

This revision is now accepted and ready to land.Thu, Jul 28, 5:56 AM
This revision was landed with ongoing or failed builds.Thu, Jul 28, 6:11 AM
This revision was automatically updated to reflect the committed changes.