Page MenuHomePhabricator

[TargetLowering] Optimize expanded SRL/SHL fed into SETCC ne/eq 0
Needs ReviewPublic

Authored by fzhinkin on Oct 11 2021, 3:12 AM.

Details

Summary

During legalization SHL/SRL nodes could be expanded into expression
that rotates low/high part of original input and then apply OR to
rotated part and other part. If the result of this operation is
compared for equality/not-equality with zero then rotation could
be removed as it does not affect comparision result.

Bug report: https://bugs.llvm.org/show_bug.cgi?id=50197

Diff Detail

Event Timeline

fzhinkin created this revision.Oct 11 2021, 3:12 AM
fzhinkin requested review of this revision.Oct 11 2021, 3:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2021, 3:12 AM
fzhinkin retitled this revision from [TargetLowering] Optimize expanded SRL/SHL feeded into SETCC ne/eq 0 to [TargetLowering] Optimize expanded SRL/SHL fed into SETCC ne/eq 0.Oct 11 2021, 3:15 AM
fzhinkin added reviewers: spatel, lebedev.ri, RKSimon.
craig.topper added inline comments.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3411

Please use SDValue. LLVM is pretty conservative about the use of auto. https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable

llvm/test/CodeGen/ARM/arm-icmp-shift-opt.ll
118 ↗(On Diff #378602)

"does not match" -> "do not match" since constants is plural.

fzhinkin updated this revision to Diff 378795.Oct 11 2021, 2:17 PM

Fixed typos, cleaned up the code.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3411

Thanks for pointing to it, fixed.

please can you pre-commit these new tests to trunk with current codegen and then rebase to show the diff?

llvm/test/CodeGen/AArch64/arm64-icmp-shift-opt.ll
1 ↗(On Diff #378795)

rename icmp-shift-opt.ll?

llvm/test/CodeGen/ARM/arm-icmp-shift-opt.ll
1 ↗(On Diff #378795)

rename icmp-shift-opt.ll?

llvm/test/CodeGen/X86/icmp-shift-opt.ll
8

add nounwind to reduce cfi noise

fzhinkin updated this revision to Diff 381050.Oct 20 2021, 12:09 PM

Renamed test files, added nounwind attribute.

fzhinkin updated this revision to Diff 381086.Oct 20 2021, 1:38 PM

Fixed typos in ARM tests.

please can you pre-commit these new tests to trunk with current codegen and then rebase to show the diff?

I don't have a permission to commit changes, so I'll appreciate if you can help me with it. Here's the patch adding new tests with checks generated for current trunk:

please can you pre-commit these new tests to trunk with current codegen and then rebase to show the diff?

I don't have a permission to commit changes, so I'll appreciate if you can help me with it. Here's the patch adding new tests with checks generated for current trunk:

Done (sorry for the delay) - please can you rebase?

fzhinkin updated this revision to Diff 381553.Oct 22 2021, 8:12 AM

Rebased to current trunk

please can you pre-commit these new tests to trunk with current codegen and then rebase to show the diff?

I don't have a permission to commit changes, so I'll appreciate if you can help me with it. Here's the patch adding new tests with checks generated for current trunk:

Done (sorry for the delay) - please can you rebase?

Thank you! Rebase done.

Also, thanks for adding i686 to X86's test, it revealed that my optimization does not work when we're legalizing i128 to i32. I'll check if that case could be easily supported.

please can you pre-commit these new tests to trunk with current codegen and then rebase to show the diff?

I don't have a permission to commit changes, so I'll appreciate if you can help me with it. Here's the patch adding new tests with checks generated for current trunk:

Done (sorry for the delay) - please can you rebase?

Thank you! Rebase done.

Also, thanks for adding i686 to X86's test, it revealed that my optimization does not work when we're legalizing i128 to i32. I'll check if that case could be easily supported.

Handling of trees generated during legalization of i128/i256/etc to i32 is relatively simple, but in case of i686 some of these expressions are folded into funnel shifts before SimplifySetCC is applied to setcc.
I see two options here (but I'm far from being an expert, so correct me if I'm wrong and there are simpler alternatives):

  1. support various instructions (funnel shifts, rotations, bit/byte swaps) in TargetLowering::optimizeSetCCOfExpandedShift in addition to SHL/SRL;
  2. support only SHL/SRL in TargetLowering::optimizeSetCCOfExpandedShift and apply it in DAGTypeLegalizer::IntegerExpandSetCCOperands right after setcc's operands expansion.

Personally I'm leaning towards second option as it should be less fragile and easier to maintain.

Handling of trees generated during legalization of i128/i256/etc to i32 is relatively simple, but in case of i686 some of these expressions are folded into funnel shifts before SimplifySetCC is applied to setcc.
I see two options here (but I'm far from being an expert, so correct me if I'm wrong and there are simpler alternatives):

  1. support various instructions (funnel shifts, rotations, bit/byte swaps) in TargetLowering::optimizeSetCCOfExpandedShift in addition to SHL/SRL;
  2. support only SHL/SRL in TargetLowering::optimizeSetCCOfExpandedShift and apply it in DAGTypeLegalizer::IntegerExpandSetCCOperands right after setcc's operands expansion.

Personally I'm leaning towards second option as it should be less fragile and easier to maintain.

Makes sense to try (2) first - although I expect at least partial support for (1) might end up being required - you are handling a pattern that is almost a funnel shift much of the time.

fzhinkin updated this revision to Diff 383852.Mon, Nov 1, 12:12 PM

Reimplemented expanded shift matching to handle funnel shifts.

Handling of trees generated during legalization of i128/i256/etc to i32 is relatively simple, but in case of i686 some of these expressions are folded into funnel shifts before SimplifySetCC is applied to setcc.
I see two options here (but I'm far from being an expert, so correct me if I'm wrong and there are simpler alternatives):

  1. support various instructions (funnel shifts, rotations, bit/byte swaps) in TargetLowering::optimizeSetCCOfExpandedShift in addition to SHL/SRL;
  2. support only SHL/SRL in TargetLowering::optimizeSetCCOfExpandedShift and apply it in DAGTypeLegalizer::IntegerExpandSetCCOperands right after setcc's operands expansion.

Personally I'm leaning towards second option as it should be less fragile and easier to maintain.

Makes sense to try (2) first - although I expect at least partial support for (1) might end up being required - you are handling a pattern that is almost a funnel shift much of the time.

Unfortunately (2) doesn't work well, because nodes created during shift expansion may have several uses until type legalization finished, so I gave it up.

Instead I supported funnel shifts TargetLowering::optimizeSetCCOfExpandedShift (did not support rotations and bit/byte swaps because such nodes should not be created during expanded shift's combining).

While optimization works fine for i686 now there is an issue with AArch64: shifts expanded from types wider than i128 won't be optimized (see @opt_setcc_shl_ne_zero_i256) because for AArch64 funnel shift alike patterns combined into AArch64ISD::EXTR instead of FSHL/FSHR. I attempted to fix it by implementing (2), but the solution was fragile and didn't work in some cases.

Instead I supported funnel shifts TargetLowering::optimizeSetCCOfExpandedShift (did not support rotations and bit/byte swaps because such nodes should not be created during expanded shift's combining).

While optimization works fine for i686 now there is an issue with AArch64: shifts expanded from types wider than i128 won't be optimized (see @opt_setcc_shl_ne_zero_i256) because for AArch64 funnel shift alike patterns combined into AArch64ISD::EXTR instead of FSHL/FSHR. I attempted to fix it by implementing (2), but the solution was fragile and didn't work in some cases.

I'm hoping D112443 will help with this

Instead I supported funnel shifts TargetLowering::optimizeSetCCOfExpandedShift (did not support rotations and bit/byte swaps because such nodes should not be created during expanded shift's combining).

While optimization works fine for i686 now there is an issue with AArch64: shifts expanded from types wider than i128 won't be optimized (see @opt_setcc_shl_ne_zero_i256) because for AArch64 funnel shift alike patterns combined into AArch64ISD::EXTR instead of FSHL/FSHR. I attempted to fix it by implementing (2), but the solution was fragile and didn't work in some cases.

I'm hoping D112443 will help with this

D112443 has been committed - please can you see if it helps?

Instead I supported funnel shifts TargetLowering::optimizeSetCCOfExpandedShift (did not support rotations and bit/byte swaps because such nodes should not be created during expanded shift's combining).

While optimization works fine for i686 now there is an issue with AArch64: shifts expanded from types wider than i128 won't be optimized (see @opt_setcc_shl_ne_zero_i256) because for AArch64 funnel shift alike patterns combined into AArch64ISD::EXTR instead of FSHL/FSHR. I attempted to fix it by implementing (2), but the solution was fragile and didn't work in some cases.

I'm hoping D112443 will help with this

D112443 has been committed - please can you see if it helps?

Unfortunately it didn't affect AArch64ISD::EXTR combining.

X86 changes look good. Are AArch64 changes actually a regressions, or are they just changes?

X86 changes look good. Are AArch64 changes actually a regressions, or are they just changes?

Thanks!
There is no regression for AArch64: better code sequence is generated for legalized i128 shifts, but there is no improvement for wider types (like i256).

X86 changes look good. Are AArch64 changes actually a regressions, or are they just changes?

Thanks!
There is no regression for AArch64: better code sequence is generated for legalized i128 shifts, but there is no improvement for wider types (like i256).

A change that only triggers for some (common?) cases and causes no regressions is a better step forward
than a change that indiscriminately triggers on everything and causes widespread regressions i would think :)

X86 changes look good. Are AArch64 changes actually a regressions, or are they just changes?

Thanks!
There is no regression for AArch64: better code sequence is generated for legalized i128 shifts, but there is no improvement for wider types (like i256).

A change that only triggers for some (common?) cases and causes no regressions is a better step forward
than a change that indiscriminately triggers on everything and causes widespread regressions i would think :)

I agree. I don't think that i256 and wider types are something frequently you can see in real code.

RKSimon added inline comments.Wed, Nov 3, 3:14 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3399

Lots of comment duplication - description to header, and keep the code snippet here?

3440

const APInt &C

3476

(style) Break apart if-else chains when they return.

3486

const APInt &CVal =

3497

Why was i4096 of particular interest?

We have a default depth limit we use for this kind of thing: SelectionDAG::MaxRecursionDepth = 6, also, normally we increment a depth to that value instead of decrementing a height.

3508

(style) Don't use auto

3517

Can't you avoid pushing results by just OR'ing the results once?

SDValue Reduction = Result[0];
for (size_t I = 1, E = Results.size(); I < E; ++I)
  Reduction = DAG.getNode(ISD::OR, DL, N0.getValueType(), Reduction, Result[I]);
fzhinkin added inline comments.Wed, Nov 3, 4:26 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3497

Nothing really interesting about i4096, just mentioned it to depict that max height of 8 is large enough for any type someone will use in actual code.

I'll change 8 to SelectionDAG::MaxRecursionDepth and decrement to increment. Thanks!

3517

I'm pushing ORs back to the results list to generate balanced tree (and shorten critical path's length). And it pays off at least for ARM:

; llc -O3 -mtriple=armv7 test.ll
define i1 @opt_setcc_shl_ne_zero_i128(i128 %a) nounwind {
   %shl = shl i128 %a, 17
   %cmp = icmp ne i128 %shl, 0
   ret i1 %cmp
}

Code generated using current implementation:

opt_setcc_shl_ne_zero_i128:
@ %bb.0:
	orr	r2, r2, r3, lsl #17
	orr	r0, r1, r0
	orrs	r0, r0, r2
	movwne	r0, #1
	bx	lr

Code generated using implementation OR'ing in place:

opt_setcc_shl_ne_zero_i128:
@ %bb.0:
	orr	r0, r1, r0
	orr	r0, r0, r2
	orr	r0, r0, r3, lsl #17
	cmp	r0, #0
	movwne	r0, #1
	bx	lr
RKSimon added inline comments.Wed, Nov 3, 5:27 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3517

OK - maybe mention that in the comment?

fzhinkin added inline comments.Wed, Nov 3, 7:31 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3517

Updated the comment and fixed all other issues you've mentioned earlier.

RKSimon added inline comments.Wed, Nov 3, 11:54 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3410

Use isNullOrNullSplat?

fzhinkin updated this revision to Diff 384556.Wed, Nov 3, 12:28 PM

Simplified expression in assertion.

RKSimon added inline comments.Sun, Nov 7, 5:25 AM
llvm/test/CodeGen/AArch64/icmp-shift-opt.ll
151

pre-commit?

fzhinkin added inline comments.Mon, Nov 8, 1:06 AM
llvm/test/CodeGen/AArch64/icmp-shift-opt.ll
151

Will appreciate if you can help me with committing it:

fzhinkin added inline comments.Mon, Nov 8, 6:55 AM
llvm/test/CodeGen/AArch64/icmp-shift-opt.ll
151

thanks!

RKSimon added inline comments.Mon, Nov 8, 8:06 AM
llvm/test/CodeGen/ARM/icmp-shift-opt.ll
150

sorry - I didn't add this one - I'll commit this shortly.

RKSimon added inline comments.Mon, Nov 8, 8:13 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4153

This is doing something pretty similar, its more limited to the 'concat' pattern but handles the -1 'allbits' case as well.

fzhinkin added inline comments.Mon, Nov 8, 8:47 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4153

Are you suggesting to extract and reuse code common to both optimizations or to support -1 in optimizeSetCCOfExpandedShift?

llvm/test/CodeGen/ARM/icmp-shift-opt.ll
150

Thank you.

fzhinkin added inline comments.Wed, Nov 17, 12:32 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4153

It does not make sense to support -1 in optimizeSetCCOfExpandedShift because a shifted value (the method support only logical shifts) is never equal to -1.

For me it seems like "concat" optimization and optimizeSetCCOfExpandedShift are not so similar. "concat" opt merges OR arms in place, but optimizeSetCCOfExpandedShift eliminate shift pairs that are scattered among the tree and the whole tree should be scanned before final OR reduction is applied.

RKSimon added inline comments.Mon, Nov 22, 3:30 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4153

OK - what about the CmpZero case - is that redundant dead code now?

fzhinkin added inline comments.Mon, Nov 22, 10:11 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4153

CmpZero case is redundant only if Y's type was lowered to a pair of values having narrower type (X and Y are i64 and comp. target is armv7).

However, concat optimization still works for following cases not supported in optimizeSetCCOfExpandedShift:

  • X and Y whose type was legal (X and Y are i64, x86_64 is the target);
  • X and Y are vectors.

I did not support vector types in optimizeSetCCOfExpandedShift because it seems impossible to get a DAG with appropriate shape during vectors legalization.