This is an archive of the discontinued LLVM Phabricator instance.

[TargetLowering] Optimize expanded SRL/SHL fed into SETCC ne/eq 0
AbandonedPublic

Authored by fzhinkin on Oct 11 2021, 3:12 AM.

Details

Summary

During legalization SHL/SRL nodes could be expanded into expression
that rotates low/high part of original input and then apply OR to
rotated part and other part. If the result of this operation is
compared for equality/not-equality with zero then rotation could
be removed as it does not affect comparision result.

Bug report: https://llvm.org/PR50197

Diff Detail

Event Timeline

fzhinkin created this revision.Oct 11 2021, 3:12 AM
fzhinkin requested review of this revision.Oct 11 2021, 3:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2021, 3:12 AM
fzhinkin retitled this revision from [TargetLowering] Optimize expanded SRL/SHL feeded into SETCC ne/eq 0 to [TargetLowering] Optimize expanded SRL/SHL fed into SETCC ne/eq 0.Oct 11 2021, 3:15 AM
fzhinkin added reviewers: spatel, lebedev.ri, RKSimon.
craig.topper added inline comments.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3411

Please use SDValue. LLVM is pretty conservative about the use of auto. https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable

llvm/test/CodeGen/ARM/arm-icmp-shift-opt.ll
118 ↗(On Diff #378602)

"does not match" -> "do not match" since constants is plural.

fzhinkin updated this revision to Diff 378795.Oct 11 2021, 2:17 PM

Fixed typos, cleaned up the code.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3411

Thanks for pointing to it, fixed.

please can you pre-commit these new tests to trunk with current codegen and then rebase to show the diff?

llvm/test/CodeGen/AArch64/arm64-icmp-shift-opt.ll
1 ↗(On Diff #378795)

rename icmp-shift-opt.ll?

llvm/test/CodeGen/ARM/arm-icmp-shift-opt.ll
1 ↗(On Diff #378795)

rename icmp-shift-opt.ll?

llvm/test/CodeGen/X86/icmp-shift-opt.ll
8

add nounwind to reduce cfi noise

fzhinkin updated this revision to Diff 381050.Oct 20 2021, 12:09 PM

Renamed test files, added nounwind attribute.

fzhinkin updated this revision to Diff 381086.Oct 20 2021, 1:38 PM

Fixed typos in ARM tests.

please can you pre-commit these new tests to trunk with current codegen and then rebase to show the diff?

I don't have a permission to commit changes, so I'll appreciate if you can help me with it. Here's the patch adding new tests with checks generated for current trunk:

please can you pre-commit these new tests to trunk with current codegen and then rebase to show the diff?

I don't have a permission to commit changes, so I'll appreciate if you can help me with it. Here's the patch adding new tests with checks generated for current trunk:

Done (sorry for the delay) - please can you rebase?

fzhinkin updated this revision to Diff 381553.Oct 22 2021, 8:12 AM

Rebased to current trunk

please can you pre-commit these new tests to trunk with current codegen and then rebase to show the diff?

I don't have a permission to commit changes, so I'll appreciate if you can help me with it. Here's the patch adding new tests with checks generated for current trunk:

Done (sorry for the delay) - please can you rebase?

Thank you! Rebase done.

Also, thanks for adding i686 to X86's test, it revealed that my optimization does not work when we're legalizing i128 to i32. I'll check if that case could be easily supported.

please can you pre-commit these new tests to trunk with current codegen and then rebase to show the diff?

I don't have a permission to commit changes, so I'll appreciate if you can help me with it. Here's the patch adding new tests with checks generated for current trunk:

Done (sorry for the delay) - please can you rebase?

Thank you! Rebase done.

Also, thanks for adding i686 to X86's test, it revealed that my optimization does not work when we're legalizing i128 to i32. I'll check if that case could be easily supported.

Handling of trees generated during legalization of i128/i256/etc to i32 is relatively simple, but in case of i686 some of these expressions are folded into funnel shifts before SimplifySetCC is applied to setcc.
I see two options here (but I'm far from being an expert, so correct me if I'm wrong and there are simpler alternatives):

  1. support various instructions (funnel shifts, rotations, bit/byte swaps) in TargetLowering::optimizeSetCCOfExpandedShift in addition to SHL/SRL;
  2. support only SHL/SRL in TargetLowering::optimizeSetCCOfExpandedShift and apply it in DAGTypeLegalizer::IntegerExpandSetCCOperands right after setcc's operands expansion.

Personally I'm leaning towards second option as it should be less fragile and easier to maintain.

Handling of trees generated during legalization of i128/i256/etc to i32 is relatively simple, but in case of i686 some of these expressions are folded into funnel shifts before SimplifySetCC is applied to setcc.
I see two options here (but I'm far from being an expert, so correct me if I'm wrong and there are simpler alternatives):

  1. support various instructions (funnel shifts, rotations, bit/byte swaps) in TargetLowering::optimizeSetCCOfExpandedShift in addition to SHL/SRL;
  2. support only SHL/SRL in TargetLowering::optimizeSetCCOfExpandedShift and apply it in DAGTypeLegalizer::IntegerExpandSetCCOperands right after setcc's operands expansion.

Personally I'm leaning towards second option as it should be less fragile and easier to maintain.

Makes sense to try (2) first - although I expect at least partial support for (1) might end up being required - you are handling a pattern that is almost a funnel shift much of the time.

fzhinkin updated this revision to Diff 383852.Nov 1 2021, 12:12 PM

Reimplemented expanded shift matching to handle funnel shifts.

Handling of trees generated during legalization of i128/i256/etc to i32 is relatively simple, but in case of i686 some of these expressions are folded into funnel shifts before SimplifySetCC is applied to setcc.
I see two options here (but I'm far from being an expert, so correct me if I'm wrong and there are simpler alternatives):

  1. support various instructions (funnel shifts, rotations, bit/byte swaps) in TargetLowering::optimizeSetCCOfExpandedShift in addition to SHL/SRL;
  2. support only SHL/SRL in TargetLowering::optimizeSetCCOfExpandedShift and apply it in DAGTypeLegalizer::IntegerExpandSetCCOperands right after setcc's operands expansion.

Personally I'm leaning towards second option as it should be less fragile and easier to maintain.

Makes sense to try (2) first - although I expect at least partial support for (1) might end up being required - you are handling a pattern that is almost a funnel shift much of the time.

Unfortunately (2) doesn't work well, because nodes created during shift expansion may have several uses until type legalization finished, so I gave it up.

Instead I supported funnel shifts TargetLowering::optimizeSetCCOfExpandedShift (did not support rotations and bit/byte swaps because such nodes should not be created during expanded shift's combining).

While optimization works fine for i686 now there is an issue with AArch64: shifts expanded from types wider than i128 won't be optimized (see @opt_setcc_shl_ne_zero_i256) because for AArch64 funnel shift alike patterns combined into AArch64ISD::EXTR instead of FSHL/FSHR. I attempted to fix it by implementing (2), but the solution was fragile and didn't work in some cases.

Instead I supported funnel shifts TargetLowering::optimizeSetCCOfExpandedShift (did not support rotations and bit/byte swaps because such nodes should not be created during expanded shift's combining).

While optimization works fine for i686 now there is an issue with AArch64: shifts expanded from types wider than i128 won't be optimized (see @opt_setcc_shl_ne_zero_i256) because for AArch64 funnel shift alike patterns combined into AArch64ISD::EXTR instead of FSHL/FSHR. I attempted to fix it by implementing (2), but the solution was fragile and didn't work in some cases.

I'm hoping D112443 will help with this

Instead I supported funnel shifts TargetLowering::optimizeSetCCOfExpandedShift (did not support rotations and bit/byte swaps because such nodes should not be created during expanded shift's combining).

While optimization works fine for i686 now there is an issue with AArch64: shifts expanded from types wider than i128 won't be optimized (see @opt_setcc_shl_ne_zero_i256) because for AArch64 funnel shift alike patterns combined into AArch64ISD::EXTR instead of FSHL/FSHR. I attempted to fix it by implementing (2), but the solution was fragile and didn't work in some cases.

I'm hoping D112443 will help with this

D112443 has been committed - please can you see if it helps?

Instead I supported funnel shifts TargetLowering::optimizeSetCCOfExpandedShift (did not support rotations and bit/byte swaps because such nodes should not be created during expanded shift's combining).

While optimization works fine for i686 now there is an issue with AArch64: shifts expanded from types wider than i128 won't be optimized (see @opt_setcc_shl_ne_zero_i256) because for AArch64 funnel shift alike patterns combined into AArch64ISD::EXTR instead of FSHL/FSHR. I attempted to fix it by implementing (2), but the solution was fragile and didn't work in some cases.

I'm hoping D112443 will help with this

D112443 has been committed - please can you see if it helps?

Unfortunately it didn't affect AArch64ISD::EXTR combining.

X86 changes look good. Are AArch64 changes actually a regressions, or are they just changes?

X86 changes look good. Are AArch64 changes actually a regressions, or are they just changes?

Thanks!
There is no regression for AArch64: better code sequence is generated for legalized i128 shifts, but there is no improvement for wider types (like i256).

X86 changes look good. Are AArch64 changes actually a regressions, or are they just changes?

Thanks!
There is no regression for AArch64: better code sequence is generated for legalized i128 shifts, but there is no improvement for wider types (like i256).

A change that only triggers for some (common?) cases and causes no regressions is a better step forward
than a change that indiscriminately triggers on everything and causes widespread regressions i would think :)

X86 changes look good. Are AArch64 changes actually a regressions, or are they just changes?

Thanks!
There is no regression for AArch64: better code sequence is generated for legalized i128 shifts, but there is no improvement for wider types (like i256).

A change that only triggers for some (common?) cases and causes no regressions is a better step forward
than a change that indiscriminately triggers on everything and causes widespread regressions i would think :)

I agree. I don't think that i256 and wider types are something frequently you can see in real code.

RKSimon added inline comments.Nov 3 2021, 3:14 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3399

Lots of comment duplication - description to header, and keep the code snippet here?

3440

const APInt &C

3476

(style) Break apart if-else chains when they return.

3486

const APInt &CVal =

3497

Why was i4096 of particular interest?

We have a default depth limit we use for this kind of thing: SelectionDAG::MaxRecursionDepth = 6, also, normally we increment a depth to that value instead of decrementing a height.

3508

(style) Don't use auto

3517

Can't you avoid pushing results by just OR'ing the results once?

SDValue Reduction = Result[0];
for (size_t I = 1, E = Results.size(); I < E; ++I)
  Reduction = DAG.getNode(ISD::OR, DL, N0.getValueType(), Reduction, Result[I]);
fzhinkin added inline comments.Nov 3 2021, 4:26 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3497

Nothing really interesting about i4096, just mentioned it to depict that max height of 8 is large enough for any type someone will use in actual code.

I'll change 8 to SelectionDAG::MaxRecursionDepth and decrement to increment. Thanks!

3517

I'm pushing ORs back to the results list to generate balanced tree (and shorten critical path's length). And it pays off at least for ARM:

; llc -O3 -mtriple=armv7 test.ll
define i1 @opt_setcc_shl_ne_zero_i128(i128 %a) nounwind {
   %shl = shl i128 %a, 17
   %cmp = icmp ne i128 %shl, 0
   ret i1 %cmp
}

Code generated using current implementation:

opt_setcc_shl_ne_zero_i128:
@ %bb.0:
	orr	r2, r2, r3, lsl #17
	orr	r0, r1, r0
	orrs	r0, r0, r2
	movwne	r0, #1
	bx	lr

Code generated using implementation OR'ing in place:

opt_setcc_shl_ne_zero_i128:
@ %bb.0:
	orr	r0, r1, r0
	orr	r0, r0, r2
	orr	r0, r0, r3, lsl #17
	cmp	r0, #0
	movwne	r0, #1
	bx	lr
RKSimon added inline comments.Nov 3 2021, 5:27 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3517

OK - maybe mention that in the comment?

fzhinkin added inline comments.Nov 3 2021, 7:31 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3517

Updated the comment and fixed all other issues you've mentioned earlier.

RKSimon added inline comments.Nov 3 2021, 11:54 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3410

Use isNullOrNullSplat?

fzhinkin updated this revision to Diff 384556.Nov 3 2021, 12:28 PM

Simplified expression in assertion.

RKSimon added inline comments.Nov 7 2021, 5:25 AM
llvm/test/CodeGen/AArch64/icmp-shift-opt.ll
151

pre-commit?

fzhinkin added inline comments.Nov 8 2021, 1:06 AM
llvm/test/CodeGen/AArch64/icmp-shift-opt.ll
151

Will appreciate if you can help me with committing it:

fzhinkin added inline comments.Nov 8 2021, 6:55 AM
llvm/test/CodeGen/AArch64/icmp-shift-opt.ll
151

thanks!

RKSimon added inline comments.Nov 8 2021, 8:06 AM
llvm/test/CodeGen/ARM/icmp-shift-opt.ll
150

sorry - I didn't add this one - I'll commit this shortly.

RKSimon added inline comments.Nov 8 2021, 8:13 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4155

This is doing something pretty similar, its more limited to the 'concat' pattern but handles the -1 'allbits' case as well.

fzhinkin added inline comments.Nov 8 2021, 8:47 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4155

Are you suggesting to extract and reuse code common to both optimizations or to support -1 in optimizeSetCCOfExpandedShift?

llvm/test/CodeGen/ARM/icmp-shift-opt.ll
150

Thank you.

fzhinkin added inline comments.Nov 17 2021, 12:32 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4155

It does not make sense to support -1 in optimizeSetCCOfExpandedShift because a shifted value (the method support only logical shifts) is never equal to -1.

For me it seems like "concat" optimization and optimizeSetCCOfExpandedShift are not so similar. "concat" opt merges OR arms in place, but optimizeSetCCOfExpandedShift eliminate shift pairs that are scattered among the tree and the whole tree should be scanned before final OR reduction is applied.

RKSimon added inline comments.Nov 22 2021, 3:30 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4155

OK - what about the CmpZero case - is that redundant dead code now?

fzhinkin added inline comments.Nov 22 2021, 10:11 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4155

CmpZero case is redundant only if Y's type was lowered to a pair of values having narrower type (X and Y are i64 and comp. target is armv7).

However, concat optimization still works for following cases not supported in optimizeSetCCOfExpandedShift:

  • X and Y whose type was legal (X and Y are i64, x86_64 is the target);
  • X and Y are vectors.

I did not support vector types in optimizeSetCCOfExpandedShift because it seems impossible to get a DAG with appropriate shape during vectors legalization.

fzhinkin added inline comments.Dec 7 2021, 11:11 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4155

MergeConcat's CmpZero case will continue to work and there is no easy way to exclude cases that could be handled by both optimizeSetCCOfExpandedShift and MergeConcat. So I think it does make sense to apply optimizeSetCCOfExpandedShift after MergeConcat and only when types legalization is complete.

fzhinkin updated this revision to Diff 399363.Jan 12 2022, 9:50 AM

Reordered MergeConcat and optimizeSetCCOfExpandedShift optimizations.

fzhinkin added inline comments.Jan 17 2022, 9:53 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4155

So I think it does make sense to apply optimizeSetCCOfExpandedShift after MergeConcat and only when types legalization is complete.

optimizeSetCCOfExpandedShift could not be disabled during type legalization phase, because some cases won't be handled then.

RKSimon edited the summary of this revision. (Show Details)Feb 7 2022, 5:11 AM
RKSimon added inline comments.Feb 7 2022, 5:16 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3456

You might be better off pulling this lambda out as static helper function instead of relying on recursive lambda calls?

fzhinkin added inline comments.Feb 7 2022, 11:08 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3456

Yes, thanks for suggestion. I think it'll be better to move both lambdas as well result reduction to a helper class.

fzhinkin updated this revision to Diff 406921.Feb 8 2022, 12:04 PM

Move most of optimization's code into helper class.

RKSimon added inline comments.Feb 16 2022, 9:48 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3413

don't pass SDValue by reference - its just a pointer + int - it should pass by value very cheaply

3469

don't pass SDValue by reference

fzhinkin updated this revision to Diff 409381.Feb 16 2022, 1:18 PM

Pass SDValue by value.

I didn't step through everything here, so I may not be seeing the entire problem.

There are 2 or more relatively simple folds that are missing both in IR and DAG, and adding those might solve this more generally and more easily than the proposed patch.

Here are examples in Alive2:
https://alive2.llvm.org/ce/z/KNQuYm
https://alive2.llvm.org/ce/z/LKLpo3

So it might be possible to solve the motivating bug without starting from icmp/setcc -- it's really a problem of combining shifts and funnel shifts in a way that is better for analysis/codegen.

We can show a potential codegen improvement for x86 with a minimal example:

% cat fsh.ll          
declare i32 @llvm.fshl.i32 (i32, i32, i32)
declare i32 @llvm.fshr.i32 (i32, i32, i32)

define i32 @src(i32 %x, i32 %y) {
  %y5 = shl i32 %y, 5
  %fun3 = call i32 @llvm.fshr.i32(i32 %x, i32 %y, i32 27)
  %or2 = or i32 %fun3, %y5
  ret i32 %or2
}


define i32 @tgt(i32 %x, i32 %y) {
  %x5 = shl i32 %x, 5
  %rot3 = call i32 @llvm.fshl.i32(i32 %y, i32 %y, i32 5)
  %or2 = or i32 %rot3, %x5
  ret i32 %or2
}
% llc -o - fsh.ll
_src:
	shldl	$5, %esi, %edi  ; avoid shld if possible because it is slow on some targets
	movl	%esi, %eax
	shll	$5, %eax
	orl	%edi, %eax
	ret

_tgt:
	movl	%esi, %eax
	shll	$5, %edi
	roll	$5, %eax
	orl	%edi, %eax
	retq

If I'm seeing the patterns correctly, there's at least one more possibility.
If a target has a legal+fast funnel shift, we could eliminate the shift completely from this sequence instead of forming a rotate:
https://alive2.llvm.org/ce/z/PpGpy3

I didn't step through everything here, so I may not be seeing the entire problem.

There are 2 or more relatively simple folds that are missing both in IR and DAG, and adding those might solve this more generally and more easily than the proposed patch.

Here are examples in Alive2:
https://alive2.llvm.org/ce/z/KNQuYm
https://alive2.llvm.org/ce/z/LKLpo3

So it might be possible to solve the motivating bug without starting from icmp/setcc -- it's really a problem of combining shifts and funnel shifts in a way that is better for analysis/codegen.

Transformation from this change is not only about shifts recombination but (and mostly) about removal of shifts that will rotate bits of a value compared with zero. In that particular case such shifts could be safely eliminated, but in any other case it's not possible as it will change the result.

By combining shifts differently (by replacing shift + funnel shift with rotate + shift, for example) we can improve performance on some platforms, but the case with icmp eq/ne 0 would still have to be handled separately as it allow transformation that is illegal otherwise.

I didn't step through everything here, so I may not be seeing the entire problem.

There are 2 or more relatively simple folds that are missing both in IR and DAG, and adding those might solve this more generally and more easily than the proposed patch.

Here are examples in Alive2:
https://alive2.llvm.org/ce/z/KNQuYm
https://alive2.llvm.org/ce/z/LKLpo3

So it might be possible to solve the motivating bug without starting from icmp/setcc -- it's really a problem of combining shifts and funnel shifts in a way that is better for analysis/codegen.

Transformation from this change is not only about shifts recombination but (and mostly) about removal of shifts that will rotate bits of a value compared with zero. In that particular case such shifts could be safely eliminated, but in any other case it's not possible as it will change the result.

By combining shifts differently (by replacing shift + funnel shift with rotate + shift, for example) we can improve performance on some platforms, but the case with icmp eq/ne 0 would still have to be handled separately as it allow transformation that is illegal otherwise.

On the other hand it's just a special case where we're using "identify" instead of rotation.

I agree that a general solution for shifts combining could be developed and used instead of the transformation from this change.

@spatel I remembered why I didn't add a transformation into DAGCombiner and decided to perform specific transformation in TargetLowering::SimplifySetCC:

  • to remove unnecessary shifts we have to know that legalized shift was an input of setcc eq/ne 0 and expanded shifts are not yet combined when DAGCombiner is visiting setcc;
  • and it's actually better to apply the optimization before shifts combining because for some targets shifts could be combined to some target-specific node instead of a funnel shift (for example, for AArch64 shifts implemening funnel shift will be combined into EXTR node instead of FSHL/FSHR).

As a generalization I was thinking to extract rotation-matching part of this optimization to a separate method that will actually combine shifts into rotations, call if from both TargetLowering::SimplifySetCC and DAGCombiner::MatchRotate and then eliminate rotations in TargetLowering::SimplifySetCC.
But it seems like or-shift expression trees that deep enough to apply such optimization should be quite rare. In fact, I'm hardly imagining cases other than setcc eq/ne 0.

@spatel I remembered why I didn't add a transformation into DAGCombiner and decided to perform specific transformation in TargetLowering::SimplifySetCC:

  • to remove unnecessary shifts we have to know that legalized shift was an input of setcc eq/ne 0 and expanded shifts are not yet combined when DAGCombiner is visiting setcc;
  • and it's actually better to apply the optimization before shifts combining because for some targets shifts could be combined to some target-specific node instead of a funnel shift (for example, for AArch64 shifts implemening funnel shift will be combined into EXTR node instead of FSHL/FSHR).

As a generalization I was thinking to extract rotation-matching part of this optimization to a separate method that will actually combine shifts into rotations, call if from both TargetLowering::SimplifySetCC and DAGCombiner::MatchRotate and then eliminate rotations in TargetLowering::SimplifySetCC.
But it seems like or-shift expression trees that deep enough to apply such optimization should be quite rare. In fact, I'm hardly imagining cases other than setcc eq/ne 0.

I'm not convinced yet that we need to include the setcc to handle the motivating bug.

I noticed that we're missing an even more basic logic+shift transform (we miss this in IR too):
https://alive2.llvm.org/ce/z/RZjxTE

I think that transform should work with any shift opcode and any bitwise logic opcode (as long as they are matching opcodes within any one transform).

Would this patch still get the optimal result for the transformed code?

Transformation from this change is not only about shifts recombination but (and mostly) about removal of shifts that will rotate bits of a value compared with zero. In that particular case such shifts could be safely eliminated, but in any other case it's not possible as it will change the result.

Now that we're reducing the pattern more with:
acb96ffd149db447
...I see what you mean. We were missing a transform that already exists in IR:
69684b84c61c

We still need 1-2 other patches to fix the motivating example, but I think we can get there with some small peepholes.

@fzhinkin @spatel Is this patch still relevant?

Herald added a project: Restricted Project. · View Herald TranscriptMay 1 2022, 8:46 AM

@fzhinkin @spatel Is this patch still relevant?

@spatel fixed almost all cases (huge thanks for that), so this patch is no longer relevant.

fzhinkin abandoned this revision.May 2 2022, 1:58 AM