This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Disable generation of fshl/fshr for rotates
ClosedPublic

Authored by pmatos on May 16 2023, 7:30 AM.

Details

Summary

This commits overrides the default lowering of rotates to fshl/fshr.
It lowers the rotate straight to a target dependent rotate intrinsic.

Diff Detail

Event Timeline

pmatos created this revision.May 16 2023, 7:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2023, 7:30 AM
pmatos requested review of this revision.May 16 2023, 7:30 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 16 2023, 7:30 AM

Proposal to fix https://github.com/llvm/llvm-project/issues/62703

This not ready to land yet but it should open a discussion on if this is a good fix for this issue.
If we want to go this route we need: support for 64bits, tests.

A quick summary of why this is needed. The function :

inline u32 bswap(u32 x) {
    return __builtin_rotateleft32((x & 0xFF00FF00), 8) 
        | __builtin_rotateright32((x & 0x00FF00FF), 8);
  }

generates really poor code in WebAssembly. A rotate should be generated but instead we get quite a large amount of code just to set up constants and perform a shift. The issue is that the rotateleft is lowered to a fshl(%and, %and, 8). During instcombine, the second argument is simplified since %and is the result of a bitwise operation. However, this results in the wasm backend not being able to generate the rotate since the first and second arguments to the fshl are not any longer the same. The generated code is:

bswap:                                  # @bswap
	.functype	bswap (i32) -> (i32)
# %bb.0:                                # %entry
	local.get	0
	i32.const	24
	i32.shl
	local.get	0
	i32.const	65280
	i32.and
	i32.const	8
	i32.shl
	i32.or
	local.get	0
	i32.const	8
	i32.shr_u
	i32.const	65280
	i32.and
	local.get	0
	i32.const	24
	i32.shr_u
	i32.or
	i32.or
                                        # fallthrough-return
	end_function

With the fix this becomes:

bswap:                                  # @bswap
	.functype	bswap (i32) -> (i32)
# %bb.0:                                # %entry
	local.get	0
	i32.const	16711935
	i32.and
	i32.const	8
	i32.rotr
	local.get	0
	i32.const	-16711936
	i32.and
	i32.const	8
	i32.rotl
	i32.or
                                        # fallthrough-return
	end_function

The alternative way to implement something like this would be to block the instcombine on fshl/fshr instructions for the wasm backend, however adding target dependent stuff to instcombine feels even more icky than this patch. I welcome suggestions on how to improve the patch, since the hack in emitRotate is also not great, and I have not seen other places lowering a generic builtin into a target specific intrinsic.

Adding Sanjay as a reviewer since he implemented the clang emitRotate which we patched.

FWIW X86 seems to do something similar elsewhere in this file (https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CGBuiltin.cpp#L985-L986), although it doesn't seem common otherwise. I think I'd be OK with this approach (and it does seem better than trying to mess with instcombine or a new TTI hook or something)

FWIW X86 seems to do something similar elsewhere in this file (https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CGBuiltin.cpp#L985-L986), although it doesn't seem common otherwise. I think I'd be OK with this approach (and it does seem better than trying to mess with instcombine or a new TTI hook or something)

Oh - I missed that X86 bit. That gives me some confidence that this approach makes sense. I will work on a new patch with 64bit support and tests.

pmatos updated this revision to Diff 523149.May 17 2023, 1:15 PM

Generate rotates for 64bits as well. Add tests.

Ready for review.

nikic requested changes to this revision.May 17 2023, 2:37 PM
nikic added a subscriber: nikic.

This doesn't looks like a wasm specific problem. You get essentially the same issue on any target that has a rotate instruction but no funnel shift instruction. Here are just a couple examples: https://godbolt.org/z/8v6nfaax9

I believe this needs to be either solved by preventing demanded bits simplifications that break a rotate pattern (though I'm not sure if that would break any other optimizations we care about) or by adding a special case for this in the backend when lowering FSH to ROT.

Lowering to a rotate intrinsic only "solves" this for wasm and will at the same time make these rotates completely opaque to optimization -- heck, it looks like we don't even support constant folding for these intrinsics (https://llvm.godbolt.org/z/hMWG16b9W).

This revision now requires changes to proceed.May 17 2023, 2:37 PM

This doesn't looks like a wasm specific problem. You get essentially the same issue on any target that has a rotate instruction but no funnel shift instruction. Here are just a couple examples: https://godbolt.org/z/8v6nfaax9

Yes, I am indeed aware this is not specific to wasm. What's specific to wasm afaiu is that the code generated is much worse when expanding fshl. That's what I mentioned in the bug discussion here: https://github.com/llvm/llvm-project/issues/62703#issuecomment-1548474310

I believe this needs to be either solved by preventing demanded bits simplifications that break a rotate pattern (though I'm not sure if that would break any other optimizations we care about) or by adding a special case for this in the backend when lowering FSH to ROT.

Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.

Lowering to a rotate intrinsic only "solves" this for wasm and will at the same time make these rotates completely opaque to optimization -- heck, it looks like we don't even support constant folding for these intrinsics (https://llvm.godbolt.org/z/hMWG16b9W).

I just added the intrinsics, so those optimizations were not added yet.

Lowering to a rotate intrinsic only "solves" this for wasm and will at the same time make these rotates completely opaque to optimization -- heck, it looks like we don't even support constant folding for these intrinsics (https://llvm.godbolt.org/z/hMWG16b9W).

Another thing wrt this optimization regard is that Wasm is slightly different from other targets in that it's not natively executed but instead executed through a VM or passed through binaryen for further optimization. I just tested this and emcc which passes the resulting file through binaryen performs this optimization. I imagine V8 won't have problems with this code either, therefore the missing optimization in llvm is not problematic for the target.

Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.

How harmful is it to avoid breaking rotate patterns even if the target doesn't support rotate?

nikic added a comment.May 18 2023, 1:15 AM

This doesn't looks like a wasm specific problem. You get essentially the same issue on any target that has a rotate instruction but no funnel shift instruction. Here are just a couple examples: https://godbolt.org/z/8v6nfaax9

Yes, I am indeed aware this is not specific to wasm. What's specific to wasm afaiu is that the code generated is much worse when expanding fshl. That's what I mentioned in the bug discussion here: https://github.com/llvm/llvm-project/issues/62703#issuecomment-1548474310

I believe this needs to be either solved by preventing demanded bits simplifications that break a rotate pattern (though I'm not sure if that would break any other optimizations we care about) or by adding a special case for this in the backend when lowering FSH to ROT.

Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.

I'm not suggesting to add any target specific code to instcombine. I think there are actually quite a few different ways this could be solved. See https://llvm.godbolt.org/z/f55K7K17W for three possible representations of the same rotate pattern.

  1. Say that we prefer preserving rotates over "simplifying" funnel shifts (ending up with the rot2 pattern). Basically by skipping the optimization at https://github.com/llvm/llvm-project/blob/7f54b38e28b3b66195de672848f2b5366d0d51e3/llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L927-L931 if both fsh operands are the same. Assuming this doesn't cause test regressions, I think this would be acceptable to do. From a backend perspective, even for targets that have a native funnel shift (aarch64, x86), the difference between the rot1/rot2 patterns looks pretty neutral.
  1. Undo the transform in the backend. It is bidrectional (https://alive2.llvm.org/ce/z/Chb85F), so this is possible. This would need an extra legalization/combiner pattern (depending on where we form ROTs). Advantage of undoing the pattern is the usual one: Works if it was in the undesirable form in the first place. (E.g. this could happen if the rotate did not use a builtin but was implemented as x << 8 | x >> 24, which is probably much more widespread than the builtin. Though I checked, and we don't form a funnel shift for it in this case right now.)
  1. Move the and from the fsh arguments to the result. This is the rot3 pattern. This seems to produce the best codegen on average, because it can use uxtb16 on ARM. Moving the and from args to return is a bit unusual for unary ops, but if we see this as moving two ands on both fsh arguments (which happen to be the same) to one on the result, that would be a pretty standard transform.

I think all of those options are viable, and couldn't say for certain which one is best. I think any of them would be better than making clang emit a special intrinsic just for wasm though.

@nikic Thank you for the thorough suggestions above. I will have to look at this closer next week and will work on an alternative solution.

Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.

How harmful is it to avoid breaking rotate patterns even if the target doesn't support rotate?

Hi Craig, I thought initially your question was for Nikita but it's apparently for me. I am sorry but I am not sure I understand your question. Could you please rephrase?

  1. Say that we prefer preserving rotates over "simplifying" funnel shifts (ending up with the rot2 pattern). Basically by skipping the optimization at https://github.com/llvm/llvm-project/blob/7f54b38e28b3b66195de672848f2b5366d0d51e3/llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L927-L931 if both fsh operands are the same. Assuming this doesn't cause test regressions, I think this would be acceptable to do. From a backend perspective, even for targets that have a native funnel shift (aarch64, x86), the difference between the rot1/rot2 patterns looks pretty neutral.

I am surprised this option is viable for example. This was my initial thought to avoid the rotate, but I assumed adding something like :

if (!getTarget().getTriple().isWasm()) {
  APInt DemandedMaskLHS(DemandedMask.lshr(ShiftAmt));
  APInt DemandedMaskRHS(DemandedMask.shl(BitWidth - ShiftAmt));
  if (SimplifyDemandedBits(I, 0, DemandedMaskLHS, LHSKnown, Depth + 1) ||
      SimplifyDemandedBits(I, 1, DemandedMaskRHS, RHSKnown, Depth + 1))
    return I;
}

would not be well received. Also, I cannot find precedent for doing this.

pmatos added a comment.EditedMay 24 2023, 6:29 AM
  1. Say that we prefer preserving rotates over "simplifying" funnel shifts (ending up with the rot2 pattern). Basically by skipping the optimization at https://github.com/llvm/llvm-project/blob/7f54b38e28b3b66195de672848f2b5366d0d51e3/llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L927-L931 if both fsh operands are the same. Assuming this doesn't cause test regressions, I think this would be acceptable to do. From a backend perspective, even for targets that have a native funnel shift (aarch64, x86), the difference between the rot1/rot2 patterns looks pretty neutral.

OK, I just re-read your comment above and I am starting to assume that what you mean is skipping the optimization for all targets if the funnel shift is a rotate (i.e. same first two operands). Is this correct?

nikic added a comment.May 24 2023, 6:31 AM
  1. Say that we prefer preserving rotates over "simplifying" funnel shifts (ending up with the rot2 pattern). Basically by skipping the optimization at https://github.com/llvm/llvm-project/blob/7f54b38e28b3b66195de672848f2b5366d0d51e3/llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L927-L931 if both fsh operands are the same. Assuming this doesn't cause test regressions, I think this would be acceptable to do. From a backend perspective, even for targets that have a native funnel shift (aarch64, x86), the difference between the rot1/rot2 patterns looks pretty neutral.

OK, I just re-read your comment above and I am starting to assume that what you mean is skipping the optimization for all targets if the funnel shift is a rotate (i.e. same first two operands). Is this correct?

That's right.

Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.

How harmful is it to avoid breaking rotate patterns even if the target doesn't support rotate?

Hi Craig, I thought initially your question was for Nikita but it's apparently for me. I am sorry but I am not sure I understand your question. Could you please rephrase?

My question was equivalent to @nikic 's option 1.

pmatos updated this revision to Diff 526044.May 26 2023, 6:22 AM

Update the patch by removing target specific changes in CGBuiltin.
Leave fshl/fshr unchanged for rotates. This actually fixes a todo in fshl/r test.

@nikic What do you think of the current patch?

nikic added inline comments.May 26 2023, 6:36 AM
llvm/test/Transforms/InstCombine/fsh.ll
664

We still want to simplify this case. Could possibly be done by checking whether all demanded bits are zero for one of the operands in the rotate case.

pmatos added inline comments.May 26 2023, 7:00 AM
llvm/test/Transforms/InstCombine/fsh.ll
664

Ah, yes, right. That should be just a simple shift right. Will see how to still allow that change. Thanks.

pmatos added inline comments.May 29 2023, 1:55 AM
llvm/test/Transforms/InstCombine/fsh.ll
664

I am still looking into the best way to handle this case. The issue is that we only know if the demanded bits are zero when analyzing the uses of the value. This is done in SimplifyDemandedBits which in turn calls SimplifyDemandedUseBits, but we cannot call these functions to obtain demanded bits because they'll change the instruction straightaway. I was seeing if there was a way to do the checks inside the block of code already changed, but I don't think that'll be possible. I might have to add a check to SimplifyDemandedUseBits to only simplify in this specific case we want.

nikic added inline comments.May 29 2023, 12:37 PM
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
915–935

You should be able to do something along these lines.

pmatos updated this revision to Diff 526556.May 30 2023, 2:51 AM

Implement optimization when demanded bits are known, skip otherwise for rotates.

@nikic Things look much better now. Thanks for your help with the changes in InstCombine. What do you think?

pmatos marked an inline comment as done.May 30 2023, 2:51 AM
pmatos updated this revision to Diff 526559.May 30 2023, 2:55 AM

Fix up some spacing issues.

pmatos marked an inline comment as done.May 30 2023, 2:57 AM
pmatos retitled this revision from [WebAssembly] Disable generation of fshl/fshr for rotates to Disable generation of fshl/fshr for rotates.
pmatos updated this revision to Diff 526577.May 30 2023, 4:38 AM

Apply clang-format.

nikic retitled this revision from Disable generation of fshl/fshr for rotates to [InstCombine] Disable generation of fshl/fshr for rotates.May 31 2023, 4:01 AM
nikic added a comment.May 31 2023, 4:05 AM

Can you please drop all wasm related tests and instead add an InstCombine test for the fsh+and pattern?

It would also be good to have a test where we can fold one side to a constant, but that constant is not zero. We should then consider whether that is profitable or not. (In that case we can't reduce to a simple shift and will reduce to a shift and or with constant instead -- is that better or worse than a rotate?)

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
933

Calling SimplifyDemandedBits() for the case of known demanded bits is a bit odd, I'd probably write something like this instead:

KnownBits LHSKnown = computeKnownBits(I->getOperand(0), Depth + 1, I);
if (DemandedMaskLHS.isSubsetOf(LHSKnown.Zero | LHSKnown.One)) {
  replaceOperand(I, 0, Constant::getIntegerValue(VTy, LHSKnown.One);
  return &I;
}
KnownBits RHSKnown = computeKnownBits(I->getOperand(1), Depth + 1, I);
if (DemandedMaskRHS.isSubsetOf(LHSKnown.Zero | RHSKnown.One)) {
  replaceOperand(I, 1, Constant::getIntegerValue(VTy, RHSKnown.One);
  return &I;
}
pmatos updated this revision to Diff 527326.Jun 1 2023, 2:08 AM

Simplify code according to @nikic suggestion. Add tests.

pmatos marked an inline comment as done.Jun 1 2023, 2:09 AM
pmatos added a comment.Jun 1 2023, 2:14 AM

Can you please drop all wasm related tests and instead add an InstCombine test for the fsh+and pattern?

It would also be good to have a test where we can fold one side to a constant, but that constant is not zero. We should then consider whether that is profitable or not. (In that case we can't reduce to a simple shift and will reduce to a shift and or with constant instead -- is that better or worse than a rotate?)

I have added the tests. I looked into the output of WebAssembly and it looks good. Even in the case of the generation of a non-zero const, Wasm still managed to generate a rotate which is generally more profitable, since the runtime can be then the one choosing how to implement that depending on the hardware. In the general case, I am not sure what the right answer is tbh.

nikic accepted this revision.Jun 1 2023, 6:15 AM

LGTM, let's give it a try. The patch description needs an update though.

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
920

Add some more explanation here, something like:
// Avoid converting rotate into funnel shift. Only simplify if one operand is constant.

This revision is now accepted and ready to land.Jun 1 2023, 6:15 AM
This revision was automatically updated to reflect the committed changes.
pmatos added a comment.Jun 1 2023, 6:32 AM

Landed, thanks for your patience. Lets hope it sticks.