Download Raw Diff

Details

Reviewers

dschuff
tlively
spatel
nikic

Commits

rG9485d983ac0c: [InstCombine] Disable generation of fshl/fshr for rotates

Summary

This commits overrides the default lowering of rotates to fshl/fshr.
It lowers the rotate straight to a target dependent rotate intrinsic.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

pmatos created this revision.May 16 2023, 7:30 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2023, 7:30 AM

Herald added subscribers: asb, wingo, ecnelises and 4 others. · View Herald Transcript

pmatos requested review of this revision.May 16 2023, 7:30 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 16 2023, 7:30 AM

Herald added subscribers: llvm-commits, cfe-commits, aheejin. · View Herald Transcript

Proposal to fix https://github.com/llvm/llvm-project/issues/62703

This not ready to land yet but it should open a discussion on if this is a good fix for this issue.
If we want to go this route we need: support for 64bits, tests.

A quick summary of why this is needed. The function :

inline u32 bswap(u32 x) {
    return __builtin_rotateleft32((x & 0xFF00FF00), 8) 
        | __builtin_rotateright32((x & 0x00FF00FF), 8);
  }

generates really poor code in WebAssembly. A rotate should be generated but instead we get quite a large amount of code just to set up constants and perform a shift. The issue is that the rotateleft is lowered to a fshl(%and, %and, 8). During instcombine, the second argument is simplified since %and is the result of a bitwise operation. However, this results in the wasm backend not being able to generate the rotate since the first and second arguments to the fshl are not any longer the same. The generated code is:

bswap:                                  # @bswap
	.functype	bswap (i32) -> (i32)
# %bb.0:                                # %entry
	local.get	0
	i32.const	24
	i32.shl
	local.get	0
	i32.const	65280
	i32.and
	i32.const	8
	i32.shl
	i32.or
	local.get	0
	i32.const	8
	i32.shr_u
	i32.const	65280
	i32.and
	local.get	0
	i32.const	24
	i32.shr_u
	i32.or
	i32.or
                                        # fallthrough-return
	end_function

With the fix this becomes:

bswap:                                  # @bswap
	.functype	bswap (i32) -> (i32)
# %bb.0:                                # %entry
	local.get	0
	i32.const	16711935
	i32.and
	i32.const	8
	i32.rotr
	local.get	0
	i32.const	-16711936
	i32.and
	i32.const	8
	i32.rotl
	i32.or
                                        # fallthrough-return
	end_function

The alternative way to implement something like this would be to block the instcombine on fshl/fshr instructions for the wasm backend, however adding target dependent stuff to instcombine feels even more icky than this patch. I welcome suggestions on how to improve the patch, since the hack in emitRotate is also not great, and I have not seen other places lowering a generic builtin into a target specific intrinsic.

Adding Sanjay as a reviewer since he implemented the clang emitRotate which we patched.

Herald added a subscriber: StephenFan. · View Herald TranscriptMay 16 2023, 7:56 AM

Harbormaster completed remote builds in B232311: Diff 522610.May 16 2023, 8:17 AM

FWIW X86 seems to do something similar elsewhere in this file (https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CGBuiltin.cpp#L985-L986), although it doesn't seem common otherwise. I think I'd be OK with this approach (and it does seem better than trying to mess with instcombine or a new TTI hook or something)

In D150670#4347150, @dschuff wrote:

FWIW X86 seems to do something similar elsewhere in this file (https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CGBuiltin.cpp#L985-L986), although it doesn't seem common otherwise. I think I'd be OK with this approach (and it does seem better than trying to mess with instcombine or a new TTI hook or something)

Oh - I missed that X86 bit. That gives me some confidence that this approach makes sense. I will work on a new patch with 64bit support and tests.

Generate rotates for 64bits as well. Add tests.

Ready for review.

This doesn't looks like a wasm specific problem. You get essentially the same issue on any target that has a rotate instruction but no funnel shift instruction. Here are just a couple examples: https://godbolt.org/z/8v6nfaax9

I believe this needs to be either solved by preventing demanded bits simplifications that break a rotate pattern (though I'm not sure if that would break any other optimizations we care about) or by adding a special case for this in the backend when lowering FSH to ROT.

Lowering to a rotate intrinsic only "solves" this for wasm and will at the same time make these rotates completely opaque to optimization -- heck, it looks like we don't even support constant folding for these intrinsics (https://llvm.godbolt.org/z/hMWG16b9W).

This revision now requires changes to proceed.May 17 2023, 2:37 PM

Harbormaster completed remote builds in B232682: Diff 523149.May 17 2023, 10:51 PM

In D150670#4351147, @nikic wrote:

This doesn't looks like a wasm specific problem. You get essentially the same issue on any target that has a rotate instruction but no funnel shift instruction. Here are just a couple examples: https://godbolt.org/z/8v6nfaax9

Yes, I am indeed aware this is not specific to wasm. What's specific to wasm afaiu is that the code generated is much worse when expanding fshl. That's what I mentioned in the bug discussion here: https://github.com/llvm/llvm-project/issues/62703#issuecomment-1548474310

I believe this needs to be either solved by preventing demanded bits simplifications that break a rotate pattern (though I'm not sure if that would break any other optimizations we care about) or by adding a special case for this in the backend when lowering FSH to ROT.

Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.

Lowering to a rotate intrinsic only "solves" this for wasm and will at the same time make these rotates completely opaque to optimization -- heck, it looks like we don't even support constant folding for these intrinsics (https://llvm.godbolt.org/z/hMWG16b9W).

I just added the intrinsics, so those optimizations were not added yet.

In D150670#4351147, @nikic wrote:

Lowering to a rotate intrinsic only "solves" this for wasm and will at the same time make these rotates completely opaque to optimization -- heck, it looks like we don't even support constant folding for these intrinsics (https://llvm.godbolt.org/z/hMWG16b9W).

Another thing wrt this optimization regard is that Wasm is slightly different from other targets in that it's not natively executed but instead executed through a VM or passed through binaryen for further optimization. I just tested this and emcc which passes the resulting file through binaryen performs this optimization. I imagine V8 won't have problems with this code either, therefore the missing optimization in llvm is not problematic for the target.

Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.

How harmful is it to avoid breaking rotate patterns even if the target doesn't support rotate?

In D150670#4352055, @pmatos wrote:

In D150670#4351147, @nikic wrote:

This doesn't looks like a wasm specific problem. You get essentially the same issue on any target that has a rotate instruction but no funnel shift instruction. Here are just a couple examples: https://godbolt.org/z/8v6nfaax9

Yes, I am indeed aware this is not specific to wasm. What's specific to wasm afaiu is that the code generated is much worse when expanding fshl. That's what I mentioned in the bug discussion here: https://github.com/llvm/llvm-project/issues/62703#issuecomment-1548474310

I believe this needs to be either solved by preventing demanded bits simplifications that break a rotate pattern (though I'm not sure if that would break any other optimizations we care about) or by adding a special case for this in the backend when lowering FSH to ROT.

Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.

I'm not suggesting to add any target specific code to instcombine. I think there are actually quite a few different ways this could be solved. See https://llvm.godbolt.org/z/f55K7K17W for three possible representations of the same rotate pattern.

Say that we prefer preserving rotates over "simplifying" funnel shifts (ending up with the rot2 pattern). Basically by skipping the optimization at https://github.com/llvm/llvm-project/blob/7f54b38e28b3b66195de672848f2b5366d0d51e3/llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L927-L931 if both fsh operands are the same. Assuming this doesn't cause test regressions, I think this would be acceptable to do. From a backend perspective, even for targets that have a native funnel shift (aarch64, x86), the difference between the rot1/rot2 patterns looks pretty neutral.

Undo the transform in the backend. It is bidrectional (https://alive2.llvm.org/ce/z/Chb85F), so this is possible. This would need an extra legalization/combiner pattern (depending on where we form ROTs). Advantage of undoing the pattern is the usual one: Works if it was in the undesirable form in the first place. (E.g. this could happen if the rotate did not use a builtin but was implemented as x << 8 | x >> 24, which is probably much more widespread than the builtin. Though I checked, and we don't form a funnel shift for it in this case right now.)

Move the and from the fsh arguments to the result. This is the rot3 pattern. This seems to produce the best codegen on average, because it can use uxtb16 on ARM. Moving the and from args to return is a bit unusual for unary ops, but if we see this as moving two ands on both fsh arguments (which happen to be the same) to one on the result, that would be a pretty standard transform.

I think all of those options are viable, and couldn't say for certain which one is best. I think any of them would be better than making clang emit a special intrinsic just for wasm though.

@nikic Thank you for the thorough suggestions above. I will have to look at this closer next week and will work on an alternative solution.

In D150670#4352094, @craig.topper wrote:

Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.

How harmful is it to avoid breaking rotate patterns even if the target doesn't support rotate?

Hi Craig, I thought initially your question was for Nikita but it's apparently for me. I am sorry but I am not sure I understand your question. Could you please rephrase?

In D150670#4352163, @nikic wrote:

Say that we prefer preserving rotates over "simplifying" funnel shifts (ending up with the rot2 pattern). Basically by skipping the optimization at https://github.com/llvm/llvm-project/blob/7f54b38e28b3b66195de672848f2b5366d0d51e3/llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L927-L931 if both fsh operands are the same. Assuming this doesn't cause test regressions, I think this would be acceptable to do. From a backend perspective, even for targets that have a native funnel shift (aarch64, x86), the difference between the rot1/rot2 patterns looks pretty neutral.

I am surprised this option is viable for example. This was my initial thought to avoid the rotate, but I assumed adding something like :

if (!getTarget().getTriple().isWasm()) {
  APInt DemandedMaskLHS(DemandedMask.lshr(ShiftAmt));
  APInt DemandedMaskRHS(DemandedMask.shl(BitWidth - ShiftAmt));
  if (SimplifyDemandedBits(I, 0, DemandedMaskLHS, LHSKnown, Depth + 1) ||
      SimplifyDemandedBits(I, 1, DemandedMaskRHS, RHSKnown, Depth + 1))
    return I;
}

would not be well received. Also, I cannot find precedent for doing this.

In D150670#4368238, @pmatos wrote:

In D150670#4352163, @nikic wrote:

Say that we prefer preserving rotates over "simplifying" funnel shifts (ending up with the rot2 pattern). Basically by skipping the optimization at https://github.com/llvm/llvm-project/blob/7f54b38e28b3b66195de672848f2b5366d0d51e3/llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L927-L931 if both fsh operands are the same. Assuming this doesn't cause test regressions, I think this would be acceptable to do. From a backend perspective, even for targets that have a native funnel shift (aarch64, x86), the difference between the rot1/rot2 patterns looks pretty neutral.

OK, I just re-read your comment above and I am starting to assume that what you mean is skipping the optimization for all targets if the funnel shift is a rotate (i.e. same first two operands). Is this correct?

In D150670#4368241, @pmatos wrote:

In D150670#4368238, @pmatos wrote:

In D150670#4352163, @nikic wrote:

Say that we prefer preserving rotates over "simplifying" funnel shifts (ending up with the rot2 pattern). Basically by skipping the optimization at https://github.com/llvm/llvm-project/blob/7f54b38e28b3b66195de672848f2b5366d0d51e3/llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L927-L931 if both fsh operands are the same. Assuming this doesn't cause test regressions, I think this would be acceptable to do. From a backend perspective, even for targets that have a native funnel shift (aarch64, x86), the difference between the rot1/rot2 patterns looks pretty neutral.

OK, I just re-read your comment above and I am starting to assume that what you mean is skipping the optimization for all targets if the funnel shift is a rotate (i.e. same first two operands). Is this correct?

That's right.

In D150670#4367736, @pmatos wrote:

In D150670#4352094, @craig.topper wrote:

Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.

How harmful is it to avoid breaking rotate patterns even if the target doesn't support rotate?

Hi Craig, I thought initially your question was for Nikita but it's apparently for me. I am sorry but I am not sure I understand your question. Could you please rephrase?

My question was equivalent to @nikic 's option 1.

Update the patch by removing target specific changes in CGBuiltin.
Leave fshl/fshr unchanged for rotates. This actually fixes a todo in fshl/r test.

@nikic What do you think of the current patch?

nikic added inline comments.May 26 2023, 6:36 AM

llvm/test/Transforms/InstCombine/fsh.ll
664	We still want to simplify this case. Could possibly be done by checking whether all demanded bits are zero for one of the operands in the rotate case.

pmatos added inline comments.May 26 2023, 7:00 AM

llvm/test/Transforms/InstCombine/fsh.ll
664	Ah, yes, right. That should be just a simple shift right. Will see how to still allow that change. Thanks.

Harbormaster completed remote builds in B234846: Diff 526044.May 26 2023, 7:11 AM

pmatos added inline comments.May 29 2023, 1:55 AM

llvm/test/Transforms/InstCombine/fsh.ll
664	I am still looking into the best way to handle this case. The issue is that we only know if the demanded bits are zero when analyzing the uses of the value. This is done in SimplifyDemandedBits which in turn calls SimplifyDemandedUseBits, but we cannot call these functions to obtain demanded bits because they'll change the instruction straightaway. I was seeing if there was a way to do the checks inside the block of code already changed, but I don't think that'll be possible. I might have to add a check to SimplifyDemandedUseBits to only simplify in this specific case we want.

nikic added inline comments.May 29 2023, 12:37 PM

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
915–935	You should be able to do something along these lines.

Implement optimization when demanded bits are known, skip otherwise for rotates.

@nikic Things look much better now. Thanks for your help with the changes in InstCombine. What do you think?

pmatos marked an inline comment as done.May 30 2023, 2:51 AM

Fix up some spacing issues.

pmatos marked an inline comment as done.May 30 2023, 2:57 AM

pmatos retitled this revision from [WebAssembly] Disable generation of fshl/fshr for rotates to Disable generation of fshl/fshr for rotates.

Harbormaster completed remote builds in B235250: Diff 526559.May 30 2023, 3:51 AM

Apply clang-format.

Harbormaster completed remote builds in B235263: Diff 526577.May 30 2023, 5:15 AM

nikic retitled this revision from Disable generation of fshl/fshr for rotates to [InstCombine] Disable generation of fshl/fshr for rotates.May 31 2023, 4:01 AM

Can you please drop all wasm related tests and instead add an InstCombine test for the fsh+and pattern?

It would also be good to have a test where we can fold one side to a constant, but that constant is not zero. We should then consider whether that is profitable or not. (In that case we can't reduce to a simple shift and will reduce to a shift and or with constant instead -- is that better or worse than a rotate?)

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp

933

Calling SimplifyDemandedBits() for the case of known demanded bits is a bit odd, I'd probably write something like this instead:

KnownBits LHSKnown = computeKnownBits(I->getOperand(0), Depth + 1, I);
if (DemandedMaskLHS.isSubsetOf(LHSKnown.Zero | LHSKnown.One)) {
  replaceOperand(I, 0, Constant::getIntegerValue(VTy, LHSKnown.One);
  return &I;
}
KnownBits RHSKnown = computeKnownBits(I->getOperand(1), Depth + 1, I);
if (DemandedMaskRHS.isSubsetOf(LHSKnown.Zero | RHSKnown.One)) {
  replaceOperand(I, 1, Constant::getIntegerValue(VTy, RHSKnown.One);
  return &I;
}

Simplify code according to @nikic suggestion. Add tests.

pmatos marked an inline comment as done.Jun 1 2023, 2:09 AM

In D150670#4383823, @nikic wrote:

Can you please drop all wasm related tests and instead add an InstCombine test for the fsh+and pattern?

It would also be good to have a test where we can fold one side to a constant, but that constant is not zero. We should then consider whether that is profitable or not. (In that case we can't reduce to a simple shift and will reduce to a shift and or with constant instead -- is that better or worse than a rotate?)

I have added the tests. I looked into the output of WebAssembly and it looks good. Even in the case of the generation of a non-zero const, Wasm still managed to generate a rotate which is generally more profitable, since the runtime can be then the one choosing how to implement that depending on the hardware. In the general case, I am not sure what the right answer is tbh.

Harbormaster completed remote builds in B235769: Diff 527326.Jun 1 2023, 2:40 AM

LGTM, let's give it a try. The patch description needs an update though.

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
920	Add some more explanation here, something like: `// Avoid converting rotate into funnel shift. Only simplify if one operand is constant.`

This revision is now accepted and ready to land.Jun 1 2023, 6:15 AM

Closed by commit rG9485d983ac0c: [InstCombine] Disable generation of fshl/fshr for rotates (authored by pmatos). · Explain WhyJun 1 2023, 6:32 AM

This revision was automatically updated to reflect the committed changes.

pmatos added a commit: rG9485d983ac0c: [InstCombine] Disable generation of fshl/fshr for rotates.

Landed, thanks for your patience. Lets hope it sticks.

Diff 527395

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp

Show First 20 Lines • Show All 906 Lines • ▼ Show 20 Lines if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(I)) {

// Normalize to funnel shift left. APInt shifts of BitWidth are well- // Normalize to funnel shift left. APInt shifts of BitWidth are well-

// defined, so no need to special-case zero shifts here. // defined, so no need to special-case zero shifts here.

uint64_t ShiftAmt = SA->urem(BitWidth); uint64_t ShiftAmt = SA->urem(BitWidth);

if (II->getIntrinsicID() == Intrinsic::fshr) if (II->getIntrinsicID() == Intrinsic::fshr)

ShiftAmt = BitWidth - ShiftAmt; ShiftAmt = BitWidth - ShiftAmt;

APInt DemandedMaskLHS(DemandedMask.lshr(ShiftAmt)); APInt DemandedMaskLHS(DemandedMask.lshr(ShiftAmt));

APInt DemandedMaskRHS(DemandedMask.shl(BitWidth - ShiftAmt)); APInt DemandedMaskRHS(DemandedMask.shl(BitWidth - ShiftAmt));

if (SimplifyDemandedBits(I, 0, DemandedMaskLHS, LHSKnown, Depth + 1) || if (I->getOperand(0) != I->getOperand(1)) {

if (SimplifyDemandedBits(I, 0, DemandedMaskLHS, LHSKnown,

Depth + 1) ||

SimplifyDemandedBits(I, 1, DemandedMaskRHS, RHSKnown, Depth + 1)) SimplifyDemandedBits(I, 1, DemandedMaskRHS, RHSKnown, Depth + 1))

return I; return I;

} else { // fshl is a rotate

nikicUnsubmitted

Not Done

Add some more explanation here, something like:
// Avoid converting rotate into funnel shift. Only simplify if one operand is constant.

nikic: Add some more explanation here, something like: `// Avoid converting rotate into funnel shift.

// Avoid converting rotate into funnel shift.

// Only simplify if one operand is constant.

KnownBits LHSKnown = computeKnownBits(I->getOperand(0), Depth + 1, I);

if (DemandedMaskLHS.isSubsetOf(LHSKnown.Zero | LHSKnown.One)) {

replaceOperand(*I, 0, Constant::getIntegerValue(VTy, LHSKnown.One));

return I;

}

KnownBits RHSKnown = computeKnownBits(I->getOperand(1), Depth + 1, I);

if (DemandedMaskRHS.isSubsetOf(RHSKnown.Zero | RHSKnown.One)) {

replaceOperand(*I, 1, Constant::getIntegerValue(VTy, RHSKnown.One));

return I;

}

nikicUnsubmitted

Done

Calling SimplifyDemandedBits() for the case of known demanded bits is a bit odd, I'd probably write something like this instead:

KnownBits LHSKnown = computeKnownBits(I->getOperand(0), Depth + 1, I);
if (DemandedMaskLHS.isSubsetOf(LHSKnown.Zero | LHSKnown.One)) {
  replaceOperand(I, 0, Constant::getIntegerValue(VTy, LHSKnown.One);
  return &I;
}
KnownBits RHSKnown = computeKnownBits(I->getOperand(1), Depth + 1, I);
if (DemandedMaskRHS.isSubsetOf(LHSKnown.Zero | RHSKnown.One)) {
  replaceOperand(I, 1, Constant::getIntegerValue(VTy, RHSKnown.One);
  return &I;
}

nikic: Calling SimplifyDemandedBits() for the case of known demanded bits is a bit odd, I'd probably…

}

nikicUnsubmitted

Done

ShiftAmt = BitWidth - ShiftAmt;

- if (I->getOperand(0) != I->getOperand(1)) {

- APInt DemandedMaskLHS(DemandedMask.lshr(ShiftAmt));

- APInt DemandedMaskRHS(DemandedMask.shl(BitWidth - ShiftAmt));

+ APInt DemandedMaskLHS(DemandedMask.lshr(ShiftAmt));

+ APInt DemandedMaskRHS(DemandedMask.shl(BitWidth - ShiftAmt));

+ if (I->getOperand(0) == I->getOperand(1)) {

if (SimplifyDemandedBits(I, 0, DemandedMaskLHS, LHSKnown, Depth + 1) ||

SimplifyDemandedBits(I, 1, DemandedMaskRHS, RHSKnown, Depth + 1))

return I;

+ } else {

+ computeKnownBits(I->getOperand(0), LHSKnown, Depth + 1);

+ computeKnownBits(I->getOperand(1), RHSKnown, Depth + 1);

+ if (DemandedLHS.isSubsetOf(LHSKnown.Zero | LHSKnown.One)

+ // LHS is known constant here.

+ if (DemandedRHS.isSubsetOf(RHSKnown.Zero | LHSKnown.One)

+ // RHS is known constant here.

}

Known.Zero = LHSKnown.Zero.shl(ShiftAmt) |

You should be able to do something along these lines.

nikic: You should be able to do something along these lines.

Known.Zero = LHSKnown.Zero.shl(ShiftAmt) | Known.Zero = LHSKnown.Zero.shl(ShiftAmt) |

RHSKnown.Zero.lshr(BitWidth - ShiftAmt); RHSKnown.Zero.lshr(BitWidth - ShiftAmt);

Known.One = LHSKnown.One.shl(ShiftAmt) | Known.One = LHSKnown.One.shl(ShiftAmt) |

RHSKnown.One.lshr(BitWidth - ShiftAmt); RHSKnown.One.lshr(BitWidth - ShiftAmt);

KnownBitsComputed = true; KnownBitsComputed = true;

break; break;

} }

case Intrinsic::umax: { case Intrinsic::umax: {

▲ Show 20 Lines • Show All 833 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/fsh.ll

	Show First 20 Lines • Show All 434 Lines • ▼ Show 20 Lines
	define <2 x i32> @fshr_undef_shift_amount_vec(<2 x i32> %x, <2 x i32> %y) {			define <2 x i32> @fshr_undef_shift_amount_vec(<2 x i32> %x, <2 x i32> %y) {
	; CHECK-LABEL: @fshr_undef_shift_amount_vec(			; CHECK-LABEL: @fshr_undef_shift_amount_vec(
	; CHECK-NEXT: ret <2 x i32> [[Y:%.*]]			; CHECK-NEXT: ret <2 x i32> [[Y:%.*]]
	;			;
	%r = call <2 x i32> @llvm.fshr.v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i32> undef)			%r = call <2 x i32> @llvm.fshr.v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i32> undef)
	ret <2 x i32> %r			ret <2 x i32> %r
	}			}

	; TODO: Don't let SimplifyDemandedBits split up a rotate - keep the same operand.

	define i32 @rotl_common_demanded(i32 %a0) {			define i32 @rotl_common_demanded(i32 %a0) {
	; CHECK-LABEL: @rotl_common_demanded(			; CHECK-LABEL: @rotl_common_demanded(
	; CHECK-NEXT: [[X:%.]] = xor i32 [[A0:%.]], 2			; CHECK-NEXT: [[X:%.]] = xor i32 [[A0:%.]], 2
	; CHECK-NEXT: [[R:%.*]] = call i32 @llvm.fshl.i32(i32 [[X]], i32 [[A0]], i32 8)			; CHECK-NEXT: [[R:%.*]] = call i32 @llvm.fshl.i32(i32 [[X]], i32 [[X]], i32 8)
	; CHECK-NEXT: ret i32 [[R]]			; CHECK-NEXT: ret i32 [[R]]
	;			;
	%x = xor i32 %a0, 2			%x = xor i32 %a0, 2
	%r = call i32 @llvm.fshl.i32(i32 %x, i32 %x, i32 8)			%r = call i32 @llvm.fshl.i32(i32 %x, i32 %x, i32 8)
	ret i32 %r			ret i32 %r
	}			}

	define i33 @rotr_common_demanded(i33 %a0) {			define i33 @rotr_common_demanded(i33 %a0) {
	; CHECK-LABEL: @rotr_common_demanded(			; CHECK-LABEL: @rotr_common_demanded(
	; CHECK-NEXT: [[X:%.]] = xor i33 [[A0:%.]], 2			; CHECK-NEXT: [[X:%.]] = xor i33 [[A0:%.]], 2
	; CHECK-NEXT: [[R:%.*]] = call i33 @llvm.fshl.i33(i33 [[X]], i33 [[A0]], i33 25)			; CHECK-NEXT: [[R:%.*]] = call i33 @llvm.fshl.i33(i33 [[X]], i33 [[X]], i33 25)
	; CHECK-NEXT: ret i33 [[R]]			; CHECK-NEXT: ret i33 [[R]]
	;			;
	%x = xor i33 %a0, 2			%x = xor i33 %a0, 2
	%r = call i33 @llvm.fshr.i33(i33 %x, i33 %x, i33 8)			%r = call i33 @llvm.fshr.i33(i33 %x, i33 %x, i33 8)
	ret i33 %r			ret i33 %r
	}			}

	; The shift modulo bitwidth is the same for all vector elements.			; The shift modulo bitwidth is the same for all vector elements.
	▲ Show 20 Lines • Show All 190 Lines • ▼ Show 20 Lines
	;			;
	%r = call i32 @llvm.fshl.i32(i32 %x, i32 %x, i32 8)			%r = call i32 @llvm.fshl.i32(i32 %x, i32 %x, i32 8)
	ret i32 %r			ret i32 %r
	}			}

	define i32 @fshl_mask_args_same1(i32 %a) {			define i32 @fshl_mask_args_same1(i32 %a) {
	; CHECK-LABEL: @fshl_mask_args_same1(			; CHECK-LABEL: @fshl_mask_args_same1(
	; CHECK-NEXT: [[T2:%.]] = lshr i32 [[A:%.]], 16			; CHECK-NEXT: [[T2:%.]] = lshr i32 [[A:%.]], 16
	; CHECK-NEXT: ret i32 [[T2]]			; CHECK-NEXT: ret i32 [[T2]]
				nikicUnsubmitted Done Reply Inline Actions We still want to simplify this case. Could possibly be done by checking whether all demanded bits are zero for one of the operands in the rotate case. nikic: We still want to simplify this case. Could possibly be done by checking whether all demanded…
				pmatosAuthorUnsubmitted Done Reply Inline Actions Ah, yes, right. That should be just a simple shift right. Will see how to still allow that change. Thanks. pmatos: Ah, yes, right. That should be just a simple shift right. Will see how to still allow that…
				pmatosAuthorUnsubmitted Done Reply Inline Actions I am still looking into the best way to handle this case. The issue is that we only know if the demanded bits are zero when analyzing the uses of the value. This is done in SimplifyDemandedBits which in turn calls SimplifyDemandedUseBits, but we cannot call these functions to obtain demanded bits because they'll change the instruction straightaway. I was seeing if there was a way to do the checks inside the block of code already changed, but I don't think that'll be possible. I might have to add a check to SimplifyDemandedUseBits to only simplify in this specific case we want. pmatos: I am still looking into the best way to handle this case. The issue is that we only know if the…
	;			;
	%t1 = and i32 %a, 4294901760 ; 0xffff0000			%t1 = and i32 %a, 4294901760 ; 0xffff0000
	%t2 = call i32 @llvm.fshl.i32(i32 %t1, i32 %t1, i32 16)			%t2 = call i32 @llvm.fshl.i32(i32 %t1, i32 %t1, i32 16)
	ret i32 %t2			ret i32 %t2
	}			}

	define i32 @fshl_mask_args_same2(i32 %a) {			define i32 @fshl_mask_args_same2(i32 %a) {
	; CHECK-LABEL: @fshl_mask_args_same2(			; CHECK-LABEL: @fshl_mask_args_same2(
	Show All 24 Lines
	; CHECK-NEXT: ret i32 [[T3]]			; CHECK-NEXT: ret i32 [[T3]]
	;			;
	%t2 = and i32 %a, 4294901760 ; 0xfffff00f			%t2 = and i32 %a, 4294901760 ; 0xfffff00f
	%t1 = and i32 %a, 4278190080 ; 0xff00f00f			%t1 = and i32 %a, 4278190080 ; 0xff00f00f
	%t3 = call i32 @llvm.fshl.i32(i32 %t2, i32 %t1, i32 17)			%t3 = call i32 @llvm.fshl.i32(i32 %t2, i32 %t1, i32 17)
	ret i32 %t3			ret i32 %t3
	}			}

				define i32 @fsh_andconst_rotate(i32 %a) {
				; CHECK-LABEL: @fsh_andconst_rotate(
				; CHECK-NEXT: [[T2:%.]] = lshr i32 [[A:%.]], 16
				; CHECK-NEXT: ret i32 [[T2]]
				;
				%t1 = and i32 %a, 4294901760 ; 0xffff0000
				%t2 = call i32 @llvm.fshl.i32(i32 %t1, i32 %t1, i32 16)
				ret i32 %t2
				}

				define i32 @fsh_orconst_rotate(i32 %a) {
				; CHECK-LABEL: @fsh_orconst_rotate(
				; CHECK-NEXT: [[T2:%.]] = call i32 @llvm.fshl.i32(i32 [[A:%.]], i32 -268435456, i32 4)
				; CHECK-NEXT: ret i32 [[T2]]
				;
				%t1 = or i32 %a, 4026531840 ; 0xf0000000
				%t2 = call i32 @llvm.fshl.i32(i32 %t1, i32 %t1, i32 4)
				ret i32 %t2
				}

	define <2 x i31> @fshr_mask_args_same_vector(<2 x i31> %a) {			define <2 x i31> @fshr_mask_args_same_vector(<2 x i31> %a) {
	; CHECK-LABEL: @fshr_mask_args_same_vector(			; CHECK-LABEL: @fshr_mask_args_same_vector(
	; CHECK-NEXT: [[T3:%.]] = shl <2 x i31> [[A:%.]], <i31 10, i31 10>			; CHECK-NEXT: [[T3:%.]] = shl <2 x i31> [[A:%.]], <i31 10, i31 10>
	; CHECK-NEXT: ret <2 x i31> [[T3]]			; CHECK-NEXT: ret <2 x i31> [[T3]]
	;			;
	%t1 = and <2 x i31> %a, <i31 1000, i31 1000>			%t1 = and <2 x i31> %a, <i31 1000, i31 1000>
	%t2 = and <2 x i31> %a, <i31 6442450943, i31 6442450943>			%t2 = and <2 x i31> %a, <i31 6442450943, i31 6442450943>
	%t3 = call <2 x i31> @llvm.fshl.v2i31(<2 x i31> %t2, <2 x i31> %t1, <2 x i31> <i31 10, i31 10>)			%t3 = call <2 x i31> @llvm.fshl.v2i31(<2 x i31> %t2, <2 x i31> %t1, <2 x i31> <i31 10, i31 10>)
	▲ Show 20 Lines • Show All 143 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Disable generation of fshl/fshr for rotates
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 527395

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp

llvm/test/Transforms/InstCombine/fsh.ll

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Disable generation of fshl/fshr for rotatesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 527395

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp

llvm/test/Transforms/InstCombine/fsh.ll

[InstCombine] Disable generation of fshl/fshr for rotates
ClosedPublic