This commits overrides the default lowering of rotates to fshl/fshr.
It lowers the rotate straight to a target dependent rotate intrinsic.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Proposal to fix https://github.com/llvm/llvm-project/issues/62703
This not ready to land yet but it should open a discussion on if this is a good fix for this issue.
If we want to go this route we need: support for 64bits, tests.
A quick summary of why this is needed. The function :
inline u32 bswap(u32 x) { return __builtin_rotateleft32((x & 0xFF00FF00), 8) | __builtin_rotateright32((x & 0x00FF00FF), 8); }
generates really poor code in WebAssembly. A rotate should be generated but instead we get quite a large amount of code just to set up constants and perform a shift. The issue is that the rotateleft is lowered to a fshl(%and, %and, 8). During instcombine, the second argument is simplified since %and is the result of a bitwise operation. However, this results in the wasm backend not being able to generate the rotate since the first and second arguments to the fshl are not any longer the same. The generated code is:
bswap: # @bswap .functype bswap (i32) -> (i32) # %bb.0: # %entry local.get 0 i32.const 24 i32.shl local.get 0 i32.const 65280 i32.and i32.const 8 i32.shl i32.or local.get 0 i32.const 8 i32.shr_u i32.const 65280 i32.and local.get 0 i32.const 24 i32.shr_u i32.or i32.or # fallthrough-return end_function
With the fix this becomes:
bswap: # @bswap .functype bswap (i32) -> (i32) # %bb.0: # %entry local.get 0 i32.const 16711935 i32.and i32.const 8 i32.rotr local.get 0 i32.const -16711936 i32.and i32.const 8 i32.rotl i32.or # fallthrough-return end_function
The alternative way to implement something like this would be to block the instcombine on fshl/fshr instructions for the wasm backend, however adding target dependent stuff to instcombine feels even more icky than this patch. I welcome suggestions on how to improve the patch, since the hack in emitRotate is also not great, and I have not seen other places lowering a generic builtin into a target specific intrinsic.
Adding Sanjay as a reviewer since he implemented the clang emitRotate which we patched.
FWIW X86 seems to do something similar elsewhere in this file (https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CGBuiltin.cpp#L985-L986), although it doesn't seem common otherwise. I think I'd be OK with this approach (and it does seem better than trying to mess with instcombine or a new TTI hook or something)
Oh - I missed that X86 bit. That gives me some confidence that this approach makes sense. I will work on a new patch with 64bit support and tests.
This doesn't looks like a wasm specific problem. You get essentially the same issue on any target that has a rotate instruction but no funnel shift instruction. Here are just a couple examples: https://godbolt.org/z/8v6nfaax9
I believe this needs to be either solved by preventing demanded bits simplifications that break a rotate pattern (though I'm not sure if that would break any other optimizations we care about) or by adding a special case for this in the backend when lowering FSH to ROT.
Lowering to a rotate intrinsic only "solves" this for wasm and will at the same time make these rotates completely opaque to optimization -- heck, it looks like we don't even support constant folding for these intrinsics (https://llvm.godbolt.org/z/hMWG16b9W).
Yes, I am indeed aware this is not specific to wasm. What's specific to wasm afaiu is that the code generated is much worse when expanding fshl. That's what I mentioned in the bug discussion here: https://github.com/llvm/llvm-project/issues/62703#issuecomment-1548474310
I believe this needs to be either solved by preventing demanded bits simplifications that break a rotate pattern (though I'm not sure if that would break any other optimizations we care about) or by adding a special case for this in the backend when lowering FSH to ROT.
Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.
Lowering to a rotate intrinsic only "solves" this for wasm and will at the same time make these rotates completely opaque to optimization -- heck, it looks like we don't even support constant folding for these intrinsics (https://llvm.godbolt.org/z/hMWG16b9W).
I just added the intrinsics, so those optimizations were not added yet.
Another thing wrt this optimization regard is that Wasm is slightly different from other targets in that it's not natively executed but instead executed through a VM or passed through binaryen for further optimization. I just tested this and emcc which passes the resulting file through binaryen performs this optimization. I imagine V8 won't have problems with this code either, therefore the missing optimization in llvm is not problematic for the target.
Preventing the simplification means adding target specific code in instcombine which seems even worse than adding it here given as @dschuff
pointed out, there's precedent with x86.
How harmful is it to avoid breaking rotate patterns even if the target doesn't support rotate?
I'm not suggesting to add any target specific code to instcombine. I think there are actually quite a few different ways this could be solved. See https://llvm.godbolt.org/z/f55K7K17W for three possible representations of the same rotate pattern.
- Say that we prefer preserving rotates over "simplifying" funnel shifts (ending up with the rot2 pattern). Basically by skipping the optimization at https://github.com/llvm/llvm-project/blob/7f54b38e28b3b66195de672848f2b5366d0d51e3/llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L927-L931 if both fsh operands are the same. Assuming this doesn't cause test regressions, I think this would be acceptable to do. From a backend perspective, even for targets that have a native funnel shift (aarch64, x86), the difference between the rot1/rot2 patterns looks pretty neutral.
- Undo the transform in the backend. It is bidrectional (https://alive2.llvm.org/ce/z/Chb85F), so this is possible. This would need an extra legalization/combiner pattern (depending on where we form ROTs). Advantage of undoing the pattern is the usual one: Works if it was in the undesirable form in the first place. (E.g. this could happen if the rotate did not use a builtin but was implemented as x << 8 | x >> 24, which is probably much more widespread than the builtin. Though I checked, and we don't form a funnel shift for it in this case right now.)
- Move the and from the fsh arguments to the result. This is the rot3 pattern. This seems to produce the best codegen on average, because it can use uxtb16 on ARM. Moving the and from args to return is a bit unusual for unary ops, but if we see this as moving two ands on both fsh arguments (which happen to be the same) to one on the result, that would be a pretty standard transform.
I think all of those options are viable, and couldn't say for certain which one is best. I think any of them would be better than making clang emit a special intrinsic just for wasm though.
@nikic Thank you for the thorough suggestions above. I will have to look at this closer next week and will work on an alternative solution.
Hi Craig, I thought initially your question was for Nikita but it's apparently for me. I am sorry but I am not sure I understand your question. Could you please rephrase?
I am surprised this option is viable for example. This was my initial thought to avoid the rotate, but I assumed adding something like :
if (!getTarget().getTriple().isWasm()) { APInt DemandedMaskLHS(DemandedMask.lshr(ShiftAmt)); APInt DemandedMaskRHS(DemandedMask.shl(BitWidth - ShiftAmt)); if (SimplifyDemandedBits(I, 0, DemandedMaskLHS, LHSKnown, Depth + 1) || SimplifyDemandedBits(I, 1, DemandedMaskRHS, RHSKnown, Depth + 1)) return I; }
would not be well received. Also, I cannot find precedent for doing this.
OK, I just re-read your comment above and I am starting to assume that what you mean is skipping the optimization for all targets if the funnel shift is a rotate (i.e. same first two operands). Is this correct?
Update the patch by removing target specific changes in CGBuiltin.
Leave fshl/fshr unchanged for rotates. This actually fixes a todo in fshl/r test.
@nikic What do you think of the current patch?
llvm/test/Transforms/InstCombine/fsh.ll | ||
---|---|---|
664 | We still want to simplify this case. Could possibly be done by checking whether all demanded bits are zero for one of the operands in the rotate case. |
llvm/test/Transforms/InstCombine/fsh.ll | ||
---|---|---|
664 | Ah, yes, right. That should be just a simple shift right. Will see how to still allow that change. Thanks. |
llvm/test/Transforms/InstCombine/fsh.ll | ||
---|---|---|
664 | I am still looking into the best way to handle this case. The issue is that we only know if the demanded bits are zero when analyzing the uses of the value. This is done in SimplifyDemandedBits which in turn calls SimplifyDemandedUseBits, but we cannot call these functions to obtain demanded bits because they'll change the instruction straightaway. I was seeing if there was a way to do the checks inside the block of code already changed, but I don't think that'll be possible. I might have to add a check to SimplifyDemandedUseBits to only simplify in this specific case we want. |
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp | ||
---|---|---|
915–935 | You should be able to do something along these lines. |
Implement optimization when demanded bits are known, skip otherwise for rotates.
@nikic Things look much better now. Thanks for your help with the changes in InstCombine. What do you think?
Can you please drop all wasm related tests and instead add an InstCombine test for the fsh+and pattern?
It would also be good to have a test where we can fold one side to a constant, but that constant is not zero. We should then consider whether that is profitable or not. (In that case we can't reduce to a simple shift and will reduce to a shift and or with constant instead -- is that better or worse than a rotate?)
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp | ||
---|---|---|
933 | Calling SimplifyDemandedBits() for the case of known demanded bits is a bit odd, I'd probably write something like this instead: KnownBits LHSKnown = computeKnownBits(I->getOperand(0), Depth + 1, I); if (DemandedMaskLHS.isSubsetOf(LHSKnown.Zero | LHSKnown.One)) { replaceOperand(I, 0, Constant::getIntegerValue(VTy, LHSKnown.One); return &I; } KnownBits RHSKnown = computeKnownBits(I->getOperand(1), Depth + 1, I); if (DemandedMaskRHS.isSubsetOf(LHSKnown.Zero | RHSKnown.One)) { replaceOperand(I, 1, Constant::getIntegerValue(VTy, RHSKnown.One); return &I; } |
I have added the tests. I looked into the output of WebAssembly and it looks good. Even in the case of the generation of a non-zero const, Wasm still managed to generate a rotate which is generally more profitable, since the runtime can be then the one choosing how to implement that depending on the hardware. In the general case, I am not sure what the right answer is tbh.
LGTM, let's give it a try. The patch description needs an update though.
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp | ||
---|---|---|
920 | Add some more explanation here, something like: |
Add some more explanation here, something like:
// Avoid converting rotate into funnel shift. Only simplify if one operand is constant.