[AVX512/AVX][Intrinsics] Fix Variable Bit Shift Right Arithmetic intrinsic lowering.
Intel SRA intrinsic behavior are different from LLVM ashr instruction.
LLVM ashr instruction : If op2 is (statically or dynamically) equal to or larger than the number of bits in op1, the result is undefined.
Intel SRA: If the unsigned integer value specified in the respective data element of the second source operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination data element are filled with the corresponding sign bit of the source element.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
This looks very similar in aim to D19675 (although we barely have any support for AVX512 intrinsics in InstCombine)
I just think we're better off handling simplification of intrinsics as early as possible - in this case in InstCombiner::visitCallInst instead of waiting until lowering.
Also, should we be adding constant folding (or other optimizations) to LowerINTRINSIC_WO_CHAIN ? I understood that is for cleanup + canonicalization only.
In this case constant folding done before intrinsic simplification only in order to match intrinsic behavior so general ISD::SRA can be used.
I agree. But today we are going towards generation IR from clang whenever it is possible. As far as this specific intrinsic (SAR), LLVM-IR specification and Intel intrinsic specification are different for out-of-range constants. You can generate a constant if the both arguments are constants, but replacing intrinsic with generic IR instruction is incorrect in this case.
Also, should we be adding constant folding (or other optimizations) to LowerINTRINSIC_WO_CHAIN ? I understood that is for cleanup + canonicalization only.
We should generate an optimal code, if we can. So, we, probably, should fold constants whenever it is possible.
IMO, once we are visiting SAR intrinsic in LowerINTRINSIC_WO_CHAIN, we should try to fold constants there. And it is actual for all intrinsics where IR and architecture specification does not match.
Are there many examples of where these SAR intrinsics can only be folded at lowering? And is it just SAR or are SLR/SHL shifts likely as well? What prevented them from being folded earlier in instcombine? I'm happier now with the idea of adding it to LowerINTRINSIC_WO_CHAIN - I just want to know it will be useful.
I think we need this patch for correctness in case we got this intrinsic with const index in back-end ( for example if InstCombiner doesn't run , -O0 compilation) .
Except this isn't a correctness issue, its an optimization no? The code will run fine at -O0 or higher as it will lower to the vpsrav intrinsic which supports the out-of-range shift value and will give the correct result.
If this was in combineINTRINSIC_WO_CHAIN (which we recently removed as it wasn't being used....) I think we could accept this but only if we have a use case that InstCombine isn't catching.
No, I belive it is correctness issue. vpsrav intrinsic lowering with constant out-of-range shift value is incorrect.
For example
define <4 x i32> @test_x86_avx2_psrav_d_fold(<4 x i32> %a0, <4 x i32> %a1) { %res = call <4 x i32> @llvm.x86.avx2.psrav.d(<4 x i32> <i32 2, i32 9, i32 -12, i32 23>, <4 x i32> <i32 1, i32 18, i32 35, i32 52>) ret <4 x i32> %res } declare <4 x i32> @llvm.x86.avx2.psrav.d(<4 x i32>, <4 x i32>) nounwind readnone
Without patch we got incorrect result:
movl $1, %eax movd %eax, %xmm0 retq
With the patch:
.LCPI0_0: .long 1 # 0x1 .long 0 # 0x0 .long 4294967295 # 0xffffffff .long 0 # 0x0 movaps .LCPI0_0(%rip), %xmm0 # xmm0 = [1,0,4294967295,0] retq
No, I belive it is correctness issue. vpsrav intrinsic lowering with constant out-of-range shift value is incorrect.
Thanks for the example - the issue you describe seems to be that the x86 variable shift intrinsics are being mapped to ISD::SRA - I think this is the problem you are encountering?
lib/Target/X86/X86IntrinsicsInfo.h | ||
---|---|---|
326 ↗ | (On Diff #59341) | These variable shift intrinsics should NOT use the ISD::SRA opcode but need a X86ISD::VSRAV opcode instead - similarly for the logical left/right intrinsics. |
Thanks Igor - do you think the logical shift equivalents need to be changed as well? They also have defined behaviour for out of range shifts.
I am not sure . The out of range shifts behaviour undefined but implemented as "undefined results" to be always 0.
APInt APInt::lshr(unsigned shiftAmt)
APInt LLVM_ATTRIBUTE_UNUSED_RESULT shl(unsigned shiftAmt)
I will add a few tests to insure that this behavior doesn't change.