This is an archive of the discontinued LLVM Phabricator instance.

[AVX512/AVX][Intrinsics] Fix Variable Bit Shift Right Arithmetic intrinsic lowering.
ClosedPublic

Authored by igorb on Jun 2 2016, 12:37 AM.

Details

Summary

[AVX512/AVX][Intrinsics] Fix Variable Bit Shift Right Arithmetic intrinsic lowering.
Intel SRA intrinsic behavior are different from LLVM ashr instruction.
LLVM ashr instruction : If op2 is (statically or dynamically) equal to or larger than the number of bits in op1, the result is undefined.
Intel SRA: If the unsigned integer value specified in the respective data element of the second source operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination data element are filled with the corresponding sign bit of the source element.

Diff Detail

Repository
rL LLVM

Event Timeline

igorb updated this revision to Diff 59341.Jun 2 2016, 12:37 AM
igorb retitled this revision from to [AVX512/AVX][Intrinsics] Fix Variable Bit Shift Right Arithmetic intrinsic lowering..
igorb updated this object.
igorb added reviewers: delena, AsafBadouh.
igorb set the repository for this revision to rL LLVM.
igorb added a subscriber: llvm-commits.
RKSimon added a subscriber: RKSimon.Jun 2 2016, 1:52 AM

This looks very similar in aim to D19675 (although we barely have any support for AVX512 intrinsics in InstCombine)

delena edited edge metadata.Jun 2 2016, 6:55 AM

This looks very similar in aim to D19675 (although we barely have any support for AVX512 intrinsics in InstCombine)

Simon,

Do you see any problem with the current patch?

I just think we're better off handling simplification of intrinsics as early as possible - in this case in InstCombiner::visitCallInst instead of waiting until lowering.

Also, should we be adding constant folding (or other optimizations) to LowerINTRINSIC_WO_CHAIN ? I understood that is for cleanup + canonicalization only.

igorb added a comment.Jun 2 2016, 1:11 PM

I just think we're better off handling simplification of intrinsics as early as possible - in this case in InstCombiner::visitCallInst instead of waiting until lowering.

Also, should we be adding constant folding (or other optimizations) to LowerINTRINSIC_WO_CHAIN ? I understood that is for cleanup + canonicalization only.

In this case constant folding done before intrinsic simplification only in order to match intrinsic behavior so general ISD::SRA can be used.

I just think we're better off handling simplification of intrinsics as early as possible - in this case in InstCombiner::visitCallInst instead of waiting until lowering.

I agree. But today we are going towards generation IR from clang whenever it is possible. As far as this specific intrinsic (SAR), LLVM-IR specification and Intel intrinsic specification are different for out-of-range constants. You can generate a constant if the both arguments are constants, but replacing intrinsic with generic IR instruction is incorrect in this case.

Also, should we be adding constant folding (or other optimizations) to LowerINTRINSIC_WO_CHAIN ? I understood that is for cleanup + canonicalization only.

We should generate an optimal code, if we can. So, we, probably, should fold constants whenever it is possible.

IMO, once we are visiting SAR intrinsic in LowerINTRINSIC_WO_CHAIN, we should try to fold constants there. And it is actual for all intrinsics where IR and architecture specification does not match.

I just think we're better off handling simplification of intrinsics as early as possible - in this case in InstCombiner::visitCallInst instead of waiting until lowering.

I agree. But today we are going towards generation IR from clang whenever it is possible. As far as this specific intrinsic (SAR), LLVM-IR specification and Intel intrinsic specification are different for out-of-range constants. You can generate a constant if the both arguments are constants, but replacing intrinsic with generic IR instruction is incorrect in this case.

Also, should we be adding constant folding (or other optimizations) to LowerINTRINSIC_WO_CHAIN ? I understood that is for cleanup + canonicalization only.

We should generate an optimal code, if we can. So, we, probably, should fold constants whenever it is possible.

IMO, once we are visiting SAR intrinsic in LowerINTRINSIC_WO_CHAIN, we should try to fold constants there. And it is actual for all intrinsics where IR and architecture specification does not match.

Are there many examples of where these SAR intrinsics can only be folded at lowering? And is it just SAR or are SLR/SHL shifts likely as well? What prevented them from being folded earlier in instcombine? I'm happier now with the idea of adding it to LowerINTRINSIC_WO_CHAIN - I just want to know it will be useful.

igorb added a comment.Jun 9 2016, 2:16 AM

I think we need this patch for correctness in case we got this intrinsic with const index in back-end ( for example if InstCombiner doesn't run , -O0 compilation) .

RKSimon edited edge metadata.Jun 9 2016, 10:02 AM

I think we need this patch for correctness in case we got this intrinsic with const index in back-end ( for example if InstCombiner doesn't run , -O0 compilation) .

Except this isn't a correctness issue, its an optimization no? The code will run fine at -O0 or higher as it will lower to the vpsrav intrinsic which supports the out-of-range shift value and will give the correct result.

If this was in combineINTRINSIC_WO_CHAIN (which we recently removed as it wasn't being used....) I think we could accept this but only if we have a use case that InstCombine isn't catching.

igorb added a comment.Jun 13 2016, 3:55 AM

Except this isn't a correctness issue, its an optimization no? The code will run fine at -O0 or higher as it will lower to the vpsrav intrinsic which supports the out-of-range shift value and will give the correct result.

No, I belive it is correctness issue. vpsrav intrinsic lowering with constant out-of-range shift value is incorrect.

For example

define <4 x i32> @test_x86_avx2_psrav_d_fold(<4 x i32> %a0, <4 x i32> %a1) {
  %res = call <4 x i32> @llvm.x86.avx2.psrav.d(<4 x i32> <i32 2, i32 9, i32 -12, i32 23>, <4 x i32> <i32 1, i32 18, i32 35, i32 52>)
  ret <4 x i32> %res
}
declare <4 x i32> @llvm.x86.avx2.psrav.d(<4 x i32>, <4 x i32>) nounwind readnone

Without patch we got incorrect result:

movl    $1, %eax
movd    %eax, %xmm0
retq

With the patch:

 .LCPI0_0:
.long   1                       # 0x1
.long   0                       # 0x0
.long   4294967295              # 0xffffffff
.long   0                       # 0x0

movaps  .LCPI0_0(%rip), %xmm0   # xmm0 = [1,0,4294967295,0]
retq

No, I belive it is correctness issue. vpsrav intrinsic lowering with constant out-of-range shift value is incorrect.

Thanks for the example - the issue you describe seems to be that the x86 variable shift intrinsics are being mapped to ISD::SRA - I think this is the problem you are encountering?

lib/Target/X86/X86IntrinsicsInfo.h
326 ↗(On Diff #59341)

These variable shift intrinsics should NOT use the ISD::SRA opcode but need a X86ISD::VSRAV opcode instead - similarly for the logical left/right intrinsics.

igorb updated this revision to Diff 60685.Jun 14 2016, 7:38 AM
igorb updated this object.
igorb edited edge metadata.

Update patch according to comments.
Thanks for review.

delena accepted this revision.Jun 15 2016, 11:14 PM
delena edited edge metadata.

LGTM

This revision is now accepted and ready to land.Jun 15 2016, 11:14 PM

Thanks Igor - do you think the logical shift equivalents need to be changed as well? They also have defined behaviour for out of range shifts.

igorb added a comment.Jun 19 2016, 6:44 AM

Thanks Igor - do you think the logical shift equivalents need to be changed as well? They also have defined behaviour for out of range shifts.

I am not sure . The out of range shifts behaviour undefined but implemented as "undefined results" to be always 0.
APInt APInt::lshr(unsigned shiftAmt)
APInt LLVM_ATTRIBUTE_UNUSED_RESULT shl(unsigned shiftAmt)

I will add a few tests to insure that this behavior doesn't change.

This revision was automatically updated to reflect the committed changes.