This should help in lowering the following four intrinsics:
_mm256_cvtepi32_epi8
_mm256_cvtepi64_epi16
_mm256_cvtepi64_epi8
_mm512_cvtepi64_epi8
Details
Diff Detail
Event Timeline
Can you precommit the test cases?
Does this handle the more intuitive case of shuffling in 0 elements without the bitcast to <2 x i64>?
What about this. This is the most obvious representation for this.
define <8 x i16> @trunc_v4i64_to_v4i16_return_v8i16(<4 x i64> %vec) nounwind { %truncated = trunc <4 x i64> %vec to <4 x i16> %result = shufflevector <4 x i16> %truncated, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> ret <8 x i16> %result }
Added some more tests.
We have four test functions for vpmovdb, and four test functions for vpmovdw.
None of these were detected before this patch.
This indirect approach is still not detected, but I think it shouldn't be too hard:
// truncate to v8i16 in the first operation, but this really is a vpmovdb %truncated = trunc <8 x i32> %vec to <8 x i16> %bc = bitcast <8 x i16> %truncated to <16 x i8> %result = shufflevector <16 x i8> %bc, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
I haven't tried but you might be able to do
builtin_shufflevector(builtin_convertvector((v4di)A, v4qi), (v4qi){0, 0, 0, 0}, 0, 1, 2, 3, 4, 5, 6);
What about _mm512_cvtepi64_epi8?
There's also __m256_cvtepi64_epi8 which produces a 32 bit result. And all the ones that start as 128 bits.
Commit the tests with the current codegen so the patch diff shows the improvement?
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9413 | clang format all of this | |
9417 | Cleanup the for-loop condition to avoid so much evaluation and size_t casts: | |
9435 | Those for loop braces can be removed? |
I think some more work needs to be done here, to also match the masked versions of these, e.g.
_mm256_mask_cvtepi64_epi16 -> vpmovdb %ymm0, %xmm0, {k1}
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9429 | Why can't you permit UNDEFs in the shuffle mask? |
test/CodeGen/X86/shuffle-vs-trunc-512.ll | ||
---|---|---|
949 | This is an AVX512F instruction we're trying to handle here and we are failing to remove the shuffle with AVX512F. |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9429 | I do permit undef here. The condition of the branch is what I don't permit. |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9423 | If you tweak isSequentialOrUndefInRange to take an increment argument (default = 1) then you could remove this loop. | |
9429 | Then why don't you use isUndefOrZeroOrInRange()? if (!isUndefOrZeroOrInRange(Mask.slice(Split, Size), TruncatedVectorStart, TruncatedVectorStart + Size)) return false; |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9429 | That would check for the wrong elements, wouldn't it? OtherVectorStart = SwappedOps ? 0 : Size if (!isUndefOrZeroOrInRange(Mask.slice(Split, Size), OtherVectorStart, OtherVectorStart + Size)) return false; |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9429 | Then I would rather make a new function: if (isAnyInRange(Mask.slice(Split, Size), TruncatedVectorStart, TruncatedVectorStart + Size)) return false; |
Taking over at @GBuella's request. The patterns that are currently implemented will be finished, but I don't have much hope for the masked versions. Since the mask is only good for the lower elements and the upped elements must be zeroed out, lowering the masked versions of these intrinsics would require not simple selects (see PR34877), but patterns like
define <8 x i16> @trunc_v4i64_to_v4i16_return_v2i64_1(<4 x i64> %vec, i8 zeroext %k, <2 x i64> %dest) nounwind { %truncated = trunc <4 x i64> %vec to <4 x i16> %dst = bitcast <2 x i64> %dest to <8 x i16> %dst_select = shufflevector <8 x i16> %dst, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %mask = trunc i8 %k to i4 %mask_vec = bitcast i4 %mask to <4 x i1> %res = select <4 x i1> %mask_vec, <4 x i16> %truncated, <4 x i16> %dst_select %result = shufflevector <4 x i16> %res, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> ret <8 x i16> %result }
or
define <8 x i16> @trunc_v4i64_to_v4i16_return_v2i64_2(<4 x i64> %vec, i8 zeroext %k, <2 x i64> %dest) nounwind { %truncated = trunc <4 x i64> %vec to <4 x i16> %res = shufflevector <4 x i16> %truncated, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> %mask = xor i8 %k, -1 %mask1 = and i8 %mask, 15 %mask_vec = bitcast i8 %mask1 to <8 x i1> %dst = bitcast <2 x i64> %dest to <8 x i16> %result = select <8 x i1> %mask_vec, <8 x i16> %dst, <8 x i16> %res ret <8 x i16> %result }
and I feel that both of these are too complex.
Changed isSequentialOrUndefInRange to take an increment argument and added isAnyInRange.
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9406 ↗ | (On Diff #149730) | We've usually called these functions something like 'matchVectorShuffleAsVPMOV' or 'matchVectorShuffleAsVTRUNC' |
9407 ↗ | (On Diff #149730) | int Delta |
9417 ↗ | (On Diff #149730) | I still don't get why you don't want to just use isUndefOrInRange with the 'safe' vector range. |
9473 ↗ | (On Diff #149730) | Save yourself some typing ;-) SDValue Src = V1.getOperand(0).getOperand(0); MVT SrcVT = Src.getSimpleValueType(); |
9483 ↗ | (On Diff #149730) | Why is it just 16i8 and not 32i8 as well for _mm512_cvtepi16_epi8 ? |
9487 ↗ | (On Diff #149730) | Couldn't this be at the top for an early-out? |
9492 ↗ | (On Diff #149730) | Is this right? I'd expect it to check for the ! case. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9407 ↗ | (On Diff #149730) | Or Stride |
9483 ↗ | (On Diff #149730) | This part is only about truncations, where the result must be filled with extra zeros, due to the (narrower tan 128bits) result being in an xmm register. The check here is about _mm_cvtepi16_epi8 (which requires avx512vl & avx512bw). It truncates from v8i16 -> v8i8, but the vpmovwb instruction actually sets a whole xmm register, so the actual result is going to be v16i8, with other 8 bytes set to zero. Ok, perhaps these details should be explained in comments around here. |
9492 ↗ | (On Diff #149730) | Ye, originally it was: if (!maskContainsSequenceForVPMOV(Mask, SwappedOps, 2) && !maskContainsSequenceForVPMOV(Mask, SwappedOps, 4)) return SDValue(); The new form of this conjuction doesn't seem to make much sense at first sight, what happened? |
Fixed the error in the final check (it was from badly undone edits around there). Moved the early-exit check. Expanded the comment on the AVX512BW check for clarity. Some names changed per comments.
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9417 ↗ | (On Diff #149730) | Due to SwappedOps, the 'safe' range may be before or after the 'unsafe' one, so in this case catching unsafe values is tidier. To use isUndefOrInRange we'd first have to define whether we accept elements in [Size, 2*Size) or in [0, Size). If you keep insisting, I may do that, but at the moment I don't see much benefit in that. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9481 ↗ | (On Diff #150717) | SrcVT.is512BitVector() |
9483 ↗ | (On Diff #149730) | Shouldn't it handle this case? https://godbolt.org/g/Yxw7nE |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9483 ↗ | (On Diff #149730) | Probably that could also be implemented here, we just didn't think about it so far. But if it is not a lot of extra work, the shufflevector equivalents of those convertvector ones could be detected here. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9483 ↗ | (On Diff #149730) | The example doesn't contain explicit truncations, so it isn't handled by this particular function anyway and there's no need to check for features for it here. Correct me if I'm wrong, but I don't think that the aim of this patch is to catch every possible VPMOV pattern, just as many as convenient for subsequent intrinsic lowering, and in the cases of 128-bit-or-wider outputs, where there's no need to zero out the upper parts of xmm registers, a pattern more complex than (in this case) %res = trunc <32 x i16> %v to <32 x i8> is just a needless complication. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
4867 ↗ | (On Diff #150895) | Fix comment - (Low, Low+Step*Size] ? |
clang format all of this