Page MenuHomePhabricator

[x86] Lower some trunc + shuffle patterns to vpmov[q|d][b|w]
ClosedPublic

Authored by mike.dvoretsky on May 16 2018, 8:40 AM.

Details

Summary

This should help in lowering the following four intrinsics:
_mm256_cvtepi32_epi8
_mm256_cvtepi64_epi16
_mm256_cvtepi64_epi8
_mm512_cvtepi64_epi8

Diff Detail

Repository
rL LLVM

Event Timeline

GBuella created this revision.May 16 2018, 8:40 AM

Can you precommit the test cases?

Does this handle the more intuitive case of shuffling in 0 elements without the bitcast to <2 x i64>?

GBuella updated this revision to Diff 147154.May 16 2018, 12:28 PM

Added some more tests.

What about this. This is the most obvious representation for this.

define <8 x i16> @trunc_v4i64_to_v4i16_return_v8i16(<4 x i64> %vec) nounwind {
  %truncated = trunc <4 x i64> %vec to <4 x i16>
  %result = shufflevector <4 x i16> %truncated, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i16> %result
}

What about this. This is the most obvious representation for this.

define <8 x i16> @trunc_v4i64_to_v4i16_return_v8i16(<4 x i64> %vec) nounwind {
  %truncated = trunc <4 x i64> %vec to <4 x i16>
  %result = shufflevector <4 x i16> %truncated, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i16> %result
}

Oh, nice, I'm adding this one as well. BTW, how would you write that in C?

GBuella updated this revision to Diff 147163.May 16 2018, 1:06 PM

Added some more tests.
We have four test functions for vpmovdb, and four test functions for vpmovdw.
None of these were detected before this patch.

This indirect approach is still not detected, but I think it shouldn't be too hard:

// truncate to v8i16 in the first operation, but this really is a vpmovdb
%truncated = trunc <8 x i32> %vec to <8 x i16>
%bc = bitcast <8 x i16> %truncated to <16 x i8>
%result = shufflevector <16 x i8> %bc, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
GBuella updated this revision to Diff 147170.May 16 2018, 1:35 PM

Fixed the some of the tests, all cases are detected now.

If you think the tests are enough, I can precommit them, and then rebase this patch.

I haven't tried but you might be able to do

builtin_shufflevector(builtin_convertvector((v4di)A, v4qi), (v4qi){0, 0, 0, 0}, 0, 1, 2, 3, 4, 5, 6);

What about _mm512_cvtepi64_epi8?

There's also __m256_cvtepi64_epi8 which produces a 32 bit result. And all the ones that start as 128 bits.

GBuella updated this revision to Diff 147281.May 17 2018, 3:11 AM
GBuella edited the summary of this revision. (Show Details)
GBuella updated this revision to Diff 147282.May 17 2018, 3:29 AM
GBuella retitled this revision from [x86] Lower some trunc + shuffle patterns to vpmovd[b|w] to [x86] Lower some trunc + shuffle patterns to vpmov[q|d][b|w].

Commit the tests with the current codegen so the patch diff shows the improvement?

lib/Target/X86/X86ISelLowering.cpp
9391 ↗(On Diff #147282)

clang format all of this

9395 ↗(On Diff #147282)

Cleanup the for-loop condition to avoid so much evaluation and size_t casts:
http://llvm.org/docs/CodingStandards.html#don-t-use-default-labels-in-fully-covered-switches-over-enumerations

9413 ↗(On Diff #147282)

Those for loop braces can be removed?

GBuella updated this revision to Diff 147975.May 22 2018, 2:55 AM

Performed some code formatting.
I also committed the tests.

RKSimon added inline comments.May 22 2018, 12:07 PM
lib/Target/X86/X86ISelLowering.cpp
9414 ↗(On Diff #147975)

Instead of repeating for SwappedOps - can't you just use start/end values? Else take a local copy of the shuffle mask and use ShuffleVectorSDNode::commuteMask?

9418 ↗(On Diff #147975)

Can you use the shuffle mask helpers for any/all of these - isUndefOrInRange etc?

GBuella updated this revision to Diff 148178.May 23 2018, 3:04 AM

Refactored the maskContainsSequenceForVPMOV function.

GBuella marked 5 inline comments as done.May 23 2018, 3:06 AM

I think some more work needs to be done here, to also match the masked versions of these, e.g.
_mm256_mask_cvtepi64_epi16 -> vpmovdb %ymm0, %xmm0, {k1}

RKSimon added inline comments.May 23 2018, 10:42 AM
lib/Target/X86/X86ISelLowering.cpp
9429 ↗(On Diff #148178)

Why can't you permit UNDEFs in the shuffle mask?

craig.topper added inline comments.May 23 2018, 11:45 AM
test/CodeGen/X86/shuffle-vs-trunc-512.ll
949 ↗(On Diff #148178)

This is an AVX512F instruction we're trying to handle here and we are failing to remove the shuffle with AVX512F.

GBuella added inline comments.May 23 2018, 9:17 PM
lib/Target/X86/X86ISelLowering.cpp
9429 ↗(On Diff #148178)

I do permit undef here. The condition of the branch is what I don't permit.

GBuella updated this revision to Diff 148337.May 23 2018, 11:11 PM

Fixed the ISA feature checking conditions.

GBuella marked an inline comment as done.May 23 2018, 11:12 PM
RKSimon added inline comments.May 24 2018, 6:39 AM
lib/Target/X86/X86ISelLowering.cpp
9418 ↗(On Diff #147975)

If you tweak isSequentialOrUndefInRange to take an increment argument (default = 1) then you could remove this loop.

9429 ↗(On Diff #148178)

Then why don't you use isUndefOrZeroOrInRange()?

if (!isUndefOrZeroOrInRange(Mask.slice(Split, Size), TruncatedVectorStart, TruncatedVectorStart + Size))
  return false;
GBuella added inline comments.May 24 2018, 8:15 AM
lib/Target/X86/X86ISelLowering.cpp
9429 ↗(On Diff #148178)

That would check for the wrong elements, wouldn't it?
Maybe

OtherVectorStart = SwappedOps ? 0 : Size
if (!isUndefOrZeroOrInRange(Mask.slice(Split, Size), OtherVectorStart, OtherVectorStart + Size))
  return false;
GBuella added inline comments.May 24 2018, 8:24 AM
lib/Target/X86/X86ISelLowering.cpp
9429 ↗(On Diff #148178)

Then I would rather make a new function:

if (isAnyInRange(Mask.slice(Split, Size), TruncatedVectorStart, TruncatedVectorStart + Size))
  return false;
mike.dvoretsky commandeered this revision.Jun 4 2018, 4:47 AM
mike.dvoretsky edited reviewers, added: GBuella; removed: mike.dvoretsky.

Taking over at @GBuella's request. The patterns that are currently implemented will be finished, but I don't have much hope for the masked versions. Since the mask is only good for the lower elements and the upped elements must be zeroed out, lowering the masked versions of these intrinsics would require not simple selects (see PR34877), but patterns like

define <8 x i16> @trunc_v4i64_to_v4i16_return_v2i64_1(<4 x i64> %vec, i8 zeroext %k, <2 x i64> %dest) nounwind {
  %truncated = trunc <4 x i64> %vec to <4 x i16>
  %dst = bitcast <2 x i64> %dest to <8 x i16>
  %dst_select = shufflevector <8 x i16> %dst, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %mask = trunc i8 %k to i4
  %mask_vec = bitcast i4 %mask to <4 x i1>
  %res = select <4 x i1> %mask_vec, <4 x i16> %truncated, <4 x i16> %dst_select
  %result = shufflevector <4 x i16> %res, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i16> %result
}

or

define <8 x i16> @trunc_v4i64_to_v4i16_return_v2i64_2(<4 x i64> %vec, i8 zeroext %k, <2 x i64> %dest) nounwind {
  %truncated = trunc <4 x i64> %vec to <4 x i16>
  %res = shufflevector <4 x i16> %truncated, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %mask = xor i8 %k, -1
  %mask1 = and i8 %mask, 15
  %mask_vec = bitcast i8 %mask1 to <8 x i1>
  %dst = bitcast <2 x i64> %dest to <8 x i16>
  %result = select <8 x i1> %mask_vec, <8 x i16> %dst, <8 x i16> %res
  ret <8 x i16> %result
}

and I feel that both of these are too complex.

Changed isSequentialOrUndefInRange to take an increment argument and added isAnyInRange.

RKSimon added inline comments.Jun 11 2018, 3:11 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
9406 ↗(On Diff #149730)

We've usually called these functions something like 'matchVectorShuffleAsVPMOV' or 'matchVectorShuffleAsVTRUNC'

9407 ↗(On Diff #149730)

int Delta

9417 ↗(On Diff #149730)

I still don't get why you don't want to just use isUndefOrInRange with the 'safe' vector range.

9473 ↗(On Diff #149730)

Save yourself some typing ;-)

SDValue Src = V1.getOperand(0).getOperand(0);
MVT SrcVT = Src.getSimpleValueType();
9483 ↗(On Diff #149730)

Why is it just 16i8 and not 32i8 as well for _mm512_cvtepi16_epi8 ?

9487 ↗(On Diff #149730)

Couldn't this be at the top for an early-out?

9492 ↗(On Diff #149730)

Is this right? I'd expect it to check for the ! case.

GBuella added inline comments.Jun 11 2018, 3:43 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
9407 ↗(On Diff #149730)

Or Stride

9483 ↗(On Diff #149730)

This part is only about truncations, where the result must be filled with extra zeros, due to the (narrower tan 128bits) result being in an xmm register.
The _mm512_cvtepi16_epi8 one truncates from a 512bit vector into a 256bit vector, that is already recognized without this patch.

The check here is about _mm_cvtepi16_epi8 (which requires avx512vl & avx512bw). It truncates from v8i16 -> v8i8, but the vpmovwb instruction actually sets a whole xmm register, so the actual result is going to be v16i8, with other 8 bytes set to zero.

Ok, perhaps these details should be explained in comments around here.

9492 ↗(On Diff #149730)

Ye, originally it was:

if (!maskContainsSequenceForVPMOV(Mask, SwappedOps, 2) &&
    !maskContainsSequenceForVPMOV(Mask, SwappedOps, 4))
  return SDValue();

The new form of this conjuction doesn't seem to make much sense at first sight, what happened?

mike.dvoretsky marked 9 inline comments as done.

Fixed the error in the final check (it was from badly undone edits around there). Moved the early-exit check. Expanded the comment on the AVX512BW check for clarity. Some names changed per comments.

mike.dvoretsky added inline comments.Jun 11 2018, 5:42 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
9417 ↗(On Diff #149730)

Due to SwappedOps, the 'safe' range may be before or after the 'unsafe' one, so in this case catching unsafe values is tidier. To use isUndefOrInRange we'd first have to define whether we accept elements in [Size, 2*Size) or in [0, Size). If you keep insisting, I may do that, but at the moment I don't see much benefit in that.

RKSimon added inline comments.Jun 11 2018, 5:55 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
9481 ↗(On Diff #150717)

SrcVT.is512BitVector()

9483 ↗(On Diff #149730)

Shouldn't it handle this case? https://godbolt.org/g/Yxw7nE

GBuella added inline comments.Jun 11 2018, 6:05 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
9483 ↗(On Diff #149730)

Probably that could also be implemented here, we just didn't think about it so far.
There is/was a patch for those using builtin_convertvector
https://reviews.llvm.org/D46742
This patch was originally intended to handle these cases, which can't be don with
builtin_convertvector.

But if it is not a lot of extra work, the shufflevector equivalents of those convertvector ones could be detected here.

mike.dvoretsky added inline comments.Jun 11 2018, 6:19 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
9483 ↗(On Diff #149730)

The example doesn't contain explicit truncations, so it isn't handled by this particular function anyway and there's no need to check for features for it here.

Correct me if I'm wrong, but I don't think that the aim of this patch is to catch every possible VPMOV pattern, just as many as convenient for subsequent intrinsic lowering, and in the cases of 128-bit-or-wider outputs, where there's no need to zero out the upper parts of xmm registers, a pattern more complex than (in this case)

%res = trunc <32 x i16> %v to <32 x i8>

is just a needless complication.

mike.dvoretsky marked an inline comment as done.Jun 12 2018, 1:09 AM
RKSimon added inline comments.Jun 19 2018, 8:44 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
4867 ↗(On Diff #150895)

Fix comment - (Low, Low+Step*Size] ?

mike.dvoretsky marked an inline comment as done.
RKSimon accepted this revision.Jun 21 2018, 4:57 AM

LGTM

This revision is now accepted and ready to land.Jun 21 2018, 4:57 AM
This revision was automatically updated to reflect the committed changes.