Page MenuHomePhabricator

[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.
ClosedPublic

Authored by m_zuckerman on Jun 25 2017, 12:38 AM.

Details

Summary

This patch expands the support of lowerInterleavedStore to 32x8i stride 4.

LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=4 VF=32) and we plan to include more patterns in the future. To reach our goal of "more patterns". We include two mask creators. The first function creates shuffle's mask equivalent to unpacklo/unpackhi instructions. The other creator creates mask equivalent to a concat of two half vectors(high/low).

The patch goal is to optimize the following sequence:
At the end of the computation, we have ymm2, ymm0, ymm12 and ymm3 holding
each 32 chars:

c0, c1, , c31
m0, m1, , m31
y0, y1, , y31
k0, k1, ., k31

And these need to be transposed/interleaved and stored like so:

c0 m0 y0 k0 c1 m1 y1 k1 c2 m2 y2 k2 c3 m3 y3 k3 ....

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon edited edge metadata.Jun 25 2017, 2:15 AM

Would it be beneficial to work on a more general solution for the 128-bit subvector issue? Won't 16x16 and 8x32 (as well as all the 512-bit equivalents) still suffer?

lib/Target/X86/X86InterleavedAccess.cpp
367 ↗(On Diff #103862)

Indenting / clang-format

381 ↗(On Diff #103862)

If you pull out the Type::getInt16Ty(Shuffles[0]->getContext()) you should be able to tidy all this up

test/CodeGen/X86/x86-interleaved-access.ll
143 ↗(On Diff #103862)

Add this test to trunk with current codegen so this patch shows the diff.

test/Transforms/InterleavedAccess/X86/interleavedStore.ll
1 ↗(On Diff #103862)

Add this file to trunk with current codegen so this patch shows the diff.

2 ↗(On Diff #103862)

You just need -mattr=+avx2 - it implies -mattr=+avx

m_zuckerman marked 4 inline comments as done.
m_zuckerman marked an inline comment as done.

Would it be beneficial to work on a more general solution for the 128-bit subvector issue? Won't 16x16 and 8x32 (as well as all the 512-bit equivalents) still suffer?

In general, we plan to generalize the problem and this is why we added the two mask functions.

dorit edited edge metadata.Jun 27 2017, 3:18 AM

Just a few minor comments. Looks good to me otherwise, but should give @RKSimon a chance to see if he's happy with your responses (and maybe also another day to give @Farhana/@DavidKreitzer a chance to comment).

lib/Target/X86/X86InterleavedAccess.cpp
105 ↗(On Diff #103953)

Should this comment be updated?

180 ↗(On Diff #103953)

80-column overflow...

(and while you're fixing this, maybe use the same capitalization in naming of arguments (i.e. - either "vec1" or "VEC1" form but not both)).

419 ↗(On Diff #103953)

Have you tried enabling this also for AVX? (I understand if not, because with the current cost numbers that TTI returns for interleaved accesses on AVX we'll probably determine it's not worth vectorizing... so that may need to come along with an update of getInterleavedMemoryOpCost -- maybe at least a TODO comment is needed).

m_zuckerman marked 2 inline comments as done and an inline comment as not done.

Have you tried enabling this also for AVX? (I understand if not, because with the current cost numbers that TTI returns for interleaved accesses on AVX we'll probably determine it's not worth vectorizing... so that may need to come along with an update of getInterleavedMemoryOpCost -- maybe at least a TODO comment is needed).

Thanks , I tried it and the result was good for AVX.
You can see in the test interleavedStore.all that we are getting less instructions compared to the old code.

In the new patch I added a test checks the AVX sequence and new if verifies that VF32 is with store instructions.

zvi added a subscriber: zvi.Jul 4 2017, 11:19 PM
zvi added a comment.Jul 4 2017, 11:22 PM

Michael, can you please add AVX512 targets to the tests so we ensure that the AVX512 targets are at least as good as the AVX2 targets?

In D34601#799242, @zvi wrote:

Michael, can you please add AVX512 targets to the tests so we ensure that the AVX512 targets are at least as good as the AVX2 targets?

No problem

Farhana added inline comments.Jul 5 2017, 5:07 PM
lib/Target/X86/X86InterleavedAccess.cpp
170 ↗(On Diff #104990)

Why not use the createUnpackShuffleMask(..) defined in X86ISelLowering.cpp? I would recommend you to declare a template function of this in X86ISelLowering.h which will allow both int and int32_t mask and get rid of this from X86InterleavedAccess.cpp.

186 ↗(On Diff #104990)

The comment does not match the coding logic. Something is off here. May be it would be NumElement+NumElement/2?

205 ↗(On Diff #104990)

I would think you could write a more generate transpose function, a function of 8 elements which would scale into 16 and 32. Why is it written only for 32 VLen?

384 ↗(On Diff #104990)

Value *VecInst = SI->getPointerOperand()

384 ↗(On Diff #104990)

What is the reason for generating for stores instead of 1 wide store? I would think codegen is smart enough to generate expected 4 optimized stores here. Can you please keep it consistent with the other case which means do the concatenation of the transposed vectors and generate 1 wide store.

StoreInst *SI = cast<StoreInst>(Inst);
Value *BasePtr = SI->getPointerOperand();

case 32: {

transposeChar_32x4(DecomposedVectors, TransposedVectors);
// VecInst contains the Ptr argument.
Value *VecInst = SI->getPointerOperand();
Type *IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext());
// From <128xi8>* to <64xi16>*
Type *VecTran = VectorType::get(IntOf16, 64)->getPointerTo();
BasePtr = Builder.CreateBitCast(VecInst, VecTran);

And move the following two statements out of the switch.

//   3. Concatenate the contiguous-vectors back into a wide vector.
Value *WideVec = concatenateVectors(Builder, TransposedVectors);

//   4. Generate a store instruction for wide-vec.
Builder.CreateAlignedStore(WideVec, BasePtr, // replace this with 
                           SI->getAlignment());
389 ↗(On Diff #104990)

++i

DavidKreitzer edited edge metadata.Jul 6 2017, 1:50 PM

Hi Michael, sorry it took a while for me to get to this. I was on vacation last week.

lib/Target/X86/X86InterleavedAccess.cpp
73 ↗(On Diff #104990)

"transpose" is a poor name here. "interleave" would be better. Also, I would prefer "8bit" or "1byte" to "Char", e.g. interleave8bit_32x4.

"transpose" works for the 4x4 case (and other NxN cases), because the shuffle sequence does a matrix transpose on the input vectors, and the same code can be used for interleaving and de-interleaving. To handle the 32x8 load case, we would need a different code sequence than what you are currently generating in transposeChar_32x4. Presumably, we would use deinterleave8bit_32x4 for that.

74 ↗(On Diff #104990)

Trasposed --> Transposed. Can you fix this at line 72 as well?

195 ↗(On Diff #104990)

Perhaps change "BeginIndex" to "CurrentIndex" since you are updating it as you go.

260 ↗(On Diff #104990)

I think it is unnecessary and undesirable to do bitcasts here. It complicates both the IR and the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence, which now has to account for the interleave function returning a different type in "TransposedVectors" than the original "DecomposedVectors".

Instead, you just need to "scale" your <16 x i32> masks to <32 x i32> like this:

<a, b, ..., p> --> <a*2, a*2+1, b*2, b*2+1, ..., p*2, p*2+1>

There is an existing routine to do this scaling in X86ISelLowering.cpp, scaleShuffleMask. Also see canWidenShuffleElements, which is how the 32xi8 shuffle gets automatically converted to a 16xi16 shuffle by the CodeGen, ultimately causing the desired unpckwd instructions to be generated.

384 ↗(On Diff #104990)

+1

426 ↗(On Diff #104990)

This comment is no longer accurate since you enabled this for AVX. It is probably okay to simply delete this comment since it is redundant with 105-107.

test/CodeGen/X86/x86-interleaved-access.ll
199 ↗(On Diff #104990)

The codegen improvements here look great!

m_zuckerman marked 9 inline comments as done.
m_zuckerman added inline comments.
lib/Target/X86/X86InterleavedAccess.cpp
205 ↗(On Diff #104990)

You are right and we plan to do so in the next patch.

m_zuckerman marked an inline comment as done.Jul 12 2017, 6:15 AM

Thanks for the changes, Michael, and thanks for following up on the perf issue in https://bugs.llvm.org/show_bug.cgi?id=33740.

I have a few additional comments, but this generally looks good. The simplification of the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence is great!

lib/Target/X86/X86ISelLowering.h
1440 ↗(On Diff #106158)

Fix the indenting here.

lib/Target/X86/X86InterleavedAccess.cpp
73 ↗(On Diff #104990)

I see that you changed this to "deinterleave8bit_32x4" rather than "interleave8bit_32x4". Can you please explain why? This routine is taking 4 input vectors and merging their elements like this:

v0[0], v1[0], v2[0], v3[0], v0[1], v1[1], v2[1], v3[1], ...

Wouldn't you call that interleaving?

113 ↗(On Diff #106158)

The old wording here was better, and you have a typo in "oration".

198 ↗(On Diff #106158)

Would it be possible to simplify the implementation of this routine by just calling createUnpackShuffleMask with an i16 type plus calling scaleShuffleMask with a scale of 2? (That would require you to move scaleShuffleMask into X86ISelLowering.h like you did with createUnpackShuffleMask.)

m_zuckerman added inline comments.Jul 12 2017, 9:50 AM
lib/Target/X86/X86InterleavedAccess.cpp
73 ↗(On Diff #104990)

You are right, I swapped the terminology.

m_zuckerman marked 3 inline comments as done.

Hi Michael, I have one small additional comment. Otherwise, this looks good.

lib/Target/X86/X86ISelLowering.h
1457 ↗(On Diff #106407)

Did you want to have T default to int as you did for createUnpackShuffleMask? That will avoid the extra changes in X86ISelLowering.cpp.

m_zuckerman marked an inline comment as done.Jul 16 2017, 1:56 AM

Hi Michael, I have one small additional comment. Otherwise, this looks good.

This the error I am getting when I try to define the function with a default parameter.
void llvm::scaleShuffleMask(int,llvm::ArrayRef<T>,llvm::SmallVectorImpl<T> &)': could not deduce template argument for 'llvm::ArrayRef<T>' from 'llvm::SmallVector<int,4>'

If you prefer not to touch the CPP file. We can define an overload function in the header file that will explicit set the function signature.

The function in the header will look like:

template <typename t1, unsigned n>
void scaleshufflemask(int scale, typename smallvector<t1, n> mask,
  typename smallvectorimpl<t1> &scaledmask) {
  scaleshufflemask(scale, makearrayref(mask), scaledmask);

This function describe another signature that one of the call in the CPP file uses.

The other option is as I proposed on this review.

What you prefer?

RKSimon added inline comments.Jul 19 2017, 2:44 AM
lib/Target/X86/X86ISelLowering.h
1457 ↗(On Diff #106407)

+1

lib/Target/X86/X86InterleavedAccess.cpp
175 ↗(On Diff #106407)

shuffle

188 ↗(On Diff #106407)

I think you could make this much simpler to understand:

int NumHalfElements = NumElement / 2;
int Offset = Low ? 0 : NumHalfElements;
for (int i = 0; i < NumHalfElements; ++i)
  Mask.push_back(i + Offset);
for (int i = 0; i < NumHalfElements; ++i)
  Mask.push_back(i + Offset + NumElements);

Hi Michael, I have one small additional comment. Otherwise, this looks good.

This the error I am getting when I try to define the function with a default parameter.
void llvm::scaleShuffleMask(int,llvm::ArrayRef<T>,llvm::SmallVectorImpl<T> &)': could not deduce template argument for 'llvm::ArrayRef<T>' from 'llvm::SmallVector<int,4>'

If you prefer not to touch the CPP file. We can define an overload function in the header file that will explicit set the function signature.

The function in the header will look like:

template <typename t1, unsigned n>
void scaleshufflemask(int scale, typename smallvector<t1, n> mask,
  typename smallvectorimpl<t1> &scaledmask) {
  scaleshufflemask(scale, makearrayref(mask), scaledmask);

This function describe another signature that one of the call in the CPP file uses.

The other option is as I proposed on this review.

What you prefer?

I am not a fan of either of those options. How about getting rid of the template argument altogether and just use "int"? That is what ShuffleVectorInstruction::getShuffleMask uses, so it seems like it should be good enough here too. If you take this approach, it would be good to modify transpose_4x4 to use "int" instead of "uint32_t" for consistency.

m_zuckerman marked an inline comment as done.
m_zuckerman marked an inline comment as done.

build.CreateShuffleVector accepts only uint32_t. I created a static CreateShuffleVector that do reinterpret_cast the int to uint. This solves our dependency on the template.

DavidKreitzer added inline comments.Jul 21 2017, 11:05 AM
lib/Target/X86/X86InterleavedAccess.cpp
188 ↗(On Diff #107663)

Ugh! I didn't realize that you were going to have to add this method to do the ArrayRef conversion. Sorry, I think I gave you bad advice. I think this is a worse evil than the explicit template arguments in the previous patch.

Unless someone else has a better idea, my recommendation is to go back to your previous solution and accept the explicit template arguments as an unfortunate consequence of the inconsistent ShuffleVector interfaces.

I returned to the previous patch with small modification as @RKSimon asked.

I returned to the previous patch with small modification as @RKSimon asked.

Thanks, Michael. This LGTM pending @RKSimon's re-review. But can you please update your sources so that we can see how you merged with https://reviews.llvm.org/D35638, since that will cause conflicts?

DavidKreitzer accepted this revision.Jul 24 2017, 12:27 PM

Thanks, Michael. LGTM, again pending @RKSimon's re-review.

This revision is now accepted and ready to land.Jul 24 2017, 12:27 PM
RKSimon accepted this revision.Jul 25 2017, 5:53 AM

LGTM - if possible moving createUnpackShuffleMask and scaleShuffleMask should be done as NFC pre-commit.

This revision was automatically updated to reflect the committed changes.