This is an archive of the discontinued LLVM Phabricator instance.

[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess (VF8 stride 4).

Authored by m_zuckerman on Jul 30 2017, 5:20 AM.



This patch expands the support of lowerInterleavedStore to 8x8i stride 4.

LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=4 VF=8) and we plan to include more patterns in the future.

The patch goal is to optimize the following sequence:
At the end of the computation, we have xmm2, xmm0, xmm12 and xmm3 holding
each 8 chars:

c0, c1, , c7
m0, m1, , m7
y0, y1, , y7
k0, k1, ., k7

And these need to be transposed/interleaved and stored like so:

c0 m0 y0 k0 c1 m1 y1 k1 c2 m2 y2 k2 c3 m3 y3 k3 ....

Diff Detail

Event Timeline

m_zuckerman created this revision.Jul 30 2017, 5:20 AM
m_zuckerman edited the summary of this revision. (Show Details)Jul 30 2017, 5:37 AM
RKSimon added inline comments.Jul 31 2017, 9:37 AM


Farhana added inline comments.Jul 31 2017, 11:10 AM

Can you please add some comments about what this function does?

m_zuckerman marked 2 inline comments as done.
Farhana added inline comments.Aug 11 2017, 7:24 AM

Please remove the extra space after "&&".


Please remove the unknown character.


Is there anything special about this unpackMask? Why do we need a separate unpackmask generator? Why aren't you extending the createUnpackShuffleMask()? If createLowUnpackMask is unpacking in a different way than createUnpackShuffleMask() that distinction should be documented and clearly noted why we need a separate unpackMask-creator.

Farhana added inline comments.Aug 11 2017, 11:47 AM

Do we also need to check DL.getTypeSizeInBits(ShuffleVecTy) == 256 here?

m_zuckerman marked 2 inline comments as done.
m_zuckerman added inline comments.

Yes since in default mode, the compose function creates 4 xmm where only half of the wide is been used. I want that high bits will be left undef so I am checking 4*xmm = 256 in the next step will fill the undef part.


The regular unpack function doesn't create * 2XN * as we need in this case and creates only half from one vector and another half from the second vector. In mine function, we use all the vector and the not only half. For the backend, the result is the same but the way they create is different.

zvi edited edge metadata.Sep 11 2017, 12:22 PM

If this patch is still in review please rebase it.

zvi added inline comments.Sep 12 2017, 5:06 AM

'unsigned' looks like a more suitable type, That will also allow 'i' in the for loop below to be of 'unsigned' type, and that is better because it is being pushed into Mask, a container of type 'uint32_t'.


The comment doesn't address the special case v8i8. Can you please explain why the exception is needed?




Can you please commit these minor changes: leaner computation of VT and Half, variable renaming in a couple of lines below as an NFC commit and rebase this patch?


I think it would be better to create a function that handles the v8i8 case instead of having these if's + early exit. It will also allow reomving the if from scaleVectorType to the point where it may not be needed.

m_zuckerman marked 3 inline comments as done.
zvi accepted this revision.Sep 25 2017, 1:58 AM

LGTM, thanks!

This revision is now accepted and ready to land.Sep 25 2017, 1:58 AM
This revision was automatically updated to reflect the committed changes.