Page MenuHomePhabricator

[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess (VF{8|16|32} stride 3)..

Authored by m_zuckerman on Aug 24 2017, 12:45 PM.



This patch expands the support of lowerInterleavedStore to {8|16|32}x8i stride 3.

LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=3 VF={8|16|32}) . This patch is part two of two patches and it covers the store (interlevaed) side.

The patch goal is to optimize the following sequence:
a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
c0 c1 c2 c3 c4 c5 c6 c7

a0 b0 c0 a1 b1 c1 a2 b2
c2 a3 b3 c3 a4 b4 c4 a5
b5 c5 a6 b6 c6 a7 b7 c7

Diff Detail

Event Timeline

zvi edited edge metadata.Aug 28 2017, 2:34 AM

Some of the comments i wrote in D39690, apply here, so will wait for the comments to be addressed in both patches.


Please check the formatting

m_zuckerman updated this revision to Diff 114513.
m_zuckerman marked an inline comment as done.
zvi added a comment.Sep 24 2017, 5:11 AM

Please remember to create the diff with -U


Please capitalize variables




This statement has no effect => This loop can be avoided if Output.push_back(...) is done in the above loop.


What does the 'optimize' part of the name mean? Looks like the function creates shuffle indices for given two source indices.


Starting from here, IMHO, the comment doesn't add much as the code below is very similar.


Instead of passing the bool args, pass the variable offsets which would be both either 0 or NumElts/2.


missing an 'e'?

m_zuckerman marked 2 inline comments as done.Sep 24 2017, 10:27 AM
m_zuckerman added inline comments.

I am sorry but I do not follow you, why?
In the first loop, I only compute the different indexes.
In the second loop, I use the result of the first loop and create the mask.
So how do you suggest to combine them together?

0-1564- 79

For the sequence to work as a mirror to the load. We must consider the elements order as above. You need somehow to combine two types of shuffles.
The first one is vpshufed and the second is a type of "blend" shuffle. By computing the shuffle on a sequence of 16 elements (one lane) and add the correct offset you are doing "optimizedShuffle" that doesn't add any new IR and creating a vpsuffed sequence between two shuffles.


What about the case of low from one shuffle and high from the other ?

m_zuckerman marked 2 inline comments as done.
zvi added inline comments.Sep 24 2017, 1:50 PM

What i mean is that IndexGroup is a function-local array. Writing to the i-th index *after* the last read from the i-th index is redundant, so this loop can be rewritten as:

for (int i = 0; i < VF / Lane; i++)
  Output.push_back(IndexGroup[i % 3]);

Next, is there a better way to rewrite the code in these two loops into one loop? Not sure about that. Can you please look into this possibility?


Also this one needs to be capitalized


You mean this case?

optimizeShuffle(VT, VPShuf, OptimizeShuf[1], true, false);

It would be changed to:

optimizeShuffle(VT, VPShuf, OptimizeShuf[1], 0, VT.getVectorNumElements()/2);
m_zuckerman marked 2 inline comments as done.
zvi accepted this revision.Sep 25 2017, 5:26 AM


This revision is now accepted and ready to land.Sep 25 2017, 5:26 AM
m_zuckerman closed this revision.Sep 26 2017, 12:25 PM