This is an archive of the discontinued LLVM Phabricator instance.

[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess (VF{8|16|32} stride 3)..
ClosedPublic

Authored by m_zuckerman on Aug 24 2017, 12:45 PM.

Details

Summary

This patch expands the support of lowerInterleavedStore to {8|16|32}x8i stride 3.

LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=3 VF={8|16|32}) . This patch is part two of two patches and it covers the store (interlevaed) side.

The patch goal is to optimize the following sequence:
a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
c0 c1 c2 c3 c4 c5 c6 c7

into
a0 b0 c0 a1 b1 c1 a2 b2
c2 a3 b3 c3 a4 b4 c4 a5
b5 c5 a6 b6 c6 a7 b7 c7

Diff Detail

Event Timeline

zvi edited edge metadata.Aug 28 2017, 2:34 AM

Some of the comments i wrote in D39690, apply here, so will wait for the comments to be addressed in both patches.

lib/Target/X86/X86InterleavedAccess.cpp
675–678

Please check the formatting

m_zuckerman updated this revision to Diff 114513.
m_zuckerman marked an inline comment as done.
zvi added a comment.Sep 24 2017, 5:11 AM

Please remember to create the diff with -U

lib/Target/X86/X86InterleavedAccess.cpp
506

Please capitalize variables

508

*VectorWidth

519

This statement has no effect => This loop can be avoided if Output.push_back(...) is done in the above loop.

523

What does the 'optimize' part of the name mean? Looks like the function creates shuffle indices for given two source indices.

528

Starting from here, IMHO, the comment doesn't add much as the code below is very similar.

538

Instead of passing the bool args, pass the variable offsets which would be both either 0 or NumElts/2.

567

missing an 'e'?

m_zuckerman marked 2 inline comments as done.Sep 24 2017, 10:27 AM
m_zuckerman added inline comments.
lib/Target/X86/X86InterleavedAccess.cpp
519

I am sorry but I do not follow you, why?
In the first loop, I only compute the different indexes.
In the second loop, I use the result of the first loop and create the mask.
So how do you suggest to combine them together?

523
0-1564- 79
16-3180-95
32-4796-111
48-63112-127

For the sequence to work as a mirror to the load. We must consider the elements order as above. You need somehow to combine two types of shuffles.
The first one is vpshufed and the second is a type of "blend" shuffle. By computing the shuffle on a sequence of 16 elements (one lane) and add the correct offset you are doing "optimizedShuffle" that doesn't add any new IR and creating a vpsuffed sequence between two shuffles.

538

What about the case of low from one shuffle and high from the other ?

m_zuckerman marked 2 inline comments as done.
zvi added inline comments.Sep 24 2017, 1:50 PM
lib/Target/X86/X86InterleavedAccess.cpp
519

What i mean is that IndexGroup is a function-local array. Writing to the i-th index *after* the last read from the i-th index is redundant, so this loop can be rewritten as:

for (int i = 0; i < VF / Lane; i++)
  Output.push_back(IndexGroup[i % 3]);

Next, is there a better way to rewrite the code in these two loops into one loop? Not sure about that. Can you please look into this possibility?

524

Also this one needs to be capitalized

538

You mean this case?

optimizeShuffle(VT, VPShuf, OptimizeShuf[1], true, false);

It would be changed to:

optimizeShuffle(VT, VPShuf, OptimizeShuf[1], 0, VT.getVectorNumElements()/2);
m_zuckerman marked 2 inline comments as done.
zvi accepted this revision.Sep 25 2017, 5:26 AM

LGTM

This revision is now accepted and ready to land.Sep 25 2017, 5:26 AM
m_zuckerman closed this revision.Sep 26 2017, 12:25 PM