Page MenuHomePhabricator

Optimize patterns of vectorized interleaved memory accesses for X86.
ClosedPublic

Authored by Farhana on Sep 16 2016, 1:00 PM.

Details

Summary

[X86InterleavedAccess] Optimize patterns of vectorized interleaved memory accesses for X86.

Prior to this, there were no x86 implementation of InterleavedAccessPass which detects a set of interleaved accesses and generates target specific intrinsics.
Here is an example of interleaved loads:

%wide.vec = load <8 x i32>, <8 x i32>* %ptr
%v0 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <0, 2, 4, 6>
// %v1 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <1, 3, 5, 7>

The ARM implementation generates ldn/stn intrinsics.

The change-set here places the basic framework to support InterleavedAccessPass on X86. It also, tries to detect an interleaved pattern(with 4 interleaved accesses, stride:4, 64-bit on AVX/AVX2) and generate optimized sequence for that.

This is just the first step of a long effort. The short-term plan is to continue supporting a few patterns this way while we work out a more general solution.

In order to allow code sharing between multiple transpose functions, the next change-set will introduce a class that will encapsulate all the necessary information.

Due to this change-set,
/ Current supported interleaved loads: here, T = {i/f}
/ %wide.vec = load <16 x T64>, <16 x T64>* %ptr
/ %v0 = shuffle %wide.vec, undef, <0, 4, 8, 12> ;
/ %v1 = shuffle %wide.vec, undef, <1, 5, 9, 13> ;
/ %v2 = shuffle %wide.vec, undef, <2, 6, 10, 14> ;
/ %v3 = shuffle %wide.vec, undef, <3, 7, 11, 15> ;
/
/ Into:
/ %load0 = load <4 x T64>, <4 x T64>* %ptr
/ %load1 = load <4 x T64>, <4 x T64>* %ptr+32
/ %load2 = load <4 x T64>, <4 x T64>* %ptr+64
/ %load3 = load <4 x T64>, <4 x T64>* %ptr+96
/
/ %intrshuffvec1 = shuffle %load0, %load2, <0, 1, 4, 5>;
/ %intrshuffvec2 = shuffle %load1, %load3, <0, 1, 4, 5>;
/ %v0 = shuffle %intrshuffvec1, %intrshuffvec2, <0, 4, 2, 6>;
/ %v1 = shuffle %intrshuffvec1, %intrshuffvec2, <1, 5, 3, 7>;
/
/ %intrshuffvec3 = shuffle %load0, %load2, <2, 3, 6, 7>;
/ %intrshuffvec4 = shuffle %load1, %load3, <2, 3, 6, 7>;
/ %v2 = shuffle %intrshuffvec3, %intrshuffvec4, <0, 4, 2, 6>;
/ %v3 = shuffle %intrshuffvec3, %intrshuffvec4, <1, 5, 3, 7>;

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
zvi added a subscriber: zvi.Sep 26 2016, 8:15 AM
Farhana updated this revision to Diff 72540.Sep 26 2016, 12:26 PM
zvi added a comment.Sep 26 2016, 1:05 PM

Some minor comments

lib/Target/X86/X86InterleavedAccess.cpp
34

Need to check that Factor==4?
What about checking load size?

80

unexpected load size

92

From this point, is the code hard coded' for Factor==4? Maybe replacing occurrences of Factor with literal '4' will make it more obvious until we support more cases?

Farhana updated this revision to Diff 72853.Sep 28 2016, 10:22 AM

Hi Zvi,

Thanks for your comments, I incorporated them.

Farhana

lib/Target/X86/X86InterleavedAccess.cpp
35

Yes, we need to check whether factor ==4. So, the check Shuffles.size() != UniqueIndices.size() at line 150would cover that and it would also make sure that we don't have duplicate shuffles.

I don't think we to check the load size, all we care about the shuffles. And if the loadsize is not correct then the assertion in line 78 will hit since at that point the load is expected to load only the elements in the shuffles.

Farhana marked 3 inline comments as done.Sep 28 2016, 10:35 AM
Farhana added inline comments.
lib/Target/X86/X86InterleavedAccess.cpp
81

Good suggestion!

Farhana marked an inline comment as done.Sep 28 2016, 10:36 AM
Farhana added inline comments.Sep 28 2016, 10:39 AM
lib/Target/X86/X86InterleavedAccess.cpp
93

Yes, it is hardcoded currently, but I have a plan to generalize this part with any number of factors in the next change-set. So, keeping it as it is will require smaller changes in the next one.

What's the reasoning behind adding X86InterleavedAccess.cpp instead of including inside X86ISelLowering.cpp?

lib/Target/X86/X86InterleavedAccess.cpp
151

Move both the tests to the top and test (Shuffles.size() != Indices.size()) directly?

156

Remove braces (style).

test/CodeGen/X86/x86-interleaved-access.ll
2

Is this actually true? The checks below don't look like what the script would generate.

DavidKreitzer requested changes to this revision.Sep 30 2016, 9:59 AM
DavidKreitzer edited edge metadata.

Hi Farhana,

The overall architecture of the optimization looks good to me, but a few fixes are needed, most notably a fix for the correctness problem related to the ordering of the Shuffles array.

Thanks,
-Dave

lib/Target/X86/X86InterleavedAccess.cpp
23

are --> is

88

Wouldn't it be simpler to create a bitcast to a pointer to an array of the smaller vector type? That would avoid the need for further bitcasts, saving 4 LLVM IR instructions. For example, in the 4 interleaved double case, the bitcast would look like this:

%bc = bitcast <16 x double>* %ptr to [4 x <4 x double>]*

And then your GEPs would look like this:

%gep0 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 0
%gep1 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 1
%gep2 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 2
%gep3 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 3
120

You seem to be making unwarranted assumptions about the order of the Shuffles array. Your code is only correct if

Indices[0] = 3
Indices[1] = 2
Indices[2] = 1
Indices[3] = 0

I see nothing in the interleaved access pass that guarantees this ordering.

A better way to write this function would be to populate a new array of shuffles where NewShuffles[0] corresponds to an index of 0, NewShuffles[1] corresponds to an index of 1, etc. Then you can do the shuffle replacement like this:

for (unsigned i = 0; i < Shuffles.size(); i++) {
  unsigned Index = Indices[i];
  Shuffles[i]->replaceAllUsesWith(NewShuffles[Index]);
}
145

The indentation is off here. Can you please run clang-format on this file? To make things easy on your reviewers, please do this in a separate upload without ANY other changes.

147

Do you really need this test for unique indices? I see no reason not to handle multiple shuffles with the same index.

I also don't see a need for all possible indices to be covered, which you are implicitly requiring by forcing all the indices to be unique and checking that Shuffles.size() == 4 in isSupported. If you remove this requirement, you will automatically handle more cases. For example, suppose we vectorize a loop that accesses only fields a, b, and d of an array of this type of structure:

struct {
  double a, b, c, d;
};

Your shuffle sequence will still work well for this case. One of the output shuffles will just go unused.

This suggestion has implications on the isSupported routine. You would have to replace the Shuffles.size() == 4 check with Factor == 4.

This revision now requires changes to proceed.Sep 30 2016, 9:59 AM
Farhana updated this revision to Diff 73086.Sep 30 2016, 10:22 AM
Farhana edited edge metadata.

This version is created after running clang-format.

Farhana updated this revision to Diff 73093.Sep 30 2016, 10:47 AM
Farhana edited edge metadata.

Uploaded the whole patch after running clang-format.

RKSimon added inline comments.Sep 30 2016, 11:20 AM
lib/Target/X86/X86InterleavedAccess.cpp
151

Discard this - didn't notice that it was creating a set.

Farhana updated this revision to Diff 73129.Sep 30 2016, 1:46 PM

Includes changes that reflect Dave and Simon's comment.

Farhana added inline comments.Sep 30 2016, 2:03 PM
lib/Target/X86/X86InterleavedAccess.cpp
88

Good suggestion!

120

Well, vectorizer guarantees the order, but certainly I should not have relied on that since any other optimizations can mess up the order.

147

What about the case where we only have only one strided_load and generating 4 extracts would be more profitable than generating 8 vector instructions, right?

test/CodeGen/X86/x86-interleaved-access.ll
2

Hi Simon,
I am not sure whether I understand your concern. Which checks are you talking about?
Farhana

DavidKreitzer added inline comments.Oct 3 2016, 1:37 PM
lib/Target/X86/X86InterleavedAccess.cpp
36

Please remove this check for Shuffles.size() != 4. I don't think you need it.

Also, I think Zvi is right that you need to check the load size here. Even in the previous version of the code, there is nothing to guard against the load being larger than you expect. So (as you pointed out), the "unexpected load size" assertion might fire on valid LLVM IR input.

And in the new version of the code where you are no longer checking that a shuffle exists for each possible Index (0, 1, 2, and 3), there is the oddball possibility that the load is smaller than you expect. Probably what you should be checking here is that LoadSize >= Factor * ShuffleVecSize since that is the size of the expanded set of loads.

88

Thanks for the fix. I'm okay with the way you did this, but note that by modelling the address as

%gep3 = getelementptr <4 x double>, <4 x double>* %bc, i32 3

instead of

%gep3 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 3

the size of the underlying memory reference is no longer explicit in the GEP. You can imagine a scenario where this could have a negative impact on subsequent optimization.

147

Even in the case of only one strided load, I think you will find that the CG will generate good code from your expansion sequence. It is true that 5 of the 8 shuffle instructions you are generating are unnecessary in that case, but those shuffle instructions should be optimized away as dead.

Try it out, and perhaps add a unit test for this situation, and possibly for some of the other unusual situations like multiple shuffles with the same index.

RKSimon added inline comments.Oct 5 2016, 9:00 AM
test/CodeGen/X86/x86-interleaved-access.ll
2

The 'AVX-NEXT'/'AVX1-NEXT'/'AVX2-NEXT' checks - the update script would generate quite a bit more than what is shown below.

Farhana updated this revision to Diff 73680.Oct 5 2016, 11:56 AM

Hi Farhana,

Aside from one minor issue in the test, this looks great.

-Dave

test/CodeGen/X86/x86-interleaved-access.ll
26

There should be a vunpckhpd here too, right? Is there a reason you are not checking for it?

delena added inline comments.Oct 6 2016, 8:34 AM
lib/Target/X86/X86InterleavedAccess.cpp
71

AVX512 probably has another set of shuffles

90

inbounds GEP?

144

It is not a good name for function. I think that you don't need additional function call here at all.

Farhana updated this revision to Diff 73869.Oct 6 2016, 5:47 PM
Farhana added inline comments.
lib/Target/X86/X86InterleavedAccess.cpp
36

Enforcing LoadSize >= Factor * ShuffleVecSize makes the support of any combination of shuffles kind of pointless, It will only support the shuffles starting with 0 and 3. That is the reason I did not want to support it in this change-set.

My plan was to support it completely by generating maskload in the follow-up change-set, also avoid generating unnecessary shuffles.

90

Since GEP does not come in a fixed order before the load-instruction, it requires traversing the instructions. I have a plan to add it in the next change-set. I added a todo comment to the code.

Also, vectorizer does not generate inboundsGEP today, so there were not much incentive to add that in this change-set.

144

You are right in the current context. My plan is to define a class to encapsulate all the information and allow data sharing where I will have two main functions one for load generation and the other one for shuffle-generation. In order to keep the follow-up change-set with minimal changes I decided to create a function here. I hope it's ok to do so.

147

I know CG will get rid of the other shuffles, I am not talking about this particular case. In general, I think we have to incorporate some concept of cost in order to support any number of accesses.

test/CodeGen/X86/x86-interleaved-access.ll
2

If I understand your comment correctly, you are saying the optimization will generate more instructions than it is checking for. Yes, it only checks for the must instructions, because the rest can be optimized away depending on the uses.

26

Right. The first four are the stepping stones, which guarantee the rest. I did not add it in order to keep the file size small. I know this is going to be populated with lot of tests.

DavidKreitzer accepted this revision.Oct 7 2016, 7:06 AM
DavidKreitzer edited edge metadata.

This LGTM, but I'd like you to also get approvals from Michael & Elena before proceeding.

This revision is now accepted and ready to land.Oct 7 2016, 7:06 AM
delena edited edge metadata.Oct 7 2016, 7:39 AM

What about shuffles set for AVX-512?

lib/Target/X86/X86InterleavedAccess.cpp
144

Each patch should look good regardless of future plans.

Farhana updated this revision to Diff 73940.Oct 7 2016, 9:42 AM
Farhana edited edge metadata.
Farhana updated this revision to Diff 73941.Oct 7 2016, 10:20 AM
Farhana added inline comments.
lib/Target/X86/X86InterleavedAccess.cpp
71

Yes. The plan is to support AVX512 in a separate check-in, also handle the patterns that take advantage of its extended shuffle instructions and the wider vector length. This change-set is meant to support only AVX1 and AVX2.

144

I totally agree with you. I in-lined the function.

test/CodeGen/X86/x86-interleaved-access.ll
2

Hi Simon,

I think I understand your question now (Dave helped me).

You are right the script update_llc_test_checks.py generates quite a bit more checks than what I have here. Yes, the checks are not auto-generated by the script. I got rid of the NOTE.

But now I am wondering whether I should have used the script or not. I did not want to put all the checks because in my opinion putting all of them would be unnecessary in this case, checking for first few instructions would be enough to ensure the behavior.

Let me know if you think it's good practice to use the script always...

RKSimon added inline comments.Oct 7 2016, 12:21 PM
test/CodeGen/X86/x86-interleaved-access.ll
2

Generally yes, the script output is great as its easy to regenerate, it means you're not hiding anything and its easier to grok the entire codegen.

There are plenty of cases where bulky codesize is just too off putting and CHECKs should be more selective, but if the codesize could be reduced in the future I'd tend to include it as its very useful to show the delta.

But for these interleave cases I think it'd be useful, especially as we don't have any other reference examples of x86 interleave codegen at present. If it means you need to split the tests into multiple files (we often have 128 / 256 / 512 versions of test files), so be it.

Farhana updated this revision to Diff 74004.Oct 7 2016, 5:10 PM
Farhana added inline comments.
test/CodeGen/X86/x86-interleaved-access.ll
2

Sounds good.

Right now, I don't need to split the file, may be in the future I will have to consider it.

RKSimon accepted this revision.Oct 8 2016, 4:17 AM
RKSimon added a reviewer: RKSimon.

LGTM

delena accepted this revision.Oct 9 2016, 12:45 AM
delena edited edge metadata.
mkuper edited edge metadata.Oct 10 2016, 8:41 AM

I'm really sorry for coming into this review so late, but could you please also add stand-alone IR-level tests for the pass itself (and not only llc-level tests)?
Not only we'll get better testing granularity, but it'll be easier to see what the produced IR looks like, and whether it's really in the form we want.

lib/Target/X86/X86InterleavedAccess.cpp
77

There's an overload of CreateGEP that doesn't require the nullptr.

I'm really sorry for coming into this review so late, but could you please also add stand-alone IR-level tests for the pass itself (and not only llc-level tests)?
Not only we'll get better testing granularity, but it'll be easier to see what the produced IR looks like, and whether it's really in the form we want.

Hi Michael,

So, you want me add an opt-based IR-level test, right?

But this InterleavedAccessPass is a CodegenPass, relies on target information and it gets added to the Codegen library by TargetPassConfig::addIRPasses(). On the other hand, opt relies on PassManagerBuilder and uses Builder.populateFunctionPassManager(FPM) and

Builder.populateModulePassManager(MPM) to construct it's list of passes to run.

llc uses TargetMachine::addPassesToEmitFile in order to get the backend passes.

clang supports both.

It is possible to at least support the IRPasses of Codegen in opt, but I am not sure whether that is the intent of opt. Also, even if we want that support I would guess you would want me to support that in a separate checkin?

Farhana

Hi Farhana,

I'm not sure I see the problem.
opt already depends on both CodeGen and all-targets, and we already have opt-based tests for IR-to-IR passes that live in CodeGen. For example, you can find tests for AtomicExpandPass in test/Transforms/AtomicExtend
Note that for target-specific tests you need to create a subdirectory (within the pass' test directory) for that target, which contains a lit.local.cfg file that requires this target. Otherwise you'll get errors in configurations where this target is not compiled in.

What's more, we already have opt-level tests for this pass, they just live (in my opinion) in the wrong place - see test/CodeGen/AArch64/aarch64-interleaved-accesses-extract-user.ll
Matt, am I missing a reason those tests should live in test/CodeGen, or was that an oversight? The test tree doesn't really match the lib tree anyway, and I think the vast majority of IR test live in Transforms, regardless of where the pass is.

And I would prefer these tests to go in in the same commit - it will make it easier to make sure we don't have any surprises in the IR this produces.

Farhana updated this revision to Diff 74276.Oct 11 2016, 11:27 AM
Farhana edited edge metadata.

Hi Michael,

I added the IR-level tests.

Farhana

What's more, we already have opt-level tests for this pass, they just live (in my opinion) in the wrong place - see test/CodeGen/AArch64/aarch64-interleaved-accesses-extract-user.ll
Matt, am I missing a reason those tests should live in test/CodeGen, or was that an oversight? The test tree doesn't really match the lib tree anyway, and I think the vast majority of IR test live in Transforms, regardless of where the pass is.

No, there's no reason the IR-to-IR tests for InterleavedAccessPass should be under test/CodeGen. In fact, the existing llc tests for ARM and AArch64 in test/Codegen really should be opt tests. When we extended the pass and added extract-user.ll (an opt test), we just added it along side the original tests. This made some sense because the transformation does live under lib/Codegen and is a kind of preparation pass. But you're right, most of the other IR-to-IR codegen pass tests live under Transforms instead (atomic expand, global merge, etc.). The tests should really be under target-specific directories in test/Transforms/InterleavedAccess.

mkuper accepted this revision.Oct 12 2016, 4:24 PM
mkuper edited edge metadata.

LGTM

test/Transforms/InterleavedAccess/X86/interleaved-accesses-64bits-avx.ll
5 ↗(On Diff #74276)

Probably a good idea to generate this test with the update script - we want to see the loads, not rely on the numbered IR values, etc.

Farhana updated this revision to Diff 74512.Oct 13 2016, 7:32 AM
Farhana edited edge metadata.
This revision was automatically updated to reflect the committed changes.