This is an archive of the discontinued LLVM Phabricator instance.

Optimize patterns of vectorized interleaved memory accesses for X86.
ClosedPublic

Authored by Farhana on Sep 16 2016, 1:00 PM.

Download Raw Diff

Details

Reviewers

RKSimon
delena
mkuper
DavidKreitzer

Commits

rG01a057a0c475: Add a pass to optimize patterns of vectorized interleaved memory accesses for…
rL284260: Add a pass to optimize patterns of vectorized interleaved memory accesses for

Summary

[X86InterleavedAccess] Optimize patterns of vectorized interleaved memory accesses for X86.

Prior to this, there were no x86 implementation of InterleavedAccessPass which detects a set of interleaved accesses and generates target specific intrinsics.
Here is an example of interleaved loads:

%wide.vec = load <8 x i32>, <8 x i32>* %ptr
%v0 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <0, 2, 4, 6>
// %v1 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <1, 3, 5, 7>

The ARM implementation generates ldn/stn intrinsics.

The change-set here places the basic framework to support InterleavedAccessPass on X86. It also, tries to detect an interleaved pattern(with 4 interleaved accesses, stride:4, 64-bit on AVX/AVX2) and generate optimized sequence for that.

This is just the first step of a long effort. The short-term plan is to continue supporting a few patterns this way while we work out a more general solution.

In order to allow code sharing between multiple transpose functions, the next change-set will introduce a class that will encapsulate all the necessary information.

Due to this change-set,
/ Current supported interleaved loads: here, T = {i/f}
/ %wide.vec = load <16 x T64>, <16 x T64>* %ptr
/ %v0 = shuffle %wide.vec, undef, <0, 4, 8, 12> ;
/ %v1 = shuffle %wide.vec, undef, <1, 5, 9, 13> ;
/ %v2 = shuffle %wide.vec, undef, <2, 6, 10, 14> ;
/ %v3 = shuffle %wide.vec, undef, <3, 7, 11, 15> ;
/
/ Into:
/ %load0 = load <4 x T64>, <4 x T64>* %ptr
/ %load1 = load <4 x T64>, <4 x T64>* %ptr+32
/ %load2 = load <4 x T64>, <4 x T64>* %ptr+64
/ %load3 = load <4 x T64>, <4 x T64>* %ptr+96
/
/ %intrshuffvec1 = shuffle %load0, %load2, <0, 1, 4, 5>;
/ %intrshuffvec2 = shuffle %load1, %load3, <0, 1, 4, 5>;
/ %v0 = shuffle %intrshuffvec1, %intrshuffvec2, <0, 4, 2, 6>;
/ %v1 = shuffle %intrshuffvec1, %intrshuffvec2, <1, 5, 3, 7>;
/
/ %intrshuffvec3 = shuffle %load0, %load2, <2, 3, 6, 7>;
/ %intrshuffvec4 = shuffle %load1, %load3, <2, 3, 6, 7>;
/ %v2 = shuffle %intrshuffvec3, %intrshuffvec4, <0, 4, 2, 6>;
/ %v3 = shuffle %intrshuffvec3, %intrshuffvec4, <1, 5, 3, 7>;

Diff Detail

Event Timeline

Farhana updated this revision to Diff 71690.Sep 16 2016, 1:00 PM

Farhana retitled this revision from to Optimize patterns of vectorized interleaved memory accesses for X86..

Farhana updated this object.

Farhana added reviewers: DavidKreitzer, mkuper, delena.

Farhana added a subscriber: llvm-commits.

Herald added subscribers: mgorny, beanz, aemerson. · View Herald TranscriptSep 16 2016, 1:00 PM

zvi added a subscriber: zvi.Sep 26 2016, 8:15 AM

Farhana updated this revision to Diff 72540.Sep 26 2016, 12:26 PM

Some minor comments

lib/Target/X86/X86InterleavedAccess.cpp
34	Need to check that Factor==4? What about checking load size?
80	unexpected load size
92	From this point, is the code hard coded' for Factor==4? Maybe replacing occurrences of Factor with literal '4' will make it more obvious until we support more cases?

Farhana updated this revision to Diff 72853.Sep 28 2016, 10:22 AM

Hi Zvi,

Thanks for your comments, I incorporated them.

Farhana

lib/Target/X86/X86InterleavedAccess.cpp
35	Yes, we need to check whether factor ==4. So, the check Shuffles.size() != UniqueIndices.size() at line 150would cover that and it would also make sure that we don't have duplicate shuffles. I don't think we to check the load size, all we care about the shuffles. And if the loadsize is not correct then the assertion in line 78 will hit since at that point the load is expected to load only the elements in the shuffles.

Farhana marked 3 inline comments as done.Sep 28 2016, 10:35 AM

Farhana added inline comments.

lib/Target/X86/X86InterleavedAccess.cpp
81	Good suggestion!

Farhana added inline comments.Sep 28 2016, 10:39 AM

lib/Target/X86/X86InterleavedAccess.cpp
93	Yes, it is hardcoded currently, but I have a plan to generalize this part with any number of factors in the next change-set. So, keeping it as it is will require smaller changes in the next one.

RKSimon added a subscriber: RKSimon.Sep 28 2016, 3:55 PM

What's the reasoning behind adding X86InterleavedAccess.cpp instead of including inside X86ISelLowering.cpp?

lib/Target/X86/X86InterleavedAccess.cpp
151	Move both the tests to the top and test (Shuffles.size() != Indices.size()) directly?
156	Remove braces (style).
test/CodeGen/X86/x86-interleaved-access.ll
2	Is this actually true? The checks below don't look like what the script would generate.

Hi Farhana,

The overall architecture of the optimization looks good to me, but a few fixes are needed, most notably a fix for the correctness problem related to the ordering of the Shuffles array.

Thanks,
-Dave

lib/Target/X86/X86InterleavedAccess.cpp
23	are --> is
88	Wouldn't it be simpler to create a bitcast to a pointer to an array of the smaller vector type? That would avoid the need for further bitcasts, saving 4 LLVM IR instructions. For example, in the 4 interleaved double case, the bitcast would look like this: %bc = bitcast <16 x double>* %ptr to [4 x <4 x double>]* And then your GEPs would look like this: %gep0 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 0 %gep1 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 1 %gep2 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 2 %gep3 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 3
120	You seem to be making unwarranted assumptions about the order of the Shuffles array. Your code is only correct if Indices[0] = 3 Indices[1] = 2 Indices[2] = 1 Indices[3] = 0 I see nothing in the interleaved access pass that guarantees this ordering. A better way to write this function would be to populate a new array of shuffles where NewShuffles[0] corresponds to an index of 0, NewShuffles[1] corresponds to an index of 1, etc. Then you can do the shuffle replacement like this: for (unsigned i = 0; i < Shuffles.size(); i++) { unsigned Index = Indices[i]; Shuffles[i]->replaceAllUsesWith(NewShuffles[Index]); }
145	The indentation is off here. Can you please run clang-format on this file? To make things easy on your reviewers, please do this in a separate upload without ANY other changes.
147	Do you really need this test for unique indices? I see no reason not to handle multiple shuffles with the same index. I also don't see a need for all possible indices to be covered, which you are implicitly requiring by forcing all the indices to be unique and checking that Shuffles.size() == 4 in isSupported. If you remove this requirement, you will automatically handle more cases. For example, suppose we vectorize a loop that accesses only fields a, b, and d of an array of this type of structure: struct { double a, b, c, d; }; Your shuffle sequence will still work well for this case. One of the output shuffles will just go unused. This suggestion has implications on the isSupported routine. You would have to replace the Shuffles.size() == 4 check with Factor == 4.

This revision now requires changes to proceed.Sep 30 2016, 9:59 AM

This version is created after running clang-format.

Uploaded the whole patch after running clang-format.

RKSimon added inline comments.Sep 30 2016, 11:20 AM

lib/Target/X86/X86InterleavedAccess.cpp
151	Discard this - didn't notice that it was creating a set.

Includes changes that reflect Dave and Simon's comment.

Farhana added inline comments.Sep 30 2016, 2:03 PM

lib/Target/X86/X86InterleavedAccess.cpp
88	Good suggestion!
120	Well, vectorizer guarantees the order, but certainly I should not have relied on that since any other optimizations can mess up the order.
147	What about the case where we only have only one strided_load and generating 4 extracts would be more profitable than generating 8 vector instructions, right?
test/CodeGen/X86/x86-interleaved-access.ll
2	Hi Simon, I am not sure whether I understand your concern. Which checks are you talking about? Farhana

DavidKreitzer added inline comments.Oct 3 2016, 1:37 PM

lib/Target/X86/X86InterleavedAccess.cpp
35	Please remove this check for Shuffles.size() != 4. I don't think you need it. Also, I think Zvi is right that you need to check the load size here. Even in the previous version of the code, there is nothing to guard against the load being larger than you expect. So (as you pointed out), the "unexpected load size" assertion might fire on valid LLVM IR input. And in the new version of the code where you are no longer checking that a shuffle exists for each possible Index (0, 1, 2, and 3), there is the oddball possibility that the load is smaller than you expect. Probably what you should be checking here is that LoadSize >= Factor * ShuffleVecSize since that is the size of the expanded set of loads.
88	Thanks for the fix. I'm okay with the way you did this, but note that by modelling the address as %gep3 = getelementptr <4 x double>, <4 x double>* %bc, i32 3 instead of %gep3 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 3 the size of the underlying memory reference is no longer explicit in the GEP. You can imagine a scenario where this could have a negative impact on subsequent optimization.
147	Even in the case of only one strided load, I think you will find that the CG will generate good code from your expansion sequence. It is true that 5 of the 8 shuffle instructions you are generating are unnecessary in that case, but those shuffle instructions should be optimized away as dead. Try it out, and perhaps add a unit test for this situation, and possibly for some of the other unusual situations like multiple shuffles with the same index.

RKSimon added inline comments.Oct 5 2016, 9:00 AM

test/CodeGen/X86/x86-interleaved-access.ll
2	The 'AVX-NEXT'/'AVX1-NEXT'/'AVX2-NEXT' checks - the update script would generate quite a bit more than what is shown below.

mssimpso added a subscriber: mssimpso.Oct 5 2016, 10:32 AM

Farhana updated this revision to Diff 73680.Oct 5 2016, 11:56 AM

Hi Farhana,

Aside from one minor issue in the test, this looks great.

-Dave

test/CodeGen/X86/x86-interleaved-access.ll
26	There should be a vunpckhpd here too, right? Is there a reason you are not checking for it?

delena added inline comments.Oct 6 2016, 8:34 AM

lib/Target/X86/X86InterleavedAccess.cpp
71	AVX512 probably has another set of shuffles
90	inbounds GEP?
144	It is not a good name for function. I think that you don't need additional function call here at all.

Farhana updated this revision to Diff 73869.Oct 6 2016, 5:47 PM

Farhana added inline comments.

lib/Target/X86/X86InterleavedAccess.cpp
35	Enforcing LoadSize >= Factor * ShuffleVecSize makes the support of any combination of shuffles kind of pointless, It will only support the shuffles starting with 0 and 3. That is the reason I did not want to support it in this change-set. My plan was to support it completely by generating maskload in the follow-up change-set, also avoid generating unnecessary shuffles.
90	Since GEP does not come in a fixed order before the load-instruction, it requires traversing the instructions. I have a plan to add it in the next change-set. I added a todo comment to the code. Also, vectorizer does not generate inboundsGEP today, so there were not much incentive to add that in this change-set.
144	You are right in the current context. My plan is to define a class to encapsulate all the information and allow data sharing where I will have two main functions one for load generation and the other one for shuffle-generation. In order to keep the follow-up change-set with minimal changes I decided to create a function here. I hope it's ok to do so.
147	I know CG will get rid of the other shuffles, I am not talking about this particular case. In general, I think we have to incorporate some concept of cost in order to support any number of accesses.
test/CodeGen/X86/x86-interleaved-access.ll
2	If I understand your comment correctly, you are saying the optimization will generate more instructions than it is checking for. Yes, it only checks for the must instructions, because the rest can be optimized away depending on the uses.
26	Right. The first four are the stepping stones, which guarantee the rest. I did not add it in order to keep the file size small. I know this is going to be populated with lot of tests.

This LGTM, but I'd like you to also get approvals from Michael & Elena before proceeding.

This revision is now accepted and ready to land.Oct 7 2016, 7:06 AM

What about shuffles set for AVX-512?

lib/Target/X86/X86InterleavedAccess.cpp
144	Each patch should look good regardless of future plans.

Farhana updated this revision to Diff 73940.Oct 7 2016, 9:42 AM

Farhana edited edge metadata.

Farhana updated this revision to Diff 73941.Oct 7 2016, 10:20 AM

Farhana added inline comments.

lib/Target/X86/X86InterleavedAccess.cpp
71	Yes. The plan is to support AVX512 in a separate check-in, also handle the patterns that take advantage of its extended shuffle instructions and the wider vector length. This change-set is meant to support only AVX1 and AVX2.
144	I totally agree with you. I in-lined the function.
test/CodeGen/X86/x86-interleaved-access.ll
2	Hi Simon, I think I understand your question now (Dave helped me). You are right the script update_llc_test_checks.py generates quite a bit more checks than what I have here. Yes, the checks are not auto-generated by the script. I got rid of the NOTE. But now I am wondering whether I should have used the script or not. I did not want to put all the checks because in my opinion putting all of them would be unnecessary in this case, checking for first few instructions would be enough to ensure the behavior. Let me know if you think it's good practice to use the script always...

RKSimon added inline comments.Oct 7 2016, 12:21 PM

test/CodeGen/X86/x86-interleaved-access.ll
2	Generally yes, the script output is great as its easy to regenerate, it means you're not hiding anything and its easier to grok the entire codegen. There are plenty of cases where bulky codesize is just too off putting and CHECKs should be more selective, but if the codesize could be reduced in the future I'd tend to include it as its very useful to show the delta. But for these interleave cases I think it'd be useful, especially as we don't have any other reference examples of x86 interleave codegen at present. If it means you need to split the tests into multiple files (we often have 128 / 256 / 512 versions of test files), so be it.

Farhana updated this revision to Diff 74004.Oct 7 2016, 5:10 PM

Farhana added inline comments.

test/CodeGen/X86/x86-interleaved-access.ll
2	Sounds good. Right now, I don't need to split the file, may be in the future I will have to consider it.

LGTM

delena accepted this revision.Oct 9 2016, 12:45 AM

delena edited edge metadata.

mkuper mentioned this in D25350: [X86] Enable interleaved memory accesses by default.Oct 10 2016, 8:24 AM

I'm really sorry for coming into this review so late, but could you please also add stand-alone IR-level tests for the pass itself (and not only llc-level tests)?
Not only we'll get better testing granularity, but it'll be easier to see what the produced IR looks like, and whether it's really in the form we want.

lib/Target/X86/X86InterleavedAccess.cpp
78	There's an overload of CreateGEP that doesn't require the nullptr.

In D24681#566229, @mkuper wrote:

I'm really sorry for coming into this review so late, but could you please also add stand-alone IR-level tests for the pass itself (and not only llc-level tests)?
Not only we'll get better testing granularity, but it'll be easier to see what the produced IR looks like, and whether it's really in the form we want.

Hi Michael,

So, you want me add an opt-based IR-level test, right?

But this InterleavedAccessPass is a CodegenPass, relies on target information and it gets added to the Codegen library by TargetPassConfig::addIRPasses(). On the other hand, opt relies on PassManagerBuilder and uses Builder.populateFunctionPassManager(FPM) and

Builder.populateModulePassManager(MPM) to construct it's list of passes to run.

llc uses TargetMachine::addPassesToEmitFile in order to get the backend passes.

clang supports both.

It is possible to at least support the IRPasses of Codegen in opt, but I am not sure whether that is the intent of opt. Also, even if we want that support I would guess you would want me to support that in a separate checkin?

Farhana

Hi Farhana,

I'm not sure I see the problem.
opt already depends on both CodeGen and all-targets, and we already have opt-based tests for IR-to-IR passes that live in CodeGen. For example, you can find tests for AtomicExpandPass in test/Transforms/AtomicExtend
Note that for target-specific tests you need to create a subdirectory (within the pass' test directory) for that target, which contains a lit.local.cfg file that requires this target. Otherwise you'll get errors in configurations where this target is not compiled in.

What's more, we already have opt-level tests for this pass, they just live (in my opinion) in the wrong place - see test/CodeGen/AArch64/aarch64-interleaved-accesses-extract-user.ll
Matt, am I missing a reason those tests should live in test/CodeGen, or was that an oversight? The test tree doesn't really match the lib tree anyway, and I think the vast majority of IR test live in Transforms, regardless of where the pass is.

And I would prefer these tests to go in in the same commit - it will make it easier to make sure we don't have any surprises in the IR this produces.

Hi Michael,

I added the IR-level tests.

Farhana

In D24681#567421, @mkuper wrote:

What's more, we already have opt-level tests for this pass, they just live (in my opinion) in the wrong place - see test/CodeGen/AArch64/aarch64-interleaved-accesses-extract-user.ll
Matt, am I missing a reason those tests should live in test/CodeGen, or was that an oversight? The test tree doesn't really match the lib tree anyway, and I think the vast majority of IR test live in Transforms, regardless of where the pass is.

No, there's no reason the IR-to-IR tests for InterleavedAccessPass should be under test/CodeGen. In fact, the existing llc tests for ARM and AArch64 in test/Codegen really should be opt tests. When we extended the pass and added extract-user.ll (an opt test), we just added it along side the original tests. This made some sense because the transformation does live under lib/Codegen and is a kind of preparation pass. But you're right, most of the other IR-to-IR codegen pass tests live under Transforms instead (atomic expand, global merge, etc.). The tests should really be under target-specific directories in test/Transforms/InterleavedAccess.

LGTM

test/Transforms/InterleavedAccess/X86/interleaved-accesses-64bits-avx.ll
5 ↗	(On Diff #74276)	Probably a good idea to generate this test with the update script - we want to see the loads, not rely on the numbered IR values, etc.

Farhana updated this revision to Diff 74512.Oct 13 2016, 7:32 AM

Farhana edited edge metadata.

Closed by commit rL284260: Add a pass to optimize patterns of vectorized interleaved memory accesses for (authored by dlkreitz). · Explain WhyOct 14 2016, 11:29 AM

This revision was automatically updated to reflect the committed changes.

Farhana mentioned this in D32658: Supports lowerInterleavedStore() in X86InterleavedAccess..Apr 28 2017, 1:17 PM

efriedma mentioned this in D73260: [Alignement][NFC] Deprecate untyped CreateAlignedLoad.May 17 2020, 11:31 AM

Revision Contents

Path

Size

lib/

CodeGen/

InterleavedAccessPass.cpp

5 lines

Target/

X86/

CMakeLists.txt

1 line

X86ISelLowering.h

8 lines

X86InterleavedAccess.cpp

147 lines

X86TargetMachine.cpp

3 lines

test/

CodeGen/

X86/

x86-interleaved-access.ll

40 lines

Diff 73129

lib/CodeGen/InterleavedAccessPass.cpp

	Show All 23 Lines
	// E.g. An interleaved load (Factor = 2):			// E.g. An interleaved load (Factor = 2):
	// %wide.vec = load <8 x i32>, <8 x i32>* %ptr			// %wide.vec = load <8 x i32>, <8 x i32>* %ptr
	// %v0 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <0, 2, 4, 6>			// %v0 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <0, 2, 4, 6>
	// %v1 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <1, 3, 5, 7>			// %v1 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <1, 3, 5, 7>
	//			//
	// It could be transformed into a ld2 intrinsic in AArch64 backend or a vld2			// It could be transformed into a ld2 intrinsic in AArch64 backend or a vld2
	// intrinsic in ARM backend.			// intrinsic in ARM backend.
	//			//
				// In X86, this can be further optimized into a set of target
				// specific loads followed by an optimized sequence of shuffles.
				//
	// E.g. An interleaved store (Factor = 3):			// E.g. An interleaved store (Factor = 3):
	// %i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1,			// %i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1,
	// <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>			// <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
	// store <12 x i32> %i.vec, <12 x i32>* %ptr			// store <12 x i32> %i.vec, <12 x i32>* %ptr
	//			//
	// It could be transformed into a st3 intrinsic in AArch64 backend or a vst3			// It could be transformed into a st3 intrinsic in AArch64 backend or a vst3
	// intrinsic in ARM backend.			// intrinsic in ARM backend.
	//			//
				// Similarly, a set of interleaved stores can be transformed into an optimized
				// sequence of shuffles followed by a set of target specific stores for X86.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "llvm/CodeGen/Passes.h"			#include "llvm/CodeGen/Passes.h"
	#include "llvm/IR/Dominators.h"			#include "llvm/IR/Dominators.h"
	#include "llvm/IR/InstIterator.h"			#include "llvm/IR/InstIterator.h"
	#include "llvm/Support/Debug.h"			#include "llvm/Support/Debug.h"
	#include "llvm/Support/MathExtras.h"			#include "llvm/Support/MathExtras.h"
	#include "llvm/Support/raw_ostream.h"			#include "llvm/Support/raw_ostream.h"
	▲ Show 20 Lines • Show All 339 Lines • Show Last 20 Lines

lib/Target/X86/CMakeLists.txt

Show All 18 Lines	set(sources
X86FastISel.cpp		X86FastISel.cpp
X86FixupBWInsts.cpp		X86FixupBWInsts.cpp
X86FixupLEAs.cpp		X86FixupLEAs.cpp
X86FixupSetCC.cpp		X86FixupSetCC.cpp
X86FloatingPoint.cpp		X86FloatingPoint.cpp
X86FrameLowering.cpp		X86FrameLowering.cpp
X86ISelDAGToDAG.cpp		X86ISelDAGToDAG.cpp
X86ISelLowering.cpp		X86ISelLowering.cpp
		X86InterleavedAccess.cpp
X86InstrFMA3Info.cpp		X86InstrFMA3Info.cpp
X86InstrInfo.cpp		X86InstrInfo.cpp
X86MCInstLower.cpp		X86MCInstLower.cpp
X86MachineFunctionInfo.cpp		X86MachineFunctionInfo.cpp
X86OptimizeLEAs.cpp		X86OptimizeLEAs.cpp
X86PadShortFunction.cpp		X86PadShortFunction.cpp
X86RegisterInfo.cpp		X86RegisterInfo.cpp
X86SelectionDAGInfo.cpp		X86SelectionDAGInfo.cpp
Show All 18 Lines

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,013 Lines • ▼ Show 20 Lines	public:
LegalizeTypeAction getPreferredVectorAction(EVT VT) const override;		LegalizeTypeAction getPreferredVectorAction(EVT VT) const override;

bool isIntDivCheap(EVT VT, AttributeSet Attr) const override;		bool isIntDivCheap(EVT VT, AttributeSet Attr) const override;

bool supportSwiftError() const override {		bool supportSwiftError() const override {
return true;		return true;
}		}

		unsigned getMaxSupportedInterleaveFactor() const override { return 4; }

		/// \brief Lower interleaved load(s) into target specific
		/// instructions/intrinsics.
		bool lowerInterleavedLoad(LoadInst *LI,
		ArrayRef<ShuffleVectorInst *> Shuffles,
		ArrayRef<unsigned> Indices,
		unsigned Factor) const override;
protected:		protected:
std::pair<const TargetRegisterClass *, uint8_t>		std::pair<const TargetRegisterClass *, uint8_t>
findRepresentativeClass(const TargetRegisterInfo *TRI,		findRepresentativeClass(const TargetRegisterInfo *TRI,
MVT VT) const override;		MVT VT) const override;

private:		private:
/// Keep a reference to the X86Subtarget around so that we can		/// Keep a reference to the X86Subtarget around so that we can
/// make the right decision when generating code for different targets.		/// make the right decision when generating code for different targets.
▲ Show 20 Lines • Show All 235 Lines • Show Last 20 Lines

lib/Target/X86/X86InterleavedAccess.cpp

Property	Old Value	New Value
svn:eol-style	null	native
svn:keywords	null	Author Date Id Rev URL
svn:mime-type	null	text/plain

				//===------- X86InterleavedAccess.cpp --------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains the X86 implementation of the interleaved accesses
				// optimization generating X86-specific instructions/intrinsics for interleaved
				// access groups.
				//
				//===----------------------------------------------------------------------===//

				#include "X86ISelLowering.h"
				#include "X86TargetMachine.h"

				using namespace llvm;

				/// Returns true if the interleaved access group represented by the shuffles
				/// is supported for the subtarget. Returns false otherwise.
				static bool isSupported(const X86Subtarget &SubTarget,
				DavidKreitzerUnsubmitted Not Done Reply Inline Actions are --> is DavidKreitzer: are --> is
				const ArrayRef<ShuffleVectorInst *> Shuffles,
				unsigned Factor) {

				const DataLayout &DL = Shuffles[0]->getModule()->getDataLayout();
				VectorType *ShuffleVecTy = Shuffles[0]->getType();
				unsigned ShuffleVecSize = DL.getTypeSizeInBits(ShuffleVecTy);
				Type *ShuffleEltTy = ShuffleVecTy->getVectorElementType();

				// Currently, lowering is supported only for four interleaved accesses of
				// 64 bits on AVX2.
				if (!SubTarget.hasAVX() \|\| ShuffleVecSize != 256 \|\|
				zviUnsubmitted Not Done Reply Inline Actions Need to check that Factor==4? What about checking load size? zvi: Need to check that Factor==4? What about checking load size?
				DL.getTypeSizeInBits(ShuffleEltTy) != 64 \|\| Shuffles.size() != 4 \|\|
				FarhanaAuthorUnsubmitted Done Reply Inline Actions Yes, we need to check whether factor ==4. So, the check Shuffles.size() != UniqueIndices.size() at line 150would cover that and it would also make sure that we don't have duplicate shuffles. I don't think we to check the load size, all we care about the shuffles. And if the loadsize is not correct then the assertion in line 78 will hit since at that point the load is expected to load only the elements in the shuffles. Farhana: Yes, we need to check whether factor ==4. So, the check Shuffles.size() != UniqueIndices.size()…
				DavidKreitzerUnsubmitted Not Done Reply Inline Actions Please remove this check for Shuffles.size() != 4. I don't think you need it. Also, I think Zvi is right that you need to check the load size here. Even in the previous version of the code, there is nothing to guard against the load being larger than you expect. So (as you pointed out), the "unexpected load size" assertion might fire on valid LLVM IR input. And in the new version of the code where you are no longer checking that a shuffle exists for each possible Index (0, 1, 2, and 3), there is the oddball possibility that the load is smaller than you expect. Probably what you should be checking here is that LoadSize >= Factor * ShuffleVecSize since that is the size of the expanded set of loads. DavidKreitzer: Please remove this check for Shuffles.size() != 4. I don't think you need it. Also, I think…
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Enforcing LoadSize >= Factor * ShuffleVecSize makes the support of any combination of shuffles kind of pointless, It will only support the shuffles starting with 0 and 3. That is the reason I did not want to support it in this change-set. My plan was to support it completely by generating maskload in the follow-up change-set, also avoid generating unnecessary shuffles. Farhana: Enforcing LoadSize >= Factor * ShuffleVecSize makes the support of any combination of shuffles…
				Factor != 4)
				return false;

				return true;
				}

				/// \brief Lower interleaved load(s) into target specific instructions/
				/// intrinsics. Lowering sequence varies depending on the vector-types, factor,
				/// number of shuffles and ISA.
				///
				/// Current supported interleaved loads: here, T = {i/f}
				/// %wide.vec = load <16 x T64>, <16 x T64>* %ptr
				/// %v0 = shuffle %wide.vec, undef, <0, 4, 8, 12> ;
				/// %v1 = shuffle %wide.vec, undef, <1, 5, 9, 13> ;
				/// %v2 = shuffle %wide.vec, undef, <2, 6, 10, 14> ;
				/// %v3 = shuffle %wide.vec, undef, <3, 7, 11, 15> ;
				///
				/// Into:
				/// %load0 = load <4 x T64>, <4 x T64>* %ptr
				/// %load1 = load <4 x T64>, <4 x T64>* %ptr+32
				/// %load2 = load <4 x T64>, <4 x T64>* %ptr+64
				/// %load3 = load <4 x T64>, <4 x T64>* %ptr+96
				///
				/// %intrshuffvec1 = shuffle %load0, %load2, <0, 1, 4, 5>;
				/// %intrshuffvec2 = shuffle %load1, %load3, <0, 1, 4, 5>;
				/// %v0 = shuffle %intrshuffvec1, %intrshuffvec2, <0, 4, 2, 6>;
				/// %v1 = shuffle %intrshuffvec1, %intrshuffvec2, <1, 5, 3, 7>;
				///
				/// %intrshuffvec3 = shuffle %load0, %load2, <2, 3, 6, 7>;
				/// %intrshuffvec4 = shuffle %load1, %load3, <2, 3, 6, 7>;
				/// %v2 = shuffle %intrshuffvec3, %intrshuffvec4, <0, 4, 2, 6>;
				/// %v3 = shuffle %intrshuffvec3, %intrshuffvec4, <1, 5, 3, 7>;
				///
				static bool lower(LoadInst LI, ArrayRef<ShuffleVectorInst > Shuffles,
				ArrayRef<unsigned> Indices, unsigned Factor) {
				const DataLayout &DL = LI->getModule()->getDataLayout();
				delenaUnsubmitted Not Done Reply Inline Actions AVX512 probably has another set of shuffles delena: AVX512 probably has another set of shuffles
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Yes. The plan is to support AVX512 in a separate check-in, also handle the patterns that take advantage of its extended shuffle instructions and the wider vector length. This change-set is meant to support only AVX1 and AVX2. Farhana: Yes. The plan is to support AVX512 in a separate check-in, also handle the patterns that take…

				VectorType *ShuffleVecTy = Shuffles[0]->getType();
				unsigned ShuffleVecSize = DL.getTypeSizeInBits(ShuffleVecTy);

				assert(DL.getTypeSizeInBits(LI->getType()) == Factor * ShuffleVecSize &&
				"Unexpected load size");

				mkuperUnsubmitted Not Done Reply Inline Actions There's an overload of CreateGEP that doesn't require the nullptr. mkuper: There's an overload of CreateGEP that doesn't require the nullptr.
				Type *VecBasePtrTy = ShuffleVecTy->getPointerTo(LI->getPointerAddressSpace());

				zviUnsubmitted Not Done Reply Inline Actions unexpected load size zvi: unexpected load size
				IRBuilder<> Builder(LI);
				FarhanaAuthorUnsubmitted Done Reply Inline Actions Good suggestion! Farhana: Good suggestion!
				SmallVector<Instruction *, 4> NewLoads;
				SmallVector<Value *, 4> NewShuffles;
				NewShuffles.resize(Factor);

				Value *VecBasePtr =
				Builder.CreateBitCast(LI->getPointerOperand(), VecBasePtrTy);

				DavidKreitzerUnsubmitted Not Done Reply Inline Actions Wouldn't it be simpler to create a bitcast to a pointer to an array of the smaller vector type? That would avoid the need for further bitcasts, saving 4 LLVM IR instructions. For example, in the 4 interleaved double case, the bitcast would look like this: %bc = bitcast <16 x double>* %ptr to [4 x <4 x double>]* And then your GEPs would look like this: %gep0 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 0 %gep1 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 1 %gep2 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 2 %gep3 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 3 DavidKreitzer: Wouldn't it be simpler to create a bitcast to a pointer to an array of the smaller vector type?
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Good suggestion! Farhana: Good suggestion!
				DavidKreitzerUnsubmitted Not Done Reply Inline Actions Thanks for the fix. I'm okay with the way you did this, but note that by modelling the address as %gep3 = getelementptr <4 x double>, <4 x double>* %bc, i32 3 instead of %gep3 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 3 the size of the underlying memory reference is no longer explicit in the GEP. You can imagine a scenario where this could have a negative impact on subsequent optimization. DavidKreitzer: Thanks for the fix. I'm okay with the way you did this, but note that by modelling the address…
				// Generate 4 loads of type v4xT64
				for (unsigned Part = 0; Part < Factor; Part++) {
				delenaUnsubmitted Not Done Reply Inline Actions inbounds GEP? delena: inbounds GEP?
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Since GEP does not come in a fixed order before the load-instruction, it requires traversing the instructions. I have a plan to add it in the next change-set. I added a todo comment to the code. Also, vectorizer does not generate inboundsGEP today, so there were not much incentive to add that in this change-set. Farhana: Since GEP does not come in a fixed order before the load-instruction, it requires traversing…
				Value *NewBasePtr =
				Builder.CreateGEP(nullptr, VecBasePtr, Builder.getInt32(Part));
				zviUnsubmitted Done Reply Inline Actions From this point, is the code hard coded' for Factor==4? Maybe replacing occurrences of Factor with literal '4' will make it more obvious until we support more cases? zvi: From this point, is the code hard coded' for Factor==4? Maybe replacing occurrences of Factor…
				Instruction *NewLoad =
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Yes, it is hardcoded currently, but I have a plan to generalize this part with any number of factors in the next change-set. So, keeping it as it is will require smaller changes in the next one. Farhana: Yes, it is hardcoded currently, but I have a plan to generalize this part with any number of…
				Builder.CreateAlignedLoad(NewBasePtr, LI->getAlignment());
				NewLoads.push_back(NewLoad);
				}

				// dst = src1[0,1],src2[0,1]
				uint32_t IntMask1[] = {0, 1, 4, 5};
				ArrayRef<unsigned int> ShuffleMask = makeArrayRef(IntMask1, 4);
				Value *IntrVec1 =
				Builder.CreateShuffleVector(NewLoads[0], NewLoads[2], ShuffleMask);
				Value *IntrVec2 =
				Builder.CreateShuffleVector(NewLoads[1], NewLoads[3], ShuffleMask);

				// dst = src1[2,3],src2[2,3]
				uint32_t IntMask2[] = {2, 3, 6, 7};
				ShuffleMask = makeArrayRef(IntMask2, 4);
				Value *IntrVec3 =
				Builder.CreateShuffleVector(NewLoads[0], NewLoads[2], ShuffleMask);
				Value *IntrVec4 =
				Builder.CreateShuffleVector(NewLoads[1], NewLoads[3], ShuffleMask);

				// dst = src1[0],src2[0],src1[2],src2[2]
				uint32_t IntMask3[] = {0, 4, 2, 6};
				ShuffleMask = makeArrayRef(IntMask3, 4);
				NewShuffles[0] = Builder.CreateShuffleVector(IntrVec1, IntrVec2, ShuffleMask);
				NewShuffles[2] = Builder.CreateShuffleVector(IntrVec3, IntrVec4, ShuffleMask);

				// dst = src1[1],src2[1],src1[3],src2[3]
				DavidKreitzerUnsubmitted Not Done Reply Inline Actions You seem to be making unwarranted assumptions about the order of the Shuffles array. Your code is only correct if Indices[0] = 3 Indices[1] = 2 Indices[2] = 1 Indices[3] = 0 I see nothing in the interleaved access pass that guarantees this ordering. A better way to write this function would be to populate a new array of shuffles where NewShuffles[0] corresponds to an index of 0, NewShuffles[1] corresponds to an index of 1, etc. Then you can do the shuffle replacement like this: for (unsigned i = 0; i < Shuffles.size(); i++) { unsigned Index = Indices[i]; Shuffles[i]->replaceAllUsesWith(NewShuffles[Index]); } DavidKreitzer: You seem to be making unwarranted assumptions about the order of the Shuffles array. Your code…
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Well, vectorizer guarantees the order, but certainly I should not have relied on that since any other optimizations can mess up the order. Farhana: Well, vectorizer guarantees the order, but certainly I should not have relied on that since any…
				uint32_t IntMask4[] = {1, 5, 3, 7};
				ShuffleMask = makeArrayRef(IntMask4, 4);
				NewShuffles[1] = Builder.CreateShuffleVector(IntrVec1, IntrVec2, ShuffleMask);
				NewShuffles[3] = Builder.CreateShuffleVector(IntrVec3, IntrVec4, ShuffleMask);

				for (unsigned i = 0; i < Shuffles.size(); i++) {
				unsigned Index = Indices[i];
				Shuffles[i]->replaceAllUsesWith(NewShuffles[Index]);
				}

				return true;
				}

				bool X86TargetLowering::lowerInterleavedLoad(
				LoadInst LI, ArrayRef<ShuffleVectorInst > Shuffles,
				ArrayRef<unsigned> Indices, unsigned Factor) const {
				assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
				"Invalid interleave factor");
				assert(!Shuffles.empty() && "Empty shufflevector input");
				assert(Shuffles.size() == Indices.size() &&
				"Unmatched number of shufflevectors and indices");

				if (!isSupported(Subtarget, Shuffles, Factor))
				return false;
				delenaUnsubmitted Not Done Reply Inline Actions It is not a good name for function. I think that you don't need additional function call here at all. delena: It is not a good name for function. I think that you don't need additional function call here…
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions You are right in the current context. My plan is to define a class to encapsulate all the information and allow data sharing where I will have two main functions one for load generation and the other one for shuffle-generation. In order to keep the follow-up change-set with minimal changes I decided to create a function here. I hope it's ok to do so. Farhana: You are right in the current context. My plan is to define a class to encapsulate all the…
				delenaUnsubmitted Not Done Reply Inline Actions Each patch should look good regardless of future plans. delena: Each patch should look good regardless of future plans.
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions I totally agree with you. I in-lined the function. Farhana: I totally agree with you. I in-lined the function.

				DavidKreitzerUnsubmitted Not Done Reply Inline Actions The indentation is off here. Can you please run clang-format on this file? To make things easy on your reviewers, please do this in a separate upload without ANY other changes. DavidKreitzer: The indentation is off here. Can you please run clang-format on this file? To make things easy…
				return lower(LI, Shuffles, Indices, Factor);
				}
				RKSimonUnsubmitted Not Done Reply Inline Actions Move both the tests to the top and test (Shuffles.size() != Indices.size()) directly? RKSimon: Move both the tests to the top and test (Shuffles.size() != Indices.size()) directly?
				RKSimonUnsubmitted Not Done Reply Inline Actions Discard this - didn't notice that it was creating a set. RKSimon: Discard this - didn't notice that it was creating a set.
				RKSimonUnsubmitted Not Done Reply Inline Actions Remove braces (style). RKSimon: Remove braces (style).
				DavidKreitzerUnsubmitted Not Done Reply Inline Actions Do you really need this test for unique indices? I see no reason not to handle multiple shuffles with the same index. I also don't see a need for all possible indices to be covered, which you are implicitly requiring by forcing all the indices to be unique and checking that Shuffles.size() == 4 in isSupported. If you remove this requirement, you will automatically handle more cases. For example, suppose we vectorize a loop that accesses only fields a, b, and d of an array of this type of structure: struct { double a, b, c, d; }; Your shuffle sequence will still work well for this case. One of the output shuffles will just go unused. This suggestion has implications on the isSupported routine. You would have to replace the Shuffles.size() == 4 check with Factor == 4. DavidKreitzer: Do you really need this test for unique indices? I see no reason not to handle multiple…
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions What about the case where we only have only one strided_load and generating 4 extracts would be more profitable than generating 8 vector instructions, right? Farhana: What about the case where we only have only one strided_load and generating 4 extracts would be…
				DavidKreitzerUnsubmitted Not Done Reply Inline Actions Even in the case of only one strided load, I think you will find that the CG will generate good code from your expansion sequence. It is true that 5 of the 8 shuffle instructions you are generating are unnecessary in that case, but those shuffle instructions should be optimized away as dead. Try it out, and perhaps add a unit test for this situation, and possibly for some of the other unusual situations like multiple shuffles with the same index. DavidKreitzer: Even in the case of only one strided load, I think you will find that the CG will generate good…
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions I know CG will get rid of the other shuffles, I am not talking about this particular case. In general, I think we have to incorporate some concept of cost in order to support any number of accesses. Farhana: I know CG will get rid of the other shuffles, I am not talking about this particular case. In…

lib/Target/X86/X86TargetMachine.cpp

	Show First 20 Lines • Show All 267 Lines • ▼ Show 20 Lines
	TargetPassConfig *X86TargetMachine::createPassConfig(PassManagerBase &PM) {			TargetPassConfig *X86TargetMachine::createPassConfig(PassManagerBase &PM) {
	return new X86PassConfig(this, PM);			return new X86PassConfig(this, PM);
	}			}

	void X86PassConfig::addIRPasses() {			void X86PassConfig::addIRPasses() {
	addPass(createAtomicExpandPass(&getX86TargetMachine()));			addPass(createAtomicExpandPass(&getX86TargetMachine()));

	TargetPassConfig::addIRPasses();			TargetPassConfig::addIRPasses();

				if (TM->getOptLevel() != CodeGenOpt::None)
				addPass(createInterleavedAccessPass(TM));
	}			}

	bool X86PassConfig::addInstSelector() {			bool X86PassConfig::addInstSelector() {
	// Install an instruction selector.			// Install an instruction selector.
	addPass(createX86ISelDag(getX86TargetMachine(), getOptLevel()));			addPass(createX86ISelDag(getX86TargetMachine(), getOptLevel()));

	// For ELF, cleanup any local-dynamic TLS accesses.			// For ELF, cleanup any local-dynamic TLS accesses.
	if (TM->getTargetTriple().isOSBinFormatELF() &&			if (TM->getTargetTriple().isOSBinFormatELF() &&
	▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

test/CodeGen/X86/x86-interleaved-access.ll

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=x86_64-pc-linux -mattr=+avx < %s \| FileCheck %s --check-prefix=AVX --check-prefix=AVX1
				RKSimonUnsubmitted Not Done Reply Inline Actions Is this actually true? The checks below don't look like what the script would generate. RKSimon: Is this actually true? The checks below don't look like what the script would generate.
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Hi Simon, I am not sure whether I understand your concern. Which checks are you talking about? Farhana Farhana: Hi Simon, I am not sure whether I understand your concern. Which checks are you talking about?
				RKSimonUnsubmitted Not Done Reply Inline Actions The 'AVX-NEXT'/'AVX1-NEXT'/'AVX2-NEXT' checks - the update script would generate quite a bit more than what is shown below. RKSimon: The 'AVX-NEXT'/'AVX1-NEXT'/'AVX2-NEXT' checks - the update script would generate quite a bit…
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions If I understand your comment correctly, you are saying the optimization will generate more instructions than it is checking for. Yes, it only checks for the must instructions, because the rest can be optimized away depending on the uses. Farhana: If I understand your comment correctly, you are saying the optimization will generate more…
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Hi Simon, I think I understand your question now (Dave helped me). You are right the script update_llc_test_checks.py generates quite a bit more checks than what I have here. Yes, the checks are not auto-generated by the script. I got rid of the NOTE. But now I am wondering whether I should have used the script or not. I did not want to put all the checks because in my opinion putting all of them would be unnecessary in this case, checking for first few instructions would be enough to ensure the behavior. Let me know if you think it's good practice to use the script always... Farhana: Hi Simon, I think I understand your question now (Dave helped me). You are right the script…
				RKSimonUnsubmitted Not Done Reply Inline Actions Generally yes, the script output is great as its easy to regenerate, it means you're not hiding anything and its easier to grok the entire codegen. There are plenty of cases where bulky codesize is just too off putting and CHECKs should be more selective, but if the codesize could be reduced in the future I'd tend to include it as its very useful to show the delta. But for these interleave cases I think it'd be useful, especially as we don't have any other reference examples of x86 interleave codegen at present. If it means you need to split the tests into multiple files (we often have 128 / 256 / 512 versions of test files), so be it. RKSimon: Generally yes, the script output is great as its easy to regenerate, it means you're not hiding…
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Sounds good. Right now, I don't need to split the file, may be in the future I will have to consider it. Farhana: Sounds good. Right now, I don't need to split the file, may be in the future I will have to…
				; RUN: llc -mtriple=x86_64-pc-linux -mattr=+avx2 < %s \| FileCheck %s --check-prefix=AVX --check-prefix=AVX2

				; AVX: vinsertf128 $1, %xmm2, %ymm0, %ymm4
				; AVX-NEXT: vinsertf128 $1, %xmm3, %ymm1, %ymm5
				; AVX-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],ymm2[2,3]
				; AVX-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm1[2,3],ymm3[2,3]
				define <4 x double> @load_factorf64_4(<16 x double>* %ptr) {
				%wide.vec = load <16 x double>, <16 x double>* %ptr, align 16
				%strided.v0 = shufflevector <16 x double> %wide.vec, <16 x double> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>
				%strided.v1 = shufflevector <16 x double> %wide.vec, <16 x double> undef, <4 x i32> <i32 1, i32 5, i32 9, i32 13>
				%strided.v2 = shufflevector <16 x double> %wide.vec, <16 x double> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14>
				%strided.v3 = shufflevector <16 x double> %wide.vec, <16 x double> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>
				%add1 = fadd <4 x double> %strided.v0, %strided.v1
				%add2 = fadd <4 x double> %add1, %strided.v2
				%add3 = fadd <4 x double> %add2, %strided.v3
				ret <4 x double> %add3
				}

				; AVX1: vinsertf128 $1, %xmm2, %ymm0, %ymm4
				; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm1, %ymm5
				; AVX1-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],ymm2[2,3]
				; AVX1-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm1[2,3],ymm3[2,3]
				;
				; AVX2: vinserti128 $1, %xmm2, %ymm0, %ymm4
				DavidKreitzerUnsubmitted Not Done Reply Inline Actions There should be a vunpckhpd here too, right? Is there a reason you are not checking for it? DavidKreitzer: There should be a vunpckhpd here too, right? Is there a reason you are not checking for it?
				FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Right. The first four are the stepping stones, which guarantee the rest. I did not add it in order to keep the file size small. I know this is going to be populated with lot of tests. Farhana: Right. The first four are the stepping stones, which guarantee the rest. I did not add it in…
				; AVX2-NEXT: vinserti128 $1, %xmm3, %ymm1, %ymm5
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm0[2,3],ymm2[2,3]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm1[2,3],ymm3[2,3]
				define <4 x i64> @load_factori64_4(<16 x i64>* %ptr) {
				%wide.vec = load <16 x i64>, <16 x i64>* %ptr, align 16
				%strided.v0 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>
				%strided.v1 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 1, i32 5, i32 9, i32 13>
				%strided.v2 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14>
				%strided.v3 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>
				%add1 = add <4 x i64> %strided.v0, %strided.v1
				%add2 = add <4 x i64> %add1, %strided.v2
				%add3 = add <4 x i64> %add2, %strided.v3
				ret <4 x i64> %add3
				}

This is an archive of the discontinued LLVM Phabricator instance.

Optimize patterns of vectorized interleaved memory accesses for X86.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 73129

lib/CodeGen/InterleavedAccessPass.cpp

lib/Target/X86/CMakeLists.txt

lib/Target/X86/X86ISelLowering.h

lib/Target/X86/X86InterleavedAccess.cpp

lib/Target/X86/X86TargetMachine.cpp

test/CodeGen/X86/x86-interleaved-access.ll

Optimize patterns of vectorized interleaved memory accesses for X86.
ClosedPublic