This is an archive of the discontinued LLVM Phabricator instance.

[PPC64LE] Remove unnecessary swaps from lane-insensitive vector computations
ClosedPublic

Authored by wschmidt on Mar 23 2015, 2:08 PM.

Download Raw Diff

Details

Reviewers

wschmidt
kbarton
seurer
hfinkel
nemanjai

Summary

This patch adds a new SSA MI pass that runs on little-endian PPC64 code with VSX enabled. Loads and stores of 4x32 and 2x64 vectors without alignment constraints are accomplished for little-endian using lxvd2x/xxswapd and xxswapd/stxvd2x. The existence of the additional xxswapd instructions hurts performance in comparison with big-endian code, but they are necessary in the general case to support correct semantics.

However, the general case does not apply to most vector code. Many vector instructions are lane-insensitive; they do not "care" which lanes the parallel computations are performed within, provided that the resulting data is stored into the correct locations. Thus this pass looks for computations that perform only lane-insensitive operations, and remove the unnecessary swaps from loads and stores in such computations.

Future improvements will allow computations using certain lane-sensitive operations to also be optimized in this manner, by modifying the lane-sensitive operations to account for the permuted order of the lanes. However, this patch only adds the infrastructure to permit this; no lane-sensitive operations are optimized at this time.

I have a few questions about infrastructure; these are flagged as FIXMEs in the commentary. Reviewers, I'd appreciate your opinions on these!

Diff Detail

Repository: rL LLVM

Event Timeline

wschmidt updated this revision to Diff 22514.Mar 23 2015, 2:08 PM

wschmidt retitled this revision from to [PPC64LE] Remove unnecessary swaps from lane-insensitive vector computations.

wschmidt updated this object.

wschmidt edited the test plan for this revision. (Show Details)

wschmidt added reviewers: hfinkel, kbarton, seurer, nemanjai.

wschmidt set the repository for this revision to rL LLVM.

wschmidt added a subscriber: Unknown Object (MLST).

Comment on my own patch: This needs to be controllable via option. (For GCC the similar transformation is controlled with -m[no-]optimize-swaps.)

Here's an updated patch that adds support for -mattr={+,-}optimize-swaps. I'll post a companion patch shortly for Clang, adding -m[no-]optimize-swaps in the front end to generate this attribute.

The test case now includes a -mno-optimize-swaps variant.

Why is a command line option a subtarget feature? More comments inline...

-eric

lib/Target/PowerPC/PPCTargetMachine.cpp
316	Comment on why.
lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
111	DenseMap?
205	256?
252	What instructions not listed are you thinking here?
615	Good point, what _does_ this do for debug info? :)
647	?

Eric and I had a short chat about the option situation on IRC. Apparently there really isn't a good mechanism for disabling a specific pass for a specific target right now. I am fine with removing the option portion of this for now until LLVM supports a proper way of doing this. The down side is losing the chicken switch of a quick workaround. Since this is a -O1 optimization, the workaround becomes -O0 (or -O0 -no-fast-isel) which is somewhat harsh.

Responses to inline comments ... inline!

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
205	Hey, a magic constant. That's why I gave it a name. :) Picking an initial vector size to avoid too many reallocations while keeping the size reasonable. Yes, it's a wet thumb to the breeze...
252	A vast number of lane-insensitive instructions. All of the vector math, logical, select, etc. Everything that's true SIMD.
615	It doesn't do anything, but it doesn't need to. When the load or store is initially expanded, its location information goes on the LXVD2X and XXPERMDI instructions, or on the XXPERMDI and STXVD2X instructions. This pass removes the XXPERMDI instructions, but the location information remains with the LXVD2X or STXVD2X.
647	As noted in the overall patch commentary, there are non-pure-SIMD instructions for which we can still perform this optimization, provided we change code generation for those instructions. My plan is to fill in this function with the details in future patches. The initial patch is for the "simple" part (which covers a great many cases).

As far as the option you can have a backend option to turn the pass on and off, see the aarch64 port for more information there.

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
205	Did you use science to figure it out (i.e. see what numbers would generally work etc) or just WAG?
252	Comment then? Just because there's another long list below so...
615	That's what I thought, just making sure.
647	Sounds good.

wschmidt mentioned this in D8706: [PPC64LE] Add -m[no-]optimize-swaps option.Mar 30 2015, 1:53 PM

I'd like to understand why you opted for doing this in MI instead of at the IR level? In particular, it seems odd to have the loop vectorizer and other tools build vectorized code in the "wrong" way and then fix it if we can later.

In D8565#149183, @chandlerc wrote:

I'd like to understand why you opted for doing this in MI instead of at the IR level? In particular, it seems odd to have the loop vectorizer and other tools build vectorized code in the "wrong" way and then fix it if we can later.

The extra swap instructions are not introduced until instruction selection, so the IR level is already correct. The instruction selector is too myopic to view entire computation webs, so we have to be conservative and generate code that uses true-LE register representations, then clean up at a later time when we can see the big picture. (So both the disease and the cure are introduced in the back end.)

I'll look at the backend option as a temporary measure, although someday I'd like to see a better solution. (That said, I can't volunteer to design it myself right now, so I have no right to complain...)

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
111	Yes, that looks appropriate. I'll move to that.
205	Semi-science, yes. I've spent some time on this in GCC as well and gotten a pretty good feel for what happens. 256 is sufficiently large to prevent any reallocations in the usual case. For larger functions, we will see a reallocation or two. My testing on LLVM indicated that the code in projects/test-suite used no reallocations except for three of the benchmarks (I know this because my implementation was broken, so I noticed failures when the reallocations occurred). So I feel this is a good choice.
252	Yep, I can do that.
615	I can add a comment to this effect.

echristo added inline comments.Mar 30 2015, 2:16 PM

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
205	Good enough science for me. I was hoping you'd collected normal size, but this is probably close enough.

I've attempted to address all comments received to date. I've changed the option handling to ape what's done for some similar passes (will be controlled with -mllvm -disable-ppc-vsx-swap-removal).

I'm still interested in any opinions on some of the FIXME issues I raised.

Thanks for the reviews!

Bill

Seems pretty reasonable on this end. I haven't done an in-depth look at the actual optimization at this point in case Hal or someone else has.

Quick Question: The testcases with the removed xxpermdi instruction checks - not being emitted at all, emitting fewer, useless to check anyhow, something else?

-eric

In D8565#151320, @echristo wrote:

Quick Question: The testcases with the removed xxpermdi instruction checks - not being emitted at all, emitting fewer, useless to check anyhow, something else?

The tests check for total removal (xxswapd is an alternative mnemonic for this specific form of xxpermdi):

; CHECK-NOT: xxpermdi
; CHECK-NOT: xxswapd

Hm, pre-coffee comment. I suspect you mean the last two tests where I removed the checks. At this time they don't check anything to do with the xxpermdi. Alternatively, I could leave the check in place and replace "count 12" with "count 0" or otherwise look at this. Since those tests weren't designed to specifically check an optimization I initially chose to remove the check, but I could go either way.

(The results in both cases are that all xxpermdi instructions are removed.)

Nah, that part sounds fine to me, I was just verifying. I'll leave it to Hal for the pass itself.

-eric

I apologize for taking so long to get to this...

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
22	What does "particularly bad for vectorized code" mean? Doesn't this exclusively apply to vectorized code?
127	Why don't we just call these "vector instructions"? mentions sounds stilted to me.
168	As a general note, this pass does not handle the case where a register has a subregister index. This should be mostly irrelevant here (we probably don't generate them for these kinds of instructions), but we also don't want to mishandle them should they occur. (maybe also give up in lookThruCopyLike should you run into a register with a subreg index too).
225	range-based for?
228	range-based for?
233	for (const MachineOperand &MO : MI->operands()) {
285	Where to we actually reject webs with non-swapping load/stores?
308	We should add a comment here explaining that this happens because some of the 128-bit VSX registers have 128-bit Altivec sub-registers.
334	Yes, that would be better. If nothing else, we should be able to use the InstrMapping facility to generate a lookup table (this is how we currently keep track of which instructions are record forms, and the mapping between record-form and the non-record-form variants).
460	add an assert that isSubregToReg() is true.
492	range-based for?

Hal, thanks for the review! Responses to your comments are inline. Please let me know if I'm off base on any of them. I'll work on a new version shortly.

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
22	I suppose I had meant "autovectorized code" at the time, because the user has no way to anticipate these swaps are going to show up out of nowhere and reduce performance in a performance-enhancing optimization. However, the sentence really adds nothing, so I'll remove it. ;)
127	That's fine. "Mentions" is a more or less standard term for defs U kills U uses. The important thing is to get anything that mentions a vector register, including things that just copy into or out of them. But in our current ISA, everything that lives in the Category:Altivec and Category:VSX fits this description, so "vector instructions" is equivalent.
168	I don't believe any special handling is necessary for the current patch. If a subregister mention is connected, directly or indirectly, to a full register load or store, it will have to be done via an EXTRACT_SUBREG, INSERT_SUBREG, or SUBREG_TO_REG. But these are operations that kill the optimization anyway (except SUBREG_TO_REG for full 128-bit copies). We don't want to exclude them from the analysis, because we can handle EXTRACT_SUBREG and INSERT_SUBREG by adjusting the subregister number to account for the doubleword swap, and dealing with SUBREG_TO_REG accordingly as well. I plan to incorporate this into a future patch. However, there may be subtleties in how subregs are handled in LLVM that I'm not familiar with, so please let me know if I'm missing something.
225	OK -- I'll rework this and the ones below.
285	The fact that these are not marked as either IsSwap or IsSwappable causes them to be rejected. See lines 537-538.
308	OK.
334	I'll look at that, thanks. There didn't seem to be a way to just create a simple flag that I could see, which would have been nice...
460	OK.

hfinkel added inline comments.Apr 11 2015, 6:54 PM

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
127	Feel free to keep mentions if you'd prefer. I don't have a strong opinion.
168	it will have to be done via an EXTRACT_SUBREG, INSERT_SUBREG, or SUBREG_TO_REG. No, I think you've misunderstood my comment. MI register operands have have implicit subregister indices. These are separate from the subreg pseudo instructions you've named above. For example, MO.getReg() may equal PPC::CR0, and the operand refers to the whole register if MO.getSubReg() == 0. But, if MO.getReg() == PPC::CR0 and MO.getSubReg() == PPC::sub_eq, then the operand is really referring to PPC::CR0EQ. Now this seems silly for physical registers, but is useful for virtual registers when you know that you want a particular subregister of a virtual register.

Should have a revised patch shortly. There are a couple of responses inline.

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
127	That's ok, I'll make the switch.
168	OK. Well, I really don't think this can actually happen today for any of the various vector register classes. (If it can, I'd be interested in an example.) I will add some logic that will just kill the optimization if it ever occurs, but to my knowledge subregs of vector registers are always generated explicitly by operation (as they should be). I can understand using this for fixed bits of a status register, but it would be really bad practice to use this for vector subregs.

hfinkel added inline comments.Apr 13 2015, 12:42 PM

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
168	Right; I don't think it can happen today either, but if it does for whatever reason, I'd rather we handle it gracefully instead of crashing/miscompiling. That was my entire point ;)

OK, I've attempted to address all comments. The biggest change is adding an InstrMapping to PPC.td to record which instructions kill the swap removal optimization, rather then just listing them in the switch statement. It is a bit ugly to use a mapping to represent a set, but it appears to be the only option for doing this via TableGen. Let me know if you have any better suggestions.

Thanks again for the thorough reviews!

Bill

hfinkel added inline comments.Apr 16 2015, 2:07 PM

lib/Target/PowerPC/PPC.td
196 ↗	(On Diff #23695)	It would be nice to keep this as an intrinsic instruction property. Something that means that the instruction operates on all vector lanes independently. Then special cases that are handled in the code can be handled there without involving the instruction definitions.
lib/Target/PowerPC/PPCInstrAltivec.td
388	Why are you setting SwapFlag to 1 only to set it to 0? You can add the flag to the base instruction class if it just needs to be defined first.

One comment I understand, the other I really don't. Please advise. ;)

lib/Target/PowerPC/PPC.td
196 ↗	(On Diff #23695)	OK, but how would that be implemented? I don't understand what you are proposing. At some point we have to identify either the "good" ones or the "bad" ones; how do you propose to do that without involving the instruction definitions?
lib/Target/PowerPC/PPCInstrAltivec.td
388	At first I didn't understand this, but I see what you're getting at -- actually I'm setting it to zero first, and then 1 here. Yes, that could just be 1 in the base class and removed from the instruction descriptions. Sorry, this is left over from the several iterations of trying to get this to work...

Hal, thanks for the offline discussion. I've changed the filter class to "LaneSensitive" and cleaned up the SwapFlag foolishness.

nemanjai added inline comments.Apr 21 2015, 8:49 AM

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
15	Nit: v4r32 -> v4f32?

ping ;)

This looks good, but we need to get rid of the

let BaseName = "BLA" in
def BLA: XForm_n<whatever...>

pattern. Here's one way to do it:

Change the definition of LaneSensitive to something like this:

class LaneSensitive {
  bit SwapFlag = 1;
  string BaseName = NAME;
}

(this is similar to what MUBUFAddr64Table does in R600/SIInstrInfo.td).

Now the problem is that LaneSensitive will only work properly when used inside of a multiclass definition. For that, you'll need some multclass wrappers for the XForm classes (we already have a few of these, see the definition of XForm_6r and friends in PPCInstrInfo.td). We can then make the multiclass something like this:

multiclass XForm_nls<whatever...> {
  def NAME : XForm_n<whatever...>, LaneSensitive;
}

and then you just need to change the definitions of the lane-sensitive instructions to something like:

defm BLA : XForm_nls<whatever...>

OK, thanks! This makes sense.

Frankly this gets pretty baroque when all we want is a way to flag an instruction with an attribute. Providing an easy way to form sets of instructions would be a good project for somebody who wants to modify TableGen someday.

Hal, I've looked this over a bit more, and I'm not very happy with this approach.

To support this, I'll have to add multiclass wrappers for all of the following:

XForm_1
XForm_8
VA1a_int_Ty3
VA1a_Int_Ty
VAForm_2
VXForm_1
VX1_Int_Ty2
VX1_Int_Ty
VX1_Int_Ty3
VXCR_Int_Ty
XX1Form
XX3Form
XX2Form_2

This is becoming a huge mangled mess, in an attempt to avoid listing these instructions in a switch statement. At this point I would like to suggest that TableGen is not set up to cleanly create sets of instructions, and revert to using the switch statement. Any objections?

In D8565#161928, @wschmidt wrote:
Hal, I've looked this over a bit more, and I'm not very happy with this approach.

To support this, I'll have to add multiclass wrappers for all of the following:
XForm_1
XForm_8
VA1a_int_Ty3
VA1a_Int_Ty
VAForm_2
VXForm_1
VX1_Int_Ty2
VX1_Int_Ty
VX1_Int_Ty3
VXCR_Int_Ty
XX1Form
XX3Form
XX2Form_2
This is becoming a huge mangled mess, in an attempt to avoid listing these instructions in a switch statement. At this point I would like to suggest that TableGen is not set up to cleanly create sets of instructions, and revert to using the switch statement. Any objections?

Unfortunately, I don't object :( -- It is not clear that the multiclass setup is any more obvious or more maintainable than the switch statement. You can go back to the switch statement, with a FIXME. Also, add a prominent note in PPCInstrAltivec.td and in PPCInstrVSX.td reminding us that we may need to update the list if new vector instructions are added.

Thanks, Hal. This revision reverts back to the switch statement, and adds the commentary you asked for. Thanks for helping me think through trying to do this in TableGen.

LGTM, thanks!

This revision is now accepted and ready to land.Apr 27 2015, 11:29 AM

Committed as r235910.

Work is complete.

Revision Contents

Path

Size

lib/

Target/

PowerPC/

1 line

1 line

15 lines

15 lines

14 lines

PPCVSXSwapRemoval.cpp

778 lines

test/

CodeGen/

PowerPC/

swaps-le-1.ll

147 lines

vsx-ldst-builtin-le.ll

1 line

vsx-ldst.ll

1 line

Diff 24491

lib/Target/PowerPC/CMakeLists.txt

Show All 30 Lines	add_llvm_target(PowerPCCodeGen
PPCSubtarget.cpp		PPCSubtarget.cpp
PPCTargetMachine.cpp		PPCTargetMachine.cpp
PPCTargetObjectFile.cpp		PPCTargetObjectFile.cpp
PPCTargetTransformInfo.cpp		PPCTargetTransformInfo.cpp
PPCSelectionDAGInfo.cpp		PPCSelectionDAGInfo.cpp
PPCTLSDynamicCall.cpp		PPCTLSDynamicCall.cpp
PPCVSXCopy.cpp		PPCVSXCopy.cpp
PPCVSXFMAMutate.cpp		PPCVSXFMAMutate.cpp
		PPCVSXSwapRemoval.cpp
)		)

add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(Disassembler)		add_subdirectory(Disassembler)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)

lib/Target/PowerPC/PPC.h

	Show All 33 Lines
	#ifndef NDEBUG			#ifndef NDEBUG
	FunctionPass *createPPCCTRLoopsVerify();			FunctionPass *createPPCCTRLoopsVerify();
	#endif			#endif
	FunctionPass *createPPCLoopDataPrefetchPass();			FunctionPass *createPPCLoopDataPrefetchPass();
	FunctionPass *createPPCLoopPreIncPrepPass(PPCTargetMachine &TM);			FunctionPass *createPPCLoopPreIncPrepPass(PPCTargetMachine &TM);
	FunctionPass *createPPCEarlyReturnPass();			FunctionPass *createPPCEarlyReturnPass();
	FunctionPass *createPPCVSXCopyPass();			FunctionPass *createPPCVSXCopyPass();
	FunctionPass *createPPCVSXFMAMutatePass();			FunctionPass *createPPCVSXFMAMutatePass();
				FunctionPass *createPPCVSXSwapRemovalPass();
	FunctionPass *createPPCBranchSelectionPass();			FunctionPass *createPPCBranchSelectionPass();
	FunctionPass *createPPCISelDag(PPCTargetMachine &TM);			FunctionPass *createPPCISelDag(PPCTargetMachine &TM);
	FunctionPass *createPPCTLSDynamicCallPass();			FunctionPass *createPPCTLSDynamicCallPass();
	void LowerPPCMachineInstrToMCInst(const MachineInstr *MI, MCInst &OutMI,			void LowerPPCMachineInstrToMCInst(const MachineInstr *MI, MCInst &OutMI,
	AsmPrinter &AP, bool isDarwin);			AsmPrinter &AP, bool isDarwin);

	void initializePPCVSXFMAMutatePass(PassRegistry&);			void initializePPCVSXFMAMutatePass(PassRegistry&);
	extern char &PPCVSXFMAMutateID;			extern char &PPCVSXFMAMutateID;
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrAltivec.td

	//===-- PPCInstrAltivec.td - The PowerPC Altivec Extension -- tablegen --===//			//===-- PPCInstrAltivec.td - The PowerPC Altivec Extension -- tablegen --===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file describes the Altivec extension to the PowerPC instruction set.			// This file describes the Altivec extension to the PowerPC instruction set.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

				// ********************************* NOTE *********************************
				// For POWER8 Little Endian, the VSX swap optimization relies on knowing
				// which VMX and VSX instructions are lane-sensitive and which are not.
				// A lane-sensitive instruction relies, implicitly or explicitly, on
				// whether lanes are numbered from left to right. An instruction like
				// VADDFP is not lane-sensitive, because each lane of the result vector
				// relies only on the corresponding lane of the source vectors. However,
				// an instruction like VMULESB is lane-sensitive, because "even" and
				// "odd" lanes are different for big-endian and little-endian numbering.
				//
				// When adding new VMX and VSX instructions, please consider whether they
				// are lane-sensitive. If so, they must be added to a switch statement
				// in PPCVSXSwapRemoval::gatherVectorInstructions().
				// ****************************************************************************

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Altivec transformation functions and pattern fragments.			// Altivec transformation functions and pattern fragments.
	//			//

	// Since we canonicalize buildvectors to v16i8, all vnots "-1" operands will be			// Since we canonicalize buildvectors to v16i8, all vnots "-1" operands will be
	// of that type.			// of that type.
	def vnot_ppc : PatFrag<(ops node:$in),			def vnot_ppc : PatFrag<(ops node:$in),
	(xor node:$in, (bitconvert (v16i8 immAllOnesV)))>;			(xor node:$in, (bitconvert (v16i8 immAllOnesV)))>;
	▲ Show 20 Lines • Show All 343 Lines • ▼ Show 20 Lines
	def LVX : XForm_1<31, 103, (outs vrrc:$vD), (ins memrr:$src),			def LVX : XForm_1<31, 103, (outs vrrc:$vD), (ins memrr:$src),
	"lvx $vD, $src", IIC_LdStLoad,			"lvx $vD, $src", IIC_LdStLoad,
	[(set v4i32:$vD, (int_ppc_altivec_lvx xoaddr:$src))]>;			[(set v4i32:$vD, (int_ppc_altivec_lvx xoaddr:$src))]>;
	def LVXL : XForm_1<31, 359, (outs vrrc:$vD), (ins memrr:$src),			def LVXL : XForm_1<31, 359, (outs vrrc:$vD), (ins memrr:$src),
	"lvxl $vD, $src", IIC_LdStLoad,			"lvxl $vD, $src", IIC_LdStLoad,
	[(set v4i32:$vD, (int_ppc_altivec_lvxl xoaddr:$src))]>;			[(set v4i32:$vD, (int_ppc_altivec_lvxl xoaddr:$src))]>;
	}			}

	def LVSL : XForm_1<31, 6, (outs vrrc:$vD), (ins memrr:$src),			def LVSL : XForm_1<31, 6, (outs vrrc:$vD), (ins memrr:$src),
				hfinkelUnsubmitted Not Done Reply Inline Actions Why are you setting SwapFlag to 1 only to set it to 0? You can add the flag to the base instruction class if it just needs to be defined first. hfinkel: Why are you setting SwapFlag to 1 only to set it to 0? You can add the flag to the base…
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions At first I didn't understand this, but I see what you're getting at -- actually I'm setting it to zero first, and then 1 here. Yes, that could just be 1 in the base class and removed from the instruction descriptions. Sorry, this is left over from the several iterations of trying to get this to work... wschmidt: At first I didn't understand this, but I see what you're getting at -- actually I'm setting it…
	"lvsl $vD, $src", IIC_LdStLoad,			"lvsl $vD, $src", IIC_LdStLoad,
	[(set v16i8:$vD, (int_ppc_altivec_lvsl xoaddr:$src))]>,			[(set v16i8:$vD, (int_ppc_altivec_lvsl xoaddr:$src))]>,
	PPC970_Unit_LSU;			PPC970_Unit_LSU;
	def LVSR : XForm_1<31, 38, (outs vrrc:$vD), (ins memrr:$src),			def LVSR : XForm_1<31, 38, (outs vrrc:$vD), (ins memrr:$src),
	"lvsr $vD, $src", IIC_LdStLoad,			"lvsr $vD, $src", IIC_LdStLoad,
	[(set v16i8:$vD, (int_ppc_altivec_lvsr xoaddr:$src))]>,			[(set v16i8:$vD, (int_ppc_altivec_lvsr xoaddr:$src))]>,
	PPC970_Unit_LSU;			PPC970_Unit_LSU;

	▲ Show 20 Lines • Show All 704 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrVSX.td

	//===- PPCInstrVSX.td - The PowerPC VSX Extension --- tablegen --===//			//===- PPCInstrVSX.td - The PowerPC VSX Extension --- tablegen --===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file describes the VSX extension to the PowerPC instruction set.			// This file describes the VSX extension to the PowerPC instruction set.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

				// ********************************* NOTE *********************************
				// For POWER8 Little Endian, the VSX swap optimization relies on knowing
				// which VMX and VSX instructions are lane-sensitive and which are not.
				// A lane-sensitive instruction relies, implicitly or explicitly, on
				// whether lanes are numbered from left to right. An instruction like
				// VADDFP is not lane-sensitive, because each lane of the result vector
				// relies only on the corresponding lane of the source vectors. However,
				// an instruction like VMULESB is lane-sensitive, because "even" and
				// "odd" lanes are different for big-endian and little-endian numbering.
				//
				// When adding new VMX and VSX instructions, please consider whether they
				// are lane-sensitive. If so, they must be added to a switch statement
				// in PPCVSXSwapRemoval::gatherVectorInstructions().
				// ****************************************************************************

	def PPCRegVSRCAsmOperand : AsmOperandClass {			def PPCRegVSRCAsmOperand : AsmOperandClass {
	let Name = "RegVSRC"; let PredicateMethod = "isVSRegNumber";			let Name = "RegVSRC"; let PredicateMethod = "isVSRegNumber";
	}			}
	def vsrc : RegisterOperand<VSRC> {			def vsrc : RegisterOperand<VSRC> {
	let ParserMatchClass = PPCRegVSRCAsmOperand;			let ParserMatchClass = PPCRegVSRCAsmOperand;
	}			}

	def PPCRegVSFRCAsmOperand : AsmOperandClass {			def PPCRegVSFRCAsmOperand : AsmOperandClass {
	▲ Show 20 Lines • Show All 971 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCTargetMachine.cpp

Show All 32 Lines
static cl::		static cl::
opt<bool> DisablePreIncPrep("disable-ppc-preinc-prep", cl::Hidden,		opt<bool> DisablePreIncPrep("disable-ppc-preinc-prep", cl::Hidden,
cl::desc("Disable PPC loop preinc prep"));		cl::desc("Disable PPC loop preinc prep"));

static cl::opt<bool>		static cl::opt<bool>
VSXFMAMutateEarly("schedule-ppc-vsx-fma-mutation-early",		VSXFMAMutateEarly("schedule-ppc-vsx-fma-mutation-early",
cl::Hidden, cl::desc("Schedule VSX FMA instruction mutation early"));		cl::Hidden, cl::desc("Schedule VSX FMA instruction mutation early"));

		static cl::
		opt<bool> DisableVSXSwapRemoval("disable-ppc-vsx-swap-removal", cl::Hidden,
		cl::desc("Disable VSX Swap Removal for PPC"));

static cl::opt<bool>		static cl::opt<bool>
EnableGEPOpt("ppc-gep-opt", cl::Hidden,		EnableGEPOpt("ppc-gep-opt", cl::Hidden,
cl::desc("Enable optimizations on complex GEPs"),		cl::desc("Enable optimizations on complex GEPs"),
cl::init(true));		cl::init(true));

static cl::opt<bool>		static cl::opt<bool>
EnablePrefetch("enable-ppc-prefetching",		EnablePrefetch("enable-ppc-prefetching",
cl::desc("disable software prefetching on PPC"),		cl::desc("disable software prefetching on PPC"),
▲ Show 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	public:
PPCTargetMachine &getPPCTargetMachine() const {		PPCTargetMachine &getPPCTargetMachine() const {
return getTM<PPCTargetMachine>();		return getTM<PPCTargetMachine>();
}		}

void addIRPasses() override;		void addIRPasses() override;
bool addPreISel() override;		bool addPreISel() override;
bool addILPOpts() override;		bool addILPOpts() override;
bool addInstSelector() override;		bool addInstSelector() override;
		void addMachineSSAOptimization() override;
void addPreRegAlloc() override;		void addPreRegAlloc() override;
void addPreSched2() override;		void addPreSched2() override;
void addPreEmitPass() override;		void addPreEmitPass() override;
};		};
} // namespace		} // namespace

TargetPassConfig *PPCTargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *PPCTargetMachine::createPassConfig(PassManagerBase &PM) {
return new PPCPassConfig(this, PM);		return new PPCPassConfig(this, PM);
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	#ifndef NDEBUG
if (!DisableCTRLoops && getOptLevel() != CodeGenOpt::None)		if (!DisableCTRLoops && getOptLevel() != CodeGenOpt::None)
addPass(createPPCCTRLoopsVerify());		addPass(createPPCCTRLoopsVerify());
#endif		#endif

addPass(createPPCVSXCopyPass());		addPass(createPPCVSXCopyPass());
return false;		return false;
}		}

		void PPCPassConfig::addMachineSSAOptimization() {
		TargetPassConfig::addMachineSSAOptimization();
		// For little endian, remove where possible the vector swap instructions
		echristoUnsubmitted Not Done Reply Inline Actions Comment on why. echristo: Comment on why.
		// introduced at code generation to normalize vector element order.
		if (Triple(TM->getTargetTriple()).getArch() == Triple::ppc64le &&
		!DisableVSXSwapRemoval)
		addPass(createPPCVSXSwapRemovalPass());
		}

void PPCPassConfig::addPreRegAlloc() {		void PPCPassConfig::addPreRegAlloc() {
initializePPCVSXFMAMutatePass(*PassRegistry::getPassRegistry());		initializePPCVSXFMAMutatePass(*PassRegistry::getPassRegistry());
insertPass(VSXFMAMutateEarly ? &RegisterCoalescerID : &MachineSchedulerID,		insertPass(VSXFMAMutateEarly ? &RegisterCoalescerID : &MachineSchedulerID,
&PPCVSXFMAMutateID);		&PPCVSXFMAMutateID);
if (getPPCTargetMachine().getRelocationModel() == Reloc::PIC_)		if (getPPCTargetMachine().getRelocationModel() == Reloc::PIC_)
addPass(createPPCTLSDynamicCallPass());		addPass(createPPCTLSDynamicCallPass());
}		}

Show All 16 Lines

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp

				//===----------- PPCVSXSwapRemoval.cpp - Remove VSX LE Swaps -------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===---------------------------------------------------------------------===//
				//
				// This pass analyzes vector computations and removes unnecessary
				// doubleword swaps (xxswapd instructions). This pass is performed
				// only for little-endian VSX code generation.
				//
				// For this specific case, loads and stores of v4i32, v4f32, v2i64,
				// and v2f64 vectors are inefficient. These are implemented using
				nemanjaiUnsubmitted Not Done Reply Inline Actions Nit: v4r32 -> v4f32? nemanjai: Nit: v4r32 -> v4f32?
				// the lxvd2x and stxvd2x instructions, which invert the order of
				// doublewords in a vector register. Thus code generation inserts
				// an xxswapd after each such load, and prior to each such store.
				//
				// The extra xxswapd instructions reduce performance. The purpose
				// of this pass is to reduce the number of xxswapd instructions
				// required for correctness.
				hfinkelUnsubmitted Not Done Reply Inline Actions What does "particularly bad for vectorized code" mean? Doesn't this exclusively apply to vectorized code? hfinkel: What does "particularly bad for vectorized code" mean? Doesn't this exclusively apply to…
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions I suppose I had meant "autovectorized code" at the time, because the user has no way to anticipate these swaps are going to show up out of nowhere and reduce performance in a performance-enhancing optimization. However, the sentence really adds nothing, so I'll remove it. ;) wschmidt: I suppose I had meant "autovectorized code" at the time, because the user has no way to…
				//
				// The primary insight is that much code that operates on vectors
				// does not care about the relative order of elements in a register,
				// so long as the correct memory order is preserved. If we have a
				// computation where all input values are provided by lxvd2x/xxswapd,
				// all outputs are stored using xxswapd/lxvd2x, and all intermediate
				// computations are lane-insensitive (independent of element order),
				// then all the xxswapd instructions associated with the loads and
				// stores may be removed without changing observable semantics.
				//
				// This pass uses standard equivalence class infrastructure to create
				// maximal webs of computations fitting the above description. Each
				// such web is then optimized by removing its unnecessary xxswapd
				// instructions.
				//
				// There are some lane-sensitive operations for which we can still
				// permit the optimization, provided we modify those operations
				// accordingly. Such operations are identified as using "special
				// handling" within this module.
				//
				//===---------------------------------------------------------------------===//

				#include "PPCInstrInfo.h"
				#include "PPC.h"
				#include "PPCInstrBuilder.h"
				#include "PPCTargetMachine.h"
				#include "llvm/ADT/DenseMap.h"
				#include "llvm/ADT/EquivalenceClasses.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/Format.h"
				#include "llvm/Support/raw_ostream.h"

				using namespace llvm;

				#define DEBUG_TYPE "ppc-vsx-swaps"

				namespace llvm {
				void initializePPCVSXSwapRemovalPass(PassRegistry&);
				}

				namespace {

				// A PPCVSXSwapEntry is created for each machine instruction that
				// is relevant to a vector computation.
				struct PPCVSXSwapEntry {
				// Pointer to the instruction.
				MachineInstr *VSEMI;

				// Unique ID (position in the swap vector).
				int VSEId;

				// Attributes of this node.
				unsigned int IsLoad : 1;
				unsigned int IsStore : 1;
				unsigned int IsSwap : 1;
				unsigned int MentionsPhysVR : 1;
				unsigned int HasImplicitSubreg : 1;
				unsigned int IsSwappable : 1;
				unsigned int SpecialHandling : 3;
				unsigned int WebRejected : 1;
				unsigned int WillRemove : 1;
				};

				enum SHValues {
				SH_NONE = 0,
				SH_BUILDVEC,
				SH_EXTRACT,
				SH_INSERT,
				SH_NOSWAP_LD,
				SH_NOSWAP_ST,
				SH_SPLAT
				};

				struct PPCVSXSwapRemoval : public MachineFunctionPass {

				static char ID;
				const PPCInstrInfo *TII;
				MachineFunction *MF;
				MachineRegisterInfo *MRI;

				// Swap entries are allocated in a vector for better performance.
				std::vector<PPCVSXSwapEntry> SwapVector;

				// A mapping is maintained between machine instructions and
				// their swap entries. The key is the address of the MI.
				DenseMap<MachineInstr*, int> SwapMap;
				echristoUnsubmitted Not Done Reply Inline Actions DenseMap? echristo: DenseMap?
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions Yes, that looks appropriate. I'll move to that. wschmidt: Yes, that looks appropriate. I'll move to that.

				// Equivalence classes are used to gather webs of related computation.
				// Swap entries are represented by their VSEId fields.
				EquivalenceClasses<int> *EC;

				PPCVSXSwapRemoval() : MachineFunctionPass(ID) {
				initializePPCVSXSwapRemovalPass(*PassRegistry::getPassRegistry());
				}

				private:
				// Initialize data structures.
				void initialize(MachineFunction &MFParm);

				// Walk the machine instructions to gather vector usage information.
				// Return true iff vector mentions are present.
				bool gatherVectorInstructions();
				hfinkelUnsubmitted Not Done Reply Inline Actions Why don't we just call these "vector instructions"? mentions sounds stilted to me. hfinkel: Why don't we just call these "vector instructions"? mentions sounds stilted to me.
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions That's fine. "Mentions" is a more or less standard term for defs U kills U uses. The important thing is to get anything that mentions a vector register, including things that just copy into or out of them. But in our current ISA, everything that lives in the Category:Altivec and Category:VSX fits this description, so "vector instructions" is equivalent. wschmidt: That's fine. "Mentions" is a more or less standard term for defs U kills U uses. The…
				hfinkelUnsubmitted Not Done Reply Inline Actions Feel free to keep mentions if you'd prefer. I don't have a strong opinion. hfinkel: Feel free to keep mentions if you'd prefer. I don't have a strong opinion.
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions That's ok, I'll make the switch. wschmidt: That's ok, I'll make the switch.

				// Add an entry to the swap vector and swap map.
				int addSwapEntry(MachineInstr *MI, PPCVSXSwapEntry &SwapEntry);

				// Hunt backwards through COPY and SUBREG_TO_REG chains for a
				// source register. VecIdx indicates the swap vector entry to
				// mark as mentioning a physical register if the search leads
				// to one.
				unsigned lookThruCopyLike(unsigned SrcReg, unsigned VecIdx);

				// Generate equivalence classes for related computations (webs).
				void formWebs();

				// Analyze webs and determine those that cannot be optimized.
				void recordUnoptimizableWebs();

				// Record which swap instructions can be safely removed.
				void markSwapsForRemoval();

				// Remove swaps and update other instructions requiring special
				// handling. Return true iff any changes are made.
				bool removeSwaps();

				// Update instructions requiring special handling.
				void handleSpecialSwappables(int EntryIdx);

				// Dump a description of the entries in the swap vector.
				void dumpSwapVector();

				// Return true iff the given register is in the given class.
				bool isRegInClass(unsigned Reg, const TargetRegisterClass *RC) {
				if (TargetRegisterInfo::isVirtualRegister(Reg))
				return RC->hasSubClassEq(MRI->getRegClass(Reg));
				if (RC->contains(Reg))
				return true;
				return false;
				}

				// Return true iff the given register is a full vector register.
				bool isVecReg(unsigned Reg) {
				return (isRegInClass(Reg, &PPC::VSRCRegClass) \|\|
				hfinkelUnsubmitted Not Done Reply Inline Actions As a general note, this pass does not handle the case where a register has a subregister index. This should be mostly irrelevant here (we probably don't generate them for these kinds of instructions), but we also don't want to mishandle them should they occur. (maybe also give up in lookThruCopyLike should you run into a register with a subreg index too). hfinkel: As a general note, this pass does not handle the case where a register has a subregister index.
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions I don't believe any special handling is necessary for the current patch. If a subregister mention is connected, directly or indirectly, to a full register load or store, it will have to be done via an EXTRACT_SUBREG, INSERT_SUBREG, or SUBREG_TO_REG. But these are operations that kill the optimization anyway (except SUBREG_TO_REG for full 128-bit copies). We don't want to exclude them from the analysis, because we can handle EXTRACT_SUBREG and INSERT_SUBREG by adjusting the subregister number to account for the doubleword swap, and dealing with SUBREG_TO_REG accordingly as well. I plan to incorporate this into a future patch. However, there may be subtleties in how subregs are handled in LLVM that I'm not familiar with, so please let me know if I'm missing something. wschmidt: I don't believe any special handling is necessary for the current patch. If a subregister…
				hfinkelUnsubmitted Not Done Reply Inline Actions it will have to be done via an EXTRACT_SUBREG, INSERT_SUBREG, or SUBREG_TO_REG. No, I think you've misunderstood my comment. MI register operands have have implicit subregister indices. These are separate from the subreg pseudo instructions you've named above. For example, MO.getReg() may equal PPC::CR0, and the operand refers to the whole register if MO.getSubReg() == 0. But, if MO.getReg() == PPC::CR0 and MO.getSubReg() == PPC::sub_eq, then the operand is really referring to PPC::CR0EQ. Now this seems silly for physical registers, but is useful for virtual registers when you know that you want a particular subregister of a virtual register. hfinkel: > it will have to be done via an EXTRACT_SUBREG, INSERT_SUBREG, or SUBREG_TO_REG. No, I think…
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions OK. Well, I really don't think this can actually happen today for any of the various vector register classes. (If it can, I'd be interested in an example.) I will add some logic that will just kill the optimization if it ever occurs, but to my knowledge subregs of vector registers are always generated explicitly by operation (as they should be). I can understand using this for fixed bits of a status register, but it would be really bad practice to use this for vector subregs. wschmidt: OK. Well, I really don't think this can actually happen today for any of the various vector…
				hfinkelUnsubmitted Not Done Reply Inline Actions Right; I don't think it can happen today either, but if it does for whatever reason, I'd rather we handle it gracefully instead of crashing/miscompiling. That was my entire point ;) hfinkel: Right; I don't think it can happen today either, but if it does for whatever reason, I'd rather…
				isRegInClass(Reg, &PPC::VRRCRegClass));
				}

				public:
				// Main entry point for this pass.
				bool runOnMachineFunction(MachineFunction &MF) override {
				// If we don't have VSX on the subtarget, don't do anything.
				const PPCSubtarget &STI = MF.getSubtarget<PPCSubtarget>();
				if (!STI.hasVSX())
				return false;

				bool Changed = false;
				initialize(MF);

				if (gatherVectorInstructions()) {
				formWebs();
				recordUnoptimizableWebs();
				markSwapsForRemoval();
				Changed = removeSwaps();
				}

				// FIXME: See the allocation of EC in initialize().
				delete EC;
				return Changed;
				}
				};

				// Initialize data structures for this pass. In particular, clear the
				// swap vector and allocate the equivalence class mapping before
				// processing each function.
				void PPCVSXSwapRemoval::initialize(MachineFunction &MFParm) {
				MF = &MFParm;
				MRI = &MF->getRegInfo();
				TII = static_cast<const PPCInstrInfo*>(MF->getSubtarget().getInstrInfo());

				// An initial vector size of 256 appears to work well in practice.
				// Small/medium functions with vector content tend not to incur a
				echristoUnsubmitted Not Done Reply Inline Actions 256? echristo: 256?
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions Hey, a magic constant. That's why I gave it a name. :) Picking an initial vector size to avoid too many reallocations while keeping the size reasonable. Yes, it's a wet thumb to the breeze... wschmidt: Hey, a magic constant. That's why I gave it a name. :) Picking an initial vector size to…
				echristoUnsubmitted Not Done Reply Inline Actions Did you use science to figure it out (i.e. see what numbers would generally work etc) or just WAG? echristo: Did you use science to figure it out (i.e. see what numbers would generally work etc) or just…
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions Semi-science, yes. I've spent some time on this in GCC as well and gotten a pretty good feel for what happens. 256 is sufficiently large to prevent any reallocations in the usual case. For larger functions, we will see a reallocation or two. My testing on LLVM indicated that the code in projects/test-suite used no reallocations except for three of the benchmarks (I know this because my implementation was broken, so I noticed failures when the reallocations occurred). So I feel this is a good choice. wschmidt: Semi-science, yes. I've spent some time on this in GCC as well and gotten a pretty good feel…
				echristoUnsubmitted Not Done Reply Inline Actions Good enough science for me. I was hoping you'd collected normal size, but this is probably close enough. echristo: Good enough science for me. I was hoping you'd collected normal size, but this is probably…
				// reallocation at this size. Three of the vector tests in
				// projects/test-suite reallocate, which seems like a reasonable rate.
				const int InitialVectorSize(256);
				SwapVector.clear();
				SwapVector.reserve(InitialVectorSize);

				// FIXME: Currently we allocate EC each time because we don't have
				// access to the set representation on which to call clear(). Should
				// consider adding a clear() method to the EquivalenceClasses class.
				EC = new EquivalenceClasses<int>;
				}

				// Create an entry in the swap vector for each instruction that mentions
				// a full vector register, recording various characteristics of the
				// instructions there.
				bool PPCVSXSwapRemoval::gatherVectorInstructions() {
				bool RelevantFunction = false;

				for (MachineBasicBlock &MBB : *MF) {
				for (MachineInstr &MI : MBB) {
				hfinkelUnsubmitted Not Done Reply Inline Actions range-based for? hfinkel: range-based for?
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions OK -- I'll rework this and the ones below. wschmidt: OK -- I'll rework this and the ones below.

				bool RelevantInstr = false;
				bool ImplicitSubreg = false;
				hfinkelUnsubmitted Not Done Reply Inline Actions range-based for? hfinkel: range-based for?

				for (const MachineOperand &MO : MI.operands()) {
				if (!MO.isReg())
				continue;
				unsigned Reg = MO.getReg();
				hfinkelUnsubmitted Not Done Reply Inline Actions for (const MachineOperand &MO : MI->operands()) { hfinkel: for (const MachineOperand &MO : MI->operands()) {
				if (isVecReg(Reg)) {
				RelevantInstr = true;
				if (MO.getSubReg() != 0)
				ImplicitSubreg = true;
				break;
				}
				}

				if (!RelevantInstr)
				continue;

				RelevantFunction = true;

				// Create a SwapEntry initialized to zeros, then fill in the
				// instruction and ID fields before pushing it to the back
				// of the swap vector.
				PPCVSXSwapEntry SwapEntry{};
				int VecIdx = addSwapEntry(&MI, SwapEntry);

				echristoUnsubmitted Not Done Reply Inline Actions What instructions not listed are you thinking here? echristo: What instructions not listed are you thinking here?
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions A vast number of lane-insensitive instructions. All of the vector math, logical, select, etc. Everything that's true SIMD. wschmidt: A vast number of lane-insensitive instructions. All of the vector math, logical, select, etc.
				echristoUnsubmitted Not Done Reply Inline Actions Comment then? Just because there's another long list below so... echristo: Comment then? Just because there's another long list below so...
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions Yep, I can do that. wschmidt: Yep, I can do that.
				if (ImplicitSubreg)
				SwapVector[VecIdx].HasImplicitSubreg = 1;

				switch(MI.getOpcode()) {
				default:
				// Unless noted otherwise, an instruction is considered
				// safe for the optimization. There are a large number of
				// such true-SIMD instructions (all vector math, logical,
				// select, compare, etc.).
				SwapVector[VecIdx].IsSwappable = 1;
				break;
				case PPC::XXPERMDI:
				// This is a swap if it is of the form XXPERMDI t, s, s, 2.
				// Unfortunately, MachineCSE ignores COPY and SUBREG_TO_REG, so we
				// can also see XXPERMDI t, SUBREG_TO_REG(s), SUBREG_TO_REG(s), 2,
				// for example. We have to look through chains of COPY and
				// SUBREG_TO_REG to find the real source value for comparison.
				// If the real source value is a physical register, then mark the
				// XXPERMDI as mentioning a physical register.
				// Any other form of XXPERMDI is lane-sensitive and unsafe
				// for the optimization.
				if (MI.getOperand(3).getImm() == 2) {
				unsigned trueReg1 = lookThruCopyLike(MI.getOperand(1).getReg(),
				VecIdx);
				unsigned trueReg2 = lookThruCopyLike(MI.getOperand(2).getReg(),
				VecIdx);
				if (trueReg1 == trueReg2)
				SwapVector[VecIdx].IsSwap = 1;
				}
				break;
				case PPC::LVX:
				// Non-permuting loads are currently unsafe. We can use special
				// handling for this in the future. By not marking these as
				hfinkelUnsubmitted Not Done Reply Inline Actions Where to we actually reject webs with non-swapping load/stores? hfinkel: Where to we actually reject webs with non-swapping load/stores?
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions The fact that these are not marked as either IsSwap or IsSwappable causes them to be rejected. See lines 537-538. wschmidt: The fact that these are not marked as either IsSwap or IsSwappable causes them to be rejected.
				// IsSwap, we ensure computations containing them will be rejected
				// for now.
				SwapVector[VecIdx].IsLoad = 1;
				break;
				case PPC::LXVD2X:
				case PPC::LXVW4X:
				// Permuting loads are marked as both load and swap, and are
				// safe for optimization.
				SwapVector[VecIdx].IsLoad = 1;
				SwapVector[VecIdx].IsSwap = 1;
				break;
				case PPC::STVX:
				// Non-permuting stores are currently unsafe. We can use special
				// handling for this in the future. By not marking these as
				// IsSwap, we ensure computations containing them will be rejected
				// for now.
				SwapVector[VecIdx].IsStore = 1;
				break;
				case PPC::STXVD2X:
				case PPC::STXVW4X:
				// Permuting stores are marked as both store and swap, and are
				// safe for optimization.
				SwapVector[VecIdx].IsStore = 1;
				hfinkelUnsubmitted Not Done Reply Inline Actions We should add a comment here explaining that this happens because some of the 128-bit VSX registers have 128-bit Altivec sub-registers. hfinkel: We should add a comment here explaining that this happens because some of the 128-bit VSX…
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions OK. wschmidt: OK.
				SwapVector[VecIdx].IsSwap = 1;
				break;
				case PPC::SUBREG_TO_REG:
				// These are fine provided they are moving between full vector
				// register classes. For example, the VRs are a subset of the
				// VSRs, but each VR and each VSR is a full 128-bit register.
				if (isVecReg(MI.getOperand(0).getReg()) &&
				isVecReg(MI.getOperand(2).getReg()))
				SwapVector[VecIdx].IsSwappable = 1;
				break;
				case PPC::COPY:
				// These are fine provided they are moving between full vector
				// register classes.
				if (isVecReg(MI.getOperand(0).getReg()) &&
				isVecReg(MI.getOperand(1).getReg()))
				SwapVector[VecIdx].IsSwappable = 1;
				break;
				case PPC::VSPLTB:
				case PPC::VSPLTH:
				case PPC::VSPLTW:
				// Splats are lane-sensitive, but we can use special handling
				// to adjust the source lane for the splat. This is not yet
				// implemented. When it is, we need to uncomment the following:
				// SwapVector[VecIdx].IsSwappable = 1;
				SwapVector[VecIdx].SpecialHandling = SHValues::SH_SPLAT;
				break;
				hfinkelUnsubmitted Not Done Reply Inline Actions Yes, that would be better. If nothing else, we should be able to use the InstrMapping facility to generate a lookup table (this is how we currently keep track of which instructions are record forms, and the mapping between record-form and the non-record-form variants). hfinkel: Yes, that would be better. If nothing else, we should be able to use the InstrMapping facility…
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions I'll look at that, thanks. There didn't seem to be a way to just create a simple flag that I could see, which would have been nice... wschmidt: I'll look at that, thanks. There didn't seem to be a way to just create a simple flag that I…
				// The presence of the following lane-sensitive operations in a
				// web will kill the optimization, at least for now. For these
				// we do nothing, causing the optimization to fail.
				// FIXME: Some of these could be permitted with special handling,
				// and will be phased in as time permits.
				// FIXME: There is no simple and maintainable way to express a set
				// of opcodes having a common attribute in TableGen. Should this
				// change, this is a prime candidate to use such a mechanism.
				case PPC::INLINEASM:
				case PPC::EXTRACT_SUBREG:
				case PPC::INSERT_SUBREG:
				case PPC::COPY_TO_REGCLASS:
				case PPC::LVEBX:
				case PPC::LVEHX:
				case PPC::LVEWX:
				case PPC::LVSL:
				case PPC::LVSR:
				case PPC::LVXL:
				case PPC::LXVDSX:
				case PPC::STVEBX:
				case PPC::STVEHX:
				case PPC::STVEWX:
				case PPC::STVXL:
				case PPC::STXSDX:
				case PPC::VCIPHER:
				case PPC::VCIPHERLAST:
				case PPC::VMRGHB:
				case PPC::VMRGHH:
				case PPC::VMRGHW:
				case PPC::VMRGLB:
				case PPC::VMRGLH:
				case PPC::VMRGLW:
				case PPC::VMULESB:
				case PPC::VMULESH:
				case PPC::VMULESW:
				case PPC::VMULEUB:
				case PPC::VMULEUH:
				case PPC::VMULEUW:
				case PPC::VMULOSB:
				case PPC::VMULOSH:
				case PPC::VMULOSW:
				case PPC::VMULOUB:
				case PPC::VMULOUH:
				case PPC::VMULOUW:
				case PPC::VNCIPHER:
				case PPC::VNCIPHERLAST:
				case PPC::VPERM:
				case PPC::VPERMXOR:
				case PPC::VPKPX:
				case PPC::VPKSHSS:
				case PPC::VPKSHUS:
				case PPC::VPKSWSS:
				case PPC::VPKSWUS:
				case PPC::VPKUHUM:
				case PPC::VPKUHUS:
				case PPC::VPKUWUM:
				case PPC::VPKUWUS:
				case PPC::VPMSUMB:
				case PPC::VPMSUMD:
				case PPC::VPMSUMH:
				case PPC::VPMSUMW:
				case PPC::VRLB:
				case PPC::VRLD:
				case PPC::VRLH:
				case PPC::VRLW:
				case PPC::VSBOX:
				case PPC::VSHASIGMAD:
				case PPC::VSHASIGMAW:
				case PPC::VSL:
				case PPC::VSLDOI:
				case PPC::VSLO:
				case PPC::VSR:
				case PPC::VSRO:
				case PPC::VSUM2SWS:
				case PPC::VSUM4SBS:
				case PPC::VSUM4SHS:
				case PPC::VSUM4UBS:
				case PPC::VSUMSWS:
				case PPC::VUPKHPX:
				case PPC::VUPKHSB:
				case PPC::VUPKHSH:
				case PPC::VUPKLPX:
				case PPC::VUPKLSB:
				case PPC::VUPKLSH:
				case PPC::XXMRGHW:
				case PPC::XXMRGLW:
				case PPC::XXSPLTW:
				break;
				}
				}
				}

				if (RelevantFunction) {
				DEBUG(dbgs() << "Swap vector when first built\n\n");
				dumpSwapVector();
				}

				return RelevantFunction;
				}

				// Add an entry to the swap vector and swap map, and make a
				// singleton equivalence class for the entry.
				int PPCVSXSwapRemoval::addSwapEntry(MachineInstr *MI,
				PPCVSXSwapEntry& SwapEntry) {
				SwapEntry.VSEMI = MI;
				SwapEntry.VSEId = SwapVector.size();
				SwapVector.push_back(SwapEntry);
				EC->insert(SwapEntry.VSEId);
				SwapMap[MI] = SwapEntry.VSEId;
				return SwapEntry.VSEId;
				}

				// This is used to find the "true" source register for an
				// XXPERMDI instruction, since MachineCSE does not handle the
				// "copy-like" operations (Copy and SubregToReg). Returns
				// the original SrcReg unless it is the target of a copy-like
				// operation, in which case we chain backwards through all
				// such operations to the ultimate source register. If a
				// physical register is encountered, we stop the search and
				// flag the swap entry indicated by VecIdx (the original
				// XXPERMDI) as mentioning a physical register. Similarly
				// for implicit subregister mentions (which should never
				// happen).
				unsigned PPCVSXSwapRemoval::lookThruCopyLike(unsigned SrcReg,
				unsigned VecIdx) {
				MachineInstr *MI = MRI->getVRegDef(SrcReg);
				hfinkelUnsubmitted Not Done Reply Inline Actions add an assert that isSubregToReg() is true. hfinkel: add an assert that isSubregToReg() is true.
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions OK. wschmidt: OK.
				if (!MI->isCopyLike())
				return SrcReg;

				unsigned CopySrcReg, CopySrcSubreg;
				if (MI->isCopy()) {
				CopySrcReg = MI->getOperand(1).getReg();
				CopySrcSubreg = MI->getOperand(1).getSubReg();
				} else {
				assert(MI->isSubregToReg() && "bad opcode for lookThruCopyLike");
				CopySrcReg = MI->getOperand(2).getReg();
				CopySrcSubreg = MI->getOperand(2).getSubReg();
				}

				if (!TargetRegisterInfo::isVirtualRegister(CopySrcReg)) {
				SwapVector[VecIdx].MentionsPhysVR = 1;
				return CopySrcReg;
				}

				if (CopySrcSubreg != 0) {
				SwapVector[VecIdx].HasImplicitSubreg = 1;
				return CopySrcReg;
				}

				return lookThruCopyLike(CopySrcReg, VecIdx);
				}

				// Generate equivalence classes for related computations (webs) by
				// def-use relationships of virtual registers. Mention of a physical
				// register terminates the generation of equivalence classes as this
				// indicates a use of a parameter, definition of a return value, use
				// of a value returned from a call, or definition of a parameter to a
				// call. Computations with physical register mentions are flagged
				hfinkelUnsubmitted Not Done Reply Inline Actions range-based for? hfinkel: range-based for?
				// as such so their containing webs will not be optimized.
				void PPCVSXSwapRemoval::formWebs() {

				DEBUG(dbgs() << "\n* Forming webs for swap removal *\n\n");

				for (unsigned EntryIdx = 0; EntryIdx < SwapVector.size(); ++EntryIdx) {

				MachineInstr *MI = SwapVector[EntryIdx].VSEMI;

				DEBUG(dbgs() << "\n" << SwapVector[EntryIdx].VSEId << " ");
				DEBUG(MI->dump());

				// It's sufficient to walk vector uses and join them to their unique
				// definitions. In addition, check all vector register operands
				// for physical regs.
				for (const MachineOperand &MO : MI->operands()) {
				if (!MO.isReg())
				continue;

				unsigned Reg = MO.getReg();
				if (!isVecReg(Reg))
				continue;

				if (!TargetRegisterInfo::isVirtualRegister(Reg)) {
				SwapVector[EntryIdx].MentionsPhysVR = 1;
				continue;
				}

				if (!MO.isUse())
				continue;

				MachineInstr* DefMI = MRI->getVRegDef(Reg);
				assert(SwapMap.find(DefMI) != SwapMap.end() &&
				"Inconsistency: def of vector reg not found in swap map!");
				int DefIdx = SwapMap[DefMI];
				(void)EC->unionSets(SwapVector[DefIdx].VSEId,
				SwapVector[EntryIdx].VSEId);

				DEBUG(dbgs() << format("Unioning %d with %d\n", SwapVector[DefIdx].VSEId,
				SwapVector[EntryIdx].VSEId));
				DEBUG(dbgs() << " Def: ");
				DEBUG(DefMI->dump());
				}
				}
				}

				// Walk the swap vector entries looking for conditions that prevent their
				// containing computations from being optimized. When such conditions are
				// found, mark the representative of the computation's equivalence class
				// as rejected.
				void PPCVSXSwapRemoval::recordUnoptimizableWebs() {

				DEBUG(dbgs() << "\n* Rejecting webs for swap removal *\n\n");

				for (unsigned EntryIdx = 0; EntryIdx < SwapVector.size(); ++EntryIdx) {
				int Repr = EC->getLeaderValue(SwapVector[EntryIdx].VSEId);

				// Reject webs containing mentions of physical registers or implicit
				// subregs, or containing operations that we don't know how to handle
				// in a lane-permuted region.
				if (SwapVector[EntryIdx].MentionsPhysVR \|\|
				SwapVector[EntryIdx].HasImplicitSubreg \|\|
				!(SwapVector[EntryIdx].IsSwappable \|\| SwapVector[EntryIdx].IsSwap)) {

				SwapVector[Repr].WebRejected = 1;

				DEBUG(dbgs() <<
				format("Web %d rejected for physreg, subreg, or not swap[pable]\n",
				Repr));
				DEBUG(dbgs() << " in " << EntryIdx << ": ");
				DEBUG(SwapVector[EntryIdx].VSEMI->dump());
				DEBUG(dbgs() << "\n");
				}

				// Reject webs than contain swapping loads that feed something other
				// than a swap instruction.
				else if (SwapVector[EntryIdx].IsLoad && SwapVector[EntryIdx].IsSwap) {
				MachineInstr *MI = SwapVector[EntryIdx].VSEMI;
				unsigned DefReg = MI->getOperand(0).getReg();

				// We skip debug instructions in the analysis. (Note that debug
				// location information is still maintained by this optimization
				// because it remains on the LXVD2X and STXVD2X instructions after
				// the XXPERMDIs are removed.)
				for (MachineInstr &UseMI : MRI->use_nodbg_instructions(DefReg)) {
				int UseIdx = SwapMap[&UseMI];

				if (!SwapVector[UseIdx].IsSwap \|\| SwapVector[UseIdx].IsLoad \|\|
				SwapVector[UseIdx].IsStore) {

				SwapVector[Repr].WebRejected = 1;

				DEBUG(dbgs() <<
				format("Web %d rejected for load not feeding swap\n", Repr));
				DEBUG(dbgs() << " def " << EntryIdx << ": ");
				DEBUG(MI->dump());
				DEBUG(dbgs() << " use " << UseIdx << ": ");
				DEBUG(UseMI.dump());
				DEBUG(dbgs() << "\n");
				}
				}

				// Reject webs than contain swapping stores that are fed by something
				// other than a swap instruction.
				} else if (SwapVector[EntryIdx].IsStore && SwapVector[EntryIdx].IsSwap) {
				MachineInstr *MI = SwapVector[EntryIdx].VSEMI;
				unsigned UseReg = MI->getOperand(0).getReg();
				MachineInstr *DefMI = MRI->getVRegDef(UseReg);
				int DefIdx = SwapMap[DefMI];

				if (!SwapVector[DefIdx].IsSwap \|\| SwapVector[DefIdx].IsLoad \|\|
				SwapVector[DefIdx].IsStore) {

				SwapVector[Repr].WebRejected = 1;

				DEBUG(dbgs() <<
				format("Web %d rejected for store not fed by swap\n", Repr));
				DEBUG(dbgs() << " def " << DefIdx << ": ");
				DEBUG(DefMI->dump());
				DEBUG(dbgs() << " use " << EntryIdx << ": ");
				DEBUG(MI->dump());
				DEBUG(dbgs() << "\n");
				}
				echristoUnsubmitted Not Done Reply Inline Actions Good point, what _does_ this do for debug info? :) echristo: Good point, what _does_ this do for debug info? :)
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions It doesn't do anything, but it doesn't need to. When the load or store is initially expanded, its location information goes on the LXVD2X and XXPERMDI instructions, or on the XXPERMDI and STXVD2X instructions. This pass removes the XXPERMDI instructions, but the location information remains with the LXVD2X or STXVD2X. wschmidt: It doesn't do anything, but it doesn't need to. When the load or store is initially expanded…
				echristoUnsubmitted Not Done Reply Inline Actions That's what I thought, just making sure. echristo: That's what I thought, just making sure.
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions I can add a comment to this effect. wschmidt: I can add a comment to this effect.
				}
				}

				DEBUG(dbgs() << "Swap vector after web analysis:\n\n");
				dumpSwapVector();
				}

				// Walk the swap vector entries looking for swaps fed by permuting loads
				// and swaps that feed permuting stores. If the containing computation
				// has not been marked rejected, mark each such swap for removal.
				// (Removal is delayed in case optimization has disturbed the pattern,
				// such that multiple loads feed the same swap, etc.)
				void PPCVSXSwapRemoval::markSwapsForRemoval() {

				DEBUG(dbgs() << "\n* Marking swaps for removal *\n\n");

				for (unsigned EntryIdx = 0; EntryIdx < SwapVector.size(); ++EntryIdx) {

				if (SwapVector[EntryIdx].IsLoad && SwapVector[EntryIdx].IsSwap) {
				int Repr = EC->getLeaderValue(SwapVector[EntryIdx].VSEId);

				if (!SwapVector[Repr].WebRejected) {
				MachineInstr *MI = SwapVector[EntryIdx].VSEMI;
				unsigned DefReg = MI->getOperand(0).getReg();

				for (MachineInstr &UseMI : MRI->use_nodbg_instructions(DefReg)) {
				int UseIdx = SwapMap[&UseMI];
				SwapVector[UseIdx].WillRemove = 1;

				DEBUG(dbgs() << "Marking swap fed by load for removal: ");
				DEBUG(UseMI.dump());
				}
				echristoUnsubmitted Not Done Reply Inline Actions ? echristo: ?
				wschmidtAuthorUnsubmitted Not Done Reply Inline Actions As noted in the overall patch commentary, there are non-pure-SIMD instructions for which we can still perform this optimization, provided we change code generation for those instructions. My plan is to fill in this function with the details in future patches. The initial patch is for the "simple" part (which covers a great many cases). wschmidt: As noted in the overall patch commentary, there are non-pure-SIMD instructions for which we can…
				echristoUnsubmitted Not Done Reply Inline Actions Sounds good. echristo: Sounds good.
				}

				} else if (SwapVector[EntryIdx].IsStore && SwapVector[EntryIdx].IsSwap) {
				int Repr = EC->getLeaderValue(SwapVector[EntryIdx].VSEId);

				if (!SwapVector[Repr].WebRejected) {
				MachineInstr *MI = SwapVector[EntryIdx].VSEMI;
				unsigned UseReg = MI->getOperand(0).getReg();
				MachineInstr *DefMI = MRI->getVRegDef(UseReg);
				int DefIdx = SwapMap[DefMI];
				SwapVector[DefIdx].WillRemove = 1;

				DEBUG(dbgs() << "Marking swap feeding store for removal: ");
				DEBUG(DefMI->dump());
				}

				} else if (SwapVector[EntryIdx].IsSwappable &&
				SwapVector[EntryIdx].SpecialHandling != 0)
				handleSpecialSwappables(EntryIdx);
				}
				}

				// The identified swap entry requires special handling to allow its
				// containing computation to be optimized. Perform that handling
				// here.
				// FIXME: This code is to be phased in with subsequent patches.
				void PPCVSXSwapRemoval::handleSpecialSwappables(int EntryIdx) {
				}

				// Walk the swap vector and replace each entry marked for removal with
				// a copy operation.
				bool PPCVSXSwapRemoval::removeSwaps() {

				DEBUG(dbgs() << "\n* Removing swaps *\n\n");

				bool Changed = false;

				for (unsigned EntryIdx = 0; EntryIdx < SwapVector.size(); ++EntryIdx) {
				if (SwapVector[EntryIdx].WillRemove) {
				Changed = true;
				MachineInstr *MI = SwapVector[EntryIdx].VSEMI;
				MachineBasicBlock *MBB = MI->getParent();
				BuildMI(*MBB, MI, MI->getDebugLoc(),
				TII->get(TargetOpcode::COPY), MI->getOperand(0).getReg())
				.addOperand(MI->getOperand(1));

				DEBUG(dbgs() << format("Replaced %d with copy: ",
				SwapVector[EntryIdx].VSEId));
				DEBUG(MI->dump());

				MI->eraseFromParent();
				}
				}

				return Changed;
				}

				// For debug purposes, dump the contents of the swap vector.
				void PPCVSXSwapRemoval::dumpSwapVector() {

				for (unsigned EntryIdx = 0; EntryIdx < SwapVector.size(); ++EntryIdx) {

				MachineInstr *MI = SwapVector[EntryIdx].VSEMI;
				int ID = SwapVector[EntryIdx].VSEId;

				DEBUG(dbgs() << format("%6d", ID));
				DEBUG(dbgs() << format("%6d", EC->getLeaderValue(ID)));
				DEBUG(dbgs() << format(" BB#%3d", MI->getParent()->getNumber()));
				DEBUG(dbgs() << format(" %14s ", TII->getName(MI->getOpcode())));

				if (SwapVector[EntryIdx].IsLoad)
				DEBUG(dbgs() << "load ");
				if (SwapVector[EntryIdx].IsStore)
				DEBUG(dbgs() << "store ");
				if (SwapVector[EntryIdx].IsSwap)
				DEBUG(dbgs() << "swap ");
				if (SwapVector[EntryIdx].MentionsPhysVR)
				DEBUG(dbgs() << "physreg ");
				if (SwapVector[EntryIdx].HasImplicitSubreg)
				DEBUG(dbgs() << "implsubreg ");

				if (SwapVector[EntryIdx].IsSwappable) {
				DEBUG(dbgs() << "swappable ");
				switch(SwapVector[EntryIdx].SpecialHandling) {
				default:
				DEBUG(dbgs() << "special:unknown");
				break;
				case SH_NONE:
				break;
				case SH_BUILDVEC:
				DEBUG(dbgs() << "special:buildvec ");
				break;
				case SH_EXTRACT:
				DEBUG(dbgs() << "special:extract ");
				break;
				case SH_INSERT:
				DEBUG(dbgs() << "special:insert ");
				break;
				case SH_NOSWAP_LD:
				DEBUG(dbgs() << "special:load ");
				break;
				case SH_NOSWAP_ST:
				DEBUG(dbgs() << "special:store ");
				break;
				case SH_SPLAT:
				DEBUG(dbgs() << "special:splat ");
				break;
				}
				}

				if (SwapVector[EntryIdx].WebRejected)
				DEBUG(dbgs() << "rejected ");
				if (SwapVector[EntryIdx].WillRemove)
				DEBUG(dbgs() << "remove ");

				DEBUG(dbgs() << "\n");
				}

				DEBUG(dbgs() << "\n");
				}

				} // end default namespace

				INITIALIZE_PASS_BEGIN(PPCVSXSwapRemoval, DEBUG_TYPE,
				"PowerPC VSX Swap Removal", false, false)
				INITIALIZE_PASS_END(PPCVSXSwapRemoval, DEBUG_TYPE,
				"PowerPC VSX Swap Removal", false, false)

				char PPCVSXSwapRemoval::ID = 0;
				FunctionPass*
				llvm::createPPCVSXSwapRemovalPass() { return new PPCVSXSwapRemoval(); }

test/CodeGen/PowerPC/swaps-le-1.ll

				; RUN: llc -O3 -mcpu=pwr8 -mtriple=powerpc64le-unknown-linux-gnu < %s \| FileCheck %s
				; RUN: llc -O3 -mcpu=pwr8 -disable-ppc-vsx-swap-removal -mtriple=powerpc64le-unknown-linux-gnu < %s \| FileCheck -check-prefix=NOOPTSWAP %s

				; This test was generated from the following source:
				;
				; #define N 4096
				; int ca[N] __attribute__((aligned(16)));
				; int cb[N] __attribute__((aligned(16)));
				; int cc[N] __attribute__((aligned(16)));
				; int cd[N] __attribute__((aligned(16)));
				;
				; void foo ()
				; {
				; int i;
				; for (i = 0; i < N; i++) {
				; ca[i] = (cb[i] + cc[i]) * cd[i];
				; }
				; }

				@cb = common global [4096 x i32] zeroinitializer, align 16
				@cc = common global [4096 x i32] zeroinitializer, align 16
				@cd = common global [4096 x i32] zeroinitializer, align 16
				@ca = common global [4096 x i32] zeroinitializer, align 16

				define void @foo() {
				entry:
				br label %vector.body

				vector.body:
				%index = phi i64 [ 0, %entry ], [ %index.next.3, %vector.body ]
				%0 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cb, i64 0, i64 %index
				%1 = bitcast i32* %0 to <4 x i32>*
				%wide.load = load <4 x i32>, <4 x i32>* %1, align 16
				%2 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cc, i64 0, i64 %index
				%3 = bitcast i32* %2 to <4 x i32>*
				%wide.load13 = load <4 x i32>, <4 x i32>* %3, align 16
				%4 = add nsw <4 x i32> %wide.load13, %wide.load
				%5 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cd, i64 0, i64 %index
				%6 = bitcast i32* %5 to <4 x i32>*
				%wide.load14 = load <4 x i32>, <4 x i32>* %6, align 16
				%7 = mul nsw <4 x i32> %4, %wide.load14
				%8 = getelementptr inbounds [4096 x i32], [4096 x i32]* @ca, i64 0, i64 %index
				%9 = bitcast i32* %8 to <4 x i32>*
				store <4 x i32> %7, <4 x i32>* %9, align 16
				%index.next = add nuw nsw i64 %index, 4
				%10 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cb, i64 0, i64 %index.next
				%11 = bitcast i32* %10 to <4 x i32>*
				%wide.load.1 = load <4 x i32>, <4 x i32>* %11, align 16
				%12 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cc, i64 0, i64 %index.next
				%13 = bitcast i32* %12 to <4 x i32>*
				%wide.load13.1 = load <4 x i32>, <4 x i32>* %13, align 16
				%14 = add nsw <4 x i32> %wide.load13.1, %wide.load.1
				%15 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cd, i64 0, i64 %index.next
				%16 = bitcast i32* %15 to <4 x i32>*
				%wide.load14.1 = load <4 x i32>, <4 x i32>* %16, align 16
				%17 = mul nsw <4 x i32> %14, %wide.load14.1
				%18 = getelementptr inbounds [4096 x i32], [4096 x i32]* @ca, i64 0, i64 %index.next
				%19 = bitcast i32* %18 to <4 x i32>*
				store <4 x i32> %17, <4 x i32>* %19, align 16
				%index.next.1 = add nuw nsw i64 %index.next, 4
				%20 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cb, i64 0, i64 %index.next.1
				%21 = bitcast i32* %20 to <4 x i32>*
				%wide.load.2 = load <4 x i32>, <4 x i32>* %21, align 16
				%22 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cc, i64 0, i64 %index.next.1
				%23 = bitcast i32* %22 to <4 x i32>*
				%wide.load13.2 = load <4 x i32>, <4 x i32>* %23, align 16
				%24 = add nsw <4 x i32> %wide.load13.2, %wide.load.2
				%25 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cd, i64 0, i64 %index.next.1
				%26 = bitcast i32* %25 to <4 x i32>*
				%wide.load14.2 = load <4 x i32>, <4 x i32>* %26, align 16
				%27 = mul nsw <4 x i32> %24, %wide.load14.2
				%28 = getelementptr inbounds [4096 x i32], [4096 x i32]* @ca, i64 0, i64 %index.next.1
				%29 = bitcast i32* %28 to <4 x i32>*
				store <4 x i32> %27, <4 x i32>* %29, align 16
				%index.next.2 = add nuw nsw i64 %index.next.1, 4
				%30 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cb, i64 0, i64 %index.next.2
				%31 = bitcast i32* %30 to <4 x i32>*
				%wide.load.3 = load <4 x i32>, <4 x i32>* %31, align 16
				%32 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cc, i64 0, i64 %index.next.2
				%33 = bitcast i32* %32 to <4 x i32>*
				%wide.load13.3 = load <4 x i32>, <4 x i32>* %33, align 16
				%34 = add nsw <4 x i32> %wide.load13.3, %wide.load.3
				%35 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cd, i64 0, i64 %index.next.2
				%36 = bitcast i32* %35 to <4 x i32>*
				%wide.load14.3 = load <4 x i32>, <4 x i32>* %36, align 16
				%37 = mul nsw <4 x i32> %34, %wide.load14.3
				%38 = getelementptr inbounds [4096 x i32], [4096 x i32]* @ca, i64 0, i64 %index.next.2
				%39 = bitcast i32* %38 to <4 x i32>*
				store <4 x i32> %37, <4 x i32>* %39, align 16
				%index.next.3 = add nuw nsw i64 %index.next.2, 4
				%40 = icmp eq i64 %index.next.3, 4096
				br i1 %40, label %for.end, label %vector.body

				for.end:
				ret void
				}

				; CHECK-LABEL: @foo
				; CHECK-NOT: xxpermdi
				; CHECK-NOT: xxswapd

				; CHECK: lxvd2x
				; CHECK: lxvd2x
				; CHECK-DAG: lxvd2x
				; CHECK-DAG: vadduwm
				; CHECK: vmuluwm
				; CHECK: stxvd2x

				; CHECK: lxvd2x
				; CHECK: lxvd2x
				; CHECK-DAG: lxvd2x
				; CHECK-DAG: vadduwm
				; CHECK: vmuluwm
				; CHECK: stxvd2x

				; CHECK: lxvd2x
				; CHECK: lxvd2x
				; CHECK-DAG: lxvd2x
				; CHECK-DAG: vadduwm
				; CHECK: vmuluwm
				; CHECK: stxvd2x

				; CHECK: lxvd2x
				; CHECK: lxvd2x
				; CHECK-DAG: lxvd2x
				; CHECK-DAG: vadduwm
				; CHECK: vmuluwm
				; CHECK: stxvd2x


				; NOOPTSWAP-LABEL: @foo

				; NOOPTSWAP: lxvd2x
				; NOOPTSWAP-DAG: lxvd2x
				; NOOPTSWAP-DAG: lxvd2x
				; NOOPTSWAP-DAG: xxpermdi
				; NOOPTSWAP-DAG: xxpermdi
				; NOOPTSWAP-DAG: xxpermdi
				; NOOPTSWAP-DAG: vadduwm
				; NOOPTSWAP: vmuluwm
				; NOOPTSWAP: xxpermdi
				; NOOPTSWAP-DAG: xxpermdi
				; NOOPTSWAP-DAG: xxpermdi
				; NOOPTSWAP-DAG: stxvd2x
				; NOOPTSWAP-DAG: stxvd2x
				; NOOPTSWAP: stxvd2x

test/CodeGen/PowerPC/vsx-ldst-builtin-le.ll

	; RUN: llc -mcpu=pwr8 -mattr=+vsx -O2 -mtriple=powerpc64le-unknown-linux-gnu < %s > %t			; RUN: llc -mcpu=pwr8 -mattr=+vsx -O2 -mtriple=powerpc64le-unknown-linux-gnu < %s > %t
	; RUN: grep lxvd2x < %t \| count 18			; RUN: grep lxvd2x < %t \| count 18
	; RUN: grep stxvd2x < %t \| count 18			; RUN: grep stxvd2x < %t \| count 18
	; RUN: grep xxpermdi < %t \| count 36

	@vf = global <4 x float> <float -1.500000e+00, float 2.500000e+00, float -3.500000e+00, float 4.500000e+00>, align 16			@vf = global <4 x float> <float -1.500000e+00, float 2.500000e+00, float -3.500000e+00, float 4.500000e+00>, align 16
	@vd = global <2 x double> <double 3.500000e+00, double -7.500000e+00>, align 16			@vd = global <2 x double> <double 3.500000e+00, double -7.500000e+00>, align 16
	@vsi = global <4 x i32> <i32 -1, i32 2, i32 -3, i32 4>, align 16			@vsi = global <4 x i32> <i32 -1, i32 2, i32 -3, i32 4>, align 16
	@vui = global <4 x i32> <i32 0, i32 1, i32 2, i32 3>, align 16			@vui = global <4 x i32> <i32 0, i32 1, i32 2, i32 3>, align 16
	@vsll = global <2 x i64> <i64 255, i64 -937>, align 16			@vsll = global <2 x i64> <i64 255, i64 -937>, align 16
	@vull = global <2 x i64> <i64 1447, i64 2894>, align 16			@vull = global <2 x i64> <i64 1447, i64 2894>, align 16
	@res_vsi = common global <4 x i32> zeroinitializer, align 16			@res_vsi = common global <4 x i32> zeroinitializer, align 16
	▲ Show 20 Lines • Show All 160 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/vsx-ldst.ll

	; RUN: llc -mcpu=pwr8 -mattr=+vsx -O2 -mtriple=powerpc64-unknown-linux-gnu < %s > %t			; RUN: llc -mcpu=pwr8 -mattr=+vsx -O2 -mtriple=powerpc64-unknown-linux-gnu < %s > %t
	; RUN: grep lxvw4x < %t \| count 3			; RUN: grep lxvw4x < %t \| count 3
	; RUN: grep lxvd2x < %t \| count 3			; RUN: grep lxvd2x < %t \| count 3
	; RUN: grep stxvw4x < %t \| count 3			; RUN: grep stxvw4x < %t \| count 3
	; RUN: grep stxvd2x < %t \| count 3			; RUN: grep stxvd2x < %t \| count 3
	; RUN: llc -mcpu=pwr8 -mattr=+vsx -O0 -fast-isel=1 -mtriple=powerpc64-unknown-linux-gnu < %s > %t			; RUN: llc -mcpu=pwr8 -mattr=+vsx -O0 -fast-isel=1 -mtriple=powerpc64-unknown-linux-gnu < %s > %t
	; RUN: grep lxvw4x < %t \| count 3			; RUN: grep lxvw4x < %t \| count 3
	; RUN: grep lxvd2x < %t \| count 3			; RUN: grep lxvd2x < %t \| count 3
	; RUN: grep stxvw4x < %t \| count 3			; RUN: grep stxvw4x < %t \| count 3
	; RUN: grep stxvd2x < %t \| count 3			; RUN: grep stxvd2x < %t \| count 3

	; RUN: llc -mcpu=pwr8 -mattr=+vsx -O2 -mtriple=powerpc64le-unknown-linux-gnu < %s > %t			; RUN: llc -mcpu=pwr8 -mattr=+vsx -O2 -mtriple=powerpc64le-unknown-linux-gnu < %s > %t
	; RUN: grep lxvd2x < %t \| count 6			; RUN: grep lxvd2x < %t \| count 6
	; RUN: grep stxvd2x < %t \| count 6			; RUN: grep stxvd2x < %t \| count 6
	; RUN: grep xxpermdi < %t \| count 12

	@vsi = global <4 x i32> <i32 -1, i32 2, i32 -3, i32 4>, align 16			@vsi = global <4 x i32> <i32 -1, i32 2, i32 -3, i32 4>, align 16
	@vui = global <4 x i32> <i32 0, i32 1, i32 2, i32 3>, align 16			@vui = global <4 x i32> <i32 0, i32 1, i32 2, i32 3>, align 16
	@vf = global <4 x float> <float -1.500000e+00, float 2.500000e+00, float -3.500000e+00, float 4.500000e+00>, align 16			@vf = global <4 x float> <float -1.500000e+00, float 2.500000e+00, float -3.500000e+00, float 4.500000e+00>, align 16
	@vsll = global <2 x i64> <i64 255, i64 -937>, align 16			@vsll = global <2 x i64> <i64 255, i64 -937>, align 16
	@vull = global <2 x i64> <i64 1447, i64 2894>, align 16			@vull = global <2 x i64> <i64 1447, i64 2894>, align 16
	@vd = global <2 x double> <double 3.500000e+00, double -7.500000e+00>, align 16			@vd = global <2 x double> <double 3.500000e+00, double -7.500000e+00>, align 16
	@res_vsi = common global <4 x i32> zeroinitializer, align 16			@res_vsi = common global <4 x i32> zeroinitializer, align 16
	Show All 23 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[PPC64LE] Remove unnecessary swaps from lane-insensitive vector computationsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 24491

lib/Target/PowerPC/CMakeLists.txt

lib/Target/PowerPC/PPC.h

lib/Target/PowerPC/PPCInstrAltivec.td

lib/Target/PowerPC/PPCInstrVSX.td

lib/Target/PowerPC/PPCTargetMachine.cpp

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp

test/CodeGen/PowerPC/swaps-le-1.ll

test/CodeGen/PowerPC/vsx-ldst-builtin-le.ll

test/CodeGen/PowerPC/vsx-ldst.ll

[PPC64LE] Remove unnecessary swaps from lane-insensitive vector computations
ClosedPublic