Page MenuHomePhabricator

[PPC64LE] Remove unnecessary swaps from lane-insensitive vector computations
ClosedPublic

Authored by wschmidt on Mar 23 2015, 2:08 PM.

Details

Summary

This patch adds a new SSA MI pass that runs on little-endian PPC64 code with VSX enabled. Loads and stores of 4x32 and 2x64 vectors without alignment constraints are accomplished for little-endian using lxvd2x/xxswapd and xxswapd/stxvd2x. The existence of the additional xxswapd instructions hurts performance in comparison with big-endian code, but they are necessary in the general case to support correct semantics.

However, the general case does not apply to most vector code. Many vector instructions are lane-insensitive; they do not "care" which lanes the parallel computations are performed within, provided that the resulting data is stored into the correct locations. Thus this pass looks for computations that perform only lane-insensitive operations, and remove the unnecessary swaps from loads and stores in such computations.

Future improvements will allow computations using certain lane-sensitive operations to also be optimized in this manner, by modifying the lane-sensitive operations to account for the permuted order of the lanes. However, this patch only adds the infrastructure to permit this; no lane-sensitive operations are optimized at this time.

I have a few questions about infrastructure; these are flagged as FIXMEs in the commentary. Reviewers, I'd appreciate your opinions on these!

Diff Detail

Repository
rL LLVM

Event Timeline

wschmidt updated this revision to Diff 22514.Mar 23 2015, 2:08 PM
wschmidt retitled this revision from to [PPC64LE] Remove unnecessary swaps from lane-insensitive vector computations.
wschmidt updated this object.
wschmidt edited the test plan for this revision. (Show Details)
wschmidt added reviewers: hfinkel, kbarton, seurer, nemanjai.
wschmidt set the repository for this revision to rL LLVM.
wschmidt added a subscriber: Unknown Object (MLST).

Comment on my own patch: This needs to be controllable via option. (For GCC the similar transformation is controlled with -m[no-]optimize-swaps.)

wschmidt updated this revision to Diff 22895.Mar 30 2015, 12:13 PM

Here's an updated patch that adds support for -mattr={+,-}optimize-swaps. I'll post a companion patch shortly for Clang, adding -m[no-]optimize-swaps in the front end to generate this attribute.

The test case now includes a -mno-optimize-swaps variant.

Why is a command line option a subtarget feature? More comments inline...

-eric

lib/Target/PowerPC/PPCTargetMachine.cpp
316

Comment on why.

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
111

DenseMap?

205

256?

252

What instructions not listed are you thinking here?

615

Good point, what _does_ this do for debug info? :)

647

?

Eric and I had a short chat about the option situation on IRC. Apparently there really isn't a good mechanism for disabling a specific pass for a specific target right now. I am fine with removing the option portion of this for now until LLVM supports a proper way of doing this. The down side is losing the chicken switch of a quick workaround. Since this is a -O1 optimization, the workaround becomes -O0 (or -O0 -no-fast-isel) which is somewhat harsh.

Responses to inline comments ... inline!

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
205

Hey, a magic constant. That's why I gave it a name. :) Picking an initial vector size to avoid too many reallocations while keeping the size reasonable. Yes, it's a wet thumb to the breeze...

252

A vast number of lane-insensitive instructions. All of the vector math, logical, select, etc. Everything that's true SIMD.

615

It doesn't do anything, but it doesn't need to. When the load or store is initially expanded, its location information goes on the LXVD2X and XXPERMDI instructions, or on the XXPERMDI and STXVD2X instructions. This pass removes the XXPERMDI instructions, but the location information remains with the LXVD2X or STXVD2X.

647

As noted in the overall patch commentary, there are non-pure-SIMD instructions for which we can still perform this optimization, provided we change code generation for those instructions. My plan is to fill in this function with the details in future patches. The initial patch is for the "simple" part (which covers a great many cases).

As far as the option you can have a backend option to turn the pass on and off, see the aarch64 port for more information there.

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
205

Did you use science to figure it out (i.e. see what numbers would generally work etc) or just WAG?

252

Comment then? Just because there's another long list below so...

615

That's what I thought, just making sure.

647

Sounds good.

I'd like to understand why you opted for doing this in MI instead of at the IR level? In particular, it seems odd to have the loop vectorizer and other tools build vectorized code in the "wrong" way and then fix it if we can later.

I'd like to understand why you opted for doing this in MI instead of at the IR level? In particular, it seems odd to have the loop vectorizer and other tools build vectorized code in the "wrong" way and then fix it if we can later.

The extra swap instructions are not introduced until instruction selection, so the IR level is already correct. The instruction selector is too myopic to view entire computation webs, so we have to be conservative and generate code that uses true-LE register representations, then clean up at a later time when we can see the big picture. (So both the disease and the cure are introduced in the back end.)

I'll look at the backend option as a temporary measure, although someday I'd like to see a better solution. (That said, I can't volunteer to design it myself right now, so I have no right to complain...)

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
111

Yes, that looks appropriate. I'll move to that.

205

Semi-science, yes. I've spent some time on this in GCC as well and gotten a pretty good feel for what happens. 256 is sufficiently large to prevent any reallocations in the usual case. For larger functions, we will see a reallocation or two. My testing on LLVM indicated that the code in projects/test-suite used no reallocations except for three of the benchmarks (I know this because my implementation was broken, so I noticed failures when the reallocations occurred). So I feel this is a good choice.

252

Yep, I can do that.

615

I can add a comment to this effect.

echristo added inline comments.Mar 30 2015, 2:16 PM
lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
205

Good enough science for me. I was hoping you'd collected normal size, but this is probably close enough.

wschmidt updated this revision to Diff 22968.Mar 31 2015, 9:44 AM

I've attempted to address all comments received to date. I've changed the option handling to ape what's done for some similar passes (will be controlled with -mllvm -disable-ppc-vsx-swap-removal).

I'm still interested in any opinions on some of the FIXME issues I raised.

Thanks for the reviews!

Bill

Seems pretty reasonable on this end. I haven't done an in-depth look at the actual optimization at this point in case Hal or someone else has.

Quick Question: The testcases with the removed xxpermdi instruction checks - not being emitted at all, emitting fewer, useless to check anyhow, something else?

-eric

Quick Question: The testcases with the removed xxpermdi instruction checks - not being emitted at all, emitting fewer, useless to check anyhow, something else?

The tests check for total removal (xxswapd is an alternative mnemonic for this specific form of xxpermdi):

; CHECK-NOT: xxpermdi
; CHECK-NOT: xxswapd

wschmidt added a comment.EditedApr 3 2015, 6:20 AM

Hm, pre-coffee comment. I suspect you mean the last two tests where I removed the checks. At this time they don't check anything to do with the xxpermdi. Alternatively, I could leave the check in place and replace "count 12" with "count 0" or otherwise look at this. Since those tests weren't designed to specifically check an optimization I initially chose to remove the check, but I could go either way.

(The results in both cases are that all xxpermdi instructions are removed.)

Nah, that part sounds fine to me, I was just verifying. I'll leave it to Hal for the pass itself.

-eric

hfinkel edited edge metadata.Apr 9 2015, 6:17 PM

I apologize for taking so long to get to this...

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
22

What does "particularly bad for vectorized code" mean? Doesn't this exclusively apply to vectorized code?

127

Why don't we just call these "vector instructions"? mentions sounds stilted to me.

168

As a general note, this pass does not handle the case where a register has a subregister index. This should be mostly irrelevant here (we probably don't generate them for these kinds of instructions), but we also don't want to mishandle them should they occur.

(maybe also give up in lookThruCopyLike should you run into a register with a subreg index too).

225

range-based for?

228

range-based for?

233
for (const MachineOperand &MO : MI->operands()) {
285

Where to we actually reject webs with non-swapping load/stores?

308

We should add a comment here explaining that this happens because some of the 128-bit VSX registers have 128-bit Altivec sub-registers.

334

Yes, that would be better. If nothing else, we should be able to use the InstrMapping facility to generate a lookup table (this is how we currently keep track of which instructions are record forms, and the mapping between record-form and the non-record-form variants).

460

add an assert that isSubregToReg() is true.

492

range-based for?

Hal, thanks for the review! Responses to your comments are inline. Please let me know if I'm off base on any of them. I'll work on a new version shortly.

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
22

I suppose I had meant "autovectorized code" at the time, because the user has no way to anticipate these swaps are going to show up out of nowhere and reduce performance in a performance-enhancing optimization. However, the sentence really adds nothing, so I'll remove it. ;)

127

That's fine. "Mentions" is a more or less standard term for defs U kills U uses. The important thing is to get anything that mentions a vector register, including things that just copy into or out of them. But in our current ISA, everything that lives in the Category:Altivec and Category:VSX fits this description, so "vector instructions" is equivalent.

168

I don't believe any special handling is necessary for the current patch. If a subregister mention is connected, directly or indirectly, to a full register load or store, it will have to be done via an EXTRACT_SUBREG, INSERT_SUBREG, or SUBREG_TO_REG. But these are operations that kill the optimization anyway (except SUBREG_TO_REG for full 128-bit copies).

We don't want to exclude them from the analysis, because we can handle EXTRACT_SUBREG and INSERT_SUBREG by adjusting the subregister number to account for the doubleword swap, and dealing with SUBREG_TO_REG accordingly as well. I plan to incorporate this into a future patch.

However, there may be subtleties in how subregs are handled in LLVM that I'm not familiar with, so please let me know if I'm missing something.

225

OK -- I'll rework this and the ones below.

285

The fact that these are not marked as either IsSwap or IsSwappable causes them to be rejected. See lines 537-538.

308

OK.

334

I'll look at that, thanks. There didn't seem to be a way to just create a simple flag that I could see, which would have been nice...

460

OK.

hfinkel added inline comments.Apr 11 2015, 6:54 PM
lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
127

Feel free to keep mentions if you'd prefer. I don't have a strong opinion.

168

it will have to be done via an EXTRACT_SUBREG, INSERT_SUBREG, or SUBREG_TO_REG.

No, I think you've misunderstood my comment. MI register operands have have *implicit* subregister indices. These are separate from the subreg pseudo instructions you've named above.

For example, MO.getReg() may equal PPC::CR0, and the operand refers to the whole register if MO.getSubReg() == 0. But, if MO.getReg() == PPC::CR0 and MO.getSubReg() == PPC::sub_eq, then the operand is really referring to PPC::CR0EQ. Now this seems silly for physical registers, but is useful for virtual registers when you know that you want a particular subregister of a virtual register.

Should have a revised patch shortly. There are a couple of responses inline.

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
127

That's ok, I'll make the switch.

168

OK. Well, I really don't think this can actually happen today for any of the various vector register classes. (If it can, I'd be interested in an example.) I will add some logic that will just kill the optimization if it ever occurs, but to my knowledge subregs of vector registers are always generated explicitly by operation (as they should be). I can understand using this for fixed bits of a status register, but it would be really bad practice to use this for vector subregs.

hfinkel added inline comments.Apr 13 2015, 12:42 PM
lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
168

Right; I don't think it can happen today either, but if it does for whatever reason, I'd rather we handle it gracefully instead of crashing/miscompiling. That was my entire point ;)

wschmidt updated this revision to Diff 23695.Apr 13 2015, 1:42 PM
wschmidt edited edge metadata.

OK, I've attempted to address all comments. The biggest change is adding an InstrMapping to PPC.td to record which instructions kill the swap removal optimization, rather then just listing them in the switch statement. It is a bit ugly to use a mapping to represent a set, but it appears to be the only option for doing this via TableGen. Let me know if you have any better suggestions.

Thanks again for the thorough reviews!

Bill

hfinkel added inline comments.Apr 16 2015, 2:07 PM
lib/Target/PowerPC/PPC.td
196 ↗(On Diff #23695)

It would be nice to keep this as an intrinsic instruction property. Something that means that the instruction operates on all vector lanes independently. Then special cases that are handled in the code can be handled there without involving the instruction definitions.

lib/Target/PowerPC/PPCInstrAltivec.td
388

Why are you setting SwapFlag to 1 only to set it to 0? You can add the flag to the base instruction class if it just needs to be defined first.

One comment I understand, the other I really don't. Please advise. ;)

lib/Target/PowerPC/PPC.td
196 ↗(On Diff #23695)

OK, but how would that be implemented? I don't understand what you are proposing. At some point we have to identify either the "good" ones or the "bad" ones; how do you propose to do that without involving the instruction definitions?

lib/Target/PowerPC/PPCInstrAltivec.td
388

At first I didn't understand this, but I see what you're getting at -- actually I'm setting it to zero first, and then 1 here. Yes, that could just be 1 in the base class and removed from the instruction descriptions. Sorry, this is left over from the several iterations of trying to get this to work...

wschmidt updated this revision to Diff 23994.Apr 19 2015, 10:21 AM

Hal, thanks for the offline discussion. I've changed the filter class to "LaneSensitive" and cleaned up the SwapFlag foolishness.

nemanjai added inline comments.Apr 21 2015, 8:49 AM
lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
15

Nit: v4r32 -> v4f32?

This looks good, but we need to get rid of the

let BaseName = "BLA" in
def BLA: XForm_n<whatever...>

pattern. Here's one way to do it:

Change the definition of LaneSensitive to something like this:

class LaneSensitive {
  bit SwapFlag = 1;
  string BaseName = NAME;
}

(this is similar to what MUBUFAddr64Table does in R600/SIInstrInfo.td).

Now the problem is that LaneSensitive will only work properly when used inside of a multiclass definition. For that, you'll need some multclass wrappers for the XForm classes (we already have a few of these, see the definition of XForm_6r and friends in PPCInstrInfo.td). We can then make the multiclass something like this:

multiclass XForm_nls<whatever...> {
  def NAME : XForm_n<whatever...>, LaneSensitive;
}

and then you just need to change the definitions of the lane-sensitive instructions to something like:

defm BLA : XForm_nls<whatever...>

OK, thanks! This makes sense.

Frankly this gets pretty baroque when all we want is a way to flag an instruction with an attribute. Providing an easy way to form sets of instructions would be a good project for somebody who wants to modify TableGen someday.

Hal, I've looked this over a bit more, and I'm not very happy with this approach.

To support this, I'll have to add multiclass wrappers for all of the following:

XForm_1
XForm_8
VA1a_int_Ty3
VA1a_Int_Ty
VAForm_2
VXForm_1
VX1_Int_Ty2
VX1_Int_Ty
VX1_Int_Ty3
VXCR_Int_Ty
XX1Form
XX3Form
XX2Form_2

This is becoming a huge mangled mess, in an attempt to avoid listing these instructions in a switch statement. At this point I would like to suggest that TableGen is not set up to cleanly create sets of instructions, and revert to using the switch statement. Any objections?

Hal, I've looked this over a bit more, and I'm not very happy with this approach.

To support this, I'll have to add multiclass wrappers for all of the following:

XForm_1
XForm_8
VA1a_int_Ty3
VA1a_Int_Ty
VAForm_2
VXForm_1
VX1_Int_Ty2
VX1_Int_Ty
VX1_Int_Ty3
VXCR_Int_Ty
XX1Form
XX3Form
XX2Form_2

This is becoming a huge mangled mess, in an attempt to avoid listing these instructions in a switch statement. At this point I would like to suggest that TableGen is not set up to cleanly create sets of instructions, and revert to using the switch statement. Any objections?

Unfortunately, I don't object :( -- It is not clear that the multiclass setup is any more obvious or more maintainable than the switch statement. You can go back to the switch statement, with a FIXME. Also, add a prominent note in PPCInstrAltivec.td and in PPCInstrVSX.td reminding us that we may need to update the list if new vector instructions are added.

wschmidt updated this revision to Diff 24491.Apr 27 2015, 11:25 AM

Thanks, Hal. This revision reverts back to the switch statement, and adds the commentary you asked for. Thanks for helping me think through trying to do this in TableGen.

hfinkel accepted this revision.Apr 27 2015, 11:29 AM
hfinkel edited edge metadata.

LGTM, thanks!

This revision is now accepted and ready to land.Apr 27 2015, 11:29 AM
wschmidt accepted this revision.Apr 27 2015, 1:01 PM
wschmidt added a reviewer: wschmidt.

Committed as r235910.

wschmidt closed this revision.Apr 27 2015, 1:01 PM

Work is complete.