This is an archive of the discontinued LLVM Phabricator instance.

[x86] Begin a significant overhaul of how vector lowering is done in the x86 backend.
ClosedPublic

Authored by chandlerc on Jun 19 2014, 6:21 PM.

Details

Summary

This sketches out a new code path for vector lowering, hidden behind an
off-by-default flag while it is under development. The fundamental idea
behind the new code path is to aggressively break down the problem space
in ways that ease selecting the odd set of instructions available on
x86, and carefully avoid scalarizing code even when forced to use older
ISAs. Notably, this starts off restricting itself to SSE2 and implements
the complete vector shuffle and blend space for 128-bit vectors in SSE2
without scalarizing. The plan is to layer on top of this ISA extensions
where we can bail out of the complex SSE2 lowering and opt for
a cheaper, specialized instruction (or set of instructions). It also
needs to be generalized to AVX and AVX512 vector widths.

Currently, this does a decent but not perfect job for SSE2. There are
some specific shortcomings that I plan to address:

  • We need a peephole combine to fold together shuffles where possible. There are cases where a previous shuffle could be modified slightly to arrange for elements to be in the correct position and a later shuffle eliminated. Doing this eagerly added quite a bit of complexity, and so my plan is to combine away these redundancies afterward.
  • There are a lot more clever ways to use unpck and pack that need to be added. This is essential for real world shuffles as it turns out...

Once SSE2 is polished a bit I should be able to get interesting numbers
on performance improvements on benchmarks conducive to vectorization.
All of this will be off by default until it is functionally equivalent
of course.

Diff Detail

Repository
rL LLVM

Event Timeline

chandlerc updated this revision to Diff 10669.Jun 19 2014, 6:21 PM
chandlerc retitled this revision from to [x86] Begin a significant overhaul of how vector lowering is done in the x86 backend..
chandlerc updated this object.
chandlerc edited the test plan for this revision. (Show Details)
chandlerc added reviewers: grosbach, filcab.
chandlerc set the repository for this revision to rL LLVM.
chandlerc added a subscriber: Unknown Object (MLST).
grosbach edited edge metadata.Jun 20 2014, 12:55 PM

This is fantastic. Some comments inline. Mostly of the shed color choice variety that you can ignore freely if you prefer.

I really like the overall direction of this patch. It takes a very systematic approach to breaking down the problem.

In the commit message, you mention needing a later peephole to combine some combinations of instructions. Would it make sense to call some of those sequences out explicitly in the code comments as well? Since this code is going to be relying on other things in order to get performant results, I'd like to see that called out so that anyone later reading this code can see where else they should look to fully understand how things are happening.

lib/Target/X86/X86ISelLowering.cpp
7219 ↗(On Diff #10669)

s/test/tests/

7254 ↗(On Diff #10669)

Same thing as above. Function comment describing the approach will go a long way.

7579 ↗(On Diff #10669)

Only need to "try to collapse" once. ;)

grosbach added inline comments.Jun 20 2014, 12:55 PM
lib/Target/X86/X86ISelLowering.cpp
6761 ↗(On Diff #10669)

Style nit. I prefer not to refer to things as "new" in code comments, as the comments tend to live on that way long after the new code becomes the old code.

6769 ↗(On Diff #10669)

You comment on this below, but IMO it's worth calling out explicitly here that UNDEF masks are required to be implemented as preserving the input lane and thus are considered a NOP.

This and isSingleInputShuffleMask() should probably be static methods of ShuffleVectorSDNode instead. That would be similar to how isSplatMask() is implemented (side note, that should probably switch to use an ArrayRef instead of an EVT). Then add simple non-static wrappers isNoopShuffle() and isSingleInputShuff() to the class which call down to the static versions using the instance's mask.

6781 ↗(On Diff #10669)

"NB?"

6803 ↗(On Diff #10669)

Likewise, it may be useful to expose a direct method on ShuffleVectorSDNode for getting the lane count.

int getNumElements() const { return getValueType(0).getVectorNumElements(); }

6805 ↗(On Diff #10669)

Example of the above. This would be a bit clearer if written as SVOp->isSingleInput().

Ditto on similar constructs later in the patch.

6807 ↗(On Diff #10669)

does not parse: "by passing sharing the"?

6813 ↗(On Diff #10669)

Comment strings on the asserts.

6831 ↗(On Diff #10669)

s/But we have to map the mask as it/We have to map the mask, as it/

6859 ↗(On Diff #10669)

space after "{" and before "}"

6864 ↗(On Diff #10669)

Why not use the helper function you defined above to check for the shuffle being single input?

I guess because you want to use the element count directly in the following code?

6873 ↗(On Diff #10669)

std::find_if instead?

6874 ↗(On Diff #10669)

This is a neat trick for getting the other lane of the high half, but it isn't immediately obvious that's what it's doing. Either a ?: to make the selection explicit or a comment explaining the bitwise trick would be good.

6886 ↗(On Diff #10669)

Big ask, but I'd love some ASCII art type examples of the transforms, in particular for those like this one that need more than one instruction. Not the complete set of possibilities; just a single example that's illustrative showing arrows for which lanes get moved where by which insn.

6917 ↗(On Diff #10669)

Spacing for "{" "}".

6965 ↗(On Diff #10669)

This is a really tricky part of the patch. Can you add a comment describing the approach and algorithm to the function? Without a lot of very explicit exposition, I'm concerned about future "improvements" to this code path inadvertently breaking things.

6996 ↗(On Diff #10669)

An example in the comment would really help here.

majnemer added inline comments.
lib/Target/X86/X86ISelLowering.cpp
6781 ↗(On Diff #10669)

It is probably an abbreviation for nota bene

Bike-sheddy comments addressed below. ;] I seem to like a somewhat different shade than you. Lemme know if its too different.

To the combine stuff, I'm not sure where would be a great place to put it. The point of the combines being needed is that there *isn't* one place where we make a potentially bad decision, but that two distant decisions could (in theory) collude to save an operation. Hard to comment that. The test cases should cover it very well though...

Will attach an updated patch momentarily...

lib/Target/X86/X86ISelLowering.cpp
6761 ↗(On Diff #10669)

OK, "experimental" maybe?

6769 ↗(On Diff #10669)

Regarding the undef mask contract below, it doesn't impact this at all. An undef lane, by its nature, is a NOP?

I really dislike static methods for these at the moment. These have nothing to do with an SDNode and everything to do with the specific ways in which masks are used in this lowering code. I'm hesitant to over-generalize here.

Also, static methods are horrible to actually use.

If there is useful common logic to refactor into common locations, I'd rather do it if and when it is actually needed to implement common functionality. As it stands, it wouldn't save anything?

6781 ↗(On Diff #10669)

Correct. I don't care particularly about comment style; I'll probably clean this up along with adding the comment above.

6803 ↗(On Diff #10669)

I don't know that this makes the code any better. By directly asserting properties of the Mask, to me, it makes the following code that assumes the properties on the mask much more clear.

6805 ↗(On Diff #10669)

I think this method might make sense on SVOp as it really isn't used anywhere on raw / hypothetical masks.

But at the same time, it was actually rather convenient to write all of this code in terms of the mask. I don't find the ShuffleVectorSDNode to add any utility. :: shrug ::

6807 ↗(On Diff #10669)

English is hard. Will reword.

6813 ↗(On Diff #10669)

When there is a generally useful bit of information that is being asserted, I try to do so. However, these asserts are just documenting the mathematical invariants established by the above condition. There isn't any useful comment to give here IMO.

6831 ↗(On Diff #10669)

Done.

6859 ↗(On Diff #10669)

See the coding guidelines and clang-format -- this is how it formats braces specifically to interoperate cleanly with braced init lists in C++11.

6864 ↗(On Diff #10669)

Exactly. When I have to count anyways, I may as well re-use that.

6873 ↗(On Diff #10669)

Sure. The first time I wrote this I was annoyed by re-computing the index from find_if and switched it to a while loop, but maybe the loop is too high cost.

6874 ↗(On Diff #10669)

I have a distaste for comments which merely explain what the standard says the C++ in question does.

I tried to help this by naming the variable after the "adjacent" index. Is there a better variable name that would give the context? If not I can do something else...

6886 ↗(On Diff #10669)

I'm hesitant to do this. Every time I drew an ascii art picture (and I drew several) or something on a white board, it actually messed me up because I started to silently assume that the one example I had was actually representative of the other ways to hit the same scenario.

Unless this is breaking your ability to reason about the code, I'm going to hold off investing a ton of time in crafting these...

6917 ↗(On Diff #10669)

Same comment, this is the clang-format enforced style and matches the coding standards.

I'll give a citation to help: http://llvm.org/docs/CodingStandards.html#braced-initializer-lists

6965 ↗(On Diff #10669)

Yea, I didn't have that comment because there were many tens of different approaches tried before one worked. I'm also still hoping to simplify some parts of it.

Also, all the details are already documented in each branch of the hybrid approach.

6996 ↗(On Diff #10669)

Hmm, ok. Not sure it'll actually help, but done. =] This trick took me weeks to figure out.

7219 ↗(On Diff #10669)

Don.

7254 ↗(On Diff #10669)

I've added them to most.

Note that these comments will only get more confusing as time goes on. =/ I'm not sure of a good way to document these. Each new ISA extension or trick I add will require updates. Hopefully I won't miss any. =]

7579 ↗(On Diff #10669)

Done.

chandlerc updated this revision to Diff 10716.Jun 20 2014, 6:55 PM
chandlerc edited edge metadata.

Updates from review thus far...

filcab edited edge metadata.Jun 20 2014, 7:18 PM

Many thanks for this patch! Making the shuffle code saner and tidier is awesome!

I'm posting a few comments for v2* and v4* for now. Will check the other ones later. Most of these are nitpicks, but the ones on line 6816 and 6836 look like bugs (either on the code or in my brain).

I also don't like the phrasing “We rely heavily on "undef" masks preserving the input lane.”
From what I've seen, it seems that we will treat undefs as if the corresponding element for the source vector was specified. But the code doesn't assume that anywhere (maybe it does in places further down). My first reading parsed it as “we need to be sure that undefs are treated this way”.

I would also echo most of grosbach's comments.

lib/Target/X86/X86ISelLowering.cpp
6782 ↗(On Diff #10669)

This is similar to a (DAG) shuffle mask, but not quite (it only goes up (non-inclusive) to 4, not 8, which you would have on shuffles for two v4* vectors). Could you add some word (Shufp?) to the name, to make it easier to spot the difference?

6803 ↗(On Diff #10669)

Shouldn't we test, at a higher level, that the shuffle's type is v2f64? I don't think we can get a malformed SDValue that's a v2f64 with more (or fewer) than 2 elements in the mask. The same for the other functions.

6808 ↗(On Diff #10669)

Since the possible values are UNDEF (-1), 0, and 1, wouldn't simply Mask[x] == 1u be easier to understand (This would invert how we handle UNDEFs in this case)

6816 ↗(On Diff #10669)

I might be missing something. How does this handle an <i32 2, i32 0> mask? Wouldn't it generate

{ V1[0], V2[0] }

instead of

{ V2[0], V1[0] }

?

6836 ↗(On Diff #10669)

What about bitcasting the PSHUFD output back to v2i64?

6864 ↗(On Diff #10669)

Since we use NumV2Elements a bunch of times, I'd be ok with not going through the mask two times when we can just go one time.

6955 ↗(On Diff #10669)

Nitpick: I would change the phrasing. If we only use one shufps, shufps is smaller (2 opcode bytes vs. 3 opcode bytes). If we need two shufps, then it's better to just use shufpd. The comment, as it is, seems, to me, like it suggests we still need to check if there's any case where this is helpful.

My overarching concern is to make sure the code is understandable a year or three or five from now. There's no getting around the necessary complexity of the code due to the ISAs we have to work with today, but as new ISA extensions come out, it's only going to get more so. As such, it's immensely important that the code be understandable to a level that guides someone doing that to work where and how to insert those new tricks. Thus my hounding more about comments than the actual details of the code. The code itself looks great. I want to make sure it stays that way after barbarians like me get their hands on it in the coming years. :)

Looks like one of my comments from before got eaten. Trying again, but in case it doesn't go through there this time either, you're missing a "#include <numeric>". Otherwise you don't have a declaration for std::accumulate.

lib/Target/X86/X86ISelLowering.cpp
6769 ↗(On Diff #10669)

I'm more concerned that it's inconsistent with what we already have (isSplatMask()). I don't care much which idiom we use (static method vs. static function), but do think we should be consistent.

I disagree that they have nothing to do with the SDNode. They're for asking questions about a mask either of or for an SDNode. Now, strictly speaking, the mask is a distinct thing, it's true, and we could have a dedicated class or namespace for that, I suppose.

Static methods are no harder to use than the static functions you already have. It's just a scope.

On the other hand, do other targets actually want answers to the questions these helpers ask? If not, maybe you're right that they're more closely related to the X86 lowering code than the generic stuff. So I suggest we get an answer pragmatically. Can you have a look at the other targets that do shuffles and see if they have something similar? If so, combine the use cases with something generic and if not, leave it as you've written it currently. Sound reasonable?

6781 ↗(On Diff #10669)

Cool. Just not used to latin in source code. ;)

6813 ↗(On Diff #10669)

I'd still prefer there be something. As a matter of consistent style, if nothing else. Just something that which tells me what other properties I should be looking at to see what happened. For example, distinguishing between a mask that's invalid on its face for the vector type vs. one that's valid but shouldn't have gotten this far because the code above was supposed to handle it.

6859 ↗(On Diff #10669)

clang-format agrees with me. That's why I made the comment, actually.

6874 ↗(On Diff #10669)

It's not just explaining what the standard says, though. It explains why. I spend a couple minutes looking at that code wondering what you were up to before I realized and I had to work through the bit patterns in my head to do it. Maybe I'm just being dense and it should have been obvious, but it wasn't. I really think a comment explaining the intent (not the language semantics) is very useful here.

The abbreviation certainly didn't help. I thought it meant "Adjusted" not "Adjacent".

6886 ↗(On Diff #10669)

I'm drawing lots of stuff on paper to be able to reason about the code. Some comments to help guide that thought process would help.

From another perspective, comments breaking down what the code is intended to cover help review to make sure it matches what it actually does cover.

I understand that a mathematically sound exposition of all the cases and how they transpose into one another is likely not practical, but I still think a few example would be useful.

Doesn't have to be ASCII art. I just like pictures. Simple examples, even expressed in terms of __builtin_shufflevector() .c code perhaps, would go a long way.

6917 ↗(On Diff #10669)

Then clang-format should be fixed. That's what I used to verify I was correct. I just tried again and got the same result.

6965 ↗(On Diff #10669)

Maybe something about "there be dragons here, and not the friendly kind" then? If it was that tricky to get right, I'm more concerned about someone helpfully coming along and "fixing" it later without understanding all the cases. Tests go a long way there, of course, but I don't like relying on that exclusively.

7000 ↗(On Diff #10669)

llvm/lib/Target/X86/X86ISelLowering.cpp:7000:24: error: no member named 'accumulate' in namespace 'std'

Need to add "#include <numeric>" I think.

OK, I think I've responded to all the comments reasonably at this point. Will upload an updated patch momentarily. Please let me know if this LG to submit at this point! =D

lib/Target/X86/X86ISelLowering.cpp
6769 ↗(On Diff #10669)

Chatted about this on IRC some and I've add lots of comments which will even explain the specific peculiarities of these routines.

The essence is that these seem pretyt special-purpose today. They work on raw masks, and those masks sometimes have fairly specific meaning to this cod ethat might not generalize well. I Actually tried to move the single input test and it is just wrong in general; it only works here with this code path.

Hopefully with the documentation this helps clarify what's going on.

6782 ↗(On Diff #10669)

Do you have a concrete suggestion here? I don't have any ideas other than V4. We use this for SHUFP, PSHUF, PSHUFH, and PSHUFL instructions. =/

The way I think about it is that this is just the way that x86 always represents a 4-lane shuffle, regardless of where the 4 lanes come from.

6803 ↗(On Diff #10669)

Yea, testing this stuff is a great idea. I've added lots of asserts to this effect. Also fixed a bunch of places where we were playing fast and loose with types to instead go back through the full vector shuffle code to ensure type correctness throughout.

6808 ↗(On Diff #10669)

Sure.

6813 ↗(On Diff #10669)

I've added strings to every assert. I've no idea of many of these really add value, but hey. =D

6816 ↗(On Diff #10669)

How? There is a test case that covers this. ;]

I think the key is that we canonicalize the mask first, so we know that by the time we gete here there is exactly one entry from each input, and when we have equal numbers of inptus, we canonicalize s.t. that first input provides more low-half inputs than the second input.

6836 ↗(On Diff #10669)

Done. I think the DAG stuff was fixing this for me. This code is definitely covered by tests. =/

6859 ↗(On Diff #10669)

To record here what took place on IRC -- Jim's build of clang-format was stale for truly mysterious reasons... no idea why...

6874 ↗(On Diff #10669)

Comments added.

6886 ↗(On Diff #10669)

Ok.

However, this comment was just bad. Bordering on terrible. I've rewritten it to be less confusing. =]

I'm not sure after that exactly where diagrams or examples would most help? Could you point at the specific places that need more illustration in light of the added comments from the last couple of iterations?

6917 ↗(On Diff #10669)

As noted above, we sorted this out...

6955 ↗(On Diff #10669)

Not sure what you're really saying. First, we're talking about shufps vs. pshufd. Second, we would never need more or fewer of shufps vs. pshufd, they are both equally powerful when using a single input.

However, yes, I've now looked, and there is no reason for this. I'll nuke the FIXME.

If anything, we might want a combine somewhere *very* late that can see when we don't need the free copy pshufd provides and we don't pay a domain crossing penalty, and so it is an easy code size win to rewrite it as a shufps.

6965 ↗(On Diff #10669)

I've tried to do a decent job of commenting the strategy with appropriate warnings and cross references. I think it ended up being simpler to document than I feared, so hopefully its good now.

7000 ↗(On Diff #10669)

Done.

chandlerc updated this revision to Diff 10870.Jun 25 2014, 6:00 PM
chandlerc edited edge metadata.

New patch with significant updates based on review comments by Jim and Filipe.

filcab added inline comments.Jun 25 2014, 6:04 PM
lib/Target/X86/X86ISelLowering.cpp
7005 ↗(On Diff #10716)

This seems wrong. How can you have [{0,1}{2,7}{4,5}{6,3}] -> [{0,1}{4,7}{2,3}{6,5}]? How did the {2,7} get mixed with the {4,5}? Wouldn't pshufd just shuffle double words around but not break them?

Shouldn't it become [{0,1}{4,5}{2,7}{6,3}]?

7013 ↗(On Diff #10716)

Commenting how that subtraction gets us the index that's not used would be nice.

7073 ↗(On Diff #10716)

Wouldn't InPlaceInputs[0] ^ 1 do the same thing? And it would be the same trick you used in the previous function.

7401 ↗(On Diff #10716)

NumV1Inputs == 0 && NumV2Inputs == 4 => assert in lowerV8I16BasicBlendVectorShuffle:7293

The previous if should have an else if (NumV1Inputs == 0) ...SingleInput...(...V2...)

If this condition doesn't arise because it's been handled earlier, I'd like to have an assert here.

chandlerc added inline comments.Jun 25 2014, 6:07 PM
lib/Target/X86/X86ISelLowering.cpp
7005 ↗(On Diff #10716)

Yea, this is why examples aren't clear.

The PSHUFD applies to the *input* to this shuffle. The result here is the new mask needed after the pshufd to produce the same final output.

How can i make this more clear?

7401 ↗(On Diff #10716)

Correct, we can't hit this because if there is only a single input, it is always V1. Adding assert here.

chandlerc added inline comments.Jun 25 2014, 6:20 PM
lib/Target/X86/X86ISelLowering.cpp
7005 ↗(On Diff #10716)

Think I have a better way to illustrate it...

7013 ↗(On Diff #10716)

Yea, this is way down the clever rabbit hole.

7073 ↗(On Diff #10716)

Good idea. ;] I should be way more consistent with these tricks.

chandlerc updated this revision to Diff 10873.Jun 25 2014, 6:21 PM

Fix up comments and add another assert based on Filipe's comments.

LGTM unless Filipe has any additional requests.

lib/Target/X86/X86ISelLowering.cpp
7321 ↗(On Diff #10873)

Different from what?

I don't have a better suggestion for the message, unfortunately.

filcab accepted this revision.Jun 26 2014, 3:08 PM
filcab edited edge metadata.

LGTM, too. I'll still have another look at some comments, but if there's any minor issue with those, it can be resolved on IRC.

lib/Target/X86/X86ISelLowering.cpp
6782 ↗(On Diff #10669)

Maybe V4X86Shuffle? At least it shows it might not be a “regular” DAG shuffle mask.

This revision is now accepted and ready to land.Jun 26 2014, 3:08 PM
chandlerc closed this revision.Jun 27 2014, 4:32 AM
chandlerc updated this revision to Diff 10922.

Closed by commit rL211888 (authored by @chandlerc).