This is an archive of the discontinued LLVM Phabricator instance.

[X86] Remove isel patterns for MOVSS/MOVSD ISD opcodes with integer types.
ClosedPublic

Authored by craig.topper on Jul 12 2018, 10:08 PM.

Details

Summary

Ideally our ISD node types going into the isel table would have types consistent with their instruction domain. This prevents us having to duplicate patterns with different types for the same instruction.

Unfortunately, it seems our shuffle combining is currently relying on this a little remove some bitcasts. This seems to enable some switching between shufps and shufd. Hopefully there's some way we can address this in the combining.

Diff Detail

Event Timeline

craig.topper created this revision.Jul 12 2018, 10:08 PM

The domain-switching in the tests is expected, correct? From the deleted comment I gather this is intentional. Do you have any performance measurement of this change? Not that I necessarily object, I'm just curious. I'm surprised only two tests changed from a use of movsd and no tests changes from a use of movss. Makes me a bit concerned about our test coverage.

greened added inline comments.Jul 13 2018, 8:07 AM
lib/Target/X86/X86ISelLowering.cpp
30711

It would be worth replicating this comment somewhere, wherever this decision is made, I suppose. It's helpful to know when such things are intentional.

The domain switching isn't really intentiional. The code I removed from X86ISelLowering was removing some bitcasts that made this optimize better.

What used to happen is this
-Lowering selects an fp typed MOVSS/MOVSD for the first shuffle and a PSHUFD for the second shuffle. The MOVSS/SD has bitcasts int->fp before it and a bitcast fp->int after it.
-Combining sees the MOVSS/SD and the bitcasts int->fp and turns into an integer MOVSS/MOVSD ISD node. This removes the bitcasts before and after it.
-Combining sees the PSHUFD producer has integer type and does nothing to chang it.

What happens now is
-Lowering selects an fp typed MOVSS/MOVSD for the first shuffle and a PSHUFD for the second shuffle. The MOVSS/SD has bitcasts int->fp before it and a bitcast fp->int after it.
-Combining sees the MOVSS/SD and the bitcasts int->fp, but does nothing to change it.
-Combining sees the PSHUFD, looks through the fp->int bitcast and finds the FP typed MOVSS/SD. Decides to rewrite the PSHUFD to SHUFPS to remove the fp->int bitcast, but ends up creating a new fp->int bitcast after the SHUFPS. The subsequent shuffles are on v8i16 type and so can't be fixed so the fp->int bitcast after the SHUFPS stays.

I don't have perf numbers. Based on the changes I think we're just moving the domain crossing one instruction later. And we picked up a move. The move is probably more costly on older CPUs that don't have move elimination.

I'm hoping there's somethign we can do in DAG combine to fix this. Maybe recognize the bitcast+shufps from fp->int and change it to pshufd+bitcast? Need to be careful to avoid an infinite loop with teh combine that turns pshud+bitcast into bitcast+shufps

greened added a comment.EditedJul 13 2018, 2:24 PM

Ok, thanks for the explanation, that helps. It sounds like no additional domain switching issues have been introduced, so this seems fine to me as far as that goes.

The extra moves are a little concerning. Even with move elimination they still consume processor resources. I'm not sure that trade-off is worth it just
to eliminate two really simple TableGen patterns.

EDIT: I realized the above is a bit unfair. It's getting rid of a fair amount of custom lowering code and eliminating the MOVS patterns in multiple files. It also does shrink the first test by a couple of instructions but overall, more instructions (in the form of movs) are introduced than eliminated.

I'm not really objecting to this, just wishing we had some hard numbers to look at.

It's 2 patterns repeated 3 times. And as of yesterday these patterns are all qualified with "optsize || !sse41) and we have new patterns to select blend under (sse41 & !optsize). This fixes a long standing issue where we turned movss/sd into blend later in the pipeline through an accidental double call to commuteInstruction. So with that fixed its now 12 patterns that can be removed. I need to rebase this patch.

I believe we're also missing a pattern to turn (v2f64 (movsd (v2f64), (loadv2f64)) into movlpd(probably really movlps since that's 1 byte shorter and identical in behavior). That would also need to be repeated for v2i64 for SSE, AVX, and AVX512

I think the blend patterns I added last night should also have load versions for consistency unless we want to just depend on the peephole pass.

I think if we were to fix those issues and not take this patch we would have 21 extra patterns. Which I guess is still not a huge number compared to the total number of patterns

I really hope we can find a different way to fix the DAG combiner. Simon is the expert on that code and I'm hoping he has an idea.

There might be more we can do to encourage lowering/combines to domain swappable style instruuction patterns (shufd+punpck etc.) - plus I've raised https://bugs.llvm.org/show_bug.cgi?id=38157 to see what we can do to do more exotic domain swaps (SHUFPD/SHUFPS <-> PSHUFD etc.).

Glad to hear the blend commute has been fixed!

I believe we're also missing a pattern to turn (v2f64 (movsd (v2f64), (loadv2f64)) into movlpd(probably really movlps since that's 1 byte shorter and identical in behavior).

But not necessarily identical in performance. Some microarchitectures don't like ps/pd domain shifting like that.

Which microarchitecture cares about switching PD/PS? To my knowledge, no Intel architecture cares. Do any of the AMD architectures care?

Which microarchitecture cares about switching PD/PS? To my knowledge, no Intel architecture cares. Do any of the AMD architectures care?

It tends to be only the 'weird mixture' PS/PD domain shifts that cause a stall: VADDPS then VMULPD, that kind of thing - shuffles and bitops tend to be more forgiving (and more easy to fix.)

Which microarchitecture cares about switching PD/PS? To my knowledge, no Intel architecture cares. Do any of the AMD architectures care?

It tends to be only the 'weird mixture' PS/PD domain shifts that cause a stall: VADDPS then VMULPD, that kind of thing - shuffles and bitops tend to be more forgiving (and more easy to fix.)

http://www.agner.org/optimize/microarchitecture.pdf, "21 AMD Bobcat and Jaguar pipeline", page 222:

21.9 Data delay between differently typed instructions
...
There is a penalty of 40 clock cycles when the output of a floating point calculation is input
to a floating point calculation with a different precision, for example if the output of a double
precision floating point addition is input to a single precision addition. This has hardly any
practical significance since such a sequence is most likely to be a programming error, but it
indicates that the processor stores extra information about floating point numbers beyond
the 128 bits in an XMM register.

Which microarchitecture cares about switching PD/PS? To my knowledge, no Intel architecture cares. Do any of the AMD architectures care?

It tends to be only the 'weird mixture' PS/PD domain shifts that cause a stall: VADDPS then VMULPD, that kind of thing - shuffles and bitops tend to be more forgiving (and more easy to fix.)

The AMD 17h Optimization Guide has this to say:

"Try to use consistent data types for instructions operating on the same data. For example, use
VANDPS, VMAXPS, and so on when consuming the output of MULPS."

The inclusion of VANDPS in that list makes me think all such domain crossings may incur a penalty.
I have not seen an explicit list anywhere of what is and is not bad. A VADDPD feeding a VANDPS
is probably not a programming error but it would incur a penalty.

Which microarchitecture cares about switching PD/PS? To my knowledge, no Intel architecture cares. Do any of the AMD architectures care?

It tends to be only the 'weird mixture' PS/PD domain shifts that cause a stall: VADDPS then VMULPD, that kind of thing - shuffles and bitops tend to be more forgiving (and more easy to fix.)

The AMD 17h Optimization Guide has this to say:

"Try to use consistent data types for instructions operating on the same data. For example, use
VANDPS, VMAXPS, and so on when consuming the output of MULPS."

The inclusion of VANDPS in that list makes me think all such domain crossings may incur a penalty.
I have not seen an explicit list anywhere of what is and is not bad. A VADDPD feeding a VANDPS
is probably not a programming error but it would incur a penalty.

We do our hardest to keep to the same domain already - like I said on PR38157 bitops and (some) unpacks/shuffles are already handled but we can always do more (patches welcome!).

test/CodeGen/X86/oddshuffles.ll
1317

TBH, this would be better if we stayed purely in the float domain.

Rebase again

RKSimon accepted this revision.Jul 20 2018, 3:24 AM

LGTM - this isn't introducing any more domain switches than what is already present., although as you mentioned there is still room for improvement.

This revision is now accepted and ready to land.Jul 20 2018, 3:24 AM

Ditto. Just wanted to clarify that I think this is fine. Thanks for a good discussion!

This revision was automatically updated to reflect the committed changes.