This is an archive of the discontinued LLVM Phabricator instance.

[x86] transform vector inc/dec to use -1 constant (PR33483)
ClosedPublic

Authored by spatel on Jun 18 2017, 3:53 PM.

Details

Summary

Convert vector increment or decrement to sub/add with an all-ones constant:

add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>

The all-ones vector constant can be materialized using a pcmpeq instruction that is commonly recognized as an idiom (has no register dependency and/or has no latency), so that's better than loading a splat 1 constant.

The SSE and AVX1/2 diffs look like what I expected - we prefer 'pcmpeq' even over a folded load. AVX512 uses 'vpternlogd' for 512-bit vectors. Is that optimal?

This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Jun 18 2017, 3:53 PM
craig.topper edited edge metadata.Jun 18 2017, 4:05 PM

vpternlog does not have any idiom recognition.

For pcmpeq I think intel only avoids the dependency but still executes it. What does AMD do?

aymanmus edited edge metadata.Jun 19 2017, 12:35 AM

Taken from Intel's optimization manual:

There is another dependency breaking idiom - the "ones idiom".
• CMPEQ XMM1, XMM1; "ones idiom" set all elements to all "ones"
In this case, the micro-op must execute, however, since it is known that regardless of the input data the
output data is always "all ones" the micro-op dependency upon its sources does not exist as with the zero
idiom and it can execute as soon as it finds a free execution port.

No mentioning for vpternlog in the manual, but I think it would behave the same.

craig.topper added a comment.EditedJun 19 2017, 12:41 AM

I've looked into this before and talked to one of our architects. There is no instruction to generate 512-bit all ones that gets any special treatment. I've wondered before if we should use a zero idiom first to break the dependency, but that would add code bloat.

RKSimon edited edge metadata.Jun 19 2017, 4:04 AM

vpternlog does not have any idiom recognition.

For pcmpeq I think intel only avoids the dependency but still executes it. What does AMD do?

Confirmed with Agner's docs - Jaguar/Bulldozer/Ryzen all avoid input register dependencies for PCMPEQ/PCMPGT/PSUB/XOR/ANDN simd instructions when the two inputs are the same. They still create+execute uops (unlike move elimination) - but as integer ops can often go down most simd pipes and are low latency they are very unlikely to cause a performance regression.

IMO we are much better off with this approach than the constant load, although the higher register pressure is a minor concern.

lib/Target/X86/X86ISelLowering.cpp
34997 ↗(On Diff #102980)

!SplatVal->isOneValue()

test/CodeGen/X86/widen_cast-2.ll
11 ↗(On Diff #102980)

2 annoying (but low priority) things here - (1) these constants weren't merged and (2) different [] <> brackets for the comments......

vpternlog does not have any idiom recognition.

For pcmpeq I think intel only avoids the dependency but still executes it. What does AMD do?

Confirmed with Agner's docs - Jaguar/Bulldozer/Ryzen all avoid input register dependencies for PCMPEQ/PCMPGT/PSUB/XOR/ANDN simd instructions when the two inputs are the same. They still create+execute uops (unlike move elimination) - but as integer ops can often go down most simd pipes and are low latency they are very unlikely to cause a performance regression.

Ah, right - I was confusing move elimination with input reg elimination. In any case, I think there are a few reasons to favor this form in the DAG:

  1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables, so in theory, this could be better for perf, but...
  2. That seems unlikely to affect any OOO implementation, and I can't measure any real perf difference from this transform on Haswell or Jaguar (I'll attach a test program next), but...
  3. It doesn't look like it from the diffs, but this is an overall size win because we eliminate 16-64 constant bytes in the case of a vector load. If we're broadcasting a scalar load (which might itself be a bug), then we're replacing a scalar constant load + broadcast with a single cheap op, so that should always be smaller/better too.
  4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1 and psub x, -1, so we should use that form for +1 too because we can. If there's some reason to favor a constant load on some CPU, let's make the reverse transform for all of these cases (either here in the DAG or in a later machine pass).

IMO we are much better off with this approach than the constant load, although the higher register pressure is a minor concern.

Here's a test loop:

And asm (Linux x86 flavor):

compile with:
$ clang -O1 inc_test.c inc.s

I used this to benchmark this transform on Haswell and Jaguar, but I couldn't measure any difference. I also had expanded versions of the asm that repeated the padd/psub up to 128 times...but still no perf diff on either chip.

spatel updated this revision to Diff 103055.Jun 19 2017, 8:36 AM

Patch updated (no functional changes):

  1. Use APInt::isOneValue().
  2. Change comment about potential zero-latency implementation of pcmpeq - although that does seem possible for some uarch to implement if it wanted to. :)

Does anyone have any concerns about this? Otherwise I'd like to accept it.

The increased register pressure is offset by the constant being trivial to re-materialize. And avoiding loads/constant pool entries is something I'm always keen to aim for.

zvi accepted this revision.Jun 26 2017, 6:37 AM

LGTM

This revision is now accepted and ready to land.Jun 26 2017, 6:37 AM
This revision was automatically updated to reflect the committed changes.