This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Optimize the truncation of vector comparison results with PACKSS
ClosedPublic

Authored by RKSimon on Jul 26 2016, 9:03 AM.

Details

Summary

We currently default to using either generic shuffles or MASK+PACKUS/PACKSS to truncate all integer vectors. For vector comparisons, we know that the result will be either all or zero bits in every element, which can be efficiently truncated by directly using PACKSS to repeatedly halve the size of each element.

Due to the limited input values (-1 or 0) we don't need to account for vector element size, so for simplicity we just use the PACKSS(vXi16,vXi16) implementation in all cases. Additionally for AVX2 PACKSS of 256bit data we must perform a PERMQ shuffle to reorder the data into the correct order. I did investigate performing a single shuffle after all the PACKSS calls but the need to cross 128bit lanes makes this difficult to achieve efficiently.

We avoid performing this on AVX512 as it should have better alternative truncation instructions - is this right?

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 65535.Jul 26 2016, 9:03 AM
RKSimon retitled this revision from to [X86][SSE] Optimize the truncation of vector comparison results with PACKSS.
RKSimon updated this object.
RKSimon added reviewers: mkuper, delena, ab, spatel, andreadb.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
mkuper edited edge metadata.Jul 27 2016, 3:12 PM

Thanks Simon, nice trick!

I don't know about AVX512 - hopefully Elena can answer, but, in any case, I don't see a reason not to enable this for AVX2 and below for now.

lib/Target/X86/X86ISelLowering.cpp
14150 ↗(On Diff #65535)

Is this valid? If not, maybe assert instead?
For IR-level truncs, we explicitly disallow equal types, and I'd be slightly surprised if we allowed them for the ISD.

14185 ↗(On Diff #65535)

Sorry, I'm still confused about this.

We just truncated, say, a v8i64 into a v8i32. The high 4 values should have ended up in the high 4 lanes of the ymm, and the low 4 values should have ended up in the 4 low lanes. Why isn't this what we want?

14200 ↗(On Diff #65535)

I may have missed it, but I didn't see this is checked explicitly anywhere in the callers.
What happens if you hit the combine version of this, pre-legalization, truncating from v32i64 to v32i8, for instance? The destination is a 256-bit type, and the source is simple.

(Also, greater -> smaller)

14280 ↗(On Diff #65535)

How about VPCOM and VPCOMU?

29861 ↗(On Diff #65535)

Perhaps add an assert that getBooleanContents() is ZeroOrNegativeOneBooleanContent?
I don't expect this to change anytime soon, but... :-)

delena added inline comments.Jul 28 2016, 2:39 AM
lib/Target/X86/X86ISelLowering.cpp
29869 ↗(On Diff #65535)

It should be another function that says that all bits in each element all-ones or all-zeroes. It is not only compare. SIGN_EXTEND_IN_REG has the same effect, for example.

RKSimon added inline comments.Jul 28 2016, 7:28 AM
lib/Target/X86/X86ISelLowering.cpp
14150 ↗(On Diff #65535)

I used this to make the recursion calls simpler (otherwise they all need equivalent tests). I can add a comment to explain this.

14185 ↗(On Diff #65535)

We're truncating a 8i64 which will be 2 YMMs of 4i64 each:

YMM0 = ((i64[0], i64[1]), (i64[2], i64[3]))
YMM1 = ((i64[4], i64[5]), (i64[6], i64[7]))

PACKSS acts on each 128-bit vector lane separately:

PACKSS(((i64[0], i64[1]), (i64[2], i64[3])), ((i64[4], i64[5]), (i64[6], i64[7])))
-->
(i32[0], i32[1], i32[4], i32[5]),(i32[2], i32[3], i32[6], i32[7])
--> PERMQ
(i32[0], i32[1], i32[2], i32[3]),(i32[4], i32[5], i32[6], i32[7])

14200 ↗(On Diff #65535)

Near the start of the function we check that the destination value type is a 128-bit vector or more, and combineVectorCompareTruncation sanitizes the inputs. I'll beef up the tests/asserts at the start of truncateVectorCompareWithPACKSS just to be safe though.

14280 ↗(On Diff #65535)

Yes I can add those, I just hadn't made any test cases for them yet. I'd prefer to do all that in a post-commit.

29869 ↗(On Diff #65535)

Yes, although it quickly gets more complicated with regard to acceptable PACKSS input sizes - which is why I simplified a lot of this code just to call PACKSS with i16 inputs (which only compares can easily conform to). I'll add a TODO comment.

RKSimon updated this revision to Diff 65931.Jul 28 2016, 7:36 AM
RKSimon edited edge metadata.

Updated based on feedback

mkuper accepted this revision.Jul 28 2016, 10:30 AM
mkuper edited edge metadata.

LGTM

lib/Target/X86/X86ISelLowering.cpp
14150 ↗(On Diff #65931)

Ah, ok, this makes sense.

14185 ↗(On Diff #65931)

Oh, I just completely missed that in the definition of the VEX.256-encoded versions.

14200 ↗(On Diff #65931)

Never mind, I completely misread the code (hence the greater -> smaller comment).
Forget I said anything.

14283 ↗(On Diff #65931)

Sounds good.

This revision is now accepted and ready to land.Jul 28 2016, 10:30 AM
This revision was automatically updated to reflect the committed changes.