This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Allow SplitOpsAndApply to split to diminishing vector sizes
AbandonedPublic

Authored by RKSimon on Jan 9 2019, 11:05 AM.

Details

Summary

This patch relaxes the SplitOpsAndApply requirements for 128/256/512-bit sized vectors for SSE2/AVX2/AVX512BW so that it splits illegal types by a variety vector sizes down to 128-bits in size and I've updated the AVG combine to demonstrate this with its existing v48i8 + v40i16 test cases.

It splits the vectors into 512, 256 and finally 128-bit vector results and concatenates them back together (using the minimum necessary subvector size).

If the result size is less than 128-bits then it gets passed through in the old way as a single op, but otherwise the result size must be modulo 128-bits.

I've had to put in a limit that the operands must be the same vector width (not type) as the result - otherwise the operand subvector splitting becomes a lot more complex.

Diff Detail

Event Timeline

RKSimon created this revision.Jan 9 2019, 11:05 AM

PMULDQ/PMULUDQ is interacting poorly with the fact that we convert zext/sext to zext_vector_inreg/sext_vector_inreg before type legalization. So we split the PMULDQ/PMULUDQ when we create them. Then SimplfiyDemandedbits can't optimize the zext/sext to aext because the splitting messed up the use count. Then the zext/sext becomes a split zext_invec/sext_invec, but SimplifyDemandedBit won't turn those into aext_invec. So its really a gross ordering problem that probably goes away with -x86-experimental-vector-widening-legalization since we won't eagerly create zext_invec/sext_invec ops.

For this AVG case, I've considered trying to see if we could emit a v48i8 pavg and let type legalization custom widen it to v64i8 using undef and then split it. I think that requires us to use a (v64i8 (insert_subvector undef, (v48i8 X))) to widen the inputs in custom legalization. Then generic legalization would need support for legalizing the v64i8 insert_subvector with v48i8 input. Once its widened and split we should have one v64i8 pavg, or two v32i8 pavg, or four v16i8 pavg depending on the target.

Another idea is that we could teach custom type legalization to split the v48i8 avg as it widens into v16i8 undef, v16i8 avg, v32i8 avg for avx2.

craig.topper added inline comments.Jan 9 2019, 7:50 PM
lib/Target/X86/X86ISelLowering.cpp
37004 ↗(On Diff #180883)

This crashes on v24i8. Probably need InVT.getSizeInBits() % 128 == 0 instead of NumElems % 8 == 0

RKSimon planned changes to this revision.Jan 10 2019, 2:34 AM
RKSimon updated this revision to Diff 251888.Mar 22 2020, 7:46 AM
RKSimon edited the summary of this revision. (Show Details)

rebase an old patch - looking a lot better now that type widening is default - I've added the crashing avg test cases.

Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2020, 7:46 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
RKSimon updated this revision to Diff 261648.May 2 2020, 7:39 AM
RKSimon retitled this revision from [X86][SSE] Allow SplitOpsAndApply to split to lowest common vector size to [X86][SSE] Allow SplitOpsAndApply to split to diminishing vector sizes.
RKSimon edited the summary of this revision. (Show Details)
RKSimon abandoned this revision.Sep 15 2020, 2:17 AM

Solved this with rGfc446935d724 instead