This is an archive of the discontinued LLVM Phabricator instance.

[x86] split 256-bit vector selects if operands are vector concats
ClosedPublic

Authored by spatel on Jun 14 2019, 3:05 PM.

Details

Summary

This is similar logic/motivation to the select splitting in D62969.

In D63233, the pattern changes so that we no longer have an extract_subvector of vselect, but the operands of the select are still being concatenated.

The closest case is represented in either the first or last test diffs here - we have an extra instruction, but we converted 3-4 ymm instructions into 4-5 xmm instructions. I think that's the right trade-off for most AVX1 targets.

In the example based on PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428
...this makes the loop about 30% faster (tested on Haswell by compiling with -mavx).

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Jun 14 2019, 3:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2019, 3:05 PM

Looks ok.
Is there some costmodel here, or do we always (well, when we see concatenation, we don't seem to introduce it
intentionally) want to do this, in the hope that two smaller ops are always at least as good as one wider op?

RKSimon accepted this revision.Jun 15 2019, 5:53 AM

LGTM but there are a couple of cases that are bordering on regression that need investigating (llvm-mca comparisons, TODO comments, bug report, whatever).

@lebedev.ri The TTI costs try to include the extra costs of 256-bit integer vector ops for AVX1 but its often tricky to completely account for it - because the costs work on an individual instruction level many of the 'holistic' effects aren't considered at all. This is something that has made it difficult to make D46276 actually useful - slightly better costs for individual instructions didn't help improve costs/codgen decisions for the entire sequence.

llvm/test/CodeGen/X86/cast-vsel.ll
494 ↗(On Diff #204857)

This is a annoying - even though many AVX1 targets have 128-bit ALUs, we were avoiding xmm insertion/extraction completely which was the better option.

This revision is now accepted and ready to land.Jun 15 2019, 5:53 AM
spatel marked an inline comment as done.Jun 15 2019, 7:14 AM

Looks ok.
Is there some costmodel here, or do we always (well, when we see concatenation, we don't seem to introduce it
intentionally) want to do this, in the hope that two smaller ops are always at least as good as one wider op?

I'm expecting the existing concatenation in the match to arise from AVX1 legalization. So in the worst case, we're removing those 2 concats but adding a concat of the condition operand and the blend results.
As Simon mentioned, this is bordering on a heuristic decision. The part of this that we really have no way to model is the frequency throttling that can occur with wider vector ops - that gets eliminated by using 128-bit (xmm) ops.

llvm/test/CodeGen/X86/cast-vsel.ll
494 ↗(On Diff #204857)

Agreed - we could limit the transform based on type of the select condition and/or whether it is extracted in addition to the true/false operands. I'll put a TODO here.

This revision was automatically updated to reflect the committed changes.