This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Combine (some) target shuffles with multiple uses
ClosedPublic

Authored by RKSimon on Aug 6 2018, 5:51 AM.

Details

Summary

As discussed on D41794, we have many cases where we fail to combine shuffles as the input operands have other uses.

This patch permits these shuffles to be combined as long as they don't introduce additional variable shuffle masks, which should allow the total number of shuffles to still drop without increasing the constant pool.

However, this may mean that some memory folds may no longer occur, and on pre-AVX require the occasional extra register move.

This also exposes some poor PMULDQ/PMULUDQ codegen which was doing unnecessary upper/lower calculations which will in fact fold to zero/undef - I've included the fix in this patch but can commit it separately as a followup if you wish to better show the effect

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Aug 6 2018, 5:51 AM
greened added inline comments.
lib/Target/X86/X86ISelLowering.cpp
39640 ↗(On Diff #159280)

This would be best done as a separate change.

test/CodeGen/X86/2012-01-12-extract-sv.ll
12 ↗(On Diff #159280)

Can we make this test less brittle by using FileCheck variables? This goes for pretty much every test in this patch.

test/CodeGen/X86/avx512-intrinsics-fast-isel.ll
6498 ↗(On Diff #159280)

Make this less brittle with FileCheck variables.

6740 ↗(On Diff #159280)

Make this less brittle with FileCheck variables.

RKSimon added inline comments.Aug 6 2018, 1:26 PM
lib/Target/X86/X86ISelLowering.cpp
39640 ↗(On Diff #159280)

Sure, that was what I meant in the summary:

This also exposes some poor PMULDQ/PMULUDQ codegen which was doing unnecessary upper/lower calculations which will in fact fold to zero/undef - I've included the fix in this patch but can commit it separately as a followup if you wish to better show the effect

test/CodeGen/X86/2012-01-12-extract-sv.ll
12 ↗(On Diff #159280)

I'm sorry but no - its been repeatedly proven that using update_llc_test_checks.py on the majority of x86 tests is the way forward - it speeds up creation of tests (x86 by far has the highest test coverage), makes regeneration of checks trivial and it prevents dodgy code being 'hidden' (either on purpose or by accident). Additionally many x86 subtargets have different instruction behaviours depending on the registers used so hidng the registers behind regexps make it that more difficult to track.

ping?

test/CodeGen/X86/2012-01-12-extract-sv.ll
12 ↗(On Diff #159280)

Just to be clear, this isn't just a regalloc diff - there is a codegen change here - the xmm0 value on line 12 is no longer dependent on the pervious perm that was on line 9.

test/CodeGen/X86/avx512-intrinsics-fast-isel.ll
6498 ↗(On Diff #159280)

Again, there is a codegen change here.

andreadb accepted this revision.Aug 9 2018, 3:21 AM

LGTM.

The change to combinePMULDQ() should be committed as a separate patch (as suggested by you and David).

Thanks
-Andrea

This revision is now accepted and ready to land.Aug 9 2018, 3:21 AM
This revision was automatically updated to reflect the committed changes.