Page MenuHomePhabricator

[DAGCombiner] Convert constant AND masks to shuffle clear masks down to the byte level
ClosedPublic

Authored by RKSimon on Jul 26 2015, 2:43 PM.

Details

Summary

The XformToShuffleWithZero method currently checks AND masks at the per-lane level for all-one and all-zero constants and attempts to converts them to legal shuffle clear masks.

This patch generalises XformToShuffleWithZero, splitting and checking the sub-lanes of the constants down to the byte level to see if any legal shuffle clear masks are possible. This allows a lot of masks (often from legalization or truncation) to be folded into existing shuffle patterns and removes a lot of constant mask loading.

The patch involves a number of additional minor tweaks to improve codegen, I can commit these separately or generate patches for extra review if any of you wish:

  • XformToShuffleWithZero is only attempted if constant folding has failed.
  • A lot more X86 byte vector blends are now generated. I've added a stage to the VPBLENDVB lowering in lowerVectorShuffleAsBlend to attempt to lower back to a AND mask using lowerVectorShuffleAsBitMask if possible. VPAND is a lot faster than VPBLENDVB.
  • X86 v8i16 shuffle lowering now attempts to use lowerVectorShuffleAsBitMask before resorting to VPSHUFB (matches v16i8 shuffle lowering). VPAND is a lot faster than VPSHUFB.

There are still a few examples of poor shuffle lowering that are exposed that we can cleanup in future patches (e.g. x86 legalized v8i8 zero extension uses PMOVZX+AND+AND instead of AND+PMOVZX)

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 30667.Jul 26 2015, 2:43 PM
RKSimon retitled this revision from to [DAGCombiner] Convert constant AND masks to shuffle clear masks down to the byte level.
RKSimon updated this object.
RKSimon added reviewers: chandlerc, qcolombet, andreadb.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
chandlerc edited edge metadata.Jul 26 2015, 10:09 PM

By and large, really, really nice.

Comments in-line.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12991–12994 ↗(On Diff #30667)

How annoying would it be to fully split this out into a helper function? The indentation is already reasonably significant here.

12999–13001 ↗(On Diff #30667)

I would just consistently use 'int' here.

13118–13121 ↗(On Diff #30667)

Is there a particular reason to need to adjust this in this way?

lib/Target/X86/X86ISelLowering.cpp
6884–6885 ↗(On Diff #30667)

Please send a separate review with just this (the bit mask lowering) change?

It seems likely it should be testable independently.

test/CodeGen/X86/sse3.ll
272–275 ↗(On Diff #30667)

This seems like a pretty bad regression. Any hope of recovering it?

test/CodeGen/X86/vec_cast2.ll
49–54 ↗(On Diff #30667)

Same question as above...

test/CodeGen/X86/vec_int_to_fp.ll
1762–1765 ↗(On Diff #30667)

and here, wow!

Thanks Chandler - I've replied to your comments and I'll get the 'sub patches' done first.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12991–12994 ↗(On Diff #30667)

Moderately annoying - the clear mask test is the only test in XformToShuffleWithZero. I could replace the if (RHS.getOpcode() == ISD::BUILD_VECTOR) block with a if (RHS.getOpcode() != ISD::BUILD_VECTOR) early out to reduce one level indentation instead?

13118–13121 ↗(On Diff #30667)

We carry on constant folding of binops - in this case (AND c1 c2) - a lot later in the lowering stages than we do constant folding of bitcasts and shuffles, so a number of tests started failing as they'd been converted to bitcasted shuffles, resulting in a lot of unnecessary code.

lib/Target/X86/X86ISelLowering.cpp
6884–6885 ↗(On Diff #30667)

Easily done (and understandable...) - I'll do a NFC commit beforehand to move the unaltered function so that the diff for a subpatch is as clear as possible.

test/CodeGen/X86/sse3.ll
272–275 ↗(On Diff #30667)

Yes - it requires DAGCombiner::visitVECTOR_SHUFFLE to be able to peek through shuffles of bitcasts of two inputs, so far it only handles the one input case. That should fix both the domain crossing and number of instructions. I've moved it up my todo list.

test/CodeGen/X86/vec_cast2.ll
49–54 ↗(On Diff #30667)

This is the zero-extend issue that I mentioned in the summary. Yak shaving......

test/CodeGen/X86/vec_int_to_fp.ll
1762–1765 ↗(On Diff #30667)

This is the zero-extend issue that I mentioned in the summary.

RKSimon updated this revision to Diff 30812.Jul 28 2015, 6:41 AM
RKSimon edited edge metadata.

rebased now that D11541 has been committed

Chandler - would you be happy for this patch to be submitted? I've started looking at the remaining issues (shuffle through bitcasts, extension + masks and merging 2x128 bitops -> 1x256 mask), all of them are more easily exposed after this patch and I'm working on improvements right now.

chandlerc accepted this revision.Jul 31 2015, 2:13 AM
chandlerc edited edge metadata.

Yea, LGTM and looking forward to the two-input fix and zext fix. Awesome work.

This revision is now accepted and ready to land.Jul 31 2015, 2:13 AM
This revision was automatically updated to reflect the committed changes.