This is an archive of the discontinued LLVM Phabricator instance.

[X86] Don't create VBROADCAST nodes with 256-bit or 512-bit input types
ClosedPublic

Authored by craig.topper on Jan 15 2017, 11:50 AM.

Details

Summary

We don't seem to have great rules on what a valid VBROADCAST node looks like. And as a consequence we end up with a lot of patterns to try to catch everything. We have patterns with scalar inputs, 128-bit vector inputs, 256-bit vector inputs, and 512-bit vector inputs.

As you can see from the things improved here we are currently missing patterns for 128-bit loads being extended to 256-bit before the vbroadcast.

I'd like to propose that VBROADCAST should always take a 128-bit vector type as input. As a first step towards that this patch adds an EXTRACT_SUBVECTOR in front of VBROADCAST when the input is 256 or 512-bits. In the future I would like to add scalar_to_vector around all the scalar operations. And maybe we should consider adding a VBROADCAST+load node to avoid separating loads from the broadcasting operation when the load itself isn't foldable.

This requires an additional change in target shuffle combining to look for the extract subvector and look through it to find the original operand. I'm sure this change isn't perfect but was enough to fix a few test failures that were being caused.

Another interesting thing I noticed is that the changes in masked_gather_scatter.ll show cases were we don't remove a useless insert into element 1 before broadcasting element 0.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper retitled this revision from to [X86] Don't create VBROADCAST nodes with 256-bit or 512-bit input types.
craig.topper updated this object.
craig.topper added reviewers: delena, zvi, RKSimon.
craig.topper added a subscriber: llvm-commits.
RKSimon edited edge metadata.Jan 15 2017, 12:16 PM

I think we've spoken in the past of splitting broadcast (and subvector broadcast) into memory and register opcode variants.

lib/Target/X86/X86ISelLowering.cpp
5488 ↗(On Diff #84497)

I've got the horrid feeling this will cause problems for combineX86ShufflesRecursively which decodes shuffles but at the moment asserts if the inputs are not the same size as the shuffle result itself.

test/CodeGen/X86/widened-broadcast.ll
128 ↗(On Diff #84497)

Any idea why AVX1 fails?

What became of previous conversations about splitting the opcodes?

lib/Target/X86/X86ISelLowering.cpp
5488 ↗(On Diff #84497)

Does the fact that I put an node with the matching type into the Ops vector not help?

test/CodeGen/X86/widened-broadcast.ll
128 ↗(On Diff #84497)

Probably something to do with the lowering as broadcast only supporting integer types with AVX2. But that's just a guess.

zvi added inline comments.Jan 16 2017, 2:27 PM
lib/Target/X86/X86ISelLowering.cpp
5479 ↗(On Diff #84497)

EXTRACT_SUBVECTOR's Index operand must be a constant, so you don't need a dyn_cast<>.

5484 ↗(On Diff #84497)

You can drop the outter braces

5487 ↗(On Diff #84497)

Comment need to be updated

9677 ↗(On Diff #84497)

Can you please add here a comment explaining why?

RKSimon added inline comments.Jan 19 2017, 6:57 AM
lib/Target/X86/X86ISelLowering.cpp
5488 ↗(On Diff #84497)

My mistake - yes that should work.

What became of previous conversations about splitting the opcodes?

I think it came up after D22460 when we found some regressions. Even if we don't go the route of fully splitting broadcast/subvbroadcast into reg and mem intrinsics I'm going to look into adding broadcast support into EltsFromConsecutiveLoads soon.

zvi edited edge metadata.Feb 13 2017, 12:52 PM

What is the status of this patch?

craig.topper marked 6 inline comments as done.

Address previous review comments.

RKSimon added inline comments.Feb 14 2017, 5:55 AM
test/CodeGen/X86/masked_gather_scatter.ll
1719 ↗(On Diff #88299)

Current codegen should be committed to trunk if you want to show a delta, otherwise possibly just keep it as it is with ALL-NOT check?

craig.topper added inline comments.Feb 14 2017, 10:05 PM
test/CodeGen/X86/masked_gather_scatter.ll
1719 ↗(On Diff #88299)

I just naively reran the update script when previous patch failed to apply. I'll change it back

Reverted portions of the scatter gather test case that the update script unnecessarily changed.

zvi accepted this revision.Feb 14 2017, 10:52 PM

This patch LGTM. If you won't improve this patch to handle the AVX1 cases, please create a bug.

This revision is now accepted and ready to land.Feb 14 2017, 10:52 PM

Is using an fp broadcast for an integer operation for AVX1 a good idea. Is there a stack cross penalty for that on Sandy Bridge?

This revision was automatically updated to reflect the committed changes.

Is using an fp broadcast for an integer operation for AVX1 a good idea. Is there a stack cross penalty for that on Sandy Bridge?

For 256-bit cases using broadcastss/broadcastsd is a definite win - plus nearly all the 256-bit AVX1 operations are in the fp-domain (including how we lower v8i32/v4i64 shuffles). For 128-bit cases its less clear.