This is an archive of the discontinued LLVM Phabricator instance.

[AVX512] Correct isel patterns to support selecting masked vbroadcastf32x2/vbroadcasti32x2
ClosedPublic

Authored by craig.topper on Aug 29 2017, 10:37 PM.

Details

Summary

This patch adjusts the patterns to make the result type of the broadcast node vXf64/vXi64. Then adds a bitcast to vXi32 after that. Intrinsic lowering was also adjusted to generate this new pattern.

Fixes PR34357

We should probably just drop the intrinsic entirely and use native IR, but I'll leave that for a future patch.

Any idea what instruction we should be lowering the floating point 128-bit result version of this pattern to? There's a 128-bit v2i32 integer broadcast but not an fp one.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper created this revision.Aug 29 2017, 10:37 PM
craig.topper retitled this revision from [AVX512] Correct isel patterns to support selecting masked vbroadcastf32x2/vbroadi32x2 to [AVX512] Correct isel patterns to support selecting masked vbroadcastf32x2/vbroadcasti32x2.
aymanmus accepted this revision.Aug 30 2017, 12:33 AM

LGTM

Regarding the 128-bit floating point version, the resulted sequence in the test looks fine.
It can be lowered to either vmovddup or vshufps, but both showed the same throughput results on IACA.

This revision is now accepted and ready to land.Aug 30 2017, 12:33 AM

movddup won't allow the mask to fold. shufps would allow the masking to fold. The only annoying thing is that we can't fold the 64-bit load with shufps.

This revision was automatically updated to reflect the committed changes.

That's right, but in the merge-mask version it doesn't really improve anything, you must have the mov instruction from xmm1 to xmm0, so folding the mask into the mov or the shuffle/duplicate is equivalent.
Only in the zero-mask version it can save us the last masked mov (if we fold the mask), but still IACA showed no throughput improvement.