This is an archive of the discontinued LLVM Phabricator instance.

[X86] Remove the v16i8->v16i16 path for MULHS with AVX2.
ClosedPublic

Authored by craig.topper on May 8 2020, 2:28 PM.

Details

Summary

We have a couple main strategies for legalizing MULH.

-If the vXi16 type is legal, extend to do the full i16 multiply
and then shift and truncate the results.
-Use unpcks to split each 128 bit lane into high and low halves.a

For signed we have an extra case to split a v32i8 to v16i8 and then
use the extending to v16i16 strategy.

This patch proposes to use the unpck strategy instead. Which is
what we already do for unsigned.

This seems to be 1 instruction shorter when the RHS is constant
like the idiv case. It's 1 instruction longer for the smulo case.
But we're trading cross lane shuffles for inlane shuffles and a
shift.

I'm not sure this is the right answer, so I wanted to put it out
there for discussion.

Diff Detail

Event Timeline

craig.topper created this revision.May 8 2020, 2:28 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 8 2020, 2:28 PM
Herald added a subscriber: hiraditya. · View Herald Transcript

llvm-mca suggests this is beneficial (znver1/skx) or neutral (haswell): https://godbolt.org/z/Pi3Mqb - and its closer to what we do for regular vXi8 multiplies.

llvm/lib/Target/X86/X86ISelLowering.cpp
26563–26564

Update the comment

Update comment

Rebase after the punpck false dependency fix

RKSimon accepted this revision.May 12 2020, 2:44 AM

LGTM with one minor comment

llvm/test/CodeGen/X86/vec_smulo.ll
2842

How come this can't be a broadcast of a smaller constant?

This revision is now accepted and ready to land.May 12 2020, 2:44 AM
craig.topper marked an inline comment as done.May 12 2020, 10:38 AM
craig.topper added inline comments.
llvm/test/CodeGen/X86/vec_smulo.ll
2842

We ended up here which doesn't know to broadcast i16 for some reason.

// On Sandybridge (no AVX2), it is still better to load a constant vector
// from the constant pool and not to broadcast it from a scalar.
// But override that restriction when optimizing for size.
// TODO: Check if splatting is recommended for other AVX-capable CPUs.
if (ConstSplatVal && (Subtarget.hasAVX2() || OptForSize)) {
  EVT CVT = Ld.getValueType();
  assert(!CVT.isVector() && "Must not broadcast a vector type");

  // Splat f32, i32, v4f64, v4i64 in all cases with AVX2.
  // For size optimization, also splat v2f64 and v2i64, and for size opt
  // with AVX2, also splat i8 and i16.
  // With pattern matching, the VBROADCAST node may become a VMOVDDUP.
  if (ScalarSize == 32 || (IsGE256 && ScalarSize == 64) ||
      (OptForSize && (ScalarSize == 64 || Subtarget.hasAVX2()))) {
This revision was automatically updated to reflect the committed changes.