This is an archive of the discontinued LLVM Phabricator instance.

[X86] Optimize vXi8 MULHS on targets where we can't sign_extend to the next register size.
ClosedPublic

Authored by craig.topper on Mar 13 2021, 11:44 AM.

Details

Summary

For these cases we need to extract the upper or lower elements,
multiply them using 16-bit multiplies and repack them.

Previously we used punpcklbw/punpckhbw+psraw or pmovsxbw+pshudfd to
extract and sign extend so we could use pmullw to compute the 16-bit
product and then shift down the high bits.

We can avoid the need to sign extend if we unpack the bytes into
the high byte of each word and fill the lower byte with 0 using
pxor. This puts the sign bit of each byte into the sign bit of
each word. Since the LHS and RHS have 8 trailing zeros, the full
32-bit product of those 16-bit values will have 16 trailing zeros.
This means the 16-bit product of the original bytes is in the upper
16 bits which we can calculate using pmulhw.

Diff Detail

Event Timeline

craig.topper created this revision.Mar 13 2021, 11:44 AM
craig.topper requested review of this revision.Mar 13 2021, 11:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 13 2021, 11:44 AM
RKSimon added inline comments.Mar 17 2021, 9:28 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
27494

punpcklbw/punpckhbw

27495

Isn't the shift just for RHS constants?

27516

Do we need this constant path any more? It seemed to be to avoid problems with the SSE41 path that's been removed.

llvm/test/CodeGen/X86/vec_smulo.ll
1785–1786

Not related - but we should be able to remove this sext-in-reg code by replace the pmovzxbd with pmovzxsd (is that a hidden any_extend?)

craig.topper added inline comments.Mar 17 2021, 9:44 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
27495

We only use a shift for constants, though its should get constant folded away. That was me trying to explain the layout of the data. We use unpck in a way that extends the data and shifts it in the upper bits at the same time.

27516

Maybe not. I'll check.

llvm/test/CodeGen/X86/vec_smulo.ll
1785–1786

I'll check. Any idea why these tests use large result types like 16xi32 for a 16xi8 multiply?

RKSimon added inline comments.Mar 18 2021, 5:27 AM
llvm/test/CodeGen/X86/vec_smulo.ll
1785–1786

Not sure - either we copied+pasted from somewhere else or maybe its legacy from before we cleaned up the types promotion/widening. I'll see if the fold would be used anywhere else.

Rebase test cases.

craig.topper added inline comments.Mar 19 2021, 10:48 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
27516

I think the constant path is still needed. Instruction counts increased on some tests I checked.

Revise comment slightly

RKSimon added inline comments.Mar 23 2021, 5:40 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
27492

"put zero extend the bytes" ?

27505

Pull out repeated DAG.getConstant(0, dl, VT) calls

27547

More repeated DAG.getConstant(0, dl, VT) calls?

Address review comments.

RKSimon accepted this revision.Mar 28 2021, 4:01 AM

LGTM - those sextintreg issues shouldn't block this

llvm/lib/Target/X86/X86ISelLowering.cpp
27559

We don't do a bitcast here anymore - its all done by the PACKUS

This revision is now accepted and ready to land.Mar 28 2021, 4:01 AM
This revision was landed with ongoing or failed builds.Mar 28 2021, 11:50 AM
This revision was automatically updated to reflect the committed changes.