This is an archive of the discontinued LLVM Phabricator instance.

[X86] decomposeMulByConstant - decompose legal vXi32 multiplies on SlowPMULLD targets and all vXi64 multiplies
ClosedPublic

Authored by RKSimon on Sep 27 2021, 2:08 PM.

Details

Summary

X86's decomposeMulByConstant never permits mul decomposition to shift+add/sub if the vector multiply is legal.

Unfortunately this isn't great for SSE41+ targets which have PMULLD for vXi32 multiplies, but is often quite slow. This patch proposes to allow decomposition if the target has the SlowPMULLD flag (i.e. Silvermont). We also always decompose legal vXi64 multiplies - even latest IceLake has really poor latencies for PMULLQ.

Diff Detail

Event Timeline

RKSimon created this revision.Sep 27 2021, 2:08 PM
RKSimon requested review of this revision.Sep 27 2021, 2:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2021, 2:08 PM

ping - any thoughts? should we do not do this at all? limit this to SlowPMULLD or always do it for vXi32 and vXi64?

ping - any thoughts? should we do not do this at all? limit this to SlowPMULLD or always do it for vXi32 and vXi64?

I think it makes sense for SlowPMULLD. I think this makes sense for vXi32 on Haswell and later Intel Core CPUs. That's where it went from one uop to two serialize 5 cycle uops. I think it makes sense for vXi64 on all Intel CPUs with PMULLQ since that instruction is 3 serialized uops.

RKSimon updated this revision to Diff 376561.Oct 1 2021, 10:04 AM
RKSimon retitled this revision from [X86] decomposeMulByConstant - decompose legal vXi32 mutliplies on SlowPMULLD targets to [X86] decomposeMulByConstant - decompose legal vXi32 mutliplies on SlowPMULLD targets and all xXi64 mutliplies.
RKSimon edited the summary of this revision. (Show Details)

Always decompose vXi64 multiplies

RKSimon retitled this revision from [X86] decomposeMulByConstant - decompose legal vXi32 mutliplies on SlowPMULLD targets and all xXi64 mutliplies to [X86] decomposeMulByConstant - decompose legal vXi32 mutliplies on SlowPMULLD targets and all vXi64 mutliplies.Oct 1 2021, 10:04 AM
This revision is now accepted and ready to land.Oct 1 2021, 10:44 AM

Thanks Craig

Can this patch solve first part of PR52039?

https://bugs.llvm.org/show_bug.cgi?id=52039

Can this patch solve first part of PR52039?

https://bugs.llvm.org/show_bug.cgi?id=52039

I think this will only help that on Silvermont :)

We might have to redefine SlowPMULLD slightly so we can enable it on Haswell/Broadwell (and maybe Jaguar) (see PR35948). As well as this patch its used in reduceVMULWidth for PMULLD->PMULLW+PMULHW+SHUFFLE expansion of small values which is good for SLM (which has REALLY bad microcoded PMULLD) but not usually for anything else where its just a 2/3uop instruction.

I've been trying to improve PMADDWD combines as an alternative to reduceVMULWidth but there's a still some work to do (PR47437 and PR45897).

RKSimon retitled this revision from [X86] decomposeMulByConstant - decompose legal vXi32 mutliplies on SlowPMULLD targets and all vXi64 mutliplies to [X86] decomposeMulByConstant - decompose legal vXi32 multiplies on SlowPMULLD targets and all vXi64 multiplies.Oct 2 2021, 4:35 AM