This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros
ClosedPublic

Authored by RKSimon on Dec 21 2017, 4:26 AM.

Details

Summary

If there are 17 or more leading zeros to the v4i32 elements, then we can use PMADD for the integer multiply when PMULLD is unavailable or slow.

The 17 bits need to be zero as the PMADDWD performs a v8i16 signed-mul-extend + pairwise-add - the upper 16 so we're adding a zero pair and the 17th bit so we don't incorrectly sign extend.

If people want I can try to incorporate this more into the ShrinkMode enum returned by canReduceVMulWidth ?

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Dec 21 2017, 4:26 AM
craig.topper added inline comments.Dec 21 2017, 2:00 PM
test/CodeGen/X86/shrink_vmul.ll
1

Why doesn't this test have any avx command lines. I assume some of the unpcks in the modified test case would be a zero extend on newer feature sets?

RKSimon updated this revision to Diff 128106.Dec 24 2017, 4:56 AM

Rebased after adding AVX tests to shrink_vmul.ll

craig.topper accepted this revision.Dec 27 2017, 12:38 PM

LGTM

I didn't realize when I made that avx comment that shrink vmul only applies to pre-sse4.1

This revision is now accepted and ready to land.Dec 27 2017, 12:38 PM
This revision was automatically updated to reflect the committed changes.

LGTM

I didn't realize when I made that avx comment that shrink vmul only applies to pre-sse4.1

Thanks - I'm wondering whether we should try to use MADD for SSE41+ targets as well - realistically v2Xi16 multiplies are always going to be faster than vXi32 (1cy or more latency saving according to Agner). Similar to your avx512 vXi64 multiply patches I guess.

I'm wondering whether we should try to use MADD for SSE41+ targets as well

Yes, absolutely. Look for alternatives to PMULLD whenever possible except with -march=sandybridge / ivybridge, or KNL.

PMADDWD has twice the throughput (and half the latency) of PMULLD on Haswell and Skylake. (Although Skylake does have vector-integer multiply on two ports, so PMULLD is 10c latency, 1c throughput). PMULLD is also half throughput on Core2 (4 uops) and Nehalem (2 uops).

On Jaguar it's half-throughput like on Haswell. On Silvermont, it's 7 uops with 11c throughput (11x worse than PMADDWD).

On Ryzen, they're both single-uop, but PMADDWD has 3c instead of 4c latency, and 1c instead of 2c throughput. Same thing on Bulldozer-family: 4c vs. 5c latency, and 1c vs. 2c throughput.

PMULUDQ (widening multiply of the even elements) is usually as fast as PMADDWD, but 32-bit low-half PMULLD multiply is slow on everything except Intel Sandybridge / Ivybridge, and KNL. The throughput penalty is at least a factor of 2 on CPUs other than those.