This is an archive of the discontinued LLVM Phabricator instance.

[X86] Attempt to improve v32i8/v64i8 multiply lowering by applying the v16i8 non-avx2 algorithm to each 128-bit lane.
ClosedPublic

Authored by craig.topper on Nov 17 2018, 11:34 AM.

Details

Summary

Previously we split the vectors in half to allow the two halves to be any extended then concatenated the results back together.

This patch instead instead extends the v16i8 sse algorithm to extend half of each 128-bit lane using punpcklbw/punpckhbw. Multiplies all the low half lanes and high half lanes together in separate operations. Then merges the half lane results back together using packuswb.

Unfortunately, some of the cases in vector-reduce-mul.ll regress because we aren't narrowing the vector width of the multiplies as we reduce. The splitting was somewhat making up for that before by causing halves to be discarded after the split.

Diff Detail

Event Timeline

craig.topper created this revision.Nov 17 2018, 11:34 AM
craig.topper added inline comments.Nov 17 2018, 11:38 AM
test/CodeGen/X86/min-legal-vector-width.ll
653

This code is longer but does 4 256-bit loads. The original code did 8 128-bit loads folded into the zero extends.

test/CodeGen/X86/vector-mul.ll
575

We lost the combine here that turned the and+packuswb into vpperm between vector op legalization and dag combine. I'm not sure why shuffle combining wasn't able to do the same with the regular shuffle.

craig.topper edited the summary of this revision. (Show Details)Nov 17 2018, 11:39 AM

What does IACA/LLVM-MCA say about the regressions in min-legal-vector-width.ll and vector-reduce-mul.ll

test/CodeGen/X86/vector-mul.ll
575

This is rather odd - I'll take a look once this has landed.

craig.topper added inline comments.Nov 18 2018, 10:56 PM
test/CodeGen/X86/min-legal-vector-width.ll
653

llvm-mca says the new code is better for this case https://godbolt.org/z/FLcYCe

test/CodeGen/X86/vector-reduce-mul.ll
2973

This is also kind of terrible. We extend i16->i32, truncate i32->i8, shuffle, then extend i8->i16.

3073

This vpunpcklbw is undeeded since it just unpacks the packuswb with undef for odd elements. Similar things occur throughout the reductions. Fixing that would go a long way to improving the new code I think.

LGTM - please can you raise bugs describing the regressions?

RKSimon accepted this revision.Nov 19 2018, 9:27 AM
This revision is now accepted and ready to land.Nov 19 2018, 9:27 AM
This revision was automatically updated to reflect the committed changes.