This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Add v16i8/v32i8 multiplication support
ClosedPublic

Authored by RKSimon on Apr 20 2015, 11:11 AM.

Details

Summary

Patch to allow int8 vectors to be multiplied on the SSE unit instead of being scalarized.

The patch sign extends the i8 lanes to i16, uses the SSE2 pmullw multiplication instruction, then packs the lower byte from each result.

Once vpackuswb zmm support is present this should also work for v64i8 multiplication on AVX512BW targets.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 24030.Apr 20 2015, 11:11 AM
RKSimon retitled this revision from to [X86][SSE] Add v16i8/v32i8 multiplication support.
RKSimon updated this object.
RKSimon edited the test plan for this revision. (Show Details)
RKSimon added reviewers: qcolombet, andreadb, spatel, delena.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: Unknown Object (MLST).
delena edited edge metadata.Apr 20 2015, 1:33 PM

I think that this sequence is good for SSE2. But for SSE4, AVX2 and definitely AVX-512 we can find a better chain. See my comments inside.

lib/Target/X86/X86ISelLowering.cpp
15899

There is a more optimal way for sign extend on SSE4, AVX2, at least for lower part. just VPMOVSXBW.
And for AVX-512 (skx) we have truncate from W to B.
So I suggest to write more generic code and then lower it according to target:

  1. ALo = sign extend lower part of A from "B" to "W" (ISD::SIGN_EXTEND)
  2. BLo = sign extend lower part of B from "B" to "W"
  3. multiply ALo * BLo
  4. shift the whole vector A right to put the high part instead of low (VPALIGNR)
  5. do the same with AHi, BHi
  6. use ISD::TRUNCATE for writing result back

you can optimize truncate/extend according to the target capabilities

RKSimon updated this revision to Diff 24247.Apr 22 2015, 11:30 AM
RKSimon edited edge metadata.

Thanks Elena for the review. I've updated the patch with your suggestions for SSE2/SSE4.1/AVX2 specific optimizations. If AVX512BW support for vpmovsxbw (zmm) and vpmovwb (xmm,ymm,zmm) (TRUNCATE) were added I could include support for v64i8 as well.

Reviewing this updated patch, it is quite bulky. Something that I'm considering is postponing this and improving support for SSE2/SSE41 for SIGN_EXTEND and SIGN_EXTEND_VECTOR_INREG first which would permit all of their specific code to be removed.

The code is correct. I just suggest to divide it in 2-3 functions,
like LowerMUL_v16i8_sse41(), LowerMUL_v16i8_sse2() or you can choose better names.
But it is only my suggestion, you can commit the code as is if you want.

This revision was automatically updated to reflect the committed changes.

Thanks Elena. I've submitted the patch as is. I have an upcoming patch for sign extension that will simplify much of this code and should make any additional effort to split off SSE2/SSE41 implementations redundant.