Patch to allow int8 vectors to be multiplied on the SSE unit instead of being scalarized.
The patch sign extends the i8 lanes to i16, uses the SSE2 pmullw multiplication instruction, then packs the lower byte from each result.
Once vpackuswb zmm support is present this should also work for v64i8 multiplication on AVX512BW targets.
There is a more optimal way for sign extend on SSE4, AVX2, at least for lower part. just VPMOVSXBW.
And for AVX-512 (skx) we have truncate from W to B.
So I suggest to write more generic code and then lower it according to target:
you can optimize truncate/extend according to the target capabilities