Similar to D49636, but for PMADDUBSW. This instruction has the additional complexity that the addition of the two products saturates to 16-bits rather than wrapping around. And one operand is treated as signed and the other as unsigned.
I changed the madd.ll test command line from sse2 to ssse3 to ensure this instruction was available which also caused some test changes for phadd. I can commit that separately if desired. Or I can add a new run line. Or a new test file. Whatever is preferable
A C example that triggers this pattern
static const int N = 128; int8_t A[2*N]; uint8_t B[2*N]; int16_t C[N]; #define MIN(x, y) ((x) < (y)) ? (x) : (y) #define MAX(x, y) ((x) > (y)) ? (x) : (y) void foo() { for (int i = 0; i != N; ++i) C[i] = MIN(MAX((int16_t)A[2*i]*(int16_t)B[2*i] + (int16_t)A[2*i+1]*(int16_t)B[2*i+1], -32768), 32767); }
Couldn't you merge all these sets of canonicalization early-outs together to safe space?