PMADDWD can help improve 8/16 bit integer mutliply-add operation performance for cases like:
for (int i = 0; i < count; i++)
a += x[i] * y[i];
Paths
| Differential D31679
Use PMADDWD to expand reduction in a loop ClosedPublic Authored by danielcdh on Apr 4 2017, 2:05 PM.
Details Summary PMADDWD can help improve 8/16 bit integer mutliply-add operation performance for cases like: for (int i = 0; i < count; i++) a += x[i] * y[i];
Diff Detail
Event Timeline
danielcdh marked an inline comment as done. Comment Actionsremove the support for PMADDUBSW as it cannot handle overflow case.
danielcdh retitled this revision from Support PMADDWD and PMADDUBSW to Use PMADDWD to expand reduction in a loop.Apr 4 2017, 5:02 PM
Comment Actions Thanks for working on this patch. Regarding support for PMADDUBSW, can we match something like the following? for (int i = 0; i < count; i++) { a = saturate(a + x[i] * y[i]); } Comment Actions
How does user specify "saturate"? Is it a general builtin in clang? Comment Actions I suggest we leave the PMADDUBSW discussion for a separate patch. Some minor comments inline.
This revision is now accepted and ready to land.Apr 6 2017, 10:00 PM Comment Actions
I'm not aware of such a builtin and my snippet above was more of pseudo-code. int sat_sint16(int x) { return std::min(32767, std::max(-32768, x)); } AFAIK, the loop vectorizer will not vectorize the reduction for PMADDUBSW, so i agree with @mkuper to do this in a different patch,
Revision Contents
Diff 94119 lib/Target/X86/X86ISelLowering.cpp
test/CodeGen/X86/madd.ll
|
Maybe use std::swap, so that Op0 and Op1 are unnecessary.
MulOp = N->getOperand(0);
Phi = N->getOperand(1);
if (MulOp.getOpcode() != ISD::MUL) {
std::swap(MulOp, Phi);
if (MulOp.getOpcode() != ISD::MUL)
return SDValue();
}