This is an archive of the discontinued LLVM Phabricator instance.

Use PMADDWD to expand reduction in a loop
ClosedPublic

Authored by danielcdh on Apr 4 2017, 2:05 PM.

Details

Summary

PMADDWD can help improve 8/16 bit integer mutliply-add operation performance for cases like:

for (int i = 0; i < count; i++)

a += x[i] * y[i];

Event Timeline

danielcdh created this revision.Apr 4 2017, 2:05 PM
wmi added inline comments.Apr 4 2017, 3:46 PM
lib/Target/X86/X86ISelLowering.cpp
34588–34599

Maybe use std::swap, so that Op0 and Op1 are unnecessary.
MulOp = N->getOperand(0);
Phi = N->getOperand(1);
if (MulOp.getOpcode() != ISD::MUL) {
std::swap(MulOp, Phi);
if (MulOp.getOpcode() != ISD::MUL)
return SDValue();
}

34605–34607

Just realize maybe we need to prove no wrap before use PMADDUBSW? It is possible that i8*i8+i8*i8 will overflow for i16 and PMADDUBSW will generate saturate result. However, the original i32 operations will not have overflow issue.

danielcdh updated this revision to Diff 94140.Apr 4 2017, 5:00 PM
danielcdh marked an inline comment as done.

remove the support for PMADDUBSW as it cannot handle overflow case.

lib/Target/X86/X86ISelLowering.cpp
34605–34607

Good catch!

Looks like we should not use compiler to generate PMADDUBSW directly. Patch updated.

danielcdh retitled this revision from Support PMADDWD and PMADDUBSW to Use PMADDWD to expand reduction in a loop.Apr 4 2017, 5:02 PM
danielcdh edited the summary of this revision. (Show Details)
wmi added inline comments.Apr 5 2017, 10:06 AM
test/CodeGen/X86/madd.ll
25

The test can be simpler by removing the preheader and exit. just like what test/CodeGen/X86/sad.ll does. It will be easier if it is needed to extend the test and check different kinds of element types or different vector lengths.

zvi edited edge metadata.Apr 5 2017, 9:55 PM

Thanks for working on this patch. Regarding support for PMADDUBSW, can we match something like the following?

for (int i = 0; i < count; i++) {
  a = saturate(a + x[i] * y[i]);
}
danielcdh marked an inline comment as done.Apr 6 2017, 8:31 AM
In D31679#719786, @zvi wrote:

Thanks for working on this patch. Regarding support for PMADDUBSW, can we match something like the following?

for (int i = 0; i < count; i++) {
  a = saturate(a + x[i] * y[i]);
}

How does user specify "saturate"? Is it a general builtin in clang?

danielcdh updated this revision to Diff 94376.Apr 6 2017, 8:31 AM
danielcdh edited the summary of this revision. (Show Details)

simplify test

mkuper edited edge metadata.Apr 6 2017, 6:24 PM

I suggest we leave the PMADDUBSW discussion for a separate patch.

Some minor comments inline.

lib/Target/X86/X86ISelLowering.cpp
34600

Since this version always uses PMADDWD, we no longer care about Mode at all, and this entire if can go away.

34630

Isn't Madd.getSimpleValueType() always MVT::i32?

34630

Just a sanity check - this is correct irrespective of whether the initial value for the reduction phi is 0 or not, right?

34708

Bikeshedding: maybe call this combineLoopMAddPattern() to match the other name?

34708

Sad -> Madd?

danielcdh updated this revision to Diff 94474.Apr 6 2017, 7:25 PM
danielcdh marked 3 inline comments as done.

update

lib/Target/X86/X86ISelLowering.cpp
34630

I suppose yes.

34630

It should be vector type instead of scalar type.

mkuper accepted this revision.Apr 6 2017, 10:00 PM

LGTM

lib/Target/X86/X86ISelLowering.cpp
34630

Argh, right, sorry.

This revision is now accepted and ready to land.Apr 6 2017, 10:00 PM
danielcdh closed this revision.Apr 7 2017, 8:54 AM
zvi added a comment.Apr 9 2017, 1:06 AM

How does user specify "saturate"? Is it a general builtin in clang?

I'm not aware of such a builtin and my snippet above was more of pseudo-code.
In c++ you may implement saturation as:

int sat_sint16(int x) {
  return std::min(32767, std::max(-32768, x));
}

AFAIK, the loop vectorizer will not vectorize the reduction for PMADDUBSW, so i agree with @mkuper to do this in a different patch,