This is an archive of the discontinued LLVM Phabricator instance.

[LV] Add a new reduction pattern match
ClosedPublic

Authored by rengolin on Jul 11 2018, 12:30 AM.

Details

Summary

Adding a new reduction pattern match for vectorizing code similar to TSVC s3111:

for (int i = 0; i < N; i++)
  if (a[i] > b)
    sum += a[i];

This patch adds support for fadd, fsub and fmull, as well as multiple
branches and different (but compatible) instructions (ex. add+sub) in
different branches.

I have forwarded to trunk, added fsub and fmul functionality and
additional tests, but the credit goes to Takahiro, who did most of the
actual work.

Patch by Takahiro Miyoshi <takahiro.miyoshi@linaro.org>.

Diff Detail

Event Timeline

Hi Takahiro,

The patch looks good, but I'm adding more people to have a closer look, as it has been a while since last time I touched this code.

cheers,
--renato

hsaito added a comment.EditedJul 24 2018, 6:12 PM

Takahiro,

I'm not familiar with the Recurrence Descriptor code, but I suppose the following is considered as RK_FloatAdd. If that's the case, we should be beefing up RK_FloatAdd rather than adding a new Kind. We can't keep adding new kind every time we encounter a different pattern of reduction sum/product. Downside is possibly exposing a downstream bug, but that should only help generalizing reduction handling code. From reduction analysis perspective, select (IF-converted) and phi (IF) should be the same thing. So, trying to handle this within FloatAdd/FloatMult should also help generalize recurrence analysis code. That's how I look at the issue. Hope this helps.

Thanks,
Hideki


float foo(float *a, int n){

float sum=0;
for (int i=0;i<n;i++){
  if (a[i]>1.0){
    sum+=a[i];
  }
  else if (a[i]<3.0){
    sum+=2*a[i];
  }
}
return sum;

}

for.body: ; preds = %for.inc, %for.body.preheader

%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.inc ]
%sum.026 = phi float [ 0.000000e+00, %for.body.preheader ], [ %sum.1, %for.inc ]
%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
%0 = load float, float* %arrayidx, align 4, !tbaa !2
%cmp1 = fcmp ogt float %0, 1.000000e+00
br i1 %cmp1, label %if.then, label %if.else

if.then: ; preds = %for.body

%add = fadd fast float %0, %sum.026
br label %for.inc

if.else: ; preds = %for.body

%cmp8 = fcmp olt float %0, 3.000000e+00
br i1 %cmp8, label %if.then10, label %for.inc

if.then10: ; preds = %if.else

%mul = fmul fast float %0, 2.000000e+00
%add13 = fadd fast float %mul, %sum.026
br label %for.inc

for.inc: ; preds = %if.then, %if.then10, %if.else

%sum.1 = phi float [ %add, %if.then ], [ %add13, %if.then10 ], [ %sum.026, %if.else ]
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body

Hi Hideki,

Thank you for your comments.

At first, as you said, I supposed that the target loop was FloatAdd pattern. But, I supposed this was also a little similar to FloatMixMax.
It means that the target one contains the meaning of FloatAdd and FloatMixMax, and I add the new recurrence descriptor to express this.

Certainly, I think keeping on adding new kind for a different pattern of reduction doesn't make sense.
So, I will retry to handle this within FloatAdd/FloatMult.

Best regards,
Takahiro

takahiro.miyoshi removed rL LLVM as the repository for this revision.

I modified my patch to use RK_FloatAdd instead of adding a new recurrence descriptor.
And, input IRs of this target loop is already converted into a select instruction, so I don't extend If-convert functionality.

I modified my patch to use RK_FloatAdd instead of adding a new recurrence descriptor.
And, input IRs of this target loop is already converted into a select instruction, so I don't extend If-convert functionality.

Is this ready for another round of review? I took a quick look. I think this is the right direction to follow. Any specific reasons for restricting to FloatAdd? Should be the same for integers and SUB and MUL, as well, isn't it?
Please also add a negative test for "two use" cases but outside of the pattern you are looking for as well as "three use" negative test.

By any chance, did you try looking into creating a LIT test for the following conditional reduction where both IFs are converted to selects? I'm just curious about how much extension of your code would be needed to capture that.
If this is already caught great. If low hanging, its nice to extend a bit further (and try to see if that covers 3, 4, 5, ..., N cases).

if (cond1)

sum+=...

if (cond2)

sum+=...

[
if (cond3)

sum+=...

if (cond4)

sum+=...

...
if (condN)

sum+=...

]

Thanks,
Hideki

rengolin commandeered this revision.Oct 8 2018, 10:14 AM
rengolin edited reviewers, added: takahiro.miyoshi; removed: rengolin.

Hi Hideki,

Takahiro is on leave, so I'm taking this work to make sure we don't delay much more.

I have added fsub and fmul functionality and a few tests (with multiple branches), including some that should not vectorise.

All the credit still goes to Takahiro.

cheers,
--renato

rengolin updated this revision to Diff 168678.Oct 8 2018, 10:16 AM
rengolin edited the summary of this revision. (Show Details)
rengolin edited the summary of this revision. (Show Details)

Thanks a lot, Renato. Will take a look quick.

hsaito added a comment.Oct 8 2018, 1:28 PM

Code looks good. Just a minor suggestion on the comment. Looking at the LIT test.

lib/Analysis/IVDescriptors.cpp
503 ↗(On Diff #168678)

where the Instruction argument I is the last select in the chain.

rengolin added inline comments.Oct 8 2018, 2:05 PM
lib/Analysis/IVDescriptors.cpp
503 ↗(On Diff #168678)

Good point! Probably better to also rename the argument and simplify the cast in the beginning of the function. I didn´t want to change much, but I guess that's more cosmetic than anything. :)

hsaito accepted this revision.Oct 8 2018, 2:28 PM

LGTM. Please wait for a few days to give others time to respond if they'd like to.

test/Transforms/LoopVectorize/if-reduction.ll
2

My preference is to have vectorization/non-vectorization checked by itself and then have another RUN line to check the expected optimization by InstCombine. That way, we'll quickly know which part changed when the test fails. I don't insist, though.

584

Is this the correct check here?

This revision is now accepted and ready to land.Oct 8 2018, 2:28 PM

LGTM. Please wait for a few days to give others time to respond if they'd like to.

Thanks Hideki!

I'll update with the review comments and wait a few days.

test/Transforms/LoopVectorize/if-reduction.ll
2

I see what you mean, will try to separate them. I'm not even sure the instcombine is necessary for the results we check, though.

584

ouch, no, regex left-over. Will fix.

hsaito added inline comments.Oct 8 2018, 3:12 PM
test/Transforms/LoopVectorize/if-reduction.ll
37

One comment somehow went missing.

I suggest adding one more negative test, for example, storing %add to y[i]. Single use of %add should be checked, I think. If we find a bug there, that's an easy thing to remedy.

rengolin updated this revision to Diff 168768.Oct 9 2018, 3:08 AM

Changes to comment:

  • Improved comments on isConditionalRdxPattern
  • Removed instcombine pass from test
  • Added write negative test
  • Fix typo in CHECK line
rengolin marked 7 inline comments as done.Oct 9 2018, 3:09 AM
hsaito accepted this revision.Oct 10 2018, 10:48 AM

LGTM.

This revision was automatically updated to reflect the committed changes.
Carrot added a subscriber: Carrot.Oct 26 2018, 3:00 PM

This patch doesn't correctly handle isFast(). By default isRecurrenceInstr() should check I->isFast(), but for this pattern I is Select, isFast() doesn't apply to it, it should be checked against FAdd/FMul inside isConditionalRdxPattern().

It caused our several internal applications failed. Following is a simple reproduction.

static double bar(double* v) {

double t = 0.0;
for (int i=0; i<10000; i++) {
  double s = v[i];
  if (s > 0) {
    t += s;
  }
}
return t;

}

double foo(double* v)
{

return bar(v);

}

clang++ -msse4.2 -c -O2 t9.cc -save-temps

In the generated code, the loop is wrongly vectorized.

This patch doesn't correctly handle isFast(). By default isRecurrenceInstr() should check I->isFast(), but for this pattern I is Select, isFast() doesn't apply to it, it should be checked against FAdd/FMul inside isConditionalRdxPattern().

Benjamin has said something similar, but I could not reproduce. I'll revert the patch and fix the issue. Thanks for the reproducer case!

--renato

Reverted in r345465. Will take a look and land again when fixed. Thanks!