This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Do not reduce repeated values, use scalar red ops instead.
ClosedPublic

Authored by ABataev on Aug 19 2022, 1:58 PM.

Details

Summary

Metric: size..text

size..text                 results     results0    diff

SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-980605-1.test 445.00 461.00 3.6%
SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 428477.00 428445.00 -0.0%
External/SPEC/CFP2006/447.dealII/447.dealII.test 618849.00 618785.00 -0.0%

For all tests some extra code was optimized, GCC-C-execute has some more
inlining after

Diff Detail

Event Timeline

ABataev created this revision.Aug 19 2022, 1:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2022, 1:58 PM
ABataev requested review of this revision.Aug 19 2022, 1:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2022, 1:58 PM
ABataev updated this revision to Diff 461984.Sep 21 2022, 12:33 PM

Rebase, ping

The patch is seems trying to move SLP vectorizer into InstCombine territory. I'm not sure why we should do that. Did you try to analyze which specific patterns helped in for example GCC-C-execute test?
Can these represent any cases where instcombine could be improved instead? It is difficult to access this patch based on test changes as a lot of SLP vectorizer tests were not designed as capable to run through instcombine.
If you run "-instcombine -slp-vectorizer" instead of just -slp-vectorizer then how many of the affected LIT tests would still benefit from the patch?

The patch is seems trying to move SLP vectorizer into InstCombine territory. I'm not sure why we should do that. Did you try to analyze which specific patterns helped in for example GCC-C-execute test?
Can these represent any cases where instcombine could be improved instead? It is difficult to access this patch based on test changes as a lot of SLP vectorizer tests were not designed as capable to run through instcombine.
If you run "-instcombine -slp-vectorizer" instead of just -slp-vectorizer then how many of the affected LIT tests would still benefit from the patch?

Most of these test will be optimized for sure, since they are pretty simple. But looks like there are some other places, where reduction analysis in SLP is better than the similar analysis in instcombiner. Also, if we can do some optimization here, it shall reduce compile time, since instcombiner consumes lots of time

Now the difference is even more:

            test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   431225.00   431288.00  0.0%
       test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2198241.00  2198465.00  0.0%
                      test-suite :: MultiSource/Applications/SPASS/SPASS.test   530608.00   530640.00  0.0%
        test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1135953.00  1135937.00 -0.0%
               test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   651440.00   651360.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test    48631.00    48311.00 -0.7%

InstCombine can simplify most of the currently not optimized stuff, but requires extra passes (like ReassociatePass) and extra compile time, while we can do it easily in SLP vectorizer, since we already have all required data. PhaseOrdering tests are the proves.

A couple of general thoughts. Can you please add a knob that allows to turn off the optimization? And can some sort of debug tracing be added? Such as values that has been optimized away?

ABataev updated this revision to Diff 489611.Jan 16 2023, 11:25 AM

Address comments

ABataev updated this revision to Diff 490845.Jan 20 2023, 7:25 AM

Rebase, ping.

I'll try to look at this closely next week (but not earlier than Monday). Just a quick note about terminology used. The option name "slp-same-scalars-reduction" does not actually tell much about what it actually controls.
Since you basically are trying to optimize away identity operations in reduction sequences I'd suggest you to use name for option and across the code that better reflect that.
The option name could be "-slp-optimize-identity-hor-reduction-ops=true|false" for example.

ABataev updated this revision to Diff 490963.Jan 20 2023, 1:35 PM

Rebase, renamed option, added a test run for false option value

ABataev updated this revision to Diff 492126.Jan 25 2023, 8:24 AM

Rebase, ping!

ABataev updated this revision to Diff 493402.Jan 30 2023, 1:38 PM

Rebase, ping!

RKSimon added inline comments.Feb 4 2023, 1:08 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
12430

Use allConstant() ?

12550

allConstant(Candidates)

ABataev updated this revision to Diff 495111.Feb 6 2023, 6:54 AM

Address comments

I'm sorry for the delay. Was bit overloaded with an internal stuff.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
131

Would the following description
"Allow optimization of original scalar identity operations on matched horizontal reductions."
sound better?
Same question wrt AllowSameScalarReduction variable name. Will AllowHorRdxIdenityOptimization fit better?

It probably makes sense to add a note (as a comment above or into the commit message): will the optimization run if we match a reduction but do not vectorize at the end? Based on test updates it looks like it will run. IMO that should be clearly stated that the pass outcome might not be a vectorized code.

12244

nit: noticed by chance while was refreshing with the code. This call to getRdxKind can be safely replaced with RdxKind.

12421

nit: maybe make it unsigned too (+ deduction rule)?

12557–12566

This code pattern is seen multiple times. Any chance to outline it?
+ (for better readability) swap then/else and remove '!' from condition?

12569

We don't need to collect repeating counters if IsSupportedreusedRdxOp is false.

12573–12574

This code shall form a member of HorizontalReduction class and emitReusedOps() need to assert it can handle a case by calling the former before doing any processing.

12575

This needs a comment. Are you trying to catch a case for special processing when each value used same times? Please describe what is the benefit of catching such cases?
Why are you trying to calculate it this early?

12581

drop_front() ?

12698

Are you trying to repurpose the data?
To be honest, I'd refrain from doing that.

12699–12708

Fuse these loops?

` for (unsigned Cnt = 0; Cnt < NumReducedVals; ++Cnt) {

   if (Cnt >= Pos && Cnt < Pos + ReduxWidth)
     continue;
  ...
}`
12710–12731

Fuse these loops?

13084

Please add a description.

vdmitrie added inline comments.Feb 13 2023, 3:25 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
12581

drop_front() ?

This was a "nit" actually. drop_front() needs ArrayRef, so it should look like the following:
ArrayRef<std::pair<Value *, unsigned>>(SameValuesCounter.takeVector()).drop_front()

probably not worth the effort to save on just one comparison.

vdmitrie added inline comments.Feb 13 2023, 6:29 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
12577

"IsSupportedReusedRdxOp && HasReusedScalars" pattern is the most common use.
What about defining it as below
`bool OptReusedScalars = IsSupportedReusedRdxOp &&

SameValuesCounter.size() != NumReducedVals;`

and just check for OptReusedScalars everywhere across the code?
It only used differently at line 12683 but check for HasReusedScalars can be safely dropped there.

12581

Uhh, I did not realize takeVector clears the map.
finally... this one should do the trick:

`all_of(drop_begin(SameValuesCounter),

[&SameValuesCounter](const std::pair<Value *, unsigned> &P) {
  return P.second == SameValuesCounter.begin()->second;
 })`
13188

May be reorganize it a bit to add a return statement? It can produce warning "control reaches end of non-void function [-Wreturn-type]"

ABataev added inline comments.Feb 15 2023, 8:02 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
12569

No, still need it, it is used to produce vector of unique scalars, see line 12577.

12575

It may help to produce better vector ops. Say, we have 8 x aabbccdd. It will require reduction of 8 elements + the built tree still will operate on 4 x element, but the last node will have reshuffle from 4 x vector to 8 x vector + red(8 x ). Instead we may have just red(4 x abcd)*2.

ABataev updated this revision to Diff 497694.Feb 15 2023, 8:42 AM

Address comments

vdmitrie added inline comments.Feb 16 2023, 10:04 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
12567

This comment line seems to be misplaced.

13095–13113

What I basically see here is that you packed two methods into one and differentiate them by VL argument.
May be just split this method in two ? One to handle identity optimization and another for same scale factor optimization. They seems do not share any essential code, only the switch. But sharing the switch does not look right to me as these optimizations do not fully share reduction kinds.

13175

Are you relying on instcombiner to optimize this further? I mean for example mul X, 2 -> add X, X

ABataev added inline comments.Feb 16 2023, 10:34 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
13095–13113

I thought about it, just in many cases we reuse the same code, so I decided to put it into single switch (just (f)add and xor have special processing).

13175

Yes, because there might be some other numbers, say 3, 4, etc.

ABataev updated this revision to Diff 498088.Feb 16 2023, 11:06 AM

Fix comment

vdmitrie added inline comments.Feb 16 2023, 11:39 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
13095–13113

I don't buy this kind of savings. Ideally we want to assert that we have one of add/fadd/xor reduction kind when we fall into same scale factor optimization. This is why I advocate for splitting it in two. We do not save a lot on switch which will be handing just 3 cases and default one for the second method. Code inside the switch cases is already separate. So it is easy to do from the very beginning.

ABataev updated this revision to Diff 498125.Feb 16 2023, 12:47 PM

Address comments

Thanks. This revision basically looks good to me. So I'm going to accept it.
@RKSimon , do you have any remarks/comments/concerns?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
13093–13096

okay :-) that works too

just a n.i.t. remark: you could reduce arguments to just these if the code was a dedicated method:
(Value *VectorizedValue, IRBuilderBase &Builder, ArrayRef<Value *> VL, unsigned Cnt)

ABataev added inline comments.Feb 16 2023, 4:32 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
13093–13096

It is not possible, both SameValuesCounter and TrackedToOrig are used for each elements of VL in the second switch to build correct constant vector.

vdmitrie added inline comments.Feb 16 2023, 4:51 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
13093–13096

I don't see that.
Isn't the following is the only use of them in this part (under condition VL.empty())?

` unsigned Cnt =

SameValuesCounter.lookup(TrackedToOrig.find(VectorizedValue)->second);

`
Aha, I even overlooked that you don't actually use VL elements in this part too.
So here are the only essential arguments for it:
(Value *VectorizedValue, IRBuilderBase &Builder, unsigned Cnt)

vdmitrie added inline comments.Feb 16 2023, 4:56 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
13093–13096

Ah, I missed that you said about second switch. I was thinking about first part of the method.
May be I wasn't clear enough.
I meant to have a separate method for the part which is currently under VL.empty().
Value *emitScaleForReusedOps(Value *VectorizedValue, IRBuilderBase &Builder, unsigned Scale) {

... here goes the code which is now under VL.empty() ...

}

ABataev updated this revision to Diff 498204.Feb 16 2023, 5:23 PM

Address comments

vdmitrie accepted this revision.Feb 16 2023, 5:43 PM

Perfect! Looks good. Thanks, Alexey!

This revision is now accepted and ready to land.Feb 16 2023, 5:43 PM
dyung added a subscriber: dyung.Mar 6 2023, 10:46 AM

@ABataev, one of our internal tests seems to hit an infinite loop in the compiler after this change. I have filed issue 61224 for the issue, can you take a look?

@ABataev, one of our internal tests seems to hit an infinite loop in the compiler after this change. I have filed issue 61224 for the issue, can you take a look?

Will check/fix ASAP.

Hi Alexey,
something isn't quite right with the transformation:
$ opt -passes=slp-vectorizer -slp-optimize-identity-hor-reduction-ops=false -S -o j.ll t.ll && llc -O0 j.ll && clang j.s -o a.out && ./a.out
209
$ opt -passes=slp-vectorizer -slp-optimize-identity-hor-reduction-ops=true -S -o j.ll t.ll && llc -O0 j.ll && clang j.s -o a.out && ./a.out
202

Hi Alexey,

It turned out that in some cases the transformation introduced in this change doesn't work properly. Let's consider this code:

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i32 @main() {
entry:
  %sq = alloca [64 x i32], i32 0, align 16
  
  %0 = getelementptr inbounds [64 x i32], ptr %sq, i64 0, i64 1
  %elt_1 = load i32, ptr %0, align 4
  %1 = getelementptr [64 x i32], ptr %sq, i64 0, i64 2
  %elt_2 = load i32, ptr %1, align 8
  %2 = getelementptr [64 x i32], ptr %sq, i64 0, i64 3
  %elt_3 = load i32, ptr %2, align 4
  %3 = getelementptr [64 x i32], ptr %sq, i64 0, i64 4
  %elt_4 = load i32, ptr %3, align 16  
  
  %4 = add i32 %elt_2, %elt_3
  %5 = add i32 %4, %elt_2
  %6 = add i32 %5, %elt_1  
  %7 = add i32 %6, %elt_4
  %8 = add i32 %7, %elt_3
  %9 = add i32 %8, %elt_2
  %10 = add i32 %9, %elt_1
  
  %call = tail call i32 (ptr, ...) null(ptr null, i32 %10)
  ret i32 0
}

It's easy to see that %10 fianl value is 2*%elt_1 + 3*%elt_2 + 2*%elt_3 + %elt_4. But actually we get this code:

$ opt -S test.ll -passes=slp-vectorizer
...
define i32 @main() {
entry:
  %sq = alloca [64 x i32], i32 0, align 16
  %0 = getelementptr inbounds [64 x i32], ptr %sq, i64 0, i64 1
  %1 = load <4 x i32>, ptr %0, align 4
  %2 = mul <4 x i32> %1, <i32 3, i32 2, i32 2, i32 1>
  %3 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %2)
  %call = tail call i32 (ptr, ...) null(ptr null, i32 %3)
  ret i32 0
}

<i32 3, i32 2, i32 2, i32 1> value doesn't look correct, it should be <i32 2, i32 3, i32 2, i32 1>. Running the test with -slp-optimize-identity-hor-reduction-ops=false gives us this code:

$ opt -S test.ll -passes=slp-vectorizer -slp-optimize-identity-hor-reduction-ops=false
...
define i32 @main() {
entry:
  %sq = alloca [64 x i32], i32 0, align 16
  %0 = getelementptr inbounds [64 x i32], ptr %sq, i64 0, i64 1
  %1 = load <4 x i32>, ptr %0, align 4
  %2 = shufflevector <4 x i32> %1, <4 x i32> poison, <8 x i32> <i32 1, i32 1, i32 1, i32 2, i32 2, i32 0, i32 0, i32 3>
  %3 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %2)
  %call = tail call i32 (ptr, ...) null(ptr null, i32 %3)
  ret i32 0
}

<i32 1, i32 1, i32 1, i32 2, i32 2, i32 0, i32 0, i32 3> consist of two zeros and three ones which is correct which means that the incorrectness was introduced with this change.

I'll investigate all reports, thanks!

To add on to the above, the order of the values in the scale vector is determined by the default order of Candidates, and in these test cases there's a mismatch between that order and the lanes that the scalar values are in in the vectorized value.

Ok, have a fix for the issue, it is very simple. Forgot to take the ordering into account, which can be dropped for the reductions.

Matt added a subscriber: Matt.May 3 2023, 3:08 PM

@ABataev May I ask whether the change to constant folding semantics is intended with -slp-optimize-identity-hor-reduction-ops?

Cf. https://godbolt.org/z/G44xhKcTc

I wonder whether this may be connected to the constant folding similar to createOp related to this patch, https://github.com/llvm/llvm-project/issues/61224, https://github.com/llvm/llvm-project/commit/c411965820eb803dd7eac39f80357cad663b7ba0

  • LLVM version 16.0.0

opt -passes=slp-vectorizer

%mul = fmul reassoc nsz float 0x39B4484C00000000, 0x39B4484C00000000
ret float %mul
  • LLVM version 17.0.0git

opt -passes=slp-vectorizer -slp-optimize-identity-hor-reduction-ops=0

%mul = fmul reassoc nsz float 0x39B4484C00000000, 0x39B4484C00000000
ret float %mul
  • LLVM version 17.0.0git

opt -passes=slp-vectorizer

ret float 0.000000e+00

@ABataev May I ask whether the change to constant folding semantics is intended with -slp-optimize-identity-hor-reduction-ops?

Cf. https://godbolt.org/z/G44xhKcTc

I wonder whether this may be connected to the constant folding similar to createOp related to this patch, https://github.com/llvm/llvm-project/issues/61224, https://github.com/llvm/llvm-project/commit/c411965820eb803dd7eac39f80357cad663b7ba0

  • LLVM version 16.0.0

opt -passes=slp-vectorizer

%mul = fmul reassoc nsz float 0x39B4484C00000000, 0x39B4484C00000000
ret float %mul
  • LLVM version 17.0.0git

opt -passes=slp-vectorizer -slp-optimize-identity-hor-reduction-ops=0

%mul = fmul reassoc nsz float 0x39B4484C00000000, 0x39B4484C00000000
ret float %mul
  • LLVM version 17.0.0git

opt -passes=slp-vectorizer

ret float 0.000000e+00

Yes, slp-vectorizer may fold constants as a side effect of the reduction optimization.

Matt added a comment.May 3 2023, 3:28 PM

Thanks!

This explains some of the behavior I've observed.
Context: Downstream implementation with a custom floating-point trapping behavior which generally disables constant folding for would-be trapping operations (such as mutiplying 1e-30f * 1e-30f which would result FE_UNDERFLOW due to underflow to zero).

The issue I've observed is exactly like in https://github.com/llvm/llvm-project/issues/61224, i.e., an infinite loop.
I suppose this is due to the fact that FMul (as in the example above) is not constant folded.

I think I may be able to deal with this downstream if I can prevent the constant folding alone while not regressing back to the infinite loop as in issue 61224 (I suppose that run into the infinite loop as the pass at this point didn't expect non-constant-folded instructions?).
As a workaround I may be able to disable AllowHorRdxIdenityOptimization whenever the custom trapping is enabled.
For my context, I'm wondering, would you happen to know whether disabling the constant-folding alone without running into an infinite loop issue is even possible?

I've noticed there are 4 occurrences of AllowHorRdxIdenityOptimization and I can try to experiment selective disablement (I presume the allConstant branches are mostly relevant here) but thought I'd ask just in case this is a general issue.

Thanks!

This explains some of the behavior I've observed.
Context: Downstream implementation with a custom floating-point trapping behavior which generally disables constant folding for would-be trapping operations (such as mutiplying 1e-30f * 1e-30f which would result FE_UNDERFLOW due to underflow to zero).

The issue I've observed is exactly like in https://github.com/llvm/llvm-project/issues/61224, i.e., an infinite loop.
I suppose this is due to the fact that FMul (as in the example above) is not constant folded.

I think I may be able to deal with this downstream if I can prevent the constant folding alone while not regressing back to the infinite loop as in issue 61224 (I suppose that run into the infinite loop as the pass at this point didn't expect non-constant-folded instructions?).
As a workaround I may be able to disable AllowHorRdxIdenityOptimization whenever the custom trapping is enabled.
For my context, I'm wondering, would you happen to know whether disabling the constant-folding alone without running into an infinite loop issue is even possible?

I've noticed there are 4 occurrences of AllowHorRdxIdenityOptimization and I can try to experiment selective disablement (I presume the allConstant branches are mostly relevant here) but thought I'd ask just in case this is a general issue.

Yes, use -slp-optimize-identity-hor-reduction-ops=false to disable constant optimizations. Meanwhile I will try to provide a more general fix for the problem.

Matt added a comment.May 3 2023, 3:36 PM

Thank you!

Matt added a comment.May 5 2023, 4:49 PM

Just tested with the commit and appears to be working--thank you once again!