This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Enable fold select into operand for FAdd, FMul, FSub and FDiv.
ClosedPublic

Authored by huihuiz on Nov 8 2021, 4:27 PM.

Details

Summary

For FAdd, FMul, FSub and FDiv, fold select into one of the operands to enable
further optimizations, i.e., floating-point reduction detection.

Turn code:

%C = fadd %A, %B
%D = select %cond, %C, %A

into:

%C = select %cond, %B, -0.000000e+00
%D = fadd %A, %C

Alive2 verification (with --disable-undef-input), timed out otherwise.
FAdd - https://alive2.llvm.org/ce/z/eUxN4Y
FMul - https://alive2.llvm.org/ce/z/5SWZz4
FSub - https://alive2.llvm.org/ce/z/Dhj8dU
FDiv - https://alive2.llvm.org/ce/z/Yj_NA2

Diff Detail

Event Timeline

huihuiz created this revision.Nov 8 2021, 4:27 PM
huihuiz requested review of this revision.Nov 8 2021, 4:27 PM

Take test.ll attached.

Run: opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-codegen-verify -analyze -polly-scops test.ll
Then you will see reduction is not detected.
MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]

  • -> { Stmt_for_body[i0] -> MemRef_sum_014_reg2mem[0] };

After enable fold select into operand for FAdd, then you will see reduction is detected.
run: opt -S -instcombine test.ll -o test2.ll (run with this patch)
opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-codegen-verify -analyze -polly-scops test2.ll

MustWriteAccess := [Reduction Type: +] [Scalar: 0]

  • -> { Stmt_for_body[i0] -> MemRef_sum_014_reg2mem[0] };
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-none-linux-gnu"

define float @test(i32 %n, float* noalias nocapture readonly %a) {
entry:
  %sum.014.reg2mem = alloca float, align 4
  %sum.0.lcssa.reg2mem = alloca float, align 4
  br label %entry.split15

entry.split15:                                    ; preds = %entry
  br label %entry.split

entry.split:                                      ; preds = %entry.split15
  %cmp12 = icmp sgt i32 %n, 0
  store float 0.000000e+00, float* %sum.0.lcssa.reg2mem, align 4
  br i1 %cmp12, label %for.body.preheader, label %for.end

for.body.preheader:                               ; preds = %entry.split
  %wide.trip.count = zext i32 %n to i64
  store float 0.000000e+00, float* %sum.014.reg2mem, align 4
  br label %for.body

for.body:                                         ; preds = %for.body.preheader, %for.body
  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
  %sum.014.reload = load float, float* %sum.014.reg2mem, align 4
  %arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
  %0 = load float, float* %arrayidx, align 4
  %cmp1 = fcmp fast ogt float %0, 0.000000e+00
  %add = fadd fast float %0, %sum.014.reload
  %sum.1 = select i1 %cmp1, float %add, float %sum.014.reload
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
  store float %sum.1, float* %sum.014.reg2mem, align 4
  br i1 %exitcond.not, label %for.end.loopexit, label %for.body

for.end.loopexit:                                 ; preds = %for.body
  %1 = load float, float* %sum.014.reg2mem, align 4
  store float %1, float* %sum.0.lcssa.reg2mem, align 4
  br label %for.end

for.end:                                          ; preds = %for.end.loopexit, %entry.split
  %sum.0.lcssa.reload = load float, float* %sum.0.lcssa.reg2mem, align 4
  ret float %sum.0.lcssa.reload
}
spatel added inline comments.Nov 9 2021, 5:44 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
257

Why exclude fdiv?

llvm/test/Transforms/InstCombine/select-binop-cmp.ll
307 ↗(On Diff #385653)

Please pre-commit the baseline tests. Some of these tests should include fast-math-flags on the FP binop, so we can verify that FMF propagates as expected.

330 ↗(On Diff #385653)

Why is the compare constant (0.0 or -0.0) relevant for this fold?

The true/false operands should be swapped so we have coverage for the pattern that replaces the true value with a constant. Similarly for fmul, there should be two tests.

spatel added inline comments.Nov 9 2021, 5:50 AM
llvm/test/Transforms/InstCombine/select-binop-cmp.ll
330 ↗(On Diff #385653)

A better question might be - why is there an fcmp in any of these tests? That isn't part of the minimal pattern is it?

Thanks Sanjay for the comments, I will update unit test as suggested.

I have some concern when I checked with alive2, for fadd https://alive2.llvm.org/ce/z/UjAMM_ alive2 complaints mis-matched outputs.
For integer add this folding seems correct https://alive2.llvm.org/ce/z/oE4UQJ

Let me know if such transformation would be illegal for floating point type as alive2 pointed out, or there is anything I missed before I go fixing the unit test ?

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
257

I checked with alive2, looks like adding fdiv, sdiv and udiv will trigger undefined behavior
https://alive2.llvm.org/ce/z/KvFYev

I have some concern when I checked with alive2, for fadd https://alive2.llvm.org/ce/z/UjAMM_ alive2 complaints mis-matched outputs.

For fadd, we need to use -0.0 as the binop identity constant. I'm not getting the online instance of Alive2 to verify this without timing out, but I think this should work given enough time:
https://alive2.llvm.org/ce/z/TtGDNR

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
257

I think the integer ops are not safe because we can trigger immediate UB that might have been avoided in the original code, but fdiv should be fine - it doesn't have any different UB characteristics vs. fadd/fmul/fsub in the default FP environment.
https://alive2.llvm.org/ce/z/Yj_NA2
(This is timing out on the online version when allowing undef/poison, so please double-check on a local machine.)

huihuiz updated this revision to Diff 386409.Nov 10 2021, 7:49 PM
huihuiz marked 2 inline comments as done.

Addressed review comments.
Pre-commit baseline test test/Transforms/InstCombine/select-binop-foldable-floating-point.ll

huihuiz retitled this revision from [InstCombine] Enable fold select into operand for FAdd, FMul, and FSub. to [InstCombine] Enable fold select into operand for FAdd, FMul, FSub and FDiv..Nov 10 2021, 7:50 PM
huihuiz edited the summary of this revision. (Show Details)
huihuiz added inline comments.Nov 10 2021, 7:54 PM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
257

Thanks Sanjay for the explanation. Include 'fdiv' to allow fold on the divisor amount.

I did alive-tv run on my local machine with time out increased to
./alive-tv src.ll tgt.ll -smt-to 100000000

But alive did not converge after 1.5 hour. Doing an overnight run now, will update result tomorrow morning.

llvm/test/Transforms/InstCombine/select-binop-cmp.ll
330 ↗(On Diff #385653)

Removing fcmp instruction. The minimal pattern to fold select into binop operand does not require a fcmp. I am moving this into a separate test. The original test select-binop-cmp.ll checks fcmp ignores the sign of 0.0 (for example test @select_fadd_fcmp), so keep the original test unchanged for now.

spatel added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
257

Hmm...I think it's a good sign if it did not find a failure after 1.5 hours, but that is a long time to spend on 2 instructions.
cc @nlopes @regehr @aqjune in case they see something wrong or room for improvement here:
https://alive2.llvm.org/ce/z/TtGDNR

Update on overnight run for fdiv

My run below did not return anything or capture any failure after 12 hours. 1000000000 is probably the maximum time out we can set to alive.
But I do agree that for two instructions, alive is taking too long to converge.

Definitely point us out if there is anything wrong ?

./alive-tv src.ll -smt-to 1000000000

----------------------------------------
define half @select_fdiv(i1 %cond, half %A, half %B) {
%0:
  %C = fdiv half %A, %B
  %D = select i1 %cond, half %C, half %A
  ret half %D
}
=>
define half @select_fdiv(i1 %cond, half %A, half %B) {
%0:
  %C = select i1 %cond, half %B, half 15360
  %D = fdiv half %A, %C
  ret half %D
}
huihuiz updated this revision to Diff 388002.Nov 17 2021, 11:25 AM
huihuiz edited the summary of this revision. (Show Details)

Rebased, and a gentle ping ?

Let me know if there are concerns with the current implementation ?

spatel accepted this revision.Nov 22 2021, 6:56 AM

LGTM.

You wrote that fdiv ran overnight without completing. I wonder if that opcode is slower than the others. For example, did fadd/fsub also timeout?

This revision is now accepted and ready to land.Nov 22 2021, 6:56 AM

Thanks Sanjay for the review!
I did another local run for fadd, with "--disable-undef-input" it finish within a minute.
When removing "--disable-undef-input", it's taking about an hour now, still not finished.

Probably, the problem is the floating-point type. We hit some limitations of alive2 I assume.

Thanks Sanjay for the review!
I did another local run for fadd, with "--disable-undef-input" it finish within a minute.
When removing "--disable-undef-input", it's taking about an hour now, still not finished.

Probably, the problem is the floating-point type. We hit some limitations of alive2 I assume.

Reasoning about floats is already quite expensive, and mixing that with undefs is really hard.
The fact it doesn't find any bug within 1 hour is a good sign though. Right now there's nothing better we can do. We have plans to improve this in the longer term.

This revision was landed with ongoing or failed builds.Nov 22 2021, 3:10 PM
This revision was automatically updated to reflect the committed changes.

Just FYI: I found that this change might affect of Ptrdist-ks benchmark from MultiSource suite on AMD Rome cores up to 30%. Don't have much details but leaving it here in case anyone else is affected.

LuoYuanke added a subscriber: LuoYuanke.EditedJan 29 2022, 6:37 AM

This patch cause regression (https://godbolt.org/z/7WYboTe16) on x86 avx512 target, because avx512 has mask instruction which can match select instruction in ISel pattern match. Move select before fadd cause the pattern match fails. I don't quite understand for the optimization opportunity for this transformation. Could you elaborate an example for it?

This patch has side effect on AVX512 ISel. Another example is https://godbolt.org/z/xdr8xs4cb.

This needs backend undo transformation to be implemented first.

Plesse revert @huihuiz

This patch cause regression (https://godbolt.org/z/7WYboTe16) on x86 avx512 target, because avx512 has mask instruction which can match select instruction in ISel pattern match. Move select before fadd cause the pattern match fails. I don't quite understand for the optimization opportunity for this transformation. Could you elaborate an example for it?

This patch has side effect on AVX512 ISel. Another example is https://godbolt.org/z/xdr8xs4cb.

Yeah, this patch was commited without checking how this transformation affects x86 codegen. :/ needs to be reverted before LLVM 14.

cc @spatel

This patch cause regression (https://godbolt.org/z/7WYboTe16) on x86 avx512 target, because avx512 has mask instruction which can match select instruction in ISel pattern match. Move select before fadd cause the pattern match fails. I don't quite understand for the optimization opportunity for this transformation. Could you elaborate an example for it?

This patch has side effect on AVX512 ISel. Another example is https://godbolt.org/z/xdr8xs4cb.

Yeah, this patch was commited without checking how this transformation affects x86 codegen. :/ needs to be reverted before LLVM 14.

cc @spatel

Note that this patch makes IR more consistent (FP ops are treated the same as integer ops). So is the integer codegen also not optimal?
https://godbolt.org/z/jd3q6Yq55

If there's still time for clang 14, can we fix x86 codegen rather than revert? There was already a similar bug filed for compares:
https://github.com/llvm/llvm-project/issues/51842

LuoYuanke added a comment.EditedJan 29 2022, 7:29 PM

Note that this patch makes IR more consistent (FP ops are treated the same as integer ops). So is the integer codegen also not optimal?
https://godbolt.org/z/jd3q6Yq55

I think the integer is also NOT optimal with AVX512 target. See https://godbolt.org/z/Wahef3b3v. For _mm_mask_add_epi64() and _mm_mask_sub_epi64(), it eventually generate extra instructions. But for integer mul, or, and, xor and shl in the test case the InstCombinePass doesn't transform it, so it still generate good code.

I still don't understand what's the benefit to move select instruction before the binary operation (add, sub, mul ...). Can someone indicate why it is profitable? If it is profitable for general case, I think we can call TTI interface to ask backend if the transform is profitable and then make the decision to transform. We may add below interfaces in TTI.

+
+  bool isLegalMaskedFAdd(Type *DataTye);
+  bool isLegalMaskedFSub(Type *DataTye);
+  bool isLegalMaskedFMul(Type *DataTye);
+  bool isLegalMaskedFDiv(Type *DataTye);
+

Note that this patch makes IR more consistent (FP ops are treated the same as integer ops). So is the integer codegen also not optimal?
https://godbolt.org/z/jd3q6Yq55

I think the integer is also NOT optimal with AVX512 target. See https://godbolt.org/z/Wahef3b3v. For _mm_mask_add_epi64() and _mm_mask_sub_epi64(), it eventually generate extra instructions. But for integer mul, or, and, xor and shl in the test case the InstCombinePass doesn't transform it, so it still generate good code.

I still don't understand what's the benefit to move select instruction before the binary operation (add, sub, mul ...). Can someone indicate why it is profitable? If it is profitable for general case, I think we can call TTI interface to ask backend if the transform is profitable and then make the decision to transform. We may add below interfaces in TTI.

+
+  bool isLegalMaskedFAdd(Type *DataTye);
+  bool isLegalMaskedFSub(Type *DataTye);
+  bool isLegalMaskedFMul(Type *DataTye);
+  bool isLegalMaskedFDiv(Type *DataTye);
+

I believe at least one reason for the int combine was because it reduces the number of uses of the input.

ARM and RISCV both reverse it in the backend for scalars. Though they should probably be using FREEZE to do it.

X86 could probably reverse it for vectors.

For vectors - Arm MVE has to reverse these transforms (as in https://reviews.llvm.org/rGd9af9c2c5a53c9ba6aa0255240a2a40e8bea27aa). It is simpler in general to match vselect cc, (add x, y), x) as a predicated-add, than it is with a folded select with an identity element. But we do manage to reverse that at the moment, and I've not seen any cases of it folding to something we couldn't convert back. It can be quite important for performance in places, for example where the vectorizer produces a predicated reduction (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L4158, that is only enabled for Arm MVE at the moment).

SVE doesn't seem to attempt to convert add+select into predicate-add yet. https://godbolt.org/z/TPbE95h5x

Maybe worth to mention:
https://reviews.llvm.org/D90113

Thanks for the link. I posted a much more limited x86-only patch -- D118644 -- to deal with the regression noted here.

Thank you guys for looking into this!

One of the optimization enabled through this patch is reduction detection. The same applies to it's integer equivalent.

Take the example test.ll attached in the very first comment
run:
opt -S -instcombine test.ll -o test2.ll
opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-codegen-verify -analyze -polly-scops test2.ll

Then you will see reduction getting detected. Eventually allow loops to get vectorized.
MustWriteAccess := [Reduction Type: +] [Scalar: 0]

The folding in particular reduce the user of %sum.014.reload to 1. So that we don't need to over-complicate the data-flow analysis algorithm used by vectorizer. Also don't need to combine multiple reduction operators.

before

%cmp1 = fcmp fast ogt float %0, 0.000000e+00
%add = fadd fast float %0, %sum.014.reload
%sum.1 = select i1 %cmp1, float %add, float %sum.014.reload

after

%.inv = fcmp fast ole float %0, -0.000000e+00
%1 = select fast i1 %.inv, float -0.000000e+00, float %0
%sum.1 = fadd fast float %sum.014.reload, %1
xbolva00 added a comment.EditedFeb 1 2022, 8:49 AM

Just FYI: I found that this change might affect of Ptrdist-ks benchmark from MultiSource suite on AMD Rome cores up to 30%. Don't have much details but leaving it here in case anyone else is affected.

This is huge regression. Does @spatel ´s patch fix it?

Any performance measurements of this patch alone? Or phoronix sill surprise (good/bad?) us again

Just FYI: I found that this change might affect of Ptrdist-ks benchmark from MultiSource suite on AMD Rome cores up to 30%. Don't have much details but leaving it here in case anyone else is affected.

This is huge regression. Does @spatel ´s patch fix it?

D118644 only handles vectors and AVX512 targets so this regression wouldn't be fixed

Following-up:
https://github.com/llvm/llvm-project/issues/53866 lists the commits that are in main/branch to avoid the AVX512 FP regressions.

Further work to invert this transform in the backend is ongoing. For example:
D119654

This has to be done in relatively small steps to avoid regressions on x86 - it's not clear without looking at asm if the transformed code will be better (may depend on operation, data type, and subtarget features).