This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/test/Transforms/SLPVectorizer/X86/
-
test/
-
Transforms/
-
SLPVectorizer/
-
X86/
2
slp-fma-loss.ll

Differential D124867

[SLP][NFC] Pre-commit test showing vectorization preventing FMA
ClosedPublic

Authored by wjschmidt on May 3 2022, 11:56 AM.

Download Raw Diff

Details

Reviewers

vdmitrie
ABataev
vporpo

Commits

rGd633dbd19573: [SLP][NFC] Pre-commit test showing vectorization preventing FMA

Summary

When we generate a horizontal reduction of floating adds fed by a vectorized tree rooted at floating multiplies, we should account for the cost of no longer being able to generate scalar FMAs. Similarly, if we vectorize a list of floating multiplies that each feeds a single floating add, we should again account for this cost.

The first test was reduced from a case where the vectorizable tree looked barely profitable (cost -1) with a horizontal reduction, but produced substantially worse code than allowing the FMAs to be generated. The second test was derived from the first; we again generate a horizontal reduction here, but even if the horizontal reduction is forced to be unprofitable, we try to vectorize the multiplies. I have two follow-up patches to address these issues.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

wjschmidt created this revision.May 3 2022, 11:56 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 11:56 AM

wjschmidt requested review of this revision.May 3 2022, 11:56 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 11:56 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Try to reduce test more. I think you can remove attributes, datalayout, pass triple as an argument, remove comments for incoming branches

vporpo added inline comments.May 3 2022, 12:14 PM

llvm/test/Transforms/SLPVectorizer/X86/slp-reduc-fma-loss.ll
37 ↗	(On Diff #426796)	Nit: Please try to use slightly more readable names. These are chains of fmul, fadd so perhaps name them like: %mul0 = fmul ... %add0 = fadd ... %mul0 ... %mul1 = fmul ... %add1 = fadd ... %mul1 ...
40 ↗	(On Diff #426796)	The test should still work without the loop

In D124867#3489175, @ABataev wrote:

Try to reduce test more. I think you can remove attributes, datalayout, pass triple as an argument, remove comments for incoming branches

Thanks, I'm a noob at test case reduction. All these can indeed be removed.

wjschmidt added inline comments.May 3 2022, 1:03 PM

llvm/test/Transforms/SLPVectorizer/X86/slp-reduc-fma-loss.ll
37 ↗	(On Diff #426796)	Thanks, will do.
40 ↗	(On Diff #426796)	Sadly, it does not. If I remove the loop branches and the phi and replace the phi uses with undef, the vectorizer no longer finds the horizontal reduction profitable.

Harbormaster completed remote builds in B162528: Diff 426796.May 3 2022, 1:36 PM

fhahn added a subscriber: fhahn.May 4 2022, 1:11 AM

fhahn added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/slp-reduc-fma-loss.ll
40 ↗	(On Diff #426796)	Please avoid using undef. In most cases, it prevents automatic verification or at least makes it more difficult.

wjschmidt added inline comments.May 4 2022, 6:26 AM

llvm/test/Transforms/SLPVectorizer/X86/slp-reduc-fma-loss.ll
40 ↗	(On Diff #426796)	OK, will replace all undefs with constants.

I've made all requested changes, with the exception that I can't remove the loop structure or any of the undefs without breaking the test. In both cases, we no longer generate the horizontal reduction. I've made all the reductions Alexey requested, and changed the variable names as Vasileios requested.

Harbormaster completed remote builds in B162715: Diff 427037.May 4 2022, 10:54 AM

Hi! I'd like to ping this revision, please.

The undefs still here.
Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

In D124867#3506146, @ABataev wrote:

The undefs still here.

Yes, see above -- I was unable to find a sequence without the undefs that causes the horizontal reduction to kick in.

Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

Phase ordering -- it seems the FMA combining happens quite late in the pipeline. When we replace the adds with a horizontal reduction, the opportunity is removed.

In D124867#3506211, @wjschmidt wrote:

In D124867#3506146, @ABataev wrote:

The undefs still here.

Yes, see above -- I was unable to find a sequence without the undefs that causes the horizontal reduction to kick in.

Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

Phase ordering -- it seems the FMA combining happens quite late in the pipeline. When we replace the adds with a horizontal reduction, the opportunity is removed.

Why, could you investigate it?

In D124867#3506218, @ABataev wrote:

Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

Phase ordering -- it seems the FMA combining happens quite late in the pipeline. When we replace the adds with a horizontal reduction, the opportunity is removed.

Why, could you investigate it?

I'll have to refresh my memory, but my recollection is that the FMA combining is done in the MI level instruction combiner.

In D124867#3506289, @wjschmidt wrote:

In D124867#3506218, @ABataev wrote:

Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

Phase ordering -- it seems the FMA combining happens quite late in the pipeline. When we replace the adds with a horizontal reduction, the opportunity is removed.

Why, could you investigate it?

I'll have to refresh my memory, but my recollection is that the FMA combining is done in the MI level instruction combiner.

Why? Are there any target-caused limitations?

In D124867#3506211, @wjschmidt wrote:

In D124867#3506146, @ABataev wrote:

The undefs still here.

Yes, see above -- I was unable to find a sequence without the undefs that causes the horizontal reduction to kick in.

Use -slp-threshold option to avoid problems with the cost.

In D124867#3506304, @ABataev wrote:

In D124867#3506289, @wjschmidt wrote:

In D124867#3506218, @ABataev wrote:

Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

Phase ordering -- it seems the FMA combining happens quite late in the pipeline. When we replace the adds with a horizontal reduction, the opportunity is removed.

Why, could you investigate it?

I'll have to refresh my memory, but my recollection is that the FMA combining is done in the MI level instruction combiner.

Why? Are there any target-caused limitations?

I can't speak to the choices that were made by the InstCombine designers. There don't appear to be any remarks about it in the code. I do see that InstCombineMulDivRem.cpp goes out of its way to create opportunities for later FMA combining by generating FMul followed by FAdd or FSub, so it appears to be a deliberate choice not to create an FMA. There are also some small optimizations on existing Intrinsic::fma in InstCombineCalls.cpp, but nothing that creates one.

In D124867#3506308, @ABataev wrote:

In D124867#3506211, @wjschmidt wrote:

In D124867#3506146, @ABataev wrote:

The undefs still here.

Yes, see above -- I was unable to find a sequence without the undefs that causes the horizontal reduction to kick in.

Use -slp-threshold option to avoid problems with the cost.

Thanks for the helpful suggestion! I was able to remove the undefs with a slight adjustment. I also ran across another FMA-inhibiting variant that I want to look at before re-posting. I appreciate the feedback.

In D124867#3519126, @wjschmidt wrote:

In D124867#3506308, @ABataev wrote:

In D124867#3506211, @wjschmidt wrote:

In D124867#3506146, @ABataev wrote:

The undefs still here.

Yes, see above -- I was unable to find a sequence without the undefs that causes the horizontal reduction to kick in.

Use -slp-threshold option to avoid problems with the cost.

Thanks for the helpful suggestion! I was able to remove the undefs with a slight adjustment. I also ran across another FMA-inhibiting variant that I want to look at before re-posting. I appreciate the feedback.

You're welcome!

Thanks for the helpful comments to date! In this version, I've managed to remove the undefs from the original test. I also added a second test that removes the loop structure. For both tests, today we will generate an unprofitable horizontal reduction. With the first test, adding cost modeling to constrain the horizontal reduction allows FMAs to be generated. With the second test, this is insufficient, as we then decide to vectorize the multiplies in an unprofitable way. The two tests demonstrate the need to account for lost FMAs in the cost modeling both when vectorizing for a reduction and when vectorizing a list of multiplies.

I will have two follow-up patches. The first introduces costing for lost FMAs, and applies it to the horizontal reduction. The expected test case results are modified to show the first test is properly handled, but the second still has vectorized multiplies. The second patch applies the costing changes to the case of vectorizing a list, and both tests then leave the FMA opportunities in place. Breaking this into two patches hopefully makes it clearer what happens with the tests.

LG with some nits

llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll
11	Remove `#0`
52	Remove `#0`

This revision is now accepted and ready to land.May 18 2022, 9:51 AM

Harbormaster completed remote builds in B165151: Diff 430413.May 18 2022, 10:25 AM

In D124867#3522822, @ABataev wrote:

LG with some nits

Thanks, sorry those crept back in... Removed.

This revision was landed with ongoing or failed builds.May 19 2022, 6:58 AM

Closed by commit rGd633dbd19573: [SLP][NFC] Pre-commit test showing vectorization preventing FMA (authored by wjschmidt). · Explain Why

This revision was automatically updated to reflect the committed changes.

wjschmidt added a commit: rGd633dbd19573: [SLP][NFC] Pre-commit test showing vectorization preventing FMA.

Thanks for the updates!

You're welcome! Thanks for the good advice!

Revision Contents

Path

Size

llvm/

test/

Transforms/

SLPVectorizer/

X86/

slp-fma-loss.ll

71 lines

Diff 430660

llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -slp-vectorizer -S -mcpu=core-avx2 -mtriple=x86_64-unknown-linux-gnu -slp-threshold=-2 < %s \| FileCheck %s

				; This test checks for a case when a horizontal reduction of floating-point
				; adds may look profitable, but is not because it eliminates generation of
				; floating-point FMAs that would be more profitable.

				; FIXME: We generate a horizontal reduction today.

				define void @hr() {
				; CHECK-LABEL: @hr(
				ABataevUnsubmitted Not Done Reply Inline Actions Remove `#0` ABataev: Remove `#0`
				; CHECK-NEXT: br label [[LOOP:%.*]]
				; CHECK: loop:
				; CHECK-NEXT: [[PHI0:%.]] = phi double [ 0.000000e+00, [[TMP0:%.]] ], [ [[OP_RDX:%.*]], [[LOOP]] ]
				; CHECK-NEXT: [[CVT0:%.*]] = uitofp i16 0 to double
				; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> <double poison, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00>, double [[CVT0]], i32 0
				; CHECK-NEXT: [[TMP2:%.*]] = fmul fast <4 x double> zeroinitializer, [[TMP1]]
				; CHECK-NEXT: [[TMP3:%.*]] = call fast double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> [[TMP2]])
				; CHECK-NEXT: [[OP_RDX]] = fadd fast double [[TMP3]], [[PHI0]]
				; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[LOOP]]
				; CHECK: exit:
				; CHECK-NEXT: ret void
				;
				br label %loop

				loop:
				%phi0 = phi double [ 0.000000e+00, %0 ], [ %add3, %loop ]
				%cvt0 = uitofp i16 0 to double
				%mul0 = fmul fast double 0.000000e+00, %cvt0
				%add0 = fadd fast double %mul0, %phi0
				%mul1 = fmul fast double 0.000000e+00, 0.000000e+00
				%add1 = fadd fast double %mul1, %add0
				%mul2 = fmul fast double 0.000000e+00, 0.000000e+00
				%add2 = fadd fast double %mul2, %add1
				%mul3 = fmul fast double 0.000000e+00, 0.000000e+00
				%add3 = fadd fast double %mul3, %add2
				br i1 true, label %exit, label %loop

				exit:
				ret void
				}

				; This test checks for a case when either a horizontal reduction of
				; floating-point adds, or vectorizing a tree of floating-point multiplies,
				; may look profitable; but both are not because this eliminates generation
				; of floating-point FMAs that would be more profitable.

				; FIXME: We generate a horizontal reduction today, and if that's disabled, we
				; still vectorize some of the multiplies.

				define double @hr_or_mul() {
				; CHECK-LABEL: @hr_or_mul(
				ABataevUnsubmitted Not Done Reply Inline Actions Remove `#0` ABataev: Remove `#0`
				; CHECK-NEXT: [[CVT0:%.*]] = uitofp i16 3 to double
				; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> poison, double [[CVT0]], i32 0
				; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x double> [[TMP1]], <4 x double> poison, <4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP2:%.*]] = fmul fast <4 x double> <double 7.000000e+00, double -4.300000e+01, double 2.200000e-02, double 9.500000e+00>, [[SHUFFLE]]
				; CHECK-NEXT: [[TMP3:%.*]] = call fast double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> [[TMP2]])
				; CHECK-NEXT: [[OP_RDX:%.*]] = fadd fast double [[TMP3]], [[CVT0]]
				; CHECK-NEXT: ret double [[OP_RDX]]
				;
				%cvt0 = uitofp i16 3 to double
				%mul0 = fmul fast double 7.000000e+00, %cvt0
				%add0 = fadd fast double %mul0, %cvt0
				%mul1 = fmul fast double -4.300000e+01, %cvt0
				%add1 = fadd fast double %mul1, %add0
				%mul2 = fmul fast double 2.200000e-02, %cvt0
				%add2 = fadd fast double %mul2, %add1
				%mul3 = fmul fast double 9.500000e+00, %cvt0
				%add3 = fadd fast double %mul3, %add2
				ret double %add3
				}