This is an archive of the discontinued LLVM Phabricator instance.

[SLP][NFC] Pre-commit test showing vectorization preventing FMA
ClosedPublic

Authored by wjschmidt on May 3 2022, 11:56 AM.

Details

Summary

When we generate a horizontal reduction of floating adds fed by a vectorized tree rooted at floating multiplies, we should account for the cost of no longer being able to generate scalar FMAs. Similarly, if we vectorize a list of floating multiplies that each feeds a single floating add, we should again account for this cost.

The first test was reduced from a case where the vectorizable tree looked barely profitable (cost -1) with a horizontal reduction, but produced substantially worse code than allowing the FMAs to be generated. The second test was derived from the first; we again generate a horizontal reduction here, but even if the horizontal reduction is forced to be unprofitable, we try to vectorize the multiplies. I have two follow-up patches to address these issues.

Diff Detail

Event Timeline

wjschmidt created this revision.May 3 2022, 11:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 11:56 AM
wjschmidt requested review of this revision.May 3 2022, 11:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 11:56 AM

Try to reduce test more. I think you can remove attributes, datalayout, pass triple as an argument, remove comments for incoming branches

vporpo added inline comments.May 3 2022, 12:14 PM
llvm/test/Transforms/SLPVectorizer/X86/slp-reduc-fma-loss.ll
37 ↗(On Diff #426796)

Nit: Please try to use slightly more readable names. These are chains of fmul, fadd so perhaps name them like:

%mul0 = fmul ...
%add0 = fadd ... %mul0 ...
%mul1 = fmul ...
%add1 = fadd ... %mul1 ...
40 ↗(On Diff #426796)

The test should still work without the loop

Try to reduce test more. I think you can remove attributes, datalayout, pass triple as an argument, remove comments for incoming branches

Thanks, I'm a noob at test case reduction. All these can indeed be removed.

wjschmidt added inline comments.May 3 2022, 1:03 PM
llvm/test/Transforms/SLPVectorizer/X86/slp-reduc-fma-loss.ll
37 ↗(On Diff #426796)

Thanks, will do.

40 ↗(On Diff #426796)

Sadly, it does not. If I remove the loop branches and the phi and replace the phi uses with undef, the vectorizer no longer finds the horizontal reduction profitable.

fhahn added a subscriber: fhahn.May 4 2022, 1:11 AM
fhahn added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/slp-reduc-fma-loss.ll
40 ↗(On Diff #426796)

Please avoid using undef. In most cases, it prevents automatic verification or at least makes it more difficult.

wjschmidt added inline comments.May 4 2022, 6:26 AM
llvm/test/Transforms/SLPVectorizer/X86/slp-reduc-fma-loss.ll
40 ↗(On Diff #426796)

OK, will replace all undefs with constants.

wjschmidt updated this revision to Diff 427037.May 4 2022, 9:10 AM

I've made all requested changes, with the exception that I can't remove the loop structure or any of the undefs without breaking the test. In both cases, we no longer generate the horizontal reduction. I've made all the reductions Alexey requested, and changed the variable names as Vasileios requested.

Hi! I'd like to ping this revision, please.

The undefs still here.
Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

The undefs still here.

Yes, see above -- I was unable to find a sequence without the undefs that causes the horizontal reduction to kick in.

Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

Phase ordering -- it seems the FMA combining happens quite late in the pipeline. When we replace the adds with a horizontal reduction, the opportunity is removed.

The undefs still here.

Yes, see above -- I was unable to find a sequence without the undefs that causes the horizontal reduction to kick in.

Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

Phase ordering -- it seems the FMA combining happens quite late in the pipeline. When we replace the adds with a horizontal reduction, the opportunity is removed.

Why, could you investigate it?

Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

Phase ordering -- it seems the FMA combining happens quite late in the pipeline. When we replace the adds with a horizontal reduction, the opportunity is removed.

Why, could you investigate it?

I'll have to refresh my memory, but my recollection is that the FMA combining is done in the MI level instruction combiner.

Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

Phase ordering -- it seems the FMA combining happens quite late in the pipeline. When we replace the adds with a horizontal reduction, the opportunity is removed.

Why, could you investigate it?

I'll have to refresh my memory, but my recollection is that the FMA combining is done in the MI level instruction combiner.

Why? Are there any target-caused limitations?

The undefs still here.

Yes, see above -- I was unable to find a sequence without the undefs that causes the horizontal reduction to kick in.

Use -slp-threshold option to avoid problems with the cost.

Also, why these sequences are not optimized by InsrtuctionCombiner to FMA?

Phase ordering -- it seems the FMA combining happens quite late in the pipeline. When we replace the adds with a horizontal reduction, the opportunity is removed.

Why, could you investigate it?

I'll have to refresh my memory, but my recollection is that the FMA combining is done in the MI level instruction combiner.

Why? Are there any target-caused limitations?

I can't speak to the choices that were made by the InstCombine designers. There don't appear to be any remarks about it in the code. I do see that InstCombineMulDivRem.cpp goes out of its way to create opportunities for later FMA combining by generating FMul followed by FAdd or FSub, so it appears to be a deliberate choice not to create an FMA. There are also some small optimizations on existing Intrinsic::fma in InstCombineCalls.cpp, but nothing that creates one.

The undefs still here.

Yes, see above -- I was unable to find a sequence without the undefs that causes the horizontal reduction to kick in.

Use -slp-threshold option to avoid problems with the cost.

Thanks for the helpful suggestion! I was able to remove the undefs with a slight adjustment. I also ran across another FMA-inhibiting variant that I want to look at before re-posting. I appreciate the feedback.

The undefs still here.

Yes, see above -- I was unable to find a sequence without the undefs that causes the horizontal reduction to kick in.

Use -slp-threshold option to avoid problems with the cost.

Thanks for the helpful suggestion! I was able to remove the undefs with a slight adjustment. I also ran across another FMA-inhibiting variant that I want to look at before re-posting. I appreciate the feedback.

You're welcome!

wjschmidt updated this revision to Diff 430413.May 18 2022, 9:49 AM
wjschmidt retitled this revision from [SLP][NFC] Pre-commit test showing horizontal reduction preventing FMA to [SLP][NFC] Pre-commit test showing vectorization preventing FMA.
wjschmidt edited the summary of this revision. (Show Details)

Thanks for the helpful comments to date! In this version, I've managed to remove the undefs from the original test. I also added a second test that removes the loop structure. For both tests, today we will generate an unprofitable horizontal reduction. With the first test, adding cost modeling to constrain the horizontal reduction allows FMAs to be generated. With the second test, this is insufficient, as we then decide to vectorize the multiplies in an unprofitable way. The two tests demonstrate the need to account for lost FMAs in the cost modeling both when vectorizing for a reduction and when vectorizing a list of multiplies.

I will have two follow-up patches. The first introduces costing for lost FMAs, and applies it to the horizontal reduction. The expected test case results are modified to show the first test is properly handled, but the second still has vectorized multiplies. The second patch applies the costing changes to the case of vectorizing a list, and both tests then leave the FMA opportunities in place. Breaking this into two patches hopefully makes it clearer what happens with the tests.

ABataev accepted this revision.May 18 2022, 9:51 AM

LG with some nits

llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll
11

Remove #0

52

Remove #0

This revision is now accepted and ready to land.May 18 2022, 9:51 AM

LG with some nits

Thanks, sorry those crept back in... Removed.

This revision was landed with ongoing or failed builds.May 19 2022, 6:58 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.May 19 2022, 8:15 AM

Thanks for the updates!

You're welcome! Thanks for the good advice!