This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Account for cost of removing FMA opportunities
Needs ReviewPublic

Authored by wjschmidt on May 19 2022, 9:32 AM.

Details

Summary

When we generate a horizontal reduction of floating adds fed by a vectorized tree rooted at floating multiplies, or when we vectorize a list of floating multiplies that each feeds a floating add/subtract, account for the cost of no longer being able to generate scalar FMAs.

Diff Detail

Event Timeline

wjschmidt created this revision.May 19 2022, 9:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 9:32 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
wjschmidt requested review of this revision.May 19 2022, 9:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 9:32 AM
fhahn added a comment.May 19 2022, 9:36 AM

I think this loss is not only relevant for horizontal reductions, but can pessimize things in general, right?

ABataev added inline comments.May 19 2022, 9:37 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10086

I think it must be a part of TTI::getArithmeticReductionCost

I think this loss is not only relevant for horizontal reductions, but can pessimize things in general, right?

Yes. I have a second patch that covers that case. I split the patches so we could see that we still have the pessimization in the second test case after the reduction is avoided, which the second patch will fix.

On what target are you seeing this problem?

llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll
61

why is this a constant?

wjschmidt added inline comments.May 19 2022, 9:54 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10086

Because this will also be called in a case where it's not part of a reduction, I'd like to keep it separate. (See my comment in the summary.) Perhaps I shouldn't have split the patches so this would be more evident.

On what target are you seeing this problem?

-mcpu=core-avx2 -mtriple=x86_64-unknown-linux-gnu

wjschmidt added inline comments.May 19 2022, 9:56 AM
llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll
61

This was reduced from a big messy function. Bugpoint ended up with an undef here and elsewhere, and previous reviewers asked me to remove the undefs.

ABataev added inline comments.May 19 2022, 10:00 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10086

Anyway, maybe better to make it a part of TTI rather than put it to SLPVectorizer?

wjschmidt added inline comments.May 19 2022, 10:06 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10086

That's certainly possible, and probably a good idea. We'd need to pull out the final multiplication by Width and the SLP-specific debug stuff, but with that done it would fit nicely in TTI, and there might be potential for reuse.

RKSimon added inline comments.May 19 2022, 10:12 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10086

Ideally the getArithmeticReductionCost costs wouldn't be affected but getArithmeticInstrCost could be adjusted to account for potential FMA patterns in the scalar/vector FADD costs

ABataev added inline comments.May 19 2022, 10:20 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10086

Yep, good idea!

wjschmidt added inline comments.May 19 2022, 11:03 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10086

I'll look into possibilities. That might be a bit awkward for the other use case. It's clear that I should have combined the two patches to make this easier to judge, so I'll do that in the next revision. Thanks for the ideas!

wjschmidt updated this revision to Diff 431351.May 23 2022, 6:24 AM
wjschmidt edited the summary of this revision. (Show Details)

This revision now includes fixes for both cases: a horizontal reduction of fadds fed by a vectorized tree rooted at fmuls, and a vectorized tree of fmuls that feeds arbitrary fadds and/or fsubs. It was misguided to break this up into two patches without showing you both...

The reviewers of the previous revision requested that the FMA cost calculation be moved into TTI. I've done that here, but I kept it as a separate calculation rather than incorporating it somehow into getArithmeticInstrCost, for a couple of reasons. If I understood the proposal correctly, it was to have getArithmeticInstrCost when called on an FAdd (or FSub) produce a lesser cost when fed by an FMul as one of the operands, reporting that the FAdd/FSub is essentially free in those circumstances. (If I've misunderstood, please correct me.)

First, I didn't find this particularly helpful for the use cases I'm addressing here. For example, for the tryToVectorizeList() case, we have a collection of FMul instructions. We can follow the defs to get the FAdd/FSubs to ask about their cost, but that will just get us a raw number. It doesn't tell us how much we ought to add back to the cost for breaking the FMA. (I don't think that just having the altered cost for FAdd/FSub will avoid us considering the FMul vectorization in the first place.)

Second, I worry about unintended effects throughout the middle end by altering the cost model this way. I think it's better to provide the additional interface so that the lost FMA cost can be taken into account only where it's known to matter.

Thanks for the reviews!

ABataev added inline comments.May 25 2022, 7:55 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

Can this be hidden in the TTI completely anyhow? To avoid the extra calls of the adjustForFMAs completely in SLP.

RKSimon added inline comments.May 25 2022, 9:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

+1 - the only problem if that this will probably end up having to be added on a target-by-target basis in getArithmeticInstrCost - hopefully we can provide a helper wrapper so its only a one-liner in each target.

wjschmidt added inline comments.May 25 2022, 9:56 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

Hi -- sorry, I'll be out of town for the next week, so can't look at this right away. In the meantime, can you please give me a little bit more idea exactly what you're looking for? I've expressed my concerns about making direct changes to getArithmeticInstrCost for this in the comments for this revision -- can you please address those? It's not at all clear to me what you're looking for other than "hide it from SLP somehow."

wjschmidt added inline comments.Jun 3 2022, 7:00 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
923

Pasto here, this should be Instruction::FAdd.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

I took a run at prototyping this yesterday for X86 TTI. I put a shim in X86TTIImpl::getArithmeticInstrCost() to look for cases of FAdd and FSub that we expect to be folded into FMAs, and reduce their cost appropriately. This was sufficient to prevent the horizontal reduction from happening, but as I expected, this doesn't help us with the tryToVectorizeList() case, where the cost modeling is only looking at the multiplies and not the add/sub fed by them. (I.e., the second variant in the test case isn't fixed.) I don't see any way to avoid explicit FMA checking in the SLP vectorizer to solve this, as explained in my summary for this revision.

Given this result, I'd like to proceed with the patch provided here. It would be pleasant if we could hide the issue entirely from SLP, but it doesn't seem practical. Thanks!

wjschmidt updated this revision to Diff 436946.Jun 14 2022, 2:34 PM
wjschmidt retitled this revision from [SLP] Account for cost of removing FMA opportunities by horizontal reduction to [SLP] Account for cost of removing FMA opportunities.

Rebased; fixed pasto in getFMACostSavings(); ensured that getFMACostSavings() returns nonnegative values; and made a slight simplification in adjustForFMAs(). Thanks!

ABataev added inline comments.Jun 15 2022, 6:10 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

Could you provide more details why this does not help with tryToVectorizeList?

wjschmidt marked an inline comment as not done.Jun 15 2022, 6:14 AM
wjschmidt added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

Because tryToVectorizeList doesn't see the adds at all. The root of the tree is the multiplies. Each of those multiplies happens to feed an add that's outside the vectorizable tree. There's never any call to getArithmeticInstrCost for the adds, so modifying the cost there is of no help.

ABataev added inline comments.Jun 15 2022, 6:15 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

Can we look through the users of these multiplies?

wjschmidt added inline comments.Jun 15 2022, 6:19 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

That's exactly what this patch is doing.

wjschmidt added inline comments.Jun 15 2022, 6:20 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

To be clear, the current patch does fix the tryToVectorizeList case. It just doesn't hide that aspect of things in TTI.

ABataev added inline comments.Jun 15 2022, 6:21 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

I mean, can we do it in TTI?

wjschmidt added inline comments.Jun 15 2022, 6:25 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

The problem is that you have to pick one way or the other to look at things if you do it in TTI. You either have to say, the cost of an add is free if it's fed by a multiply, or the cost of a multiply is free if it feeds into an add. If you do both of these things, you end up undercounting the cost. I don't see a practical way to do this in TTI that solves both problems in SLP. One is looking at things from a multiply point of view and the other from the add point of view.

ABataev added inline comments.Jun 15 2022, 6:29 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

Yes, agree. Can we try to have free muls if they are used by adds?Do you foresee any problems with this approach?

wjschmidt added inline comments.Jun 15 2022, 6:37 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

I implemented a prototype that gave us free adds if they use multiplies. I didn't see any regressions from this. I haven't tried it the other way around (free multiplies if used only by a single add each). If we did that, my guess is we probably still wouldn't see regressions. However, we also wouldn't see much benefit, because now the other case (the horizontal reduction case) will no longer be fixed by the TTI solution. Either way we end up having to have code in SLP that special-cases FMAs somewhere.

My feeling is that it's sensible for TTI to provide the interface that calculates the savings of an FMA, but the burden needs to be on the users of TTI (SLP for now, but perhaps others) to determine when to use it. I really do understand why it would be preferable to hide everything in the TTI layer, so nobody ever has to worry about it again, but because users can come at the problem from two directions, it's hard for a simple interface like getArithmeticInstrCost to hide the complexity.

ABataev added inline comments.Jun 15 2022, 6:42 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

However, we also wouldn't see much benefit, because now the other case (the horizontal reduction case) will no longer be fixed by the TTI solution.

What's the problem with the reduction here?

I understand, but let's try to evaluate possible solutions before having final decision.

wjschmidt added inline comments.Jun 15 2022, 6:53 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we need to check whether this removes an FMA opportunity, as otherwise we are going to overstate the cost savings provided by the horizontal reduction.

It may be that looking at it from the multiply side will solve the problem for the horizontal reduction also, because the vectorizable tree cost will be increased. But let's be careful. Suppose some other optimization pass is looking at code where FMul feeds FAdd, and it is considering an optimization that removes the FAdd in place of some other code sequence. This pass will just do its cost calculation based on getArithmeticInstrCost for FAdd, since FMul is not part of the modification it's making. If we make the FMA cost savings calculation apply only for getArithmeticInstrCost for FMul, then this hypothetical optimization will not account for the cost of the broken FMA opportunity.

I.e., we might be able to hide everything in TTI specifically for today's needs for SLP, but it doesn't appear to me to be a satisfactory general solution that hides everything in TTI for all possible uses. That's what makes me reluctant to do this.

ABataev added inline comments.Jun 15 2022, 6:57 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we need to check whether this removes an FMA opportunity, as otherwise we are going to overstate the cost savings provided by the horizontal reduction.

We can add a check for this in TTI too.

It may be that looking at it from the multiply side will solve the problem for the horizontal reduction also, because the vectorizable tree cost will be increased. But let's be careful. Suppose some other optimization pass is looking at code where FMul feeds FAdd, and it is considering an optimization that removes the FAdd in place of some other code sequence. This pass will just do its cost calculation based on getArithmeticInstrCost for FAdd, since FMul is not part of the modification it's making. If we make the FMA cost savings calculation apply only for getArithmeticInstrCost for FMul, then this hypothetical optimization will not account for the cost of the broken FMA opportunity.

Same reason applies to SLP. If miss something, we need to fix it for all passes. We'd better to have one interface for all transformation passes, if possible.

wjschmidt added inline comments.Jun 15 2022, 7:00 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we need to check whether this removes an FMA opportunity, as otherwise we are going to overstate the cost savings provided by the horizontal reduction.

We can add a check for this in TTI too.

But you can't, because then you've said both the multiply and the add are free, as I said before. This will break some cost calculations.

ABataev added inline comments.Jun 15 2022, 7:02 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

But for this we'll need to modify a different function (I thought about getArithemticReductionCost()), though I'm not sure that we even need to that. Everything still should be handled by the getArithmeticCost function.

wjschmidt added inline comments.Jun 15 2022, 7:11 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10079

I remain very uncomfortable with trying to hide this in getArithmeticInstrCost for the reasons stated already; it's just too myopic an interface to be able to "look both ways" without making a costing error somewhere. But I may just be unable to see an obvious solution. Perhaps someone with more expertise in TTI would like to take a run at this.

I think this loss is not only relevant for horizontal reductions, but can pessimize things in general, right?

I stumbled upon an issue on AArch64 where ignoring scalar fusion opportunities notably regresses performance of some SLP sensitive applications. I put up D132872 to sketch a potential solution for the non-horizontal reduction case.