This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Account for loss due to FMAs when estimating fmul costs.
Needs ReviewPublic

Authored by fhahn on Aug 29 2022, 11:01 AM.

Details

Summary

This patch extends the cost model for FMUL operations to consider scalar
FMA opportunities when considering multiplies. If a scalar operation in
the bundle has a single FADD/FSUB user, it is very likely those
instructions can be fused to a single fmuladd/fmulsub operation, which
means the multiply is effectively free.

The patch counts the number of fusable operations in a bundle. If all
entries in the bundle can be fused, it is likely that the resulting
vector instructions can also be fused. In this case, consider both
version free and return the common cost, with a tiny bias towards
vectorizing.

Otherwise just consider the fusable scalar FMULs as free.

This recovers a regression in an application very sensitive to SLP
changes after 65c7cecb13f8d2132a54103903501474083afe8f and overall
improves performance by 10%. There is no other measurable impact on the
other applications in a large proprietary benchmark suite on ARM64.

Excessive SLP vectorization in the presence of scalar FMA opportunities
has also been discussed in D131028 and mentioned by @dmgreen.

D125987 also tries to address a similar issue, but with a focus on
horizontal reductions.

I'd appreciate if someone could give this a test on the X86 side.

Diff Detail

Event Timeline

fhahn created this revision.Aug 29 2022, 11:01 AM
fhahn requested review of this revision.Aug 29 2022, 11:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2022, 11:01 AM

Similar question to D125987, why we you cannot put related cost changes to getArithmeticInstrCost?

Similar question to D125987, why we you cannot put related cost changes to getArithmeticInstrCost?

I think one reason is that TTI usually avoids to rely on the contex instruction/use-list scans. Not saying this is a blocker though if we converge on this as solution. The other issue is that we can't estimate the fact that the widened FMUL will also be applicable for fusion because no context instruction to pass exists yet.

I think it would be good to role out a solution that first fixes the issue in SLP first and then possible move it to TTI once it has proven itself. This is an issue that mostly impacts the SLP vectorizer AFAICT.

Similar question to D125987, why we you cannot put related cost changes to getArithmeticInstrCost?

I think one reason is that TTI usually avoids to rely on the contex instruction/use-list scans. Not saying this is a blocker though if we converge on this as solution.

Not quite true, since it accepts operands as an argument.

The other issue is that we can't estimate the fact that the widened FMUL will also be applicable for fusion because no context instruction to pass exists yet.

Do not see a problem here.

I think it would be good to role out a solution that first fixes the issue in SLP first and then possible move it to TTI once it has proven itself. This is an issue that mostly impacts the SLP vectorizer AFAICT.

Not sure this a good decision.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6449–6450

The cost-based analysis may lead to the wrong final decision here, need to return something like a flag instead or implement this analysis in TTI. What if the cost of Intrinsic::fmuladd != TCC_Basic?

fhahn updated this revision to Diff 456654.Aug 30 2022, 7:05 AM

Add missing checks for reassoc FMF flag, compare fmuladd cost with FMUL cost to see if fmuladd is at least as cheap as FMUL.

Similar question to D125987, why we you cannot put related cost changes to getArithmeticInstrCost?

I think one reason is that TTI usually avoids to rely on the contex instruction/use-list scans. Not saying this is a blocker though if we converge on this as solution.

Not quite true, since it accepts operands as an argument.

Yes it accepts it, but AFAICT one of the only user of getArithmeticInstrCost is the ARMTargetTransformInfo AFAICT, so I'd argue that using it at the momeent is quite uncommon.

The other issue is that we can't estimate the fact that the widened FMUL will also be applicable for fusion because no context instruction to pass exists yet.

Do not see a problem here.

We need a way to estimate if the vector fmul will be fusable with its users, otherwise we may incorrectly overstate the benefit of scalar FMAs.

For that we would need to check if all entries in the bundle are fusable with their users. But when we call getArithmeticInstrCost for with the vector type, we can only pass a single context instruction, which is one of the scalar instructions of the bundle. So I don't think getArithmeticInstrCost can really figure out whether the vector FMUL will be free or not.

I think if we would want to sink this into TTI, it would need to be integerated into getArithmeticInstrCost, rather than adding a very specialized interface that would only be used by the SLP vectorizer.

I think it would be good to role out a solution that first fixes the issue in SLP first and then possible move it to TTI once it has proven itself. This is an issue that mostly impacts the SLP vectorizer AFAICT.

Not sure this a good decision.

Could you elaborate your concerns with making the decision in SLP to start with? Even with the current implementation in SLP, the target specific information (availability and cost of FMAs) is accessed through TTI.

I think it would be good to first converge on making the correct decision to address the regressions on both X86 and AArch64 and make sure it is actually the right decision. Then decide if and where to generalize the decision.

fhahn marked an inline comment as done.Aug 30 2022, 7:10 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6449–6450

Thanks! Updated the code to check if the the cost of the scalar FMA is <= the cost of the scalar FMUL. I think that should indicate whether we can consider the FMUL as free if the FMA is used instead FMUL + FADD.

Add missing checks for reassoc FMF flag, compare fmuladd cost with FMUL cost to see if fmuladd is at least as cheap as FMUL.

Similar question to D125987, why we you cannot put related cost changes to getArithmeticInstrCost?

I think one reason is that TTI usually avoids to rely on the contex instruction/use-list scans. Not saying this is a blocker though if we converge on this as solution.

Not quite true, since it accepts operands as an argument.

Yes it accepts it, but AFAICT one of the only user of getArithmeticInstrCost is the ARMTargetTransformInfo AFAICT, so I'd argue that using it at the momeent is quite uncommon.

Anyway, you introducing another level of the dependency when we already have one. It makes the codebase complex and makes it harder to maintain.

The other issue is that we can't estimate the fact that the widened FMUL will also be applicable for fusion because no context instruction to pass exists yet.

Do not see a problem here.

We need a way to estimate if the vector fmul will be fusable with its users, otherwise we may incorrectly overstate the benefit of scalar FMAs.

For that we would need to check if all entries in the bundle are fusable with their users. But when we call getArithmeticInstrCost for with the vector type, we can only pass a single context instruction, which is one of the scalar instructions of the bundle. So I don't think getArithmeticInstrCost can really figure out whether the vector FMUL will be free or not.

It can be fixed by adding an extra parameter, extending existing function with array of instruction instead of single instruction, flags, etc.

I think if we would want to sink this into TTI, it would need to be integerated into getArithmeticInstrCost, rather than adding a very specialized interface that would only be used by the SLP vectorizer.

I think it would be good to role out a solution that first fixes the issue in SLP first and then possible move it to TTI once it has proven itself. This is an issue that mostly impacts the SLP vectorizer AFAICT.

Not sure this a good decision.

Could you elaborate your concerns with making the decision in SLP to start with? Even with the current implementation in SLP, the target specific information (availability and cost of FMAs) is accessed through TTI.

I think after this it will live here forever, nobody will try to rework it. It is the same problem as before - the patch does not try to address global problem, instead it introduces an immediate hack.

I think it would be good to first converge on making the correct decision to address the regressions on both X86 and AArch64 and make sure it is actually the right decision. Then decide if and where to generalize the decision.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6451–6453

Still it looks like a hack.

fhahn marked an inline comment as done.Aug 30 2022, 12:46 PM

I think after this it will live here forever, nobody will try to rework it. It is the same problem as before - the patch does not try to address global problem, instead it introduces an immediate hack.

Fair enough, I put up an alternative series that instead fully sinks the cost logic into TTI: D132968. Personally I am not sure if adding this to TTI is too specific (i.e. encoding SLP specific info into TTI), but if people generally prefer this direction I am happy to go that direction.