Page MenuHomePhabricator

[Analysis] Add simple cost model for strict (in-order) reductions
ClosedPublic

Authored by david-arm on Jul 5 2021, 6:17 AM.

Details

Summary

I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:

  1. Tree-wise. This is the typical fast-math reduction that involves continually splitting a vector up into halves and adding each half together until we get a scalar result. This is the default behaviour for integers, whereas for floating point we only do this if reassociation is allowed.
  2. Ordered. This now allows us to estimate the cost of performing a strict vector reduction by treating it as a series of scalar operations in lane order. This is the case when FP reassociation is not permitted. For scalable vectors this is more difficult because at compile time we do not know how many lanes there are, and so we use the worst case maximum vscale value.

I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.

New tests have been added here:

Analysis/CostModel/AArch64/reduce-fadd.ll
Analysis/CostModel/AArch64/sve-intrinsics.ll
Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dmgreen added inline comments.Jul 7 2021, 9:51 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Could we represent this by passing FastMathFlags, not a new enum? The fast math flags are what dictates in-order vs relaxed for example. And I don't believe that an integer "Ordered" reduction is ever different from a un-ordered reduction.

kmclaughlin added inline comments.Jul 7 2021, 10:04 AM
llvm/test/Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll
55

Since the -force-vector-width flag is used in the RUN lines for this test, can you please remove this hint and add -force-vector-interleave=1?

sdesmalen added inline comments.Jul 9 2021, 12:50 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137

The idea is that an fadda will have a low throughput because the operation is conceptually scalarized, because the fadd's can't be performed in parallel i.e.

double result = ((((init + v0) + v1) + v2) + ...) + vn; // where v0 .. vn are the lanes of the vector

Perhaps this is more a latency than a 'throughput' issue, but if an operation has a very long latency and blocks one of the functional units, I guess that has an impact the throughput as well.

The more important thing for now is that we want to have some conservative cost value for these, so that we don't assume in-order/in-loop reductions are cheap, so that we can tune it to return more sensible values later when we can experiment with this (after all, scalable auto-vec isn't fully functional yet). The other thing we're planning to improve is that when targeting a specific CPU, getMaxVScale returns the values from the max-vscale attribute in the IR, so that this cost-function no longer assumes the worst-case cost, but rather a more realistic cost based on the targeted vector length. The current implementation doesn't do this yet, but that's on our active to-do list.

https://godbolt.org/z/hfeaYh8r8
TargetLowering::expandVecReduce doesn't appear to handle it, which would imply to me that the cost should be "Invalid".

getArithmeticReductionCostSVE already returns Invalid for the multiplication, but only for the unordered reductions. This now doesn't happen for the ordered case. @david-arm can you look into that?

david-arm updated this revision to Diff 357465.Jul 9 2021, 3:12 AM
david-arm edited the summary of this revision. (Show Details)
  • Added doxygen comments above new enum and commented on the new RedType argument in the TTI interface.
  • Renamed Split enum member to TreeWise.
  • Removed the two FIXMEs in getTypeBasedIntrinsicInstrCost and ensured that getArithmeticReductionCost is called with the correct reduction type according to the FastMathFlags object. This meant fixing up some X86 tests that previously did not use the fast attribute on the intrinsic call. In general callers now have to explicitly add fast to the intrinsic for the tree-based reduction cost.
  • Added code check for supported opcodes (ISD::FADD) in AArch64TargetTransformInfo::getArithmeticReductionCost for the ordered case. For all other opcodes (ISD::FMUL, etc.) we return Invalid.
david-arm marked 13 inline comments as done.Jul 9 2021, 3:16 AM
david-arm added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

I guess we could do that, but the flags contain a lot more than just the reassoc flag, which is the only thing we care about here. I personally think it seems a bit more obvious to the caller what's going on if we specify the type directly. Not sure what others think?

dmgreen added inline comments.Jul 9 2021, 5:02 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

That would be my point really, that it contains more info. Why invent a new enum when the fast math flags already represent all the information we need here. It is the fundamental information we require here, so why not use it directly?

We are trying to get the cost of a reduction intrinsic, and that intrinsic has some FMFlags. Those flags may dictate the cost of the resulting code, so the FMF's should be passed to the cost function.

fmin/fmax could be in the same situations. It could be possible (although I'm not sure it occurs anywhere at the moment) that a "no-nan" fmax reduction has a different cost to one without the flag.

david-arm marked an inline comment as done.Jul 9 2021, 5:33 AM
david-arm added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Would there ever be a case where we didn't have the fast math flags, but wanted to calculate a theoretical reduction cost anyway? Then you'd have to explicitly create a FastMathFlags object and have to know exactly which flag needs setting in order to get the desired cost? For example, the user would have to know to set the reassoc flag in order to get the tree-based reduction cost. I mean, not saying that's a bad thing necessarily, but it felt a bit less intuitive and would certainly require additional comments above the getArithmeticReductionCost function saying how the flags map precisely to the cost.

Would there potentially be value in having an overloaded interface that maps one to the other (FastMathFlags<>Enum), or do you strongly prefer a single interface that takes FastMathFlags?

I wonder if other reviewers have a preference here?

david-arm added inline comments.Jul 9 2021, 5:37 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

I forgot to mention - just for reference I think the fmin/fmax cases go through a different cost function - getMinMaxReductionCost

RKSimon added inline comments.Jul 9 2021, 8:26 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1160

add explicit parentheses to make this even more obvious?

result = (((InitVal + v0) + v1) + v2) + v3

sdesmalen added inline comments.Jul 12 2021, 8:12 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Now that PairWise has been removed, I'd expect we'd only ever need to ask for the cost of either an ordered reduction or an unordered reduction ("fastest/most parallel way possible"). Would the fast-math flags ever need to specify anything else than those two options?

Integer operations have no FastMath flags, it seems awkward to have to construct them for an operation that doesn't support it, so I'd say the options are either to pass in a bool IsOrdered, or an Optional<FastMathFlags> which defaults to None (<=> implies 'unordered').

dmgreen added inline comments.Jul 13 2021, 1:44 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

OK. Adding ReductionType feels like re-adding IsPairwise to me, which we just went and removed. The combination of Integer reduction + "Ordered" should still be a "TreeWise" reduction, right? There is no such things as an ordered integer reduction from codegen perspective. The code at the beginning of getArithmeticReductionCost should really be if ((Opcode==FAdd || Opcode==FMul) && !FMF.allowReassoc()) ....

You can get empty FastMathFlags with just FastMathFlags() (and even give them a default value if needed)
I agree that min/max go through a different function - I think getMinMaxReductionCost should be updated with FMF too :) According to the ExpandReduction pass they require nnan for performing a shuffle reduction on fmin/fmax. But that cost could depend on what instructions the target has available.

An enum/bool feels like the wrong interface to me, from an llvm perspective. But I don't hold that opinion strongly enough to disagree if folks all think that is a better way to go.

david-arm added inline comments.Jul 13 2021, 7:02 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

I'm trying to see what happens when we always pass in FastMathFlags and it seems there are occasions when we have to construct the flags manually for FP instructions. See llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:getReductionCost:

FastMathFlags FMF;
switch (RdxKind) {
case RecurKind::FAdd:
case RecurKind::FMul:
  FMF.setAllowReassoc();
  LLVM_FALLTHROUGH;
case RecurKind::Add:
case RecurKind::Mul:
case RecurKind::Or:
case RecurKind::And:
case RecurKind::Xor: {
  unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);
  VectorCost = TTI->getArithmeticReductionCost(
      RdxOpcode, VectorTy, FMF);
  ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy);
  break;
}

This does work, but in this particular case it doesn't look as nice as before so I'm tempted to go with an Optional as @sdesmalen suggested.

dmgreen added inline comments.Jul 13 2021, 7:38 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Can you explain how Optional makes things better?

It looks like the SLP vectorizer has already computed RdxFMF, which are set in the builder. Otherwise they should in general come from the instruction the reduction is created from, which in this case would be from FirstReducedVal. It looks like the RdxFMF are exactly what is needed though, could it use those directly?

david-arm added inline comments.Jul 13 2021, 8:04 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

So i just meant that with @sdesmalen 's suggestion of using None to imply unordered we could simply do:

switch (RdxKind) {
case RecurKind::Add:
case RecurKind::Mul:
case RecurKind::Or:
case RecurKind::And:
case RecurKind::Xor: 
case RecurKind::FAdd:
case RecurKind::FMul: {
  unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);
  VectorCost = TTI->getArithmeticReductionCost(
      RdxOpcode, VectorTy, None);
  ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy);
  break;
}

Ideally I'd like to avoid constructing the flags here so if there is a way to pull out existing flags that's better. I'll have a look.

david-arm updated this revision to Diff 358582.Jul 14 2021, 6:32 AM
  • Changed interface getArithmeticReductionCost to take a new FastMathFlags parameter, that is only useful for FP ops.
  • Tried to improve the comments above getArithmeticCost and explain how the FastMathFlags map to the choice of algorithm used.
  • Added new helper function called isOrderedReduction to avoid unnecessary duplication of logic in 5 places and make the behaviour consistent.
david-arm edited the summary of this revision. (Show Details)Jul 14 2021, 6:33 AM
david-arm added inline comments.Jul 14 2021, 6:36 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1978

I just spotted this unnecessary change and I'll fix it!

Thanks for adding the FMF's. This looks useful in the long run to me

llvm/include/llvm/Analysis/TargetTransformInfo.h
1151

Nit: I believe FMF is the common spelling throughout llvm.
And maybe isOrderedReduction -> requiresOrderedReduction ?

1160

This comment looks useful. I'm wondering if it's worth emphasizing a bit that these are the default lowerings, and the cost should be whatever the fastest way the target can legally lower the intrinsic would be.
Maybe spelling out that that float operations without Reassoc require ordered reductions that look like ((((init + v0) + v1) + v2) + ... And otherwise the reduction can happen in any order, which by default will follow a treewise reduction.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
1661

Could these still use FMF? It shouldn't matter much I suppose, either way they should be empty.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
138

getScalarizationCostFactor implies that it will be scalarized by codegen, and sounds similar to the already present getScalarizationOverhead. What do you think about something like getMaxNumLanes, as that appears to be what it computes.

sdesmalen added inline comments.Jul 19 2021, 12:28 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Can you explain how Optional makes things better?

It's mostly conceptual that fast-math flags have no meaning for integer types, so it makes no sense to construct and pass some nonsensical FMF value for it.

1152

Why does this need to check the opcode?

1180

nit: Not sure if it's worth it, but should you make FPFlags a default argument? That may simplify some of the calls where the flags are unnecessary.

david-arm added inline comments.Jul 19 2021, 1:41 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1152

Sadly this is a result of passing using FastMathFlags to determine the algorithm. The flags are usually empty for integer operations, which means the allow reassoc flag will not be set. If we don't check the opcode then we end up using strict reductions for all integer operations.

sdesmalen added inline comments.Jul 19 2021, 2:59 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1152

If the default FMF constructor results in setting AllowReassoc=false, then I think that's a more concrete argument for using Optional<FastMathFlags>, i.e. if there are no fast-math flags, there is nothing that can ask for a 'strict ordering', and so the function would return false.

dmgreen added inline comments.Jul 19 2021, 3:47 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1152

The only reduction that _can_ require strict orderings are fp operations without reassoc set. It's not a product of the fast math flags alone that means codegen will have to expand in-order. It is a product of the opcode _and_ fast math flags.

Integer reduction cannot be expanded in-order, and the cost needn't ever look at FMF's for them.

david-arm added inline comments.Jul 19 2021, 3:49 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1152

I think @sdesmalen's point is more that we now have a bit of an ugly check for the opcode in addition to the flags. I agree that using an Optional<FastMathFlags> here is nicer than always having to look for a FP opcode.

foad removed a subscriber: foad.Jul 19 2021, 3:57 AM
david-arm added inline comments.Jul 19 2021, 4:01 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
138

Sure I can do that. I named it this way because I imagined we'll want to tweak this in future to something less pessimistic, perhaps based on mid-point between min and max? However, getMaxNumElements or Lanes works for now!

@david-arm Please can you rebase? I've tried to add fast/non-fast reduction costs coverage on X86

david-arm added inline comments.Jul 19 2021, 5:27 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1160

Hi @dmgreen, thanks for the suggestions. In the 1) case below I do state this is the default for integer operations and FP when reassociation is allowed. I was trying to list the two types of reduction, then explain which is the default. What I could do is emphasise early on that "Tree-wise" is the default, i.e

  1. Tree-wise. This is the default, 'fast' reduction ...

I can also mention after 2) that the cost should correspond to the fastest way the target can lower the intrinsic?

david-arm updated this revision to Diff 359765.Jul 19 2021, 6:09 AM
  • Rebased.
  • Changed FastMathFlags argument to be Optional<FastMathFlags>.
  • Renamed isOrderedReduction -> requiresOrderedReduction
  • Renamed getScalarizationCostFactor -> getMaxNumElements
RKSimon added inline comments.Jul 19 2021, 6:42 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
77

please can you remove this change now that we have this test coverage below

151

please can you remove this change now that we have this test coverage below

llvm/test/Analysis/CostModel/X86/reduce-fmul.ll
77

please can you remove this change now that we have this test coverage below

151

please can you remove this change now that we have this test coverage below

david-arm added inline comments.Jul 19 2021, 7:13 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
151

Hi @RKSimon, the problem is that if I revert the change the tests fail. This is because we were previously calling the intrinsic without fast, which means that we choose the very expensive cost for an ordered reduction. If you're happy I can remove the fast attribute, but I will then have to update all of the costs?

RKSimon added inline comments.Jul 19 2021, 7:30 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
151

Yes please - the update_analyze_test_checks.py script should do it all for you

david-arm updated this revision to Diff 360118.Jul 20 2021, 7:18 AM
  • Reverted previous changes to X86 tests and re-ran the update_analyze_test_checks.py script to calculate intrinsic costs without the 'fast' flag.
david-arm marked 5 inline comments as done.Jul 20 2021, 7:19 AM
RKSimon added inline comments.Jul 20 2021, 8:04 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
14–18

This looks too high for what is just a single f64 fadd (SSE floating point extract from 0 is free) - it might be a problem in x86 scalarization overhead ?

david-arm added inline comments.Jul 20 2021, 8:13 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
14–18

Possibly so - I guess that's not something that should be fixed in this patch though? I imagine that's equally a problem for other X86 operations on a v1f64 type, i.e. see the end of getIntrinsicInstrCost that does the same thing and makes no special case for v1XX types.

RKSimon added inline comments.Jul 20 2021, 8:24 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
14–18

Agreed, we can investigate that in a followup

RKSimon added inline comments.Jul 22 2021, 3:15 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7261

Aren't the calls to TTI.getArithmeticReductionCost the same?

InstructionCost BaseCost = TTI.getArithmeticReductionCost(
      RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);

if (useOrderedReductions(RdxDesc))
  return BaseCost;
david-arm updated this revision to Diff 360760.Jul 22 2021, 4:00 AM
  • Simplified call to getArithmeticReductionCost in LoopVectorize.cpp
david-arm marked an inline comment as done.Jul 22 2021, 4:01 AM
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7261

Good point! That's just a leftover from the initial version that used an enum.

Thanks for the changes @david-arm. Just left a few more nits.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1152–1154

nit: return FMF && !(*FMF).allowReassoc());

1158–1178

I think the opcode is no longer what decides what the type of the reduction is. It is mostly the type and the FMF.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2086

nit: please add a comment saying that targets must implement a default for the scalable case, because we can't know how many lanes the vector has.

2088

nit: add empty line above cast<FixedVectorType>

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1548

Why are you changing the GatherScatterOpCost in this patch?

david-arm marked an inline comment as done.Jul 22 2021, 5:09 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1548

I'm not actually changing it - I just wanted to avoid duplicating the max vscale code for the ordered reduction case so I moved the logic into a common function getMaxNumElements. This is all related to calculating the scalarisation cost.

sdesmalen added inline comments.Jul 22 2021, 5:47 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
140

nit: VF.getFixedValue()

david-arm updated this revision to Diff 360800.Jul 22 2021, 7:04 AM
  • Addressed review comments
david-arm marked 25 inline comments as done.Jul 22 2021, 7:06 AM
sdesmalen accepted this revision.Jul 23 2021, 12:57 AM

Thanks for all the changes @david-arm, the patch looks good to me!

This revision is now accepted and ready to land.Jul 23 2021, 12:57 AM
RKSimon accepted this revision.Jul 23 2021, 5:42 AM

LGTM - cheers

This revision was landed with ongoing or failed builds.Jul 26 2021, 2:26 AM
This revision was automatically updated to reflect the committed changes.
RKSimon added inline comments.Jul 26 2021, 7:42 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
2095

@david-arm I think this needs to be thisT()->getArithmeticInstrCost to use the correct TTI implementation?