This is an archive of the discontinued LLVM Phabricator instance.

[Analysis] Add simple cost model for strict (in-order) reductions
ClosedPublic

Authored by david-arm on Jul 5 2021, 6:17 AM.

Details

Summary

I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:

  1. Tree-wise. This is the typical fast-math reduction that involves continually splitting a vector up into halves and adding each half together until we get a scalar result. This is the default behaviour for integers, whereas for floating point we only do this if reassociation is allowed.
  2. Ordered. This now allows us to estimate the cost of performing a strict vector reduction by treating it as a series of scalar operations in lane order. This is the case when FP reassociation is not permitted. For scalable vectors this is more difficult because at compile time we do not know how many lanes there are, and so we use the worst case maximum vscale value.

I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.

New tests have been added here:

Analysis/CostModel/AArch64/reduce-fadd.ll
Analysis/CostModel/AArch64/sve-intrinsics.ll
Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Diff Detail

Event Timeline

david-arm created this revision.Jul 5 2021, 6:17 AM
david-arm requested review of this revision.Jul 5 2021, 6:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2021, 6:17 AM
Matt added a subscriber: Matt.Jul 5 2021, 9:50 AM
sdesmalen added inline comments.Jul 6 2021, 3:22 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
2081

Can you unify the interface for getArithmeticReductionCost to take an enum which specifies the kind of reduction we want: tree, pair-wise or in-order?
If there is only a single getReductionCost interface where the parameters dictate having to specify what cost is being asked for, than that's difficult to get wrong (you need to fill in the parameters), but if there's another slightly more specialized variant available, people may not realise it exists and just call the most basic cost interface instead.

2086

This seems to be missing the cost of extracting each of the elements which you can get using getScalarizationOverhead(..., /*bool Extract=*/true, ...)

Can we add cost model tests too? There is currently a fixme for // FIXME: Add new flag for cost of strict reductions.

Is it better to add a new method, or to add a flag to the existing methods and base the decision on the fast math flags? Any fp reduction, not just from the vectorizer, may be strict-inorder if the fmf's on it do not allow reassoc. Similarly for fmin/fmax where nnan/ninf/reassoc may make a difference.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2083

Should this be checking for scalable vectors?

2085

I think this should this also include the cost of extracts (which will sometimes be free, but that is target specific). Probably using the ScalarizationOverhead.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137

I had assumed (without thinking about it very much) that the costs for VF arguments would be based either the exact value of VF from the -mcpu argument if one is provided. If it is not then we would guess at a value, probably VF=2 would be a sensible default. This is using the maximum possible VF, which sounds like a large over-estimate in most cases.

Can you speak to why the max VF is a better value to use?

I'm not sure I understand why this is scalarizing though.

dmgreen added inline comments.Jul 6 2021, 3:31 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
2081

Oh you got here first. :)

I think we can possibly remove IsPairwise entirely. It seems to be only used in a single place nowadays that doesn't seem to me like it would be giving very good cost estimates.

david-arm added inline comments.Jul 6 2021, 3:48 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
2083

I don't think we can easily calculate a cost for scalable vectors here and is very target-specific. At the moment I have modelled this on the worst case, i.e. scalarising the operation, and for scalable vectors we don't know the number of elements. The approach taken here is similar to functions elsewhere in this file, e.g. getCommonMaskedMemoryOpCost. I think if we get here for scalable vectors it's actually a bug, since the target should really have dealt with this separately. Any thoughts @sdesmalen

2085

Fair point. @sdesmalen suggested this too privately.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137

Hi @dmgreen, yeah we are aware of this problem. It's not ideal - at the moment we also do this for gather-scatters too. We took the decision to be conservative and use the maximum value of vscale as the worst case scenario. In practice, the runtime value could vary from machine to machine and we thought it better to wait a while and revisit this again at some point. In fact, that's partly why I created this function so that we only have to change one place in future. :)

I also think that always choosing the most optimistic case could lead to poor performance so we have to be careful. One option we have is to use the new vscale range IR attributes to refine this, or choose a value of vscale that represents some sort of average of real use cases?

david-arm added inline comments.Jul 6 2021, 5:57 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
2081

Hi @dmgreen, it turns out removing the pairwise form breaks *lots* of tests. :) I think it's still needed because TTI::matchVectorReduction returns the pairwise form for a lot of cases.

dmgreen added inline comments.Jul 6 2021, 6:54 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
2081

I happened to be looking at if we can remove it too. Are you assuming that IsPairwise = false? The SLP vectorizer works differently now to how it did in the past.

I can put a patch for that if you like. There is one cost test that changes, but it seems to me like they would be OK to change, and the code creating the costs removed.

It is at least a (mostly) separate issue from this patch.

2083

Could it just return an invalid cost?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137

OK - more work is needed. Got it. I would have expected these cost factors to come from the subtarget, not an IR attribute.

What is being scalarized here though? From https://godbolt.org/z/fcz71dPeY for example? Some of the Illegal types were hitting errors.

david-arm added inline comments.Jul 6 2021, 7:15 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
2081

The problem is that TTI::matchVectorReduction sets IsPairwise to true in some tests run by "make check-all" causing them to fail. I'm not sure where exactly this function gets called from, but I can't just remove the IsPairwise option as part of this patch. I imagine it requires some investigation to work out why the callers need that? However, if you've already got a patch that removes this then I can just rebase on top of it!

dmgreen added inline comments.Jul 6 2021, 7:27 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
2081

Yeah, leave that bit to me. Consider it to be removed, but I think the patches can end up happening in either order.

david-arm added inline comments.Jul 7 2021, 3:29 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137

Even though there is a single faddv instruction I think for now it still makes sense to model this as being scalarised because conceptually the lanewise FP additions still have to be done in sequence, rather than tree-based.

david-arm updated this revision to Diff 356979.Jul 7 2021, 9:05 AM
david-arm edited the summary of this revision. (Show Details)
  • Addressed various review comments.
  • Rebased off D105484
RKSimon added a subscriber: RKSimon.
RKSimon added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Please can you add doxygen description comments for these?

1160–1182

This comment still discusses pairwise reduction modes, but doesn't mention RedType

sdesmalen added inline comments.Jul 7 2021, 9:18 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

nit: may I request this (and any other references to it) to be renamed to TreeWise? In my experience that's a more common term used with these kinds of reductions.

dmgreen added inline comments.Jul 7 2021, 9:51 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137

Then what does the scalarization?
https://godbolt.org/z/hfeaYh8r8
TargetLowering::expandVecReduce doesn't appear to handle it, which would imply to me that the cost should be "Invalid".

Or do you mean that the fadda will have a low throughput?

dmgreen added inline comments.Jul 7 2021, 9:51 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Could we represent this by passing FastMathFlags, not a new enum? The fast math flags are what dictates in-order vs relaxed for example. And I don't believe that an integer "Ordered" reduction is ever different from a un-ordered reduction.

kmclaughlin added inline comments.Jul 7 2021, 10:04 AM
llvm/test/Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll
55

Since the -force-vector-width flag is used in the RUN lines for this test, can you please remove this hint and add -force-vector-interleave=1?

sdesmalen added inline comments.Jul 9 2021, 12:50 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137

The idea is that an fadda will have a low throughput because the operation is conceptually scalarized, because the fadd's can't be performed in parallel i.e.

double result = ((((init + v0) + v1) + v2) + ...) + vn; // where v0 .. vn are the lanes of the vector

Perhaps this is more a latency than a 'throughput' issue, but if an operation has a very long latency and blocks one of the functional units, I guess that has an impact the throughput as well.

The more important thing for now is that we want to have some conservative cost value for these, so that we don't assume in-order/in-loop reductions are cheap, so that we can tune it to return more sensible values later when we can experiment with this (after all, scalable auto-vec isn't fully functional yet). The other thing we're planning to improve is that when targeting a specific CPU, getMaxVScale returns the values from the max-vscale attribute in the IR, so that this cost-function no longer assumes the worst-case cost, but rather a more realistic cost based on the targeted vector length. The current implementation doesn't do this yet, but that's on our active to-do list.

https://godbolt.org/z/hfeaYh8r8
TargetLowering::expandVecReduce doesn't appear to handle it, which would imply to me that the cost should be "Invalid".

getArithmeticReductionCostSVE already returns Invalid for the multiplication, but only for the unordered reductions. This now doesn't happen for the ordered case. @david-arm can you look into that?

david-arm updated this revision to Diff 357465.Jul 9 2021, 3:12 AM
david-arm edited the summary of this revision. (Show Details)
  • Added doxygen comments above new enum and commented on the new RedType argument in the TTI interface.
  • Renamed Split enum member to TreeWise.
  • Removed the two FIXMEs in getTypeBasedIntrinsicInstrCost and ensured that getArithmeticReductionCost is called with the correct reduction type according to the FastMathFlags object. This meant fixing up some X86 tests that previously did not use the fast attribute on the intrinsic call. In general callers now have to explicitly add fast to the intrinsic for the tree-based reduction cost.
  • Added code check for supported opcodes (ISD::FADD) in AArch64TargetTransformInfo::getArithmeticReductionCost for the ordered case. For all other opcodes (ISD::FMUL, etc.) we return Invalid.
david-arm marked 13 inline comments as done.Jul 9 2021, 3:16 AM
david-arm added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

I guess we could do that, but the flags contain a lot more than just the reassoc flag, which is the only thing we care about here. I personally think it seems a bit more obvious to the caller what's going on if we specify the type directly. Not sure what others think?

dmgreen added inline comments.Jul 9 2021, 5:02 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

That would be my point really, that it contains more info. Why invent a new enum when the fast math flags already represent all the information we need here. It is the fundamental information we require here, so why not use it directly?

We are trying to get the cost of a reduction intrinsic, and that intrinsic has some FMFlags. Those flags may dictate the cost of the resulting code, so the FMF's should be passed to the cost function.

fmin/fmax could be in the same situations. It could be possible (although I'm not sure it occurs anywhere at the moment) that a "no-nan" fmax reduction has a different cost to one without the flag.

david-arm marked an inline comment as done.Jul 9 2021, 5:33 AM
david-arm added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Would there ever be a case where we didn't have the fast math flags, but wanted to calculate a theoretical reduction cost anyway? Then you'd have to explicitly create a FastMathFlags object and have to know exactly which flag needs setting in order to get the desired cost? For example, the user would have to know to set the reassoc flag in order to get the tree-based reduction cost. I mean, not saying that's a bad thing necessarily, but it felt a bit less intuitive and would certainly require additional comments above the getArithmeticReductionCost function saying how the flags map precisely to the cost.

Would there potentially be value in having an overloaded interface that maps one to the other (FastMathFlags<>Enum), or do you strongly prefer a single interface that takes FastMathFlags?

I wonder if other reviewers have a preference here?

david-arm added inline comments.Jul 9 2021, 5:37 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

I forgot to mention - just for reference I think the fmin/fmax cases go through a different cost function - getMinMaxReductionCost

RKSimon added inline comments.Jul 9 2021, 8:26 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1160

add explicit parentheses to make this even more obvious?

result = (((InitVal + v0) + v1) + v2) + v3

sdesmalen added inline comments.Jul 12 2021, 8:12 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Now that PairWise has been removed, I'd expect we'd only ever need to ask for the cost of either an ordered reduction or an unordered reduction ("fastest/most parallel way possible"). Would the fast-math flags ever need to specify anything else than those two options?

Integer operations have no FastMath flags, it seems awkward to have to construct them for an operation that doesn't support it, so I'd say the options are either to pass in a bool IsOrdered, or an Optional<FastMathFlags> which defaults to None (<=> implies 'unordered').

dmgreen added inline comments.Jul 13 2021, 1:44 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

OK. Adding ReductionType feels like re-adding IsPairwise to me, which we just went and removed. The combination of Integer reduction + "Ordered" should still be a "TreeWise" reduction, right? There is no such things as an ordered integer reduction from codegen perspective. The code at the beginning of getArithmeticReductionCost should really be if ((Opcode==FAdd || Opcode==FMul) && !FMF.allowReassoc()) ....

You can get empty FastMathFlags with just FastMathFlags() (and even give them a default value if needed)
I agree that min/max go through a different function - I think getMinMaxReductionCost should be updated with FMF too :) According to the ExpandReduction pass they require nnan for performing a shuffle reduction on fmin/fmax. But that cost could depend on what instructions the target has available.

An enum/bool feels like the wrong interface to me, from an llvm perspective. But I don't hold that opinion strongly enough to disagree if folks all think that is a better way to go.

david-arm added inline comments.Jul 13 2021, 7:02 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

I'm trying to see what happens when we always pass in FastMathFlags and it seems there are occasions when we have to construct the flags manually for FP instructions. See llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:getReductionCost:

FastMathFlags FMF;
switch (RdxKind) {
case RecurKind::FAdd:
case RecurKind::FMul:
  FMF.setAllowReassoc();
  LLVM_FALLTHROUGH;
case RecurKind::Add:
case RecurKind::Mul:
case RecurKind::Or:
case RecurKind::And:
case RecurKind::Xor: {
  unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);
  VectorCost = TTI->getArithmeticReductionCost(
      RdxOpcode, VectorTy, FMF);
  ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy);
  break;
}

This does work, but in this particular case it doesn't look as nice as before so I'm tempted to go with an Optional as @sdesmalen suggested.

dmgreen added inline comments.Jul 13 2021, 7:38 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Can you explain how Optional makes things better?

It looks like the SLP vectorizer has already computed RdxFMF, which are set in the builder. Otherwise they should in general come from the instruction the reduction is created from, which in this case would be from FirstReducedVal. It looks like the RdxFMF are exactly what is needed though, could it use those directly?

david-arm added inline comments.Jul 13 2021, 8:04 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

So i just meant that with @sdesmalen 's suggestion of using None to imply unordered we could simply do:

switch (RdxKind) {
case RecurKind::Add:
case RecurKind::Mul:
case RecurKind::Or:
case RecurKind::And:
case RecurKind::Xor: 
case RecurKind::FAdd:
case RecurKind::FMul: {
  unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);
  VectorCost = TTI->getArithmeticReductionCost(
      RdxOpcode, VectorTy, None);
  ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy);
  break;
}

Ideally I'd like to avoid constructing the flags here so if there is a way to pull out existing flags that's better. I'll have a look.

david-arm updated this revision to Diff 358582.Jul 14 2021, 6:32 AM
  • Changed interface getArithmeticReductionCost to take a new FastMathFlags parameter, that is only useful for FP ops.
  • Tried to improve the comments above getArithmeticCost and explain how the FastMathFlags map to the choice of algorithm used.
  • Added new helper function called isOrderedReduction to avoid unnecessary duplication of logic in 5 places and make the behaviour consistent.
david-arm edited the summary of this revision. (Show Details)Jul 14 2021, 6:33 AM
david-arm added inline comments.Jul 14 2021, 6:36 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1831

I just spotted this unnecessary change and I'll fix it!

Thanks for adding the FMF's. This looks useful in the long run to me

llvm/include/llvm/Analysis/TargetTransformInfo.h
1151

Nit: I believe FMF is the common spelling throughout llvm.
And maybe isOrderedReduction -> requiresOrderedReduction ?

1163

This comment looks useful. I'm wondering if it's worth emphasizing a bit that these are the default lowerings, and the cost should be whatever the fastest way the target can legally lower the intrinsic would be.
Maybe spelling out that that float operations without Reassoc require ordered reductions that look like ((((init + v0) + v1) + v2) + ... And otherwise the reduction can happen in any order, which by default will follow a treewise reduction.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
1657

Could these still use FMF? It shouldn't matter much I suppose, either way they should be empty.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
138

getScalarizationCostFactor implies that it will be scalarized by codegen, and sounds similar to the already present getScalarizationOverhead. What do you think about something like getMaxNumLanes, as that appears to be what it computes.

sdesmalen added inline comments.Jul 19 2021, 12:28 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1149

Can you explain how Optional makes things better?

It's mostly conceptual that fast-math flags have no meaning for integer types, so it makes no sense to construct and pass some nonsensical FMF value for it.

1152

Why does this need to check the opcode?

1184

nit: Not sure if it's worth it, but should you make FPFlags a default argument? That may simplify some of the calls where the flags are unnecessary.

david-arm added inline comments.Jul 19 2021, 1:41 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1152

Sadly this is a result of passing using FastMathFlags to determine the algorithm. The flags are usually empty for integer operations, which means the allow reassoc flag will not be set. If we don't check the opcode then we end up using strict reductions for all integer operations.

sdesmalen added inline comments.Jul 19 2021, 2:59 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1152

If the default FMF constructor results in setting AllowReassoc=false, then I think that's a more concrete argument for using Optional<FastMathFlags>, i.e. if there are no fast-math flags, there is nothing that can ask for a 'strict ordering', and so the function would return false.

dmgreen added inline comments.Jul 19 2021, 3:47 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1152

The only reduction that _can_ require strict orderings are fp operations without reassoc set. It's not a product of the fast math flags alone that means codegen will have to expand in-order. It is a product of the opcode _and_ fast math flags.

Integer reduction cannot be expanded in-order, and the cost needn't ever look at FMF's for them.

david-arm added inline comments.Jul 19 2021, 3:49 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1152

I think @sdesmalen's point is more that we now have a bit of an ugly check for the opcode in addition to the flags. I agree that using an Optional<FastMathFlags> here is nicer than always having to look for a FP opcode.

foad removed a subscriber: foad.Jul 19 2021, 3:57 AM
david-arm added inline comments.Jul 19 2021, 4:01 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
138

Sure I can do that. I named it this way because I imagined we'll want to tweak this in future to something less pessimistic, perhaps based on mid-point between min and max? However, getMaxNumElements or Lanes works for now!

@david-arm Please can you rebase? I've tried to add fast/non-fast reduction costs coverage on X86

david-arm added inline comments.Jul 19 2021, 5:27 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1163

Hi @dmgreen, thanks for the suggestions. In the 1) case below I do state this is the default for integer operations and FP when reassociation is allowed. I was trying to list the two types of reduction, then explain which is the default. What I could do is emphasise early on that "Tree-wise" is the default, i.e

  1. Tree-wise. This is the default, 'fast' reduction ...

I can also mention after 2) that the cost should correspond to the fastest way the target can lower the intrinsic?

david-arm updated this revision to Diff 359765.Jul 19 2021, 6:09 AM
  • Rebased.
  • Changed FastMathFlags argument to be Optional<FastMathFlags>.
  • Renamed isOrderedReduction -> requiresOrderedReduction
  • Renamed getScalarizationCostFactor -> getMaxNumElements
RKSimon added inline comments.Jul 19 2021, 6:42 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
73

please can you remove this change now that we have this test coverage below

146

please can you remove this change now that we have this test coverage below

llvm/test/Analysis/CostModel/X86/reduce-fmul.ll
73

please can you remove this change now that we have this test coverage below

146

please can you remove this change now that we have this test coverage below

david-arm added inline comments.Jul 19 2021, 7:13 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
146

Hi @RKSimon, the problem is that if I revert the change the tests fail. This is because we were previously calling the intrinsic without fast, which means that we choose the very expensive cost for an ordered reduction. If you're happy I can remove the fast attribute, but I will then have to update all of the costs?

RKSimon added inline comments.Jul 19 2021, 7:30 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
146

Yes please - the update_analyze_test_checks.py script should do it all for you

david-arm updated this revision to Diff 360118.Jul 20 2021, 7:18 AM
  • Reverted previous changes to X86 tests and re-ran the update_analyze_test_checks.py script to calculate intrinsic costs without the 'fast' flag.
david-arm marked 5 inline comments as done.Jul 20 2021, 7:19 AM
RKSimon added inline comments.Jul 20 2021, 8:04 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
14–18

This looks too high for what is just a single f64 fadd (SSE floating point extract from 0 is free) - it might be a problem in x86 scalarization overhead ?

david-arm added inline comments.Jul 20 2021, 8:13 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
14–18

Possibly so - I guess that's not something that should be fixed in this patch though? I imagine that's equally a problem for other X86 operations on a v1f64 type, i.e. see the end of getIntrinsicInstrCost that does the same thing and makes no special case for v1XX types.

RKSimon added inline comments.Jul 20 2021, 8:24 AM
llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
14–18

Agreed, we can investigate that in a followup

RKSimon added inline comments.Jul 22 2021, 3:15 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7191

Aren't the calls to TTI.getArithmeticReductionCost the same?

InstructionCost BaseCost = TTI.getArithmeticReductionCost(
      RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);

if (useOrderedReductions(RdxDesc))
  return BaseCost;
david-arm updated this revision to Diff 360760.Jul 22 2021, 4:00 AM
  • Simplified call to getArithmeticReductionCost in LoopVectorize.cpp
david-arm marked an inline comment as done.Jul 22 2021, 4:01 AM
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7191

Good point! That's just a leftover from the initial version that used an enum.

Thanks for the changes @david-arm. Just left a few more nits.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1152–1154

nit: return FMF && !(*FMF).allowReassoc());

1161–1182

I think the opcode is no longer what decides what the type of the reduction is. It is mostly the type and the FMF.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2082

nit: please add a comment saying that targets must implement a default for the scalable case, because we can't know how many lanes the vector has.

2084

nit: add empty line above cast<FixedVectorType>

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1410

Why are you changing the GatherScatterOpCost in this patch?

david-arm marked an inline comment as done.Jul 22 2021, 5:09 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1410

I'm not actually changing it - I just wanted to avoid duplicating the max vscale code for the ordered reduction case so I moved the logic into a common function getMaxNumElements. This is all related to calculating the scalarisation cost.

sdesmalen added inline comments.Jul 22 2021, 5:47 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
140

nit: VF.getFixedValue()

david-arm updated this revision to Diff 360800.Jul 22 2021, 7:04 AM
  • Addressed review comments
david-arm marked 25 inline comments as done.Jul 22 2021, 7:06 AM
sdesmalen accepted this revision.Jul 23 2021, 12:57 AM

Thanks for all the changes @david-arm, the patch looks good to me!

This revision is now accepted and ready to land.Jul 23 2021, 12:57 AM
RKSimon accepted this revision.Jul 23 2021, 5:42 AM

LGTM - cheers

This revision was landed with ongoing or failed builds.Jul 26 2021, 2:26 AM
This revision was automatically updated to reflect the committed changes.
RKSimon added inline comments.Jul 26 2021, 7:42 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
2091

@david-arm I think this needs to be thisT()->getArithmeticInstrCost to use the correct TTI implementation?