This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
26/26
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
CodeGen/
15/16
BasicTTIImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/
-
AArch64/
8/9
AArch64TargetTransformInfo.h
3/3
AArch64TargetTransformInfo.cpp
-
AMDGPU/
-
AMDGPUTargetTransformInfo.h
-
AMDGPUTargetTransformInfo.cpp
-
ARM/
-
ARMTargetTransformInfo.h
-
ARMTargetTransformInfo.cpp
-
X86/
-
X86TargetTransformInfo.h
-
X86TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
2/2
LoopVectorize.cpp
-
SLPVectorizer.cpp
-
test/
-
Analysis/CostModel/
-
CostModel/
-
AArch64/
-
reduce-fadd.ll
-
sve-intrinsics.ll
-
X86/
-
intrinsic-cost-kinds.ll
7/7
reduce-fadd.ll
2/2
reduce-fmul.ll
-
Transforms/LoopVectorize/AArch64/
-
LoopVectorize/
-
AArch64/
-
strict-fadd-cost.ll
1/1
sve-strict-fadd-cost.ll

Differential D105432

[Analysis] Add simple cost model for strict (in-order) reductions
ClosedPublic

Authored by david-arm on Jul 5 2021, 6:17 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
kmclaughlin
CarolineConcatto
dmgreen
spatel
RKSimon

Commits

rG0aff1798b572: [Analysis] Add simple cost model for strict (in-order) reductions

Summary

I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:

Tree-wise. This is the typical fast-math reduction that involves continually splitting a vector up into halves and adding each half together until we get a scalar result. This is the default behaviour for integers, whereas for floating point we only do this if reassociation is allowed.
Ordered. This now allows us to estimate the cost of performing a strict vector reduction by treating it as a series of scalar operations in lane order. This is the case when FP reassociation is not permitted. For scalable vectors this is more difficult because at compile time we do not know how many lanes there are, and so we use the worst case maximum vscale value.

I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.

New tests have been added here:

Analysis/CostModel/AArch64/reduce-fadd.ll
Analysis/CostModel/AArch64/sve-intrinsics.ll
Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

david-arm created this revision.Jul 5 2021, 6:17 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptJul 5 2021, 6:17 AM

david-arm requested review of this revision.Jul 5 2021, 6:17 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2021, 6:17 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B112439: Diff 356491.Jul 5 2021, 6:52 AM

Matt added a subscriber: Matt.Jul 5 2021, 9:50 AM

sdesmalen added inline comments.Jul 6 2021, 3:22 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2085	Can you unify the interface for getArithmeticReductionCost to take an enum which specifies the kind of reduction we want: tree, pair-wise or in-order? If there is only a single `getReductionCost` interface where the parameters dictate having to specify what cost is being asked for, than that's difficult to get wrong (you need to fill in the parameters), but if there's another slightly more specialized variant available, people may not realise it exists and just call the most basic cost interface instead.
2090	This seems to be missing the cost of extracting each of the elements which you can get using `getScalarizationOverhead(..., /bool Extract=/true, ...)`

Can we add cost model tests too? There is currently a fixme for // FIXME: Add new flag for cost of strict reductions.

Is it better to add a new method, or to add a flag to the existing methods and base the decision on the fast math flags? Any fp reduction, not just from the vectorizer, may be strict-inorder if the fmf's on it do not allow reassoc. Similarly for fmin/fmax where nnan/ninf/reassoc may make a difference.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2087	Should this be checking for scalable vectors?
2089	I think this should this also include the cost of extracts (which will sometimes be free, but that is target specific). Probably using the ScalarizationOverhead.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137	I had assumed (without thinking about it very much) that the costs for VF arguments would be based either the exact value of VF from the -mcpu argument if one is provided. If it is not then we would guess at a value, probably VF=2 would be a sensible default. This is using the maximum possible VF, which sounds like a large over-estimate in most cases. Can you speak to why the max VF is a better value to use? I'm not sure I understand why this is scalarizing though.

dmgreen added inline comments.Jul 6 2021, 3:31 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2085	Oh you got here first. :) I think we can possibly remove IsPairwise entirely. It seems to be only used in a single place nowadays that doesn't seem to me like it would be giving very good cost estimates.

david-arm added inline comments.Jul 6 2021, 3:48 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2087	I don't think we can easily calculate a cost for scalable vectors here and is very target-specific. At the moment I have modelled this on the worst case, i.e. scalarising the operation, and for scalable vectors we don't know the number of elements. The approach taken here is similar to functions elsewhere in this file, e.g. getCommonMaskedMemoryOpCost. I think if we get here for scalable vectors it's actually a bug, since the target should really have dealt with this separately. Any thoughts @sdesmalen
2089	Fair point. @sdesmalen suggested this too privately.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137	Hi @dmgreen, yeah we are aware of this problem. It's not ideal - at the moment we also do this for gather-scatters too. We took the decision to be conservative and use the maximum value of vscale as the worst case scenario. In practice, the runtime value could vary from machine to machine and we thought it better to wait a while and revisit this again at some point. In fact, that's partly why I created this function so that we only have to change one place in future. :) I also think that always choosing the most optimistic case could lead to poor performance so we have to be careful. One option we have is to use the new vscale range IR attributes to refine this, or choose a value of vscale that represents some sort of average of real use cases?

david-arm added inline comments.Jul 6 2021, 5:57 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2085	Hi @dmgreen, it turns out removing the pairwise form breaks lots of tests. :) I think it's still needed because `TTI::matchVectorReduction` returns the pairwise form for a lot of cases.

dmgreen added inline comments.Jul 6 2021, 6:54 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2085	I happened to be looking at if we can remove it too. Are you assuming that IsPairwise = false? The SLP vectorizer works differently now to how it did in the past. I can put a patch for that if you like. There is one cost test that changes, but it seems to me like they would be OK to change, and the code creating the costs removed. It is at least a (mostly) separate issue from this patch.
2087	Could it just return an invalid cost?
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137	OK - more work is needed. Got it. I would have expected these cost factors to come from the subtarget, not an IR attribute. What is being scalarized here though? From https://godbolt.org/z/fcz71dPeY for example? Some of the Illegal types were hitting errors.

david-arm added inline comments.Jul 6 2021, 7:15 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2085	The problem is that TTI::matchVectorReduction sets IsPairwise to true in some tests run by "make check-all" causing them to fail. I'm not sure where exactly this function gets called from, but I can't just remove the IsPairwise option as part of this patch. I imagine it requires some investigation to work out why the callers need that? However, if you've already got a patch that removes this then I can just rebase on top of it!

dmgreen added inline comments.Jul 6 2021, 7:27 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2085	Yeah, leave that bit to me. Consider it to be removed, but I think the patches can end up happening in either order.

david-arm added inline comments.Jul 7 2021, 3:29 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137	Even though there is a single faddv instruction I think for now it still makes sense to model this as being scalarised because conceptually the lanewise FP additions still have to be done in sequence, rather than tree-based.

Addressed various review comments.
Rebased off D105484

Herald added subscribers: foad, kerbowa, pengfei and 3 others. · View Herald TranscriptJul 7 2021, 9:05 AM

david-arm added a parent revision: D105484: [TTI] Remove IsPairwiseForm from getArithmeticReductionCost.Jul 7 2021, 9:05 AM

Harbormaster completed remote builds in B112798: Diff 356979.Jul 7 2021, 9:06 AM

RKSimon added a reviewer: RKSimon.Jul 7 2021, 9:14 AM

RKSimon added a subscriber: RKSimon.

RKSimon added inline comments.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	Please can you add doxygen description comments for these?
1157–1178	This comment still discusses pairwise reduction modes, but doesn't mention RedType

sdesmalen added inline comments.Jul 7 2021, 9:18 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	nit: may I request this (and any other references to it) to be renamed to TreeWise? In my experience that's a more common term used with these kinds of reductions.

dmgreen added inline comments.Jul 7 2021, 9:51 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137	Then what does the scalarization? https://godbolt.org/z/hfeaYh8r8 TargetLowering::expandVecReduce doesn't appear to handle it, which would imply to me that the cost should be "Invalid". Or do you mean that the fadda will have a low throughput?

dmgreen added inline comments.Jul 7 2021, 9:51 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	Could we represent this by passing FastMathFlags, not a new enum? The fast math flags are what dictates in-order vs relaxed for example. And I don't believe that an integer "Ordered" reduction is ever different from a un-ordered reduction.

kmclaughlin added inline comments.Jul 7 2021, 10:04 AM

llvm/test/Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll
55	Since the `-force-vector-width` flag is used in the RUN lines for this test, can you please remove this hint and add `-force-vector-interleave=1`?

sdesmalen added inline comments.Jul 9 2021, 12:50 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
136–137	The idea is that an fadda will have a low throughput because the operation is conceptually scalarized, because the fadd's can't be performed in parallel i.e. double result = ((((init + v0) + v1) + v2) + ...) + vn; // where v0 .. vn are the lanes of the vector Perhaps this is more a latency than a 'throughput' issue, but if an operation has a very long latency and blocks one of the functional units, I guess that has an impact the throughput as well. The more important thing for now is that we want to have some conservative cost value for these, so that we don't assume in-order/in-loop reductions are cheap, so that we can tune it to return more sensible values later when we can experiment with this (after all, scalable auto-vec isn't fully functional yet). The other thing we're planning to improve is that when targeting a specific CPU, `getMaxVScale` returns the values from the max-vscale attribute in the IR, so that this cost-function no longer assumes the worst-case cost, but rather a more realistic cost based on the targeted vector length. The current implementation doesn't do this yet, but that's on our active to-do list. https://godbolt.org/z/hfeaYh8r8 TargetLowering::expandVecReduce doesn't appear to handle it, which would imply to me that the cost should be "Invalid". `getArithmeticReductionCostSVE` already returns `Invalid` for the multiplication, but only for the unordered reductions. This now doesn't happen for the ordered case. @david-arm can you look into that?

Added doxygen comments above new enum and commented on the new RedType argument in the TTI interface.
Renamed Split enum member to TreeWise.
Removed the two FIXMEs in getTypeBasedIntrinsicInstrCost and ensured that getArithmeticReductionCost is called with the correct reduction type according to the FastMathFlags object. This meant fixing up some X86 tests that previously did not use the fast attribute on the intrinsic call. In general callers now have to explicitly add fast to the intrinsic for the tree-based reduction cost.
Added code check for supported opcodes (ISD::FADD) in AArch64TargetTransformInfo::getArithmeticReductionCost for the ordered case. For all other opcodes (ISD::FMUL, etc.) we return Invalid.

Harbormaster completed remote builds in B113166: Diff 357465.Jul 9 2021, 3:12 AM

david-arm marked 13 inline comments as done.Jul 9 2021, 3:16 AM

david-arm added inline comments.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	I guess we could do that, but the flags contain a lot more than just the reassoc flag, which is the only thing we care about here. I personally think it seems a bit more obvious to the caller what's going on if we specify the type directly. Not sure what others think?

dmgreen added inline comments.Jul 9 2021, 5:02 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	That would be my point really, that it contains more info. Why invent a new enum when the fast math flags already represent all the information we need here. It is the fundamental information we require here, so why not use it directly? We are trying to get the cost of a reduction intrinsic, and that intrinsic has some FMFlags. Those flags may dictate the cost of the resulting code, so the FMF's should be passed to the cost function. fmin/fmax could be in the same situations. It could be possible (although I'm not sure it occurs anywhere at the moment) that a "no-nan" fmax reduction has a different cost to one without the flag.

david-arm marked an inline comment as done.Jul 9 2021, 5:33 AM

david-arm added inline comments.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	Would there ever be a case where we didn't have the fast math flags, but wanted to calculate a theoretical reduction cost anyway? Then you'd have to explicitly create a FastMathFlags object and have to know exactly which flag needs setting in order to get the desired cost? For example, the user would have to know to set the reassoc flag in order to get the tree-based reduction cost. I mean, not saying that's a bad thing necessarily, but it felt a bit less intuitive and would certainly require additional comments above the `getArithmeticReductionCost` function saying how the flags map precisely to the cost. Would there potentially be value in having an overloaded interface that maps one to the other (FastMathFlags<>Enum), or do you strongly prefer a single interface that takes FastMathFlags? I wonder if other reviewers have a preference here?

david-arm added inline comments.Jul 9 2021, 5:37 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	I forgot to mention - just for reference I think the fmin/fmax cases go through a different cost function - getMinMaxReductionCost

RKSimon added inline comments.Jul 9 2021, 8:26 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1160	add explicit parentheses to make this even more obvious? result = (((InitVal + v0) + v1) + v2) + v3

sdesmalen added inline comments.Jul 12 2021, 8:12 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	Now that `PairWise` has been removed, I'd expect we'd only ever need to ask for the cost of either an ordered reduction or an unordered reduction ("fastest/most parallel way possible"). Would the fast-math flags ever need to specify anything else than those two options? Integer operations have no FastMath flags, it seems awkward to have to construct them for an operation that doesn't support it, so I'd say the options are either to pass in a `bool IsOrdered`, or an `Optional<FastMathFlags>` which defaults to `None` (<=> implies 'unordered').

dmgreen added inline comments.Jul 13 2021, 1:44 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	OK. Adding ReductionType feels like re-adding IsPairwise to me, which we just went and removed. The combination of Integer reduction + "Ordered" should still be a "TreeWise" reduction, right? There is no such things as an ordered integer reduction from codegen perspective. The code at the beginning of getArithmeticReductionCost should really be `if ((Opcode==FAdd \|\| Opcode==FMul) && !FMF.allowReassoc()) ...`. You can get empty FastMathFlags with just `FastMathFlags()` (and even give them a default value if needed) I agree that min/max go through a different function - I think getMinMaxReductionCost should be updated with FMF too :) According to the ExpandReduction pass they require nnan for performing a shuffle reduction on fmin/fmax. But that cost could depend on what instructions the target has available. An enum/bool feels like the wrong interface to me, from an llvm perspective. But I don't hold that opinion strongly enough to disagree if folks all think that is a better way to go.

david-arm added inline comments.Jul 13 2021, 7:02 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h

1149

I'm trying to see what happens when we always pass in FastMathFlags and it seems there are occasions when we have to construct the flags manually for FP instructions. See llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:getReductionCost:

FastMathFlags FMF;
switch (RdxKind) {
case RecurKind::FAdd:
case RecurKind::FMul:
  FMF.setAllowReassoc();
  LLVM_FALLTHROUGH;
case RecurKind::Add:
case RecurKind::Mul:
case RecurKind::Or:
case RecurKind::And:
case RecurKind::Xor: {
  unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);
  VectorCost = TTI->getArithmeticReductionCost(
      RdxOpcode, VectorTy, FMF);
  ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy);
  break;
}

This does work, but in this particular case it doesn't look as nice as before so I'm tempted to go with an Optional as @sdesmalen suggested.

dmgreen added inline comments.Jul 13 2021, 7:38 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	Can you explain how Optional makes things better? It looks like the SLP vectorizer has already computed `RdxFMF`, which are set in the builder. Otherwise they should in general come from the instruction the reduction is created from, which in this case would be from `FirstReducedVal`. It looks like the RdxFMF are exactly what is needed though, could it use those directly?

david-arm added inline comments.Jul 13 2021, 8:04 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h

1149

So i just meant that with @sdesmalen 's suggestion of using None to imply unordered we could simply do:

switch (RdxKind) {
case RecurKind::Add:
case RecurKind::Mul:
case RecurKind::Or:
case RecurKind::And:
case RecurKind::Xor: 
case RecurKind::FAdd:
case RecurKind::FMul: {
  unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);
  VectorCost = TTI->getArithmeticReductionCost(
      RdxOpcode, VectorTy, None);
  ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy);
  break;
}

Ideally I'd like to avoid constructing the flags here so if there is a way to pull out existing flags that's better. I'll have a look.

Changed interface getArithmeticReductionCost to take a new FastMathFlags parameter, that is only useful for FP ops.
Tried to improve the comments above getArithmeticCost and explain how the FastMathFlags map to the choice of algorithm used.
Added new helper function called isOrderedReduction to avoid unnecessary duplication of logic in 5 places and make the behaviour consistent.

david-arm edited the summary of this revision. (Show Details)Jul 14 2021, 6:33 AM

david-arm added inline comments.Jul 14 2021, 6:36 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1978	I just spotted this unnecessary change and I'll fix it!

Harbormaster completed remote builds in B113963: Diff 358582.Jul 14 2021, 7:16 AM

Thanks for adding the FMF's. This looks useful in the long run to me

llvm/include/llvm/Analysis/TargetTransformInfo.h
1151	Nit: I believe FMF is the common spelling throughout llvm. And maybe isOrderedReduction -> requiresOrderedReduction ?
1160	This comment looks useful. I'm wondering if it's worth emphasizing a bit that these are the default lowerings, and the cost should be whatever the fastest way the target can legally lower the intrinsic would be. Maybe spelling out that that float operations without Reassoc require ordered reductions that look like `((((init + v0) + v1) + v2) + ..`. And otherwise the reduction can happen in any order, which by default will follow a treewise reduction.
llvm/include/llvm/CodeGen/BasicTTIImpl.h
1661	Could these still use FMF? It shouldn't matter much I suppose, either way they should be empty.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
138	getScalarizationCostFactor implies that it will be scalarized by codegen, and sounds similar to the already present getScalarizationOverhead. What do you think about something like getMaxNumLanes, as that appears to be what it computes.

sdesmalen added inline comments.Jul 19 2021, 12:28 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1149	Can you explain how Optional makes things better? It's mostly conceptual that fast-math flags have no meaning for integer types, so it makes no sense to construct and pass some nonsensical FMF value for it.
1152	Why does this need to check the opcode?
1180	nit: Not sure if it's worth it, but should you make FPFlags a default argument? That may simplify some of the calls where the flags are unnecessary.

david-arm added inline comments.Jul 19 2021, 1:41 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1152	Sadly this is a result of passing using FastMathFlags to determine the algorithm. The flags are usually empty for integer operations, which means the allow reassoc flag will not be set. If we don't check the opcode then we end up using strict reductions for all integer operations.

sdesmalen added inline comments.Jul 19 2021, 2:59 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1152	If the default FMF constructor results in setting `AllowReassoc=false`, then I think that's a more concrete argument for using `Optional<FastMathFlags>`, i.e. if there are no fast-math flags, there is nothing that can ask for a 'strict ordering', and so the function would return false.

dmgreen added inline comments.Jul 19 2021, 3:47 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1152	The only reduction that _can_ require strict orderings are fp operations without reassoc set. It's not a product of the fast math flags alone that means codegen will have to expand in-order. It is a product of the opcode _and_ fast math flags. Integer reduction cannot be expanded in-order, and the cost needn't ever look at FMF's for them.

david-arm added inline comments.Jul 19 2021, 3:49 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1152	I think @sdesmalen's point is more that we now have a bit of an ugly check for the opcode in addition to the flags. I agree that using an Optional<FastMathFlags> here is nicer than always having to look for a FP opcode.

foad removed a subscriber: foad.Jul 19 2021, 3:57 AM

david-arm added inline comments.Jul 19 2021, 4:01 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
138	Sure I can do that. I named it this way because I imagined we'll want to tweak this in future to something less pessimistic, perhaps based on mid-point between min and max? However, getMaxNumElements or Lanes works for now!

RKSimon mentioned this in rG5939c642aeed: [CostModel][X86] Add fast math tests for float reductions.Jul 19 2021, 5:05 AM

@david-arm Please can you rebase? I've tried to add fast/non-fast reduction costs coverage on X86

david-arm added inline comments.Jul 19 2021, 5:27 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1160	Hi @dmgreen, thanks for the suggestions. In the 1) case below I do state this is the default for integer operations and FP when reassociation is allowed. I was trying to list the two types of reduction, then explain which is the default. What I could do is emphasise early on that "Tree-wise" is the default, i.e Tree-wise. This is the default, 'fast' reduction ... I can also mention after 2) that the cost should correspond to the fastest way the target can lower the intrinsic?

Rebased.
Changed FastMathFlags argument to be Optional<FastMathFlags>.
Renamed isOrderedReduction -> requiresOrderedReduction
Renamed getScalarizationCostFactor -> getMaxNumElements

RKSimon added inline comments.Jul 19 2021, 6:42 AM

llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
77	please can you remove this change now that we have this test coverage below
151	please can you remove this change now that we have this test coverage below
llvm/test/Analysis/CostModel/X86/reduce-fmul.ll
77	please can you remove this change now that we have this test coverage below
151	please can you remove this change now that we have this test coverage below

david-arm added inline comments.Jul 19 2021, 7:13 AM

llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
151	Hi @RKSimon, the problem is that if I revert the change the tests fail. This is because we were previously calling the intrinsic without `fast`, which means that we choose the very expensive cost for an ordered reduction. If you're happy I can remove the `fast` attribute, but I will then have to update all of the costs?

Harbormaster completed remote builds in B114841: Diff 359765.Jul 19 2021, 7:17 AM

RKSimon added inline comments.Jul 19 2021, 7:30 AM

llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
151	Yes please - the update_analyze_test_checks.py script should do it all for you

Reverted previous changes to X86 tests and re-ran the update_analyze_test_checks.py script to calculate intrinsic costs without the 'fast' flag.

david-arm marked 5 inline comments as done.Jul 20 2021, 7:19 AM

RKSimon added inline comments.Jul 20 2021, 8:04 AM

llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
14–18	This looks too high for what is just a single f64 fadd (SSE floating point extract from 0 is free) - it might be a problem in x86 scalarization overhead ?

Harbormaster completed remote builds in B115093: Diff 360118.Jul 20 2021, 8:05 AM

david-arm added inline comments.Jul 20 2021, 8:13 AM

llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
14–18	Possibly so - I guess that's not something that should be fixed in this patch though? I imagine that's equally a problem for other X86 operations on a v1f64 type, i.e. see the end of getIntrinsicInstrCost that does the same thing and makes no special case for v1XX types.

RKSimon added inline comments.Jul 20 2021, 8:24 AM

llvm/test/Analysis/CostModel/X86/reduce-fadd.ll
14–18	Agreed, we can investigate that in a followup

RKSimon added inline comments.Jul 22 2021, 3:15 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7261	Aren't the calls to TTI.getArithmeticReductionCost the same? InstructionCost BaseCost = TTI.getArithmeticReductionCost( RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind); if (useOrderedReductions(RdxDesc)) return BaseCost;

Simplified call to getArithmeticReductionCost in LoopVectorize.cpp

david-arm marked an inline comment as done.Jul 22 2021, 4:01 AM

david-arm added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7261	Good point! That's just a leftover from the initial version that used an enum.

Thanks for the changes @david-arm. Just left a few more nits.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1152–1154	nit: `return FMF && !(*FMF).allowReassoc());`
1158–1178	I think the opcode is no longer what decides what the type of the reduction is. It is mostly the type and the FMF.
llvm/include/llvm/CodeGen/BasicTTIImpl.h
2086	nit: please add a comment saying that targets must implement a default for the scalable case, because we can't know how many lanes the vector has.
2088	nit: add empty line above cast<FixedVectorType>
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1548	Why are you changing the GatherScatterOpCost in this patch?

Harbormaster completed remote builds in B115534: Diff 360760.Jul 22 2021, 4:49 AM

david-arm marked an inline comment as done.Jul 22 2021, 5:09 AM

david-arm added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1548	I'm not actually changing it - I just wanted to avoid duplicating the max vscale code for the ordered reduction case so I moved the logic into a common function `getMaxNumElements`. This is all related to calculating the scalarisation cost.

sdesmalen added inline comments.Jul 22 2021, 5:47 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
140	nit: VF.getFixedValue()

Addressed review comments

david-arm marked 25 inline comments as done.Jul 22 2021, 7:06 AM

Harbormaster completed remote builds in B115558: Diff 360800.Jul 22 2021, 7:40 AM

Thanks for all the changes @david-arm, the patch looks good to me!

This revision is now accepted and ready to land.Jul 23 2021, 12:57 AM

david-arm added a child revision: D106646: [LoopVectorize] Don't interleave scalar ordered reductions for inner loops.Jul 23 2021, 3:49 AM

LGTM - cheers

This revision was landed with ongoing or failed builds.Jul 26 2021, 2:26 AM

Closed by commit rG0aff1798b572: [Analysis] Add simple cost model for strict (in-order) reductions (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rG0aff1798b572: [Analysis] Add simple cost model for strict (in-order) reductions.

RKSimon added inline comments.Jul 26 2021, 7:42 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2095	@david-arm I think this needs to be thisT()->getArithmeticInstrCost to use the correct TTI implementation?

RKSimon mentioned this in D106795: [Analysis] Fix getOrderedReductionCost to call target's getArithmeticInstrCost implementation.Jul 26 2021, 8:10 AM

sdesmalen mentioned this in D106277: [SVE] Remove usage of getMaxVScale for AArch64, in favour of IR Attribute.Jul 26 2021, 9:06 AM

RKSimon mentioned this in rG77c5e6ba900a: [Analysis] Fix getOrderedReductionCost to call target's getArithmeticInstrCost….Jul 26 2021, 9:16 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

34 lines

TargetTransformInfoImpl.h

1 line

CodeGen/

BasicTTIImpl.h

65 lines

lib/

Analysis/

TargetTransformInfo.cpp

5 lines

Target/

AArch64/

AArch64TargetTransformInfo.h

14 lines

AArch64TargetTransformInfo.cpp

24 lines

AMDGPU/

AMDGPUTargetTransformInfo.h

2 lines

AMDGPUTargetTransformInfo.cpp

6 lines

ARM/

ARMTargetTransformInfo.h

1 line

ARMTargetTransformInfo.cpp

8 lines

X86/

X86TargetTransformInfo.h

2 lines

X86TargetTransformInfo.cpp

10 lines

Transforms/

Vectorize/

LoopVectorize.cpp

11 lines

SLPVectorizer.cpp

8 lines

test/

Analysis/

CostModel/

AArch64/

reduce-fadd.ll

20 lines

sve-intrinsics.ll

41 lines

X86/

intrinsic-cost-kinds.ll

6 lines

reduce-fadd.ll

154 lines

reduce-fmul.ll

154 lines

Transforms/

LoopVectorize/

AArch64/

strict-fadd-cost.ll

49 lines

sve-strict-fadd-cost.ll

54 lines

Diff 361600

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 1,140 Lines • ▼ Show 20 Lines	public:
/// \p UseMaskForCond indicates if the memory access is predicated.		/// \p UseMaskForCond indicates if the memory access is predicated.
/// \p UseMaskForGaps indicates if gaps should be masked.		/// \p UseMaskForGaps indicates if gaps should be masked.
InstructionCost getInterleavedMemoryOpCost(		InstructionCost getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
Align Alignment, unsigned AddressSpace,		Align Alignment, unsigned AddressSpace,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput,		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput,
bool UseMaskForCond = false, bool UseMaskForGaps = false) const;		bool UseMaskForCond = false, bool UseMaskForGaps = false) const;

		/// A helper function to determine the type of reduction algorithm used
		RKSimonUnsubmitted Done Reply Inline Actions Please can you add doxygen description comments for these? RKSimon: Please can you add doxygen description comments for these?
		sdesmalenUnsubmitted Done Reply Inline Actions nit: may I request this (and any other references to it) to be renamed to TreeWise? In my experience that's a more common term used with these kinds of reductions. sdesmalen: nit: may I request this (and any other references to it) to be renamed to TreeWise? In my…
		dmgreenUnsubmitted Done Reply Inline Actions Could we represent this by passing FastMathFlags, not a new enum? The fast math flags are what dictates in-order vs relaxed for example. And I don't believe that an integer "Ordered" reduction is ever different from a un-ordered reduction. dmgreen: Could we represent this by passing FastMathFlags, not a new enum? The fast math flags are what…
		david-armAuthorUnsubmitted Done Reply Inline Actions I guess we could do that, but the flags contain a lot more than just the reassoc flag, which is the only thing we care about here. I personally think it seems a bit more obvious to the caller what's going on if we specify the type directly. Not sure what others think? david-arm: I guess we could do that, but the flags contain a lot more than just the reassoc flag, which is…
		dmgreenUnsubmitted Done Reply Inline Actions That would be my point really, that it contains more info. Why invent a new enum when the fast math flags already represent all the information we need here. It is the fundamental information we require here, so why not use it directly? We are trying to get the cost of a reduction intrinsic, and that intrinsic has some FMFlags. Those flags may dictate the cost of the resulting code, so the FMF's should be passed to the cost function. fmin/fmax could be in the same situations. It could be possible (although I'm not sure it occurs anywhere at the moment) that a "no-nan" fmax reduction has a different cost to one without the flag. dmgreen: That would be my point really, that it contains more info. Why invent a new enum when the fast…
		david-armAuthorUnsubmitted Done Reply Inline Actions Would there ever be a case where we didn't have the fast math flags, but wanted to calculate a theoretical reduction cost anyway? Then you'd have to explicitly create a FastMathFlags object and have to know exactly which flag needs setting in order to get the desired cost? For example, the user would have to know to set the reassoc flag in order to get the tree-based reduction cost. I mean, not saying that's a bad thing necessarily, but it felt a bit less intuitive and would certainly require additional comments above the `getArithmeticReductionCost` function saying how the flags map precisely to the cost. Would there potentially be value in having an overloaded interface that maps one to the other (FastMathFlags<>Enum), or do you strongly prefer a single interface that takes FastMathFlags? I wonder if other reviewers have a preference here? david-arm: Would there ever be a case where we didn't have the fast math flags, but wanted to calculate a…
		david-armAuthorUnsubmitted Done Reply Inline Actions I forgot to mention - just for reference I think the fmin/fmax cases go through a different cost function - getMinMaxReductionCost david-arm: I forgot to mention - just for reference I think the fmin/fmax cases go through a different…
		sdesmalenUnsubmitted Done Reply Inline Actions Now that `PairWise` has been removed, I'd expect we'd only ever need to ask for the cost of either an ordered reduction or an unordered reduction ("fastest/most parallel way possible"). Would the fast-math flags ever need to specify anything else than those two options? Integer operations have no FastMath flags, it seems awkward to have to construct them for an operation that doesn't support it, so I'd say the options are either to pass in a `bool IsOrdered`, or an `Optional<FastMathFlags>` which defaults to `None` (<=> implies 'unordered'). sdesmalen: Now that `PairWise` has been removed, I'd expect we'd only ever need to ask for the cost of…
		dmgreenUnsubmitted Done Reply Inline Actions OK. Adding ReductionType feels like re-adding IsPairwise to me, which we just went and removed. The combination of Integer reduction + "Ordered" should still be a "TreeWise" reduction, right? There is no such things as an ordered integer reduction from codegen perspective. The code at the beginning of getArithmeticReductionCost should really be `if ((Opcode==FAdd \|\| Opcode==FMul) && !FMF.allowReassoc()) ...`. You can get empty FastMathFlags with just `FastMathFlags()` (and even give them a default value if needed) I agree that min/max go through a different function - I think getMinMaxReductionCost should be updated with FMF too :) According to the ExpandReduction pass they require nnan for performing a shuffle reduction on fmin/fmax. But that cost could depend on what instructions the target has available. An enum/bool feels like the wrong interface to me, from an llvm perspective. But I don't hold that opinion strongly enough to disagree if folks all think that is a better way to go. dmgreen: OK. Adding ReductionType feels like re-adding IsPairwise to me, which we just went and removed.
		david-armAuthorUnsubmitted Done Reply Inline Actions I'm trying to see what happens when we always pass in FastMathFlags and it seems there are occasions when we have to construct the flags manually for FP instructions. See llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:getReductionCost: FastMathFlags FMF; switch (RdxKind) { case RecurKind::FAdd: case RecurKind::FMul: FMF.setAllowReassoc(); LLVM_FALLTHROUGH; case RecurKind::Add: case RecurKind::Mul: case RecurKind::Or: case RecurKind::And: case RecurKind::Xor: { unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind); VectorCost = TTI->getArithmeticReductionCost( RdxOpcode, VectorTy, FMF); ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy); break; } This does work, but in this particular case it doesn't look as nice as before so I'm tempted to go with an Optional as @sdesmalen suggested. david-arm: I'm trying to see what happens when we always pass in FastMathFlags and it seems there are…
		dmgreenUnsubmitted Done Reply Inline Actions Can you explain how Optional makes things better? It looks like the SLP vectorizer has already computed `RdxFMF`, which are set in the builder. Otherwise they should in general come from the instruction the reduction is created from, which in this case would be from `FirstReducedVal`. It looks like the RdxFMF are exactly what is needed though, could it use those directly? dmgreen: Can you explain how Optional makes things better? It looks like the SLP vectorizer has already…
		david-armAuthorUnsubmitted Done Reply Inline Actions So i just meant that with @sdesmalen 's suggestion of using `None` to imply unordered we could simply do: switch (RdxKind) { case RecurKind::Add: case RecurKind::Mul: case RecurKind::Or: case RecurKind::And: case RecurKind::Xor: case RecurKind::FAdd: case RecurKind::FMul: { unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind); VectorCost = TTI->getArithmeticReductionCost( RdxOpcode, VectorTy, None); ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy); break; } Ideally I'd like to avoid constructing the flags here so if there is a way to pull out existing flags that's better. I'll have a look. david-arm: So i just meant that with @sdesmalen 's suggestion of using `None` to imply unordered we could…
		sdesmalenUnsubmitted Done Reply Inline Actions Can you explain how Optional makes things better? It's mostly conceptual that fast-math flags have no meaning for integer types, so it makes no sense to construct and pass some nonsensical FMF value for it. sdesmalen: > Can you explain how Optional makes things better? It's mostly conceptual that fast-math flags…
		/// for a given \p Opcode and set of FastMathFlags \p FMF.
		static bool requiresOrderedReduction(Optional<FastMathFlags> FMF) {
		dmgreenUnsubmitted Done Reply Inline Actions Nit: I believe FMF is the common spelling throughout llvm. And maybe isOrderedReduction -> requiresOrderedReduction ? dmgreen: Nit: I believe FMF is the common spelling throughout llvm. And maybe isOrderedReduction ->…
		return FMF != None && !(*FMF).allowReassoc();
		sdesmalenUnsubmitted Done Reply Inline Actions Why does this need to check the opcode? sdesmalen: Why does this need to check the opcode?
		david-armAuthorUnsubmitted Done Reply Inline Actions Sadly this is a result of passing using FastMathFlags to determine the algorithm. The flags are usually empty for integer operations, which means the allow reassoc flag will not be set. If we don't check the opcode then we end up using strict reductions for all integer operations. david-arm: Sadly this is a result of passing using FastMathFlags to determine the algorithm. The flags are…
		sdesmalenUnsubmitted Done Reply Inline Actions If the default FMF constructor results in setting `AllowReassoc=false`, then I think that's a more concrete argument for using `Optional<FastMathFlags>`, i.e. if there are no fast-math flags, there is nothing that can ask for a 'strict ordering', and so the function would return false. sdesmalen: If the default FMF constructor results in setting `AllowReassoc=false`, then I think that's a…
		dmgreenUnsubmitted Done Reply Inline Actions The only reduction that _can_ require strict orderings are fp operations without reassoc set. It's not a product of the fast math flags alone that means codegen will have to expand in-order. It is a product of the opcode _and_ fast math flags. Integer reduction cannot be expanded in-order, and the cost needn't ever look at FMF's for them. dmgreen: The only reduction that _can_ require strict orderings are fp operations without reassoc set.
		david-armAuthorUnsubmitted Done Reply Inline Actions I think @sdesmalen's point is more that we now have a bit of an ugly check for the opcode in addition to the flags. I agree that using an Optional<FastMathFlags> here is nicer than always having to look for a FP opcode. david-arm: I think @sdesmalen's point is more that we now have a bit of an ugly check for the opcode in…
		}

		sdesmalenUnsubmitted Done Reply Inline Actions nit: `return FMF && !(FMF).allowReassoc());` sdesmalen:* nit: `return FMF && !(*FMF).allowReassoc());`
/// Calculate the cost of vector reduction intrinsics.		/// Calculate the cost of vector reduction intrinsics.
///		///
/// This is the cost of reducing the vector value of type \p Ty to a scalar		/// This is the cost of reducing the vector value of type \p Ty to a scalar
/// value using the operation denoted by \p Opcode.		/// value using the operation denoted by \p Opcode. The FastMathFlags
		/// parameter \p FMF indicates what type of reduction we are performing:
		/// 1. Tree-wise. This is the typical 'fast' reduction performed that
		RKSimonUnsubmitted Done Reply Inline Actions add explicit parentheses to make this even more obvious? result = (((InitVal + v0) + v1) + v2) + v3 RKSimon: add explicit parentheses to make this even more obvious? result = (((InitVal + v0) + v1) + v2)…
		dmgreenUnsubmitted Done Reply Inline Actions This comment looks useful. I'm wondering if it's worth emphasizing a bit that these are the default lowerings, and the cost should be whatever the fastest way the target can legally lower the intrinsic would be. Maybe spelling out that that float operations without Reassoc require ordered reductions that look like `((((init + v0) + v1) + v2) + ..`. And otherwise the reduction can happen in any order, which by default will follow a treewise reduction. dmgreen: This comment looks useful. I'm wondering if it's worth emphasizing a bit that these are the…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @dmgreen, thanks for the suggestions. In the 1) case below I do state this is the default for integer operations and FP when reassociation is allowed. I was trying to list the two types of reduction, then explain which is the default. What I could do is emphasise early on that "Tree-wise" is the default, i.e Tree-wise. This is the default, 'fast' reduction ... I can also mention after 2) that the cost should correspond to the fastest way the target can lower the intrinsic? david-arm: Hi @dmgreen, thanks for the suggestions. In the 1) case below I do state this is the default…
		/// involves successively splitting a vector into half and doing the
		/// operation on the pair of halves until you have a scalar value. For
		/// example:
		/// (v0, v1, v2, v3)
		/// ((v0+v2), (v1+v3), undef, undef)
		/// ((v0+v2+v1+v3), undef, undef, undef)
		/// This is the default behaviour for integer operations, whereas for
		/// floating point we only do this if \p FMF indicates that
		/// reassociation is allowed.
		/// 2. Ordered. For a vector with N elements this involves performing N
		/// operations in lane order, starting with an initial scalar value, i.e.
		/// result = InitVal + v0
		/// result = result + v1
		/// result = result + v2
		/// result = result + v3
		/// This is only the case for FP operations and when reassociation is not
		/// allowed.
		///
		RKSimonUnsubmitted Done Reply Inline Actions This comment still discusses pairwise reduction modes, but doesn't mention RedType RKSimon: This comment still discusses pairwise reduction modes, but doesn't mention RedType
		sdesmalenUnsubmitted Done Reply Inline Actions I think the opcode is no longer what decides what the type of the reduction is. It is mostly the type and the FMF. sdesmalen: I think the opcode is no longer what decides what the type of the reduction is. It is mostly…
InstructionCost getArithmeticReductionCost(		InstructionCost getArithmeticReductionCost(
unsigned Opcode, VectorType *Ty,		unsigned Opcode, VectorType *Ty, Optional<FastMathFlags> FMF,
		sdesmalenUnsubmitted Done Reply Inline Actions nit: Not sure if it's worth it, but should you make FPFlags a default argument? That may simplify some of the calls where the flags are unnecessary. sdesmalen: nit: Not sure if it's worth it, but should you make FPFlags a default argument? That may…
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

InstructionCost getMinMaxReductionCost(		InstructionCost getMinMaxReductionCost(
VectorType Ty, VectorType CondTy, bool IsUnsigned,		VectorType Ty, VectorType CondTy, bool IsUnsigned,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

/// Calculate the cost of an extended reduction pattern, similar to		/// Calculate the cost of an extended reduction pattern, similar to
/// getArithmeticReductionCost of an Add reduction with an extension and		/// getArithmeticReductionCost of an Add reduction with an extension and
▲ Show 20 Lines • Show All 450 Lines • ▼ Show 20 Lines	getGatherScatterOpCost(unsigned Opcode, Type DataTy, const Value Ptr,
const Instruction *I = nullptr) = 0;		const Instruction *I = nullptr) = 0;

virtual InstructionCost getInterleavedMemoryOpCost(		virtual InstructionCost getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,		Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond = false, bool UseMaskForGaps = false) = 0;		bool UseMaskForCond = false, bool UseMaskForGaps = false) = 0;
virtual InstructionCost		virtual InstructionCost
getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
		Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual InstructionCost		virtual InstructionCost
getMinMaxReductionCost(VectorType Ty, VectorType CondTy, bool IsUnsigned,		getMinMaxReductionCost(VectorType Ty, VectorType CondTy, bool IsUnsigned,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual InstructionCost getExtendedAddReductionCost(		virtual InstructionCost getExtendedAddReductionCost(
bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) = 0;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) = 0;
virtual InstructionCost		virtual InstructionCost
▲ Show 20 Lines • Show All 485 Lines • ▼ Show 20 Lines	InstructionCost getInterleavedMemoryOpCost(
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,		Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond, bool UseMaskForGaps) override {		bool UseMaskForCond, bool UseMaskForGaps) override {
return Impl.getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return Impl.getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace, CostKind,		Alignment, AddressSpace, CostKind,
UseMaskForCond, UseMaskForGaps);		UseMaskForCond, UseMaskForGaps);
}		}
InstructionCost		InstructionCost
getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
		Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
return Impl.getArithmeticReductionCost(Opcode, Ty, CostKind);		return Impl.getArithmeticReductionCost(Opcode, Ty, FMF, CostKind);
}		}
InstructionCost		InstructionCost
getMinMaxReductionCost(VectorType Ty, VectorType CondTy, bool IsUnsigned,		getMinMaxReductionCost(VectorType Ty, VectorType CondTy, bool IsUnsigned,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
return Impl.getMinMaxReductionCost(Ty, CondTy, IsUnsigned, CostKind);		return Impl.getMinMaxReductionCost(Ty, CondTy, IsUnsigned, CostKind);
}		}
InstructionCost getExtendedAddReductionCost(		InstructionCost getExtendedAddReductionCost(
bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
▲ Show 20 Lines • Show All 234 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 616 Lines • ▼ Show 20 Lines	public:
unsigned getNumberOfParts(Type *Tp) const { return 0; }		unsigned getNumberOfParts(Type *Tp) const { return 0; }

InstructionCost getAddressComputationCost(Type Tp, ScalarEvolution ,		InstructionCost getAddressComputationCost(Type Tp, ScalarEvolution ,
const SCEV *) const {		const SCEV *) const {
return 0;		return 0;
}		}

InstructionCost getArithmeticReductionCost(unsigned, VectorType *,		InstructionCost getArithmeticReductionCost(unsigned, VectorType *,
		Optional<FastMathFlags> FMF,
TTI::TargetCostKind) const {		TTI::TargetCostKind) const {
return 1;		return 1;
}		}

InstructionCost getMinMaxReductionCost(VectorType , VectorType , bool,		InstructionCost getMinMaxReductionCost(VectorType , VectorType , bool,
TTI::TargetCostKind) const {		TTI::TargetCostKind) const {
return 1;		return 1;
}		}
▲ Show 20 Lines • Show All 524 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 1,652 Lines • ▼ Show 20 Lines	getTypeBasedIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
case Intrinsic::masked_load: {		case Intrinsic::masked_load: {
Type *Ty = RetTy;		Type *Ty = RetTy;
Align TyAlign = thisT()->DL.getABITypeAlign(Ty);		Align TyAlign = thisT()->DL.getABITypeAlign(Ty);
return thisT()->getMaskedMemoryOpCost(Instruction::Load, Ty, TyAlign, 0,		return thisT()->getMaskedMemoryOpCost(Instruction::Load, Ty, TyAlign, 0,
CostKind);		CostKind);
}		}
case Intrinsic::vector_reduce_add:		case Intrinsic::vector_reduce_add:
return thisT()->getArithmeticReductionCost(Instruction::Add, VecOpTy,		return thisT()->getArithmeticReductionCost(Instruction::Add, VecOpTy,
CostKind);		None, CostKind);
		dmgreenUnsubmitted Done Reply Inline Actions Could these still use FMF? It shouldn't matter much I suppose, either way they should be empty. dmgreen: Could these still use FMF? It shouldn't matter much I suppose, either way they should be empty.
case Intrinsic::vector_reduce_mul:		case Intrinsic::vector_reduce_mul:
return thisT()->getArithmeticReductionCost(Instruction::Mul, VecOpTy,		return thisT()->getArithmeticReductionCost(Instruction::Mul, VecOpTy,
CostKind);		None, CostKind);
case Intrinsic::vector_reduce_and:		case Intrinsic::vector_reduce_and:
return thisT()->getArithmeticReductionCost(Instruction::And, VecOpTy,		return thisT()->getArithmeticReductionCost(Instruction::And, VecOpTy,
CostKind);		None, CostKind);
case Intrinsic::vector_reduce_or:		case Intrinsic::vector_reduce_or:
return thisT()->getArithmeticReductionCost(Instruction::Or, VecOpTy,		return thisT()->getArithmeticReductionCost(Instruction::Or, VecOpTy, None,
CostKind);		CostKind);
case Intrinsic::vector_reduce_xor:		case Intrinsic::vector_reduce_xor:
return thisT()->getArithmeticReductionCost(Instruction::Xor, VecOpTy,		return thisT()->getArithmeticReductionCost(Instruction::Xor, VecOpTy,
CostKind);		None, CostKind);
case Intrinsic::vector_reduce_fadd:		case Intrinsic::vector_reduce_fadd:
// FIXME: Add new flag for cost of strict reductions.
return thisT()->getArithmeticReductionCost(Instruction::FAdd, VecOpTy,		return thisT()->getArithmeticReductionCost(Instruction::FAdd, VecOpTy,
CostKind);		FMF, CostKind);
case Intrinsic::vector_reduce_fmul:		case Intrinsic::vector_reduce_fmul:
// FIXME: Add new flag for cost of strict reductions.
return thisT()->getArithmeticReductionCost(Instruction::FMul, VecOpTy,		return thisT()->getArithmeticReductionCost(Instruction::FMul, VecOpTy,
CostKind);		FMF, CostKind);
case Intrinsic::vector_reduce_smax:		case Intrinsic::vector_reduce_smax:
case Intrinsic::vector_reduce_smin:		case Intrinsic::vector_reduce_smin:
case Intrinsic::vector_reduce_fmax:		case Intrinsic::vector_reduce_fmax:
case Intrinsic::vector_reduce_fmin:		case Intrinsic::vector_reduce_fmin:
return thisT()->getMinMaxReductionCost(		return thisT()->getMinMaxReductionCost(
VecOpTy, cast<VectorType>(CmpInst::makeCmpResultType(VecOpTy)),		VecOpTy, cast<VectorType>(CmpInst::makeCmpResultType(VecOpTy)),
/IsUnsigned=/false, CostKind);		/IsUnsigned=/false, CostKind);
case Intrinsic::vector_reduce_umax:		case Intrinsic::vector_reduce_umax:
▲ Show 20 Lines • Show All 319 Lines • ▼ Show 20 Lines
/// <n x i32> <i32 n/4, i32 n/4 + 1, ..., i32 n/2, i32 undef, ..., i32 undef>		/// <n x i32> <i32 n/4, i32 n/4 + 1, ..., i32 n/2, i32 undef, ..., i32 undef>
/// \----------------v-------------/ \----------v------------/		/// \----------------v-------------/ \----------v------------/
/// n/4 elements 3*n/4 elements		/// n/4 elements 3*n/4 elements
/// %red2 = op <n x t> %red1, <n x t> val2 - working with the vector of		/// %red2 = op <n x t> %red1, <n x t> val2 - working with the vector of
/// length n/2, the resulting vector has length n/4 etc.		/// length n/2, the resulting vector has length n/4 etc.
///		///
/// The cost model should take into account that the actual length of the		/// The cost model should take into account that the actual length of the
/// vector is reduced on each iteration.		/// vector is reduced on each iteration.
InstructionCost getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		InstructionCost getTreeReductionCost(unsigned Opcode, VectorType *Ty,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
Type *ScalarTy = Ty->getElementType();		Type *ScalarTy = Ty->getElementType();
unsigned NumVecElts = cast<FixedVectorType>(Ty)->getNumElements();		unsigned NumVecElts = cast<FixedVectorType>(Ty)->getNumElements();
if ((Opcode == Instruction::Or \|\| Opcode == Instruction::And) &&		if ((Opcode == Instruction::Or \|\| Opcode == Instruction::And) &&
ScalarTy == IntegerType::getInt1Ty(Ty->getContext()) &&		ScalarTy == IntegerType::getInt1Ty(Ty->getContext()) &&
NumVecElts >= 2) {		NumVecElts >= 2) {
// Or reduction for i1 is represented as:		// Or reduction for i1 is represented as:
// %val = bitcast <ReduxWidth x i1> to iReduxWidth		// %val = bitcast <ReduxWidth x i1> to iReduxWidth
// %res = cmp ne iReduxWidth %val, 0		// %res = cmp ne iReduxWidth %val, 0
Show All 35 Lines	InstructionCost getTreeReductionCost(unsigned Opcode, VectorType *Ty,
// By default reductions need one shuffle per reduction level.		// By default reductions need one shuffle per reduction level.
ShuffleCost += NumReduxLevels * thisT()->getShuffleCost(		ShuffleCost += NumReduxLevels * thisT()->getShuffleCost(
TTI::SK_PermuteSingleSrc, Ty, None, 0, Ty);		TTI::SK_PermuteSingleSrc, Ty, None, 0, Ty);
ArithCost += NumReduxLevels * thisT()->getArithmeticInstrCost(Opcode, Ty);		ArithCost += NumReduxLevels * thisT()->getArithmeticInstrCost(Opcode, Ty);
return ShuffleCost + ArithCost +		return ShuffleCost + ArithCost +
thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, 0);		thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, 0);
}		}

		/// Try to calculate the cost of performing strict (in-order) reductions,
		/// which involves doing a sequence of floating point additions in lane
		/// order, starting with an initial value. For example, consider a scalar
		/// initial value 'InitVal' of type float and a vector of type <4 x float>:
		///
		/// Vector = <float %v0, float %v1, float %v2, float %v3>
		///
		/// %add1 = %InitVal + %v0
		/// %add2 = %add1 + %v1
		/// %add3 = %add2 + %v2
		/// %add4 = %add3 + %v3
		///
		/// As a simple estimate we can say the cost of such a reduction is 4 times
		/// the cost of a scalar FP addition. We can only estimate the costs for
		/// fixed-width vectors here because for scalable vectors we do not know the
		/// runtime number of operations.
		InstructionCost getOrderedReductionCost(unsigned Opcode, VectorType *Ty,
		TTI::TargetCostKind CostKind) {
		sdesmalenUnsubmitted Done Reply Inline Actions Can you unify the interface for getArithmeticReductionCost to take an enum which specifies the kind of reduction we want: tree, pair-wise or in-order? If there is only a single `getReductionCost` interface where the parameters dictate having to specify what cost is being asked for, than that's difficult to get wrong (you need to fill in the parameters), but if there's another slightly more specialized variant available, people may not realise it exists and just call the most basic cost interface instead. sdesmalen: Can you unify the interface for getArithmeticReductionCost to take an enum which specifies the…
		dmgreenUnsubmitted Done Reply Inline Actions Oh you got here first. :) I think we can possibly remove IsPairwise entirely. It seems to be only used in a single place nowadays that doesn't seem to me like it would be giving very good cost estimates. dmgreen: Oh you got here first. :) I think we can possibly remove IsPairwise entirely. It seems to be…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @dmgreen, it turns out removing the pairwise form breaks lots of tests. :) I think it's still needed because `TTI::matchVectorReduction` returns the pairwise form for a lot of cases. david-arm: Hi @dmgreen, it turns out removing the pairwise form breaks lots of tests. :) I think it's…
		dmgreenUnsubmitted Done Reply Inline Actions I happened to be looking at if we can remove it too. Are you assuming that IsPairwise = false? The SLP vectorizer works differently now to how it did in the past. I can put a patch for that if you like. There is one cost test that changes, but it seems to me like they would be OK to change, and the code creating the costs removed. It is at least a (mostly) separate issue from this patch. dmgreen: I happened to be looking at if we can remove it too. Are you assuming that IsPairwise = false?
		david-armAuthorUnsubmitted Done Reply Inline Actions The problem is that TTI::matchVectorReduction sets IsPairwise to true in some tests run by "make check-all" causing them to fail. I'm not sure where exactly this function gets called from, but I can't just remove the IsPairwise option as part of this patch. I imagine it requires some investigation to work out why the callers need that? However, if you've already got a patch that removes this then I can just rebase on top of it! david-arm: The problem is that TTI::matchVectorReduction sets IsPairwise to true in some tests run by…
		dmgreenUnsubmitted Done Reply Inline Actions Yeah, leave that bit to me. Consider it to be removed, but I think the patches can end up happening in either order. dmgreen: Yeah, leave that bit to me. Consider it to be removed, but I think the patches can end up…
		// Targets must implement a default value for the scalable case, since
		sdesmalenUnsubmitted Done Reply Inline Actions nit: please add a comment saying that targets must implement a default for the scalable case, because we can't know how many lanes the vector has. sdesmalen: nit: please add a comment saying that targets must implement a default for the scalable case…
		// we don't know how many lanes the vector has.
		dmgreenUnsubmitted Done Reply Inline Actions Should this be checking for scalable vectors? dmgreen: Should this be checking for scalable vectors?
		david-armAuthorUnsubmitted Done Reply Inline Actions I don't think we can easily calculate a cost for scalable vectors here and is very target-specific. At the moment I have modelled this on the worst case, i.e. scalarising the operation, and for scalable vectors we don't know the number of elements. The approach taken here is similar to functions elsewhere in this file, e.g. getCommonMaskedMemoryOpCost. I think if we get here for scalable vectors it's actually a bug, since the target should really have dealt with this separately. Any thoughts @sdesmalen david-arm: I don't think we can easily calculate a cost for scalable vectors here and is very target…
		dmgreenUnsubmitted Done Reply Inline Actions Could it just return an invalid cost? dmgreen: Could it just return an invalid cost?
		if (isa<ScalableVectorType>(Ty))
		sdesmalenUnsubmitted Done Reply Inline Actions nit: add empty line above cast<FixedVectorType> sdesmalen: nit: add empty line above cast<FixedVectorType>
		return InstructionCost::getInvalid();
		dmgreenUnsubmitted Done Reply Inline Actions I think this should this also include the cost of extracts (which will sometimes be free, but that is target specific). Probably using the ScalarizationOverhead. dmgreen: I think this should this also include the cost of extracts (which will sometimes be free, but…
		david-armAuthorUnsubmitted Done Reply Inline Actions Fair point. @sdesmalen suggested this too privately. david-arm: Fair point. @sdesmalen suggested this too privately.

		sdesmalenUnsubmitted Done Reply Inline Actions This seems to be missing the cost of extracting each of the elements which you can get using `getScalarizationOverhead(..., /bool Extract=/true, ...)` sdesmalen: This seems to be missing the cost of extracting each of the elements which you can get using…
		auto *VTy = cast<FixedVectorType>(Ty);
		InstructionCost ExtractCost =
		getScalarizationOverhead(VTy, /Insert=/false, /Extract=/true);
		InstructionCost ArithCost =
		getArithmeticInstrCost(Opcode, VTy->getElementType(), CostKind);
		RKSimonUnsubmitted Not Done Reply Inline Actions @david-arm I think this needs to be thisT()->getArithmeticInstrCost to use the correct TTI implementation? RKSimon: @david-arm I think this needs to be thisT()->getArithmeticInstrCost to use the correct TTI…
		ArithCost *= VTy->getNumElements();

		return ExtractCost + ArithCost;
		}

		InstructionCost getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
		Optional<FastMathFlags> FMF,
		TTI::TargetCostKind CostKind) {
		if (TTI::requiresOrderedReduction(FMF))
		return getOrderedReductionCost(Opcode, Ty, CostKind);
		return getTreeReductionCost(Opcode, Ty, CostKind);
		}

/// Try to calculate op costs for min/max reduction operations.		/// Try to calculate op costs for min/max reduction operations.
/// \param CondTy Conditional type for the Select instruction.		/// \param CondTy Conditional type for the Select instruction.
InstructionCost getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		InstructionCost getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsUnsigned,		bool IsUnsigned,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
Type *ScalarTy = Ty->getElementType();		Type *ScalarTy = Ty->getElementType();
Type *ScalarCondTy = CondTy->getElementType();		Type *ScalarCondTy = CondTy->getElementType();
unsigned NumVecElts = cast<FixedVectorType>(Ty)->getNumElements();		unsigned NumVecElts = cast<FixedVectorType>(Ty)->getNumElements();
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines
}		}

InstructionCost getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,		InstructionCost getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,
Type ResTy, VectorType Ty,		Type ResTy, VectorType Ty,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
// Without any native support, this is equivalent to the cost of		// Without any native support, this is equivalent to the cost of
// vecreduce.add(ext) or if IsMLA vecreduce.add(mul(ext, ext))		// vecreduce.add(ext) or if IsMLA vecreduce.add(mul(ext, ext))
VectorType *ExtTy = VectorType::get(ResTy, Ty);		VectorType *ExtTy = VectorType::get(ResTy, Ty);
InstructionCost RedCost =		InstructionCost RedCost = thisT()->getArithmeticReductionCost(
thisT()->getArithmeticReductionCost(Instruction::Add, ExtTy, CostKind);		Instruction::Add, ExtTy, None, CostKind);
InstructionCost MulCost = 0;		InstructionCost MulCost = 0;
InstructionCost ExtCost = thisT()->getCastInstrCost(		InstructionCost ExtCost = thisT()->getCastInstrCost(
IsUnsigned ? Instruction::ZExt : Instruction::SExt, ExtTy, Ty,		IsUnsigned ? Instruction::ZExt : Instruction::SExt, ExtTy, Ty,
TTI::CastContextHint::None, CostKind);		TTI::CastContextHint::None, CostKind);
if (IsMLA) {		if (IsMLA) {
MulCost =		MulCost =
thisT()->getArithmeticInstrCost(Instruction::Mul, ExtTy, CostKind);		thisT()->getArithmeticInstrCost(Instruction::Mul, ExtTy, CostKind);
ExtCost *= 2;		ExtCost *= 2;
Show All 30 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 888 Lines • ▼ Show 20 Lines

	InstructionCost TargetTransformInfo::getMemcpyCost(const Instruction *I) const {			InstructionCost TargetTransformInfo::getMemcpyCost(const Instruction *I) const {
	InstructionCost Cost = TTIImpl->getMemcpyCost(I);			InstructionCost Cost = TTIImpl->getMemcpyCost(I);
	assert(Cost >= 0 && "TTI should not produce negative costs!");			assert(Cost >= 0 && "TTI should not produce negative costs!");
	return Cost;			return Cost;
	}			}

	InstructionCost TargetTransformInfo::getArithmeticReductionCost(			InstructionCost TargetTransformInfo::getArithmeticReductionCost(
	unsigned Opcode, VectorType *Ty, TTI::TargetCostKind CostKind) const {			unsigned Opcode, VectorType *Ty, Optional<FastMathFlags> FMF,
				TTI::TargetCostKind CostKind) const {
	InstructionCost Cost =			InstructionCost Cost =
	TTIImpl->getArithmeticReductionCost(Opcode, Ty, CostKind);			TTIImpl->getArithmeticReductionCost(Opcode, Ty, FMF, CostKind);
	assert(Cost >= 0 && "TTI should not produce negative costs!");			assert(Cost >= 0 && "TTI should not produce negative costs!");
	return Cost;			return Cost;
	}			}

	InstructionCost TargetTransformInfo::getMinMaxReductionCost(			InstructionCost TargetTransformInfo::getMinMaxReductionCost(
	VectorType Ty, VectorType CondTy, bool IsUnsigned,			VectorType Ty, VectorType CondTy, bool IsUnsigned,
	TTI::TargetCostKind CostKind) const {			TTI::TargetCostKind CostKind) const {
	InstructionCost Cost =			InstructionCost Cost =
	▲ Show 20 Lines • Show All 259 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	public:
}		}

Optional<unsigned> getMaxVScale() const {		Optional<unsigned> getMaxVScale() const {
if (ST->hasSVE())		if (ST->hasSVE())
return AArch64::SVEMaxBitsPerVector / AArch64::SVEBitsPerBlock;		return AArch64::SVEMaxBitsPerVector / AArch64::SVEBitsPerBlock;
return BaseT::getMaxVScale();		return BaseT::getMaxVScale();
}		}

		/// Try to return an estimate cost factor that can be used as a multiplier
		/// when scalarizing an operation for a vector with ElementCount \p VF.
		/// For scalable vectors this currently takes the most pessimistic view based
		/// upon the maximum possible value for vscale.
		dmgreenUnsubmitted Done Reply Inline Actions I had assumed (without thinking about it very much) that the costs for VF arguments would be based either the exact value of VF from the -mcpu argument if one is provided. If it is not then we would guess at a value, probably VF=2 would be a sensible default. This is using the maximum possible VF, which sounds like a large over-estimate in most cases. Can you speak to why the max VF is a better value to use? I'm not sure I understand why this is scalarizing though. dmgreen: I had assumed (without thinking about it very much) that the costs for VF arguments would be…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @dmgreen, yeah we are aware of this problem. It's not ideal - at the moment we also do this for gather-scatters too. We took the decision to be conservative and use the maximum value of vscale as the worst case scenario. In practice, the runtime value could vary from machine to machine and we thought it better to wait a while and revisit this again at some point. In fact, that's partly why I created this function so that we only have to change one place in future. :) I also think that always choosing the most optimistic case could lead to poor performance so we have to be careful. One option we have is to use the new vscale range IR attributes to refine this, or choose a value of vscale that represents some sort of average of real use cases? david-arm: Hi @dmgreen, yeah we are aware of this problem. It's not ideal - at the moment we also do this…
		dmgreenUnsubmitted Done Reply Inline Actions OK - more work is needed. Got it. I would have expected these cost factors to come from the subtarget, not an IR attribute. What is being scalarized here though? From https://godbolt.org/z/fcz71dPeY for example? Some of the Illegal types were hitting errors. dmgreen: OK - more work is needed. Got it. I would have expected these cost factors to come from the…
		david-armAuthorUnsubmitted Done Reply Inline Actions Even though there is a single faddv instruction I think for now it still makes sense to model this as being scalarised because conceptually the lanewise FP additions still have to be done in sequence, rather than tree-based. david-arm: Even though there is a single faddv instruction I think for now it still makes sense to model…
		dmgreenUnsubmitted Done Reply Inline Actions Then what does the scalarization? https://godbolt.org/z/hfeaYh8r8 TargetLowering::expandVecReduce doesn't appear to handle it, which would imply to me that the cost should be "Invalid". Or do you mean that the fadda will have a low throughput? dmgreen: Then what does the scalarization? https://godbolt.org/z/hfeaYh8r8 TargetLowering…
		sdesmalenUnsubmitted Done Reply Inline Actions The idea is that an fadda will have a low throughput because the operation is conceptually scalarized, because the fadd's can't be performed in parallel i.e. double result = ((((init + v0) + v1) + v2) + ...) + vn; // where v0 .. vn are the lanes of the vector Perhaps this is more a latency than a 'throughput' issue, but if an operation has a very long latency and blocks one of the functional units, I guess that has an impact the throughput as well. The more important thing for now is that we want to have some conservative cost value for these, so that we don't assume in-order/in-loop reductions are cheap, so that we can tune it to return more sensible values later when we can experiment with this (after all, scalable auto-vec isn't fully functional yet). The other thing we're planning to improve is that when targeting a specific CPU, `getMaxVScale` returns the values from the max-vscale attribute in the IR, so that this cost-function no longer assumes the worst-case cost, but rather a more realistic cost based on the targeted vector length. The current implementation doesn't do this yet, but that's on our active to-do list. https://godbolt.org/z/hfeaYh8r8 TargetLowering::expandVecReduce doesn't appear to handle it, which would imply to me that the cost should be "Invalid". `getArithmeticReductionCostSVE` already returns `Invalid` for the multiplication, but only for the unordered reductions. This now doesn't happen for the ordered case. @david-arm can you look into that? sdesmalen: The idea is that an fadda will have a low throughput because the operation is conceptually…
		unsigned getMaxNumElements(ElementCount VF) const {
		dmgreenUnsubmitted Done Reply Inline Actions getScalarizationCostFactor implies that it will be scalarized by codegen, and sounds similar to the already present getScalarizationOverhead. What do you think about something like getMaxNumLanes, as that appears to be what it computes. dmgreen: getScalarizationCostFactor implies that it will be scalarized by codegen, and sounds similar to…
		david-armAuthorUnsubmitted Done Reply Inline Actions Sure I can do that. I named it this way because I imagined we'll want to tweak this in future to something less pessimistic, perhaps based on mid-point between min and max? However, getMaxNumElements or Lanes works for now! david-arm: Sure I can do that. I named it this way because I imagined we'll want to tweak this in future…
		if (!VF.isScalable())
		return VF.getFixedValue();
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: VF.getFixedValue() sdesmalen: nit: VF.getFixedValue()
		Optional<unsigned> MaxNumVScale = getMaxVScale();
		assert(MaxNumVScale && "Expected valid max vscale value");
		return MaxNumVScale VF.getKnownMinValue();
		}

unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF);

InstructionCost getMaskedMemoryOpCost(unsigned Opcode, Type *Src,		InstructionCost getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
Align Alignment, unsigned AddressSpace,		Align Alignment, unsigned AddressSpace,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,		InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
const Value *Ptr, bool VariableMask,		const Value *Ptr, bool VariableMask,
▲ Show 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	public:
}		}

bool supportsScalableVectors() const { return ST->hasSVE(); }		bool supportsScalableVectors() const { return ST->hasSVE(); }

bool isLegalToVectorizeReduction(const RecurrenceDescriptor &RdxDesc,		bool isLegalToVectorizeReduction(const RecurrenceDescriptor &RdxDesc,
ElementCount VF) const;		ElementCount VF) const;

InstructionCost getArithmeticReductionCost(		InstructionCost getArithmeticReductionCost(
unsigned Opcode, VectorType *Ty,		unsigned Opcode, VectorType *Ty, Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);

InstructionCost getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp,		InstructionCost getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp,
ArrayRef<int> Mask, int Index,		ArrayRef<int> Mask, int Index,
VectorType *SubTp);		VectorType *SubTp);
/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_AARCH64_AARCH64TARGETTRANSFORMINFO_H		#endif // LLVM_LIB_TARGET_AARCH64_AARCH64TARGETTRANSFORMINFO_H

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,537 Lines • ▼ Show 20 Lines	InstructionCost AArch64TTIImpl::getGatherScatterOpCost(
// of <vscale x 1 x eltty> yet, so return an invalid cost to avoid selecting		// of <vscale x 1 x eltty> yet, so return an invalid cost to avoid selecting
// it. This change will be removed when code-generation for these types is		// it. This change will be removed when code-generation for these types is
// sufficiently reliable.		// sufficiently reliable.
if (cast<VectorType>(DataTy)->getElementCount() ==		if (cast<VectorType>(DataTy)->getElementCount() ==
ElementCount::getScalable(1))		ElementCount::getScalable(1))
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();

ElementCount LegalVF = LT.second.getVectorElementCount();		ElementCount LegalVF = LT.second.getVectorElementCount();
Optional<unsigned> MaxNumVScale = getMaxVScale();
assert(MaxNumVScale && "Expected valid max vscale value");

InstructionCost MemOpCost =		InstructionCost MemOpCost =
getMemoryOpCost(Opcode, VT->getElementType(), Alignment, 0, CostKind, I);		getMemoryOpCost(Opcode, VT->getElementType(), Alignment, 0, CostKind, I);
unsigned MaxNumElementsPerGather =		return LT.first * MemOpCost * getMaxNumElements(LegalVF);
		sdesmalenUnsubmitted Done Reply Inline Actions Why are you changing the GatherScatterOpCost in this patch? sdesmalen: Why are you changing the GatherScatterOpCost in this patch?
		david-armAuthorUnsubmitted Done Reply Inline Actions I'm not actually changing it - I just wanted to avoid duplicating the max vscale code for the ordered reduction case so I moved the logic into a common function `getMaxNumElements`. This is all related to calculating the scalarisation cost. david-arm: I'm not actually changing it - I just wanted to avoid duplicating the max vscale code for the…
MaxNumVScale.getValue() * LegalVF.getKnownMinValue();
return LT.first * MaxNumElementsPerGather * MemOpCost;
}		}

bool AArch64TTIImpl::useNeonVector(const Type *Ty) const {		bool AArch64TTIImpl::useNeonVector(const Type *Ty) const {
return isa<FixedVectorType>(Ty) && !ST->useSVEForFixedLengthVectors();		return isa<FixedVectorType>(Ty) && !ST->useSVEForFixedLengthVectors();
}		}

InstructionCost AArch64TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Ty,		InstructionCost AArch64TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Ty,
MaybeAlign Alignment,		MaybeAlign Alignment,
▲ Show 20 Lines • Show All 389 Lines • ▼ Show 20 Lines	case ISD::FADD:
return LegalizationCost + 2;		return LegalizationCost + 2;
default:		default:
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();
}		}
}		}

InstructionCost		InstructionCost
AArch64TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,		AArch64TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
		Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
		if (TTI::requiresOrderedReduction(FMF)) {
		if (!isa<ScalableVectorType>(ValTy))
		return BaseT::getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind);

		if (Opcode != Instruction::FAdd)
		return InstructionCost::getInvalid();

		auto *VTy = cast<ScalableVectorType>(ValTy);
		InstructionCost Cost =
		getArithmeticInstrCost(Opcode, VTy->getScalarType(), CostKind);
		Cost *= getMaxNumElements(VTy->getElementCount());
		return Cost;
		}

if (isa<ScalableVectorType>(ValTy))		if (isa<ScalableVectorType>(ValTy))
return getArithmeticReductionCostSVE(Opcode, ValTy, CostKind);		return getArithmeticReductionCostSVE(Opcode, ValTy, CostKind);

std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);
MVT MTy = LT.second;		MVT MTy = LT.second;
int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
assert(ISD && "Invalid opcode");		assert(ISD && "Invalid opcode");

// Horizontal adds can use the 'addv' instruction. We model the cost of these		// Horizontal adds can use the 'addv' instruction. We model the cost of these
		david-armAuthorUnsubmitted Done Reply Inline Actions I just spotted this unnecessary change and I'll fix it! david-arm: I just spotted this unnecessary change and I'll fix it!
// instructions as twice a normal vector add, plus 1 for each legalization		// instructions as twice a normal vector add, plus 1 for each legalization
// step (LT.first). This is the only arithmetic vector reduction operation for		// step (LT.first). This is the only arithmetic vector reduction operation for
// which we have an instruction.		// which we have an instruction.
// OR, XOR and AND costs should match the codegen from:		// OR, XOR and AND costs should match the codegen from:
// OR: llvm/test/CodeGen/AArch64/reduce-or.ll		// OR: llvm/test/CodeGen/AArch64/reduce-or.ll
// XOR: llvm/test/CodeGen/AArch64/reduce-xor.ll		// XOR: llvm/test/CodeGen/AArch64/reduce-xor.ll
// AND: llvm/test/CodeGen/AArch64/reduce-and.ll		// AND: llvm/test/CodeGen/AArch64/reduce-and.ll
static const CostTblEntry CostTblNoPairwise[]{		static const CostTblEntry CostTblNoPairwise[]{
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	if (!ValVTy->getElementType()->isIntegerTy(1) &&
MTy.getVectorNumElements());		MTy.getVectorNumElements());
ExtraCost = getArithmeticInstrCost(Opcode, Ty, CostKind);		ExtraCost = getArithmeticInstrCost(Opcode, Ty, CostKind);
ExtraCost *= LT.first - 1;		ExtraCost *= LT.first - 1;
}		}
return Entry->Cost + ExtraCost;		return Entry->Cost + ExtraCost;
}		}
break;		break;
}		}
return BaseT::getArithmeticReductionCost(Opcode, ValTy, CostKind);		return BaseT::getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind);
}		}

InstructionCost AArch64TTIImpl::getSpliceCost(VectorType *Tp, int Index) {		InstructionCost AArch64TTIImpl::getSpliceCost(VectorType *Tp, int Index) {
static const CostTblEntry ShuffleTbl[] = {		static const CostTblEntry ShuffleTbl[] = {
{ TTI::SK_Splice, MVT::nxv16i8, 1 },		{ TTI::SK_Splice, MVT::nxv16i8, 1 },
{ TTI::SK_Splice, MVT::nxv8i16, 1 },		{ TTI::SK_Splice, MVT::nxv8i16, 1 },
{ TTI::SK_Splice, MVT::nxv4i32, 1 },		{ TTI::SK_Splice, MVT::nxv4i32, 1 },
{ TTI::SK_Splice, MVT::nxv2i64, 1 },		{ TTI::SK_Splice, MVT::nxv2i64, 1 },
▲ Show 20 Lines • Show All 150 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 206 Lines • ▼ Show 20 Lines	bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

unsigned getInliningThresholdMultiplier() { return 11; }		unsigned getInliningThresholdMultiplier() { return 11; }
unsigned adjustInliningThreshold(const CallBase *CB) const;		unsigned adjustInliningThreshold(const CallBase *CB) const;

int getInlinerVectorBonusPercent() { return 0; }		int getInlinerVectorBonusPercent() { return 0; }

InstructionCost getArithmeticReductionCost(		InstructionCost getArithmeticReductionCost(
unsigned Opcode, VectorType *Ty,		unsigned Opcode, VectorType *Ty, Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);

InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
InstructionCost getMinMaxReductionCost(		InstructionCost getMinMaxReductionCost(
VectorType Ty, VectorType CondTy, bool IsUnsigned,		VectorType Ty, VectorType CondTy, bool IsUnsigned,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);
};		};
▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 837 Lines • ▼ Show 20 Lines	InstructionCost GCNTTIImpl::getCFInstrCost(unsigned Opcode,
case Instruction::Ret:		case Instruction::Ret:
return SCost ? 1 : 10;		return SCost ? 1 : 10;
}		}
return BaseT::getCFInstrCost(Opcode, CostKind, I);		return BaseT::getCFInstrCost(Opcode, CostKind, I);
}		}

InstructionCost		InstructionCost
GCNTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		GCNTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
		Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
		if (TTI::requiresOrderedReduction(FMF))
		return BaseT::getArithmeticReductionCost(Opcode, Ty, FMF, CostKind);

EVT OrigTy = TLI->getValueType(DL, Ty);		EVT OrigTy = TLI->getValueType(DL, Ty);

// Computes cost on targets that have packed math instructions(which support		// Computes cost on targets that have packed math instructions(which support
// 16-bit types only).		// 16-bit types only).
if (!ST->hasVOP3PInsts() \|\| OrigTy.getScalarSizeInBits() != 16)		if (!ST->hasVOP3PInsts() \|\| OrigTy.getScalarSizeInBits() != 16)
return BaseT::getArithmeticReductionCost(Opcode, Ty, CostKind);		return BaseT::getArithmeticReductionCost(Opcode, Ty, FMF, CostKind);

std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);
return LT.first * getFullRateInstrCost();		return LT.first * getFullRateInstrCost();
}		}

InstructionCost		InstructionCost
GCNTTIImpl::getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		GCNTTIImpl::getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsUnsigned,		bool IsUnsigned,
▲ Show 20 Lines • Show All 496 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 251 Lines • ▼ Show 20 Lines	public:

InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,		InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
const Value *Ptr, bool VariableMask,		const Value *Ptr, bool VariableMask,
Align Alignment,		Align Alignment,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I = nullptr);		const Instruction *I = nullptr);

InstructionCost getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,		InstructionCost getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
		Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
InstructionCost getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,		InstructionCost getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,
Type ResTy, VectorType ValTy,		Type ResTy, VectorType ValTy,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 1,588 Lines • ▼ Show 20 Lines	if (const auto *GEP = dyn_cast<GetElementPtrInst>(Ptr)) {
}		}
return ScalarCost;		return ScalarCost;
}		}
return ScalarCost;		return ScalarCost;
}		}

InstructionCost		InstructionCost
ARMTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,		ARMTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
		Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
		if (TTI::requiresOrderedReduction(FMF))
		return BaseT::getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind);

EVT ValVT = TLI->getValueType(DL, ValTy);		EVT ValVT = TLI->getValueType(DL, ValTy);
int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
if (!ST->hasMVEIntegerOps() \|\| !ValVT.isSimple() \|\| ISD != ISD::ADD)		if (!ST->hasMVEIntegerOps() \|\| !ValVT.isSimple() \|\| ISD != ISD::ADD)
return BaseT::getArithmeticReductionCost(Opcode, ValTy, CostKind);		return BaseT::getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind);

std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);

static const CostTblEntry CostTblAdd[]{		static const CostTblEntry CostTblAdd[]{
{ISD::ADD, MVT::v16i8, 1},		{ISD::ADD, MVT::v16i8, 1},
{ISD::ADD, MVT::v8i16, 1},		{ISD::ADD, MVT::v8i16, 1},
{ISD::ADD, MVT::v4i32, 1},		{ISD::ADD, MVT::v4i32, 1},
};		};
if (const auto *Entry = CostTableLookup(CostTblAdd, ISD, LT.second))		if (const auto *Entry = CostTableLookup(CostTblAdd, ISD, LT.second))
return Entry->Cost * ST->getMVEVectorCostFactor(CostKind) * LT.first;		return Entry->Cost * ST->getMVEVectorCostFactor(CostKind) * LT.first;

return BaseT::getArithmeticReductionCost(Opcode, ValTy, CostKind);		return BaseT::getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind);
}		}

InstructionCost		InstructionCost
ARMTTIImpl::getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,		ARMTTIImpl::getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,
Type ResTy, VectorType ValTy,		Type ResTy, VectorType ValTy,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
EVT ValVT = TLI->getValueType(DL, ValTy);		EVT ValVT = TLI->getValueType(DL, ValTy);
EVT ResVT = TLI->getValueType(DL, ResTy);		EVT ResVT = TLI->getValueType(DL, ResTy);
▲ Show 20 Lines • Show All 638 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	public:

InstructionCost		InstructionCost
getTypeBasedIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		getTypeBasedIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

InstructionCost getArithmeticReductionCost(		InstructionCost getArithmeticReductionCost(
unsigned Opcode, VectorType *Ty,		unsigned Opcode, VectorType *Ty, Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency);		TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency);

InstructionCost getMinMaxCost(Type Ty, Type CondTy, bool IsUnsigned);		InstructionCost getMinMaxCost(Type Ty, Type CondTy, bool IsUnsigned);

InstructionCost getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		InstructionCost getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsUnsigned,		bool IsUnsigned,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 3,740 Lines • ▼ Show 20 Lines	if (!BaseT::getConstantStrideStep(SE, Ptr))
return 1;		return 1;
}		}

return BaseT::getAddressComputationCost(Ty, SE, Ptr);		return BaseT::getAddressComputationCost(Ty, SE, Ptr);
}		}

InstructionCost		InstructionCost
X86TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,		X86TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
		Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
		if (TTI::requiresOrderedReduction(FMF))
		return BaseT::getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind);

// We use the Intel Architecture Code Analyzer(IACA) to measure the throughput		// We use the Intel Architecture Code Analyzer(IACA) to measure the throughput
// and make it as the cost.		// and make it as the cost.

static const CostTblEntry SLMCostTblNoPairWise[] = {		static const CostTblEntry SLMCostTblNoPairWise[] = {
{ ISD::FADD, MVT::v2f64, 3 },		{ ISD::FADD, MVT::v2f64, 3 },
{ ISD::ADD, MVT::v2i64, 5 },		{ ISD::ADD, MVT::v2i64, 5 },
};		};

▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	X86TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,

// Special case: vXi8 mul reductions are performed as vXi16.		// Special case: vXi8 mul reductions are performed as vXi16.
if (ISD == ISD::MUL && MTy.getScalarType() == MVT::i8) {		if (ISD == ISD::MUL && MTy.getScalarType() == MVT::i8) {
auto *WideSclTy = IntegerType::get(ValVTy->getContext(), 16);		auto *WideSclTy = IntegerType::get(ValVTy->getContext(), 16);
auto *WideVecTy = FixedVectorType::get(WideSclTy, ValVTy->getNumElements());		auto *WideVecTy = FixedVectorType::get(WideSclTy, ValVTy->getNumElements());
return getCastInstrCost(Instruction::ZExt, WideVecTy, ValTy,		return getCastInstrCost(Instruction::ZExt, WideVecTy, ValTy,
TargetTransformInfo::CastContextHint::None,		TargetTransformInfo::CastContextHint::None,
CostKind) +		CostKind) +
getArithmeticReductionCost(Opcode, WideVecTy, CostKind);		getArithmeticReductionCost(Opcode, WideVecTy, FMF, CostKind);
}		}

InstructionCost ArithmeticCost = 0;		InstructionCost ArithmeticCost = 0;
if (LT.first != 1 && MTy.isVector() &&		if (LT.first != 1 && MTy.isVector() &&
MTy.getVectorNumElements() < ValVTy->getNumElements()) {		MTy.getVectorNumElements() < ValVTy->getNumElements()) {
// Type needs to be split. We need LT.first - 1 arithmetic ops.		// Type needs to be split. We need LT.first - 1 arithmetic ops.
auto *SingleOpTy = FixedVectorType::get(ValVTy->getElementType(),		auto *SingleOpTy = FixedVectorType::get(ValVTy->getElementType(),
MTy.getVectorNumElements());		MTy.getVectorNumElements());
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	if (ST->hasAVX2())
return ArithmeticCost + Entry->Cost;		return ArithmeticCost + Entry->Cost;
if (ST->hasAVX())		if (ST->hasAVX())
if (const auto *Entry = CostTableLookup(AVX1BoolReduction, ISD, MTy))		if (const auto *Entry = CostTableLookup(AVX1BoolReduction, ISD, MTy))
return ArithmeticCost + Entry->Cost;		return ArithmeticCost + Entry->Cost;
if (ST->hasSSE2())		if (ST->hasSSE2())
if (const auto *Entry = CostTableLookup(SSE2BoolReduction, ISD, MTy))		if (const auto *Entry = CostTableLookup(SSE2BoolReduction, ISD, MTy))
return ArithmeticCost + Entry->Cost;		return ArithmeticCost + Entry->Cost;

return BaseT::getArithmeticReductionCost(Opcode, ValVTy, CostKind);		return BaseT::getArithmeticReductionCost(Opcode, ValVTy, FMF, CostKind);
}		}

unsigned NumVecElts = ValVTy->getNumElements();		unsigned NumVecElts = ValVTy->getNumElements();
unsigned ScalarSize = ValVTy->getScalarSizeInBits();		unsigned ScalarSize = ValVTy->getScalarSizeInBits();

// Special case power of 2 reductions where the scalar type isn't changed		// Special case power of 2 reductions where the scalar type isn't changed
// by type legalization.		// by type legalization.
if (!isPowerOf2_32(NumVecElts) \|\| ScalarSize != MTy.getScalarSizeInBits())		if (!isPowerOf2_32(NumVecElts) \|\| ScalarSize != MTy.getScalarSizeInBits())
return BaseT::getArithmeticReductionCost(Opcode, ValVTy, CostKind);		return BaseT::getArithmeticReductionCost(Opcode, ValVTy, FMF, CostKind);

InstructionCost ReductionCost = 0;		InstructionCost ReductionCost = 0;

auto *Ty = ValVTy;		auto *Ty = ValVTy;
if (LT.first != 1 && MTy.isVector() &&		if (LT.first != 1 && MTy.isVector() &&
MTy.getVectorNumElements() < ValVTy->getNumElements()) {		MTy.getVectorNumElements() < ValVTy->getNumElements()) {
// Type needs to be split. We need LT.first - 1 arithmetic ops.		// Type needs to be split. We need LT.first - 1 arithmetic ops.
Ty = FixedVectorType::get(ValVTy->getElementType(),		Ty = FixedVectorType::get(ValVTy->getElementType(),
▲ Show 20 Lines • Show All 1,234 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,246 Lines • ▼ Show 20 Lines	Optional<InstructionCost> LoopVectorizationCostModel::getReductionPatternCost(
// the reduction on its own.		// the reduction on its own.
Instruction *LastChain = InLoopReductionImmediateChains[RetI];		Instruction *LastChain = InLoopReductionImmediateChains[RetI];
Instruction *ReductionPhi = LastChain;		Instruction *ReductionPhi = LastChain;
while (!isa<PHINode>(ReductionPhi))		while (!isa<PHINode>(ReductionPhi))
ReductionPhi = InLoopReductionImmediateChains[ReductionPhi];		ReductionPhi = InLoopReductionImmediateChains[ReductionPhi];

const RecurrenceDescriptor &RdxDesc =		const RecurrenceDescriptor &RdxDesc =
Legal->getReductionVars()[cast<PHINode>(ReductionPhi)];		Legal->getReductionVars()[cast<PHINode>(ReductionPhi)];
InstructionCost BaseCost =
TTI.getArithmeticReductionCost(RdxDesc.getOpcode(), VectorTy, CostKind);		InstructionCost BaseCost = TTI.getArithmeticReductionCost(
		RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);

		// If we're using ordered reductions then we can just return the base cost
		// here, since getArithmeticReductionCost calculates the full ordered
		// reduction cost when FP reassociation is not allowed.
		RKSimonUnsubmitted Done Reply Inline Actions Aren't the calls to TTI.getArithmeticReductionCost the same? InstructionCost BaseCost = TTI.getArithmeticReductionCost( RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind); if (useOrderedReductions(RdxDesc)) return BaseCost; RKSimon: Aren't the calls to TTI.getArithmeticReductionCost the same? ``` InstructionCost BaseCost = TTI.
		david-armAuthorUnsubmitted Done Reply Inline Actions Good point! That's just a leftover from the initial version that used an enum. david-arm: Good point! That's just a leftover from the initial version that used an enum.
		if (useOrderedReductions(RdxDesc))
		return BaseCost;

// Get the operand that was not the reduction chain and match it to one of the		// Get the operand that was not the reduction chain and match it to one of the
// patterns, returning the better cost if it is found.		// patterns, returning the better cost if it is found.
Instruction *RedOp = RetI->getOperand(1) == LastChain		Instruction *RedOp = RetI->getOperand(1) == LastChain
? dyn_cast<Instruction>(RetI->getOperand(0))		? dyn_cast<Instruction>(RetI->getOperand(0))
: dyn_cast<Instruction>(RetI->getOperand(1));		: dyn_cast<Instruction>(RetI->getOperand(1));

VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);		VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);
▲ Show 20 Lines • Show All 3,224 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,846 Lines • ▼ Show 20 Lines	while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {
break;		break;

V.computeMinimumValueSizes();		V.computeMinimumValueSizes();

// Estimate cost.		// Estimate cost.
InstructionCost TreeCost =		InstructionCost TreeCost =
V.getTreeCost(makeArrayRef(&ReducedVals[i], ReduxWidth));		V.getTreeCost(makeArrayRef(&ReducedVals[i], ReduxWidth));
InstructionCost ReductionCost =		InstructionCost ReductionCost =
getReductionCost(TTI, ReducedVals[i], ReduxWidth);		getReductionCost(TTI, ReducedVals[i], ReduxWidth, RdxFMF);
InstructionCost Cost = TreeCost + ReductionCost;		InstructionCost Cost = TreeCost + ReductionCost;
if (!Cost.isValid()) {		if (!Cost.isValid()) {
LLVM_DEBUG(dbgs() << "Encountered invalid baseline cost.\n");		LLVM_DEBUG(dbgs() << "Encountered invalid baseline cost.\n");
return false;		return false;
}		}
if (Cost >= -SLPCostThreshold) {		if (Cost >= -SLPCostThreshold) {
V.getORE()->emit([&]() {		V.getORE()->emit([&]() {
return OptimizationRemarkMissed(SV_NAME, "HorSLPNotBeneficial",		return OptimizationRemarkMissed(SV_NAME, "HorSLPNotBeneficial",
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI) {
return VectorizedTree != nullptr;		return VectorizedTree != nullptr;
}		}

unsigned numReductionValues() const { return ReducedVals.size(); }		unsigned numReductionValues() const { return ReducedVals.size(); }

private:		private:
/// Calculate the cost of a reduction.		/// Calculate the cost of a reduction.
InstructionCost getReductionCost(TargetTransformInfo *TTI,		InstructionCost getReductionCost(TargetTransformInfo *TTI,
Value *FirstReducedVal,		Value *FirstReducedVal, unsigned ReduxWidth,
unsigned ReduxWidth) {		FastMathFlags FMF) {
Type *ScalarTy = FirstReducedVal->getType();		Type *ScalarTy = FirstReducedVal->getType();
FixedVectorType *VectorTy = FixedVectorType::get(ScalarTy, ReduxWidth);		FixedVectorType *VectorTy = FixedVectorType::get(ScalarTy, ReduxWidth);
InstructionCost VectorCost, ScalarCost;		InstructionCost VectorCost, ScalarCost;
switch (RdxKind) {		switch (RdxKind) {
case RecurKind::Add:		case RecurKind::Add:
case RecurKind::Mul:		case RecurKind::Mul:
case RecurKind::Or:		case RecurKind::Or:
case RecurKind::And:		case RecurKind::And:
case RecurKind::Xor:		case RecurKind::Xor:
case RecurKind::FAdd:		case RecurKind::FAdd:
case RecurKind::FMul: {		case RecurKind::FMul: {
unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);		unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);
VectorCost = TTI->getArithmeticReductionCost(RdxOpcode, VectorTy);		VectorCost = TTI->getArithmeticReductionCost(RdxOpcode, VectorTy, FMF);
ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy);		ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy);
break;		break;
}		}
case RecurKind::FMax:		case RecurKind::FMax:
case RecurKind::FMin: {		case RecurKind::FMin: {
auto *VecCondTy = cast<VectorType>(CmpInst::makeCmpResultType(VectorTy));		auto *VecCondTy = cast<VectorType>(CmpInst::makeCmpResultType(VectorTy));
VectorCost = TTI->getMinMaxReductionCost(VectorTy, VecCondTy,		VectorCost = TTI->getMinMaxReductionCost(VectorTy, VecCondTy,
/unsigned=/false);		/unsigned=/false);
▲ Show 20 Lines • Show All 882 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll

This file was added.

				; RUN: opt -cost-model -analyze -mtriple=aarch64--linux-gnu < %s \| FileCheck %s

				define void @strict_fp_reductions() {
				; CHECK-LABEL: strict_fp_reductions
				; CHECK-NEXT: Cost Model: Found an estimated cost of 17 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 34 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
				%fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.0, <4 x float> undef)
				%fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> undef)
				%fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.0, <2 x double> undef)
				%fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.0, <4 x double> undef)

				ret void
				}

				declare float @llvm.vector.reduce.fadd.v4f32(float, <4 x float>)
				declare float @llvm.vector.reduce.fadd.v8f32(float, <8 x float>)
				declare double @llvm.vector.reduce.fadd.v2f64(double, <2 x double>)
				declare double @llvm.vector.reduce.fadd.v4f64(double, <4 x double>)

llvm/test/Analysis/CostModel/AArch64/sve-intrinsics.ll

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %smax_nxv4i64 = call i64 @llvm.vector.reduce.smax.nxv4i64(<vscale x 4 x i64> %v1)
%umin_nxv4i64 = call i64 @llvm.vector.reduce.umin.nxv4i64(<vscale x 4 x i64> %v1)		%umin_nxv4i64 = call i64 @llvm.vector.reduce.umin.nxv4i64(<vscale x 4 x i64> %v1)
%smin_nxv4i32 = call i32 @llvm.vector.reduce.smin.nxv4i32(<vscale x 4 x i32> %v0)		%smin_nxv4i32 = call i32 @llvm.vector.reduce.smin.nxv4i32(<vscale x 4 x i32> %v0)
%smin_nxv4i64 = call i64 @llvm.vector.reduce.smin.nxv4i64(<vscale x 4 x i64> %v1)		%smin_nxv4i64 = call i64 @llvm.vector.reduce.smin.nxv4i64(<vscale x 4 x i64> %v1)
%umax_nxv4i32 = call i32 @llvm.vector.reduce.umax.nxv4i32(<vscale x 4 x i32> %v0)		%umax_nxv4i32 = call i32 @llvm.vector.reduce.umax.nxv4i32(<vscale x 4 x i32> %v0)
%umax_nxv4i64 = call i64 @llvm.vector.reduce.umax.nxv4i64(<vscale x 4 x i64> %v1)		%umax_nxv4i64 = call i64 @llvm.vector.reduce.umax.nxv4i64(<vscale x 4 x i64> %v1)
%smax_nxv4i32 = call i32 @llvm.vector.reduce.smax.nxv4i32(<vscale x 4 x i32> %v0)		%smax_nxv4i32 = call i32 @llvm.vector.reduce.smax.nxv4i32(<vscale x 4 x i32> %v0)
%smax_nxv4i64 = call i64 @llvm.vector.reduce.smax.nxv4i64(<vscale x 4 x i64> %v1)		%smax_nxv4i64 = call i64 @llvm.vector.reduce.smax.nxv4i64(<vscale x 4 x i64> %v1)

; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_nxv4f32 = call float @llvm.vector.reduce.fadd.nxv4f32(float 0.000000e+00, <vscale x 4 x float> %v2)		; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_nxv4f32 = call fast float @llvm.vector.reduce.fadd.nxv4f32(float 0.000000e+00, <vscale x 4 x float> %v2)
; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %fadd_nxv4f64 = call double @llvm.vector.reduce.fadd.nxv4f64(double 0.000000e+00, <vscale x 4 x double> %v3)		; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %fadd_nxv4f64 = call fast double @llvm.vector.reduce.fadd.nxv4f64(double 0.000000e+00, <vscale x 4 x double> %v3)
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fmin_nxv4f32 = call float @llvm.vector.reduce.fmin.nxv4f32(<vscale x 4 x float> %v2)		; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fmin_nxv4f32 = call fast float @llvm.vector.reduce.fmin.nxv4f32(<vscale x 4 x float> %v2)
; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %fmin_nxv4f64 = call double @llvm.vector.reduce.fmin.nxv4f64(<vscale x 4 x double> %v3)		; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %fmin_nxv4f64 = call fast double @llvm.vector.reduce.fmin.nxv4f64(<vscale x 4 x double> %v3)
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fmax_nxv4f32 = call float @llvm.vector.reduce.fmax.nxv4f32(<vscale x 4 x float> %v2)		; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fmax_nxv4f32 = call fast float @llvm.vector.reduce.fmax.nxv4f32(<vscale x 4 x float> %v2)
; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %fmax_nxv4f64 = call double @llvm.vector.reduce.fmax.nxv4f64(<vscale x 4 x double> %v3)		; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %fmax_nxv4f64 = call fast double @llvm.vector.reduce.fmax.nxv4f64(<vscale x 4 x double> %v3)
%fadd_nxv4f32 = call float @llvm.vector.reduce.fadd.nxv4f32(float 0.0, <vscale x 4 x float> %v2)		%fadd_nxv4f32 = call fast float @llvm.vector.reduce.fadd.nxv4f32(float 0.0, <vscale x 4 x float> %v2)
%fadd_nxv4f64 = call double @llvm.vector.reduce.fadd.nxv4f64(double 0.0, <vscale x 4 x double> %v3)		%fadd_nxv4f64 = call fast double @llvm.vector.reduce.fadd.nxv4f64(double 0.0, <vscale x 4 x double> %v3)
%fmin_nxv4f32 = call float @llvm.vector.reduce.fmin.nxv4f32(<vscale x 4 x float> %v2)		%fmin_nxv4f32 = call fast float @llvm.vector.reduce.fmin.nxv4f32(<vscale x 4 x float> %v2)
%fmin_nxv4f64 = call double @llvm.vector.reduce.fmin.nxv4f64(<vscale x 4 x double> %v3)		%fmin_nxv4f64 = call fast double @llvm.vector.reduce.fmin.nxv4f64(<vscale x 4 x double> %v3)
%fmax_nxv4f32 = call float @llvm.vector.reduce.fmax.nxv4f32(<vscale x 4 x float> %v2)		%fmax_nxv4f32 = call fast float @llvm.vector.reduce.fmax.nxv4f32(<vscale x 4 x float> %v2)
%fmax_nxv4f64 = call double @llvm.vector.reduce.fmax.nxv4f64(<vscale x 4 x double> %v3)		%fmax_nxv4f64 = call fast double @llvm.vector.reduce.fmax.nxv4f64(<vscale x 4 x double> %v3)

ret void		ret void
}		}

		define void @strict_fp_reductions(<vscale x 4 x float> %v0, <vscale x 4 x double> %v1) {
		; CHECK-LABEL: 'strict_fp_reductions'
		; CHECK-NEXT: Cost Model: Found an estimated cost of 128 for instruction: %fadd_nxv4f32 = call float @llvm.vector.reduce.fadd.nxv4f32(float 0.000000e+00, <vscale x 4 x float> %v0)
		; CHECK-NEXT: Cost Model: Found an estimated cost of 128 for instruction: %fadd_nxv4f64 = call double @llvm.vector.reduce.fadd.nxv4f64(double 0.000000e+00, <vscale x 4 x double> %v1)
		; CHECK-NEXT: Cost Model: Invalid cost for instruction: %fmul_nxv4f32 = call float @llvm.vector.reduce.fmul.nxv4f32(float 0.000000e+00, <vscale x 4 x float> %v0)
		; CHECK-NEXT: Cost Model: Invalid cost for instruction: %fmul_nxv4f64 = call double @llvm.vector.reduce.fmul.nxv4f64(double 0.000000e+00, <vscale x 4 x double> %v1)
		%fadd_nxv4f32 = call float @llvm.vector.reduce.fadd.nxv4f32(float 0.0, <vscale x 4 x float> %v0)
		%fadd_nxv4f64 = call double @llvm.vector.reduce.fadd.nxv4f64(double 0.0, <vscale x 4 x double> %v1)
		%fmul_nxv4f32 = call float @llvm.vector.reduce.fmul.nxv4f32(float 0.0, <vscale x 4 x float> %v0)
		%fmul_nxv4f64 = call double @llvm.vector.reduce.fmul.nxv4f64(double 0.0, <vscale x 4 x double> %v1)

		ret void
		}

declare i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32>)		declare i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32>)
declare i64 @llvm.vector.reduce.add.nxv4i64(<vscale x 4 x i64>)		declare i64 @llvm.vector.reduce.add.nxv4i64(<vscale x 4 x i64>)
declare i32 @llvm.vector.reduce.mul.nxv4i32(<vscale x 4 x i32>)		declare i32 @llvm.vector.reduce.mul.nxv4i32(<vscale x 4 x i32>)
declare i64 @llvm.vector.reduce.mul.nxv4i64(<vscale x 4 x i64>)		declare i64 @llvm.vector.reduce.mul.nxv4i64(<vscale x 4 x i64>)
declare i32 @llvm.vector.reduce.and.nxv4i32(<vscale x 4 x i32>)		declare i32 @llvm.vector.reduce.and.nxv4i32(<vscale x 4 x i32>)
declare i64 @llvm.vector.reduce.and.nxv4i64(<vscale x 4 x i64>)		declare i64 @llvm.vector.reduce.and.nxv4i64(<vscale x 4 x i64>)
declare i32 @llvm.vector.reduce.or.nxv4i32(<vscale x 4 x i32>)		declare i32 @llvm.vector.reduce.or.nxv4i32(<vscale x 4 x i32>)
declare i64 @llvm.vector.reduce.or.nxv4i64(<vscale x 4 x i64>)		declare i64 @llvm.vector.reduce.or.nxv4i64(<vscale x 4 x i64>)
declare i32 @llvm.vector.reduce.xor.nxv4i32(<vscale x 4 x i32>)		declare i32 @llvm.vector.reduce.xor.nxv4i32(<vscale x 4 x i32>)
declare i64 @llvm.vector.reduce.xor.nxv4i64(<vscale x 4 x i64>)		declare i64 @llvm.vector.reduce.xor.nxv4i64(<vscale x 4 x i64>)
declare i32 @llvm.vector.reduce.umin.nxv4i32(<vscale x 4 x i32>)		declare i32 @llvm.vector.reduce.umin.nxv4i32(<vscale x 4 x i32>)
declare i64 @llvm.vector.reduce.umin.nxv4i64(<vscale x 4 x i64>)		declare i64 @llvm.vector.reduce.umin.nxv4i64(<vscale x 4 x i64>)
declare i32 @llvm.vector.reduce.smin.nxv4i32(<vscale x 4 x i32>)		declare i32 @llvm.vector.reduce.smin.nxv4i32(<vscale x 4 x i32>)
declare i64 @llvm.vector.reduce.smin.nxv4i64(<vscale x 4 x i64>)		declare i64 @llvm.vector.reduce.smin.nxv4i64(<vscale x 4 x i64>)
declare i32 @llvm.vector.reduce.umax.nxv4i32(<vscale x 4 x i32>)		declare i32 @llvm.vector.reduce.umax.nxv4i32(<vscale x 4 x i32>)
declare i64 @llvm.vector.reduce.umax.nxv4i64(<vscale x 4 x i64>)		declare i64 @llvm.vector.reduce.umax.nxv4i64(<vscale x 4 x i64>)
declare i32 @llvm.vector.reduce.smax.nxv4i32(<vscale x 4 x i32>)		declare i32 @llvm.vector.reduce.smax.nxv4i32(<vscale x 4 x i32>)
declare i64 @llvm.vector.reduce.smax.nxv4i64(<vscale x 4 x i64>)		declare i64 @llvm.vector.reduce.smax.nxv4i64(<vscale x 4 x i64>)
declare float @llvm.vector.reduce.fadd.nxv4f32(float, <vscale x 4 x float>)		declare float @llvm.vector.reduce.fadd.nxv4f32(float, <vscale x 4 x float>)
declare double @llvm.vector.reduce.fadd.nxv4f64(double, <vscale x 4 x double>)		declare double @llvm.vector.reduce.fadd.nxv4f64(double, <vscale x 4 x double>)
		declare float @llvm.vector.reduce.fmul.nxv4f32(float, <vscale x 4 x float>)
		declare double @llvm.vector.reduce.fmul.nxv4f64(double, <vscale x 4 x double>)
declare float @llvm.vector.reduce.fmin.nxv4f32(<vscale x 4 x float>)		declare float @llvm.vector.reduce.fmin.nxv4f32(<vscale x 4 x float>)
declare double @llvm.vector.reduce.fmin.nxv4f64(<vscale x 4 x double>)		declare double @llvm.vector.reduce.fmin.nxv4f64(<vscale x 4 x double>)
declare float @llvm.vector.reduce.fmax.nxv4f32(<vscale x 4 x float>)		declare float @llvm.vector.reduce.fmax.nxv4f32(<vscale x 4 x float>)
declare double @llvm.vector.reduce.fmax.nxv4f64(<vscale x 4 x double>)		declare double @llvm.vector.reduce.fmax.nxv4f64(<vscale x 4 x double>)


define void @count_zeroes(<vscale x 4 x i32> %A) {		define void @count_zeroes(<vscale x 4 x i32> %A) {
; CHECK-LABEL: 'count_zeroes'		; CHECK-LABEL: 'count_zeroes'
▲ Show 20 Lines • Show All 277 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/X86/intrinsic-cost-kinds.ll

	Show First 20 Lines • Show All 338 Lines • ▼ Show 20 Lines
	; SIZE_LATE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret void			; SIZE_LATE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret void
	;			;
	%v = call float @llvm.vector.reduce.fmax.v16f32(<16 x float> %va)			%v = call float @llvm.vector.reduce.fmax.v16f32(<16 x float> %va)
	ret void			ret void
	}			}

	define void @reduce_fmul(<16 x float> %va) {			define void @reduce_fmul(<16 x float> %va) {
	; THRU-LABEL: 'reduce_fmul'			; THRU-LABEL: 'reduce_fmul'
	; THRU-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %v = call float @llvm.vector.reduce.fmul.v16f32(float 4.200000e+01, <16 x float> %va)			; THRU-NEXT: Cost Model: Found an estimated cost of 44 for instruction: %v = call float @llvm.vector.reduce.fmul.v16f32(float 4.200000e+01, <16 x float> %va)
	; THRU-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; THRU-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; LATE-LABEL: 'reduce_fmul'			; LATE-LABEL: 'reduce_fmul'
	; LATE-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v = call float @llvm.vector.reduce.fmul.v16f32(float 4.200000e+01, <16 x float> %va)			; LATE-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v = call float @llvm.vector.reduce.fmul.v16f32(float 4.200000e+01, <16 x float> %va)
	; LATE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret void			; LATE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret void
	;			;
	; SIZE-LABEL: 'reduce_fmul'			; SIZE-LABEL: 'reduce_fmul'
	; SIZE-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v = call float @llvm.vector.reduce.fmul.v16f32(float 4.200000e+01, <16 x float> %va)			; SIZE-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %v = call float @llvm.vector.reduce.fmul.v16f32(float 4.200000e+01, <16 x float> %va)
	; SIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret void			; SIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret void
	;			;
	; SIZE_LATE-LABEL: 'reduce_fmul'			; SIZE_LATE-LABEL: 'reduce_fmul'
	; SIZE_LATE-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v = call float @llvm.vector.reduce.fmul.v16f32(float 4.200000e+01, <16 x float> %va)			; SIZE_LATE-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %v = call float @llvm.vector.reduce.fmul.v16f32(float 4.200000e+01, <16 x float> %va)
	; SIZE_LATE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret void			; SIZE_LATE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret void
	;			;
	%v = call float @llvm.vector.reduce.fmul.v16f32(float 42.0, <16 x float> %va)			%v = call float @llvm.vector.reduce.fmul.v16f32(float 42.0, <16 x float> %va)
	ret void			ret void
	}			}

	define void @reduce_fadd_fast(<16 x float> %va) {			define void @reduce_fadd_fast(<16 x float> %va) {
	; THRU-LABEL: 'reduce_fadd_fast'			; THRU-LABEL: 'reduce_fadd_fast'
	Show All 39 Lines

llvm/test/Analysis/CostModel/X86/reduce-fadd.ll

	; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse2 \| FileCheck %s --check-prefixes=SSE2			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse2 \| FileCheck %s --check-prefixes=SSE2
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+ssse3 \| FileCheck %s --check-prefixes=SSSE3			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+ssse3 \| FileCheck %s --check-prefixes=SSSE3
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse4.1 \| FileCheck %s --check-prefixes=SSE41			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse4.1 \| FileCheck %s --check-prefixes=SSE41
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse4.2 \| FileCheck %s --check-prefixes=SSE42			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse4.2 \| FileCheck %s --check-prefixes=SSE42
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx \| FileCheck %s --check-prefixes=AVX1			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx \| FileCheck %s --check-prefixes=AVX1
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx2 \| FileCheck %s --check-prefixes=AVX2			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx2 \| FileCheck %s --check-prefixes=AVX2
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f \| FileCheck %s --check-prefixes=AVX512			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f \| FileCheck %s --check-prefixes=AVX512
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefixes=AVX512			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefixes=AVX512
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512dq \| FileCheck %s --check-prefixes=AVX512			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512dq \| FileCheck %s --check-prefixes=AVX512

	define void @reduce_f64(double %arg) {			define void @reduce_f64(double %arg) {
	; SSE2-LABEL: 'reduce_f64'			; SSE2-LABEL: 'reduce_f64'
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 36 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 72 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)
				RKSimonUnsubmitted Done Reply Inline Actions This looks too high for what is just a single f64 fadd (SSE floating point extract from 0 is free) - it might be a problem in x86 scalarization overhead ? RKSimon: This looks too high for what is just a single f64 fadd (SSE floating point extract from 0 is…
				david-armAuthorUnsubmitted Done Reply Inline Actions Possibly so - I guess that's not something that should be fixed in this patch though? I imagine that's equally a problem for other X86 operations on a v1f64 type, i.e. see the end of getIntrinsicInstrCost that does the same thing and makes no special case for v1XX types. david-arm: Possibly so - I guess that's not something that should be fixed in this patch though? I imagine…
				RKSimonUnsubmitted Done Reply Inline Actions Agreed, we can investigate that in a followup RKSimon: Agreed, we can investigate that in a followup
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSSE3-LABEL: 'reduce_f64'			; SSSE3-LABEL: 'reduce_f64'
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 36 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 72 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSE41-LABEL: 'reduce_f64'			; SSE41-LABEL: 'reduce_f64'
	; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 36 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 72 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSE42-LABEL: 'reduce_f64'			; SSE42-LABEL: 'reduce_f64'
	; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 36 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 72 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX1-LABEL: 'reduce_f64'			; AVX1-LABEL: 'reduce_f64'
	; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 80 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX2-LABEL: 'reduce_f64'			; AVX2-LABEL: 'reduce_f64'
	; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 80 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX512-LABEL: 'reduce_f64'			; AVX512-LABEL: 'reduce_f64'
	; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 42 for instruction: %V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 84 for instruction: %V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	%V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)			%V1 = call double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)
	%V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)			%V2 = call double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)
	%V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)			%V4 = call double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)
	%V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)			%V8 = call double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)
	%V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)			%V16 = call double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)
	ret void			ret void
	}			}

	define void @reduce_f32(float %arg) {			define void @reduce_f32(float %arg) {
				RKSimonUnsubmitted Done Reply Inline Actions please can you remove this change now that we have this test coverage below RKSimon: please can you remove this change now that we have this test coverage below
	; SSE2-LABEL: 'reduce_f32'			; SSE2-LABEL: 'reduce_f32'
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 19 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 38 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 76 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 152 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSSE3-LABEL: 'reduce_f32'			; SSSE3-LABEL: 'reduce_f32'
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 19 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 38 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 76 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 152 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSE41-LABEL: 'reduce_f32'			; SSE41-LABEL: 'reduce_f32'
	; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 19 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 38 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 76 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 152 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSE42-LABEL: 'reduce_f32'			; SSE42-LABEL: 'reduce_f32'
	; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 19 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 38 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 76 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 152 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX1-LABEL: 'reduce_f32'			; AVX1-LABEL: 'reduce_f32'
	; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 19 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 42 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 84 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 168 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX2-LABEL: 'reduce_f32'			; AVX2-LABEL: 'reduce_f32'
	; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 19 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 42 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 84 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 168 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX512-LABEL: 'reduce_f32'			; AVX512-LABEL: 'reduce_f32'
	; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 19 for instruction: %V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 42 for instruction: %V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 176 for instruction: %V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	%V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)			%V1 = call float @llvm.vector.reduce.fadd.v1f32(float %arg, <1 x float> undef)
	%V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)			%V2 = call float @llvm.vector.reduce.fadd.v2f32(float %arg, <2 x float> undef)
	%V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)			%V4 = call float @llvm.vector.reduce.fadd.v4f32(float %arg, <4 x float> undef)
	%V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)			%V8 = call float @llvm.vector.reduce.fadd.v8f32(float %arg, <8 x float> undef)
	%V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)			%V16 = call float @llvm.vector.reduce.fadd.v16f32(float %arg, <16 x float> undef)
	%V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)			%V32 = call float @llvm.vector.reduce.fadd.v32f32(float %arg, <32 x float> undef)
	ret void			ret void
	}			}

	; Fast Reductions			; Fast Reductions

				RKSimonUnsubmitted Done Reply Inline Actions please can you remove this change now that we have this test coverage below RKSimon: please can you remove this change now that we have this test coverage below
				david-armAuthorUnsubmitted Done Reply Inline Actions Hi @RKSimon, the problem is that if I revert the change the tests fail. This is because we were previously calling the intrinsic without `fast`, which means that we choose the very expensive cost for an ordered reduction. If you're happy I can remove the `fast` attribute, but I will then have to update all of the costs? david-arm: Hi @RKSimon, the problem is that if I revert the change the tests fail. This is because we were…
				RKSimonUnsubmitted Done Reply Inline Actions Yes please - the update_analyze_test_checks.py script should do it all for you RKSimon: Yes please - the update_analyze_test_checks.py script should do it all for you
	define void @reduce_f64_fast(double %arg) {			define void @reduce_f64_fast(double %arg) {
	; SSE2-LABEL: 'reduce_f64_fast'			; SSE2-LABEL: 'reduce_f64_fast'
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call fast double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call fast double @llvm.vector.reduce.fadd.v1f64(double %arg, <1 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call fast double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call fast double @llvm.vector.reduce.fadd.v2f64(double %arg, <2 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call fast double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call fast double @llvm.vector.reduce.fadd.v4f64(double %arg, <4 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8 = call fast double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8 = call fast double @llvm.vector.reduce.fadd.v8f64(double %arg, <8 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V16 = call fast double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V16 = call fast double @llvm.vector.reduce.fadd.v16f64(double %arg, <16 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/X86/reduce-fmul.ll

	; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse2 \| FileCheck %s --check-prefixes=SSE2			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse2 \| FileCheck %s --check-prefixes=SSE2
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+ssse3 \| FileCheck %s --check-prefixes=SSSE3			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+ssse3 \| FileCheck %s --check-prefixes=SSSE3
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse4.1 \| FileCheck %s --check-prefixes=SSE41			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse4.1 \| FileCheck %s --check-prefixes=SSE41
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse4.2 \| FileCheck %s --check-prefixes=SSE42			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse4.2 \| FileCheck %s --check-prefixes=SSE42
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx \| FileCheck %s --check-prefixes=AVX1			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx \| FileCheck %s --check-prefixes=AVX1
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx2 \| FileCheck %s --check-prefixes=AVX2			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx2 \| FileCheck %s --check-prefixes=AVX2
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f \| FileCheck %s --check-prefixes=AVX512			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f \| FileCheck %s --check-prefixes=AVX512
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefixes=AVX512			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefixes=AVX512
	; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512dq \| FileCheck %s --check-prefixes=AVX512			; RUN: opt < %s -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512dq \| FileCheck %s --check-prefixes=AVX512

	define void @reduce_f64(double %arg) {			define void @reduce_f64(double %arg) {
	; SSE2-LABEL: 'reduce_f64'			; SSE2-LABEL: 'reduce_f64'
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 17 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSSE3-LABEL: 'reduce_f64'			; SSSE3-LABEL: 'reduce_f64'
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 17 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSE41-LABEL: 'reduce_f64'			; SSE41-LABEL: 'reduce_f64'
	; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 17 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSE42-LABEL: 'reduce_f64'			; SSE42-LABEL: 'reduce_f64'
	; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX1-LABEL: 'reduce_f64'			; AVX1-LABEL: 'reduce_f64'
	; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX2-LABEL: 'reduce_f64'			; AVX2-LABEL: 'reduce_f64'
	; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX512-LABEL: 'reduce_f64'			; AVX512-LABEL: 'reduce_f64'
	; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 26 for instruction: %V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 52 for instruction: %V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	%V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)			%V1 = call double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)
	%V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)			%V2 = call double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)
	%V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)			%V4 = call double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)
	%V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)			%V8 = call double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)
	%V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)			%V16 = call double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)
	ret void			ret void
	}			}

	define void @reduce_f32(float %arg) {			define void @reduce_f32(float %arg) {
				RKSimonUnsubmitted Done Reply Inline Actions please can you remove this change now that we have this test coverage below RKSimon: please can you remove this change now that we have this test coverage below
	; SSE2-LABEL: 'reduce_f32'			; SSE2-LABEL: 'reduce_f32'
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 22 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 44 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSSE3-LABEL: 'reduce_f32'			; SSSE3-LABEL: 'reduce_f32'
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 22 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 44 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)			; SSSE3-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)
	; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSE41-LABEL: 'reduce_f32'			; SSE41-LABEL: 'reduce_f32'
	; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 22 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 44 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)			; SSE41-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)
	; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE41-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SSE42-LABEL: 'reduce_f32'			; SSE42-LABEL: 'reduce_f32'
	; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 22 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 44 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)			; SSE42-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)
	; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE42-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX1-LABEL: 'reduce_f32'			; AVX1-LABEL: 'reduce_f32'
	; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 26 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 52 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)			; AVX1-NEXT: Cost Model: Found an estimated cost of 104 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)
	; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX1-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX2-LABEL: 'reduce_f32'			; AVX2-LABEL: 'reduce_f32'
	; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 26 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 52 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)			; AVX2-NEXT: Cost Model: Found an estimated cost of 104 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)
	; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; AVX512-LABEL: 'reduce_f32'			; AVX512-LABEL: 'reduce_f32'
	; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 26 for instruction: %V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 56 for instruction: %V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)			; AVX512-NEXT: Cost Model: Found an estimated cost of 112 for instruction: %V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)
	; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	%V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)			%V1 = call float @llvm.vector.reduce.fmul.v1f32(float %arg, <1 x float> undef)
	%V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)			%V2 = call float @llvm.vector.reduce.fmul.v2f32(float %arg, <2 x float> undef)
	%V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)			%V4 = call float @llvm.vector.reduce.fmul.v4f32(float %arg, <4 x float> undef)
	%V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)			%V8 = call float @llvm.vector.reduce.fmul.v8f32(float %arg, <8 x float> undef)
	%V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)			%V16 = call float @llvm.vector.reduce.fmul.v16f32(float %arg, <16 x float> undef)
	%V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)			%V32 = call float @llvm.vector.reduce.fmul.v32f32(float %arg, <32 x float> undef)
	ret void			ret void
	}			}

	; Fast Reductions			; Fast Reductions

				RKSimonUnsubmitted Done Reply Inline Actions please can you remove this change now that we have this test coverage below RKSimon: please can you remove this change now that we have this test coverage below
	define void @reduce_f64_fast(double %arg) {			define void @reduce_f64_fast(double %arg) {
	; SSE2-LABEL: 'reduce_f64_fast'			; SSE2-LABEL: 'reduce_f64_fast'
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call fast double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V1 = call fast double @llvm.vector.reduce.fmul.v1f64(double %arg, <1 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call fast double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call fast double @llvm.vector.reduce.fmul.v2f64(double %arg, <2 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4 = call fast double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4 = call fast double @llvm.vector.reduce.fmul.v4f64(double %arg, <4 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V8 = call fast double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %V8 = call fast double @llvm.vector.reduce.fmul.v8f64(double %arg, <8 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 17 for instruction: %V16 = call fast double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)			; SSE2-NEXT: Cost Model: Found an estimated cost of 17 for instruction: %V16 = call fast double @llvm.vector.reduce.fmul.v16f64(double %arg, <16 x double> undef)
	; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll

This file was added.

				; RUN: opt < %s -loop-vectorize -debug -disable-output -enable-strict-reductions=true -hints-allow-reordering=false \
				; RUN: -force-vector-width=4 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF4
				; RUN: opt < %s -loop-vectorize -debug -disable-output -enable-strict-reductions=true -hints-allow-reordering=false \
				; RUN: -force-vector-width=8 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF8

				target triple="aarch64-unknown-linux-gnu"

				; CHECK-VF4: Found an estimated cost of 17 for VF 4 For instruction: %add = fadd float %0, %sum.07
				; CHECK-VF8: Found an estimated cost of 34 for VF 8 For instruction: %add = fadd float %0, %sum.07

				define float @fadd_strict32(float* noalias nocapture readonly %a, i64 %n) {
				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
				%0 = load float, float* %arrayidx, align 4
				%add = fadd float %0, %sum.07
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end:
				ret float %add
				}


				; CHECK-VF4: Found an estimated cost of 14 for VF 4 For instruction: %add = fadd double %0, %sum.07
				; CHECK-VF8: Found an estimated cost of 28 for VF 8 For instruction: %add = fadd double %0, %sum.07

				define double @fadd_strict64(double* noalias nocapture readonly %a, i64 %n) {
				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds double, double* %a, i64 %iv
				%0 = load double, double* %arrayidx, align 4
				%add = fadd double %0, %sum.07
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end:
				ret double %add
				}

llvm/test/Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

This file was added.

				; RUN: opt < %s -loop-vectorize -debug -disable-output -enable-strict-reductions=true -hints-allow-reordering=false \
				; RUN: -scalable-vectorization=on -force-vector-width=4 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF4
				; RUN: opt < %s -loop-vectorize -debug -disable-output -enable-strict-reductions=true -hints-allow-reordering=false \
				; RUN: -scalable-vectorization=on -force-vector-width=8 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF8

				target triple="aarch64-unknown-linux-gnu"

				; CHECK-VF4: Found an estimated cost of 128 for VF vscale x 4 For instruction: %add = fadd float %0, %sum.07
				; CHECK-VF8: Found an estimated cost of 256 for VF vscale x 8 For instruction: %add = fadd float %0, %sum.07

				define float @fadd_strict32(float* noalias nocapture readonly %a, i64 %n) #0 {
				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
				%0 = load float, float* %arrayidx, align 4
				%add = fadd float %0, %sum.07
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				ret float %add
				}


				; CHECK-VF4: Found an estimated cost of 128 for VF vscale x 4 For instruction: %add = fadd double %0, %sum.07
				; CHECK-VF8: Found an estimated cost of 256 for VF vscale x 8 For instruction: %add = fadd double %0, %sum.07

				define double @fadd_strict64(double* noalias nocapture readonly %a, i64 %n) #0 {
				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds double, double* %a, i64 %iv
				%0 = load double, double* %arrayidx, align 4
				%add = fadd double %0, %sum.07
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				ret double %add
				}

				attributes #0 = { "target-features"="+sve" }

				!0 = distinct !{!0, !1}
				!1 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
				kmclaughlinUnsubmitted Done Reply Inline Actions Since the `-force-vector-width` flag is used in the RUN lines for this test, can you please remove this hint and add `-force-vector-interleave=1`? kmclaughlin: Since the `-force-vector-width` flag is used in the RUN lines for this test, can you please…

This is an archive of the discontinued LLVM Phabricator instance.

[Analysis] Add simple cost model for strict (in-order) reductionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 361600

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Target/X86/X86TargetTransformInfo.h

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll

llvm/test/Analysis/CostModel/AArch64/sve-intrinsics.ll

llvm/test/Analysis/CostModel/X86/intrinsic-cost-kinds.ll

llvm/test/Analysis/CostModel/X86/reduce-fadd.ll

llvm/test/Analysis/CostModel/X86/reduce-fmul.ll

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

[Analysis] Add simple cost model for strict (in-order) reductions
ClosedPublic