Page MenuHomePhabricator

[SLP]Cost for a constant buildvector.
ClosedPublic

Authored by ABataev on Jun 2 2022, 7:56 AM.

Details

Summary

Usually, constant buildvector results in a vector load from a
constant/data pool.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2022, 7:56 AM
ABataev requested review of this revision.Jun 2 2022, 7:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2022, 7:56 AM

Some unfortunate regressions

llvm/test/Transforms/SLPVectorizer/X86/pr46983.ll
146 ↗(On Diff #433753)
vdmitrie added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5932

This estimate should be bit more complicated.
Here are the things that can additionally be considered:

  • for scalar floating point ops a constant operand is normally loaded from memory too.
  • if it is an operand of instruction that becomes immediate (like shift value) and is splat - cost is zero.
  • for a scalar integer op a constant operand is typically an immediate, so this estimate works in most cases but there is an exception: 64 bits operations on a 32bits target. That should be taken into account too.
ABataev updated this revision to Diff 434112.Jun 3 2022, 12:46 PM

Address comments.

ABataev added inline comments.Jun 3 2022, 12:51 PM
llvm/test/Transforms/SLPVectorizer/X86/pr46983.ll
146 ↗(On Diff #433753)

llvm-mca reports throughputs:
For scalar code - 5.5
AVX vector - 8.0
AVX2 vector - 5.0

https://godbolt.org/z/rEc74dxza

xbolva00 added inline comments.Jun 3 2022, 1:03 PM
llvm/test/Transforms/SLPVectorizer/X86/pr46983.ll
146 ↗(On Diff #433753)

Ah, right, these checks are for avx.

Test case for collection.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3810

Note that it is free for splat shift value only.
https://godbolt.org/z/KMf6sW5n3

Test case for collection.

mca shows that these 2 instructions has the same cost, so it actually doers not matter. Probably worth to add some other instructions, which can load params directly from memory for x86

vdmitrie added inline comments.Jun 3 2022, 2:23 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5936

Just wondering is that possible for UserTreeIndices to be empty here? AFAIU it can be for root only but constants do not seed vtree.
if alternate opcodes are for shl/shr but shift value is splat it is still can be immediate for both of them.

5955

drop it?

ABataev added inline comments.Jun 3 2022, 2:42 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5936
  1. If constants are reduced values in reduction ops.
  2. That's why there is a TODO above.
5955

What do you mean?

vdmitrie added inline comments.Jun 3 2022, 2:47 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5936

okay. Although I believe it is not SLP vectorizer job to do constant folding.

5955

Drop extra definition of ScalarCost.
Otherwise loop at line 5807 is updating variable from 5802, but it is not used.
LIne 5811 will subtract one defined at 5795.

ABataev added inline comments.Jun 3 2022, 2:50 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5955

Ah, yes, sure.

vdmitrie added inline comments.Jun 3 2022, 3:14 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5960

Isn't this interface already puts assumption that a constant is a legal immediate?
I was trying to explore this too and I found that it does not seem to cover correctly 32bit target specifically for 64bit operations. Ideally we should have interface that tells whether immediate is a legal imm operand for a target but I have not found anything like that.
One way to figure this (which I found -may be wrongful) is
when condition
DL->getTypeStoreSizeInBits(ScalarTy) > DL->getLargestLegalIntTypeSizeInBits()
is true we cannot assume operand as a legal immediate.

ABataev added inline comments.Jun 3 2022, 3:24 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5960

I'll check it.

ABataev added inline comments.Jun 6 2022, 9:02 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5936

Do you suggest to hide it in getConstBuildVectorInstrCost? And return the difference? Or just add a new member function?

vdmitrie added inline comments.Jun 6 2022, 11:34 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5936

Alternate opcodes is SLP vectorizer specific. For that reason trying to sink that logic into inside the TTI interface does not look like right thing to do.
But outlining this whole new code into a separate member is a good idea.

What sounds weird for me is that constants may seed vtree for reduction. Although that is not directly related to this patch but you are placing here work arounds of that. IMO it is unpractical to run constants reduction through SLP vectorizer machinery.

Probably, to make the work around of that issue simpler in this patch, add an early return:
if (E->UserTreeIndices.empty())

return 0;

Otherwise it will be returning memory-op cost for a foldable operation.

ABataev updated this revision to Diff 435177.Jun 8 2022, 8:19 AM

Address comments

ABataev added inline comments.Jun 8 2022, 8:20 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5936

What sounds weird for me is that constants may seed vtree for reduction.

InstCombiner and other passes are not always able to handle them (or require some extra work and compile time). E.g:

define i32 @foo(i32 %v, i32 %a) {
  %s1 = add i32 %v, 1
  %s2 = add i32 %a, 2
  %s3 = add i32 %s1, %s2
  %s11 = add i32 %v, %a
  %s31 = add i32 %s3, %s11
  %s4 = add i32 %v, 3
  %s5 = add i32 %a, 4
  %s6 = add i32 %s4, %s5
  %s7 = add i32 %s31, %s6
  ret i32 %s7
}

SLP is able to transform it to:

define i32 @foo(i32 %v, i32 %a) {
  %1 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> <i32 4, i32 3, i32 2, i32 1>)
  %op.rdx = add i32 %a, %a
  %op.rdx1 = add i32 %a, %v
  %op.rdx2 = add i32 %v, %v
  %op.rdx3 = add i32 %op.rdx, %op.rdx1
  %op.rdx4 = add i32 %op.rdx3, %op.rdx2
  %op.rdx5 = add i32 %1, %op.rdx4
  ret i32 %op.rdx5
}

which can be optimized

%1 = i32 10

But I agree, that it requires improvement. We don't need to estimate the cost and emit reduction here. I have a patch that improves it. Need to work on it for some time, though.

RKSimon added inline comments.Jun 16 2022, 3:32 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3810

Yes, vector shifts must be splats without AVX2 or XOP

3830

Instruction::Add/Sub?

Also, we'd need to allow Idx ==0 || Idx == 1 for commutable ops.

ABataev added inline comments.Jun 16 2022, 10:17 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3830
  1. I excluded Add/Sub here because scalar Add/Sub with Imm has less cost than the vector Add/Subs (0.2-0.33 vs ~0.5)
  2. We can add it later, currently no such kind of analysis in getIntImmCostInst
ABataev updated this revision to Diff 437675.Jun 16 2022, 1:03 PM

Address comments

RKSimon added inline comments.Jun 22 2022, 10:10 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3823

XOP can memory fold from Idx == 0 as well.

ABataev updated this revision to Diff 439156.Jun 22 2022, 2:16 PM

Removed getConstantBuildVectorCost, the analysis for constant values already exists in getArithmeticInstrCost. Added support for const operand for stores in getMemoryOpCost function.

RKSimon added inline comments.Aug 15 2022, 4:15 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4108

Just do this once before the !VTy || !LT.second.isVector()) check?

ABataev updated this revision to Diff 452648.Aug 15 2022, 6:27 AM

Address comment

RKSimon accepted this revision.Aug 17 2022, 10:11 AM

LGTM - it might be worth splitting the refactoring of adding the OperandValueKind arg to getMemoryOpCost? That way any fall out from the cost changes are more localised.

This revision is now accepted and ready to land.Aug 17 2022, 10:11 AM

LGTM - it might be worth splitting the refactoring of adding the OperandValueKind arg to getMemoryOpCost? That way any fall out from the cost changes are more localised.

Ok, will commit in a separate patch.

This revision was landed with ongoing or failed builds.Aug 19 2022, 8:04 AM
This revision was automatically updated to reflect the committed changes.
reames added a subscriber: reames.Aug 19 2022, 8:44 AM

Coming into this a bit late.

I stumbled into this myself when looking at the impact of SLP on RISC-V. I think this is addressing an important problem, but I'm really not happy with the structure of the change that landed.

We have a general problem here of needing to account for cost of a constant build vector. This change ended up being specific to stores of constant build vectors, but the same basic problem still exists if e.g. you have a load, add constant-build-vector, and store sequence which gets vectorized. The problem here is not in any way related to the cost of the store; it's related to the cost of materializing the value to be stored.

There's an additional problem that the cost model added for RISC-V is way overly simplistic. It's out of sync with the existing build vector lowering code, and thus will result in costs which differ from the actual lowering chosen. More importantly, the interface chosen in this patch prevents a more sophisticated cost model from being used.

I think we need to undo this, and return to the getConstBuildVectorInstrCost approach used in early versions of this patch. There was a mention of existing build vector costing in getConstBuildVectorInstrCost, but I can't find this in generic code. Can you point me to the code you were referring to?

p.s. I used the word "undo" specifically to avoid "revert". I'm not asking the change be reverted, simply that we work in the direction of a better interface overall. Doing that will have the effect of semantically reverting the landed change, but I'm not picky about the order of operations here.

Coming into this a bit late.

I stumbled into this myself when looking at the impact of SLP on RISC-V. I think this is addressing an important problem, but I'm really not happy with the structure of the change that landed.

We have a general problem here of needing to account for cost of a constant build vector. This change ended up being specific to stores of constant build vectors, but the same basic problem still exists if e.g. you have a load, add constant-build-vector, and store sequence which gets vectorized. The problem here is not in any way related to the cost of the store; it's related to the cost of materializing the value to be stored.

There's an additional problem that the cost model added for RISC-V is way overly simplistic. It's out of sync with the existing build vector lowering code, and thus will result in costs which differ from the actual lowering chosen. More importantly, the interface chosen in this patch prevents a more sophisticated cost model from being used.

I think we need to undo this, and return to the getConstBuildVectorInstrCost approach used in early versions of this patch. There was a mention of existing build vector costing in getConstBuildVectorInstrCost, but I can't find this in generic code. Can you point me to the code you were referring to?

Check the cost model of arithmetic instructions etc, they already include the cost analysis for constant values.

p.s. I used the word "undo" specifically to avoid "revert". I'm not asking the change be reverted, simply that we work in the direction of a better interface overall. Doing that will have the effect of semantically reverting the landed change, but I'm not picky about the order of operations here.

I tried initially to implement it but our cost model already includes the cost of constants/constant buildvectors for many operations. It requires significant rework of TTI and some extra investigation because we need to account cross dependency between constants and the operations. And I'm not sure if it would better/easier to implement, it requires some extra (re)design investigation.

Coming into this a bit late.

I stumbled into this myself when looking at the impact of SLP on RISC-V. I think this is addressing an important problem, but I'm really not happy with the structure of the change that landed.

We have a general problem here of needing to account for cost of a constant build vector. This change ended up being specific to stores of constant build vectors, but the same basic problem still exists if e.g. you have a load, add constant-build-vector, and store sequence which gets vectorized. The problem here is not in any way related to the cost of the store; it's related to the cost of materializing the value to be stored.

There's an additional problem that the cost model added for RISC-V is way overly simplistic. It's out of sync with the existing build vector lowering code, and thus will result in costs which differ from the actual lowering chosen. More importantly, the interface chosen in this patch prevents a more sophisticated cost model from being used.

I think we need to undo this, and return to the getConstBuildVectorInstrCost approach used in early versions of this patch. There was a mention of existing build vector costing in getConstBuildVectorInstrCost, but I can't find this in generic code. Can you point me to the code you were referring to?

Check the cost model of arithmetic instructions etc, they already include the cost analysis for constant values.

I think I found the code you're referring to in X86TTIImpl::getArithmeticInstrCost. I'd summarize this code as we have various alternate cost tables which seem to assume one constant splat operand or sometimes just one constant operand gets folded into the instruction. I don't know enough about avx512 instruction encoding to reason about this, but I will accept that it exists. Though it does look very weird to me that *all* constants are assumed to be folded into the encoding? Whatever, out of scope for this discussion.

p.s. I used the word "undo" specifically to avoid "revert". I'm not asking the change be reverted, simply that we work in the direction of a better interface overall. Doing that will have the effect of semantically reverting the landed change, but I'm not picky about the order of operations here.

I tried initially to implement it but our cost model already includes the cost of constants/constant buildvectors for many operations. It requires significant rework of TTI and some extra investigation because we need to account cross dependency between constants and the operations. And I'm not sure if it would better/easier to implement, it requires some extra (re)design investigation.

Ok, so I see the concern here. I'm not thrilled with the conclusion, but I think I agree that the current state of the art is having each operation reason about the cost of the constant materialization independently.

Given that, I see why you took the approach you did here.

However, we're still left with the problem that the current interface is insufficient for RISC-V. On the vector side, we can generate various non-splat sequences (e.g. vid and friends) at low cost. As such, the current expressibility of interface isn't really sufficient.

I see two possible paths, both with downsides. I'm curious what you think:

  • Extend the OperandValueProperties enum with a bunch more options for describing build vectors. I don't really see the semantic distinction between OperandValueProperties and OperandValueKind, so we'd probably end up merging them into a single info struct with a bunch more properties on it. This arguably works more naturally with scalable vectors, but it's a bunch of complexity.
  • Add the getConstBuildVectorInstrCost interface anyways. Document the contract as being to return zero cost when the constant could fold into the using instruction. Existing backends which don't need the additional expressibility continue with the old scheme, RISC-V uses this approach to cost build vectors instead (i.e. arithmetic cost et al don't include constant mat costs).

As I said, both approaches have some obvious downsides. If you have an alternate idea, definitely open to hearing it.

Also, to be clear, I've accepted that this patch is reasonable. I'm asking now about future direction for my own work, not asking for you to volunteer for any of the above. :)

Coming into this a bit late.

I stumbled into this myself when looking at the impact of SLP on RISC-V. I think this is addressing an important problem, but I'm really not happy with the structure of the change that landed.

We have a general problem here of needing to account for cost of a constant build vector. This change ended up being specific to stores of constant build vectors, but the same basic problem still exists if e.g. you have a load, add constant-build-vector, and store sequence which gets vectorized. The problem here is not in any way related to the cost of the store; it's related to the cost of materializing the value to be stored.

There's an additional problem that the cost model added for RISC-V is way overly simplistic. It's out of sync with the existing build vector lowering code, and thus will result in costs which differ from the actual lowering chosen. More importantly, the interface chosen in this patch prevents a more sophisticated cost model from being used.

I think we need to undo this, and return to the getConstBuildVectorInstrCost approach used in early versions of this patch. There was a mention of existing build vector costing in getConstBuildVectorInstrCost, but I can't find this in generic code. Can you point me to the code you were referring to?

Check the cost model of arithmetic instructions etc, they already include the cost analysis for constant values.

I think I found the code you're referring to in X86TTIImpl::getArithmeticInstrCost. I'd summarize this code as we have various alternate cost tables which seem to assume one constant splat operand or sometimes just one constant operand gets folded into the instruction. I don't know enough about avx512 instruction encoding to reason about this, but I will accept that it exists. Though it does look very weird to me that *all* constants are assumed to be folded into the encoding? Whatever, out of scope for this discussion.

p.s. I used the word "undo" specifically to avoid "revert". I'm not asking the change be reverted, simply that we work in the direction of a better interface overall. Doing that will have the effect of semantically reverting the landed change, but I'm not picky about the order of operations here.

I tried initially to implement it but our cost model already includes the cost of constants/constant buildvectors for many operations. It requires significant rework of TTI and some extra investigation because we need to account cross dependency between constants and the operations. And I'm not sure if it would better/easier to implement, it requires some extra (re)design investigation.

Ok, so I see the concern here. I'm not thrilled with the conclusion, but I think I agree that the current state of the art is having each operation reason about the cost of the constant materialization independently.

Given that, I see why you took the approach you did here.

However, we're still left with the problem that the current interface is insufficient for RISC-V. On the vector side, we can generate various non-splat sequences (e.g. vid and friends) at low cost. As such, the current expressibility of interface isn't really sufficient.

I see two possible paths, both with downsides. I'm curious what you think:

  • Extend the OperandValueProperties enum with a bunch more options for describing build vectors. I don't really see the semantic distinction between OperandValueProperties and OperandValueKind, so we'd probably end up merging them into a single info struct with a bunch more properties on it. This arguably works more naturally with scalable vectors, but it's a bunch of complexity.
  • Add the getConstBuildVectorInstrCost interface anyways. Document the contract as being to return zero cost when the constant could fold into the using instruction. Existing backends which don't need the additional expressibility continue with the old scheme, RISC-V uses this approach to cost build vectors instead (i.e. arithmetic cost et al don't include constant mat costs).

As I said, both approaches have some obvious downsides. If you have an alternate idea, definitely open to hearing it.

I would do both (in some way) as a first step. Introduce getConstBuildVectorInstrCost (local to RiscV TTI interfac) and use it in TTI functions (I mean in getArithmeticInstrCost, getMemoryOpCost, etc.) for better constant build vector cost estimation (if the user provides operands or OperandValueProperties). Later we can make it public for all TTI interfaces. Thoughts?

Also, to be clear, I've accepted that this patch is reasonable. I'm asking now about future direction for my own work, not asking for you to volunteer for any of the above. :)

I understand, no problem.

Refactoring OperandValueKind/Properties from enums into a single properties list has come up several times (IIRC KnownNeverZero/KnownNeverNegative properties and even KnownBits/SignBits/Min+Max have been mentioned as useful for some cases).

TBH just merging them as an initial cleanup (and improving TargetTransformInfo::getOperandInfo) would be worth it and would make it easier for future changes.

Coming into this a bit late.

I stumbled into this myself when looking at the impact of SLP on RISC-V. I think this is addressing an important problem, but I'm really not happy with the structure of the change that landed.

We have a general problem here of needing to account for cost of a constant build vector. This change ended up being specific to stores of constant build vectors, but the same basic problem still exists if e.g. you have a load, add constant-build-vector, and store sequence which gets vectorized. The problem here is not in any way related to the cost of the store; it's related to the cost of materializing the value to be stored.

There's an additional problem that the cost model added for RISC-V is way overly simplistic. It's out of sync with the existing build vector lowering code, and thus will result in costs which differ from the actual lowering chosen. More importantly, the interface chosen in this patch prevents a more sophisticated cost model from being used.

I think we need to undo this, and return to the getConstBuildVectorInstrCost approach used in early versions of this patch. There was a mention of existing build vector costing in getConstBuildVectorInstrCost, but I can't find this in generic code. Can you point me to the code you were referring to?

Check the cost model of arithmetic instructions etc, they already include the cost analysis for constant values.

I think I found the code you're referring to in X86TTIImpl::getArithmeticInstrCost. I'd summarize this code as we have various alternate cost tables which seem to assume one constant splat operand or sometimes just one constant operand gets folded into the instruction. I don't know enough about avx512 instruction encoding to reason about this, but I will accept that it exists. Though it does look very weird to me that *all* constants are assumed to be folded into the encoding? Whatever, out of scope for this discussion.

p.s. I used the word "undo" specifically to avoid "revert". I'm not asking the change be reverted, simply that we work in the direction of a better interface overall. Doing that will have the effect of semantically reverting the landed change, but I'm not picky about the order of operations here.

I tried initially to implement it but our cost model already includes the cost of constants/constant buildvectors for many operations. It requires significant rework of TTI and some extra investigation because we need to account cross dependency between constants and the operations. And I'm not sure if it would better/easier to implement, it requires some extra (re)design investigation.

Ok, so I see the concern here. I'm not thrilled with the conclusion, but I think I agree that the current state of the art is having each operation reason about the cost of the constant materialization independently.

Given that, I see why you took the approach you did here.

However, we're still left with the problem that the current interface is insufficient for RISC-V. On the vector side, we can generate various non-splat sequences (e.g. vid and friends) at low cost. As such, the current expressibility of interface isn't really sufficient.

I see two possible paths, both with downsides. I'm curious what you think:

  • Extend the OperandValueProperties enum with a bunch more options for describing build vectors. I don't really see the semantic distinction between OperandValueProperties and OperandValueKind, so we'd probably end up merging them into a single info struct with a bunch more properties on it. This arguably works more naturally with scalable vectors, but it's a bunch of complexity.
  • Add the getConstBuildVectorInstrCost interface anyways. Document the contract as being to return zero cost when the constant could fold into the using instruction. Existing backends which don't need the additional expressibility continue with the old scheme, RISC-V uses this approach to cost build vectors instead (i.e. arithmetic cost et al don't include constant mat costs).

As I said, both approaches have some obvious downsides. If you have an alternate idea, definitely open to hearing it.

I would do both (in some way) as a first step. Introduce getConstBuildVectorInstrCost (local to RiscV TTI interfac) and use it in TTI functions (I mean in getArithmeticInstrCost, getMemoryOpCost, etc.) for better constant build vector cost estimation (if the user provides operands or OperandValueProperties). Later we can make it public for all TTI interfaces. Thoughts?

I split out the costing code in 59960e8d with plans to extend it. However, this isn't quite the same as getConstBuildVectorInstrCost as we don't have the actual values forming the build vector. For that, we'd need to significantly change the interface of TTI to pass through all of the Values making up the build vector.

Refactoring OperandValueKind/Properties from enums into a single properties list has come up several times (IIRC KnownNeverZero/KnownNeverNegative properties and even KnownBits/SignBits/Min+Max have been mentioned as useful for some cases).

TBH just merging them as an initial cleanup (and improving TargetTransformInfo::getOperandInfo) would be worth it and would make it easier for future changes.

Given this has come up many times, I went ahead and did it. Changes to wrap both existing properties enum in a class have been plumbed through all of TTI and client code. If we want to add new properties, it should be pretty straight forward to do so.

I'm still not sure this is the right direction overall - as opposed to costing the actual constant value - but I'm still toying with ideas here. One thing I did notice is that basically only X86 costs immediate operands with the current approach. So this really isn't "existing targets do X"; it's "most targets ignore this issue, and X86 does X".

FYI, this change causes regression with Flang due to store-forwarding issues. I am not sure if it is Flang-specific - please take a look: https://github.com/llvm/llvm-project/issues/57322

nikic added a subscriber: nikic.Aug 29 2022, 3:18 AM

FYI this change caused a noticeable compile-time regression: http://llvm-compile-time-tracker.com/compare.php?from=31fbcccb3136b9da99e7bc95007e553403fcd641&to=0e7ed32c71362f3547329c6ee8573a8bc191f58a&stat=instructions Highest impact seems to be 7.5% on constants.c from mafft. I don't see anything obvious that can be optimized here though.

FYI this change caused a noticeable compile-time regression: http://llvm-compile-time-tracker.com/compare.php?from=31fbcccb3136b9da99e7bc95007e553403fcd641&to=0e7ed32c71362f3547329c6ee8573a8bc191f58a&stat=instructions Highest impact seems to be 7.5% on constants.c from mafft. I don't see anything obvious that can be optimized here though.

Hope D132750 will fix it in some cases for FP cases.