Usually, constant buildvector results in a vector load from a
constant/data pool.
Details
- Reviewers
RKSimon craig.topper - Commits
- rG0e7ed32c7136: [SLP]Cost for a constant buildvector.
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Some unfortunate regressions
llvm/test/Transforms/SLPVectorizer/X86/pr46983.ll | ||
---|---|---|
146 ↗ | (On Diff #433753) |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5932 | This estimate should be bit more complicated.
|
llvm/test/Transforms/SLPVectorizer/X86/pr46983.ll | ||
---|---|---|
146 ↗ | (On Diff #433753) | llvm-mca reports throughputs: |
llvm/test/Transforms/SLPVectorizer/X86/pr46983.ll | ||
---|---|---|
146 ↗ | (On Diff #433753) | Ah, right, these checks are for avx. |
Test case for collection.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
3810 | Note that it is free for splat shift value only. |
mca shows that these 2 instructions has the same cost, so it actually doers not matter. Probably worth to add some other instructions, which can load params directly from memory for x86
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5936 | Just wondering is that possible for UserTreeIndices to be empty here? AFAIU it can be for root only but constants do not seed vtree. | |
5955 | drop it? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5955 | Ah, yes, sure. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5960 | Isn't this interface already puts assumption that a constant is a legal immediate? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5960 | I'll check it. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5936 | Do you suggest to hide it in getConstBuildVectorInstrCost? And return the difference? Or just add a new member function? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5936 | Alternate opcodes is SLP vectorizer specific. For that reason trying to sink that logic into inside the TTI interface does not look like right thing to do. What sounds weird for me is that constants may seed vtree for reduction. Although that is not directly related to this patch but you are placing here work arounds of that. IMO it is unpractical to run constants reduction through SLP vectorizer machinery. Probably, to make the work around of that issue simpler in this patch, add an early return: return 0; Otherwise it will be returning memory-op cost for a foldable operation. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5936 |
InstCombiner and other passes are not always able to handle them (or require some extra work and compile time). E.g: define i32 @foo(i32 %v, i32 %a) { %s1 = add i32 %v, 1 %s2 = add i32 %a, 2 %s3 = add i32 %s1, %s2 %s11 = add i32 %v, %a %s31 = add i32 %s3, %s11 %s4 = add i32 %v, 3 %s5 = add i32 %a, 4 %s6 = add i32 %s4, %s5 %s7 = add i32 %s31, %s6 ret i32 %s7 } SLP is able to transform it to: define i32 @foo(i32 %v, i32 %a) { %1 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> <i32 4, i32 3, i32 2, i32 1>) %op.rdx = add i32 %a, %a %op.rdx1 = add i32 %a, %v %op.rdx2 = add i32 %v, %v %op.rdx3 = add i32 %op.rdx, %op.rdx1 %op.rdx4 = add i32 %op.rdx3, %op.rdx2 %op.rdx5 = add i32 %1, %op.rdx4 ret i32 %op.rdx5 } which can be optimized %1 = i32 10 But I agree, that it requires improvement. We don't need to estimate the cost and emit reduction here. I have a patch that improves it. Need to work on it for some time, though. |
llvm/lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
3830 |
|
llvm/lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
3823 | XOP can memory fold from Idx == 0 as well. |
Removed getConstantBuildVectorCost, the analysis for constant values already exists in getArithmeticInstrCost. Added support for const operand for stores in getMemoryOpCost function.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
4108 | Just do this once before the !VTy || !LT.second.isVector()) check? |
LGTM - it might be worth splitting the refactoring of adding the OperandValueKind arg to getMemoryOpCost? That way any fall out from the cost changes are more localised.
Coming into this a bit late.
I stumbled into this myself when looking at the impact of SLP on RISC-V. I think this is addressing an important problem, but I'm really not happy with the structure of the change that landed.
We have a general problem here of needing to account for cost of a constant build vector. This change ended up being specific to stores of constant build vectors, but the same basic problem still exists if e.g. you have a load, add constant-build-vector, and store sequence which gets vectorized. The problem here is not in any way related to the cost of the store; it's related to the cost of materializing the value to be stored.
There's an additional problem that the cost model added for RISC-V is way overly simplistic. It's out of sync with the existing build vector lowering code, and thus will result in costs which differ from the actual lowering chosen. More importantly, the interface chosen in this patch prevents a more sophisticated cost model from being used.
I think we need to undo this, and return to the getConstBuildVectorInstrCost approach used in early versions of this patch. There was a mention of existing build vector costing in getConstBuildVectorInstrCost, but I can't find this in generic code. Can you point me to the code you were referring to?
p.s. I used the word "undo" specifically to avoid "revert". I'm not asking the change be reverted, simply that we work in the direction of a better interface overall. Doing that will have the effect of semantically reverting the landed change, but I'm not picky about the order of operations here.
Check the cost model of arithmetic instructions etc, they already include the cost analysis for constant values.
p.s. I used the word "undo" specifically to avoid "revert". I'm not asking the change be reverted, simply that we work in the direction of a better interface overall. Doing that will have the effect of semantically reverting the landed change, but I'm not picky about the order of operations here.
I tried initially to implement it but our cost model already includes the cost of constants/constant buildvectors for many operations. It requires significant rework of TTI and some extra investigation because we need to account cross dependency between constants and the operations. And I'm not sure if it would better/easier to implement, it requires some extra (re)design investigation.
I think I found the code you're referring to in X86TTIImpl::getArithmeticInstrCost. I'd summarize this code as we have various alternate cost tables which seem to assume one constant splat operand or sometimes just one constant operand gets folded into the instruction. I don't know enough about avx512 instruction encoding to reason about this, but I will accept that it exists. Though it does look very weird to me that *all* constants are assumed to be folded into the encoding? Whatever, out of scope for this discussion.
p.s. I used the word "undo" specifically to avoid "revert". I'm not asking the change be reverted, simply that we work in the direction of a better interface overall. Doing that will have the effect of semantically reverting the landed change, but I'm not picky about the order of operations here.
I tried initially to implement it but our cost model already includes the cost of constants/constant buildvectors for many operations. It requires significant rework of TTI and some extra investigation because we need to account cross dependency between constants and the operations. And I'm not sure if it would better/easier to implement, it requires some extra (re)design investigation.
Ok, so I see the concern here. I'm not thrilled with the conclusion, but I think I agree that the current state of the art is having each operation reason about the cost of the constant materialization independently.
Given that, I see why you took the approach you did here.
However, we're still left with the problem that the current interface is insufficient for RISC-V. On the vector side, we can generate various non-splat sequences (e.g. vid and friends) at low cost. As such, the current expressibility of interface isn't really sufficient.
I see two possible paths, both with downsides. I'm curious what you think:
- Extend the OperandValueProperties enum with a bunch more options for describing build vectors. I don't really see the semantic distinction between OperandValueProperties and OperandValueKind, so we'd probably end up merging them into a single info struct with a bunch more properties on it. This arguably works more naturally with scalable vectors, but it's a bunch of complexity.
- Add the getConstBuildVectorInstrCost interface anyways. Document the contract as being to return zero cost when the constant could fold into the using instruction. Existing backends which don't need the additional expressibility continue with the old scheme, RISC-V uses this approach to cost build vectors instead (i.e. arithmetic cost et al don't include constant mat costs).
As I said, both approaches have some obvious downsides. If you have an alternate idea, definitely open to hearing it.
Also, to be clear, I've accepted that this patch is reasonable. I'm asking now about future direction for my own work, not asking for you to volunteer for any of the above. :)
I would do both (in some way) as a first step. Introduce getConstBuildVectorInstrCost (local to RiscV TTI interfac) and use it in TTI functions (I mean in getArithmeticInstrCost, getMemoryOpCost, etc.) for better constant build vector cost estimation (if the user provides operands or OperandValueProperties). Later we can make it public for all TTI interfaces. Thoughts?
Also, to be clear, I've accepted that this patch is reasonable. I'm asking now about future direction for my own work, not asking for you to volunteer for any of the above. :)
I understand, no problem.
Refactoring OperandValueKind/Properties from enums into a single properties list has come up several times (IIRC KnownNeverZero/KnownNeverNegative properties and even KnownBits/SignBits/Min+Max have been mentioned as useful for some cases).
TBH just merging them as an initial cleanup (and improving TargetTransformInfo::getOperandInfo) would be worth it and would make it easier for future changes.
I split out the costing code in 59960e8d with plans to extend it. However, this isn't quite the same as getConstBuildVectorInstrCost as we don't have the actual values forming the build vector. For that, we'd need to significantly change the interface of TTI to pass through all of the Values making up the build vector.
Given this has come up many times, I went ahead and did it. Changes to wrap both existing properties enum in a class have been plumbed through all of TTI and client code. If we want to add new properties, it should be pretty straight forward to do so.
I'm still not sure this is the right direction overall - as opposed to costing the actual constant value - but I'm still toying with ideas here. One thing I did notice is that basically only X86 costs immediate operands with the current approach. So this really isn't "existing targets do X"; it's "most targets ignore this issue, and X86 does X".
FYI, this change causes regression with Flang due to store-forwarding issues. I am not sure if it is Flang-specific - please take a look: https://github.com/llvm/llvm-project/issues/57322
FYI this change caused a noticeable compile-time regression: http://llvm-compile-time-tracker.com/compare.php?from=31fbcccb3136b9da99e7bc95007e553403fcd641&to=0e7ed32c71362f3547329c6ee8573a8bc191f58a&stat=instructions Highest impact seems to be 7.5% on constants.c from mafft. I don't see anything obvious that can be optimized here though.
Note that it is free for splat shift value only.
https://godbolt.org/z/KMf6sW5n3