Many of the scalar X86 instructions allow to use memory references directly,
without preliminary loading to registers. Cost model for X86 target does
not take into account this situation. This patch considers the cost of
the load instruction as 0, if this is the only load instruction in the
binary instruction, that supports loading from memory or if this is only
the second load instruction.
Details
Diff Detail
- Repository
- rL LLVM
- Build Status
Buildable 37786 Build 37785: arc lint + arc unit
Event Timeline
The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.
A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.
This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.
I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.
Sure. Here in PR36280 we have a vectorization tree of 3 nodes: 1) float mul insts %mul1, %mul2; 2) load insts %p1, %p2; 3) Gather %x, %y. Also we have 2 external uses %add1 and %add2. When we calculate the cvectorization cost, it is done on per tree-node basis. 1) Node cost is 2 (cost of the vectorized fmul) - (2 + 2) (costs of scalar mults) = -2; 2) Node cost is 1 (cost of the vectorized load) - (1 + 1)(!!!!) (cost of the scalar loads) = -1. Note, that in the resulting code these loads are folded in the vmulss instructions and the cost of these instructions is calculated already when we calculated the cost of the vectorization of the float point multiplacations. The resl cost must be 1 (vector load) - (0 + 0) (the cost of the scalar loads) = 1; 3) The cost of gather is 1 for gather. + 1 for an extract op. The total thus is -1. If we correct that cost of loads, the final cost is going to be 1.
It does not mean, that we should not fix the cost of the floating point operations, but the same problem exists for integer operations. Because of the extra cost of the scalar loads we are too optimistic about vectorized code in many cases and it leads to performance degradation.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2411 | A little tricky, but we only fold certain operands (Operand(1) typically) so if it doesn't commute then we can't fold - ISD::SUB/FSUB/FDIV/SHL/ASHR/LSHR will all definitely be affected by this. Also we need to handle the case where both operands are loads - we will still have at least one load cost. | |
test/Transforms/SLPVectorizer/X86/arith-mul.ll | ||
266 | This looks like a cost model issue - scalar integer muls on silvermont are pretty awful just like the vector imuls..... |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2411 |
|
test/Transforms/SLPVectorizer/X86/arith-mul.ll | ||
---|---|---|
266 | Yes, we just don't have correct cost model for scalar ops. This is another issue that must be fixed |
Thank you for the explanation. I thought we were summing uops as the cost calculation, but we're not.
I'm not sure if the current model is best suited for x86 (I'd use uop count as the first approximation for x86 perf), but I guess that's a bigger and independent question. I still don't know SLP that well, so if others are happy with this solution, I have no objections.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2407 | Shouldn't this be I->getOperand(1) == OpI ? | |
2411 | Shift instructions can only fold the second operand as well (and even then very poorly....). I'd be tempted to just not include them in isFreeOp at all tbh. | |
2423 | Please can you add a comment explaining the reason for the ScalarSIzeInBits condition? | |
2429 | Why have you only put the constant condition on FADD/FMUL ? | |
2498 | Comment this please. |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2407 | Yes, missed this when updated the patch last time, thanks! | |
2411 | No, actually only the first operand can be the memory address and, thus, can be folded | |
2423 | Yes, sure. | |
2429 | Integer operations allow using register/immediate value as one of the operands. | |
2498 | Ok. |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2411 | I mean, you're right in terms of asm instructions, but in LLVM IR we shift the first operand, so it must Operand(0), not Operand(1). Also, I limited the size of the data. |
I think test coverage is lacking here, i think there are no direct costmodel tests for all those X86TargetTransformInfo.cpp changes, they are only indirectly tested via vectorizer changes.
Sorry, I had no time to work on this. Plus, I don't have the access to x86 machine to check the performance.
Rebased ands reworked. Excluded SLM as its cost model is not quite correct.
Gives perf improvement for cpu2017.511.povray_r ~3% and improves performance up to 8% in some other cases.
Please can you explain why in some cases the cost increases,
and we no longer vectorize some test cases/vectorize with smaller vector width?
It does not increase the cost of vector code, it lowers the cost of scalar code with memops. Just the resulting cost difference between vector version and scalar version increases. I explained it already, that's because of the cost model. For example, the throughput cost of add r,r for X86 is 0.25, and for add m,r is 0.5. But the cost model gives the cost 1 for the add r,r and 2 for add m,r (load + add, 1 for load and 1 for add LLVM IR instructions).
At the same time, the actual throughput cost of VMOV+VADD is ~1.5, and the cost model estimates it to 2 (again, 1 for vmov and 1 for vadd). As you can see, add m,r and VMOV+VADD have the same cost, though actually for add m,r it should be lower.
So, in terms of comparing the cost for 2 scalar instructions add r,r and add m,r it is still not correct but it is not important because we don't need to compare the cost of scalar instructions but compare the cost of scalar and vector versions of instructions.
So, it lowers the cost of scalar instructions with mem access relatively to the vector versions of instructions.
Are you sure we are talking about the same diff?
llvm/test/Transforms/LoopVectorize/X86/interleaving.ll | ||
---|---|---|
35–47 ↗ | (On Diff #326154) | No longer vectorized for AVX? To me that reads as if we've increased the cost of vector variant. |
llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-expanded.ll | ||
251 ↗ | (On Diff #326154) | Same, no longer vectorized? |
294 ↗ | (On Diff #326154) | Same, no longer vectorized? |
llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll | ||
129 ↗ | (On Diff #326154) | Same, no longer vectorized? |
251 ↗ | (On Diff #326154) | Same, no longer vectorized? |
llvm/test/Transforms/SLPVectorizer/X86/load-merge-inseltpoison.ll | ||
156 ↗ | (On Diff #326154) | Same, no longer vectorized? |
llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll | ||
156 ↗ | (On Diff #326154) | Same, no longer vectorized? |
llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll | ||
329 ↗ | (On Diff #326154) | Same, no longer vectorized? |
llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll | ||
37 ↗ | (On Diff #326154) | Same, no longer vectorized? |
(To be noted, i agree that the change makes sense, i'm just surprised of it's visible effect on the test cases)
Roman, yes, this patch lowers the cost of scalar instruction and lowers the difference between the costs of vector variant and scalar variant of the code. For x86 in some cases, the scalar instructions with memaccesses are more profitable than the vector variant of the same instructions (pair of load + binop, actually).
llvm/test/Transforms/LoopVectorize/X86/interleaving.ll | ||
---|---|---|
35–47 ↗ | (On Diff #326154) | Yes, it looks like it increases the cost of vector code, actually, it lowers the difference between the cost of the scalar code and vector code. Most probably, here we have a corner case, where the cost of vectorized code is the same as the cost of the scalar code. |
llvm/test/Transforms/SLPVectorizer/X86/load-merge-inseltpoison.ll | ||
156 ↗ | (On Diff #326154) | Yes, because it is profitable to execute the scalar code, not the vector variant. |
I think this is fine. We lower the cost of scalar instructions and of course, it may affect/affects the vectorization. And this is good that the tests are affected by that. Or probably there is miscommunication and I just don't quite understand your concerns.
I have a few concerns
- we're increasing register pressure (x86 fp scalars share regs with the vector types)
- I'm not certain we are accounting for the impact of increased AGU usage - even though the load has been folded "for free", are we correctly handling the gep costs?
llvm/test/Transforms/LoopVectorize/X86/pr34438.ll | ||
---|---|---|
35 ↗ | (On Diff #326154) | This looks like a VECTOR regression on AVX targets? |
Yes, probably. That's why I did internal thorough performance testing of this patch. And also one of my colleagues did this too. And we got almost the same results. We have about +9% gain for one of the tests, +2% and +3% gains for 2 other tests (for AVX2), no significant changes for AVX512, so mostly the old targets are affected. No significant perf losses were found during testing.
llvm/test/Transforms/LoopVectorize/X86/pr34438.ll | ||
---|---|---|
35 ↗ | (On Diff #326154) | Looks like it prefers 256 bit vectors rather than 2 x 256 + SPLIT vector instructions (8 x i64). I think this is fine for older targets |
llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll | ||
---|---|---|
4 ↗ | (On Diff #327449) | You are missing a AVX512 run - but you have AVX512 check prefixes below? |
llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll | ||
---|---|---|
4 ↗ | (On Diff #327449) | I'll fix it, thanks |
Shouldn't this be I->getOperand(1) == OpI ?