Many of the scalar X86 instructions allow to use memory references directly,
without preliminary loading to registers. Cost model for X86 target does
not take into account this situation. This patch considers the cost of
the load instruction as 0, if this is the only load instruction in the
binary instruction, that supports loading from memory or if this is only
the second load instruction.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.
A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.
This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.
I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.
Sure. Here in PR36280 we have a vectorization tree of 3 nodes: 1) float mul insts %mul1, %mul2; 2) load insts %p1, %p2; 3) Gather %x, %y. Also we have 2 external uses %add1 and %add2. When we calculate the cvectorization cost, it is done on per tree-node basis. 1) Node cost is 2 (cost of the vectorized fmul) - (2 + 2) (costs of scalar mults) = -2; 2) Node cost is 1 (cost of the vectorized load) - (1 + 1)(!!!!) (cost of the scalar loads) = -1. Note, that in the resulting code these loads are folded in the vmulss instructions and the cost of these instructions is calculated already when we calculated the cost of the vectorization of the float point multiplacations. The resl cost must be 1 (vector load) - (0 + 0) (the cost of the scalar loads) = 1; 3) The cost of gather is 1 for gather. + 1 for an extract op. The total thus is -1. If we correct that cost of loads, the final cost is going to be 1.
It does not mean, that we should not fix the cost of the floating point operations, but the same problem exists for integer operations. Because of the extra cost of the scalar loads we are too optimistic about vectorized code in many cases and it leads to performance degradation.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
1824 ↗ | (On Diff #133273) | A little tricky, but we only fold certain operands (Operand(1) typically) so if it doesn't commute then we can't fold - ISD::SUB/FSUB/FDIV/SHL/ASHR/LSHR will all definitely be affected by this. Also we need to handle the case where both operands are loads - we will still have at least one load cost. |
test/Transforms/SLPVectorizer/X86/arith-mul.ll | ||
265 ↗ | (On Diff #133273) | This looks like a cost model issue - scalar integer muls on silvermont are pretty awful just like the vector imuls..... |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
1824 ↗ | (On Diff #133273) |
|
test/Transforms/SLPVectorizer/X86/arith-mul.ll | ||
---|---|---|
265 ↗ | (On Diff #133273) | Yes, we just don't have correct cost model for scalar ops. This is another issue that must be fixed |
Thank you for the explanation. I thought we were summing uops as the cost calculation, but we're not.
I'm not sure if the current model is best suited for x86 (I'd use uop count as the first approximation for x86 perf), but I guess that's a bigger and independent question. I still don't know SLP that well, so if others are happy with this solution, I have no objections.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
1824 ↗ | (On Diff #133273) | Shift instructions can only fold the second operand as well (and even then very poorly....). I'd be tempted to just not include them in isFreeOp at all tbh. |
1820 ↗ | (On Diff #133622) | Shouldn't this be I->getOperand(1) == OpI ? |
1836 ↗ | (On Diff #133622) | Please can you add a comment explaining the reason for the ScalarSIzeInBits condition? |
1842 ↗ | (On Diff #133622) | Why have you only put the constant condition on FADD/FMUL ? |
1886 ↗ | (On Diff #133622) | Comment this please. |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
1824 ↗ | (On Diff #133273) | No, actually only the first operand can be the memory address and, thus, can be folded |
1820 ↗ | (On Diff #133622) | Yes, missed this when updated the patch last time, thanks! |
1836 ↗ | (On Diff #133622) | Yes, sure. |
1842 ↗ | (On Diff #133622) | Integer operations allow using register/immediate value as one of the operands. |
1886 ↗ | (On Diff #133622) | Ok. |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
1824 ↗ | (On Diff #133273) | I mean, you're right in terms of asm instructions, but in LLVM IR we shift the first operand, so it must Operand(0), not Operand(1). Also, I limited the size of the data. |
I think test coverage is lacking here, i think there are no direct costmodel tests for all those X86TargetTransformInfo.cpp changes, they are only indirectly tested via vectorizer changes.
Sorry, I had no time to work on this. Plus, I don't have the access to x86 machine to check the performance.
Rebased ands reworked. Excluded SLM as its cost model is not quite correct.
Gives perf improvement for cpu2017.511.povray_r ~3% and improves performance up to 8% in some other cases.