This is an archive of the discontinued LLVM Phabricator instance.

[COST] Fix cost model of load instructions on X86
Needs ReviewPublic

Authored by ABataev on Feb 6 2018, 12:13 PM.

Details

Summary

Many of the scalar X86 instructions allow to use memory references directly,
without preliminary loading to registers. Cost model for X86 target does
not take into account this situation. This patch considers the cost of
the load instruction as 0, if this is the only load instruction in the
binary instruction, that supports loading from memory or if this is only
the second load instruction.

Diff Detail

Event Timeline

ABataev created this revision.Feb 6 2018, 12:13 PM
ABataev updated this revision to Diff 133273.Feb 7 2018, 12:15 PM

Fixed cost model for floating point math instructions.

spatel added a comment.Feb 7 2018, 1:04 PM

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

spatel added a comment.Feb 7 2018, 1:30 PM

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.

fhahn added a subscriber: fhahn.Feb 8 2018, 1:59 AM

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.

Sure. Here in PR36280 we have a vectorization tree of 3 nodes: 1) float mul insts %mul1, %mul2; 2) load insts %p1, %p2; 3) Gather %x, %y. Also we have 2 external uses %add1 and %add2. When we calculate the cvectorization cost, it is done on per tree-node basis. 1) Node cost is 2 (cost of the vectorized fmul) - (2 + 2) (costs of scalar mults) = -2; 2) Node cost is 1 (cost of the vectorized load) - (1 + 1)(!!!!) (cost of the scalar loads) = -1. Note, that in the resulting code these loads are folded in the vmulss instructions and the cost of these instructions is calculated already when we calculated the cost of the vectorization of the float point multiplacations. The resl cost must be 1 (vector load) - (0 + 0) (the cost of the scalar loads) = 1; 3) The cost of gather is 1 for gather. + 1 for an extract op. The total thus is -1. If we correct that cost of loads, the final cost is going to be 1.

It does not mean, that we should not fix the cost of the floating point operations, but the same problem exists for integer operations. Because of the extra cost of the scalar loads we are too optimistic about vectorized code in many cases and it leads to performance degradation.

RKSimon added inline comments.Feb 9 2018, 5:38 AM
lib/Target/X86/X86TargetTransformInfo.cpp
2411

A little tricky, but we only fold certain operands (Operand(1) typically) so if it doesn't commute then we can't fold - ISD::SUB/FSUB/FDIV/SHL/ASHR/LSHR will all definitely be affected by this.

Also we need to handle the case where both operands are loads - we will still have at least one load cost.

test/Transforms/SLPVectorizer/X86/arith-mul.ll
266

This looks like a cost model issue - scalar integer muls on silvermont are pretty awful just like the vector imuls.....

ABataev added inline comments.Feb 9 2018, 6:56 AM
lib/Target/X86/X86TargetTransformInfo.cpp
2411
  1. Agree, missed it. Will be fixed.
  2. Yes, it is handled already. The additional checks here are exactly to check this situation.
ABataev added inline comments.Feb 9 2018, 7:23 AM
test/Transforms/SLPVectorizer/X86/arith-mul.ll
266

Yes, we just don't have correct cost model for scalar ops. This is another issue that must be fixed

ABataev updated this revision to Diff 133622.Feb 9 2018, 7:56 AM

Fixed checks for non-commutative instructions

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.

Sure. Here in PR36280 we have a vectorization tree of 3 nodes: 1) float mul insts %mul1, %mul2; 2) load insts %p1, %p2; 3) Gather %x, %y. Also we have 2 external uses %add1 and %add2. When we calculate the cvectorization cost, it is done on per tree-node basis. 1) Node cost is 2 (cost of the vectorized fmul) - (2 + 2) (costs of scalar mults) = -2; 2) Node cost is 1 (cost of the vectorized load) - (1 + 1)(!!!!) (cost of the scalar loads) = -1. Note, that in the resulting code these loads are folded in the vmulss instructions and the cost of these instructions is calculated already when we calculated the cost of the vectorization of the float point multiplacations. The resl cost must be 1 (vector load) - (0 + 0) (the cost of the scalar loads) = 1; 3) The cost of gather is 1 for gather. + 1 for an extract op. The total thus is -1. If we correct that cost of loads, the final cost is going to be 1.

Thank you for the explanation. I thought we were summing uops as the cost calculation, but we're not.

I'm not sure if the current model is best suited for x86 (I'd use uop count as the first approximation for x86 perf), but I guess that's a bigger and independent question. I still don't know SLP that well, so if others are happy with this solution, I have no objections.

RKSimon added inline comments.Feb 18 2018, 7:03 AM
lib/Target/X86/X86TargetTransformInfo.cpp
2407

Shouldn't this be I->getOperand(1) == OpI ?

2411

Shift instructions can only fold the second operand as well (and even then very poorly....). I'd be tempted to just not include them in isFreeOp at all tbh.

2423

Please can you add a comment explaining the reason for the ScalarSIzeInBits condition?

2429

Why have you only put the constant condition on FADD/FMUL ?

2498

Comment this please.

ABataev added inline comments.Feb 20 2018, 8:13 AM
lib/Target/X86/X86TargetTransformInfo.cpp
2407

Yes, missed this when updated the patch last time, thanks!

2411

No, actually only the first operand can be the memory address and, thus, can be folded

2423

Yes, sure.

2429

Integer operations allow using register/immediate value as one of the operands.

2498

Ok.

ABataev added inline comments.Feb 20 2018, 8:21 AM
lib/Target/X86/X86TargetTransformInfo.cpp
2411

I mean, you're right in terms of asm instructions, but in LLVM IR we shift the first operand, so it must Operand(0), not Operand(1). Also, I limited the size of the data.

ABataev updated this revision to Diff 135113.Feb 20 2018, 11:28 AM

Update after review

ABataev updated this revision to Diff 135119.Feb 20 2018, 11:56 AM

Fixed tests for LoopVectorizer, affected by the cost change.

Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2019, 7:28 AM
Herald added a subscriber: zzheng. · View Herald Transcript

I think test coverage is lacking here, i think there are no direct costmodel tests for all those X86TargetTransformInfo.cpp changes, they are only indirectly tested via vectorizer changes.

Re-ping

Sorry, I had no time to work on this. Plus, I don't have the access to x86 machine to check the performance.

ABataev updated this revision to Diff 309604.Dec 4 2020, 12:04 PM

Rebased ands reworked. Excluded SLM as its cost model is not quite correct.
Gives perf improvement for cpu2017.511.povray_r ~3% and improves performance up to 8% in some other cases.

There is a very similar solution for SystemZ already in the LLVM.

lebedev.ri added a comment.EditedFeb 25 2021, 1:14 AM

Please can you explain why in some cases the cost increases,
and we no longer vectorize some test cases/vectorize with smaller vector width?

Please can you explain why in some cases the cost increases,
and we no longer vectorize some test cases/vectorize with smaller vector width?

It does not increase the cost of vector code, it lowers the cost of scalar code with memops. Just the resulting cost difference between vector version and scalar version increases. I explained it already, that's because of the cost model. For example, the throughput cost of add r,r for X86 is 0.25, and for add m,r is 0.5. But the cost model gives the cost 1 for the add r,r and 2 for add m,r (load + add, 1 for load and 1 for add LLVM IR instructions).
At the same time, the actual throughput cost of VMOV+VADD is ~1.5, and the cost model estimates it to 2 (again, 1 for vmov and 1 for vadd). As you can see, add m,r and VMOV+VADD have the same cost, though actually for add m,r it should be lower.
So, in terms of comparing the cost for 2 scalar instructions add r,r and add m,r it is still not correct but it is not important because we don't need to compare the cost of scalar instructions but compare the cost of scalar and vector versions of instructions.
So, it lowers the cost of scalar instructions with mem access relatively to the vector versions of instructions.

Are you sure we are talking about the same diff?

llvm/test/Transforms/LoopVectorize/X86/interleaving.ll
35–47 ↗(On Diff #326154)

No longer vectorized for AVX? To me that reads as if we've increased the cost of vector variant.

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-expanded.ll
251 ↗(On Diff #326154)

Same, no longer vectorized?

294 ↗(On Diff #326154)

Same, no longer vectorized?

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll
129 ↗(On Diff #326154)

Same, no longer vectorized?

251 ↗(On Diff #326154)

Same, no longer vectorized?

llvm/test/Transforms/SLPVectorizer/X86/load-merge-inseltpoison.ll
156 ↗(On Diff #326154)

Same, no longer vectorized?

llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll
156 ↗(On Diff #326154)

Same, no longer vectorized?

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll
329 ↗(On Diff #326154)

Same, no longer vectorized?

llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll
37 ↗(On Diff #326154)

Same, no longer vectorized?

(To be noted, i agree that the change makes sense, i'm just surprised of it's visible effect on the test cases)

Roman, yes, this patch lowers the cost of scalar instruction and lowers the difference between the costs of vector variant and scalar variant of the code. For x86 in some cases, the scalar instructions with memaccesses are more profitable than the vector variant of the same instructions (pair of load + binop, actually).

llvm/test/Transforms/LoopVectorize/X86/interleaving.ll
35–47 ↗(On Diff #326154)

Yes, it looks like it increases the cost of vector code, actually, it lowers the difference between the cost of the scalar code and vector code. Most probably, here we have a corner case, where the cost of vectorized code is the same as the cost of the scalar code.

llvm/test/Transforms/SLPVectorizer/X86/load-merge-inseltpoison.ll
156 ↗(On Diff #326154)

Yes, because it is profitable to execute the scalar code, not the vector variant.

(To be noted, i agree that the change makes sense, i'm just surprised of it's visible effect on the test cases)

I think this is fine. We lower the cost of scalar instructions and of course, it may affect/affects the vectorization. And this is good that the tests are affected by that. Or probably there is miscommunication and I just don't quite understand your concerns.

I have a few concerns

  • we're increasing register pressure (x86 fp scalars share regs with the vector types)
  • I'm not certain we are accounting for the impact of increased AGU usage - even though the load has been folded "for free", are we correctly handling the gep costs?
llvm/test/Transforms/LoopVectorize/X86/pr34438.ll
35 ↗(On Diff #326154)

This looks like a VECTOR regression on AVX targets?

I have a few concerns

  • we're increasing register pressure (x86 fp scalars share regs with the vector types)
  • I'm not certain we are accounting for the impact of increased AGU usage - even though the load has been folded "for free", are we correctly handling the gep costs?

Yes, probably. That's why I did internal thorough performance testing of this patch. And also one of my colleagues did this too. And we got almost the same results. We have about +9% gain for one of the tests, +2% and +3% gains for 2 other tests (for AVX2), no significant changes for AVX512, so mostly the old targets are affected. No significant perf losses were found during testing.

llvm/test/Transforms/LoopVectorize/X86/pr34438.ll
35 ↗(On Diff #326154)

Looks like it prefers 256 bit vectors rather than 2 x 256 + SPLIT vector instructions (8 x i64). I think this is fine for older targets

RKSimon added inline comments.Mar 2 2021, 8:15 AM
llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll
4 ↗(On Diff #327449)

You are missing a AVX512 run - but you have AVX512 check prefixes below?

ABataev added inline comments.Mar 2 2021, 8:17 AM
llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll
4 ↗(On Diff #327449)

I'll fix it, thanks

ABataev updated this revision to Diff 327582.Mar 2 2021, 1:57 PM

Removed extra checks