Page MenuHomePhabricator

[COST] Fix cost model of load instructions on X86
Needs ReviewPublic

Authored by ABataev on Feb 6 2018, 12:13 PM.

Details

Summary

Many of the scalar X86 instructions allow to use memory references directly,
without preliminary loading to registers. Cost model for X86 target does
not take into account this situation. This patch considers the cost of
the load instruction as 0, if this is the only load instruction in the
binary instruction, that supports loading from memory or if this is only
the second load instruction.

Diff Detail

Event Timeline

ABataev created this revision.Feb 6 2018, 12:13 PM
ABataev updated this revision to Diff 133273.Feb 7 2018, 12:15 PM

Fixed cost model for floating point math instructions.

spatel added a comment.Feb 7 2018, 1:04 PM

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

spatel added a comment.Feb 7 2018, 1:30 PM

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.

fhahn added a subscriber: fhahn.Feb 8 2018, 1:59 AM

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.

Sure. Here in PR36280 we have a vectorization tree of 3 nodes: 1) float mul insts %mul1, %mul2; 2) load insts %p1, %p2; 3) Gather %x, %y. Also we have 2 external uses %add1 and %add2. When we calculate the cvectorization cost, it is done on per tree-node basis. 1) Node cost is 2 (cost of the vectorized fmul) - (2 + 2) (costs of scalar mults) = -2; 2) Node cost is 1 (cost of the vectorized load) - (1 + 1)(!!!!) (cost of the scalar loads) = -1. Note, that in the resulting code these loads are folded in the vmulss instructions and the cost of these instructions is calculated already when we calculated the cost of the vectorization of the float point multiplacations. The resl cost must be 1 (vector load) - (0 + 0) (the cost of the scalar loads) = 1; 3) The cost of gather is 1 for gather. + 1 for an extract op. The total thus is -1. If we correct that cost of loads, the final cost is going to be 1.

It does not mean, that we should not fix the cost of the floating point operations, but the same problem exists for integer operations. Because of the extra cost of the scalar loads we are too optimistic about vectorized code in many cases and it leads to performance degradation.

RKSimon added inline comments.Feb 9 2018, 5:38 AM
lib/Target/X86/X86TargetTransformInfo.cpp
1826

A little tricky, but we only fold certain operands (Operand(1) typically) so if it doesn't commute then we can't fold - ISD::SUB/FSUB/FDIV/SHL/ASHR/LSHR will all definitely be affected by this.

Also we need to handle the case where both operands are loads - we will still have at least one load cost.

test/Transforms/SLPVectorizer/X86/arith-mul.ll
265

This looks like a cost model issue - scalar integer muls on silvermont are pretty awful just like the vector imuls.....

ABataev added inline comments.Feb 9 2018, 6:56 AM
lib/Target/X86/X86TargetTransformInfo.cpp
1826
  1. Agree, missed it. Will be fixed.
  2. Yes, it is handled already. The additional checks here are exactly to check this situation.
ABataev added inline comments.Feb 9 2018, 7:23 AM
test/Transforms/SLPVectorizer/X86/arith-mul.ll
265

Yes, we just don't have correct cost model for scalar ops. This is another issue that must be fixed

ABataev updated this revision to Diff 133622.Feb 9 2018, 7:56 AM

Fixed checks for non-commutative instructions

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.

Sure. Here in PR36280 we have a vectorization tree of 3 nodes: 1) float mul insts %mul1, %mul2; 2) load insts %p1, %p2; 3) Gather %x, %y. Also we have 2 external uses %add1 and %add2. When we calculate the cvectorization cost, it is done on per tree-node basis. 1) Node cost is 2 (cost of the vectorized fmul) - (2 + 2) (costs of scalar mults) = -2; 2) Node cost is 1 (cost of the vectorized load) - (1 + 1)(!!!!) (cost of the scalar loads) = -1. Note, that in the resulting code these loads are folded in the vmulss instructions and the cost of these instructions is calculated already when we calculated the cost of the vectorization of the float point multiplacations. The resl cost must be 1 (vector load) - (0 + 0) (the cost of the scalar loads) = 1; 3) The cost of gather is 1 for gather. + 1 for an extract op. The total thus is -1. If we correct that cost of loads, the final cost is going to be 1.

Thank you for the explanation. I thought we were summing uops as the cost calculation, but we're not.

I'm not sure if the current model is best suited for x86 (I'd use uop count as the first approximation for x86 perf), but I guess that's a bigger and independent question. I still don't know SLP that well, so if others are happy with this solution, I have no objections.

RKSimon added inline comments.Feb 18 2018, 7:03 AM
lib/Target/X86/X86TargetTransformInfo.cpp
1822

Shouldn't this be I->getOperand(1) == OpI ?

1826

Shift instructions can only fold the second operand as well (and even then very poorly....). I'd be tempted to just not include them in isFreeOp at all tbh.

1838

Please can you add a comment explaining the reason for the ScalarSIzeInBits condition?

1844

Why have you only put the constant condition on FADD/FMUL ?

1913

Comment this please.

ABataev added inline comments.Feb 20 2018, 8:13 AM
lib/Target/X86/X86TargetTransformInfo.cpp
1822

Yes, missed this when updated the patch last time, thanks!

1826

No, actually only the first operand can be the memory address and, thus, can be folded

1838

Yes, sure.

1844

Integer operations allow using register/immediate value as one of the operands.

1913

Ok.

ABataev added inline comments.Feb 20 2018, 8:21 AM
lib/Target/X86/X86TargetTransformInfo.cpp
1826

I mean, you're right in terms of asm instructions, but in LLVM IR we shift the first operand, so it must Operand(0), not Operand(1). Also, I limited the size of the data.

ABataev updated this revision to Diff 135113.Feb 20 2018, 11:28 AM

Update after review

ABataev updated this revision to Diff 135119.Feb 20 2018, 11:56 AM

Fixed tests for LoopVectorizer, affected by the cost change.