[TTI] rename getArithmeticInstructionCost() to getUnitThroughput(); NFC
ClosedPublic

Authored by spatel on Feb 26 2018, 10:18 AM.

Details

Summary

Let's explicitly state the meaning of this API in its name and the code comments to make the intent clear.

This assumes that the x86 overrides of the base cost model are doing the right thing already and D43733 is improving on that. The definition and usage of this cost were not obvious in D43079 / D42981.

If clients are not using this cost as intended, then the name change should make it apparent. Out-of-tree targets will have to update for the cosmetic change, but that's an opportunity to verify that they're either ok with the base model costs or have a reasonable override for those costs.

Diff Detail

spatel created this revision.Feb 26 2018, 10:18 AM
ABataev added inline comments.Feb 26 2018, 10:22 AM
include/llvm/Analysis/TargetTransformInfo.h
729

Seems to me very general, the name does not show that this is the throughput for arithmetic instructions only. I think it is going to be enough just to clarify the comment.

spatel added inline comments.Feb 26 2018, 10:50 AM
include/llvm/Analysis/TargetTransformInfo.h
729

getXXXCost is the most ambiguous. getALUUnitThroughput is better?

I'd like to make it so we don't have to read the comments to distinguish this from getUserCost and getOperationCost. So if we don't change this name, then change those?

I suppose we should change all of the throughput APIs in one shot if we're going to do this. But if there's no support, then I'll just update the code comment.

ABataev added inline comments.Feb 26 2018, 11:29 AM
include/llvm/Analysis/TargetTransformInfo.h
729

Still not sure. We're estimating the cost of the arithmetic LLVM instructions here in terms of throughput, not the ALU Unit throughput.

spatel added inline comments.Feb 26 2018, 12:45 PM
include/llvm/Analysis/TargetTransformInfo.h
729

Maybe I'm still confused. As I understand D43733, we're not estimating the LLVM instruction throughput for the target as a whole. If we were, then fmul/fadd/fsub should have a higher cost than integer add/sub because all recent x86 have greater overall throughput for scalar integer ops than FP ops because there are more scalar ALUs.

IMO, the more important part is that we change Cost -> Throughput for everything (I think) that the vectorizers are using, so we're at least different than 'user cost' or 'operation cost'. I guess people will have to read the comment anyway to distinguish the subtlety between 'getArithmeticThroughput' / 'getALUThroughput'.

ABataev added inline comments.Feb 26 2018, 12:57 PM
include/llvm/Analysis/TargetTransformInfo.h
729

As I understand, we estimate throughout for the target as a whole. And because of that we assume that the cost of the integer scalar instructions is just 1 (though it may be lower). This just needs to be fixed somehow, e.g by introducing some multiplier.

RKSimon added inline comments.
include/llvm/Analysis/TargetTransformInfo.h
729

If we did have access to the target's scheduling model, maybe something like:

int Cost = (int) std::max(Inst->getThroughput() / Model->IssueWidth, 1.0f) ?

Although I don't think we're even close to being able to use scheduling models here yet - x86 at least has very few CPUs correctly modelled, although there does seem to be some interest from @andreadb @courbet et al. to use the more aggressively,

fhahn added inline comments.Feb 26 2018, 2:41 PM
include/llvm/Analysis/TargetTransformInfo.h
729

Maybe AArch64 would be a good target to test using the scheduling model info? The available and required CPU units should be modelled quite well there.

This just needs to be fixed somehow, e.g by introducing some multiplier.

Now I'm even more confused.

The cost model is approximating the throughput of the entire target, not a single pipeline/execution unit.
But D43733 did the opposite of that. We said that FP ops after Pentium4 have the lowest possible reciprocal throughput value which means they have the same throughput as an integer add.
Did we purposely introduce wrong costs because we acknowledge that the cost-based calculations in the vectorizers are broken? And this is how we are working around that?

This just needs to be fixed somehow, e.g by introducing some multiplier.

Now I'm even more confused.

The cost model is approximating the throughput of the entire target, not a single pipeline/execution unit.
But D43733 did the opposite of that. We said that FP ops after Pentium4 have the lowest possible reciprocal throughput value which means they have the same throughput as an integer add.
Did we purposely introduce wrong costs because we acknowledge that the cost-based calculations in the vectorizers are broken? And this is how we are working around that?

The throughput of the integer ops is something about 0.25-0.5, but we just can't represent these numbers with integer type, so we just round these values to 1. That's why I say, that we need some kind of multiplier to fix this problem.

This just needs to be fixed somehow, e.g by introducing some multiplier.

Now I'm even more confused.

The cost model is approximating the throughput of the entire target, not a single pipeline/execution unit.
But D43733 did the opposite of that. We said that FP ops after Pentium4 have the lowest possible reciprocal throughput value which means they have the same throughput as an integer add.
Did we purposely introduce wrong costs because we acknowledge that the cost-based calculations in the vectorizers are broken? And this is how we are working around that?

The throughput of the integer ops is something about 0.25-0.5, but we just can't represent these numbers with integer type, so we just round these values to 1. That's why I say, that we need some kind of multiplier to fix this problem.

Ok, but do you agree that changing the cost of FP ops to be the same as int ops moved us further away from x86 reality than before?

This just needs to be fixed somehow, e.g by introducing some multiplier.

Now I'm even more confused.

The cost model is approximating the throughput of the entire target, not a single pipeline/execution unit.
But D43733 did the opposite of that. We said that FP ops after Pentium4 have the lowest possible reciprocal throughput value which means they have the same throughput as an integer add.
Did we purposely introduce wrong costs because we acknowledge that the cost-based calculations in the vectorizers are broken? And this is how we are working around that?

The throughput of the integer ops is something about 0.25-0.5, but we just can't represent these numbers with integer type, so we just round these values to 1. That's why I say, that we need some kind of multiplier to fix this problem.

Ok, but do you agree that changing the cost of FP ops to be the same as int ops moved us further away from x86 reality than before?

Yes, I agree. But I think this is the less important problem because this cost model is intended for comparing the cost of the instructions with the same base types, like for comparing scalar float type with vector float type or scalar integer type with vector integer type.

Sanjay,
Maybe it's worth to send the API change to llvm-dev as RFC?

include/llvm/Analysis/TargetTransformInfo.h
729

Still not sure. We're estimating the cost of the arithmetic LLVM instructions here in terms of throughput, not the ALU Unit throughput.

Agree with Alexey getUnitThroughput looks confusing whilst the information is requested for an operation. My understanding of 'throughput', when it is used for operations, is:

Throughput is the number of cycles after issue that another instruction can begin execution.

In the function comments I see it is the reciprocal throughput which means:

The reciprocal throughput is the maximum number of instructions of the same kind that can be executed per clock cycle when the operands of each instruction are independent of the preceding instructions. The reciprocal throughput is also called issue latency.

I don't know how many people are familiar with the difference. As a result there will be incorrect uses. Another point is that reciprocal throughput can be less than 1. I think such cases can be represented by a pair of integers: number of operations, number of clocks.

Is it important for the function to be ALU specific? Can the function be used for memory operations?

nhaehnle added inline comments.Feb 27 2018, 7:06 PM
include/llvm/Analysis/TargetTransformInfo.h
729

The name of the function should really be getUnitReciprocalThroughput. Throughput means higher numbers = better, whereas here higher numbers = worse.

At this point, I'll settle for just having an accurate function comment. :)
Let me update with just that much and see if we can reach consensus.

Can the function be used for memory operations?

I don't think so - we have "getMemoryOpCost" for that purpose IIUC. So again, a renaming exercise would be best if we make the whole API uniform.

spatel updated this revision to Diff 136397.Feb 28 2018, 2:18 PM

Patch updated:
Just try to make it clear that this is a reciprocal throughput and explain what that means in the function comment.

This makes sense to me, and is in line with how we use it in AMDGPU (where we don't have multiple execution units).

This revision is now accepted and ready to land.Mar 7 2018, 10:59 AM