This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
9
TargetTransformInfo.h

Differential D43769

[TTI] rename getArithmeticInstructionCost() to getUnitThroughput(); NFC
ClosedPublic

Authored by spatel on Feb 26 2018, 10:18 AM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
hfinkel
fhahn
anemet
craig.topper
eastig

Summary

Let's explicitly state the meaning of this API in its name and the code comments to make the intent clear.

This assumes that the x86 overrides of the base cost model are doing the right thing already and D43733 is improving on that. The definition and usage of this cost were not obvious in D43079 / D42981.

If clients are not using this cost as intended, then the name change should make it apparent. Out-of-tree targets will have to update for the cosmetic change, but that's an opportunity to verify that they're either ok with the base model costs or have a reasonable override for those costs.

Diff Detail

Event Timeline

spatel created this revision.Feb 26 2018, 10:18 AM

Herald added subscribers: kbarton, javed.absar, nhaehnle and 5 others. · View Herald TranscriptFeb 26 2018, 10:18 AM

ABataev added inline comments.Feb 26 2018, 10:22 AM

include/llvm/Analysis/TargetTransformInfo.h
729	Seems to me very general, the name does not show that this is the throughput for arithmetic instructions only. I think it is going to be enough just to clarify the comment.

spatel added inline comments.Feb 26 2018, 10:50 AM

include/llvm/Analysis/TargetTransformInfo.h
729	getXXXCost is the most ambiguous. getALUUnitThroughput is better? I'd like to make it so we don't have to read the comments to distinguish this from getUserCost and getOperationCost. So if we don't change this name, then change those? I suppose we should change all of the throughput APIs in one shot if we're going to do this. But if there's no support, then I'll just update the code comment.

ABataev added inline comments.Feb 26 2018, 11:29 AM

include/llvm/Analysis/TargetTransformInfo.h
729	Still not sure. We're estimating the cost of the arithmetic LLVM instructions here in terms of throughput, not the ALU Unit throughput.

spatel added inline comments.Feb 26 2018, 12:45 PM

include/llvm/Analysis/TargetTransformInfo.h
729	Maybe I'm still confused. As I understand D43733, we're not estimating the LLVM instruction throughput for the target as a whole. If we were, then fmul/fadd/fsub should have a higher cost than integer add/sub because all recent x86 have greater overall throughput for scalar integer ops than FP ops because there are more scalar ALUs. IMO, the more important part is that we change Cost -> Throughput for everything (I think) that the vectorizers are using, so we're at least different than 'user cost' or 'operation cost'. I guess people will have to read the comment anyway to distinguish the subtlety between 'getArithmeticThroughput' / 'getALUThroughput'.

ABataev added inline comments.Feb 26 2018, 12:57 PM

include/llvm/Analysis/TargetTransformInfo.h
729	As I understand, we estimate throughout for the target as a whole. And because of that we assume that the cost of the integer scalar instructions is just 1 (though it may be lower). This just needs to be fixed somehow, e.g by introducing some multiplier.

RKSimon added subscribers: courbet, andreadb.Feb 26 2018, 2:21 PM

RKSimon added inline comments.

include/llvm/Analysis/TargetTransformInfo.h
729	If we did have access to the target's scheduling model, maybe something like: int Cost = (int) std::max(Inst->getThroughput() / Model->IssueWidth, 1.0f) ? Although I don't think we're even close to being able to use scheduling models here yet - x86 at least has very few CPUs correctly modelled, although there does seem to be some interest from @andreadb @courbet et al. to use the more aggressively,

fhahn added inline comments.Feb 26 2018, 2:41 PM

include/llvm/Analysis/TargetTransformInfo.h
729	Maybe AArch64 would be a good target to test using the scheduling model info? The available and required CPU units should be modelled quite well there.

This just needs to be fixed somehow, e.g by introducing some multiplier.

Now I'm even more confused.

The cost model is approximating the throughput of the entire target, not a single pipeline/execution unit.
But D43733 did the opposite of that. We said that FP ops after Pentium4 have the lowest possible reciprocal throughput value which means they have the same throughput as an integer add.
Did we purposely introduce wrong costs because we acknowledge that the cost-based calculations in the vectorizers are broken? And this is how we are working around that?

In D43769#1020943, @spatel wrote:

This just needs to be fixed somehow, e.g by introducing some multiplier.

Now I'm even more confused.

The cost model is approximating the throughput of the entire target, not a single pipeline/execution unit.
But D43733 did the opposite of that. We said that FP ops after Pentium4 have the lowest possible reciprocal throughput value which means they have the same throughput as an integer add.
Did we purposely introduce wrong costs because we acknowledge that the cost-based calculations in the vectorizers are broken? And this is how we are working around that?

The throughput of the integer ops is something about 0.25-0.5, but we just can't represent these numbers with integer type, so we just round these values to 1. That's why I say, that we need some kind of multiplier to fix this problem.

In D43769#1020967, @ABataev wrote:

In D43769#1020943, @spatel wrote:

This just needs to be fixed somehow, e.g by introducing some multiplier.

Now I'm even more confused.

The cost model is approximating the throughput of the entire target, not a single pipeline/execution unit.
But D43733 did the opposite of that. We said that FP ops after Pentium4 have the lowest possible reciprocal throughput value which means they have the same throughput as an integer add.
Did we purposely introduce wrong costs because we acknowledge that the cost-based calculations in the vectorizers are broken? And this is how we are working around that?

The throughput of the integer ops is something about 0.25-0.5, but we just can't represent these numbers with integer type, so we just round these values to 1. That's why I say, that we need some kind of multiplier to fix this problem.

Ok, but do you agree that changing the cost of FP ops to be the same as int ops moved us further away from x86 reality than before?

In D43769#1020993, @spatel wrote:

In D43769#1020967, @ABataev wrote:

In D43769#1020943, @spatel wrote:

This just needs to be fixed somehow, e.g by introducing some multiplier.

Now I'm even more confused.

The cost model is approximating the throughput of the entire target, not a single pipeline/execution unit.
But D43733 did the opposite of that. We said that FP ops after Pentium4 have the lowest possible reciprocal throughput value which means they have the same throughput as an integer add.
Did we purposely introduce wrong costs because we acknowledge that the cost-based calculations in the vectorizers are broken? And this is how we are working around that?

The throughput of the integer ops is something about 0.25-0.5, but we just can't represent these numbers with integer type, so we just round these values to 1. That's why I say, that we need some kind of multiplier to fix this problem.

Ok, but do you agree that changing the cost of FP ops to be the same as int ops moved us further away from x86 reality than before?

Yes, I agree. But I think this is the less important problem because this cost model is intended for comparing the cost of the instructions with the same base types, like for comparing scalar float type with vector float type or scalar integer type with vector integer type.

Sanjay,
Maybe it's worth to send the API change to llvm-dev as RFC?

include/llvm/Analysis/TargetTransformInfo.h
729	Still not sure. We're estimating the cost of the arithmetic LLVM instructions here in terms of throughput, not the ALU Unit throughput. Agree with Alexey getUnitThroughput looks confusing whilst the information is requested for an operation. My understanding of 'throughput', when it is used for operations, is: Throughput is the number of cycles after issue that another instruction can begin execution. In the function comments I see it is the reciprocal throughput which means: The reciprocal throughput is the maximum number of instructions of the same kind that can be executed per clock cycle when the operands of each instruction are independent of the preceding instructions. The reciprocal throughput is also called issue latency. I don't know how many people are familiar with the difference. As a result there will be incorrect uses. Another point is that reciprocal throughput can be less than 1. I think such cases can be represented by a pair of integers: number of operations, number of clocks. Is it important for the function to be ALU specific? Can the function be used for memory operations?

nhaehnle added inline comments.Feb 27 2018, 7:06 PM

include/llvm/Analysis/TargetTransformInfo.h
729	The name of the function should really be `getUnitReciprocalThroughput`. Throughput means higher numbers = better, whereas here higher numbers = worse.

At this point, I'll settle for just having an accurate function comment. :)
Let me update with just that much and see if we can reach consensus.

Can the function be used for memory operations?

I don't think so - we have "getMemoryOpCost" for that purpose IIUC. So again, a renaming exercise would be best if we make the whole API uniform.

Patch updated:
Just try to make it clear that this is a reciprocal throughput and explain what that means in the function comment.

This makes sense to me, and is in line with how we use it in AMDGPU (where we don't have multiple execution units).

Ping.

This revision is now accepted and ready to land.Mar 7 2018, 10:59 AM

rL326956

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

18 lines

Diff 136397

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 704 Lines • ▼ Show 20 Lines	public:
/// performed.		/// performed.
unsigned getMaxPrefetchIterationsAhead() const;		unsigned getMaxPrefetchIterationsAhead() const;

/// \return The maximum interleave factor that any transform should try to		/// \return The maximum interleave factor that any transform should try to
/// perform for this target. This number depends on the level of parallelism		/// perform for this target. This number depends on the level of parallelism
/// and the number of execution units in the CPU.		/// and the number of execution units in the CPU.
unsigned getMaxInterleaveFactor(unsigned VF) const;		unsigned getMaxInterleaveFactor(unsigned VF) const;

/// \return The expected cost of arithmetic ops, such as mul, xor, fsub, etc.		/// This is an approximation of reciprocal throughput of a math/logic op.
		/// A higher cost indicates less expected throughput.
		/// From Agner Fog's guides, reciprocal throughput is "the average number of
		/// clock cycles per instruction when the instructions are not part of a
		/// limiting dependency chain."
		/// Therefore, costs should be scaled to account for multiple execution units
		/// on the target that can process this type of instruction. For example, if
		/// there are 5 scalar integer units and 2 vector integer units that can
		/// calculate an 'add' in a single cycle, this model should indicate that the
		/// cost of the vector add instruction is 2.5 times the cost of the scalar
		/// add instruction.
/// \p Args is an optional argument which holds the instruction operands		/// \p Args is an optional argument which holds the instruction operands
/// values so the TTI can analyize those values searching for special		/// values so the TTI can analyze those values searching for special
/// cases\optimizations based on those values.		/// cases or optimizations based on those values.
int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty, OperandValueKind Opd1Info = OK_AnyValue,		unsigned Opcode, Type *Ty, OperandValueKind Opd1Info = OK_AnyValue,
OperandValueKind Opd2Info = OK_AnyValue,		OperandValueKind Opd2Info = OK_AnyValue,
		ABataevUnsubmitted Not Done Reply Inline Actions Seems to me very general, the name does not show that this is the throughput for arithmetic instructions only. I think it is going to be enough just to clarify the comment. ABataev: Seems to me very general, the name does not show that this is the throughput for arithmetic…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions getXXXCost is the most ambiguous. getALUUnitThroughput is better? I'd like to make it so we don't have to read the comments to distinguish this from getUserCost and getOperationCost. So if we don't change this name, then change those? I suppose we should change all of the throughput APIs in one shot if we're going to do this. But if there's no support, then I'll just update the code comment. spatel: getXXXCost is the most ambiguous. getALUUnitThroughput is better? I'd like to make it so we…
		ABataevUnsubmitted Not Done Reply Inline Actions Still not sure. We're estimating the cost of the arithmetic LLVM instructions here in terms of throughput, not the ALU Unit throughput. ABataev: Still not sure. We're estimating the cost of the arithmetic LLVM instructions here in terms of…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Maybe I'm still confused. As I understand D43733, we're not estimating the LLVM instruction throughput for the target as a whole. If we were, then fmul/fadd/fsub should have a higher cost than integer add/sub because all recent x86 have greater overall throughput for scalar integer ops than FP ops because there are more scalar ALUs. IMO, the more important part is that we change Cost -> Throughput for everything (I think) that the vectorizers are using, so we're at least different than 'user cost' or 'operation cost'. I guess people will have to read the comment anyway to distinguish the subtlety between 'getArithmeticThroughput' / 'getALUThroughput'. spatel: Maybe I'm still confused. As I understand D43733, we're not estimating the LLVM instruction…
		ABataevUnsubmitted Not Done Reply Inline Actions As I understand, we estimate throughout for the target as a whole. And because of that we assume that the cost of the integer scalar instructions is just 1 (though it may be lower). This just needs to be fixed somehow, e.g by introducing some multiplier. ABataev: As I understand, we estimate throughout for the target as a whole. And because of that we…
		RKSimonUnsubmitted Not Done Reply Inline Actions If we did have access to the target's scheduling model, maybe something like: int Cost = (int) std::max(Inst->getThroughput() / Model->IssueWidth, 1.0f) ? Although I don't think we're even close to being able to use scheduling models here yet - x86 at least has very few CPUs correctly modelled, although there does seem to be some interest from @andreadb @courbet et al. to use the more aggressively, RKSimon: If we did have access to the target's scheduling model, maybe something like: int Cost = (int)…
		fhahnUnsubmitted Not Done Reply Inline Actions Maybe AArch64 would be a good target to test using the scheduling model info? The available and required CPU units should be modelled quite well there. fhahn: Maybe AArch64 would be a good target to test using the scheduling model info? The available and…
		eastigUnsubmitted Not Done Reply Inline Actions Still not sure. We're estimating the cost of the arithmetic LLVM instructions here in terms of throughput, not the ALU Unit throughput. Agree with Alexey getUnitThroughput looks confusing whilst the information is requested for an operation. My understanding of 'throughput', when it is used for operations, is: Throughput is the number of cycles after issue that another instruction can begin execution. In the function comments I see it is the reciprocal throughput which means: The reciprocal throughput is the maximum number of instructions of the same kind that can be executed per clock cycle when the operands of each instruction are independent of the preceding instructions. The reciprocal throughput is also called issue latency. I don't know how many people are familiar with the difference. As a result there will be incorrect uses. Another point is that reciprocal throughput can be less than 1. I think such cases can be represented by a pair of integers: number of operations, number of clocks. Is it important for the function to be ALU specific? Can the function be used for memory operations? eastig: > Still not sure. We're estimating the cost of the arithmetic LLVM instructions here in terms…
		nhaehnleUnsubmitted Not Done Reply Inline Actions The name of the function should really be `getUnitReciprocalThroughput`. Throughput means higher numbers = better, whereas here higher numbers = worse. nhaehnle: The name of the function should really be `getUnitReciprocalThroughput`. Throughput means…
OperandValueProperties Opd1PropInfo = OP_None,		OperandValueProperties Opd1PropInfo = OP_None,
OperandValueProperties Opd2PropInfo = OP_None,		OperandValueProperties Opd2PropInfo = OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >()) const;		ArrayRef<const Value > Args = ArrayRef<const Value >()) const;

/// \return The cost of a shuffle instruction of kind Kind and of type Tp.		/// \return The cost of a shuffle instruction of kind Kind and of type Tp.
/// The index and subtype parameters are used by the subvector insertion and		/// The index and subtype parameters are used by the subvector insertion and
/// extraction shuffle kinds.		/// extraction shuffle kinds.
int getShuffleCost(ShuffleKind Kind, Type *Tp, int Index = 0,		int getShuffleCost(ShuffleKind Kind, Type *Tp, int Index = 0,
▲ Show 20 Lines • Show All 878 Lines • Show Last 20 Lines