This is an archive of the discontinued LLVM Phabricator instance.

[TargetTransformInfo] Call target getMemoryOpCost for LoadInst from getUserCost
Needs ReviewPublic

Authored by Carrot on Aug 18 2017, 2:04 PM.

Download Raw Diff

Details

Reviewers

Summary

In modern processors, memory load is much slower than basic ALU instructions.
In function TargetTransformInfoImplCRTPBase::getUserCost, TargetTransformInfoImplBase::getOperationCost is called to compute the cost of memory load, and TTI::TCC_Basic is returned, it's different from real world cost.
This patch changes it to call target dependent getMemoryOpCost implementation to compute the cost of memory load.

Diff Detail

Event Timeline

Carrot created this revision.Aug 18 2017, 2:04 PM

Carrot added a reviewer: hfinkel.

This will change *lots* of different things from the vectorizer to the inliner.

It needs thorough benchmarking to tease out any performance shift.
It should be accompanied by some motivating test cases where this changes the behavior of an LLVM pass.

RKSimon added a subscriber: RKSimon.Aug 19 2017, 1:24 AM

We currently have two cost models in TTI. One is for the vectorizer and it is supposed to measure normalized reciprocal throughputs. The other model, the "user cost" model, is used by the inliner, the loop unroller, and a few other things. This model measures some less-well-defined combination of performance factors (thoughput/latency) and code size. I don't know whether tying these things together makes sense. At the risk of going down some rabbit hole, we should probably answer that question. If it does make sense, we should tie them together more completely than just for memory operations.

In D36896#847686, @hfinkel wrote:

We currently have two cost models in TTI. One is for the vectorizer and it is supposed to measure normalized reciprocal throughputs. The other model, the "user cost" model, is used by the inliner, the loop unroller, and a few other things. This model measures some less-well-defined combination of performance factors (thoughput/latency) and code size. I don't know whether tying these things together makes sense. At the risk of going down some rabbit hole, we should probably answer that question. If it does make sense, we should tie them together more completely than just for memory operations.

Thank you for the clarification that they are actually belong to two different cost models.
My problem is in SimplifyCFG(if conversion), I want to compute the latency of a dependence chain. Currently I use getUserCost for each instruction, but I get 1 for memory load, it is much lower than reality, the big gap makes it almost useless. So which API should I use to get a more precise latency of instructions?

In D36896#849335, @Carrot wrote:

In D36896#847686, @hfinkel wrote:

We currently have two cost models in TTI. One is for the vectorizer and it is supposed to measure normalized reciprocal throughputs. The other model, the "user cost" model, is used by the inliner, the loop unroller, and a few other things. This model measures some less-well-defined combination of performance factors (thoughput/latency) and code size. I don't know whether tying these things together makes sense. At the risk of going down some rabbit hole, we should probably answer that question. If it does make sense, we should tie them together more completely than just for memory operations.

Thank you for the clarification that they are actually belong to two different cost models.
My problem is in SimplifyCFG(if conversion), I want to compute the latency of a dependence chain. Currently I use getUserCost for each instruction, but I get 1 for memory load, it is much lower than reality, the big gap makes it almost useless. So which API should I use to get a more precise latency of instructions?

We don't have one (at the IR level). We should. I really want to see TTI have one cost model that returns <reciprocal throughput, latency, size>, all normalized, for each query. Then each client would be able to pick the right numbers for what it wants. While I dislike suggesting that someone take on a big project to fix a small problem, maybe we can do this incrementally somehow, so that we can keep things moving in the right direction. I suggest that you change the TTI cost interfaces to return some kind of tuple or structure, and then, make all of the latencies default to 1 for scalar operations, 5 for FP/vector ops (or 4 or 6 or whatever you think makes a reasonable default), and STI.getSchedModel().LoadLatency for loads. That will get you the right load latency, and provide an interface we can improve over time to get latency estimates for the other operations.

Thank you for the clarification that they are actually belong to two different cost models.
My problem is in SimplifyCFG(if conversion), I want to compute the latency of a dependence chain. Currently I use getUserCost for each instruction, but I get 1 for memory load, it is much lower than reality, the big gap makes it almost useless. So which API should I use to get a more precise latency of instructions?

We don't have one (at the IR level). We should. I really want to see TTI have one cost model that returns <reciprocal throughput, latency, size>, all normalized, for each query. Then each client would be able to pick the right numbers for what it wants. While I dislike suggesting that someone take on a big project to fix a small problem, maybe we can do this incrementally somehow, so that we can keep things moving in the right direction. I suggest that you change the TTI cost interfaces to return some kind of tuple or structure, and then, make all of the latencies default to 1 for scalar operations, 5 for FP/vector ops (or 4 or 6 or whatever you think makes a reasonable default), and STI.getSchedModel().LoadLatency for loads. That will get you the right load latency, and provide an interface we can improve over time to get latency estimates for the other operations.

Thank you for the good suggestion.
I will add following new public interface to TTI

enum TargetCostKind {

TCK_THROUGHPUT,
TCK_LATENCY,
TCK_CODESIZE,

};

int getInstructionCost(Instruction *I, enum TargetCostKind kind);

So clients can use this consistent and clear API.

I will also add some simple implementation of getInstructionLatency and getInstructionSize.

eastig added a subscriber: eastig.Aug 24 2017, 3:46 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfoImpl.h

7 lines

Diff 111737

include/llvm/Analysis/TargetTransformInfoImpl.h

	Show All 26 Lines
	// cmp instructions, logical or return instructions). These are usually	// cmp instructions, logical or return instructions). These are usually
	// nop on most sane targets.	// nop on most sane targets.
	if (isa<CmpInst>(CI->getOperand(0)))	if (isa<CmpInst>(CI->getOperand(0)))
	return TTI::TCC_Free;	return TTI::TCC_Free;
	if (isa<SExtInst>(CI) \|\| isa<ZExtInst>(CI) \|\| isa<FPExtInst>(CI))	if (isa<SExtInst>(CI) \|\| isa<ZExtInst>(CI) \|\| isa<FPExtInst>(CI))
	return static_cast<T *>(this)->getExtCost(CI, Operands.back());	return static_cast<T *>(this)->getExtCost(CI, Operands.back());
	}	}

		if (const LoadInst *LI = dyn_cast<LoadInst>(U)) {
		return static_cast<T *>(this)->getMemoryOpCost(LI->getOpcode(),
		LI->getType(),
		LI->getAlignment(),
		LI->getPointerAddressSpace(), LI);
		}

	return static_cast<T *>(this)->getOperationCost(	return static_cast<T *>(this)->getOperationCost(
	Operator::getOpcode(U), U->getType(),	Operator::getOpcode(U), U->getType(),
	U->getNumOperands() == 1 ? U->getOperand(0)->getType() : nullptr);	U->getNumOperands() == 1 ? U->getOperand(0)->getType() : nullptr);
	}	}
	};	};
	}	}

	#endif	#endif
Context not available.