Store operation takes 2 UOps on X86 processors. The exact cost calculation affects several optimization passes including loop unroling.
This change compensates performance degradation caused by https://reviews.llvm.org/D34458 and shows improvements on some benchmarks.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Elena, can you please add a test-case? Maybe a loop that this patch will prevent from being unrolled?
Hi Elena,
There is already
/// \return The cost of Load and Store instructions. int TargetTransformInfo::getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment, unsigned AddressSpace, const Instruction *I = nullptr) const;
Maybe it should be updated and used?
We have 2 sets of cost functions in TTI. One set is used by vectorizer in order to compare same operations with different vector types. Vectorizer does not estimate num of UOps, it mostly counts instructions. The second set is getOperationCost() returns TCC_Free/Basic/Expensive and used by Unroller, Inliner and, I think, CFG-simplifiy pass. These 2 sets are not mixed AFAIK.
Is store is always 2 UOps on x86, independent of the type and register(xmm/ymm) ?
Thanks,
Ashutosh
Thank you for clarification. I didn't know about this. It is not clear from the documentation. IMHO these specifics should be reflected at the language level not at the doc level.
What about having getLoadCost? I think there will be a reasonable question why there is getStoreCost but no getLoadCost.
Or can we have the overloaded version of TargetTransformInfo::getMemoryOpCost:
int TargetTransformInfo::getMemoryOpCost(const Instruction *I)
Yes, if the address has index and scale. If the address is simple, IACA shows "micro fusion".
IACA is publicly available tool that shows number of UOPS and ports per instruction for Intel processors.
https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
What about having getLoadCost? I think there will be a reasonable question why there is getStoreCost but no getLoadCost.
Or can we have the overloaded version of TargetTransformInfo::getMemoryOpCost:int TargetTransformInfo::getMemoryOpCost(const Instruction *I)
I don't think that we want to extend interface without any special need. Load is 1 Uop on X86, and, probably, on other targets, so everyone is happy with the common getOperatopnCost().
Maybe I am wrong but the instruction cost is the cost relative to the cost of a typical instruction. TTI does not operates in terms of microarchitecture: u-ops etc. Why do we need a general function which is only for X86? How will other architectures need it? If everyone is happy with getOperationCost why X86 is not? Is it possible to have this functionality in X86TTI without changing general TTI?
BTW ARM has LDM/STM (load and store multiple instructions) which can be quite heavy comparing to a typical instruction.
Maybe I am wrong but the instruction cost is the cost relative to the cost of a typical instruction. TTI does not operates in terms of microarchitecture: u-ops etc. Why do we need a general function which is only for X86? How will other architectures need it? If everyone is happy with getOperationCost why X86 is not? Is it possible to have this functionality in X86TTI without changing general TTI?
The second option is to customize the getOperationCost() for X86. Agree, it is also a way to provide target specific cost. Is that what you suggest?
BTW ARM has LDM/STM (load and store multiple instructions) which can be quite heavy comparing to a typical instruction.
May be ARM should also customize load/store? I don't know. X86 hits performance regression due to wrong estimation of unroll factor against LSD.
Yes, some kind of this. getOperationCost or getUserCost can be customized. For example, X86TTIImpl can override getUserCost in the following way:
unsigned X86TIImpl::getUserCost(const User *U, ArrayRef<const Value *> Operands) { if (isa<StoreInst>(U)) return getStoreCost(cast<Instruction>(U)); return BaseT::getUserCost(U, Operands); }
I updated the patch according to Evgeny's comments.
(Sorry for the delay, I was on vacation)