Changed basic cost of Store operation on X86.
ClosedPublic

Authored by delena on Jul 26 2017, 5:57 AM.

Details

Summary

Store operation takes 2 UOps on X86 processors. The exact cost calculation affects several optimization passes including loop unroling.
This change compensates performance degradation caused by https://reviews.llvm.org/D34458 and shows improvements on some benchmarks.

Diff Detail

Repository
rL LLVM
delena created this revision.Jul 26 2017, 5:57 AM
zvi added a comment.Jul 26 2017, 6:25 AM

Elena, can you please add a test-case? Maybe a loop that this patch will prevent from being unrolled?

Hi Elena,

There is already

/// \return The cost of Load and Store instructions.
int TargetTransformInfo::getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
                    unsigned AddressSpace, const Instruction *I = nullptr) const;

Maybe it should be updated and used?

Hi Elena,

There is already

/// \return The cost of Load and Store instructions.
int TargetTransformInfo::getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
                    unsigned AddressSpace, const Instruction *I = nullptr) const;

Maybe it should be updated and used?

We have 2 sets of cost functions in TTI. One set is used by vectorizer in order to compare same operations with different vector types. Vectorizer does not estimate num of UOps, it mostly counts instructions. The second set is getOperationCost() returns TCC_Free/Basic/Expensive and used by Unroller, Inliner and, I think, CFG-simplifiy pass. These 2 sets are not mixed AFAIK.

delena updated this revision to Diff 108426.Jul 27 2017, 12:33 AM

Added a test case.

delena updated this revision to Diff 108430.Jul 27 2017, 1:02 AM
delena updated this revision to Diff 108437.Jul 27 2017, 2:34 AM
zvi accepted this revision.Jul 27 2017, 3:56 AM
This revision is now accepted and ready to land.Jul 27 2017, 3:56 AM

Is store is always 2 UOps on x86, independent of the type and register(xmm/ymm) ?

Thanks,
Ashutosh

We have 2 sets of cost functions in TTI. One set is used by vectorizer in order to compare same operations with different vector types. Vectorizer does not estimate num of UOps, it mostly counts instructions. The second set is getOperationCost() returns TCC_Free/Basic/Expensive and used by Unroller, Inliner and, I think, CFG-simplifiy pass. These 2 sets are not mixed AFAIK.

Thank you for clarification. I didn't know about this. It is not clear from the documentation. IMHO these specifics should be reflected at the language level not at the doc level.
What about having getLoadCost? I think there will be a reasonable question why there is getStoreCost but no getLoadCost.
Or can we have the overloaded version of TargetTransformInfo::getMemoryOpCost:

int TargetTransformInfo::getMemoryOpCost(const Instruction *I)

Is store is always 2 UOps on x86, independent of the type and register(xmm/ymm) ?

Thanks,
Ashutosh

Yes, if the address has index and scale. If the address is simple, IACA shows "micro fusion".
IACA is publicly available tool that shows number of UOPS and ports per instruction for Intel processors.
https://software.intel.com/en-us/articles/intel-architecture-code-analyzer

What about having getLoadCost? I think there will be a reasonable question why there is getStoreCost but no getLoadCost.
Or can we have the overloaded version of TargetTransformInfo::getMemoryOpCost:

int TargetTransformInfo::getMemoryOpCost(const Instruction *I)

I don't think that we want to extend interface without any special need. Load is 1 Uop on X86, and, probably, on other targets, so everyone is happy with the common getOperatopnCost().

What about having getLoadCost? I think there will be a reasonable question why there is getStoreCost but no getLoadCost.
Or can we have the overloaded version of TargetTransformInfo::getMemoryOpCost:

int TargetTransformInfo::getMemoryOpCost(const Instruction *I)

I don't think that we want to extend interface without any special need. Load is 1 Uop on X86, and, probably, on other targets, so everyone is happy with the common getOperatopnCost().

Maybe I am wrong but the instruction cost is the cost relative to the cost of a typical instruction. TTI does not operates in terms of microarchitecture: u-ops etc. Why do we need a general function which is only for X86? How will other architectures need it? If everyone is happy with getOperationCost why X86 is not? Is it possible to have this functionality in X86TTI without changing general TTI?

BTW ARM has LDM/STM (load and store multiple instructions) which can be quite heavy comparing to a typical instruction.

Maybe I am wrong but the instruction cost is the cost relative to the cost of a typical instruction. TTI does not operates in terms of microarchitecture: u-ops etc. Why do we need a general function which is only for X86? How will other architectures need it? If everyone is happy with getOperationCost why X86 is not? Is it possible to have this functionality in X86TTI without changing general TTI?

The second option is to customize the getOperationCost() for X86. Agree, it is also a way to provide target specific cost. Is that what you suggest?

BTW ARM has LDM/STM (load and store multiple instructions) which can be quite heavy comparing to a typical instruction.

May be ARM should also customize load/store? I don't know. X86 hits performance regression due to wrong estimation of unroll factor against LSD.

Maybe I am wrong but the instruction cost is the cost relative to the cost of a typical instruction. TTI does not operates in terms of microarchitecture: u-ops etc. Why do we need a general function which is only for X86? How will other architectures need it? If everyone is happy with getOperationCost why X86 is not? Is it possible to have this functionality in X86TTI without changing general TTI?

The second option is to customize the getOperationCost() for X86. Agree, it is also a way to provide target specific cost. Is that what you suggest?

Yes, some kind of this. getOperationCost or getUserCost can be customized. For example, X86TTIImpl can override getUserCost in the following way:

unsigned X86TIImpl::getUserCost(const User *U,                                 
                                 ArrayRef<const Value *> Operands) {            
  if (isa<StoreInst>(U))
      return getStoreCost(cast<Instruction>(U));
  return BaseT::getUserCost(U, Operands);                                       
}
delena updated this revision to Diff 110955.Aug 14 2017, 6:33 AM

I updated the patch according to Evgeny's comments.
(Sorry for the delay, I was on vacation)

zvi accepted this revision.Aug 20 2017, 5:02 AM
This revision was automatically updated to reflect the committed changes.

Just back from vacation.
LGTM.
Thank you, Elena.