This is an archive of the discontinued LLVM Phabricator instance.

[x86] fix cost model inaccuracy for vector memory ops
ClosedPublic

Authored by spatel on Mar 9 2016, 10:28 AM.

Download Raw Diff

Details

Reviewers

RKSimon
zansari
DavidKreitzer

Commits

rG9f6c4d50b4b9: [x86] fix cost model inaccuracy for vector memory ops
rL263069: [x86] fix cost model inaccuracy for vector memory ops

Summary

The irony of this patch is that the one CPU that is affected is AMD Jaguar, and Jaguar has a completely double-pumped AVX implementation. But getting the cost model to reflect that is a much bigger problem. The small goal here is simply to improve on the lie that !AVX2 == SandyBridge.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 50157.Mar 9 2016, 10:28 AM

spatel retitled this revision from to [x86] fix cost model inaccuracy for vector memory ops.

spatel updated this object.

spatel added reviewers: DavidKreitzer, RKSimon, zansari.

spatel added a subscriber: llvm-commits.

Herald added a subscriber: mcrosier. · View Herald TranscriptMar 9 2016, 10:28 AM

lgtm for being an improvement over the existing code.

This revision is now accepted and ready to land.Mar 9 2016, 11:34 AM

This seems reasonable to me also. It might be nice to have a separate X86Subtarget property for double pumped 32-byte load/stores, but that may be overkill for this one use.

Closed by commit rL263069: [x86] fix cost model inaccuracy for vector memory ops (authored by spatel). · Explain WhyMar 9 2016, 2:28 PM

This revision was automatically updated to reflect the committed changes.

In D18000#371254, @DavidKreitzer wrote:

This seems reasonable to me also. It might be nice to have a separate X86Subtarget property for double pumped 32-byte load/stores, but that may be overkill for this one use.

Thanks, Zia and Dave. I agree a separate property would be good if we really want to model this better. For reference, I noticed this bug as part of the TTI cost model discussion in PR26837:
https://llvm.org/bugs/show_bug.cgi?id=26837

There's a lot of bigger stuff that could be modeled better. :)
Ideally, I think we would lift latency/throughput data from the SchedMachineModel, but I'm not sure how to do that yet.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86TargetTransformInfo.cpp

8 lines

test/

Transforms/

LoopVectorize/

X86/

avx1.ll

4 lines

Diff 50196

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 977 Lines • ▼ Show 20 Lines	int X86TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
// Legalize the type.		// Legalize the type.
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Src);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Src);
assert((Opcode == Instruction::Load \|\| Opcode == Instruction::Store) &&		assert((Opcode == Instruction::Load \|\| Opcode == Instruction::Store) &&
"Invalid Opcode");		"Invalid Opcode");

// Each load/store unit costs 1.		// Each load/store unit costs 1.
int Cost = LT.first * 1;		int Cost = LT.first * 1;

// On Sandybridge 256bit load/stores are double pumped		// This isn't exactly right. We're using slow unaligned 32-byte accesses as a
// (but not on Haswell).		// proxy for a double-pumped AVX memory interface such as on Sandybridge.
if (LT.second.getSizeInBits() > 128 && !ST->hasAVX2())		if (LT.second.getStoreSize() == 32 && ST->isUnalignedMem32Slow())
Cost*=2;		Cost *= 2;

return Cost;		return Cost;
}		}

int X86TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *SrcTy,		int X86TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *SrcTy,
unsigned Alignment,		unsigned Alignment,
unsigned AddressSpace) {		unsigned AddressSpace) {
VectorType *SrcVTy = dyn_cast<VectorType>(SrcTy);		VectorType *SrcVTy = dyn_cast<VectorType>(SrcTy);
▲ Show 20 Lines • Show All 491 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll

Show All 20 Lines	.lr.ph: ; preds = %0, %.lr.ph
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %n		%exitcond = icmp eq i32 %lftr.wideiv, %n
br i1 %exitcond, label %._crit_edge, label %.lr.ph		br i1 %exitcond, label %._crit_edge, label %.lr.ph

._crit_edge: ; preds = %.lr.ph, %0		._crit_edge: ; preds = %.lr.ph, %0
ret i32 undef		ret i32 undef
}		}

;;; FIXME: If 32-byte accesses are fast, this should use a <4 x i64> load.

; CHECK-LABEL: @read_mod_i64(		; CHECK-LABEL: @read_mod_i64(
; CHECK: load <2 x i64>		; SLOWMEM32: load <2 x i64>
		; FASTMEM32: load <4 x i64>
; CHECK: ret i32		; CHECK: ret i32
define i32 @read_mod_i64(i64* nocapture %a, i32 %n) nounwind uwtable ssp {		define i32 @read_mod_i64(i64* nocapture %a, i32 %n) nounwind uwtable ssp {
%1 = icmp sgt i32 %n, 0		%1 = icmp sgt i32 %n, 0
br i1 %1, label %.lr.ph, label %._crit_edge		br i1 %1, label %.lr.ph, label %._crit_edge

.lr.ph: ; preds = %0, %.lr.ph		.lr.ph: ; preds = %0, %.lr.ph
%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %0 ]		%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %0 ]
%2 = getelementptr inbounds i64, i64* %a, i64 %indvars.iv		%2 = getelementptr inbounds i64, i64* %a, i64 %indvars.iv
Show All 12 Lines