This is an archive of the discontinued LLVM Phabricator instance.

Stop the DAG combiner from combining vector stores greater than preferred vector width...
ClosedPublic

Authored by echristo on May 3 2019, 8:16 PM.

Download Raw Diff

Details

Reviewers

craig.topper
hfinkel

Summary

Originally we said that -mpreferred-vector-width was only going to stop the vectorizer and some of code generation, but here's another spot if we want to make sure we don't canonicalize a memcpy/memmmove and then lower it to the widest vector type.

Original testcase:

void Copy256(const char* src, char* dst) {
  char tmp[32];
  for (int i = 0; i < 32; ++i) tmp[i] = src[i];
  for (int i = 0; i < 32; ++i) dst[i] = tmp[i];
}

which is pretty boring, but shows the problem:

vmovups ymm0, ymmword ptr [rdi]
vmovups ymmword ptr [rsi], ymm0
vzeroupper
ret

while the option says that this doesn't necessarily mean no vector code, I think this is a fairly reasonable place to stop some optimization.

Thoughts?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

echristo created this revision.May 3 2019, 8:16 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2019, 8:16 PM

Herald added subscribers: llvm-commits, hiraditya, mcrosier. · View Herald Transcript

Arguably this could use some more comments and I'll add those as well.

Seems reasonable to me. Can have have both a 128- and 256-bit test case?

Seems reasonable to me too.

Update comments. Have testcase work for multiple sizes and multiple preferred vector sizes.

In D61546#1492881, @echristo wrote:

Update comments. Have testcase work for multiple sizes and multiple preferred vector sizes.

Thanks. LGTM.

This revision is now accepted and ready to land.May 6 2019, 10:53 PM

This happened here:

echristo@athyra ~/r/llvm-project> git llvm push
Pushing 1 commit:

96aa9dda693 Make sure that the DAG combiner doesn't merge stores that we explicitly asked not be greater than preferred vector width for the vectorizer. Test for both 128 and 256 with a skylake architecture.

Sending llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
Adding llvm/trunk/test/CodeGen/X86/vector-width-store-merge.ll
Transmitting file data ..done
Committing transaction...
Committed revision 360183.
Committed 96aa9dda693 to svn.

I'm bad at remembering to add it.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

24 lines

test/

CodeGen/

X86/

vector-width-store-merge.ll

53 lines

Diff 198395

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,052 Lines • ▼ Show 20 Lines
/// alignment can satisfy any constraint. Similarly if SrcAlign is zero it		/// alignment can satisfy any constraint. Similarly if SrcAlign is zero it
/// means there isn't a need to check it against alignment requirement,		/// means there isn't a need to check it against alignment requirement,
/// probably because the source does not need to be loaded. If 'IsMemset' is		/// probably because the source does not need to be loaded. If 'IsMemset' is
/// true, that means it's expanding a memset. If 'ZeroMemset' is true, that		/// true, that means it's expanding a memset. If 'ZeroMemset' is true, that
/// means it's a memset of zero. 'MemcpyStrSrc' indicates whether the memcpy		/// means it's a memset of zero. 'MemcpyStrSrc' indicates whether the memcpy
/// source is constant so it does not need to be loaded.		/// source is constant so it does not need to be loaded.
/// It returns EVT::Other if the type should be determined using generic		/// It returns EVT::Other if the type should be determined using generic
/// target-independent logic.		/// target-independent logic.
EVT		/// For vector ops we check that the overall size isn't larger than our
X86TargetLowering::getOptimalMemOpType(		/// preferred vector width.
		EVT X86TargetLowering::getOptimalMemOpType(
uint64_t Size, unsigned DstAlign, unsigned SrcAlign, bool IsMemset,		uint64_t Size, unsigned DstAlign, unsigned SrcAlign, bool IsMemset,
bool ZeroMemset, bool MemcpyStrSrc,		bool ZeroMemset, bool MemcpyStrSrc,
const AttributeList &FuncAttributes) const {		const AttributeList &FuncAttributes) const {
if (!FuncAttributes.hasFnAttribute(Attribute::NoImplicitFloat)) {		if (!FuncAttributes.hasFnAttribute(Attribute::NoImplicitFloat)) {
if (Size >= 16 &&		if (Size >= 16 && (!Subtarget.isUnalignedMem16Slow() \|\|
(!Subtarget.isUnalignedMem16Slow() \|\|
((DstAlign == 0 \|\| DstAlign >= 16) &&		((DstAlign == 0 \|\| DstAlign >= 16) &&
(SrcAlign == 0 \|\| SrcAlign >= 16)))) {		(SrcAlign == 0 \|\| SrcAlign >= 16)))) {
// FIXME: Check if unaligned 32-byte accesses are slow.		// FIXME: Check if unaligned 32-byte accesses are slow.
if (Size >= 32 && Subtarget.hasAVX()) {		if (Size >= 32 && Subtarget.hasAVX() &&
		(Subtarget.getPreferVectorWidth() >= 256)) {
// Although this isn't a well-supported type for AVX1, we'll let		// Although this isn't a well-supported type for AVX1, we'll let
// legalization and shuffle lowering produce the optimal codegen. If we		// legalization and shuffle lowering produce the optimal codegen. If we
// choose an optimal type with a vector element larger than a byte,		// choose an optimal type with a vector element larger than a byte,
// getMemsetStores() may create an intermediate splat (using an integer		// getMemsetStores() may create an intermediate splat (using an integer
// multiply) before we splat as a vector.		// multiply) before we splat as a vector.
return MVT::v32i8;		return MVT::v32i8;
}		}
if (Subtarget.hasSSE2())		if (Subtarget.hasSSE2() && (Subtarget.getPreferVectorWidth() >= 128))
return MVT::v16i8;		return MVT::v16i8;
// TODO: Can SSE1 handle a byte vector?		// TODO: Can SSE1 handle a byte vector?
// If we have SSE1 registers we should be able to use them.		// If we have SSE1 registers we should be able to use them.
if (Subtarget.hasSSE1() && (Subtarget.is64Bit() \|\| Subtarget.hasX87()))		if (Subtarget.hasSSE1() && (Subtarget.is64Bit() \|\| Subtarget.hasX87()) &&
		(Subtarget.getPreferVectorWidth() >= 128))
return MVT::v4f32;		return MVT::v4f32;
} else if ((!IsMemset \|\| ZeroMemset) && !MemcpyStrSrc && Size >= 8 &&		} else if ((!IsMemset \|\| ZeroMemset) && !MemcpyStrSrc && Size >= 8 &&
!Subtarget.is64Bit() && Subtarget.hasSSE2()) {		!Subtarget.is64Bit() && Subtarget.hasSSE2()) {
// Do not use f64 to lower memcpy if source is string constant. It's		// Do not use f64 to lower memcpy if source is string constant. It's
// better to use i32 to avoid the loads.		// better to use i32 to avoid the loads.
// Also, do not use f64 to lower memset unless this is a memset of zeros.		// Also, do not use f64 to lower memset unless this is a memset of zeros.
// The gymnastics of splatting a byte value into an XMM register and then		// The gymnastics of splatting a byte value into an XMM register and then
// only using 8-byte stores (because this is a CPU with slow unaligned		// only using 8-byte stores (because this is a CPU with slow unaligned
▲ Show 20 Lines • Show All 2,865 Lines • ▼ Show 20 Lines	bool X86TargetLowering::canMergeStoresTo(unsigned AddressSpace, EVT MemVT,
// float attribute is set.		// float attribute is set.
bool NoFloat = DAG.getMachineFunction().getFunction().hasFnAttribute(		bool NoFloat = DAG.getMachineFunction().getFunction().hasFnAttribute(
Attribute::NoImplicitFloat);		Attribute::NoImplicitFloat);

if (NoFloat) {		if (NoFloat) {
unsigned MaxIntSize = Subtarget.is64Bit() ? 64 : 32;		unsigned MaxIntSize = Subtarget.is64Bit() ? 64 : 32;
return (MemVT.getSizeInBits() <= MaxIntSize);		return (MemVT.getSizeInBits() <= MaxIntSize);
}		}
		// Make sure we don't merge greater than our preferred vector
		// width.
		if (MemVT.getSizeInBits() > Subtarget.getPreferVectorWidth())
		return false;
return true;		return true;
}		}

bool X86TargetLowering::isCtlzFast() const {		bool X86TargetLowering::isCtlzFast() const {
return Subtarget.hasFastLZCNT();		return Subtarget.hasFastLZCNT();
}		}

bool X86TargetLowering::isMaskAndCmp0FoldingBeneficial(		bool X86TargetLowering::isMaskAndCmp0FoldingBeneficial(
▲ Show 20 Lines • Show All 9,991 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-width-store-merge.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-- \| FileCheck %s

				; This tests whether or not we generate vectors large than preferred vector width when
				; lowering memmove.

				; Function Attrs: nounwind uwtable
				define weak_odr dso_local void @A(i8* %src, i8* %dst) local_unnamed_addr #0 {
				entry:
				; CHECK: A
				; CHECK-NOT: vmovups %ymm
				; CHECK: vmovups %xmm
				call void @llvm.memmove.p0i8.p0i8.i64(i8* align 1 %dst, i8* align 1 %src, i64 32, i1 false)
				ret void
				}

				; Function Attrs: nounwind uwtable
				define weak_odr dso_local void @B(i8* %src, i8* %dst) local_unnamed_addr #0 {
				entry:
				; CHECK: B
				; CHECK-NOT: vmovups %zmm
				; CHECK: vmovups %xmm
				call void @llvm.memmove.p0i8.p0i8.i64(i8* align 1 %dst, i8* align 1 %src, i64 64, i1 false)
				ret void
				}

				; Function Attrs: nounwind uwtable
				define weak_odr dso_local void @C(i8* %src, i8* %dst) local_unnamed_addr #2 {
				entry:
				; CHECK: C
				; CHECK-NOT: vmovups %ymm
				; CHECK: vmovups %ymm
				call void @llvm.memmove.p0i8.p0i8.i64(i8* align 1 %dst, i8* align 1 %src, i64 32, i1 false)
				ret void
				}

				; Function Attrs: nounwind uwtable
				define weak_odr dso_local void @D(i8* %src, i8* %dst) local_unnamed_addr #2 {
				entry:
				; CHECK: D
				; CHECK-NOT: vmovups %zmm
				; CHECK: vmovups %ymm
				call void @llvm.memmove.p0i8.p0i8.i64(i8* align 1 %dst, i8* align 1 %src, i64 64, i1 false)
				ret void
				}

				; Function Attrs: argmemonly nounwind
				declare void @llvm.memmove.p0i8.p0i8.i64(i8* nocapture, i8* nocapture readonly, i64, i1 immarg) #1

				attributes #0 = { nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "prefer-vector-width"="128" "stack-protector-buffer-size"="8" "target-cpu"="skylake-avx512" "target-features"="+adx,+aes,+avx,+avx2,+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl,+bmi,+bmi2,+clflushopt,+clwb,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+mpx,+pclmul,+pku,+popcnt,+prfchw,+rdrnd,+rdseed,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #1 = { argmemonly nounwind }
				attributes #2 = { nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "prefer-vector-width"="256" "stack-protector-buffer-size"="8" "target-cpu"="skylake-avx512" "target-features"="+adx,+aes,+avx,+avx2,+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl,+bmi,+bmi2,+clflushopt,+clwb,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+mpx,+pclmul,+pku,+popcnt,+prfchw,+rdrnd,+rdseed,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" "unsafe-fp-math"="false" "use-soft-float"="false" }

				!0 = !{i32 1, !"wchar_size", i32 4}