This is an archive of the discontinued LLVM Phabricator instance.

[LegalizeTypes] Teach DAGTypeLegalizer::GenWidenVectorLoads to pad with undef if needed when concatenating small or loads to match a larger load
ClosedPublic

Authored by craig.topper on Jul 23 2020, 1:40 PM.

Download Raw Diff

Details

Reviewers

spatel
efriedma
nadav
t.p.northover

Commits

rG152c2b1befb1: [LegalizeTypes] Teach DAGTypeLegalizer::GenWidenVectorLoads to pad with undef…
rG8131e190647a: [LegalizeTypes] Teach DAGTypeLegalizer::GenWidenVectorLoads to pad with undef…

Summary

In the included test case the align 16 allowed the v23f32 load to handled as load v16f32, load v4f32, and load v4f32(one element not used). These loads all need to be concatenated together into a final vector. In this case we tried to concatenate the two v4f32 loads to match the type of the v16f32 load so we could do a second concat_vectors, but those loads alone only add up to v8f32. So we need to two v4f32 undefs to pad it.

It appears we've tried to hack around a similar issue in this code before by adding undef padding to loads in one of the earlier loops in this function. Originally in r147964 by padding all loads narrower than previous loads to the same size. Later modifed to only the last load in r293088. This patch removes that earlier code and just handles it on demand where we know we need it.

Fixes PR46820

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

craig.topper created this revision.Jul 23 2020, 1:40 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2020, 1:40 PM

Herald added a subscriber: hiraditya. · View Herald Transcript

-Add CHECK lines to the test

craig.topper edited the summary of this revision. (Show Details)Jul 23 2020, 1:42 PM

I was stepping through that function myself after seeing the bug, and it wasn't clear to me why we are ok with reading more than the actual load size. Ie, we are only allowed to read <23 x float>, but we created a <16 x float> load + two <4 x float> loads right? Something later is guaranteed to reduce that to make sure we don't actually load the full <24 x float>?

In D84463#2170797, @spatel wrote:

I was stepping through that function myself after seeing the bug, and it wasn't clear to me why we are ok with reading more than the actual load size. Ie, we are only allowed to read <23 x float>, but we created a <16 x float> load + two <4 x float> loads right? Something later is guaranteed to reduce that to make sure we don't actually load the full <24 x float>?

No we load the whole 24xfloat. Because the load is 16 byte aligned we know the last load won’t cross a page boundary so won’t fault.

On almost all targets, if N is small, reading N bytes from a dereferenceable pointer with align N is safe, even if the known dereferenceable bytes is less than N, due to the way memory allocation works. We do this in a few places, I think?

On almost all targets, if N is small, reading N bytes from a dereferenceable pointer with align N is safe, even if the known dereferenceable bytes is less than N, due to the way memory allocation works. We do this in a few places, I think?

Err, I guess you know that. Yes, it's depending on the alignment here.

It might be a good idea to explicitly call this out on the line where we compute LdAlign.

Improve comment about widening loads based on alignment

Harbormaster failed remote builds in B65460: Diff 280270!Jul 23 2020, 3:53 PM

In D84463#2170812, @craig.topper wrote:

In D84463#2170797, @spatel wrote:

I was stepping through that function myself after seeing the bug, and it wasn't clear to me why we are ok with reading more than the actual load size. Ie, we are only allowed to read <23 x float>, but we created a <16 x float> load + two <4 x float> loads right? Something later is guaranteed to reduce that to make sure we don't actually load the full <24 x float>?

No we load the whole 24xfloat. Because the load is 16 byte aligned we know the last load won’t cross a page boundary so won’t fault.

Ah, ok. Might be good to include the same test with minimal alignment, so we see that difference in asm?
The way this code is replacing at the end of the ConcatOps vector is confusing (not sure if we can assert anything from the power-of-2 size assumption?), but that's existing code, so I won't hold up the fix. LGTM.

This revision is now accepted and ready to land.Jul 23 2020, 4:46 PM

Closed by commit rG8131e190647a: [LegalizeTypes] Teach DAGTypeLegalizer::GenWidenVectorLoads to pad with undef… (authored by craig.topper). · Explain WhyJul 23 2020, 7:03 PM

This revision was automatically updated to reflect the committed changes.

frasercrmck mentioned this in D76682: [LegalizeTypes] Handle gaps in legal vector types while widening loads.Oct 27 2020, 2:48 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

LegalizeVectorTypes.cpp

27 lines

test/

CodeGen/

X86/

pr46820.ll

47 lines

Diff 280312

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

Show First 20 Lines • Show All 4,906 Lines • ▼ Show 20 Lines	SDValue DAGTypeLegalizer::GenWidenVectorLoads(SmallVectorImpl<SDValue> &LdChain,
// Load information		// Load information
SDValue Chain = LD->getChain();		SDValue Chain = LD->getChain();
SDValue BasePtr = LD->getBasePtr();		SDValue BasePtr = LD->getBasePtr();
MachineMemOperand::Flags MMOFlags = LD->getMemOperand()->getFlags();		MachineMemOperand::Flags MMOFlags = LD->getMemOperand()->getFlags();
AAMDNodes AAInfo = LD->getAAInfo();		AAMDNodes AAInfo = LD->getAAInfo();

int LdWidth = LdVT.getSizeInBits();		int LdWidth = LdVT.getSizeInBits();
int WidthDiff = WidenWidth - LdWidth;		int WidthDiff = WidenWidth - LdWidth;
// Allow wider loads.		// Allow wider loads if they are sufficiently aligned to avoid memory faults
		// and if the original load is simple.
unsigned LdAlign = (!LD->isSimple()) ? 0 : LD->getAlignment();		unsigned LdAlign = (!LD->isSimple()) ? 0 : LD->getAlignment();

// Find the vector type that can load from.		// Find the vector type that can load from.
EVT NewVT = FindMemType(DAG, TLI, LdWidth, WidenVT, LdAlign, WidthDiff);		EVT NewVT = FindMemType(DAG, TLI, LdWidth, WidenVT, LdAlign, WidthDiff);
int NewVTWidth = NewVT.getSizeInBits();		int NewVTWidth = NewVT.getSizeInBits();
SDValue LdOp = DAG.getLoad(NewVT, dl, Chain, BasePtr, LD->getPointerInfo(),		SDValue LdOp = DAG.getLoad(NewVT, dl, Chain, BasePtr, LD->getPointerInfo(),
LD->getOriginalAlign(), MMOFlags, AAInfo);		LD->getOriginalAlign(), MMOFlags, AAInfo);
LdChain.push_back(LdOp.getValue(1));		LdChain.push_back(LdOp.getValue(1));
Show All 35 Lines	while (LdWidth > 0) {
if (LdWidth < NewVTWidth) {		if (LdWidth < NewVTWidth) {
// The current type we are using is too large. Find a better size.		// The current type we are using is too large. Find a better size.
NewVT = FindMemType(DAG, TLI, LdWidth, WidenVT, LdAlign, WidthDiff);		NewVT = FindMemType(DAG, TLI, LdWidth, WidenVT, LdAlign, WidthDiff);
NewVTWidth = NewVT.getSizeInBits();		NewVTWidth = NewVT.getSizeInBits();
L = DAG.getLoad(NewVT, dl, Chain, BasePtr,		L = DAG.getLoad(NewVT, dl, Chain, BasePtr,
LD->getPointerInfo().getWithOffset(Offset),		LD->getPointerInfo().getWithOffset(Offset),
LD->getOriginalAlign(), MMOFlags, AAInfo);		LD->getOriginalAlign(), MMOFlags, AAInfo);
LdChain.push_back(L.getValue(1));		LdChain.push_back(L.getValue(1));
if (L->getValueType(0).isVector() && NewVTWidth >= LdWidth) {
// Later code assumes the vector loads produced will be mergeable, so we
// must pad the final entry up to the previous width. Scalars are
// combined separately.
SmallVector<SDValue, 16> Loads;
Loads.push_back(L);
unsigned size = L->getValueSizeInBits(0);
while (size < LdOp->getValueSizeInBits(0)) {
Loads.push_back(DAG.getUNDEF(L->getValueType(0)));
size += L->getValueSizeInBits(0);
}
L = DAG.getNode(ISD::CONCAT_VECTORS, dl, LdOp->getValueType(0), Loads);
}
} else {		} else {
L = DAG.getLoad(NewVT, dl, Chain, BasePtr,		L = DAG.getLoad(NewVT, dl, Chain, BasePtr,
LD->getPointerInfo().getWithOffset(Offset),		LD->getPointerInfo().getWithOffset(Offset),
LD->getOriginalAlign(), MMOFlags, AAInfo);		LD->getOriginalAlign(), MMOFlags, AAInfo);
LdChain.push_back(L.getValue(1));		LdChain.push_back(L.getValue(1));
}		}

LdOps.push_back(L);		LdOps.push_back(L);
Show All 24 Lines	if (!LdTy.isVector()) {
}		}
ConcatOps[--Idx] = BuildVectorFromScalar(DAG, LdTy, LdOps, i + 1, End);		ConcatOps[--Idx] = BuildVectorFromScalar(DAG, LdTy, LdOps, i + 1, End);
}		}
ConcatOps[--Idx] = LdOps[i];		ConcatOps[--Idx] = LdOps[i];
for (--i; i >= 0; --i) {		for (--i; i >= 0; --i) {
EVT NewLdTy = LdOps[i].getValueType();		EVT NewLdTy = LdOps[i].getValueType();
if (NewLdTy != LdTy) {		if (NewLdTy != LdTy) {
// Create a larger vector.		// Create a larger vector.
		unsigned NumOps = NewLdTy.getSizeInBits() / LdTy.getSizeInBits();
		assert(NewLdTy.getSizeInBits() % LdTy.getSizeInBits() == 0);
		SmallVector<SDValue, 16> WidenOps(NumOps);
		unsigned j = 0;
		for (; j != End-Idx; ++j)
		WidenOps[j] = ConcatOps[Idx+j];
		for (; j != NumOps; ++j)
		WidenOps[j] = DAG.getUNDEF(LdTy);

ConcatOps[End-1] = DAG.getNode(ISD::CONCAT_VECTORS, dl, NewLdTy,		ConcatOps[End-1] = DAG.getNode(ISD::CONCAT_VECTORS, dl, NewLdTy,
makeArrayRef(&ConcatOps[Idx], End - Idx));		WidenOps);
Idx = End - 1;		Idx = End - 1;
LdTy = NewLdTy;		LdTy = NewLdTy;
}		}
ConcatOps[--Idx] = LdOps[i];		ConcatOps[--Idx] = LdOps[i];
}		}

if (WidenWidth == LdTy.getSizeInBits() * (End - Idx))		if (WidenWidth == LdTy.getSizeInBits() * (End - Idx))
return DAG.getNode(ISD::CONCAT_VECTORS, dl, WidenVT,		return DAG.getNode(ISD::CONCAT_VECTORS, dl, WidenVT,
▲ Show 20 Lines • Show All 217 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/pr46820.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx512f \| FileCheck %s

				; The alignment of 16 causes type legalization to split this as 3 loads,
				; v16f32, v4f32, and v4f32. This loads 24 elements, but the load is aligned
				; to 16 bytes so this i safe. There was an issue with type legalization building
				; the proper concat_vectors for this because the two v4f32s don't add up to
				; v16f32 and require padding.

				define <23 x float> @load23(<23 x float>* %p) {
				; CHECK-LABEL: load23:
				; CHECK: # %bb.0:
				; CHECK-NEXT: movq %rdi, %rax
				; CHECK-NEXT: vmovups 64(%rsi), %ymm0
				; CHECK-NEXT: vmovups (%rsi), %zmm1
				; CHECK-NEXT: vmovaps 64(%rsi), %xmm2
				; CHECK-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero
				; CHECK-NEXT: vmovss %xmm3, 88(%rdi)
				; CHECK-NEXT: vmovaps %xmm2, 64(%rdi)
				; CHECK-NEXT: vmovaps %zmm1, (%rdi)
				; CHECK-NEXT: vextractf128 $1, %ymm0, %xmm0
				; CHECK-NEXT: vmovlps %xmm0, 80(%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%t0 = load <23 x float>, <23 x float>* %p, align 16
				ret <23 x float> %t0
				}

				; Same test as above with minimal alignment just to demonstrate the different
				; codegen.
				define <23 x float> @load23_align_1(<23 x float>* %p) {
				; CHECK-LABEL: load23_align_1:
				; CHECK: # %bb.0:
				; CHECK-NEXT: movq %rdi, %rax
				; CHECK-NEXT: vmovups (%rsi), %zmm0
				; CHECK-NEXT: vmovups 64(%rsi), %xmm1
				; CHECK-NEXT: movq 80(%rsi), %rcx
				; CHECK-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
				; CHECK-NEXT: vmovss %xmm2, 88(%rdi)
				; CHECK-NEXT: movq %rcx, 80(%rdi)
				; CHECK-NEXT: vmovaps %xmm1, 64(%rdi)
				; CHECK-NEXT: vmovaps %zmm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%t0 = load <23 x float>, <23 x float>* %p, align 1
				ret <23 x float> %t0
				}