Download Raw Diff

Details

Reviewers

arsenm
mareko

Commits

rG4821937d2ef9: AMDGPU: Avoid selecting ds_{read,write}2_b32 on SI
rL344698: AMDGPU: Avoid selecting ds_{read,write}2_b32 on SI

Summary

To workaround a hardware issue in the (base + offset) calculation
when base is negative. The impact on code quality should be limited
since SILoadStoreOptimizer still runs afterwards and is able to
combine loads/stores based on known sign information.

This fixes visible corruption in Hitman on SI (easily reproducible
by running benchmark mode).

Change-Id: Ia178d207a5e2ac38ae7cd98b532ea2ae74704e5f
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99923

Diff Detail

Repository: rL LLVM

Event Timeline

nhaehnle created this revision.Oct 11 2018, 12:26 PM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptOct 11 2018, 12:27 PM

Harbormaster completed remote builds in B23704: Diff 169270.Oct 11 2018, 12:27 PM

nhaehnle added a child revision: D53162: [DataLayout] Add bit width of pointers to global values.Oct 11 2018, 12:29 PM

nhaehnle mentioned this in D52907: AMDGPU: Don't merge DS opcodes on SI to fix corruption in Hitman.Oct 11 2018, 12:40 PM

arsenm added inline comments.Oct 12 2018, 4:43 AM

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
296 ↗	(On Diff #169270)	This seems like an expensive check for this. Is this so important?

arsenm added inline comments.Oct 12 2018, 4:47 AM

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
296 ↗	(On Diff #169270)	I mean I don't understand why this would really matter that much. If we ignore this problem and let it vectorize, the resulting code shouldn't be that different when selection fixes it. The advantage is just making the IR closer to the final hardware instructions, which has minor cost analysis benefits?

This still doesn't solve the real issue? We still have the selection. The vectorized load could appear in the original source program

In D53160#1263197, @arsenm wrote:

This still doesn't solve the real issue? We still have the selection. The vectorized load could appear in the original source program

Well, an argument could be made that the corruption issue really only occurred because the shader was doing something that would be undefined behavior in the usual C semantics, namely accessing out-of-bounds data. What kind of semantics we want to support for such accesses is kind of up to us. Do we want to support vectorized loads that are partially out-of-bounds? GLSL says that out-of-bounds loads should be handled gracefully, but Mesa will always create scalar loads, so this particular patch is fine.

That said, I suppose we could say that we want partially out-of-bounds vectorized loads to be handled gracefully as well (with per-element out-of-bounds checks), and scalarize those vectorized loads again during selection. I haven't looked into that yet.

nhaehnle mentioned this in D53162: [DataLayout] Add bit width of pointers to global values.Oct 13 2018, 7:17 AM

• tomswift98 added a subscriber: • tomswift98.Oct 14 2018, 2:06 PM

@arsenm How about this approach?

Harbormaster completed remote builds in B23841: Diff 169857.Oct 16 2018, 10:53 AM

Update the summary

Harbormaster completed remote builds in B23842: Diff 169860.Oct 16 2018, 10:57 AM

Yes, something like this. I still expected to see some change in SelectDS64Bit4ByteAligned that has the fixme for this bug? I suppose the check for the sign bit would need to be here in the lowering since it would be more difficult to split during selection

lib/Target/AMDGPU/SIISelLowering.cpp
6711 ↗	(On Diff #169860)	NumElements == 2 is redundant and possibly wrong?

Right, I looked at SelectDS64Bit4ByteAligned, but doing the split there
requires more code and also seems wrong since it'd mean inserting
additional nodes fairly late without giving them a chance to be combined
(not that they'd be combined very often, but still).

In a sense, splitting the vectorized load is a form of legalization.

Anyway, removing the FIXME comment.

Harbormaster completed remote builds in B23844: Diff 169870.Oct 16 2018, 11:33 AM

nhaehnle added inline comments.Oct 16 2018, 11:34 AM

lib/Target/AMDGPU/SIISelLowering.cpp
6711 ↗	(On Diff #169860)	I don't know. We shouldn't have unaligned i64 loads at this point, I guess, but the check does ensure that we're really dealing with a vector load. And NumElements > 2 is dealt with above.

nhaehnle added inline comments.Oct 16 2018, 11:41 AM

lib/Target/AMDGPU/SIISelLowering.cpp
6711 ↗	(On Diff #169860)	And what's more, even if we did have unaligned i64 load/store at this point, it doesn't really make sense to try to fix them. The SI bug only affects the case where the 8 bytes straddle the lower bound of LDS (i.e., vaddr == -4). Trying to load an i64 from there is wrong anyway.

LGTM. Is this only actually a problem with the UB because we don't bother trying to set m0 to the allocated size?

This revision is now accepted and ready to land.Oct 16 2018, 2:24 PM

actually disable the pattern to select ds_{read,write}2_b32 to catch cases where we'd regress this fix

Harbormaster completed remote builds in B23861: Diff 169959.Oct 17 2018, 1:00 AM

In D53160#1266994, @arsenm wrote:

LGTM. Is this only actually a problem with the UB because we don't bother trying to set m0 to the allocated size?

Thanks. Not quite, the issue is that the shader actually does an out-of-bounds load. That would be UB in a normal language, except that GLSL actually says it's okay in this case. Of course the shader doesn't use the resulting value (there's a loop, and the shader loads two adjacent i32 values in each loop iteration, from indices n-1 and n; the n-1 part ends up not contributing to the calculation during the first iteration).

Closed by commit rL344698: AMDGPU: Avoid selecting ds_{read,write}2_b32 on SI (authored by nha). · Explain WhyOct 17 2018, 8:39 AM

This revision was automatically updated to reflect the committed changes.

Diff 170015

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

Show First 20 Lines • Show All 972 Lines • ▼ Show 20 Lines	if (isUInt<8>(DWordOffset0) && isUInt<8>(DWordOffset1)) {
Offset0 = CurDAG->getTargetConstant(DWordOffset0, DL, MVT::i8);		Offset0 = CurDAG->getTargetConstant(DWordOffset0, DL, MVT::i8);
Offset1 = CurDAG->getTargetConstant(DWordOffset1, DL, MVT::i8);		Offset1 = CurDAG->getTargetConstant(DWordOffset1, DL, MVT::i8);
return true;		return true;
}		}
}		}

// default case		// default case

// FIXME: This is broken on SI where we still need to check if the base
// pointer is positive here.
Base = Addr;		Base = Addr;
Offset0 = CurDAG->getTargetConstant(0, DL, MVT::i8);		Offset0 = CurDAG->getTargetConstant(0, DL, MVT::i8);
Offset1 = CurDAG->getTargetConstant(1, DL, MVT::i8);		Offset1 = CurDAG->getTargetConstant(1, DL, MVT::i8);
return true;		return true;
}		}

bool AMDGPUDAGToDAGISel::SelectMUBUF(SDValue Addr, SDValue &Ptr,		bool AMDGPUDAGToDAGISel::SelectMUBUF(SDValue Addr, SDValue &Ptr,
SDValue &VAddr, SDValue &SOffset,		SDValue &VAddr, SDValue &SOffset,
▲ Show 20 Lines • Show All 1,283 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/DSInstructions.td

	Show First 20 Lines • Show All 722 Lines • ▼ Show 20 Lines

	class DS64Bit4ByteAlignedWritePat<DS_Pseudo inst, PatFrag frag> : GCNPat<			class DS64Bit4ByteAlignedWritePat<DS_Pseudo inst, PatFrag frag> : GCNPat<
	(frag v2i32:$value, (DS64Bit4ByteAligned i32:$ptr, i8:$offset0, i8:$offset1)),			(frag v2i32:$value, (DS64Bit4ByteAligned i32:$ptr, i8:$offset0, i8:$offset1)),
	(inst $ptr, (i32 (EXTRACT_SUBREG $value, sub0)),			(inst $ptr, (i32 (EXTRACT_SUBREG $value, sub0)),
	(i32 (EXTRACT_SUBREG $value, sub1)), $offset0, $offset1,			(i32 (EXTRACT_SUBREG $value, sub1)), $offset0, $offset1,
	(i1 0))			(i1 0))
	>;			>;

	let OtherPredicates = [LDSRequiresM0Init] in {			// v2i32 loads are split into i32 loads on SI during lowering, due to a bug
				// related to bounds checking.
				let OtherPredicates = [LDSRequiresM0Init, isCIVI] in {
	def : DS64Bit4ByteAlignedReadPat<DS_READ2_B32, load_local_m0>;			def : DS64Bit4ByteAlignedReadPat<DS_READ2_B32, load_local_m0>;
	def : DS64Bit4ByteAlignedWritePat<DS_WRITE2_B32, store_local_m0>;			def : DS64Bit4ByteAlignedWritePat<DS_WRITE2_B32, store_local_m0>;
	}			}

	let OtherPredicates = [NotLDSRequiresM0Init] in {			let OtherPredicates = [NotLDSRequiresM0Init] in {
	def : DS64Bit4ByteAlignedReadPat<DS_READ2_B32_gfx9, load_local>;			def : DS64Bit4ByteAlignedReadPat<DS_READ2_B32_gfx9, load_local>;
	def : DS64Bit4ByteAlignedWritePat<DS_WRITE2_B32_gfx9, store_local>;			def : DS64Bit4ByteAlignedWritePat<DS_WRITE2_B32_gfx9, store_local>;
	}			}
	▲ Show 20 Lines • Show All 435 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,286 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
} else if (AS == AMDGPUAS::LOCAL_ADDRESS) {		} else if (AS == AMDGPUAS::LOCAL_ADDRESS) {
// Use ds_read_b128 if possible.		// Use ds_read_b128 if possible.
if (Subtarget->useDS128() && Load->getAlignment() >= 16 &&		if (Subtarget->useDS128() && Load->getAlignment() >= 16 &&
MemVT.getStoreSize() == 16)		MemVT.getStoreSize() == 16)
return SDValue();		return SDValue();

if (NumElements > 2)		if (NumElements > 2)
return SplitVectorLoad(Op, DAG);		return SplitVectorLoad(Op, DAG);

		// SI has a hardware bug in the LDS / GDS boounds checking: if the base
		// address is negative, then the instruction is incorrectly treated as
		// out-of-bounds even if base + offsets is in bounds. Split vectorized
		// loads here to avoid emitting ds_read2_b32. We may re-combine the
		// load later in the SILoadStoreOptimizer.
		if (Subtarget->getGeneration() == AMDGPUSubtarget::SOUTHERN_ISLANDS &&
		NumElements == 2 && MemVT.getStoreSize() == 8 &&
		Load->getAlignment() < 8) {
		return SplitVectorLoad(Op, DAG);
		}
}		}
return SDValue();		return SDValue();
}		}

SDValue SITargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {		SDValue SITargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
assert(VT.getSizeInBits() == 64);		assert(VT.getSizeInBits() == 64);

▲ Show 20 Lines • Show All 386 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
} else if (AS == AMDGPUAS::LOCAL_ADDRESS) {		} else if (AS == AMDGPUAS::LOCAL_ADDRESS) {
// Use ds_write_b128 if possible.		// Use ds_write_b128 if possible.
if (Subtarget->useDS128() && Store->getAlignment() >= 16 &&		if (Subtarget->useDS128() && Store->getAlignment() >= 16 &&
VT.getStoreSize() == 16)		VT.getStoreSize() == 16)
return SDValue();		return SDValue();

if (NumElements > 2)		if (NumElements > 2)
return SplitVectorStore(Op, DAG);		return SplitVectorStore(Op, DAG);

		// SI has a hardware bug in the LDS / GDS boounds checking: if the base
		// address is negative, then the instruction is incorrectly treated as
		// out-of-bounds even if base + offsets is in bounds. Split vectorized
		// stores here to avoid emitting ds_write2_b32. We may re-combine the
		// store later in the SILoadStoreOptimizer.
		if (Subtarget->getGeneration() == AMDGPUSubtarget::SOUTHERN_ISLANDS &&
		NumElements == 2 && VT.getStoreSize() == 8 &&
		Store->getAlignment() < 8) {
		return SplitVectorStore(Op, DAG);
		}

return SDValue();		return SDValue();
} else {		} else {
llvm_unreachable("unhandled address space");		llvm_unreachable("unhandled address space");
}		}
}		}

SDValue SITargetLowering::LowerTrig(SDValue Op, SelectionDAG &DAG) const {		SDValue SITargetLowering::LowerTrig(SDValue Op, SelectionDAG &DAG) const {
SDLoc DL(Op);		SDLoc DL(Op);
▲ Show 20 Lines • Show All 2,595 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/lds-bounds.ll

				; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SI %s
				; RUN: llc -march=amdgcn -mcpu=hawaii -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NOSI %s

				@compute_lds = external addrspace(3) global [512 x i32], align 16

				; GCN-LABEL: {{^}}store_aligned:
				; GCN: ds_write_b64
				define amdgpu_cs void @store_aligned(i32 addrspace(3)* %ptr) #0 {
				entry:
				%ptr.gep.1 = getelementptr i32, i32 addrspace(3)* %ptr, i32 1

				store i32 42, i32 addrspace(3)* %ptr, align 8
				store i32 43, i32 addrspace(3)* %ptr.gep.1
				ret void
				}


				; GCN-LABEL: {{^}}load_aligned:
				; GCN: ds_read_b64
				define amdgpu_cs <2 x float> @load_aligned(i32 addrspace(3)* %ptr) #0 {
				entry:
				%ptr.gep.1 = getelementptr i32, i32 addrspace(3)* %ptr, i32 1

				%v.0 = load i32, i32 addrspace(3)* %ptr, align 8
				%v.1 = load i32, i32 addrspace(3)* %ptr.gep.1

				%r.0 = insertelement <2 x i32> undef, i32 %v.0, i32 0
				%r.1 = insertelement <2 x i32> %r.0, i32 %v.1, i32 1
				%bc = bitcast <2 x i32> %r.1 to <2 x float>
				ret <2 x float> %bc
				}


				; GCN-LABEL: {{^}}store_global_const_idx:
				; GCN: ds_write2_b32
				define amdgpu_cs void @store_global_const_idx() #0 {
				entry:
				%ptr.a = getelementptr [512 x i32], [512 x i32] addrspace(3)* @compute_lds, i32 0, i32 3
				%ptr.b = getelementptr [512 x i32], [512 x i32] addrspace(3)* @compute_lds, i32 0, i32 4

				store i32 42, i32 addrspace(3)* %ptr.a
				store i32 43, i32 addrspace(3)* %ptr.b
				ret void
				}


				; GCN-LABEL: {{^}}load_global_const_idx:
				; GCN: ds_read2_b32
				define amdgpu_cs <2 x float> @load_global_const_idx() #0 {
				entry:
				%ptr.a = getelementptr [512 x i32], [512 x i32] addrspace(3)* @compute_lds, i32 0, i32 3
				%ptr.b = getelementptr [512 x i32], [512 x i32] addrspace(3)* @compute_lds, i32 0, i32 4

				%v.0 = load i32, i32 addrspace(3)* %ptr.a
				%v.1 = load i32, i32 addrspace(3)* %ptr.b

				%r.0 = insertelement <2 x i32> undef, i32 %v.0, i32 0
				%r.1 = insertelement <2 x i32> %r.0, i32 %v.1, i32 1
				%bc = bitcast <2 x i32> %r.1 to <2 x float>
				ret <2 x float> %bc
				}


				; GCN-LABEL: {{^}}store_global_var_idx_case1:
				; SI: ds_write_b32
				; SI: ds_write_b32
				; NONSI: ds_write2_b32
				define amdgpu_cs void @store_global_var_idx_case1(i32 %idx) #0 {
				entry:
				%ptr.a = getelementptr [512 x i32], [512 x i32] addrspace(3)* @compute_lds, i32 0, i32 %idx
				%ptr.b = getelementptr i32, i32 addrspace(3)* %ptr.a, i32 1

				store i32 42, i32 addrspace(3)* %ptr.a
				store i32 43, i32 addrspace(3)* %ptr.b
				ret void
				}


				; GCN-LABEL: {{^}}load_global_var_idx_case1:
				; SI: ds_read_b32
				; SI: ds_read_b32
				; NONSI: ds_read2_b32
				define amdgpu_cs <2 x float> @load_global_var_idx_case1(i32 %idx) #0 {
				entry:
				%ptr.a = getelementptr [512 x i32], [512 x i32] addrspace(3)* @compute_lds, i32 0, i32 %idx
				%ptr.b = getelementptr i32, i32 addrspace(3)* %ptr.a, i32 1

				%v.0 = load i32, i32 addrspace(3)* %ptr.a
				%v.1 = load i32, i32 addrspace(3)* %ptr.b

				%r.0 = insertelement <2 x i32> undef, i32 %v.0, i32 0
				%r.1 = insertelement <2 x i32> %r.0, i32 %v.1, i32 1
				%bc = bitcast <2 x i32> %r.1 to <2 x float>
				ret <2 x float> %bc
				}


				; GCN-LABEL: {{^}}store_global_var_idx_case2:
				; GCN: ds_write2_b32
				define amdgpu_cs void @store_global_var_idx_case2(i32 %idx) #0 {
				entry:
				%idx.and = and i32 %idx, 255
				%ptr.a = getelementptr [512 x i32], [512 x i32] addrspace(3)* @compute_lds, i32 0, i32 %idx.and
				%ptr.b = getelementptr i32, i32 addrspace(3)* %ptr.a, i32 1

				store i32 42, i32 addrspace(3)* %ptr.a
				store i32 43, i32 addrspace(3)* %ptr.b
				ret void
				}


				; GCN-LABEL: {{^}}load_global_var_idx_case2:
				; GCN: ds_read2_b32
				define amdgpu_cs <2 x float> @load_global_var_idx_case2(i32 %idx) #0 {
				entry:
				%idx.and = and i32 %idx, 255
				%ptr.a = getelementptr [512 x i32], [512 x i32] addrspace(3)* @compute_lds, i32 0, i32 %idx.and
				%ptr.b = getelementptr i32, i32 addrspace(3)* %ptr.a, i32 1

				%v.0 = load i32, i32 addrspace(3)* %ptr.a
				%v.1 = load i32, i32 addrspace(3)* %ptr.b

				%r.0 = insertelement <2 x i32> undef, i32 %v.0, i32 0
				%r.1 = insertelement <2 x i32> %r.0, i32 %v.1, i32 1
				%bc = bitcast <2 x i32> %r.1 to <2 x float>
				ret <2 x float> %bc
				}

				attributes #0 = { nounwind }

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Avoid selecting ds_{read,write}2_b32 on SI
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 170015

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

llvm/trunk/lib/Target/AMDGPU/DSInstructions.td

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/trunk/test/CodeGen/AMDGPU/lds-bounds.ll

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Avoid selecting ds_{read,write}2_b32 on SIClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 170015

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

llvm/trunk/lib/Target/AMDGPU/DSInstructions.td

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/trunk/test/CodeGen/AMDGPU/lds-bounds.ll

AMDGPU: Avoid selecting ds_{read,write}2_b32 on SI
ClosedPublic