Download Raw Diff

Details

Reviewers

foad
arsenm

Commits

rG8110fcc8fc56: AMDGPU/GlobalISel: Fix negative offset folding for buffer_load

Summary

Buffer_load does unsigned offset calculations. Don't fold
operands of 32-bit add that are likely to cause unsigned add
overflow (common case is when one of the operands is negative).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Petar.Avramovic created this revision.Nov 12 2020, 4:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 12 2020, 4:50 AM

Herald added subscribers: llvm-commits, kerbowa, hiraditya and 8 others. · View Herald Transcript

Petar.Avramovic requested review of this revision.Nov 12 2020, 4:50 AM

Herald added a subscriber: wdng. · View Herald TranscriptNov 12 2020, 4:50 AM

foad added inline comments.Nov 12 2020, 5:28 AM

llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.mir
51	Can you pre-commit this test so we can see the diff?

Pre-commit the test.

foad added inline comments.Nov 12 2020, 6:32 AM

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
16 ↗	(On Diff #304806)	I am really not sure about this change. I think `Reg` is always a 32-bit value here (perhaps we should assert this), so any G_ADD that we decompose will be a 32-bit add, so using `unsigned` (or perhaps `int`) for the offset makes sense.
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1355	I think the hardware zero-extends each component of the address from 32 to 64 bits before adding them together. So the question here is: does zext(add(x, y)) == add(zext(x), zext(y))? The answer is "only if add(x, y) has no unsigned overflow". So I don't think "Offset >= 0" is exactly the right condition here, but it does fix the common case of a small negative immediate offset, so I'm happy with this as a short term fix.

Petar.Avramovic added inline comments.Nov 12 2020, 8:03 AM

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
16 ↗	(On Diff #304806)	There is range checking for the offset so the container for value should not make big difference. Choosing int or unsigned brings ambiguity for '32-bit values with 1 at highest bit' (I would assume that it is most probably negative offset then large positive offset so int could work). With int64 we know if value was positive or negative in ir. Also see below (1), (2):
28 ↗	(On Diff #304806)	(2) while we use getZExtValue here. Is this intended?
36 ↗	(On Diff #304806)	(1) m_ICst uses getSExtValue() on APInt

foad added inline comments.Nov 12 2020, 8:52 AM

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
16 ↗	(On Diff #304806)	With int64 we know if value was positive or negative in ir. No, it's just a 32-bit value, in IR and in MIR. You don't get any extra information by returning int64_t.
28 ↗	(On Diff #304806)	getSExtValue vs getZExtValue affects how the 32-bit value is extended to int64_t. But if you make this function return a 32-bit result, then that doesn't matter.

arsenm added inline comments.Nov 12 2020, 9:06 AM

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1355	Don't we have the sign bit is zero check?

Petar.Avramovic added inline comments.Nov 12 2020, 9:48 AM

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
16 ↗	(On Diff #304806)	I meant that APInt and int64_t (used by G_CONSTANT) keep sign info and we know what was intended use of that 32 bit value. Given 32 bit value 0xFFFFFFFC, we can't handle it here if it was used as negative offset (-4 as int64_t) since soffset behaves as unsigned in address calculation. However, if that is a large positive(unsigned) offset (4294967292 as int64_t) folding this into soffset behaves correctly.

Petar.Avramovic updated this revision to Diff 306394.Nov 19 2020, 6:26 AM

Petar.Avramovic edited the summary of this revision. (Show Details)

foad added inline comments.Nov 19 2020, 6:47 AM

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1355	Not sure what you mean. Are you suggesting only decomposing the ADD if we know the sign bit is zero in both operands? That would be safe, but I'm not sure how often that test would succeed.

arsenm added inline comments.Nov 19 2020, 6:59 AM

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1355	Yes, this is what we do in a few places in the DAG already

arsenm added inline comments.Dec 14 2020, 7:28 AM

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1302	Shouldn't use unsigned here

foad added inline comments.Dec 14 2020, 7:33 AM

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1302	I don't think we need this function at all. Surely `(int)Val >= 0` is clear enough?

Sorry about the delay. Use (int)Offset to check for one in most significant bit and potential unsigned overflow. DAG equivalent in SITargetLowering::setBufferOffsets also uses int for offset.

LGTM. It is correct in more cases than it was before, and it mostly matches what the DAG version does.

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1353	The DAG version of setBufferOffsets doesn't have this case. Is it done later on when the register classes are known?

This revision is now accepted and ready to land.Dec 18 2020, 6:04 AM

Petar.Avramovic added inline comments.Dec 18 2020, 6:39 AM

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1353	SDAG just gives up and uses add as base (1 extra instruction compared to what global-isel does). See `@s_buffer_load_f32_offset_add_vgpr_sgpr` in `GlobalISel/llvm.amdgcn.s.buffer.load.ll` for example.

Closed by commit rG8110fcc8fc56: AMDGPU/GlobalISel: Fix negative offset folding for buffer_load (authored by Petar.Avramovic). · Explain WhyApr 27 2021, 5:46 AM

This revision was automatically updated to reflect the committed changes.

Petar.Avramovic mentioned this in rG6a3e1b3531c0: AMDGPU/GlobalISel: Add test for buffer_load with negative offset.

Petar.Avramovic added a commit: rG8110fcc8fc56: AMDGPU/GlobalISel: Fix negative offset folding for buffer_load.

Diff 340805

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp

Show First 20 Lines • Show All 1,293 Lines • ▼ Show 20 Lines	static Register getSrcRegIgnoringCopies(const MachineRegisterInfo &MRI,
MachineInstr *Def = getDefIgnoringCopies(Reg, MRI);		MachineInstr *Def = getDefIgnoringCopies(Reg, MRI);
if (!Def)		if (!Def)
return Reg;		return Reg;

// TODO: Guard against this being an implicit def		// TODO: Guard against this being an implicit def
return Def->getOperand(0).getReg();		return Def->getOperand(0).getReg();
}		}

// Analyze a combined offset from an llvm.amdgcn.s.buffer intrinsic and store		// Analyze a combined offset from an llvm.amdgcn.s.buffer intrinsic and store
		arsenmUnsubmitted Not Done Reply Inline Actions Shouldn't use unsigned here arsenm: Shouldn't use unsigned here
		foadUnsubmitted Not Done Reply Inline Actions I don't think we need this function at all. Surely `(int)Val >= 0` is clear enough? foad: I don't think we need this function at all. Surely `(int)Val >= 0` is clear enough?
// the three offsets (voffset, soffset and instoffset)		// the three offsets (voffset, soffset and instoffset)
static unsigned setBufferOffsets(MachineIRBuilder &B,		static unsigned setBufferOffsets(MachineIRBuilder &B,
const AMDGPURegisterBankInfo &RBI,		const AMDGPURegisterBankInfo &RBI,
Register CombinedOffset, Register &VOffsetReg,		Register CombinedOffset, Register &VOffsetReg,
Register &SOffsetReg, int64_t &InstOffsetVal,		Register &SOffsetReg, int64_t &InstOffsetVal,
Align Alignment) {		Align Alignment) {
const LLT S32 = LLT::scalar(32);		const LLT S32 = LLT::scalar(32);
MachineRegisterInfo *MRI = B.getMRI();		MachineRegisterInfo *MRI = B.getMRI();
Show All 14 Lines	static unsigned setBufferOffsets(MachineIRBuilder &B,

Register Base;		Register Base;
unsigned Offset;		unsigned Offset;

std::tie(Base, Offset) =		std::tie(Base, Offset) =
AMDGPU::getBaseWithConstantOffset(*MRI, CombinedOffset);		AMDGPU::getBaseWithConstantOffset(*MRI, CombinedOffset);

uint32_t SOffset, ImmOffset;		uint32_t SOffset, ImmOffset;
if (Offset > 0 && AMDGPU::splitMUBUFOffset(Offset, SOffset, ImmOffset,		if ((int)Offset > 0 && AMDGPU::splitMUBUFOffset(Offset, SOffset, ImmOffset,
&RBI.Subtarget, Alignment)) {		&RBI.Subtarget, Alignment)) {
if (RBI.getRegBank(Base, MRI, RBI.TRI) == &AMDGPU::VGPRRegBank) {		if (RBI.getRegBank(Base, MRI, RBI.TRI) == &AMDGPU::VGPRRegBank) {
VOffsetReg = Base;		VOffsetReg = Base;
SOffsetReg = B.buildConstant(S32, SOffset).getReg(0);		SOffsetReg = B.buildConstant(S32, SOffset).getReg(0);
B.getMRI()->setRegBank(SOffsetReg, AMDGPU::SGPRRegBank);		B.getMRI()->setRegBank(SOffsetReg, AMDGPU::SGPRRegBank);
InstOffsetVal = ImmOffset;		InstOffsetVal = ImmOffset;
return 0; // XXX - Why is this 0?		return 0; // XXX - Why is this 0?
}		}

// If we have SGPR base, we can use it for soffset.		// If we have SGPR base, we can use it for soffset.
if (SOffset == 0) {		if (SOffset == 0) {
VOffsetReg = B.buildConstant(S32, 0).getReg(0);		VOffsetReg = B.buildConstant(S32, 0).getReg(0);
B.getMRI()->setRegBank(VOffsetReg, AMDGPU::VGPRRegBank);		B.getMRI()->setRegBank(VOffsetReg, AMDGPU::VGPRRegBank);
SOffsetReg = Base;		SOffsetReg = Base;
InstOffsetVal = ImmOffset;		InstOffsetVal = ImmOffset;
return 0; // XXX - Why is this 0?		return 0; // XXX - Why is this 0?
}		}
}		}

// Handle the variable sgpr + vgpr case.		// Handle the variable sgpr + vgpr case.
		foadUnsubmitted Not Done Reply Inline Actions The DAG version of setBufferOffsets doesn't have this case. Is it done later on when the register classes are known? foad: The DAG version of setBufferOffsets doesn't have this case. Is it done later on when the…
		Petar.AvramovicAuthorUnsubmitted Done Reply Inline Actions SDAG just gives up and uses add as base (1 extra instruction compared to what global-isel does). See `@s_buffer_load_f32_offset_add_vgpr_sgpr` in `GlobalISel/llvm.amdgcn.s.buffer.load.ll` for example. Petar.Avramovic: SDAG just gives up and uses add as base (1 extra instruction compared to what global-isel does).
if (MachineInstr Add = getOpcodeDef(AMDGPU::G_ADD, CombinedOffset, MRI)) {		MachineInstr Add = getOpcodeDef(AMDGPU::G_ADD, CombinedOffset, MRI);
		if (Add && (int)Offset >= 0) {
		foadUnsubmitted Not Done Reply Inline Actions I think the hardware zero-extends each component of the address from 32 to 64 bits before adding them together. So the question here is: does zext(add(x, y)) == add(zext(x), zext(y))? The answer is "only if add(x, y) has no unsigned overflow". So I don't think "Offset >= 0" is exactly the right condition here, but it does fix the common case of a small negative immediate offset, so I'm happy with this as a short term fix. foad: I think the hardware zero-extends each component of the address from 32 to 64 bits before…
		arsenmUnsubmitted Not Done Reply Inline Actions Don't we have the sign bit is zero check? arsenm: Don't we have the sign bit is zero check?
		foadUnsubmitted Not Done Reply Inline Actions Not sure what you mean. Are you suggesting only decomposing the ADD if we know the sign bit is zero in both operands? That would be safe, but I'm not sure how often that test would succeed. foad: Not sure what you mean. Are you suggesting only decomposing the ADD if we know the sign bit is…
		arsenmUnsubmitted Not Done Reply Inline Actions Yes, this is what we do in a few places in the DAG already arsenm: Yes, this is what we do in a few places in the DAG already
Register Src0 = getSrcRegIgnoringCopies(*MRI, Add->getOperand(1).getReg());		Register Src0 = getSrcRegIgnoringCopies(*MRI, Add->getOperand(1).getReg());
Register Src1 = getSrcRegIgnoringCopies(*MRI, Add->getOperand(2).getReg());		Register Src1 = getSrcRegIgnoringCopies(*MRI, Add->getOperand(2).getReg());

const RegisterBank Src0Bank = RBI.getRegBank(Src0, MRI, *RBI.TRI);		const RegisterBank Src0Bank = RBI.getRegBank(Src0, MRI, *RBI.TRI);
const RegisterBank Src1Bank = RBI.getRegBank(Src1, MRI, *RBI.TRI);		const RegisterBank Src1Bank = RBI.getRegBank(Src1, MRI, *RBI.TRI);

if (Src0Bank == &AMDGPU::VGPRRegBank && Src1Bank == &AMDGPU::SGPRRegBank) {		if (Src0Bank == &AMDGPU::VGPRRegBank && Src1Bank == &AMDGPU::SGPRRegBank) {
VOffsetReg = Src0;		VOffsetReg = Src0;
▲ Show 20 Lines • Show All 3,013 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.mir

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	bb.0:
%2:vgpr(s32) = G_CONSTANT i32 256		%2:vgpr(s32) = G_CONSTANT i32 256
%3:_(s32) = G_ADD %1, %2		%3:_(s32) = G_ADD %1, %2
%4:_(s32) = G_AMDGPU_S_BUFFER_LOAD %0, %3, 0		%4:_(s32) = G_AMDGPU_S_BUFFER_LOAD %0, %3, 0
S_ENDPGM 0, implicit %4		S_ENDPGM 0, implicit %4

...		...

---		---
name: s_buffer_load_negative_offset		name: s_buffer_load_negative_offset
		foadUnsubmitted Not Done Reply Inline Actions Can you pre-commit this test so we can see the diff? foad: Can you pre-commit this test so we can see the diff?
legalized: true		legalized: true
tracksRegLiveness: true		tracksRegLiveness: true
body: \|		body: \|
bb.0:		bb.0:
liveins: $sgpr0_sgpr1_sgpr2_sgpr3, $vgpr0		liveins: $sgpr0_sgpr1_sgpr2_sgpr3, $vgpr0

; FAST-LABEL: name: s_buffer_load_negative_offset		; FAST-LABEL: name: s_buffer_load_negative_offset
; FAST: liveins: $sgpr0_sgpr1_sgpr2_sgpr3, $vgpr0		; FAST: liveins: $sgpr0_sgpr1_sgpr2_sgpr3, $vgpr0
; FAST: [[COPY:%[0-9]+]]:sgpr(<4 x s32>) = COPY $sgpr0_sgpr1_sgpr2_sgpr3		; FAST: [[COPY:%[0-9]+]]:sgpr(<4 x s32>) = COPY $sgpr0_sgpr1_sgpr2_sgpr3
; FAST: [[COPY1:%[0-9]+]]:vgpr(s32) = COPY $vgpr0		; FAST: [[COPY1:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
; FAST: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 -60		; FAST: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 -60
; FAST: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY [[C]](s32)		; FAST: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY [[C]](s32)
; FAST: [[ADD:%[0-9]+]]:vgpr(s32) = G_ADD [[COPY1]], [[COPY2]]		; FAST: [[ADD:%[0-9]+]]:vgpr(s32) = G_ADD [[COPY1]], [[COPY2]]
; FAST: [[C1:%[0-9]+]]:vgpr(s32) = G_CONSTANT i32 0		; FAST: [[C1:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 0
; FAST: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:vgpr(s32) = G_AMDGPU_BUFFER_LOAD [[COPY]](<4 x s32>), [[C1]](s32), [[COPY1]], [[C]], 0, 0, 0 :: (dereferenceable invariant load 4)		; FAST: [[C2:%[0-9]+]]:vgpr(s32) = G_CONSTANT i32 0
		; FAST: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:vgpr(s32) = G_AMDGPU_BUFFER_LOAD [[COPY]](<4 x s32>), [[C2]](s32), [[ADD]], [[C1]], 0, 0, 0 :: (dereferenceable invariant load 4)
; FAST: S_ENDPGM 0, implicit [[AMDGPU_BUFFER_LOAD]](s32)		; FAST: S_ENDPGM 0, implicit [[AMDGPU_BUFFER_LOAD]](s32)
; GREEDY-LABEL: name: s_buffer_load_negative_offset		; GREEDY-LABEL: name: s_buffer_load_negative_offset
; GREEDY: liveins: $sgpr0_sgpr1_sgpr2_sgpr3, $vgpr0		; GREEDY: liveins: $sgpr0_sgpr1_sgpr2_sgpr3, $vgpr0
; GREEDY: [[COPY:%[0-9]+]]:sgpr(<4 x s32>) = COPY $sgpr0_sgpr1_sgpr2_sgpr3		; GREEDY: [[COPY:%[0-9]+]]:sgpr(<4 x s32>) = COPY $sgpr0_sgpr1_sgpr2_sgpr3
; GREEDY: [[COPY1:%[0-9]+]]:vgpr(s32) = COPY $vgpr0		; GREEDY: [[COPY1:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
; GREEDY: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 -60		; GREEDY: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 -60
; GREEDY: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY [[C]](s32)		; GREEDY: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY [[C]](s32)
; GREEDY: [[ADD:%[0-9]+]]:vgpr(s32) = G_ADD [[COPY1]], [[COPY2]]		; GREEDY: [[ADD:%[0-9]+]]:vgpr(s32) = G_ADD [[COPY1]], [[COPY2]]
; GREEDY: [[C1:%[0-9]+]]:vgpr(s32) = G_CONSTANT i32 0		; GREEDY: [[C1:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 0
; GREEDY: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:vgpr(s32) = G_AMDGPU_BUFFER_LOAD [[COPY]](<4 x s32>), [[C1]](s32), [[COPY1]], [[C]], 0, 0, 0 :: (dereferenceable invariant load 4)		; GREEDY: [[C2:%[0-9]+]]:vgpr(s32) = G_CONSTANT i32 0
		; GREEDY: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:vgpr(s32) = G_AMDGPU_BUFFER_LOAD [[COPY]](<4 x s32>), [[C2]](s32), [[ADD]], [[C1]], 0, 0, 0 :: (dereferenceable invariant load 4)
; GREEDY: S_ENDPGM 0, implicit [[AMDGPU_BUFFER_LOAD]](s32)		; GREEDY: S_ENDPGM 0, implicit [[AMDGPU_BUFFER_LOAD]](s32)
%0:_(<4 x s32>) = COPY $sgpr0_sgpr1_sgpr2_sgpr3		%0:_(<4 x s32>) = COPY $sgpr0_sgpr1_sgpr2_sgpr3
%1:_(s32) = COPY $vgpr0		%1:_(s32) = COPY $vgpr0
%2:_(s32) = G_CONSTANT i32 -60		%2:_(s32) = G_CONSTANT i32 -60
%3:_(s32) = G_ADD %1, %2		%3:_(s32) = G_ADD %1, %2
%4:_(s32) = G_AMDGPU_S_BUFFER_LOAD %0, %3, 0		%4:_(s32) = G_AMDGPU_S_BUFFER_LOAD %0, %3, 0
S_ENDPGM 0, implicit %4		S_ENDPGM 0, implicit %4

...		...

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/GlobalISel: Fix negative offset folding for buffer_load
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 340805

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp

llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.mir

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/GlobalISel: Fix negative offset folding for buffer_loadClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 340805

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp

llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.mir

AMDGPU/GlobalISel: Fix negative offset folding for buffer_load
ClosedPublic