Download Raw Diff

Details

Reviewers

foad
arsenm

Group Reviewers

Restricted Project

Summary

There is a problem with the SILoadStoreOptimizer::dmasksCanBeCombined() function that can lead to UB.

This boolean function decides if two masks can be combined into 1. The idea here is that the bits which are "on" in one mask, don't overlap with the "on" bits of the other. Consider an example (10 bits for simplicity):

Mask 1: 0101101000
Mask 2: 0000000110

Those can be combined into a single mask: 0101101110.

To check if such an operation is possible, the code takes the mask which is greater and counts how many 0s there are, starting from the LSB and stopping at the first 1. Then, it shifts 1u by this number and compares it with the smaller mask. The problem is that when both masks are 0, the counter will find 32 zeroes in the first mask and will try to do a shift by 32 positions which leads to UB.

The fix is a simple sanity check, if the bigger mask is 0 or not.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,030 ms	x64 debian > MLIR.Examples/standalone::test.toy

Event Timeline

konradkusiak97 created this revision.Jul 12 2023, 1:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 12 2023, 1:57 AM

Herald added subscribers: StephenFan, kerbowa, hiraditya and 2 others. · View Herald Transcript

konradkusiak97 requested review of this revision.Jul 12 2023, 1:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 12 2023, 1:57 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Seems reasonable.

Stray ] in first line of commit message, and the target is called "AMDGPU" not "AMD".

An alternative fix would be to avoid the shift altogether, by checking something like countl_zero(MinMask) + countr_zero(MaxMask) >= 32.

konradkusiak97 retitled this revision from [AMD] ]Add sanity check that fixes bad shift operation in AMD backend to [AMDGPU] Add sanity check that fixes bad shift operation in AMD backend.Jul 12 2023, 2:29 AM

Herald added subscribers: tpr, dstuttard, yaxunl and 2 others. · View Herald TranscriptJul 12 2023, 2:29 AM

JonChesterfield added a subscriber: JonChesterfield.Jul 12 2023, 2:52 AM

JonChesterfield added inline comments.

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
877	Should this be if masks are equal? Not sure sanity check describes the condition, maybe drop the comment

Dropped the comment

konradkusiak97 added inline comments.Jul 12 2023, 3:23 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
877	Hm, If the masks are equal, the result will be returning `false`. I think that's the correct behaviour that the author of this function had in mind. That's based on the fact that there is `<=` and not `<` sign in `if ((1u << AllowedBitsForMin) <= MinMask)`, so it really checks if the masks overlap - and two equal masks overlap fully.

JonChesterfield added inline comments.Jul 12 2023, 3:31 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
877	If equal masks imply they can't be combined, that should probably be true for the special case of equal masks that are zero. I haven't looked at the call tree to determine what combining masks means in this context.

foad added inline comments.Jul 12 2023, 3:46 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
877	Firstly, there is no real use case for dmask=0. The resulting instructions would be invalid. All we have to do is handle it gracefully and not crash. If equal masks imply they can't be combined, that should probably be true for the special case of equal masks that are zero. I don't agree. This code is trying to check whether the range of bits set in one mask (from the lowest set bit to the highest including any gaps) overlaps with the range of set bits in the other. By that definition, two equal non-zero masks do overlap but two equal zero masks do not.

JonChesterfield added inline comments.Jul 12 2023, 4:14 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
877	It's called dmasksCanBeCombined though. If equal masks return false, and the zero case doesn't matter other than some arithmetic in the compiler, then masks equal zero returning false seems reasonable. Drive by review, not a blocking comment.

foad added inline comments.Jul 12 2023, 4:18 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
877	I still disagree. Returning false for all equal masks is no simpler to implement, and less easy to justify.

konradkusiak97 added inline comments.Jul 12 2023, 4:24 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
877	I'm also more directed towards returning true in the case of both masks being 0. Because you can combine two 0 masks. Whereas two equal, non-zero masks can't be trivially combined because they overlap.

JonChesterfield added inline comments.Jul 12 2023, 4:32 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
877	We already return false for all other equal masks. I'm proposing if (!MaxMask) return false. I'm running on the heuristic that special cases show up in bug reports and there's no test case in this diff. I don't see zero as special other than we're tripping over it in the compiler. Usually (x64) shift 32 behaves as if masking the value, so the other UB fix is probably &31 in the right place, which would also return false (I think, haven't checked carefully)

Harbormaster completed remote builds in B244729: Diff 539468.Jul 12 2023, 6:20 AM

Testcase?

clang-format

Harbormaster completed remote builds in B245352: Diff 540361.Jul 14 2023, 4:48 AM

included all local changes

Harbormaster completed remote builds in B245367: Diff 540384.Jul 14 2023, 6:24 AM

arsenm added inline comments.Jul 15 2023, 9:52 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
877	First I'd say countr_zero is just broken if this is UB. Also needs testcase

Updated revision

I updated the fix to return false for this case. I've been trying to create a testcase for this but I'm still trying to figure out exactly how to write it so it triggers this specific case with 0 mask.

In D155051#4569610, @konradkusiak97 wrote:

I updated the fix to return false for this case. I've been trying to create a testcase for this but I'm still trying to figure out exactly how to write it so it triggers this specific case with 0 mask.

You probably just have to write a MIR test

Harbormaster completed remote builds in B251107: Diff 548222.Aug 8 2023, 11:16 AM

Created a testcase

Herald added a subscriber: wenlei. · View Herald TranscriptAug 9 2023, 3:49 AM

Harbormaster completed remote builds in B251343: Diff 548553.Aug 9 2023, 6:06 AM

Fixed the UB behaviour and included a testcase

arsenm accepted this revision.Aug 9 2023, 10:43 AM

This revision is now accepted and ready to land.Aug 9 2023, 10:43 AM

Harbormaster completed remote builds in B251400: Diff 548637.Aug 9 2023, 11:04 AM

Thanks @arsenm. As I don't have the commit access, could you land this patch for me? Please use "Konrad Kusiak konrad.kusiak@codeplay" to commit the change.

4fa8a5487e3b1a4b2ce743b0008a912026aa3524

arsenm mentioned this in rG4fa8a5487e3b: [AMDGPU] Add sanity check that fixes bad shift operation in AMD backend.Aug 11 2023, 12:26 PM

Diff 548637

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 865 Lines • ▼ Show 20 Lines	if (Idx != -1 &&
CI.I->getOperand(Idx).getImm() != Paired.I->getOperand(Idx).getImm())		CI.I->getOperand(Idx).getImm() != Paired.I->getOperand(Idx).getImm())
return false;		return false;
}		}

// Check DMask for overlaps.		// Check DMask for overlaps.
unsigned MaxMask = std::max(CI.DMask, Paired.DMask);		unsigned MaxMask = std::max(CI.DMask, Paired.DMask);
unsigned MinMask = std::min(CI.DMask, Paired.DMask);		unsigned MinMask = std::min(CI.DMask, Paired.DMask);

		if (!MaxMask)
		return false;

unsigned AllowedBitsForMin = llvm::countr_zero(MaxMask);		unsigned AllowedBitsForMin = llvm::countr_zero(MaxMask);
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Should this be if masks are equal? Not sure sanity check describes the condition, maybe drop the comment JonChesterfield: Should this be if masks are equal? Not sure sanity check describes the condition, maybe drop…
		konradkusiak97AuthorUnsubmitted Done Reply Inline Actions Hm, If the masks are equal, the result will be returning `false`. I think that's the correct behaviour that the author of this function had in mind. That's based on the fact that there is `<=` and not `<` sign in `if ((1u << AllowedBitsForMin) <= MinMask)`, so it really checks if the masks overlap - and two equal masks overlap fully. konradkusiak97: Hm, If the masks are equal, the result will be returning `false`. I think that's the correct…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions If equal masks imply they can't be combined, that should probably be true for the special case of equal masks that are zero. I haven't looked at the call tree to determine what combining masks means in this context. JonChesterfield: If equal masks imply they can't be combined, that should probably be true for the special case…
		foadUnsubmitted Not Done Reply Inline Actions Firstly, there is no real use case for dmask=0. The resulting instructions would be invalid. All we have to do is handle it gracefully and not crash. If equal masks imply they can't be combined, that should probably be true for the special case of equal masks that are zero. I don't agree. This code is trying to check whether the range of bits set in one mask (from the lowest set bit to the highest including any gaps) overlaps with the range of set bits in the other. By that definition, two equal non-zero masks do overlap but two equal zero masks do not. foad: Firstly, there is no real use case for dmask=0. The resulting instructions would be invalid.
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions It's called dmasksCanBeCombined though. If equal masks return false, and the zero case doesn't matter other than some arithmetic in the compiler, then masks equal zero returning false seems reasonable. Drive by review, not a blocking comment. JonChesterfield: It's called dmasksCanBeCombined though. If equal masks return false, and the zero case doesn't…
		foadUnsubmitted Not Done Reply Inline Actions I still disagree. Returning false for all equal masks is no simpler to implement, and less easy to justify. foad: I still disagree. Returning false for all equal masks is no simpler to implement, and less easy…
		konradkusiak97AuthorUnsubmitted Done Reply Inline Actions I'm also more directed towards returning true in the case of both masks being 0. Because you can combine two 0 masks. Whereas two equal, non-zero masks can't be trivially combined because they overlap. konradkusiak97: I'm also more directed towards returning true in the case of both masks being 0. Because you…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions We already return false for all other equal masks. I'm proposing if (!MaxMask) return false. I'm running on the heuristic that special cases show up in bug reports and there's no test case in this diff. I don't see zero as special other than we're tripping over it in the compiler. Usually (x64) shift 32 behaves as if masking the value, so the other UB fix is probably &31 in the right place, which would also return false (I think, haven't checked carefully) JonChesterfield: We already return false for all other equal masks. I'm proposing if (!MaxMask) return false.
		arsenmUnsubmitted Not Done Reply Inline Actions First I'd say countr_zero is just broken if this is UB. Also needs testcase arsenm: First I'd say countr_zero is just broken if this is UB. Also needs testcase
if ((1u << AllowedBitsForMin) <= MinMask)		if ((1u << AllowedBitsForMin) <= MinMask)
return false;		return false;

return true;		return true;
}		}

static unsigned getBufferFormatWithCompCount(unsigned OldFormat,		static unsigned getBufferFormatWithCompCount(unsigned OldFormat,
unsigned ComponentCount,		unsigned ComponentCount,
▲ Show 20 Lines • Show All 1,594 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/merge-image-load.mir

Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	bb.0.entry:
%3:sgpr_256 = S_LOAD_DWORDX8_IMM %1, 208, 0		%3:sgpr_256 = S_LOAD_DWORDX8_IMM %1, 208, 0
%4:vgpr_32 = COPY %2.sub3		%4:vgpr_32 = COPY %2.sub3
%5:vreg_128 = BUFFER_LOAD_DWORDX4_OFFSET %2:sgpr_128, 0, 0, 0, 0, implicit $exec :: (dereferenceable invariant load (s128))		%5:vreg_128 = BUFFER_LOAD_DWORDX4_OFFSET %2:sgpr_128, 0, 0, 0, 0, implicit $exec :: (dereferenceable invariant load (s128))
%6:vgpr_32 = IMAGE_LOAD_V1_V4 %5:vreg_128, %3:sgpr_256, 4, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)		%6:vgpr_32 = IMAGE_LOAD_V1_V4 %5:vreg_128, %3:sgpr_256, 4, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)
%7:vreg_96 = IMAGE_LOAD_V3_V4 %5:vreg_128, %3:sgpr_256, 7, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)		%7:vreg_96 = IMAGE_LOAD_V3_V4 %5:vreg_128, %3:sgpr_256, 7, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)
...		...
---		---

		# GFX9-LABEL: name: image_load_dmask_zero_not_merged
		# GFX9: %{{[0-9]+}}:vgpr_32 = IMAGE_LOAD_V1_V4 %5, %3, 0, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)
		# GFX9: %{{[0-9]+}}:vreg_96 = IMAGE_LOAD_V3_V4 %5, %3, 0, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)

		name: image_load_dmask_zero_not_merged
		body: \|
		bb.0.entry:
		%0:sgpr_64 = COPY $sgpr0_sgpr1
		%1:sreg_64_xexec = S_LOAD_DWORDX2_IMM %0, 36, 0
		%2:sgpr_128 = COPY $sgpr96_sgpr97_sgpr98_sgpr99
		%3:sgpr_256 = S_LOAD_DWORDX8_IMM %1, 208, 0
		%4:vgpr_32 = COPY %2.sub3
		%5:vreg_128 = BUFFER_LOAD_DWORDX4_OFFSET %2:sgpr_128, 0, 0, 0, 0, implicit $exec :: (dereferenceable invariant load (s128))
		%6:vgpr_32 = IMAGE_LOAD_V1_V4 %5:vreg_128, %3:sgpr_256, 0, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)
		%7:vreg_96 = IMAGE_LOAD_V3_V4 %5:vreg_128, %3:sgpr_256, 0, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)
		...
		---

# GFX9-LABEL: name: image_load_dmask_not_disjoint_not_merged		# GFX9-LABEL: name: image_load_dmask_not_disjoint_not_merged
# GFX9: %{{[0-9]+}}:vgpr_32 = IMAGE_LOAD_V1_V4 %5, %3, 4, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)		# GFX9: %{{[0-9]+}}:vgpr_32 = IMAGE_LOAD_V1_V4 %5, %3, 4, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)
# GFX9: %{{[0-9]+}}:vreg_96 = IMAGE_LOAD_V3_V4 %5, %3, 11, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)		# GFX9: %{{[0-9]+}}:vreg_96 = IMAGE_LOAD_V3_V4 %5, %3, 11, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)

name: image_load_dmask_not_disjoint_not_merged		name: image_load_dmask_not_disjoint_not_merged
body: \|		body: \|
bb.0.entry:		bb.0.entry:
%0:sgpr_64 = COPY $sgpr0_sgpr1		%0:sgpr_64 = COPY $sgpr0_sgpr1
▲ Show 20 Lines • Show All 287 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/merge-image-sample.mir

Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	bb.0.entry:
%3:sgpr_256 = S_LOAD_DWORDX8_IMM %1, 208, 0		%3:sgpr_256 = S_LOAD_DWORDX8_IMM %1, 208, 0
%4:vgpr_32 = COPY %2.sub3		%4:vgpr_32 = COPY %2.sub3
%5:vreg_128 = BUFFER_LOAD_DWORDX4_OFFSET %2:sgpr_128, 0, 0, 0, 0, implicit $exec :: (dereferenceable invariant load (s128))		%5:vreg_128 = BUFFER_LOAD_DWORDX4_OFFSET %2:sgpr_128, 0, 0, 0, 0, implicit $exec :: (dereferenceable invariant load (s128))
%6:vgpr_32 = IMAGE_SAMPLE_L_V1_V4 %5:vreg_128, %3:sgpr_256, %2:sgpr_128, 4, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)		%6:vgpr_32 = IMAGE_SAMPLE_L_V1_V4 %5:vreg_128, %3:sgpr_256, %2:sgpr_128, 4, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)
%7:vreg_96 = IMAGE_SAMPLE_L_V3_V4 %5:vreg_128, %3:sgpr_256, %2:sgpr_128, 7, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)		%7:vreg_96 = IMAGE_SAMPLE_L_V3_V4 %5:vreg_128, %3:sgpr_256, %2:sgpr_128, 7, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)
...		...
---		---

		# GFX9-LABEL: name: image_sample_l_dmask_zero_not_merged
		# GFX9: %{{[0-9]+}}:vgpr_32 = IMAGE_SAMPLE_L_V1_V4 %5, %3, %2, 0, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)
		# GFX9: %{{[0-9]+}}:vreg_96 = IMAGE_SAMPLE_L_V3_V4 %5, %3, %2, 0, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)

		name: image_sample_l_dmask_zero_not_merged
		body: \|
		bb.0.entry:
		%0:sgpr_64 = COPY $sgpr0_sgpr1
		%1:sreg_64_xexec = S_LOAD_DWORDX2_IMM %0, 36, 0
		%2:sgpr_128 = COPY $sgpr96_sgpr97_sgpr98_sgpr99
		%3:sgpr_256 = S_LOAD_DWORDX8_IMM %1, 208, 0
		%4:vgpr_32 = COPY %2.sub3
		%5:vreg_128 = BUFFER_LOAD_DWORDX4_OFFSET %2:sgpr_128, 0, 0, 0, 0, implicit $exec :: (dereferenceable invariant load (s128))
		%6:vgpr_32 = IMAGE_SAMPLE_L_V1_V4 %5:vreg_128, %3:sgpr_256, %2:sgpr_128, 0, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)
		%7:vreg_96 = IMAGE_SAMPLE_L_V3_V4 %5:vreg_128, %3:sgpr_256, %2:sgpr_128, 0, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)
		...
		---

# GFX9-LABEL: name: image_sample_l_dmask_not_disjoint_not_merged		# GFX9-LABEL: name: image_sample_l_dmask_not_disjoint_not_merged
# GFX9: %{{[0-9]+}}:vgpr_32 = IMAGE_SAMPLE_L_V1_V4 %5, %3, %2, 4, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)		# GFX9: %{{[0-9]+}}:vgpr_32 = IMAGE_SAMPLE_L_V1_V4 %5, %3, %2, 4, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s32), addrspace 4)
# GFX9: %{{[0-9]+}}:vreg_96 = IMAGE_SAMPLE_L_V3_V4 %5, %3, %2, 11, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)		# GFX9: %{{[0-9]+}}:vreg_96 = IMAGE_SAMPLE_L_V3_V4 %5, %3, %2, 11, 0, 0, 0, 0, 0, -1, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 4)

name: image_sample_l_dmask_not_disjoint_not_merged		name: image_sample_l_dmask_not_disjoint_not_merged
body: \|		body: \|
bb.0.entry:		bb.0.entry:
%0:sgpr_64 = COPY $sgpr0_sgpr1		%0:sgpr_64 = COPY $sgpr0_sgpr1
▲ Show 20 Lines • Show All 951 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add sanity check that fixes bad shift operation in AMD backend
ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 548637

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

llvm/test/CodeGen/AMDGPU/merge-image-load.mir

llvm/test/CodeGen/AMDGPU/merge-image-sample.mir

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add sanity check that fixes bad shift operation in AMD backendClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 548637

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

llvm/test/CodeGen/AMDGPU/merge-image-load.mir

llvm/test/CodeGen/AMDGPU/merge-image-sample.mir

[AMDGPU] Add sanity check that fixes bad shift operation in AMD backend
ClosedPublic