This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix crash in SILoadStoreOptimizer
ClosedPublic

Authored by rampitec on Apr 1 2020, 3:07 PM.

Download Raw Diff

Details

Reviewers

arsenm
tstellar
dstuttard

Commits

rGf2334a7ef255: [AMDGPU] Fix crash in SILoadStoreOptimizer

Summary

SILoadStoreOptimizer::checkAndPrepareMerge() expects base and
paired instruction to come in order and scans MBB from base to
the paired instruction. An original order can be changed if
there were a dependent instruction in between and base instruction
was moved.

Fixed by bailing the optimization. In theory it might be possible
still to perform a merge by swapping instructions, but on practice
it bails anyway because it finds dependency on that same instruction
which has resulted in the base move.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.Apr 1 2020, 3:07 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 7 others. · View Herald TranscriptApr 1 2020, 3:07 PM

Related to D75741?

In D77245#1955795, @arsenm wrote:

Related to D75741?

Probably it is the same bug. I also think this is a very rare situation which is not worth all the machinery around it, but the crash still needs to be fixed.

LGTM, but an additional MIR test wouldn't hurt

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
893	s/times/time

This revision is now accepted and ready to land.Apr 2 2020, 7:36 AM

My fix has been hanging around for a little while now - so my memory of it is a little hazy.

However, the issue I saw (and attempted to fix) was similar to yours. Agreed that I think this is a rare occurrence, so perhaps your approach makes more sense (and is simpler/cheaper).
One question - the reason I went with the solution I did was that a previous iteration of the loop doing the optimisation was invalidating the iterator stored in a later list - which caused a crash - is your fix working around this by avoiding creating those potentially dependent lists in the first place?? (I wasn't 100% sure about the logic).

I guess one potential advantage with my approach is that you still get the potential optimisations at the expense of having to re-compute the lists for dependent instructions.

In D77245#1957290, @dstuttard wrote:

My fix has been hanging around for a little while now - so my memory of it is a little hazy.

However, the issue I saw (and attempted to fix) was similar to yours. Agreed that I think this is a rare occurrence, so perhaps your approach makes more sense (and is simpler/cheaper).
One question - the reason I went with the solution I did was that a previous iteration of the loop doing the optimisation was invalidating the iterator stored in a later list - which caused a crash - is your fix working around this by avoiding creating those potentially dependent lists in the first place?? (I wasn't 100% sure about the logic).

I guess one potential advantage with my approach is that you still get the potential optimisations at the expense of having to re-compute the lists for dependent instructions.

The current implementation always *sink* instructions no matter whether it's a load or store. That makes the instruction movement inevitable. But, if we *lift* loads and sink stores, no instructions (or very few address calculation) need moving as uses of loads are still preceded by their loads.

In D77245#1957290, @dstuttard wrote:

My fix has been hanging around for a little while now - so my memory of it is a little hazy.

However, the issue I saw (and attempted to fix) was similar to yours. Agreed that I think this is a rare occurrence, so perhaps your approach makes more sense (and is simpler/cheaper).
One question - the reason I went with the solution I did was that a previous iteration of the loop doing the optimisation was invalidating the iterator stored in a later list - which caused a crash - is your fix working around this by avoiding creating those potentially dependent lists in the first place?? (I wasn't 100% sure about the logic).

I guess one potential advantage with my approach is that you still get the potential optimisations at the expense of having to re-compute the lists for dependent instructions.

It sounds like related but not exactly the same issue. In my case iterators are valid, but the order is swapped.

In general the idea to collect all merge lists in advance seems problematic. We may do a better job collecting one list at a time. It is also simpler, although slower.

Fixed typo in the comment.
Added mir test.

Closed by commit rGf2334a7ef255: [AMDGPU] Fix crash in SILoadStoreOptimizer (authored by rampitec). · Explain WhyApr 2 2020, 10:50 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptApr 2 2020, 10:50 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SILoadStoreOptimizer.cpp

11 lines

test/

CodeGen/

AMDGPU/

merge-out-of-order-ldst.ll

28 lines

merge-out-of-order-ldst.mir

23 lines

Diff 254565

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 878 Lines • ▼ Show 20 Lines	if (Swizzled != -1 && CI.I->getOperand(Swizzled).getImm())
return false;		return false;

DenseSet<unsigned> RegDefsToMove;		DenseSet<unsigned> RegDefsToMove;
DenseSet<unsigned> PhysRegUsesToMove;		DenseSet<unsigned> PhysRegUsesToMove;
addDefsUsesToList(*CI.I, RegDefsToMove, PhysRegUsesToMove);		addDefsUsesToList(*CI.I, RegDefsToMove, PhysRegUsesToMove);

MachineBasicBlock::iterator E = std::next(Paired.I);		MachineBasicBlock::iterator E = std::next(Paired.I);
MachineBasicBlock::iterator MBBI = std::next(CI.I);		MachineBasicBlock::iterator MBBI = std::next(CI.I);
		MachineBasicBlock::iterator MBBE = CI.I->getParent()->end();
for (; MBBI != E; ++MBBI) {		for (; MBBI != E; ++MBBI) {

		if (MBBI == MBBE) {
		// CombineInfo::Order is a hint on the instruction ordering within the
		// basic block. This hint suggests that CI precedes Paired, which is
		// true most of the time. However, moveInstsAfter() processing a
		arsenmUnsubmitted Done Reply Inline Actions s/times/time arsenm: s/times/time
		// previous list may have changed this order in a situation when it
		// moves an instruction which exists in some other merge list.
		// In this case it must be dependent.
		return false;
		}

if ((getInstClass(MBBI->getOpcode(), *TII) != InstClass) \|\|		if ((getInstClass(MBBI->getOpcode(), *TII) != InstClass) \|\|
(getInstSubclass(MBBI->getOpcode(), *TII) != InstSubclass)) {		(getInstSubclass(MBBI->getOpcode(), *TII) != InstSubclass)) {
// This is not a matching instruction, but we can keep looking as		// This is not a matching instruction, but we can keep looking as
// long as one of these conditions are met:		// long as one of these conditions are met:
// 1. It is safe to move I down past MBBI.		// 1. It is safe to move I down past MBBI.
// 2. It is safe to move MBBI down past the instruction that I will		// 2. It is safe to move MBBI down past the instruction that I will
// be merged into.		// be merged into.

▲ Show 20 Lines • Show All 1,271 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/merge-out-of-order-ldst.ll

This file was added.

				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

				@L = external local_unnamed_addr addrspace(3) global [9 x double], align 16
				@Ldisp = external local_unnamed_addr addrspace(3) global [96 x double], align 16

				; Stores are reordered during loads merge. This case used to assert while
				; scanning for a paired instruction because it used to expect paired one
				; to follow a base one.

				; GCN-LABEL: {{^}}out_of_order_merge:
				; GCN-COUNT2: ds_read2_b64
				; GCN-COUNT3: ds_write_b64
				define amdgpu_kernel void @out_of_order_merge() {
				entry:
				%gep1 = getelementptr inbounds [96 x double], [96 x double] addrspace(3)* @Ldisp, i32 0, i32 0
				%gep2 = getelementptr inbounds [96 x double], [96 x double] addrspace(3)* @Ldisp, i32 0, i32 1
				%tmp12 = load <2 x double>, <2 x double> addrspace(3)* bitcast (double addrspace(3)* getelementptr inbounds ([9 x double], [9 x double] addrspace(3)* @L, i32 0, i32 1) to <2 x double> addrspace(3)*), align 8
				%tmp14 = extractelement <2 x double> %tmp12, i32 0
				%tmp15 = extractelement <2 x double> %tmp12, i32 1
				%add50.i = fadd double %tmp14, %tmp15
				store double %add50.i, double addrspace(3)* %gep1, align 8
				%tmp16 = load double, double addrspace(3)* getelementptr inbounds ([9 x double], [9 x double] addrspace(3)* @L, i32 1, i32 0), align 8
				store double %tmp16, double addrspace(3)* %gep2, align 8
				%tmp17 = load <2 x double>, <2 x double> addrspace(3)* bitcast (double addrspace(3)* getelementptr inbounds ([9 x double], [9 x double] addrspace(3)* @L, i32 2, i32 1) to <2 x double> addrspace(3)*), align 8
				%tmp19 = extractelement <2 x double> %tmp17, i32 1
				store double %tmp19, double addrspace(3)* undef, align 8
				ret void
				}

llvm/test/CodeGen/AMDGPU/merge-out-of-order-ldst.mir

This file was added.

				# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -run-pass si-load-store-opt %s -o - \| FileCheck -check-prefix=GCN %s

				# GCN-LABEL: name: out_of_order_merge
				# GCN: DS_READ2_B64_gfx9
				# GCN: DS_WRITE_B64_gfx9
				# GCN: DS_READ2_B64_gfx9
				# GCN: DS_WRITE_B64_gfx9
				# GCN: DS_WRITE_B64_gfx9
				---
				name: out_of_order_merge
				body: \|
				bb.0:
				%4:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				%5:vreg_64 = DS_READ_B64_gfx9 %4, 776, 0, implicit $exec :: (load 8 from `double addrspace(3)* undef`, addrspace 3)
				%6:vreg_64 = DS_READ_B64_gfx9 %4, 784, 0, implicit $exec :: (load 8 from `double addrspace(3)* undef` + 8, addrspace 3)
				%17:vreg_64 = DS_READ_B64_gfx9 %4, 840, 0, implicit $exec :: (load 8 from `double addrspace(3)* undef`, addrspace 3)
				DS_WRITE_B64_gfx9 %4, %17, 8, 0, implicit $exec :: (store 8 into `double addrspace(3)* undef` + 8, addrspace 3)
				DS_WRITE_B64_gfx9 %4, %6, 0, 0, implicit $exec :: (store 8 into `double addrspace(3)* undef`, align 16, addrspace 3)
				%24:vreg_64 = DS_READ_B64_gfx9 %4, 928, 0, implicit $exec :: (load 8 from `double addrspace(3)* undef` + 8, addrspace 3)
				DS_WRITE_B64_gfx9 undef %29:vgpr_32, %5, 0, 0, implicit $exec :: (store 8 into `double addrspace(3)* undef`, addrspace 3)
				S_ENDPGM 0

				...