This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Use instruction order in machine function to process workList of moveToVALU
AbandonedPublic

Authored by skc7 on Feb 5 2023, 2:23 AM.

Details

Summary

Update moveToVALU method to accept worklist of instructions.
Top/first element of worklist will be picked up to process.

Instruction will be moved to VALU or legalizeOperands is called on it before all the instruction prior to it in machine function are processed. legalizeOperands will be able to process the operands together with this change.

This patch is required to legalize rsrc and soffset of MBUF instruction together and create single waterfall loop. Prerequisite for D141030

Diff Detail

Event Timeline

skc7 created this revision.Feb 5 2023, 2:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2023, 2:23 AM
skc7 requested review of this revision.Feb 5 2023, 2:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2023, 2:23 AM
foad added a reviewer: alex-t.Feb 6 2023, 1:51 AM
foad added inline comments.Feb 6 2023, 3:42 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
6186

Sorting the whole worklist every time round this loop is not acceptable. You need to find a more efficient way of doing this.

6191

Do you really need to guarantee the order between different basic blocks? If so, surely you need some kind of topological ordering, not just based on BB number. (And what if there are cycles in the CFG?)

skc7 updated this revision to Diff 496075.Feb 9 2023, 3:48 AM
skc7 edited the summary of this revision. (Show Details)
skc7 added a reviewer: foad.

Introduce SIInstrWorklist. It has a std::set with comparison operator to store instructions as per order in machine function. This tries to solve the previous issue with sorting the vector in every iteration.

skc7 added inline comments.Feb 10 2023, 2:12 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
6186

Made changes in the current patch to use stl set with comparator for inserting and popping from worklist. This would not require to sort the worklist in every iteration. But the insertion and deletion from set have O(logn) time complexity.

6191

I'm finding it difficult to compare two instructions from different basic blocks other than using BB number. Can we use dominator tree information? May be DFS numbers be used to compare them?

arsenm added inline comments.Feb 10 2023, 4:58 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
6191

We used to do this in RPO. Not sure why this was changed.

Would it be easier to check all the operands of the one buffer instruction the first time you encounter it?

skc7 marked an inline comment as not done.Feb 13 2023, 8:36 AM
skc7 added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
6191

The first time a buffer instruction is processed in the worklist for legalizeOperands, both rsrc and soffset operands wouldn't have vgprs assigned by the pass. Only one of them would have vgpr. V2S copy result used by the other operand still need to be processed in the worklist.

So, if legalizeOperands has to legalize soffset and rsrc together, V2S copies that occur prior to buffer instruction have to be processed in the worklist first.

arsenm added inline comments.Feb 20 2023, 1:23 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
6191

Do they really need to be processed first? If you processed them at once, when those instructions are later processed they should see a copy with no uses which can be deleted

skc7 added inline comments.Feb 21 2023, 4:14 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
6191

In the example below, si-fix-sgpr-copies pass calls moveToVALU for V2S copies in random order. So, BUFFER_LOAD instruction will have VGPRs for its soffset after processing %8:sreg_32 = COPY %5:vgpr_32 in the worklist. Legalize Operands would be called on the BUFFER_LOAD instruction which would only add a waterfall loop for soffset.

In further iterations, when rsrc would be converted to VGPR, then again LegalizeOperands would be called which would then add waterfall loop for it.

LegalizeOperands is called on the BUFFER_LOAD two times which is the cause of two waterfall loops in D141030.

So, the solution I think would work is to process the worklist instructions in order and each instruction will be called with legalizeOperands only once. This would make sure both rsrc and soffset are vgprs together once legalizeOperands is called. Then, D141030 logic can add single waterfall loop for them.

define float @llvm_amdgcn_raw_buffer_load_f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset) {
  %val = call float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0)
  ret float %val
}

Input to si-fix-sgpr-copies pass:

Function Live Ins: $vgpr0 in %0, $vgpr1 in %1, $vgpr2 in %2, $vgpr3 in %3, $vgpr4 in %4, $vgpr5 in %5
bb.0 (%ir-block.0):
liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
%5:vgpr_32 = COPY $vgpr5
%4:vgpr_32 = COPY $vgpr4
%3:vgpr_32 = COPY $vgpr3
%2:vgpr_32 = COPY $vgpr2
%1:vgpr_32 = COPY $vgpr1
%0:vgpr_32 = COPY $vgpr0
%9:sgpr_32 = COPY %0:vgpr_32
%10:sgpr_32 = COPY %1:vgpr_32
%11:sgpr_32 = COPY %2:vgpr_32
%12:sgpr_32 = COPY %3:vgpr_32
%6:sgpr_128 = REG_SEQUENCE %9:sgpr_32, %subreg.sub0, %10:sgpr_32, %subreg.sub1, %11:sgpr_32, %subreg.sub2, %12:sgpr_32, %subreg.sub3
%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 7)
$vgpr0 = COPY %7:vgpr_32
SI_RETURN implicit $vgpr0

++++++++++++++++++++++++++++++++++++++++++++++++

1st iteration
Top of worklist -> %8:sreg_32 = COPY %5:vgpr_32

%6:sgpr_128 = REG_SEQUENCE %9:sgpr_32, %subreg.sub0, %10:sgpr_32, %subreg.sub1, %11:sgpr_32, 
%subreg.sub2, %12:sgpr_32, %subreg.sub3 
%8:sreg_32 = COPY %5:vgpr_32 
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, implicit $exec

++++++++++++++++++++++++++++++++++++++++++++++++

2nd iteration 
Top of worklist -> %11:sgpr_32 = COPY %2:vgpr_32

%6:sgpr_128 = REG_SEQUENCE %9:sgpr_32, %subreg.sub0, %10:sgpr_32, %subreg.sub1, %11:sgpr_32, 
%subreg.sub2, %12:sgpr_32, %subreg.sub3 
%8:sreg_32 = IMPLICIT_DEF 
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %5:vgpr_32, 0, 0, 0, implicit $exec
skc7 updated this revision to Diff 501094.Feb 28 2023, 4:57 AM
skc7 set the repository for this revision to rG LLVM Github Monorepo.

Use ReversePostOrderTraversal list to compare machine instructions from different basic blocks.

arsenm added inline comments.Mar 10 2023, 4:31 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
68

It doesn't make sense to me to construct RPO to create a list like this. The whole iteration would just work in RPO

74–76

This should be implied by the iteration order, shouldn't need to sort

skc7 added inline comments.Mar 10 2023, 8:06 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
68

AFAIU, MBB order is created using RPOT list. New MBB might get created by legalizeOperands method once an instruction is processed in the worklist and the MBB order might change. So this list needs to be re-generated.
This list will be used by Cmp::Operator() to compare instructions from two different basic blocks.

74–76

New MachineInstructions might get added to worklist after it gets processed by legalizeOperands. So we need to maintain a sorted order of worklist. So, we need a Cmp::operator() for sort.

arsenm requested changes to this revision.Jun 2 2023, 3:23 PM

Can this be abandoned now?

This revision now requires changes to proceed.Jun 2 2023, 3:23 PM
skc7 abandoned this revision.Jun 5 2023, 8:09 AM
llvm/test/CodeGen/AMDGPU/udivrem.ll