This is an archive of the discontinued LLVM Phabricator instance.

[MCA] [AMDGPU] Adding CustomBehaviour implementation for AMDGPU.
ClosedPublic

Authored by holland11 on Jun 22 2021, 11:20 AM.

Details

Summary

TLDR The s_waitcnt instructions do not currently have the proper behaviour within llvm-mca (due to the scheduling model itself not expressing what that proper behaviour is). This patch utilizes the CustomBehaviour class within mca to enforce that behaviour so that mca can simulate the instructions properly. The patch also makes some slight modifications to the AMDGPU scheduling model, but the side effects of these changes should be isolated to mca (the RetireOOO flag that is added to s_waitcnt instructions is specific to mca).

This patch is related to https://reviews.llvm.org/D104149. In that patch, I ended up pushing an empty implementation for the AMDGPUCustomBehaviour class. @foad then fixed some of the instruction flags and I was able to re-implement the waitcnt logic to be much more similar to SIInsertWaitcnts::updateEventWaitcntAfter().

Some imperfections within this implementation that I'd like to point out:

  • AFAIU, most waitcnt related instructions only increment and decrement the relevant CNTs by 1, however there are a handful of instructions that increment and decrement by more than 1. I do not have a good enough understanding of this, so the implementation treats them all as if they increment/decrement by only 1. Someone with a stronger understanding can feel free to make some changes to get this more accurate if they want.
  • AMDGPUCustomBehaviour::generateWaitCntInfo() is the function that determines whether an instruction interacts with any of the CNTs and which CNTs it interacts with (if any). The logic is very similar to SIInsertWaitcnts::updateEventWaitcntAfter() however, there are function calls from that pass that I was unable to recreate so for now I'm just making conservative assumptions. Refer to the comments within generateWaitcntInfo() for an explanation. Two of the functions that I couldn't recreate are related to memory operands and I'm not entirely sure if it's possible to recreate those functions using MCInst rather than the MachineInstr that are used in the pass this is all based on. The third function that I was unable to recreate is just used to determine how old the specific subtarget is. There is probably a way to recreate this functionality, but I'm not familiar enough with the different subtargets so I didn't want to mess around with it.

Here is the relevant comment for the last two sentences above:

// This should be:
// if (GCNTarget.vmemWriteNeedsExpWaitcnt() &&
//    (MCID.mayStore() || (MCID.TSFlags & SIInstrFlags::IsAtomicRet)))
// where the GCNTarget::vmemWriteNeedsExpWaitcnt() function is
// { return getGeneration() < SEA_ISLANDS; }
// But I'm not sure how to get the subtarget's generation from here.
// For now, conservatively assume that this is true, but maybe an
// AMDGPU dev can suggest a solution.

So if anyone knows how I can recreate the { return getGeneration() < SEA_ISLANDS; } behaviour using either a AMDGPU::IsaVersion or a MCSubtargetInfo, I'd be happy to update this diff before it gets committed, or you can just patch it in the future.

Here's a summary of what this implementation is doing:

  • The AMDGPUInstrPostProcess class is used during the lowering from MCInst to mca::Instruction. mca::Instruction objects do not have any information about immediate operands (since they aren't normally relevant for hazard checking in mca), however the s_waitcnt instructions have important information contained within an immediate operand so we use the InstrPostProcess class to store those operands for any s_waitcnt instructions within the source.
  • When the AMDGPUCustomBehaviour class is constructed (shortly before the pipeline is created within mca), we call the generateWaitcntInfo method. This method iterates over each of the mca::Instruction objects to determine (and store for later use) which instructions interact with which CNTs (vmcnt, expcnt, lgkmcnt, vscnt). This could also be done on the fly, but since in general, mca simulates the source for multiple iterations, it's more efficient to only have to run this logic once for each instruction. A debugger can be used to pause after this method is run and then you can inspect the InstrWaitCntInfo vector to check whether each instruction is being associated with the correct CNTs.
  • During the pipeline simulation, whenever an s_waitcnt instruction is encountered, we check each of the instructions that are currently executing within the pipeline, and using the information stored within InstrWaitCntInfo, we determine if this s_waitcnt instruction should force a stall or not.

And a few more issues worth noting:

  • I'm not sure how the s_waitcnt_depctr instruction works. I couldn't find any information within the ISAs, so I just didn't implement anything for this instruction.
  • I believe the s_endpgm instruction also behaves as if there is an implicit s_waitcnt 0 instruction before it, however I did not implement this behaviour.
  • Before this patch, the s_waitcnt instructions all used the WriteSALU scheduling class. To give the s_waitcnt instructions the RetireOOO flag, I created a new scheduling class WriteSALUOOO that should behave exactly like WriteSALU and should only be associated with the different s_waitcnt instructions. For some models, these classes have a latency of 2, and for others they have a latency of 1. The WriteSALUOOO class also uses the same hardware units as WriteSALU. Someone who knows more about the low level behaviour of s_waitcnt may be able to fine tune this a bit better (with respect to latency and hardware units).
  • If AMDGPU would like to use llvm-mca in the future, I think one area to focus on in particular would be with respect to the RetireOOO flag. This flag allows instructions to finish out of order. Without the flag, mca enforces in-order writes which messes with potential latency hiding. It sounds like the decision of whether or not any given instruction can finish out of order depends on a lot of factors so it's not something I tried to take into account, but just looking at some examples, it seems like there could be some significant differences between how code will actually run on the GPU vs how mca predicts it will run (because of the in-order write enforcement).

Thank you for your time. If you have any questions or suggestions, I'd be happy to discuss further. If you want me to make any modifications, I'm happy to do so (as long they are explained well enough for me to actually make the changes myself). If you want to make any modifications to this class in the future, I encourage you to do so.

Here's some examples that I took from some of the AMDGPU CodeGen test files:

Without patch:

Timeline view:
                    012345678
Index     0123456789         

[0,0]     DE   .    .    .  .   s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
[0,1]     .DE  .    .    .  .   v_add_i32_e32 v1, vcc, 8, v0
[0,2]     .  DE.    .    .  .   s_mov_b32 m0, -1
[0,3]     .   DeeeeE.    .  .   ds_read_b32 v2, v1
[0,4]     .    DeeeeE    .  .   ds_read_b64 v[0:1], v0
[0,5]     .    .   DE    .  .   s_waitcnt lgkmcnt(0)
[0,6]     .    .    DeeeeeeeE   s_setpc_b64 s[30:31]

With patch (not much difference here because the in-order write enforcement 'accidentally' causes an appropriate wait):

Timeline view:
                    0123456789
Index     0123456789          

[0,0]     DE   .    .    .   .   s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
[0,1]     .DE  .    .    .   .   v_add_i32_e32 v1, vcc, 8, v0
[0,2]     .  DE.    .    .   .   s_mov_b32 m0, -1
[0,3]     .   DeeeeE.    .   .   ds_read_b32 v2, v1
[0,4]     .    DeeeeE    .   .   ds_read_b64 v[0:1], v0
[0,5]     .    .    DE   .   .   s_waitcnt lgkmcnt(0)
[0,6]     .    .    .DeeeeeeeE   s_setpc_b64 s[30:31]

Without patch:

Timeline view:
                    0123456789          
Index     0123456789          0123456789

[0,0]     DE   .    .    .    .    .   .   s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
[0,1]     .DeeeeE   .    .    .    .   .   ds_read_u8 v1, v0
[0,2]     . DeeeeE  .    .    .    .   .   ds_read_u8 v2, v0 offset:1
[0,3]     .  DeeeeE .    .    .    .   .   ds_read_u8 v3, v0 offset:2
[0,4]     .   DeeeeE.    .    .    .   .   ds_read_u8 v4, v0 offset:3
[0,5]     .    DeeeeE    .    .    .   .   ds_read_u8 v5, v0 offset:4
[0,6]     .    .DeeeeE   .    .    .   .   ds_read_u8 v6, v0 offset:5
[0,7]     .    . DeeeeE  .    .    .   .   ds_read_u8 v7, v0 offset:6
[0,8]     .    .  DeeeeE .    .    .   .   ds_read_u8 v8, v0 offset:7
[0,9]     .    .   DeeeeE.    .    .   .   ds_read_u8 v9, v0 offset:8
[0,10]    .    .    DeeeeE    .    .   .   ds_read_u8 v10, v0 offset:9
[0,11]    .    .    .DeeeeE   .    .   .   ds_read_u8 v11, v0 offset:10
[0,12]    .    .    . DeeeeE  .    .   .   ds_read_u8 v12, v0 offset:11
[0,13]    .    .    .    .DE  .    .   .   s_waitcnt lgkmcnt(10)
[0,14]    .    .    .    . DE .    .   .   v_lshl_or_b32 v0, v2, 8, v1
[0,15]    .    .    .    .  DE.    .   .   s_waitcnt lgkmcnt(8)
[0,16]    .    .    .    .   DE    .   .   v_lshl_or_b32 v1, v4, 8, v3
[0,17]    .    .    .    .    DE   .   .   v_lshl_or_b32 v0, v1, 16, v0
[0,18]    .    .    .    .    .DE  .   .   s_waitcnt lgkmcnt(6)
[0,19]    .    .    .    .    . DE .   .   v_lshl_or_b32 v1, v6, 8, v5
[0,20]    .    .    .    .    .  DE.   .   s_waitcnt lgkmcnt(4)
[0,21]    .    .    .    .    .   DE   .   v_lshl_or_b32 v2, v8, 8, v7
[0,22]    .    .    .    .    .    DE  .   v_lshl_or_b32 v1, v2, 16, v1
[0,23]    .    .    .    .    .    .DE .   s_waitcnt lgkmcnt(2)
[0,24]    .    .    .    .    .    . DE.   v_lshl_or_b32 v2, v10, 8, v9
[0,25]    .    .    .    .    .    .  DE   s_waitcnt lgkmcnt(0)

With patch (first revision that has the RetireOOO flag on only s_waitcnt):

Timeline view:
                    0123456789         
Index     0123456789          012345678

[0,0]     DE   .    .    .    .    .  .   s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
[0,1]     .DeeeeE   .    .    .    .  .   ds_read_u8 v1, v0
[0,2]     . DeeeeE  .    .    .    .  .   ds_read_u8 v2, v0 offset:1
[0,3]     .  DeeeeE .    .    .    .  .   ds_read_u8 v3, v0 offset:2
[0,4]     .   DeeeeE.    .    .    .  .   ds_read_u8 v4, v0 offset:3
[0,5]     .    DeeeeE    .    .    .  .   ds_read_u8 v5, v0 offset:4
[0,6]     .    .DeeeeE   .    .    .  .   ds_read_u8 v6, v0 offset:5
[0,7]     .    . DeeeeE  .    .    .  .   ds_read_u8 v7, v0 offset:6
[0,8]     .    .  DeeeeE .    .    .  .   ds_read_u8 v8, v0 offset:7
[0,9]     .    .   DeeeeE.    .    .  .   ds_read_u8 v9, v0 offset:8
[0,10]    .    .    DeeeeE    .    .  .   ds_read_u8 v10, v0 offset:9
[0,11]    .    .    .DeeeeE   .    .  .   ds_read_u8 v11, v0 offset:10
[0,12]    .    .    . DeeeeE  .    .  .   ds_read_u8 v12, v0 offset:11
[0,13]    .    .    .  DE.    .    .  .   s_waitcnt lgkmcnt(10)
[0,14]    .    .    .    .DE  .    .  .   v_lshl_or_b32 v0, v2, 8, v1
[0,15]    .    .    .    . DE .    .  .   s_waitcnt lgkmcnt(8)
[0,16]    .    .    .    .  DE.    .  .   v_lshl_or_b32 v1, v4, 8, v3
[0,17]    .    .    .    .   DE    .  .   v_lshl_or_b32 v0, v1, 16, v0
[0,18]    .    .    .    .    DE   .  .   s_waitcnt lgkmcnt(6)
[0,19]    .    .    .    .    .DE  .  .   v_lshl_or_b32 v1, v6, 8, v5
[0,20]    .    .    .    .    . DE .  .   s_waitcnt lgkmcnt(4)
[0,21]    .    .    .    .    .  DE.  .   v_lshl_or_b32 v2, v8, 8, v7
[0,22]    .    .    .    .    .   DE  .   v_lshl_or_b32 v1, v2, 16, v1
[0,23]    .    .    .    .    .    DE .   s_waitcnt lgkmcnt(2)
[0,24]    .    .    .    .    .    .DE.   v_lshl_or_b32 v2, v10, 8, v9
[0,25]    .    .    .    .    .    . DE   s_waitcnt lgkmcnt(0)

With patch (RetireOOO flag set globally):

Timeline view:
                    0123456789       
Index     0123456789          0123456

[0,0]     DE   .    .    .    .    ..   s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
[0,1]     .DeeeeE   .    .    .    ..   ds_read_u8 v1, v0
[0,2]     . DeeeeE  .    .    .    ..   ds_read_u8 v2, v0 offset:1
[0,3]     .  DeeeeE .    .    .    ..   ds_read_u8 v3, v0 offset:2
[0,4]     .   DeeeeE.    .    .    ..   ds_read_u8 v4, v0 offset:3
[0,5]     .    DeeeeE    .    .    ..   ds_read_u8 v5, v0 offset:4
[0,6]     .    .DeeeeE   .    .    ..   ds_read_u8 v6, v0 offset:5
[0,7]     .    . DeeeeE  .    .    ..   ds_read_u8 v7, v0 offset:6
[0,8]     .    .  DeeeeE .    .    ..   ds_read_u8 v8, v0 offset:7
[0,9]     .    .   DeeeeE.    .    ..   ds_read_u8 v9, v0 offset:8
[0,10]    .    .    DeeeeE    .    ..   ds_read_u8 v10, v0 offset:9
[0,11]    .    .    .DeeeeE   .    ..   ds_read_u8 v11, v0 offset:10
[0,12]    .    .    . DeeeeE  .    ..   ds_read_u8 v12, v0 offset:11
[0,13]    .    .    .  DE.    .    ..   s_waitcnt lgkmcnt(10)
[0,14]    .    .    .   DE    .    ..   v_lshl_or_b32 v0, v2, 8, v1
[0,15]    .    .    .    DE   .    ..   s_waitcnt lgkmcnt(8)
[0,16]    .    .    .    .DE  .    ..   v_lshl_or_b32 v1, v4, 8, v3
[0,17]    .    .    .    . DE .    ..   v_lshl_or_b32 v0, v1, 16, v0
[0,18]    .    .    .    .  DE.    ..   s_waitcnt lgkmcnt(6)
[0,19]    .    .    .    .   DE    ..   v_lshl_or_b32 v1, v6, 8, v5
[0,20]    .    .    .    .    DE   ..   s_waitcnt lgkmcnt(4)
[0,21]    .    .    .    .    .DE  ..   v_lshl_or_b32 v2, v8, 8, v7
[0,22]    .    .    .    .    . DE ..   v_lshl_or_b32 v1, v2, 16, v1
[0,23]    .    .    .    .    .  DE..   s_waitcnt lgkmcnt(2)
[0,24]    .    .    .    .    .   DE.   v_lshl_or_b32 v2, v10, 8, v9
[0,25]    .    .    .    .    .    DE   s_waitcnt lgkmcnt(0)

The following example doesn't make much logical sense, but it demonstrates the difference that the RetireOOO flag can have on latency hiding. In case the line wrapping messes up the formatting of the timelines, I've included screenshots.
Without patch (https://i.imgur.com/pB4dgtW.png):

Timeline view:
                    0123456789          0123456789          0123456789          0123456789          0123456789
Index     0123456789          0123456789          0123456789          0123456789          0123456789          

[0,0]     DeeeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   s_load_dwordx2 s[2:3], s[0:1], 0x24
[0,1]     .DeeeeE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   s_load_dwordx2 s[0:1], s[0:1], 0x2c
[0,2]     .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   s_waitcnt lgkmcnt(0)
[0,3]     .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v0, s2
[0,4]     .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v1, s3
[0,5]     .    .  DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE .    .   .   flat_load_dword v2, v[0:1]
[0,6]     .    .   DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE.    .   .   flat_load_dword v3, v[0:1] offset:8
[0,7]     .    .    DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE    .   .   flat_load_dword v4, v[0:1] offset:16
[0,8]     .    .    .DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE   .   .   flat_load_dword v5, v[0:1] offset:24
[0,9]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .   .   v_mov_b32_e32 v0, s0
[0,10]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .   .   v_mov_b32_e32 v1, s1
[0,11]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .   .   v_mov_b32_e32 v6, s6
[0,12]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.   .   v_mov_b32_e32 v7, s7
[0,13]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE   .   v_mov_b32_e32 v8, s8
[0,14]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE  .   v_mov_b32_e32 v9, s9
[0,15]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE .   v_mov_b32_e32 v10, s10
[0,16]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE.   v_mov_b32_e32 v11, s11
[0,17]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE   v_mov_b32_e32 v12, s12
[0,18]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v13, s13
[0,19]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v14, s14
[0,20]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v15, s15
[0,21]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v16, s16
[0,22]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v17, s17
[0,23]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v18, s18
[0,24]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v19, s19
[0,25]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v20, s20
[0,26]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v21, s21
[0,27]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v22, s22
[0,28]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v23, s23
[0,29]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v24, s24
[0,30]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v25, s25
[0,31]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v26, s26
[0,32]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v27, s27
[0,33]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v28, s28
[0,34]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v29, s29
[0,35]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   s_waitcnt vmcnt(0) lgkmcnt(0)

With patch (first revision where RetireOOO is only set for s_waitcnt) (https://i.imgur.com/bmrLwrc.png):

Timeline view:
                    0123456789          0123456789          0123456789          0123456789          0123456789
Index     0123456789          0123456789          0123456789          0123456789          0123456789          

[0,0]     DeeeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   s_load_dwordx2 s[2:3], s[0:1], 0x24
[0,1]     .DeeeeE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   s_load_dwordx2 s[0:1], s[0:1], 0x2c
[0,2]     .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   s_waitcnt lgkmcnt(0)
[0,3]     .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v0, s2
[0,4]     .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v1, s3
[0,5]     .    .   DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE.    .   .   flat_load_dword v2, v[0:1]
[0,6]     .    .    DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE    .   .   flat_load_dword v3, v[0:1] offset:8
[0,7]     .    .    .DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE   .   .   flat_load_dword v4, v[0:1] offset:16
[0,8]     .    .    . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE  .   .   flat_load_dword v5, v[0:1] offset:24
[0,9]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .   .   v_mov_b32_e32 v0, s0
[0,10]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .   .   v_mov_b32_e32 v1, s1
[0,11]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.   .   v_mov_b32_e32 v6, s6
[0,12]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE   .   v_mov_b32_e32 v7, s7
[0,13]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE  .   v_mov_b32_e32 v8, s8
[0,14]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE .   v_mov_b32_e32 v9, s9
[0,15]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE.   v_mov_b32_e32 v10, s10
[0,16]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE   v_mov_b32_e32 v11, s11
[0,17]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v12, s12
[0,18]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v13, s13
[0,19]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v14, s14
[0,20]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v15, s15
[0,21]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v16, s16
[0,22]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v17, s17
[0,23]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v18, s18
[0,24]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v19, s19
[0,25]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v20, s20
[0,26]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v21, s21
[0,27]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v22, s22
[0,28]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v23, s23
[0,29]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v24, s24
[0,30]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v25, s25
[0,31]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v26, s26
[0,32]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v27, s27
[0,33]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v28, s28
[0,34]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   v_mov_b32_e32 v29, s29
[0,35]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   s_waitcnt vmcnt(0) lgkmcnt(0)

With patch (second revision where RetireOOO is set globally) (https://i.imgur.com/5PSq0vw.png):

Timeline view:
                    0123456789          0123456789          0123456789          0123456789          0123
Index     0123456789          0123456789          0123456789          0123456789          0123456789    

[0,0]     DeeeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   s_load_dwordx2 s[2:3], s[0:1], 0x24
[0,1]     .DeeeeE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   s_load_dwordx2 s[0:1], s[0:1], 0x2c
[0,2]     .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   s_waitcnt lgkmcnt(0)
[0,3]     .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v0, s2
[0,4]     .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v1, s3
[0,5]     .    .   DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE.  .   flat_load_dword v2, v[0:1]
[0,6]     .    .    DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE  .   flat_load_dword v3, v[0:1] offset:8
[0,7]     .    .    .DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE .   flat_load_dword v4, v[0:1] offset:16
[0,8]     .    .    . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE.   flat_load_dword v5, v[0:1] offset:24
[0,9]     .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v0, s0
[0,10]    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v1, s1
[0,11]    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v6, s6
[0,12]    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v7, s7
[0,13]    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v8, s8
[0,14]    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v9, s9
[0,15]    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v10, s10
[0,16]    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v11, s11
[0,17]    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v12, s12
[0,18]    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v13, s13
[0,19]    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v14, s14
[0,20]    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v15, s15
[0,21]    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v16, s16
[0,22]    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v17, s17
[0,23]    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v18, s18
[0,24]    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v19, s19
[0,25]    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v20, s20
[0,26]    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v21, s21
[0,27]    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v22, s22
[0,28]    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v23, s23
[0,29]    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v24, s24
[0,30]    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v25, s25
[0,31]    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v26, s26
[0,32]    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v27, s27
[0,33]    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v28, s28
[0,34]    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v29, s29
[0,35]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE   s_waitcnt vmcnt(0) lgkmcnt(0)

Diff Detail

Event Timeline

holland11 created this revision.Jun 22 2021, 11:20 AM
holland11 requested review of this revision.Jun 22 2021, 11:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2021, 11:20 AM
Herald added a subscriber: wdng. · View Herald Transcript
holland11 edited the summary of this revision. (Show Details)Jun 22 2021, 5:34 PM
holland11 edited the summary of this revision. (Show Details)
foad added a comment.Jun 23 2021, 2:03 AM

Thanks for doing this! Generally it seems very useful, and my only high level concern is about the way you set RetireOOO only on s_waitcnt instructions. Would it make more sense to set it for all AMDGPU instructions? I'm really not sure what effect this would have. I've never really thought about the concept of "retiring" in our sched model, since it doesn't seem to have any impact on how you should schedule instructions.

To respond to some of your specific points:

Two of the functions that I couldn't recreate are related to memory operands and I'm not entirely sure if it's possible to recreate those functions using MCInst rather than the MachineInstr that are used in the pass this is all based on.

It's not. The memoperands carry higher level semantic information about pointer operands which simply is not (and should not be) available when you're working at the lower MCInst level.

So if anyone knows how I can recreate the { return getGeneration() < SEA_ISLANDS; } behaviour using either a AMDGPU::IsaVersion or a MCSubtargetInfo, I'd be happy to update this diff before it gets committed

"Sea Islands" is ISA version 7 (don't be misled by the enumerator SEA_ISLANDS having the value 6 :-/ ) so you should be able to implement it as IsaVersion.Major < 7.

I'm not sure how the s_waitcnt_depctr instruction works. I couldn't find any information within the ISAs, so I just didn't implement anything for this instruction.

It tests some other counters, completely unrelated to normal s_waitcnt, so you are right to ignore it.

llvm/tools/llvm-mca/lib/AMDGPU/AMDGPUCustomBehaviour.cpp
25–30

I don't think you'll ever see Pseudos in mca, will you? It should all be Real instructions.

73–78

Likewise, no need to handle Pseudos?

291–294

Hmm, I wonder if you could examine SIInstrFlags::LGKM_CNT instead of having to list these opcodes explicitly?

holland11 added a comment.EditedJun 23 2021, 10:27 AM

Thanks for doing this! Generally it seems very useful, and my only high level concern is about the way you set RetireOOO only on s_waitcnt instructions. Would it make more sense to set it for all AMDGPU instructions? I'm really not sure what effect this would have. I've never really thought about the concept of "retiring" in our sched model, since it doesn't seem to have any impact on how you should schedule instructions.

Yeah, this is actually what I would personally recommend. MCA is smart enough to detect hardware, register, and most memory dependencies. So even with the RetireOOO flag, instructions still won't be dispatched if there are any hazards. My (very naive) guess would be that you'd get much more accurate simulations with the RetireOOO flag set globally. I'll make this change and produce a couple examples and you can decide if you prefer it.

"Sea Islands" is ISA version 7 (don't be misled by the enumerator SEA_ISLANDS having the value 6 :-/ ) so you should be able to implement it as IsaVersion.Major < 7.

Great! I figured it might be as simple as this. I'll make the change and update the diff.

I don't think you'll ever see Pseudos in mca, will you? It should all be Real instructions.

AFAIU, many of the design decision within mca are made with the idea that mca could potentially end up being used in a backend pass one day. I can remove the pseudo cases if you'd like, but in the hypothetical future where mca gets used in an AMDGPU backend pass, you'd probably want to add them back in.

Hmm, I wonder if you could examine SIInstrFlags::LGKM_CNT instead of having to list these opcodes explicitly?

I can make that change if you'd like, but those specific opcodes are checked within the SIInsertWaitcnts::updateEventWaitcntAfter() function which all of this is based on.

else {
    switch (Inst.getOpcode()) {
    case AMDGPU::S_SENDMSG:
    case AMDGPU::S_SENDMSGHALT:
      ScoreBrackets->updateByEvent(TII, TRI, MRI, SQ_MESSAGE, Inst);
      break;
    case AMDGPU::S_MEMTIME:
    case AMDGPU::S_MEMREALTIME:
      ScoreBrackets->updateByEvent(TII, TRI, MRI, SMEM_ACCESS, Inst);
      break;
    }

The reason for checking these specific opcodes could just be because they give different 'event types' (SQ_MESSAGE vs SMEM_ACCESS). Both of those event types map to lgkmcnt so I have them bundled together, but if you want me to just change it to checking for the LGKM_CNT flag, then I can do that.

holland11 edited the summary of this revision. (Show Details)Jun 23 2021, 10:49 AM
holland11 edited the summary of this revision. (Show Details)Jun 23 2021, 11:19 AM
holland11 updated this revision to Diff 354065.Jun 23 2021, 1:24 PM

Set the RetireOOO flag globally rather than just on the s_waitcnt instructions. Also got rid of the WriteSALUOOO sched class since s_waitcnt can go back to being in the WriteSALU class.

These changes caused one of the AMDGPU mca tests to fail since the output is now different. So I updated that test case as well.

Also added the check for ISAVersion.Major < 7 within AMDGPUCustomBehaviour::generateWaitCntInfo().

Added an example to the end of the original post's body to show demonstrate the effects that the global RetireOOO flag has.

@foad I implemented some of your suggestions and updated the diff. I did not yet remove the psuedo instruction cases and I did not yet change

else {
    switch (Inst.getOpcode()) {
    case AMDGPU::S_SENDMSG:
    case AMDGPU::S_SENDMSGHALT:
      ScoreBrackets->updateByEvent(TII, TRI, MRI, SQ_MESSAGE, Inst);
      break;
    case AMDGPU::S_MEMTIME:
    case AMDGPU::S_MEMREALTIME:
      ScoreBrackets->updateByEvent(TII, TRI, MRI, SMEM_ACCESS, Inst);
      break;
    }

to be just a check for the LGKM_CNT flag. I am happy to make those changes if you want though and then I'll update the diff again. Let me know what you think!

foad added a comment.Jun 24 2021, 2:01 AM

AFAIU, many of the design decision within mca are made with the idea that mca could potentially end up being used in a backend pass one day. I can remove the pseudo cases if you'd like, but in the hypothetical future where mca gets used in an AMDGPU backend pass, you'd probably want to add them back in.

OK, it's fine to leave the Pseudos in this patch. It's only a pain when we have long lists of opcodes, and hopefully in future most of those can be replaced by checking some flags (possibly new flags that don't exist yet).

Hmm, I wonder if you could examine SIInstrFlags::LGKM_CNT instead of having to list these opcodes explicitly?

I can make that change if you'd like, but those specific opcodes are checked within the SIInsertWaitcnts::updateEventWaitcntAfter() function which all of this is based on.

In that case it's fine to leave it as-is. Any future cleanup can be done in both places.

Added an example to the end of the original post's body to show demonstrate the effects that the global RetireOOO flag has.

Thanks. The RetireOOO output looks much better. Independent VALU instructions can definitely execute while there are outstanding loads. It would be great to add a new mca test case that shows this effect directly, if you wouldn't mind.

llvm/lib/Target/AMDGPU/SISchedule.td
140

Nit: it's a shame that this adds 10 lines to the patch instead of just 1. As an alternative...

183

... could you change this to let SchedModel = SIFullSpeedModel, RetireOOO = 1 and the same for all the other let SchedModel = ... lines? Would that still work? I think that would make it slightly easier for the next person who cut'n'pastes one of these schedmodels to create a new one, to get it right.

llvm/test/tools/llvm-mca/AMDGPU/gfx10-double.s
44

I'm surprised that there are no related changes to the tables below. Is this a case where the timeline view would have changed, except that it gets truncated due to some archaic notion of screen width? :-)

llvm/test/tools/llvm-mca/AMDGPU/gfx10-double.s
44

You can pass flag timeline-max-cycles if you need to show more columns in the timeline view.

-timeline-max-cycles=<cycles>
foad added inline comments.Jun 24 2021, 4:03 AM
llvm/test/tools/llvm-mca/AMDGPU/gfx10-double.s
44

Thanks. It would be nice if there was some way of setting this limit to infinity. I've tried implementing that in D104846.

holland11 added a comment.EditedJun 24 2021, 10:31 AM

... could you change this to let SchedModel = SIFullSpeedModel, RetireOOO = 1 and the same for all the other let SchedModel = ... lines? Would that still work? I think that would make it slightly easier for the next person who cut'n'pastes one of these schedmodels to create a new one, to get it right.

Unfortunately, this does not build. The RetireOOO flag can only be applied to scheduling classes so its let statement can't include any InstRW expressions.

I understand and agree with your point though. Do you have any other ideas to achieve something similar? I'm a rookie when it comes to tablegen so I can't really think of a better way to do it than I'm doing it right now.

I could maybe format it a bit better. Something like (moving the flag to the top of the block so it's a bit more obvious):

let SchedModel = GFX10SpeedModel in {
let RetireOOO = 1 in { // llvm-mca specific flag

// The latency values are 1 / (operations / cycle).
// Add 1 stall cycle for VGPR read.
def : HWWriteRes<Write32Bit,         [HWVALU, HWRC],   5>;
def : HWWriteRes<WriteFloatCvt,      [HWVALU, HWRC],   5>;
def : HWWriteRes<Write64Bit,         [HWVALU, HWRC],   6>;
def : HWWriteRes<WriteTrans32,       [HWTransVALU, HWRC], 10>;
def : HWWriteRes<WriteQuarterRate32, [HWVALU, HWRC],   8>;
def : HWWriteRes<WriteFloatFMA,      [HWVALU, HWRC],   5>;
def : HWWriteRes<WriteDouble,        [HWVALU, HWRC],   22>;
def : HWWriteRes<WriteDoubleAdd,     [HWVALU, HWRC],   22>;
def : HWWriteRes<WriteDoubleCvt,     [HWVALU, HWRC],   22>;
def : HWWriteRes<WriteIntMul,        [HWVALU, HWRC],   8>;
def : HWWriteRes<WriteTrans64,       [HWVALU, HWTransVALU, HWRC], 24>;

def : HWWriteRes<WriteBranch,        [HWBranch],       32>;
def : HWWriteRes<WriteExport,        [HWExport, HWRC], 16>;
def : HWWriteRes<WriteLDS,           [HWLGKM,   HWRC], 20>;
def : HWWriteRes<WriteSALU,          [HWSALU,   HWRC], 2>;
def : HWWriteRes<WriteSMEM,          [HWLGKM,   HWRC], 20>;
def : HWWriteRes<WriteVMEM,          [HWVMEM,   HWRC], 320>;
def : HWWriteRes<WriteBarrier,       [HWBranch],       2000>;
} // End RetireOOO = 1 (Can't be applied to InstRW expressions)

def : InstRW<[WriteCopy], (instrs COPY)>;

}  // End SchedModel = GFX10SpeedModel

Although, I can't do this in general because, for example:

let SchedModel = SIDPFullSpeedModel in {

defm : SICommonWriteRes;

let RetireOOO = 1 in { // llvm-mca specific flag
def : HWVALUWriteRes<WriteFloatFMA,    1>;
def : HWVALUWriteRes<WriteDouble,      1>;
def : HWVALUWriteRes<WriteDoubleAdd,   1>;
def : HWVALUWriteRes<WriteDoubleCvt,   1>;
def : HWVALUWriteRes<WriteTrans64,     4>;
def : HWVALUWriteRes<WriteIntMul,      1>;
def : HWVALUWriteRes<Write64Bit,       1>;
} // End RetireOOO = 1

def : InstRW<[WriteCopy], (instrs COPY)>;
def : InstRW<[Write64Bit], (instregex "^V_ACCVGPR_WRITE_B32_e64$")>;
def : InstRW<[Write2PassMAI,   MIMFMARead], (instregex "^V_MFMA_.32_4X4X")>;
def : InstRW<[Write8PassMAI,   MIMFMARead], (instregex "^V_MFMA_.32_16X16X")>;
def : InstRW<[Write16PassMAI,  MIMFMARead], (instregex "^V_MFMA_.32_32X32X")>;
def : InstRW<[Write4PassDGEMM, MIMFMARead], (instregex "^V_MFMA_.64_4X4X")>;
def : InstRW<[Write8PassDGEMM, MIMFMARead], (instregex "^V_MFMA_.64_16X16X")>;

} // End SchedModel = SIDPFullSpeedModel

In the above block, I can't include the defm : SICommonWriteRes; line within the let RetireOOO block. (So I can't have the let RetireOOO line come immediately after the let SchedModel line in general.) Might be most consistent to just keep it the way it is (with the let RetireOOO block always surrounding all the def : HWVALUEWriteRes lines). But I'm open to trying different ideas if you have any.

Thanks. The RetireOOO output looks much better. Independent VALU instructions can definitely execute while there are outstanding loads. It would be great to add a new mca test case that shows this effect directly, if you wouldn't mind.

I will update the current tests with your new timeline-max-cycles change and I will also add 1-2 new tests.

Updated the previous AMDGPU mca tests and added a new one that highlights the new RetireOOO flag being set globally.

foad accepted this revision.Jun 25 2021, 1:57 AM

Unfortunately, this does not build. The RetireOOO flag can only be applied to scheduling classes so its let statement can't include any InstRW expressions.

I understand and agree with your point though. Do you have any other ideas to achieve something similar? I'm a rookie when it comes to tablegen so I can't really think of a better way to do it than I'm doing it right now.

Thanks for trying, and no I don't have any better ideas. I think it's fine the way you already have it in the patch.

Please remove all mention of S_WAITCNT_DEPCTR from the patch and then I think it's good to go, thanks again!

llvm/tools/llvm-mca/lib/AMDGPU/AMDGPUCustomBehaviour.cpp
226

Just curious, have you run clang-format on this whole patch? Cos these lines look rather short.

This revision is now accepted and ready to land.Jun 25 2021, 1:57 AM

Please remove all mention of S_WAITCNT_DEPCTR from the patch and then I think it's good to go, thanks again!

Sound good. I'll leave a comment specifying that s_waitcnt_depctr is not handled yet, but take it out from everywhere else.

Just curious, have you run clang-format on this whole patch? Cos these lines look rather short.

I have been running it regularly (and just tried again with no effect). I'll squish that block of comments (vertically) and then run it again so that it can force the correct width. I've been trying to 'get ahead' of clang-format (with respect to comment line length), but I guess I shouldn't be trying to do its job myself.

holland11 updated this revision to Diff 354544.EditedJun 25 2021, 10:28 AM

Removed all lines with s_waitcnt_depctr and added a comment mentioning that we do not currently handle it. Also added a comment explaining why the pseudo instructions are included in the switch statement. Also modified the formatting of one comment block per @foad 's suggestion.

foad added a comment.Jul 6 2021, 10:38 AM

@foad Any updates?

It's already accepted. Are you waiting for anything more from me?

Oh okay, sounds good. I made some changes based on your previous comment so I was just waiting to see if they were good from your perspective. I'll push this tomorrow. Thanks!

foad added a comment.Jul 7 2021, 1:42 AM

Still looks good, thanks! :)

thakis added a subscriber: thakis.Jul 7 2021, 3:07 PM

Looks like this breaks check-llvm: http://45.33.8.238/linux/50601/step_12.txt

Please take a look, and please revert for now if it takes a while to fix.

thakis added a comment.Jul 7 2021, 5:52 PM

Hm, looks like it passes on most bots. It also fails for me locally with the GN build though.

Here's how llvm/utils/update_mca_test_checks.py thinks it should look like: https://pastebin.com/RcKi9ud9

Any idea what could cause this? Maybe a nondeterminism somewhere? How would I debug this?

thakis added a comment.Jul 7 2021, 5:54 PM

Oh I see this needs the cmake goop in https://reviews.llvm.org/D104149 . Looking…

thakis added a comment.Jul 7 2021, 7:01 PM

(That did the trick, e37dbc6e5703c2755d5fb81949eb32f07bc6ebd6 – sorry for the noise!)

@thakis Hey sorry for not responding quicker. Thanks for taking care of that. I actually don't even know what gn is, but I'll be looking into it so that I can understand what was happening here.

It looks like this patch is causing build failures with -DLLVM_BUILD_LLVM_DYLIB=ON and -DLLVM_LINK_LLVM_DYLIB=ON. I think the problem is that it's trying to access AMDGPU symbols from libLLVM-13.so which are not exported in the shared library build. If you need to use these symbols, then I think the only solution is to pass DISABLE_LLVM_LINK_LLVM_DYLIB to add_llvm_library() in /llvm/tools/llvm-mca/lib/AMDGPU/CMakeLists.txt

Here is the full failure log:
/usr/bin/g++ -O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -O2 -g -DNDEBUG -Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -Wl,-rpath-link,/builddir/build/BUILD/llvm-13.0.0.src/x86_64-redhat-linux-gnu/./lib64 -Wl,-O3 -Wl,--gc-sections tools/llvm-mca/CMakeFiles/llvm-mca.dir/llvm-mca.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/CodeRegion.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/CodeRegionGenerator.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/PipelinePrinter.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/BottleneckAnalysis.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/DispatchStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/InstructionInfoView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/InstructionView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/RegisterFileStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/ResourcePressureView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/RetireControlUnitStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/SchedulerStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/SummaryView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/TimelineView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/View.cpp.o -o bin/llvm-mca -lpthread lib64/libLLVMMCACustomBehaviourAMDGPU.a lib64/libLLVM-13.so && :
/usr/bin/ld: lib64/libLLVMMCACustomBehaviourAMDGPU.a(AMDGPUCustomBehaviour.cpp.o): in function `llvm::mca::AMDGPUCustomBehaviour::computeWaitCnt(llvm::mca::InstRef const&, unsigned int&, unsigned int&, unsigned int&, unsigned int&)':
/builddir/build/BUILD/llvm-13.0.0.src/x86_64-redhat-linux-gnu/../tools/llvm-mca/lib/AMDGPU/AMDGPUCustomBehaviour.cpp:222: undefined reference to `llvm::AMDGPU::decodeWaitcnt(llvm::AMDGPU::IsaVersion const&, unsigned int, unsigned int&, unsigned int&, unsigned int&)'
/usr/bin/ld: lib64/libLLVMMCACustomBehaviourAMDGPU.a(AMDGPUCustomBehaviour.cpp.o): in function `llvm::mca::AMDGPUCustomBehaviour::hasModifiersSet(std::unique_ptr<llvm::mca::Instruction, std::default_delete<llvm::mca::Instruction> > const&, unsigned int) const':
/builddir/build/BUILD/llvm-13.0.0.src/x86_64-redhat-linux-gnu/../tools/llvm-mca/lib/AMDGPU/AMDGPUCustomBehaviour.cpp:308: undefined reference to `llvm::AMDGPU::getNamedOperandIdx(unsigned short, unsigned short)'
/usr/bin/ld: lib64/libLLVMMCACustomBehaviourAMDGPU.a(AMDGPUCustomBehaviour.cpp.o): in function `llvm::mca::AMDGPUCustomBehaviour::generateWaitCntInfo()':
/builddir/build/BUILD/llvm-13.0.0.src/x86_64-redhat-linux-gnu/../tools/llvm-mca/lib/AMDGPU/AMDGPUCustomBehaviour.cpp:263: undefined reference to `llvm::AMDGPU::getMUBUFIsBufferInv(unsigned int)'
collect2: error: ld returned 1 exit status

holland11 added a comment.EditedJul 7 2021, 8:43 PM

@tstellar Is it possible that it's the same issue seen https://reviews.llvm.org/D104401 ? In which case, I'd just need to move destructors into the cpp files rather than the .h files?

Although your error message references an undefined reference to llvm::AMDGPU::decodeWaitcnt() which is not part of the patch so maybe it's a different issue and I need to resolve it with the cmake suggestion you mentioned.

I will revert the patch for now.

@holland11 I think it's a different issue.

@tstellar Is it possible that it's the same issue seen https://reviews.llvm.org/D104401 ? In which case, I'd just need to move destructors into the cpp files rather than the .h files?

I don't think it's the same issue. I've seen similar problems in some of the unit tests that reference non-public symbols from the backends, so I think the CMake fix is the only solution.

@tstellar I have been unable to reproduce the error on my machine. I checked out my repo to this commit and tried building with

cmake -GNinja -DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_BUILD_LLVM_DYLIB=ON -DLLVM_LINK_LLVM_DYLIB=ON ../../llvm
ninja

But it built without errors. I then tried the same thing, but with -DLLVM_TARGETS_TO_BUILD="X86" to see if it was related to building without AMDGPU, but that also built without issue.

I would like to reproduce the error so that when I attempt your fix, I can verify that it actually worked. Do you have any suggestions for how I can reproduce the problem? I've never built with shared libs before so I'm not sure if my build commands are correct. Also, I'm on macOS if that makes any difference.

@holland11 You won't be able to reproduce it on Mac OS, because we don't use -fvisibility=hidden by default there. You'll need to be on Linux to reproduce.

@tstellar Would you be willing to test the fix for me? If this request is inappropriate then I can setup a VM and reproduce the error myself then test the fix, but if you're willing to test it for me, that'd be awesome.

> git checkout af3baf1761bb

To get back to this commit.

Then modify /llvm/tools/llvm-mca/lib/AMDGPU/CMakeLists.txt to be

include_directories(
  ${LLVM_MAIN_SRC_DIR}/lib/Target/AMDGPU
  ${LLVM_BINARY_DIR}/lib/Target/AMDGPU
  )

set(LLVM_LINK_COMPONENTS
  AMDGPU
  Core
  MCA
  Support
  )

add_llvm_library(LLVMMCACustomBehaviourAMDGPU
  DISABLE_LLVM_LINK_LLVM_DYLIB
  
  AMDGPUCustomBehaviour.cpp

  DEPENDS
  AMDGPUCommonTableGen
  )

Then see if the error is gone when building with shared libs.

But again, I'd like to emphasize that if you're busy or this is just an inconsiderate request by me, then I can test it myself.

thakis added a comment.Jul 8 2021, 4:48 PM

@thakis Hey sorry for not responding quicker. Thanks for taking care of that. I actually don't even know what gn is, but I'll be looking into it so that I can understand what was happening here.

gn isn't anything supported and nothing you need to know about. Sorry I didn't realize it was a bug on my end.

I can test, no problem.

I add to add DISABLE_LLVM_LINK_LLVM_DYLIB in two places, in llvm/tools/llvm-mca/lib/AMDGPU/CMakeLists.txt as you did in your comment, and also to the add_llvm_tool call in llvm/tools/llvm-mca/CMakeLists.txt. When I did that the tests pass.

I add to add DISABLE_LLVM_LINK_LLVM_DYLIB in two places, in llvm/tools/llvm-mca/lib/AMDGPU/CMakeLists.txt as you did in your comment, and also to the add_llvm_tool call in llvm/tools/llvm-mca/CMakeLists.txt. When I did that the tests pass.

I really appreciate you testing this for me. Thank you.

@andreadb Would this be reasonable to you? If I understand correctly, this would make it so that llvm-mca's executable wouldn't benefit from shared lib builds. The other alternative would be for me to ask the AMDGPU devs if we can export the specific functions that are being used here. As far as I can tell, those are our only two options.

I add to add DISABLE_LLVM_LINK_LLVM_DYLIB in two places, in llvm/tools/llvm-mca/lib/AMDGPU/CMakeLists.txt as you did in your comment, and also to the add_llvm_tool call in llvm/tools/llvm-mca/CMakeLists.txt. When I did that the tests pass.

I really appreciate you testing this for me. Thank you.

@andreadb Would this be reasonable to you? If I understand correctly, this would make it so that llvm-mca's executable wouldn't benefit from shared lib builds. The other alternative would be for me to ask the AMDGPU devs if we can export the specific functions that are being used here. As far as I can tell, those are our only two options.

If that's the case, then your second option is preferable (i.e. exporting those functions that are required by this patch).
I would consider the other alternative only as a last resort.

This comment was removed by holland11.
holland11 added a comment.EditedJul 12 2021, 12:37 PM

@tstellar You suggested in your original message to add the DISABLE_LLVM_LINK_LLVM_DYLIB to the AMDGPUCustomBehaviour cmake file. This would be a very reasonable solution since, in theory, it would let most of mca still benefit from shared libs. However, when you attempted the fix, you said that to get it to work, you had to also add that flag to the llvm-mca cmake file as well.

My work laptop is macOS, and my personal desktop is Win10, but Win10 has WSL (Windows Subsystem for Linux) which can be used for quite a bit of linux development without needing a VM. I wasn't sure if I was going to be able to recreate your original error using WSL, but I tried it and I got the error. (I cloned the llvm git repo, checked out this commit, then tried building with shared libs. This gave me the same error that you got.)

I then tried your original suggestion (only changing the AMDGPUCB cmake file) to confirm that it wasn't sufficient by itself. But this build succeeded for me?

This could definitely be related to the fact that I'm building in WSL and not pure linux, but if we can get it to work by only modifying the AMDGPUCB cmake file, that would be ideal.

(Static libs llvm-mca executable is 57MB,
pure shared libs is 2.1MB,
shared libs w/ shared libs disabled on only AMDGPUCB is 3MB,
shared libs w/ shared libs disabled on llvm-mca is 57MB.)

Could you elaborate as to what the issue was when you tried your original fix and why you had to also change the llvm-mca cmake file?

@holland11 Did you run make check ? The reason I had to update llvm/tools/llvm-mca/CMakeLists.txt was so that the tests pass (without it all invocations of llvm-mca failed with the 'multiple registered command line options' error.

As for a fix, you can export symbols by adding the LLVM_EXTERNAL_VISIBILITY to the function definition, but I don't know if exporting random symbols is necessarily a good idea if it's only needed for an internal tool. Is it possible to rewrite this patch to not use the symbols? Or maybe can we put these symbols into their own static library? We've run into this problem before with some unittests that reference target symbols, so having some kind of static library solution for this would be generally useful.

@holland11 Did you run make check ? The reason I had to update llvm/tools/llvm-mca/CMakeLists.txt was so that the tests pass (without it all invocations of llvm-mca failed with the 'multiple registered command line options' error.

As for a fix, you can export symbols by adding the LLVM_EXTERNAL_VISIBILITY to the function definition, but I don't know if exporting random symbols is necessarily a good idea if it's only needed for an internal tool. Is it possible to rewrite this patch to not use the symbols? Or maybe can we put these symbols into their own static library? We've run into this problem before with some unittests that reference target symbols, so having some kind of static library solution for this would be generally useful.

Ah yeah I was just in the process of trying to fix the 'multiple registered command line options' error. Was hoping it was related to WSL so I was rebuilding within a ubuntu docker image to see if that fixed it. But I guess that's not going to fix it =/.

I'll have to think about this for a while longer. I appreciate all your help.

We've run into this problem before with some unittests that reference target symbols, so having some kind of static library solution for this would be generally useful.

@tstellar Could you elaborate a little more on this and mention some of the unittests so Quentin and I can look into it?

@holland11 If you grep llvm/unittests for add_llvm_target_unittest, that will give you all the unittests that we have had to use static linking for because they reference non-public target symbols.

This is essentially the same problem that you are running into with this patch, so having some kind of generic solution, so we don't need to statically link all these binaries would be really nice.

@tstellar @andreadb I've been thinking about the problem for a bit now and have been brainstorming different ideas and digging through the code. I just made a post on the llvm discourse to hopefully get some feedback on the issue (and some potential solutions that I've come up with). If you guys are curious or would be willing to take a look, here's the post https://llvm.discourse.group/t/need-help-deciding-how-to-access-backend-functions-from-within-a-tool/3915 .

Hey Patrick,

tl;dr: I prefer point 2.

I have replied to that thread on discourse.

-Andrea

@andreadb @tstellar Appreciate both of your inputs. I have been working on implementing that setup. Am currently in the process of testing it with the linux shared lib build. If it is working, I'll first submit a patch with the CustomBehaviour changes only (this includes additions to multiple cmake config files to get the CustomBehaviour and InstrPostProcess classes initialized and registered). If/once that's approved and pushed, I'll re-submit this patch but with the new directory structure.

holland11 added a comment.EditedAug 23 2021, 4:04 PM

@andreadb @foad Sorry for coming back to this so late. To recap everything that happened:

  • I posted this patch for review, got some feedback and made some edits to the point that it was approved.
  • Pushed the patch.
  • Got some feedback that the patch breaks llvm when building on linux with shared libs so I reverted the patch.
  • Discussed the issue and ended up resolving it by moving the target specific CustomBehaviour implementations out of /tools/llvm-mca/lib/ and into /lib/Target/<TargetName>/MCA/.
  • In that patch, I moved the currently empty AMDGPUCustomBehaviour implementation into /lib/Target/AMDGPU/MCA/.
  • I then added the implementation from this patch back into that new AMDGPUCB location (in my local repo).
  • During the time between posting this patch and moving target CBs into /lib/Target/, a LoadStoreUnit was added to the in-order pipeline of mca.
  • This LSUnit ended up changing the behaviour of one of the test cases present in this patch. At the time, I didn't know if it was the LSUnit, or something I did, or something else and I wanted to get to the bottom of the change before I pushed the patch. Unfortunately, I had some other things going on at the time that took priority and I didn't want to push the patch when I wasn't sure what was causing the test case to be "wrong".

Fast forward to now. With Andrea's help, I now understand what the issue is and have an idea of how to fix it. I've got some other things that I'd like to take care of first, but I'd prefer to push this patch finally as is if that's okay with both of you. Then once I implement a fix for the LSUnit issue, I can update the test case to be more accurate.

For further clarification, here is what the test case used to look like:

# CHECK:      Timeline view:
# CHECK-NEXT:                     0123456789          0123456789          0123456789          0123456789          0123
# CHECK-NEXT: Index     0123456789          0123456789          0123456789          0123456789          0123456789

# CHECK:      [0,0]     DeeeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   s_load_dwordx2 s[2:3], s[0:1], 0x24
# CHECK-NEXT: [0,1]     .DeeeeE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   s_load_dwordx2 s[0:1], s[0:1], 0x2c
# CHECK-NEXT: [0,2]     .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   s_waitcnt lgkmcnt(0)
# CHECK-NEXT: [0,3]     .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v0, s2
# CHECK-NEXT: [0,4]     .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v1, s3
# CHECK-NEXT: [0,5]     .    .   DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE.  .   flat_load_dword v2, v[0:1]
# CHECK-NEXT: [0,6]     .    .    DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE  .   flat_load_dword v3, v[0:1] offset:8
# CHECK-NEXT: [0,7]     .    .    .DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE .   flat_load_dword v4, v[0:1] offset:16
# CHECK-NEXT: [0,8]     .    .    . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE.   flat_load_dword v5, v[0:1] offset:24
# CHECK-NEXT: [0,9]     .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v0, s0
# CHECK-NEXT: [0,10]    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v1, s1
# CHECK-NEXT: [0,11]    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v6, s6
# CHECK-NEXT: [0,12]    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v7, s7
# CHECK-NEXT: [0,13]    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v8, s8
# CHECK-NEXT: [0,14]    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v9, s9
# CHECK-NEXT: [0,15]    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v10, s10
# CHECK-NEXT: [0,16]    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v11, s11
# CHECK-NEXT: [0,17]    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v12, s12
# CHECK-NEXT: [0,18]    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v13, s13
# CHECK-NEXT: [0,19]    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v14, s14
# CHECK-NEXT: [0,20]    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v15, s15
# CHECK-NEXT: [0,21]    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v16, s16
# CHECK-NEXT: [0,22]    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v17, s17
# CHECK-NEXT: [0,23]    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v18, s18
# CHECK-NEXT: [0,24]    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v19, s19
# CHECK-NEXT: [0,25]    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v20, s20
# CHECK-NEXT: [0,26]    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v21, s21
# CHECK-NEXT: [0,27]    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v22, s22
# CHECK-NEXT: [0,28]    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v23, s23
# CHECK-NEXT: [0,29]    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v24, s24
# CHECK-NEXT: [0,30]    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v25, s25
# CHECK-NEXT: [0,31]    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v26, s26
# CHECK-NEXT: [0,32]    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v27, s27
# CHECK-NEXT: [0,33]    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v28, s28
# CHECK-NEXT: [0,34]    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .  .   v_mov_b32_e32 v29, s29
# CHECK-NEXT: [0,35]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE   s_waitcnt vmcnt(0) lgkmcnt(0)

And here is what it looks like currently:

# CHECK:      Timeline view:
# CHECK-NEXT:                     0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0
# CHECK-NEXT: Index     0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789

# CHECK:      [0,0]     DeeeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   s_load_dwordx2 s[2:3], s[0:1], 0x24
# CHECK-NEXT: [0,1]     .DeeeeE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   s_load_dwordx2 s[0:1], s[0:1], 0x2c
# CHECK-NEXT: [0,2]     .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   s_waitcnt lgkmcnt(0)
# CHECK-NEXT: [0,3]     .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v0, s2
# CHECK-NEXT: [0,4]     .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v1, s3
# CHECK-NEXT: [0,5]     .    .   DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   flat_load_dword v2, v[0:1]
# CHECK-NEXT: [0,6]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   flat_load_dword v3, v[0:1] offset:8
# CHECK-NEXT: [0,7]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   flat_load_dword v4, v[0:1] offset:16
# CHECK-NEXT: [0,8]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE.   flat_load_dword v5, v[0:1] offset:24
# CHECK-NEXT: [0,9]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v0, s0
# CHECK-NEXT: [0,10]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v1, s1
# CHECK-NEXT: [0,11]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v6, s6
# CHECK-NEXT: [0,12]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v7, s7
# CHECK-NEXT: [0,13]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v8, s8
# CHECK-NEXT: [0,14]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v9, s9
# CHECK-NEXT: [0,15]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v10, s10
# CHECK-NEXT: [0,16]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v11, s11
# CHECK-NEXT: [0,17]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v12, s12
# CHECK-NEXT: [0,18]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v13, s13
# CHECK-NEXT: [0,19]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v14, s14
# CHECK-NEXT: [0,20]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v15, s15
# CHECK-NEXT: [0,21]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v16, s16
# CHECK-NEXT: [0,22]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v17, s17
# CHECK-NEXT: [0,23]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v18, s18
# CHECK-NEXT: [0,24]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v19, s19
# CHECK-NEXT: [0,25]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v20, s20
# CHECK-NEXT: [0,26]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v21, s21
# CHECK-NEXT: [0,27]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v22, s22
# CHECK-NEXT: [0,28]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v23, s23
# CHECK-NEXT: [0,29]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v24, s24
# CHECK-NEXT: [0,30]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v25, s25
# CHECK-NEXT: [0,31]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v26, s26
# CHECK-NEXT: [0,32]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v27, s27
# CHECK-NEXT: [0,33]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v28, s28
# CHECK-NEXT: [0,34]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .   v_mov_b32_e32 v29, s29
# CHECK-NEXT: [0,35]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE   s_waitcnt vmcnt(0) lgkmcnt(0)

You may need to paste it into a text editor to be able to see it properly, but essentially the flat_load_dword instructions are not being allowed to execute at the same time. This has nothing to do with CustomBehaviour or this patch and while it is likely not accurate, I would still like to push this patch since I've put it off for so long already. And as mentioned, once I can get around to fixing the underlying issue, I'll be able to update this test to be more accurate.

Just want to make sure you guys understand what's going on / why it's taken me so long to end up pushing this patch after fixing the original issue that forced me to revert it in the first place. And also want to make sure you guys are both okay with me pushing it now.

Edit: For extra information, here's the same test case but with CB disabled (showing that the issue has nothing to do with CB or this patch)

Timeline view:
                    0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789        
Index     0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          0123456789          01234567

[0,0]     DeeeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   s_load_dwordx2 s[2:3], s[0:1], 0x24
[0,1]     .DeeeeE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   s_load_dwordx2 s[0:1], s[0:1], 0x2c
[0,2]     . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   s_waitcnt lgkmcnt(0)
[0,3]     .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v0, s2
[0,4]     .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v1, s3
[0,5]     .    . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   flat_load_dword v2, v[0:1]
[0,6]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   flat_load_dword v3, v[0:1] offset:8
[0,7]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   flat_load_dword v4, v[0:1] offset:16
[0,8]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE   flat_load_dword v5, v[0:1] offset:24
[0,9]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v0, s0
[0,10]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v1, s1
[0,11]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v6, s6
[0,12]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v7, s7
[0,13]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v8, s8
[0,14]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v9, s9
[0,15]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v10, s10
[0,16]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v11, s11
[0,17]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v12, s12
[0,18]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v13, s13
[0,19]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v14, s14
[0,20]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v15, s15
[0,21]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v16, s16
[0,22]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v17, s17
[0,23]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v18, s18
[0,24]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v19, s19
[0,25]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v20, s20
[0,26]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v21, s21
[0,27]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v22, s22
[0,28]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v23, s23
[0,29]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v24, s24
[0,30]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v25, s25
[0,31]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v26, s26
[0,32]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v27, s27
[0,33]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v28, s28
[0,34]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    . .   v_mov_b32_e32 v29, s29
[0,35]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    . .   s_waitcnt vmcnt(0) lgkmcnt(0)
foad added a comment.Aug 24 2021, 2:20 AM

Thanks for the recap (I've been away too) and yes I'm very happy for you to proceed with this patch.

Same. I am happy for you to commit your patch.

Same. I am happy for you to commit your patch.

Was this from you @andreadb ? I'm assuming it was, but just want to make sure.

Same. I am happy for you to commit your patch.

Was this from you @andreadb ? I'm assuming it was, but just want to make sure.

Yes, that was me. I didn't reply through phab.
Sorry for the noise!