TLDR The s_waitcnt instructions do not currently have the proper behaviour within llvm-mca (due to the scheduling model itself not expressing what that proper behaviour is). This patch utilizes the CustomBehaviour class within mca to enforce that behaviour so that mca can simulate the instructions properly. The patch also makes some slight modifications to the AMDGPU scheduling model, but the side effects of these changes should be isolated to mca (the RetireOOO flag that is added to s_waitcnt instructions is specific to mca).
This patch is related to https://reviews.llvm.org/D104149. In that patch, I ended up pushing an empty implementation for the AMDGPUCustomBehaviour class. @foad then fixed some of the instruction flags and I was able to re-implement the waitcnt logic to be much more similar to SIInsertWaitcnts::updateEventWaitcntAfter().
Some imperfections within this implementation that I'd like to point out:
- AFAIU, most waitcnt related instructions only increment and decrement the relevant CNTs by 1, however there are a handful of instructions that increment and decrement by more than 1. I do not have a good enough understanding of this, so the implementation treats them all as if they increment/decrement by only 1. Someone with a stronger understanding can feel free to make some changes to get this more accurate if they want.
- AMDGPUCustomBehaviour::generateWaitCntInfo() is the function that determines whether an instruction interacts with any of the CNTs and which CNTs it interacts with (if any). The logic is very similar to SIInsertWaitcnts::updateEventWaitcntAfter() however, there are function calls from that pass that I was unable to recreate so for now I'm just making conservative assumptions. Refer to the comments within generateWaitcntInfo() for an explanation. Two of the functions that I couldn't recreate are related to memory operands and I'm not entirely sure if it's possible to recreate those functions using MCInst rather than the MachineInstr that are used in the pass this is all based on. The third function that I was unable to recreate is just used to determine how old the specific subtarget is. There is probably a way to recreate this functionality, but I'm not familiar enough with the different subtargets so I didn't want to mess around with it.
Here is the relevant comment for the last two sentences above:
// This should be: // if (GCNTarget.vmemWriteNeedsExpWaitcnt() && // (MCID.mayStore() || (MCID.TSFlags & SIInstrFlags::IsAtomicRet))) // where the GCNTarget::vmemWriteNeedsExpWaitcnt() function is // { return getGeneration() < SEA_ISLANDS; } // But I'm not sure how to get the subtarget's generation from here. // For now, conservatively assume that this is true, but maybe an // AMDGPU dev can suggest a solution.
So if anyone knows how I can recreate the { return getGeneration() < SEA_ISLANDS; } behaviour using either a AMDGPU::IsaVersion or a MCSubtargetInfo, I'd be happy to update this diff before it gets committed, or you can just patch it in the future.
Here's a summary of what this implementation is doing:
- The AMDGPUInstrPostProcess class is used during the lowering from MCInst to mca::Instruction. mca::Instruction objects do not have any information about immediate operands (since they aren't normally relevant for hazard checking in mca), however the s_waitcnt instructions have important information contained within an immediate operand so we use the InstrPostProcess class to store those operands for any s_waitcnt instructions within the source.
- When the AMDGPUCustomBehaviour class is constructed (shortly before the pipeline is created within mca), we call the generateWaitcntInfo method. This method iterates over each of the mca::Instruction objects to determine (and store for later use) which instructions interact with which CNTs (vmcnt, expcnt, lgkmcnt, vscnt). This could also be done on the fly, but since in general, mca simulates the source for multiple iterations, it's more efficient to only have to run this logic once for each instruction. A debugger can be used to pause after this method is run and then you can inspect the InstrWaitCntInfo vector to check whether each instruction is being associated with the correct CNTs.
- During the pipeline simulation, whenever an s_waitcnt instruction is encountered, we check each of the instructions that are currently executing within the pipeline, and using the information stored within InstrWaitCntInfo, we determine if this s_waitcnt instruction should force a stall or not.
And a few more issues worth noting:
- I'm not sure how the s_waitcnt_depctr instruction works. I couldn't find any information within the ISAs, so I just didn't implement anything for this instruction.
- I believe the s_endpgm instruction also behaves as if there is an implicit s_waitcnt 0 instruction before it, however I did not implement this behaviour.
- Before this patch, the s_waitcnt instructions all used the WriteSALU scheduling class. To give the s_waitcnt instructions the RetireOOO flag, I created a new scheduling class WriteSALUOOO that should behave exactly like WriteSALU and should only be associated with the different s_waitcnt instructions. For some models, these classes have a latency of 2, and for others they have a latency of 1. The WriteSALUOOO class also uses the same hardware units as WriteSALU. Someone who knows more about the low level behaviour of s_waitcnt may be able to fine tune this a bit better (with respect to latency and hardware units).
- If AMDGPU would like to use llvm-mca in the future, I think one area to focus on in particular would be with respect to the RetireOOO flag. This flag allows instructions to finish out of order. Without the flag, mca enforces in-order writes which messes with potential latency hiding. It sounds like the decision of whether or not any given instruction can finish out of order depends on a lot of factors so it's not something I tried to take into account, but just looking at some examples, it seems like there could be some significant differences between how code will actually run on the GPU vs how mca predicts it will run (because of the in-order write enforcement).
Thank you for your time. If you have any questions or suggestions, I'd be happy to discuss further. If you want me to make any modifications, I'm happy to do so (as long they are explained well enough for me to actually make the changes myself). If you want to make any modifications to this class in the future, I encourage you to do so.
Here's some examples that I took from some of the AMDGPU CodeGen test files:
Without patch:
Timeline view: 012345678 Index 0123456789 [0,0] DE . . . . s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) [0,1] .DE . . . . v_add_i32_e32 v1, vcc, 8, v0 [0,2] . DE. . . . s_mov_b32 m0, -1 [0,3] . DeeeeE. . . ds_read_b32 v2, v1 [0,4] . DeeeeE . . ds_read_b64 v[0:1], v0 [0,5] . . DE . . s_waitcnt lgkmcnt(0) [0,6] . . DeeeeeeeE s_setpc_b64 s[30:31]
With patch (not much difference here because the in-order write enforcement 'accidentally' causes an appropriate wait):
Timeline view: 0123456789 Index 0123456789 [0,0] DE . . . . s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) [0,1] .DE . . . . v_add_i32_e32 v1, vcc, 8, v0 [0,2] . DE. . . . s_mov_b32 m0, -1 [0,3] . DeeeeE. . . ds_read_b32 v2, v1 [0,4] . DeeeeE . . ds_read_b64 v[0:1], v0 [0,5] . . DE . . s_waitcnt lgkmcnt(0) [0,6] . . .DeeeeeeeE s_setpc_b64 s[30:31]
Without patch:
Timeline view: 0123456789 Index 0123456789 0123456789 [0,0] DE . . . . . . s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) [0,1] .DeeeeE . . . . . ds_read_u8 v1, v0 [0,2] . DeeeeE . . . . . ds_read_u8 v2, v0 offset:1 [0,3] . DeeeeE . . . . . ds_read_u8 v3, v0 offset:2 [0,4] . DeeeeE. . . . . ds_read_u8 v4, v0 offset:3 [0,5] . DeeeeE . . . . ds_read_u8 v5, v0 offset:4 [0,6] . .DeeeeE . . . . ds_read_u8 v6, v0 offset:5 [0,7] . . DeeeeE . . . . ds_read_u8 v7, v0 offset:6 [0,8] . . DeeeeE . . . . ds_read_u8 v8, v0 offset:7 [0,9] . . DeeeeE. . . . ds_read_u8 v9, v0 offset:8 [0,10] . . DeeeeE . . . ds_read_u8 v10, v0 offset:9 [0,11] . . .DeeeeE . . . ds_read_u8 v11, v0 offset:10 [0,12] . . . DeeeeE . . . ds_read_u8 v12, v0 offset:11 [0,13] . . . .DE . . . s_waitcnt lgkmcnt(10) [0,14] . . . . DE . . . v_lshl_or_b32 v0, v2, 8, v1 [0,15] . . . . DE. . . s_waitcnt lgkmcnt(8) [0,16] . . . . DE . . v_lshl_or_b32 v1, v4, 8, v3 [0,17] . . . . DE . . v_lshl_or_b32 v0, v1, 16, v0 [0,18] . . . . .DE . . s_waitcnt lgkmcnt(6) [0,19] . . . . . DE . . v_lshl_or_b32 v1, v6, 8, v5 [0,20] . . . . . DE. . s_waitcnt lgkmcnt(4) [0,21] . . . . . DE . v_lshl_or_b32 v2, v8, 8, v7 [0,22] . . . . . DE . v_lshl_or_b32 v1, v2, 16, v1 [0,23] . . . . . .DE . s_waitcnt lgkmcnt(2) [0,24] . . . . . . DE. v_lshl_or_b32 v2, v10, 8, v9 [0,25] . . . . . . DE s_waitcnt lgkmcnt(0)
With patch (first revision that has the RetireOOO flag on only s_waitcnt):
Timeline view: 0123456789 Index 0123456789 012345678 [0,0] DE . . . . . . s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) [0,1] .DeeeeE . . . . . ds_read_u8 v1, v0 [0,2] . DeeeeE . . . . . ds_read_u8 v2, v0 offset:1 [0,3] . DeeeeE . . . . . ds_read_u8 v3, v0 offset:2 [0,4] . DeeeeE. . . . . ds_read_u8 v4, v0 offset:3 [0,5] . DeeeeE . . . . ds_read_u8 v5, v0 offset:4 [0,6] . .DeeeeE . . . . ds_read_u8 v6, v0 offset:5 [0,7] . . DeeeeE . . . . ds_read_u8 v7, v0 offset:6 [0,8] . . DeeeeE . . . . ds_read_u8 v8, v0 offset:7 [0,9] . . DeeeeE. . . . ds_read_u8 v9, v0 offset:8 [0,10] . . DeeeeE . . . ds_read_u8 v10, v0 offset:9 [0,11] . . .DeeeeE . . . ds_read_u8 v11, v0 offset:10 [0,12] . . . DeeeeE . . . ds_read_u8 v12, v0 offset:11 [0,13] . . . DE. . . . s_waitcnt lgkmcnt(10) [0,14] . . . .DE . . . v_lshl_or_b32 v0, v2, 8, v1 [0,15] . . . . DE . . . s_waitcnt lgkmcnt(8) [0,16] . . . . DE. . . v_lshl_or_b32 v1, v4, 8, v3 [0,17] . . . . DE . . v_lshl_or_b32 v0, v1, 16, v0 [0,18] . . . . DE . . s_waitcnt lgkmcnt(6) [0,19] . . . . .DE . . v_lshl_or_b32 v1, v6, 8, v5 [0,20] . . . . . DE . . s_waitcnt lgkmcnt(4) [0,21] . . . . . DE. . v_lshl_or_b32 v2, v8, 8, v7 [0,22] . . . . . DE . v_lshl_or_b32 v1, v2, 16, v1 [0,23] . . . . . DE . s_waitcnt lgkmcnt(2) [0,24] . . . . . .DE. v_lshl_or_b32 v2, v10, 8, v9 [0,25] . . . . . . DE s_waitcnt lgkmcnt(0)
With patch (RetireOOO flag set globally):
Timeline view: 0123456789 Index 0123456789 0123456 [0,0] DE . . . . .. s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) [0,1] .DeeeeE . . . .. ds_read_u8 v1, v0 [0,2] . DeeeeE . . . .. ds_read_u8 v2, v0 offset:1 [0,3] . DeeeeE . . . .. ds_read_u8 v3, v0 offset:2 [0,4] . DeeeeE. . . .. ds_read_u8 v4, v0 offset:3 [0,5] . DeeeeE . . .. ds_read_u8 v5, v0 offset:4 [0,6] . .DeeeeE . . .. ds_read_u8 v6, v0 offset:5 [0,7] . . DeeeeE . . .. ds_read_u8 v7, v0 offset:6 [0,8] . . DeeeeE . . .. ds_read_u8 v8, v0 offset:7 [0,9] . . DeeeeE. . .. ds_read_u8 v9, v0 offset:8 [0,10] . . DeeeeE . .. ds_read_u8 v10, v0 offset:9 [0,11] . . .DeeeeE . .. ds_read_u8 v11, v0 offset:10 [0,12] . . . DeeeeE . .. ds_read_u8 v12, v0 offset:11 [0,13] . . . DE. . .. s_waitcnt lgkmcnt(10) [0,14] . . . DE . .. v_lshl_or_b32 v0, v2, 8, v1 [0,15] . . . DE . .. s_waitcnt lgkmcnt(8) [0,16] . . . .DE . .. v_lshl_or_b32 v1, v4, 8, v3 [0,17] . . . . DE . .. v_lshl_or_b32 v0, v1, 16, v0 [0,18] . . . . DE. .. s_waitcnt lgkmcnt(6) [0,19] . . . . DE .. v_lshl_or_b32 v1, v6, 8, v5 [0,20] . . . . DE .. s_waitcnt lgkmcnt(4) [0,21] . . . . .DE .. v_lshl_or_b32 v2, v8, 8, v7 [0,22] . . . . . DE .. v_lshl_or_b32 v1, v2, 16, v1 [0,23] . . . . . DE.. s_waitcnt lgkmcnt(2) [0,24] . . . . . DE. v_lshl_or_b32 v2, v10, 8, v9 [0,25] . . . . . DE s_waitcnt lgkmcnt(0)
The following example doesn't make much logical sense, but it demonstrates the difference that the RetireOOO flag can have on latency hiding. In case the line wrapping messes up the formatting of the timelines, I've included screenshots.
Without patch (https://i.imgur.com/pB4dgtW.png):
Timeline view: 0123456789 0123456789 0123456789 0123456789 0123456789 Index 0123456789 0123456789 0123456789 0123456789 0123456789 [0,0] DeeeeE . . . . . . . . . . . . . . . . . . . s_load_dwordx2 s[2:3], s[0:1], 0x24 [0,1] .DeeeeE . . . . . . . . . . . . . . . . . . . s_load_dwordx2 s[0:1], s[0:1], 0x2c [0,2] . DE . . . . . . . . . . . . . . . . . . . s_waitcnt lgkmcnt(0) [0,3] . .DE . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v0, s2 [0,4] . . DE . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v1, s3 [0,5] . . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE . . . flat_load_dword v2, v[0:1] [0,6] . . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE. . . flat_load_dword v3, v[0:1] offset:8 [0,7] . . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE . . flat_load_dword v4, v[0:1] offset:16 [0,8] . . .DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE . . flat_load_dword v5, v[0:1] offset:24 [0,9] . . . . . . . . . . . . . . . . . . DE . . v_mov_b32_e32 v0, s0 [0,10] . . . . . . . . . . . . . . . . . . .DE . . v_mov_b32_e32 v1, s1 [0,11] . . . . . . . . . . . . . . . . . . . DE . . v_mov_b32_e32 v6, s6 [0,12] . . . . . . . . . . . . . . . . . . . DE. . v_mov_b32_e32 v7, s7 [0,13] . . . . . . . . . . . . . . . . . . . DE . v_mov_b32_e32 v8, s8 [0,14] . . . . . . . . . . . . . . . . . . . DE . v_mov_b32_e32 v9, s9 [0,15] . . . . . . . . . . . . . . . . . . . .DE . v_mov_b32_e32 v10, s10 [0,16] . . . . . . . . . . . . . . . . . . . . DE. v_mov_b32_e32 v11, s11 [0,17] . . . . . . . . . . . . . . . . . . . . DE v_mov_b32_e32 v12, s12 [0,18] . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v13, s13 [0,19] . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v14, s14 [0,20] . . . . . . . . . . . . . . . . . . . . .DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v15, s15 [0,21] . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v16, s16 [0,22] . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v17, s17 [0,23] . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v18, s18 [0,24] . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v19, s19 [0,25] . . . . . . . . . . . . . . . . . . . . . .DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v20, s20 [0,26] . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v21, s21 [0,27] . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v22, s22 [0,28] . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v23, s23 [0,29] . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v24, s24 [0,30] . . . . . . . . . . . . . . . . . . . . . . .DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v25, s25 [0,31] . . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v26, s26 [0,32] . . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v27, s27 [0,33] . . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v28, s28 [0,34] . . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v29, s29 [0,35] . . . . . . . . . . . . . . . . . . . . . . . .DE . . . . . . . . . . . . . . . . . . . . s_waitcnt vmcnt(0) lgkmcnt(0)
With patch (first revision where RetireOOO is only set for s_waitcnt) (https://i.imgur.com/bmrLwrc.png):
Timeline view: 0123456789 0123456789 0123456789 0123456789 0123456789 Index 0123456789 0123456789 0123456789 0123456789 0123456789 [0,0] DeeeeE . . . . . . . . . . . . . . . . . . . s_load_dwordx2 s[2:3], s[0:1], 0x24 [0,1] .DeeeeE . . . . . . . . . . . . . . . . . . . s_load_dwordx2 s[0:1], s[0:1], 0x2c [0,2] . .DE . . . . . . . . . . . . . . . . . . . s_waitcnt lgkmcnt(0) [0,3] . . DE . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v0, s2 [0,4] . . DE. . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v1, s3 [0,5] . . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE. . . flat_load_dword v2, v[0:1] [0,6] . . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE . . flat_load_dword v3, v[0:1] offset:8 [0,7] . . .DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE . . flat_load_dword v4, v[0:1] offset:16 [0,8] . . . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE . . flat_load_dword v5, v[0:1] offset:24 [0,9] . . . . . . . . . . . . . . . . . . .DE . . v_mov_b32_e32 v0, s0 [0,10] . . . . . . . . . . . . . . . . . . . DE . . v_mov_b32_e32 v1, s1 [0,11] . . . . . . . . . . . . . . . . . . . DE. . v_mov_b32_e32 v6, s6 [0,12] . . . . . . . . . . . . . . . . . . . DE . v_mov_b32_e32 v7, s7 [0,13] . . . . . . . . . . . . . . . . . . . DE . v_mov_b32_e32 v8, s8 [0,14] . . . . . . . . . . . . . . . . . . . .DE . v_mov_b32_e32 v9, s9 [0,15] . . . . . . . . . . . . . . . . . . . . DE. v_mov_b32_e32 v10, s10 [0,16] . . . . . . . . . . . . . . . . . . . . DE v_mov_b32_e32 v11, s11 [0,17] . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v12, s12 [0,18] . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v13, s13 [0,19] . . . . . . . . . . . . . . . . . . . . .DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v14, s14 [0,20] . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v15, s15 [0,21] . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v16, s16 [0,22] . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v17, s17 [0,23] . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v18, s18 [0,24] . . . . . . . . . . . . . . . . . . . . . .DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v19, s19 [0,25] . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v20, s20 [0,26] . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v21, s21 [0,27] . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v22, s22 [0,28] . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v23, s23 [0,29] . . . . . . . . . . . . . . . . . . . . . . .DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v24, s24 [0,30] . . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v25, s25 [0,31] . . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v26, s26 [0,32] . . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v27, s27 [0,33] . . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v28, s28 [0,34] . . . . . . . . . . . . . . . . . . . . . . . .DE . . . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v29, s29 [0,35] . . . . . . . . . . . . . . . . . . . . . . . . DE . . . . . . . . . . . . . . . . . . . . s_waitcnt vmcnt(0) lgkmcnt(0)
With patch (second revision where RetireOOO is set globally) (https://i.imgur.com/5PSq0vw.png):
Timeline view: 0123456789 0123456789 0123456789 0123456789 0123 Index 0123456789 0123456789 0123456789 0123456789 0123456789 [0,0] DeeeeE . . . . . . . . . . . . . . . . . . s_load_dwordx2 s[2:3], s[0:1], 0x24 [0,1] .DeeeeE . . . . . . . . . . . . . . . . . . s_load_dwordx2 s[0:1], s[0:1], 0x2c [0,2] . .DE . . . . . . . . . . . . . . . . . . s_waitcnt lgkmcnt(0) [0,3] . . DE . . . . . . . . . . . . . . . . . . v_mov_b32_e32 v0, s2 [0,4] . . DE. . . . . . . . . . . . . . . . . . v_mov_b32_e32 v1, s3 [0,5] . . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE. . flat_load_dword v2, v[0:1] [0,6] . . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE . flat_load_dword v3, v[0:1] offset:8 [0,7] . . .DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE . flat_load_dword v4, v[0:1] offset:16 [0,8] . . . DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeE. flat_load_dword v5, v[0:1] offset:24 [0,9] . . . DE. . . . . . . . . . . . . . . . . v_mov_b32_e32 v0, s0 [0,10] . . . DE . . . . . . . . . . . . . . . . v_mov_b32_e32 v1, s1 [0,11] . . . DE . . . . . . . . . . . . . . . . v_mov_b32_e32 v6, s6 [0,12] . . . .DE . . . . . . . . . . . . . . . . v_mov_b32_e32 v7, s7 [0,13] . . . . DE . . . . . . . . . . . . . . . . v_mov_b32_e32 v8, s8 [0,14] . . . . DE. . . . . . . . . . . . . . . . v_mov_b32_e32 v9, s9 [0,15] . . . . DE . . . . . . . . . . . . . . . v_mov_b32_e32 v10, s10 [0,16] . . . . DE . . . . . . . . . . . . . . . v_mov_b32_e32 v11, s11 [0,17] . . . . .DE . . . . . . . . . . . . . . . v_mov_b32_e32 v12, s12 [0,18] . . . . . DE . . . . . . . . . . . . . . . v_mov_b32_e32 v13, s13 [0,19] . . . . . DE. . . . . . . . . . . . . . . v_mov_b32_e32 v14, s14 [0,20] . . . . . DE . . . . . . . . . . . . . . v_mov_b32_e32 v15, s15 [0,21] . . . . . DE . . . . . . . . . . . . . . v_mov_b32_e32 v16, s16 [0,22] . . . . . .DE . . . . . . . . . . . . . . v_mov_b32_e32 v17, s17 [0,23] . . . . . . DE . . . . . . . . . . . . . . v_mov_b32_e32 v18, s18 [0,24] . . . . . . DE. . . . . . . . . . . . . . v_mov_b32_e32 v19, s19 [0,25] . . . . . . DE . . . . . . . . . . . . . v_mov_b32_e32 v20, s20 [0,26] . . . . . . DE . . . . . . . . . . . . . v_mov_b32_e32 v21, s21 [0,27] . . . . . . .DE . . . . . . . . . . . . . v_mov_b32_e32 v22, s22 [0,28] . . . . . . . DE . . . . . . . . . . . . . v_mov_b32_e32 v23, s23 [0,29] . . . . . . . DE. . . . . . . . . . . . . v_mov_b32_e32 v24, s24 [0,30] . . . . . . . DE . . . . . . . . . . . . v_mov_b32_e32 v25, s25 [0,31] . . . . . . . DE . . . . . . . . . . . . v_mov_b32_e32 v26, s26 [0,32] . . . . . . . .DE . . . . . . . . . . . . v_mov_b32_e32 v27, s27 [0,33] . . . . . . . . DE . . . . . . . . . . . . v_mov_b32_e32 v28, s28 [0,34] . . . . . . . . DE. . . . . . . . . . . . v_mov_b32_e32 v29, s29 [0,35] . . . . . . . . . . . . . . . . . . . DE s_waitcnt vmcnt(0) lgkmcnt(0)
Nit: it's a shame that this adds 10 lines to the patch instead of just 1. As an alternative...