This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
ScheduleDAGInstrs.h
-
lib/CodeGen/
-
CodeGen/
2/8
MachineScheduler.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
1/2
aarch64-stp-cluster.ll
-
AMDGPU/
4/6
callee-special-input-vgprs.ll
1/1
max.i16.ll
1/2
stack-realign.ll

Differential D85517

[Scheduling] Implement a new way to cluster loads/stores
ClosedPublic

Authored by steven.zhang on Aug 7 2020, 4:57 AM.

Download Raw Diff

Details

Reviewers

fhahn
evandro
arsenm
rampitec
foad
kbarton
jsji

Group Reviewers

Restricted Project

Commits

rGebf3b188c6ed: [Scheduling] Implement a new way to cluster loads/stores

Summary

Before calling target hook to determine if two loads/stores are clusterable, we put them into different groups to avoid fake cluster due to dependency. For now, we are putting the loads/stores into the same group if they have the same predecessor. We assume that, if two loads/stores have the same predecessor, it is likely that, they didn't have dependency for each other.

However, one SUnit might have several predecessors and for now, we just pick up the first predecessor that has non-data/non-artificial dependency, which is too arbitrary. And we are struggling to fix it. D84139 D74524 D71717

This algorithm will become complete broken if the mutation is used post RA, because we are adding more dependency(anti, output). See this example:

           +----------------+
           |   y = add      +<----+
           +----------------+     |
                                  |
           +----------------+     |
      +---->   x = add      |     |
      |    +----------------+     |output
output|                           |
      |    +----------------+     |
      +----+  x = load A    |     |
           +----------------+     |
                                  |
                                  |
           +----------------+     |
           |  y = load A,4  +-----+
           +----------------+

We won't cluster these two loads which we should, as they have output dependency with different instruction. So that, we will put them into different groups.

Even for pre-ra mutation, there is still problems. i.e.

      +----------+
   +--+   Load A |                   +----------+
   |  +-----^----+                   |  Instr   |
   |        |                        +--+--^----+
   |        |         Dep               |  |
   |        +---------------------------+  |
Mem|                                       |
   |                  +----------+  Data   |
   |                  |  Load B  +---------+
   |                  +----+-----+
   |          Mem          |
   |     +-----------------+
   |     |
   |     |
+--v-----v--+
|   Store   |
+-----------+

Load A and Load B both have mem dependency on Store, for now, they will be put into the same group no matter if the data Load B dependent on has dependency with Load A. If yes, we cannot cluster them as there is a barrier instruction in-between them.

So, I am proposing some better implementation. (there is no perfect grouping algorithm as far as I know as it is NP complete problem).

Collect all the loads/stores that has memory info first to reduce the complexity.
Sort these loads/stores so that we can stop the seeking as early as possible.
For each load/store, seeking for the first non-dependency instruction with the sorted order, and check if they can cluster or not.

Our internal tests show great improvement with this patch on the load/store clustering. Any thoughts ?

For the test change, I indeed see some improvement from stats data, but need confirm from AMDGPU expert. Notice that, old implementation will add fake cluster edges which hurt the scheduler. i.e.

CodeGen/AMDGPU/callee-special-input-vgprs.ll

1. Cluster ld/st SU(11) - SU(14)
2.   Copy Pred SU(13)
3.   Copy Pred SU(12)
4.   Copy Pred SU(12)
5.   Curr cluster length: 2, Curr cluster bytes: 8
6. Cluster ld/st SU(14) - SU(16)
7.   Copy Pred SU(15)
8.   Copy Pred SU(12)

And this is the final scheduler output with old implementation. (11-14-16 are not scheduled together due to the dependency). And there is many likewise case.

8. SU(11):   BUFFER_STORE_DWORD_OFFEN %13:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
9. SU(12):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg
10. SU(14):   BUFFER_STORE_DWORD_OFFSET %14:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
11. SU(15):   %15:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0.alloca, addrspace 5)
12. SU(17):   %16:sreg_64 = SI_PC_ADD_REL_OFFSET target-flags(amdgpu-gotprel32-lo) @too_many_args_use_workitem_id_x_byval + 4, target-flags(amdgpu-gotprel32-hi) @too_many_args_use_workitem_id_x_byval + 4, implicit-def dead $scc
13. SU(18):   %17:sreg_64_xexec = S_LOAD_DWORDX2_IMM %16:sreg_64, 0, 0, 0 :: (dereferenceable invariant load 8 from got, addrspace 4)
14. SU(19):   %20:vgpr_32 = V_LSHLREV_B32_e32 20, %2:vgpr_32(s32), implicit $exec
15. SU(20):   %22:vgpr_32 = V_LSHLREV_B32_e32 10, %1:vgpr_32(s32), implicit $exec
16. SU(21):   %23:vgpr_32 = V_OR_B32_e32 %0:vgpr_32(s32), %22:vgpr_32, implicit $exec
17. SU(22):   %24:vgpr_32 = V_OR_B32_e32 %23:vgpr_32, %20:vgpr_32, implicit $exec
18. SU(16):   BUFFER_STORE_DWORD_OFFSET %15:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack + 4, addrspace 5)

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

steven.zhang created this revision.Aug 7 2020, 4:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 7 2020, 4:57 AM

Herald added subscribers: kerbowa, • wuzish, javed.absar and 5 others. · View Herald Transcript

steven.zhang requested review of this revision.Aug 7 2020, 4:57 AM

Herald added a subscriber: wdng. · View Herald TranscriptAug 7 2020, 4:57 AM

steven.zhang edited the summary of this revision. (Show Details)Aug 7 2020, 4:59 AM

steven.zhang edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B67448: Diff 283866.Aug 7 2020, 5:28 AM

steven.zhang edited the summary of this revision. (Show Details)Aug 7 2020, 5:45 AM

steven.zhang edited the summary of this revision. (Show Details)

Thanks for the new detailed summary.

llvm/lib/CodeGen/MachineScheduler.cpp
1613–1614	This is supposed to prefer to keep loads/stores in their original code order. From the AMDGPU test case diffs (e.g. test/CodeGen/AMDGPU/global-saddr.ll) it looks like a lot of the clusters have been reordered. Do you have any idea why?
llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
627–630	This looks like a regression because the stores on lines 627 and 630 are no longer clustered. BUT see D85530: I don't think there is any reason for AMDGPU to try to cluster stores, so this may become a non-issue.
672–674	As above, these stores are no longer clustered.
llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165–166 ↗	(On Diff #283866)	This looks good!
197–198 ↗	(On Diff #283866)	This looks good!
llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
759–761 ↗	(On Diff #283866)	This looks like an improvement modulo D85530.
llvm/test/CodeGen/AMDGPU/llvm.round.f64.ll
349–350 ↗	(On Diff #283866)	This looks good!
llvm/test/CodeGen/AMDGPU/stack-realign.ll
167–171	The stores are no longer clustered.

Thank you for the review. I have added all the necessary comments in the test change to explain what happens inside. All works as expected from my view.

llvm/lib/CodeGen/MachineScheduler.cpp
1613–1614	Good point. Old implementation only cluster two loads(SU(8) and SU(9)). SU(12) isn't clustered with them. So, it looks like the right order. SU(8): %16:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 16, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 48, align 8, addrspace 1) SU(9): %21:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 0, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 32, align 8, addrspace 1) SU(12): %44:vreg_64 = GLOBAL_LOAD_DWORDX2_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 32, 0, 0, 0, implicit $exec, implicit $exec :: (load 8 from %ir.ptr3, addrspace 1) If we are clustering more than 2 SUs, the swap logic isn't right here I think, as it cannot make them sorted. And that is the reason why we see it as this in the new implementation(we are clustering 3 SU's now): SU(12): %44:vreg_64 = GLOBAL_LOAD_DWORDX2_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 32, 0, 0, 0, implicit $exec, implicit $exec :: (load 8 from %ir.ptr3, addrspace 1) SU(8): %16:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 16, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 48, align 8, addrspace 1) SU(9): %21:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 0, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 32, align 8, addrspace 1) Cluster ld/st SU(8) - SU(9) Copy Succ SU(16) Copy Succ SU(15) Copy Succ SU(14) Copy Succ SU(13) Copy Succ SU(29) Curr cluster length: 2, Curr cluster bytes: 16 Cluster ld/st SU(8) - SU(12) Copy Succ SU(16) Copy Succ SU(15) Copy Succ SU(14) Copy Succ SU(13) Copy Succ SU(29) Copy Succ SU(9) Curr cluster length: 3, Curr cluster bytes: 24 SU(8) and SU(9) are clustered first, and then, we are trying to cluster SU(8) and SU(12). It is unable to make SU(12) as the succ of SU(8) as we have clustered the SU(8) and SU(9). So, the only available sequence would be: SU(12) SU(8) SU(9) or SU(9) SU(8) SU(12) (not swap them, this is the offset order)
llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll
531 ↗	(On Diff #283866)	s_load_dwordx8 and s_load_dword are not cluster pair from the log. And the new order is align with node order. old: SU(0): %5:sreg_64 = COPY $sgpr0_sgpr1 SU(2): %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 17, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4) SU(1): %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 9, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4) SU(3): %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256 New: SU(0): %5:sreg_64 = COPY $sgpr0_sgpr1 SU(1): %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 9, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4) SU(2): %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 17, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4) SU(3): %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256
551 ↗	(On Diff #283866)	Old implementation didn't cluster these two loads though they are scheduled together with lucky. SU(0): %5:sreg_64 = COPY $sgpr0_sgpr1 SU(2): %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 68, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4) SU(1): %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 36, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4) SU(3): %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256 New implementation take them as cluster pair and schedule them with node order. Cluster ld/st SU(1) - SU(2) Copy Succ SU(10) Copy Succ SU(5) Copy Succ SU(4) Copy Succ SU(3) Curr cluster length: 2, Curr cluster bytes: 4 * Final schedule for %bb.0 * SU(0): %5:sreg_64 = COPY $sgpr0_sgpr1 SU(1): %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 36, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4) SU(2): %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 68, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4) SU(3): %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256
llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
627–630	This is the final seq with old implementation. In fact, there is a stack operation in-between these two stores which acts as a barrier for them, and that's why we didn't cluster them. So, this works as expected. But if there is any other concern on this, please let me know. SU(11): BUFFER_STORE_DWORD_OFFEN %13:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5) SU(12): ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg SU(14): BUFFER_STORE_DWORD_OFFSET %14:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5) Old implementation still cluster them, but appear to be good as we do nothing for the stack operation. Cluster ld/st SU(11) - SU(14) Copy Pred SU(13) Copy Pred SU(12) Copy Pred SU(12) Curr cluster length: 2, Curr cluster bytes: 8
672–674	The same reason as above. Old implementation cluster them two stores. SU(10): BUFFER_STORE_DWORD_OFFEN %10:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5) SU(11): ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33 SU(13): BUFFER_STORE_DWORD_OFFSET %11:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
llvm/test/CodeGen/AMDGPU/global-saddr.ll
50 ↗	(On Diff #283866)	See the reason I explain above. This expose an issue our swap logic with more than 2 SU's clustered.
llvm/test/CodeGen/AMDGPU/llvm.round.f64.ll
241 ↗	(On Diff #283866)	Old implementation SU(1) and SU(2) are not cluster. SU(1): %10.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4) # preds left : 1 # succs left : 2 # rdefs left : 0 Latency : 5 Depth : 1 Height : 5 Predecessors: SU(0): Data Latency=1 Reg=%1 Successors: SU(104): Data Latency=5 Reg=%10 SU(103): Data Latency=5 Reg=%10 Pressure Diff : SReg_32 -2 Single Issue : false; New implementation cluster them and sort it with node order. SU(1): %10.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4) # preds left : 1 # succs left : 2 # weak succs left : 1 # rdefs left : 0 Latency : 5 Depth : 1 Height : 17 Predecessors: SU(0): Data Latency=1 Reg=%1 Successors: SU(104): Data Latency=5 Reg=%10 SU(103): Data Latency=5 Reg=%10 SU(2): Ord Latency=0 Cluster Pressure Diff : SReg_32 -2 Single Issue : false; * Final schedule for %bb.0 * SU(0): %1:sgpr_64(p4) = COPY $sgpr0_sgpr1 SU(3): undef %10.sub3:sgpr_128 = S_MOV_B32 61440 SU(4): %10.sub2:sgpr_128 = S_MOV_B32 -1 SU(6): %13:sreg_32 = S_MOV_B32 -1023 SU(1): %10.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4) SU(2): %5:sgpr_256 = S_LOAD_DWORDX8_IMM %1:sgpr_64(p4), 17, 0, 0 :: (dereferenceable invariant load 32 from %ir.in.kernarg.offset.cast, align 4, addrspace 4)
llvm/test/CodeGen/AMDGPU/sgpr-control-flow.ll
14 ↗	(On Diff #283866)	Old implementation only cluster SU(1) and SU(2) Cluster ld/st SU(1) - SU(2) Copy Succ SU(4294967295) Curr cluster length: 2, Curr cluster bytes: 24 SU(0): %10:sgpr_64(p4) = COPY $sgpr0_sgpr1 SU(1): undef %36.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %10:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4) SU(2): %16:sgpr_128 = S_LOAD_DWORDX4_IMM %10:sgpr_64(p4), 11, 0, 0 :: (dereferenceable invariant load 16 from %ir.0, align 4, addrspace 4) SU(4): S_CMP_LG_U32 %16.sub0:sgpr_128, 0, implicit-def $scc SU(3): %17:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %10:sgpr_64(p4), 15, 0, 0 :: (dereferenceable invariant load 4 from %ir.0 + 16, addrspace 4) New implementation cluster 3 SU's Cluster ld/st SU(1) - SU(2) Copy Succ SU(4294967295) Curr cluster length: 2, Curr cluster bytes: 16 Cluster ld/st SU(2) - SU(3) Copy Succ SU(4) Copy Succ SU(4294967295) Curr cluster length: 3, Curr cluster bytes: 20 * Final schedule for %bb.0 * SU(0): %10:sgpr_64(p4) = COPY $sgpr0_sgpr1 SU(2): %16:sgpr_128 = S_LOAD_DWORDX4_IMM %10:sgpr_64(p4), 11, 0, 0 :: (dereferenceable invariant load 16 from %ir.0, align 4, addrspace 4) SU(3): %17:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %10:sgpr_64(p4), 15, 0, 0 :: (dereferenceable invariant load 4 from %ir.0 + 16, addrspace 4) SU(4): S_CMP_LG_U32 %16.sub0:sgpr_128, 0, implicit-def $scc SU(1): undef %36.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %10:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4) The reason why we didn't have SU(1) SU(2) SU(3) is a bit tricky and limitation of the scheduler. I can explain more if it is needed.
llvm/test/CodeGen/AMDGPU/shift-i128.ll
450 ↗	(On Diff #283866)	Old implementation didn't cluster these two loads while new implementation did. And it is sorted with Node Order. Old implementation: SU(1): %5:sgpr_256 = S_LOAD_DWORDX8_IMM %2:sgpr_64(p4), 0, 0, 0 :: (dereferenceable invariant load 32 from %ir.lhs.kernarg.offset.cast, align 16, addrspace 4) # preds left : 1 # succs left : 12 # rdefs left : 0 Latency : 5 Depth : 1 Height : 11 Predecessors: SU(0): Data Latency=1 Reg=%2 Successors: SU(51): Data Latency=5 Reg=%5 SU(46): Data Latency=5 Reg=%5 SU(44): Data Latency=5 Reg=%5 SU(39): Data Latency=5 Reg=%5 SU(32): Data Latency=5 Reg=%5 SU(31): Data Latency=5 Reg=%5 SU(29): Data Latency=5 Reg=%5 SU(23): Data Latency=5 Reg=%5 SU(18): Data Latency=5 Reg=%5 SU(11): Data Latency=5 Reg=%5 SU(10): Data Latency=5 Reg=%5 SU(8): Data Latency=5 Reg=%5 Pressure Diff : SReg_32 - New Implementation: Cluster ld/st SU(1) - SU(2) Copy Succ SU(51) Copy Succ SU(46) Copy Succ SU(44) Copy Succ SU(39) Copy Succ SU(32) Copy Succ SU(31) Copy Succ SU(29) Copy Succ SU(23) Copy Succ SU(18) Copy Succ SU(11) Copy Succ SU(10) Copy Succ SU(8) Curr cluster length: 2, Curr cluster bytes: 32 SU(1): %5:sgpr_256 = S_LOAD_DWORDX8_IMM %2:sgpr_64(p4), 0, 0, 0 :: (dereferenceable invariant load 32 from %ir.lhs.kernarg.offset.cast, align 16, addrspace 4) SU(2): %6:sgpr_256 = S_LOAD_DWORDX8_IMM %2:sgpr_64(p4), 8, 0, 0 :: (dereferenceable invariant load 32 from %ir.rhs.kernarg.offset.cast, align 16, addrspace 4)
520 ↗	(On Diff #283866)	Same reason as above.
590 ↗	(On Diff #283866)	Same reason as above.
llvm/test/CodeGen/AMDGPU/stack-realign.ll
167–171	The same reason as before. There is ADJCALLSTACKUP in-between these two stores. So, we don't cluster them. SU(35): BUFFER_STORE_DWORD_OFFEN %35:vgpr_32, %stack.0.temp, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.temp, align 1024, addrspace 5) SU(36): ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33 SU(37): BUFFER_STORE_DWORD_OFFSET %34:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5) And this is the final assembly for old implementation. They are not sched together in fact. buffer_store_dword v33, off, s[0:3], s33 offset:1024 s_waitcnt vmcnt(1) lgkmcnt(0) buffer_store_dword v32, off, s[0:3], s32

Thanks for the detailed explanations! The test case diffs all look good to me now.

Perhaps in the future we can find some way to preserve original code order for clusters of more than two instructions. (Or even better, find a way to cluster them without enforcing any particular order at all. I.e. a way to say to the scheduler "I want these units to be adjacent but I don't care which one comes first.")

llvm/lib/CodeGen/MachineScheduler.cpp
1586	Typo "seek".

Fix typo.

dmgreen added a subscriber: dmgreen.Aug 10 2020, 3:01 AM

dmgreen added inline comments.

llvm/lib/CodeGen/MachineScheduler.cpp
1574	-> cluster

Fix typo ...

steven.zhang mentioned this in D74524: [Scheduling] Improve memory ops cluster preparation.Aug 10 2020, 3:17 AM

Harbormaster completed remote builds in B67684: Diff 284289.Aug 10 2020, 3:25 AM

Harbormaster completed remote builds in B67688: Diff 284295.Aug 10 2020, 3:56 AM

I've tested this patch by compiling several thousand shaders with the AMDGPU backend and analysing the generated code. It gives a nice ~ 4% increase in the average size of a load cluster (or equivalently a ~ 4% decrease in the total number of clusters, given that the total number of loads is unchanged).

foad added inline comments.Aug 10 2020, 6:35 AM

llvm/lib/CodeGen/MachineScheduler.cpp
1533–1536	I think this can still be an `ArrayRef`.
1536	I think this should take a `SmallVectorImpl<MemOpInfo>&`.
1600	Shouldn't this be `MemOpa.Width+MemOpb.Width`?

Address comments.

It is a pity that, the increasing cluster loads/stores we saw from AMDGPU are caused that we relax the constraint by mistake. @foad Really sorry about this and thank you for pointing this out. I have cook a new test using AArch64 load/store to show the benefit from pre-ra scheduler. And it will also remove some unnecessary cluster as we see from AMDGPU's tests. Further, it fixes the problems we see if it is used in post-ra mutation.

Herald added a subscriber: mgrang. · View Herald TranscriptAug 10 2020, 8:36 PM

steven.zhang added inline comments.Aug 10 2020, 8:45 PM

llvm/lib/CodeGen/MachineScheduler.cpp
1600	Good catch. I miss this and this is the main reason we see more clustered loads/stores from the tests.
llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
627–630	This comment still apply. Old implementation cluster these three stores which is not right.
672–674	The same reason as above.

Harbormaster completed remote builds in B67838: Diff 284569.Aug 10 2020, 9:44 PM

Gentle ping ...

foad added inline comments.Aug 18 2020, 1:54 AM

llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll
230	Would it make sense to pre-commit this test, so we can see how your patch affects it?

Rebase the patch with pre-commit arch64 test and see more cluster pairs on one new AMDGPU test max.i16.ll

steven.zhang added inline comments.Aug 18 2020, 3:13 AM

llvm/test/CodeGen/AMDGPU/max.i16.ll

148–149

This is an improvement from the scheduler log.
old implementation cluster 3 ld/st pairs.

Cluster ld/st SU(2) - SU(3)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Copy Succ SU(8)
  Copy Succ SU(7)
  Curr cluster length: 2, Curr cluster bytes: 24
Cluster ld/st SU(8) - SU(10)
  Copy Succ SU(11)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Curr cluster length: 2, Curr cluster bytes: 8
Num BaseOps: 2, Offset: 4, OffsetIsScalable: 0, Width: 4
Num BaseOps: 2, Offset: 4, OffsetIsScalable: 0, Width: 4
Num BaseOps: 2, Offset: 0, OffsetIsScalable: 0, Width: 4
Cluster ld/st SU(13) - SU(14)
  Copy Pred SU(11)
  Copy Pred SU(10)
  Copy Pred SU(9)
  Copy Pred SU(8)
  Copy Pred SU(7)
  Copy Pred SU(4)
  Copy Pred SU(2)
  Copy Pred SU(3)
  Curr cluster length: 2, Curr cluster bytes: 8

Final:
SU(0):   %1:sgpr_64(p4) = COPY $sgpr0_sgpr1
SU(1):   %0:vgpr_32(s32) = COPY $vgpr0
SU(2):   %4:sgpr_128 = S_LOAD_DWORDX4_IMM %1:sgpr_64(p4), 36, 0, 0 :: (dereferenceable invariant load 16 from %ir.1, align 4, addrspace 4)
SU(3):   %14:sreg_64_xexec = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 52, 0, 0 :: (dereferenceable invariant load 8 from %ir.1 + 16, align 4, addrspace 4)
SU(4):   %16:vgpr_32 = V_LSHLREV_B32_e32 3, %0:vgpr_32(s32), implicit $exec
SU(5):   %20:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
SU(6):   %18:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
SU(7):   %18:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 4, 0, 0, 0, %18:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep0 + 4, align 4, addrspace 1)
SU(9):   %20:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %14:sreg_64_xexec, %16:vgpr_32, 4, 0, 0, 0, %20:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep1 + 4, align 4, addrspace 1)
SU(8):   %19:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep0, addrspace 1)
SU(10):   %21:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %14:sreg_64_xexec, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep1, addrspace 1)
SU(11):   %22:vgpr_32 = V_PK_MAX_I16 8, %19:vgpr_32, 8, %21:vgpr_32, 0, 0, 0, 0, 0, implicit $exec
SU(12):   %23:vgpr_32 = V_PK_MAX_I16 8, %18:vgpr_32, 8, %20:vgpr_32, 0, 0, 0, 0, 0, implicit $exec
SU(13):   GLOBAL_STORE_SHORT_SADDR %16:vgpr_32, %23:vgpr_32, %4.sub0_sub1:sgpr_128, 4, 0, 0, 0, implicit $exec :: (store 2 into %ir.outgep + 4, align 4, addrspace 1)
SU(14):   GLOBAL_STORE_DWORD_SADDR %16:vgpr_32, %22:vgpr_32, %4.sub0_sub1:sgpr_128, 0, 0, 0, 0, implicit $exec :: (store 4 into %ir.outgep, addrspace 1)

New implementation cluster 5 pairs.

Cluster ld/st SU(2) - SU(3)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Copy Succ SU(8)
  Copy Succ SU(7)
  Curr cluster length: 2, Curr cluster bytes: 24
Cluster ld/st SU(7) - SU(8)
  Copy Succ SU(12)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Curr cluster length: 2, Curr cluster bytes: 8
Cluster ld/st SU(7) - SU(10)
  Copy Succ SU(12)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Copy Succ SU(8)
  Curr cluster length: 3, Curr cluster bytes: 12
Cluster ld/st SU(9) - SU(10)
  Copy Succ SU(12)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Curr cluster length: 4, Curr cluster bytes: 16
Num BaseOps: 2, Offset: 4, OffsetIsScalable: 0, Width: 4
Num BaseOps: 2, Offset: 0, OffsetIsScalable: 0, Width: 4
Cluster ld/st SU(13) - SU(14)
  Copy Pred SU(11)
  Copy Pred SU(10)
  Copy Pred SU(9)
  Copy Pred SU(8)
  Copy Pred SU(7)
  Copy Pred SU(4)
  Copy Pred SU(2)
  Copy Pred SU(3)
  Curr cluster length: 2, Curr cluster bytes: 8
Final:
*** Final schedule for %bb.0 ***
SU(0):   %1:sgpr_64(p4) = COPY $sgpr0_sgpr1
SU(1):   %0:vgpr_32(s32) = COPY $vgpr0
SU(2):   %4:sgpr_128 = S_LOAD_DWORDX4_IMM %1:sgpr_64(p4), 36, 0, 0 :: (dereferenceable invariant load 16 from %ir.1, align 4, addrspace 4)
SU(3):   %14:sreg_64_xexec = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 52, 0, 0 :: (dereferenceable invariant load 8 from %ir.1 + 16, align 4, addrspace 4)
SU(4):   %16:vgpr_32 = V_LSHLREV_B32_e32 3, %0:vgpr_32(s32), implicit $exec
SU(5):   %20:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
SU(6):   %18:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
SU(9):   %20:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %14:sreg_64_xexec, %16:vgpr_32, 4, 0, 0, 0, %20:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep1 + 4, align 4, addrspace 1)
SU(10):   %21:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %14:sreg_64_xexec, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep1, addrspace 1)
SU(7):   %18:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 4, 0, 0, 0, %18:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep0 + 4, align 4, addrspace 1)
SU(8):   %19:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep0, addrspace 1)
SU(11):   %22:vgpr_32 = V_PK_MAX_I16 8, %19:vgpr_32, 8, %21:vgpr_32, 0, 0, 0, 0, 0, implicit $exec
SU(12):   %23:vgpr_32 = V_PK_MAX_I16 8, %18:vgpr_32, 8, %20:vgpr_32, 0, 0, 0, 0, 0, implicit $exec
SU(13):   GLOBAL_STORE_SHORT_SADDR %16:vgpr_32, %23:vgpr_32, %4.sub0_sub1:sgpr_128, 4, 0, 0, 0, implicit $exec :: (store 2 into %ir.outgep + 4, align 4, addrspace 1)
SU(14):   GLOBAL_STORE_DWORD_SADDR %16:vgpr_32, %22:vgpr_32, %4.sub0_sub1:sgpr_128, 0, 0, 0, 0, implicit $exec :: (store 4 into %ir.outgep, addrspace 1)

steven.zhang added inline comments.Aug 18 2020, 3:15 AM

llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll
230	Sure. Done.

Harbormaster completed remote builds in B68727: Diff 286235.Aug 18 2020, 3:35 AM

Gentle ping ...

In D85517#2206517, @foad wrote:

I've tested this patch by compiling several thousand shaders with the AMDGPU backend and analysing the generated code. It gives a nice ~ 4% increase in the average size of a load cluster (or equivalently a ~ 4% decrease in the total number of clusters, given that the total number of loads is unchanged).

I've tested this again and it now gives a ~ 0.25% increase in the amount of clustering. LGTM.

This revision is now accepted and ready to land.Aug 25 2020, 2:24 AM

This revision was landed with ongoing or failed builds.Aug 26 2020, 5:34 AM

Closed by commit rGebf3b188c6ed: [Scheduling] Implement a new way to cluster loads/stores (authored by steven.zhang). · Explain Why

This revision was automatically updated to reflect the committed changes.

steven.zhang added a commit: rGebf3b188c6ed: [Scheduling] Implement a new way to cluster loads/stores.

steven.zhang mentioned this in D90144: [Scheduling] Fall back to the fast cluster algorithm if the DAG is too complex.Oct 26 2020, 3:22 AM

steven.zhang mentioned this in rG1d178d600af7: [Scheduling] Fall back to the fast cluster algorithm if the DAG is too complex.Nov 1 2020, 6:12 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

ScheduleDAGInstrs.h

5 lines

lib/

CodeGen/

MachineScheduler.cpp

136 lines

test/

CodeGen/

AArch64/

aarch64-stp-cluster.ll

4 lines

AMDGPU/

callee-special-input-vgprs.ll

4 lines

max.i16.ll

10 lines

stack-realign.ll

8 lines

Diff 287941

llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h

Show First 20 Lines • Show All 262 Lines • ▼ Show 20 Lines	public:

/// Resolves and cache a resolved scheduling class for an SUnit.		/// Resolves and cache a resolved scheduling class for an SUnit.
const MCSchedClassDesc getSchedClass(SUnit SU) const {		const MCSchedClassDesc getSchedClass(SUnit SU) const {
if (!SU->SchedClass && SchedModel.hasInstrSchedModel())		if (!SU->SchedClass && SchedModel.hasInstrSchedModel())
SU->SchedClass = SchedModel.resolveSchedClass(SU->getInstr());		SU->SchedClass = SchedModel.resolveSchedClass(SU->getInstr());
return SU->SchedClass;		return SU->SchedClass;
}		}

		/// IsReachable - Checks if SU is reachable from TargetSU.
		bool IsReachable(SUnit SU, SUnit TargetSU) {
		return Topo.IsReachable(SU, TargetSU);
		}

/// Returns an iterator to the top of the current scheduling region.		/// Returns an iterator to the top of the current scheduling region.
MachineBasicBlock::iterator begin() const { return RegionBegin; }		MachineBasicBlock::iterator begin() const { return RegionBegin; }

/// Returns an iterator to the bottom of the current scheduling region.		/// Returns an iterator to the bottom of the current scheduling region.
MachineBasicBlock::iterator end() const { return RegionEnd; }		MachineBasicBlock::iterator end() const { return RegionEnd; }

/// Creates a new SUnit and return a ptr to it.		/// Creates a new SUnit and return a ptr to it.
SUnit newSUnit(MachineInstr MI);		SUnit newSUnit(MachineInstr MI);
▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

llvm/lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 1,524 Lines • ▼ Show 20 Lines
public:		public:
BaseMemOpClusterMutation(const TargetInstrInfo *tii,		BaseMemOpClusterMutation(const TargetInstrInfo *tii,
const TargetRegisterInfo *tri, bool IsLoad)		const TargetRegisterInfo *tri, bool IsLoad)
: TII(tii), TRI(tri), IsLoad(IsLoad) {}		: TII(tii), TRI(tri), IsLoad(IsLoad) {}

void apply(ScheduleDAGInstrs *DAGInstrs) override;		void apply(ScheduleDAGInstrs *DAGInstrs) override;

protected:		protected:
void clusterNeighboringMemOps(ArrayRef<SUnit > MemOps, ScheduleDAGInstrs DAG);		void clusterNeighboringMemOps(ArrayRef<MemOpInfo> MemOps,
		ScheduleDAGInstrs *DAG);
		void collectMemOpRecords(std::vector<SUnit> &SUnits,
		SmallVectorImpl<MemOpInfo> &MemOpRecords);
		foadUnsubmitted Not Done Reply Inline Actions I think this should take a `SmallVectorImpl<MemOpInfo>&`. foad: I think this should take a `SmallVectorImpl<MemOpInfo>&`.
		foadUnsubmitted Not Done Reply Inline Actions I think this can still be an `ArrayRef`. foad: I think this can still be an `ArrayRef`.
};		};

class StoreClusterMutation : public BaseMemOpClusterMutation {		class StoreClusterMutation : public BaseMemOpClusterMutation {
public:		public:
StoreClusterMutation(const TargetInstrInfo *tii,		StoreClusterMutation(const TargetInstrInfo *tii,
const TargetRegisterInfo *tri)		const TargetRegisterInfo *tri)
: BaseMemOpClusterMutation(tii, tri, false) {}		: BaseMemOpClusterMutation(tii, tri, false) {}
};		};
Show All 19 Lines
createStoreClusterDAGMutation(const TargetInstrInfo *TII,		createStoreClusterDAGMutation(const TargetInstrInfo *TII,
const TargetRegisterInfo *TRI) {		const TargetRegisterInfo *TRI) {
return EnableMemOpCluster ? std::make_unique<StoreClusterMutation>(TII, TRI)		return EnableMemOpCluster ? std::make_unique<StoreClusterMutation>(TII, TRI)
: nullptr;		: nullptr;
}		}

} // end namespace llvm		} // end namespace llvm

		// Sorting all the loads/stores first, then for each load/store, checking the
		// following load/store one by one, until reach the first non-dependent one and
		// call target hook to see if they can cluster.
		dmgreenUnsubmitted Not Done Reply Inline Actions -> cluster dmgreen: -> cluster
void BaseMemOpClusterMutation::clusterNeighboringMemOps(		void BaseMemOpClusterMutation::clusterNeighboringMemOps(
ArrayRef<SUnit > MemOps, ScheduleDAGInstrs DAG) {		ArrayRef<MemOpInfo> MemOpRecords, ScheduleDAGInstrs *DAG) {
SmallVector<MemOpInfo, 32> MemOpRecords;		// Keep track of the current cluster length and bytes for each SUnit.
for (SUnit *SU : MemOps) {		DenseMap<unsigned, std::pair<unsigned, unsigned>> SUnit2ClusterInfo;
const MachineInstr &MI = *SU->getInstr();
SmallVector<const MachineOperand *, 4> BaseOps;
int64_t Offset;
bool OffsetIsScalable;
unsigned Width;
if (TII->getMemOperandsWithOffsetWidth(MI, BaseOps, Offset,
OffsetIsScalable, Width, TRI)) {
MemOpRecords.push_back(MemOpInfo(SU, BaseOps, Offset, Width));

LLVM_DEBUG(dbgs() << "Num BaseOps: " << BaseOps.size() << ", Offset: "
<< Offset << ", OffsetIsScalable: " << OffsetIsScalable
<< ", Width: " << Width << "\n");
}
#ifndef NDEBUG
for (auto *Op : BaseOps)
assert(Op);
#endif
}
if (MemOpRecords.size() < 2)
return;

llvm::sort(MemOpRecords);

// At this point, `MemOpRecords` array must hold atleast two mem ops. Try to		// At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
// cluster mem ops collected within `MemOpRecords` array.		// cluster mem ops collected within `MemOpRecords` array.
unsigned ClusterLength = 1;
unsigned CurrentClusterBytes = MemOpRecords[0].Width;
for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {		for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {
// Decision to cluster mem ops is taken based on target dependent logic		// Decision to cluster mem ops is taken based on target dependent logic
auto MemOpa = MemOpRecords[Idx];		auto MemOpa = MemOpRecords[Idx];
auto MemOpb = MemOpRecords[Idx + 1];
++ClusterLength;		// Seek for the next load/store to do the cluster.
		foadUnsubmitted Not Done Reply Inline Actions Typo "seek". foad: Typo "seek".
CurrentClusterBytes += MemOpb.Width;		unsigned NextIdx = Idx + 1;
		for (; NextIdx < End; ++NextIdx)
		// Skip if MemOpb has been clustered already or has dependency with
		// MemOpa.
		if (!SUnit2ClusterInfo.count(MemOpRecords[NextIdx].SU->NodeNum) &&
		!DAG->IsReachable(MemOpRecords[NextIdx].SU, MemOpa.SU) &&
		!DAG->IsReachable(MemOpa.SU, MemOpRecords[NextIdx].SU))
		break;
		if (NextIdx == End)
		continue;

		auto MemOpb = MemOpRecords[NextIdx];
		unsigned ClusterLength = 2;
		unsigned CurrentClusterBytes = MemOpa.Width + MemOpb.Width;
		foadUnsubmitted Not Done Reply Inline Actions Shouldn't this be `MemOpa.Width+MemOpb.Width`? foad: Shouldn't this be `MemOpa.Width+MemOpb.Width`?
		steven.zhangAuthorUnsubmitted Done Reply Inline Actions Good catch. I miss this and this is the main reason we see more clustered loads/stores from the tests. steven.zhang: Good catch. I miss this and this is the main reason we see more clustered loads/stores from the…
		if (SUnit2ClusterInfo.count(MemOpa.SU->NodeNum)) {
		ClusterLength = SUnit2ClusterInfo[MemOpa.SU->NodeNum].first + 1;
		CurrentClusterBytes =
		SUnit2ClusterInfo[MemOpa.SU->NodeNum].second + MemOpb.Width;
		}

if (!TII->shouldClusterMemOps(MemOpa.BaseOps, MemOpb.BaseOps, ClusterLength,		if (!TII->shouldClusterMemOps(MemOpa.BaseOps, MemOpb.BaseOps, ClusterLength,
CurrentClusterBytes)) {		CurrentClusterBytes))
// Current mem ops pair could not be clustered, reset cluster length, and
// go to next pair
ClusterLength = 1;
CurrentClusterBytes = MemOpb.Width;
continue;		continue;
}

SUnit *SUa = MemOpa.SU;		SUnit *SUa = MemOpa.SU;
SUnit *SUb = MemOpb.SU;		SUnit *SUb = MemOpb.SU;
if (SUa->NodeNum > SUb->NodeNum)		if (SUa->NodeNum > SUb->NodeNum)
std::swap(SUa, SUb);		std::swap(SUa, SUb);
		foadUnsubmitted Not Done Reply Inline Actions This is supposed to prefer to keep loads/stores in their original code order. From the AMDGPU test case diffs (e.g. test/CodeGen/AMDGPU/global-saddr.ll) it looks like a lot of the clusters have been reordered. Do you have any idea why? foad: This is supposed to prefer to keep loads/stores in their original code order. From the AMDGPU…
		steven.zhangAuthorUnsubmitted Done Reply Inline Actions Good point. Old implementation only cluster two loads(SU(8) and SU(9)). SU(12) isn't clustered with them. So, it looks like the right order. SU(8): %16:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 16, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 48, align 8, addrspace 1) SU(9): %21:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 0, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 32, align 8, addrspace 1) SU(12): %44:vreg_64 = GLOBAL_LOAD_DWORDX2_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 32, 0, 0, 0, implicit $exec, implicit $exec :: (load 8 from %ir.ptr3, addrspace 1) If we are clustering more than 2 SUs, the swap logic isn't right here I think, as it cannot make them sorted. And that is the reason why we see it as this in the new implementation(we are clustering 3 SU's now): SU(12): %44:vreg_64 = GLOBAL_LOAD_DWORDX2_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 32, 0, 0, 0, implicit $exec, implicit $exec :: (load 8 from %ir.ptr3, addrspace 1) SU(8): %16:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 16, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 48, align 8, addrspace 1) SU(9): %21:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 0, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 32, align 8, addrspace 1) Cluster ld/st SU(8) - SU(9) Copy Succ SU(16) Copy Succ SU(15) Copy Succ SU(14) Copy Succ SU(13) Copy Succ SU(29) Curr cluster length: 2, Curr cluster bytes: 16 Cluster ld/st SU(8) - SU(12) Copy Succ SU(16) Copy Succ SU(15) Copy Succ SU(14) Copy Succ SU(13) Copy Succ SU(29) Copy Succ SU(9) Curr cluster length: 3, Curr cluster bytes: 24 SU(8) and SU(9) are clustered first, and then, we are trying to cluster SU(8) and SU(12). It is unable to make SU(12) as the succ of SU(8) as we have clustered the SU(8) and SU(9). So, the only available sequence would be: SU(12) SU(8) SU(9) or SU(9) SU(8) SU(12) (not swap them, this is the offset order) steven.zhang: Good point. Old implementation only cluster two loads(SU(8) and SU(9)). SU(12) isn't clustered…

// FIXME: Is this check really required?		// FIXME: Is this check really required?
if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {		if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
ClusterLength = 1;
CurrentClusterBytes = MemOpb.Width;
continue;		continue;
}

LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("		LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
<< SUb->NodeNum << ")\n");		<< SUb->NodeNum << ")\n");
++NumClustered;		++NumClustered;

if (IsLoad) {		if (IsLoad) {
// Copy successor edges from SUa to SUb. Interleaving computation		// Copy successor edges from SUa to SUb. Interleaving computation
// dependent on SUa can prevent load combining due to register reuse.		// dependent on SUa can prevent load combining due to register reuse.
Show All 17 Lines	if (IsLoad) {
if (Pred.getSUnit() == SUa)		if (Pred.getSUnit() == SUa)
continue;		continue;
LLVM_DEBUG(dbgs() << " Copy Pred SU(" << Pred.getSUnit()->NodeNum		LLVM_DEBUG(dbgs() << " Copy Pred SU(" << Pred.getSUnit()->NodeNum
<< ")\n");		<< ")\n");
DAG->addEdge(SUa, SDep(Pred.getSUnit(), SDep::Artificial));		DAG->addEdge(SUa, SDep(Pred.getSUnit(), SDep::Artificial));
}		}
}		}

		SUnit2ClusterInfo[MemOpb.SU->NodeNum] = {ClusterLength,
		CurrentClusterBytes};

LLVM_DEBUG(dbgs() << " Curr cluster length: " << ClusterLength		LLVM_DEBUG(dbgs() << " Curr cluster length: " << ClusterLength
<< ", Curr cluster bytes: " << CurrentClusterBytes		<< ", Curr cluster bytes: " << CurrentClusterBytes
<< "\n");		<< "\n");
}		}
}		}

/// Callback from DAG postProcessing to create cluster edges for loads.		void BaseMemOpClusterMutation::collectMemOpRecords(
void BaseMemOpClusterMutation::apply(ScheduleDAGInstrs *DAG) {		std::vector<SUnit> &SUnits, SmallVectorImpl<MemOpInfo> &MemOpRecords) {
// Map DAG NodeNum to a set of dependent MemOps in store chain.		for (auto &SU : SUnits) {
DenseMap<unsigned, SmallVector<SUnit *, 4>> StoreChains;
for (SUnit &SU : DAG->SUnits) {
if ((IsLoad && !SU.getInstr()->mayLoad()) \|\|		if ((IsLoad && !SU.getInstr()->mayLoad()) \|\|
(!IsLoad && !SU.getInstr()->mayStore()))		(!IsLoad && !SU.getInstr()->mayStore()))
continue;		continue;

unsigned ChainPredID = DAG->SUnits.size();		const MachineInstr &MI = *SU.getInstr();
for (const SDep &Pred : SU.Preds) {		SmallVector<const MachineOperand *, 4> BaseOps;
// We only want to cluster the mem ops that have the same ctrl(non-data)		int64_t Offset;
// pred so that they didn't have ctrl dependency for each other. But for		bool OffsetIsScalable;
// store instrs, we can still cluster them if the pred is load instr.		unsigned Width;
if ((Pred.isCtrl() &&		if (TII->getMemOperandsWithOffsetWidth(MI, BaseOps, Offset,
(IsLoad \|\|		OffsetIsScalable, Width, TRI)) {
(Pred.getSUnit() && Pred.getSUnit()->getInstr()->mayStore()))) &&		MemOpRecords.push_back(MemOpInfo(&SU, BaseOps, Offset, Width));
!Pred.isArtificial()) {
ChainPredID = Pred.getSUnit()->NodeNum;		LLVM_DEBUG(dbgs() << "Num BaseOps: " << BaseOps.size() << ", Offset: "
break;		<< Offset << ", OffsetIsScalable: " << OffsetIsScalable
		<< ", Width: " << Width << "\n");
		}
		#ifndef NDEBUG
		for (auto *Op : BaseOps)
		assert(Op);
		#endif
}		}
}		}
// Insert the SU to corresponding store chain.
auto &Chain = StoreChains.FindAndConstruct(ChainPredID).second;		/// Callback from DAG postProcessing to create cluster edges for loads/stores.
Chain.push_back(&SU);		void BaseMemOpClusterMutation::apply(ScheduleDAGInstrs *DAG) {
}		// Collect all the clusterable loads/stores
		SmallVector<MemOpInfo, 32> MemOpRecords;
// Iterate over the store chains.		collectMemOpRecords(DAG->SUnits, MemOpRecords);
for (auto &SCD : StoreChains)
clusterNeighboringMemOps(SCD.second, DAG);		if (MemOpRecords.size() < 2)
		return;

		// Sorting the loads/stores, so that, we can stop the cluster as early as
		// possible.
		llvm::sort(MemOpRecords);

		// Trying to cluster all the neighboring loads/stores.
		clusterNeighboringMemOps(MemOpRecords, DAG);
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// CopyConstrain - DAG post-processing to encourage copy elimination.		// CopyConstrain - DAG post-processing to encourage copy elimination.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {

▲ Show 20 Lines • Show All 2,137 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll

	Show First 20 Lines • Show All 208 Lines • ▼ Show 20 Lines
	entry:			entry:
	store i32 %m, i32* %p, align 4			store i32 %m, i32* %p, align 4
	%add = add nsw i32 %n, 5			%add = add nsw i32 %n, 5
	%arrayidx1 = getelementptr inbounds i32, i32* %p, i64 1			%arrayidx1 = getelementptr inbounds i32, i32* %p, i64 1
	store i32 %add, i32* %arrayidx1, align 4			store i32 %add, i32* %arrayidx1, align 4
	ret void			ret void
	}			}

	; FIXME - The SU(4) and SU(7) can be clustered even with			; Verify that the SU(4) and SU(7) can be clustered even with
	; different preds			; different preds
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK-LABEL: cluster_with_different_preds:%bb.0			; CHECK-LABEL: cluster_with_different_preds:%bb.0
	; CHECK-NOT:Cluster ld/st SU(4) - SU(7)			; CHECK:Cluster ld/st SU(4) - SU(7)
	; CHECK:SU(3): STRWui %2:gpr32, %0:gpr64common, 0 ::			; CHECK:SU(3): STRWui %2:gpr32, %0:gpr64common, 0 ::
	; CHECK:SU(4): %3:gpr32 = LDRWui %1:gpr64common, 0 ::			; CHECK:SU(4): %3:gpr32 = LDRWui %1:gpr64common, 0 ::
	; CHECK:Predecessors:			; CHECK:Predecessors:
	; CHECK: SU(3): Ord Latency=1 Memory			; CHECK: SU(3): Ord Latency=1 Memory
	; CHECK:SU(6): STRBBui %4:gpr32, %1:gpr64common, 4 ::			; CHECK:SU(6): STRBBui %4:gpr32, %1:gpr64common, 4 ::
	; CHECK:SU(7): %5:gpr32 = LDRWui %1:gpr64common, 1 ::			; CHECK:SU(7): %5:gpr32 = LDRWui %1:gpr64common, 1 ::
	; CHECK:Predecessors:			; CHECK:Predecessors:
	; CHECK:SU(6): Ord Latency=1 Memory			; CHECK:SU(6): Ord Latency=1 Memory
	define i32 @cluster_with_different_preds(i32* %p, i32* %q) {			define i32 @cluster_with_different_preds(i32* %p, i32* %q) {
				foadUnsubmitted Not Done Reply Inline Actions Would it make sense to pre-commit this test, so we can see how your patch affects it? foad: Would it make sense to pre-commit this test, so we can see how your patch affects it?
				steven.zhangAuthorUnsubmitted Done Reply Inline Actions Sure. Done. steven.zhang: Sure. Done.
	entry:			entry:
	store i32 3, i32* %p, align 4			store i32 3, i32* %p, align 4
	%0 = load i32, i32* %q, align 4			%0 = load i32, i32* %q, align 4
	%add.ptr = getelementptr inbounds i32, i32* %q, i64 1			%add.ptr = getelementptr inbounds i32, i32* %q, i64 1
	%1 = bitcast i32* %add.ptr to i8*			%1 = bitcast i32* %add.ptr to i8*
	store i8 5, i8* %1, align 1			store i8 5, i8* %1, align 1
	%2 = load i32, i32* %add.ptr, align 4			%2 = load i32, i32* %add.ptr, align 4
	%add = add nsw i32 %2, %0			%add = add nsw i32 %2, %0
	ret i32 %add			ret i32 %add
	}			}

llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll

	Show First 20 Lines • Show All 618 Lines • ▼ Show 20 Lines
	; VARABI: buffer_store_dword v0, off, s[0:3], s32 offset:4			; VARABI: buffer_store_dword v0, off, s[0:3], s32 offset:4

	; VARABI: buffer_store_dword [[RELOAD_BYVAL]], off, s[0:3], s32{{$}}			; VARABI: buffer_store_dword [[RELOAD_BYVAL]], off, s[0:3], s32{{$}}
	; VARABI: v_mov_b32_e32 [[RELOAD_BYVAL]],			; VARABI: v_mov_b32_e32 [[RELOAD_BYVAL]],
	; VARABI: s_swappc_b64			; VARABI: s_swappc_b64


	; FIXEDABI: v_mov_b32_e32 [[K0:v[0-9]+]], 0x3e7			; FIXEDABI: v_mov_b32_e32 [[K0:v[0-9]+]], 0x3e7
				; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], 0 offset:4{{$}}
	; FIXEDABI: s_movk_i32 s32, 0x400{{$}}			; FIXEDABI: s_movk_i32 s32, 0x400{{$}}
	; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140			; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140
	; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], 0 offset:4{{$}}

				foadUnsubmitted Not Done Reply Inline Actions This looks like a regression because the stores on lines 627 and 630 are no longer clustered. BUT see D85530: I don't think there is any reason for AMDGPU to try to cluster stores, so this may become a non-issue. foad: This looks like a regression because the stores on lines 627 and 630 are no longer clustered.
				steven.zhangAuthorUnsubmitted Done Reply Inline Actions This is the final seq with old implementation. In fact, there is a stack operation in-between these two stores which acts as a barrier for them, and that's why we didn't cluster them. So, this works as expected. But if there is any other concern on this, please let me know. SU(11): BUFFER_STORE_DWORD_OFFEN %13:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5) SU(12): ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg SU(14): BUFFER_STORE_DWORD_OFFSET %14:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5) Old implementation still cluster them, but appear to be good as we do nothing for the stack operation. Cluster ld/st SU(11) - SU(14) Copy Pred SU(13) Copy Pred SU(12) Copy Pred SU(12) Curr cluster length: 2, Curr cluster bytes: 8 steven.zhang: This is the final seq with old implementation. In fact, there is a stack operation in-between…
				steven.zhangAuthorUnsubmitted Done Reply Inline Actions This comment still apply. Old implementation cluster these three stores which is not right. steven.zhang: This comment still apply. Old implementation cluster these three stores which is not right.
	; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}			; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}

	; FIXME: Why this reload?			; FIXME: Why this reload?
	; FIXEDABI: buffer_load_dword [[RELOAD:v[0-9]+]], off, s[0:3], 0 offset:4{{$}}			; FIXEDABI: buffer_load_dword [[RELOAD:v[0-9]+]], off, s[0:3], 0 offset:4{{$}}

	; FIXEDABI-DAG: v_lshlrev_b32_e32 [[TMP1:v[0-9]+]], 10, v1			; FIXEDABI-DAG: v_lshlrev_b32_e32 [[TMP1:v[0-9]+]], 10, v1
	; FIXEDABI-DAG: v_lshlrev_b32_e32 [[TMP0:v[0-9]+]], 20, v2			; FIXEDABI-DAG: v_lshlrev_b32_e32 [[TMP0:v[0-9]+]], 20, v2
	; FIXEDABI-DAG: v_or_b32_e32 [[TMP2:v[0-9]+]], v0, [[TMP1]]			; FIXEDABI-DAG: v_or_b32_e32 [[TMP2:v[0-9]+]], v0, [[TMP1]]
	Show All 25 Lines
	; VARABI: buffer_store_dword v0, off, s[0:3], s32 offset:4			; VARABI: buffer_store_dword v0, off, s[0:3], s32 offset:4
	; VARABI: buffer_store_dword [[RELOAD_BYVAL]], off, s[0:3], s32{{$}}			; VARABI: buffer_store_dword [[RELOAD_BYVAL]], off, s[0:3], s32{{$}}
	; VARABI: v_mov_b32_e32 [[RELOAD_BYVAL]],			; VARABI: v_mov_b32_e32 [[RELOAD_BYVAL]],
	; VARABI: s_swappc_b64			; VARABI: s_swappc_b64


	; FIXED-ABI-NOT: v31			; FIXED-ABI-NOT: v31
	; FIXEDABI: v_mov_b32_e32 [[K0:v[0-9]+]], 0x3e7{{$}}			; FIXEDABI: v_mov_b32_e32 [[K0:v[0-9]+]], 0x3e7{{$}}
	; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140{{$}}
	; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], s33{{$}}			; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], s33{{$}}
				; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140{{$}}
	; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}			; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}
				foadUnsubmitted Not Done Reply Inline Actions As above, these stores are no longer clustered. foad: As above, these stores are no longer clustered.
				steven.zhangAuthorUnsubmitted Done Reply Inline Actions The same reason as above. Old implementation cluster them two stores. SU(10): BUFFER_STORE_DWORD_OFFEN %10:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5) SU(11): ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33 SU(13): BUFFER_STORE_DWORD_OFFSET %11:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5) steven.zhang: The same reason as above. Old implementation cluster them two stores. ``` SU(10)…
				steven.zhangAuthorUnsubmitted Done Reply Inline Actions The same reason as above. steven.zhang: The same reason as above.
	; FIXEDABI: buffer_load_dword [[RELOAD_BYVAL:v[0-9]+]], off, s[0:3], s33{{$}}			; FIXEDABI: buffer_load_dword [[RELOAD_BYVAL:v[0-9]+]], off, s[0:3], s33{{$}}

	; FIXED-ABI-NOT: v31			; FIXED-ABI-NOT: v31
	; FIXEDABI: buffer_store_dword [[RELOAD_BYVAL]], off, s[0:3], s32 offset:4{{$}}			; FIXEDABI: buffer_store_dword [[RELOAD_BYVAL]], off, s[0:3], s32 offset:4{{$}}
	; FIXED-ABI-NOT: v31			; FIXED-ABI-NOT: v31
	; FIXEDABI: s_swappc_b64			; FIXEDABI: s_swappc_b64
	define void @func_call_too_many_args_use_workitem_id_x_byval() #1 {			define void @func_call_too_many_args_use_workitem_id_x_byval() #1 {
	%alloca = alloca i32, align 4, addrspace(5)			%alloca = alloca i32, align 4, addrspace(5)
	▲ Show 20 Lines • Show All 216 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/max.i16.ll

	Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines
	; GFX9-LABEL: v_test_imax_sge_v3i16:			; GFX9-LABEL: v_test_imax_sge_v3i16:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x24			; GFX9-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x24
	; GFX9-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x34			; GFX9-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x34
	; GFX9-NEXT: v_lshlrev_b32_e32 v0, 3, v0			; GFX9-NEXT: v_lshlrev_b32_e32 v0, 3, v0
	; GFX9-NEXT: v_mov_b32_e32 v1, 0			; GFX9-NEXT: v_mov_b32_e32 v1, 0
	; GFX9-NEXT: v_mov_b32_e32 v2, 0			; GFX9-NEXT: v_mov_b32_e32 v2, 0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_load_short_d16 v2, v0, s[6:7] offset:4
	; GFX9-NEXT: global_load_short_d16 v1, v0, s[0:1] offset:4			; GFX9-NEXT: global_load_short_d16 v1, v0, s[0:1] offset:4
	; GFX9-NEXT: global_load_dword v3, v0, s[6:7]			; GFX9-NEXT: global_load_dword v3, v0, s[0:1]
				steven.zhangAuthorUnsubmitted Done Reply Inline Actions This is an improvement from the scheduler log. old implementation cluster 3 ld/st pairs. Cluster ld/st SU(2) - SU(3) Copy Succ SU(14) Copy Succ SU(13) Copy Succ SU(8) Copy Succ SU(7) Curr cluster length: 2, Curr cluster bytes: 24 Cluster ld/st SU(8) - SU(10) Copy Succ SU(11) Copy Succ SU(14) Copy Succ SU(13) Curr cluster length: 2, Curr cluster bytes: 8 Num BaseOps: 2, Offset: 4, OffsetIsScalable: 0, Width: 4 Num BaseOps: 2, Offset: 4, OffsetIsScalable: 0, Width: 4 Num BaseOps: 2, Offset: 0, OffsetIsScalable: 0, Width: 4 Cluster ld/st SU(13) - SU(14) Copy Pred SU(11) Copy Pred SU(10) Copy Pred SU(9) Copy Pred SU(8) Copy Pred SU(7) Copy Pred SU(4) Copy Pred SU(2) Copy Pred SU(3) Curr cluster length: 2, Curr cluster bytes: 8 Final: SU(0): %1:sgpr_64(p4) = COPY $sgpr0_sgpr1 SU(1): %0:vgpr_32(s32) = COPY $vgpr0 SU(2): %4:sgpr_128 = S_LOAD_DWORDX4_IMM %1:sgpr_64(p4), 36, 0, 0 :: (dereferenceable invariant load 16 from %ir.1, align 4, addrspace 4) SU(3): %14:sreg_64_xexec = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 52, 0, 0 :: (dereferenceable invariant load 8 from %ir.1 + 16, align 4, addrspace 4) SU(4): %16:vgpr_32 = V_LSHLREV_B32_e32 3, %0:vgpr_32(s32), implicit $exec SU(5): %20:vgpr_32 = V_MOV_B32_e32 0, implicit $exec SU(6): %18:vgpr_32 = V_MOV_B32_e32 0, implicit $exec SU(7): %18:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 4, 0, 0, 0, %18:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep0 + 4, align 4, addrspace 1) SU(9): %20:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %14:sreg_64_xexec, %16:vgpr_32, 4, 0, 0, 0, %20:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep1 + 4, align 4, addrspace 1) SU(8): %19:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep0, addrspace 1) SU(10): %21:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %14:sreg_64_xexec, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep1, addrspace 1) SU(11): %22:vgpr_32 = V_PK_MAX_I16 8, %19:vgpr_32, 8, %21:vgpr_32, 0, 0, 0, 0, 0, implicit $exec SU(12): %23:vgpr_32 = V_PK_MAX_I16 8, %18:vgpr_32, 8, %20:vgpr_32, 0, 0, 0, 0, 0, implicit $exec SU(13): GLOBAL_STORE_SHORT_SADDR %16:vgpr_32, %23:vgpr_32, %4.sub0_sub1:sgpr_128, 4, 0, 0, 0, implicit $exec :: (store 2 into %ir.outgep + 4, align 4, addrspace 1) SU(14): GLOBAL_STORE_DWORD_SADDR %16:vgpr_32, %22:vgpr_32, %4.sub0_sub1:sgpr_128, 0, 0, 0, 0, implicit $exec :: (store 4 into %ir.outgep, addrspace 1) New implementation cluster 5 pairs. Cluster ld/st SU(2) - SU(3) Copy Succ SU(14) Copy Succ SU(13) Copy Succ SU(8) Copy Succ SU(7) Curr cluster length: 2, Curr cluster bytes: 24 Cluster ld/st SU(7) - SU(8) Copy Succ SU(12) Copy Succ SU(14) Copy Succ SU(13) Curr cluster length: 2, Curr cluster bytes: 8 Cluster ld/st SU(7) - SU(10) Copy Succ SU(12) Copy Succ SU(14) Copy Succ SU(13) Copy Succ SU(8) Curr cluster length: 3, Curr cluster bytes: 12 Cluster ld/st SU(9) - SU(10) Copy Succ SU(12) Copy Succ SU(14) Copy Succ SU(13) Curr cluster length: 4, Curr cluster bytes: 16 Num BaseOps: 2, Offset: 4, OffsetIsScalable: 0, Width: 4 Num BaseOps: 2, Offset: 0, OffsetIsScalable: 0, Width: 4 Cluster ld/st SU(13) - SU(14) Copy Pred SU(11) Copy Pred SU(10) Copy Pred SU(9) Copy Pred SU(8) Copy Pred SU(7) Copy Pred SU(4) Copy Pred SU(2) Copy Pred SU(3) Curr cluster length: 2, Curr cluster bytes: 8 Final: * Final schedule for %bb.0 * SU(0): %1:sgpr_64(p4) = COPY $sgpr0_sgpr1 SU(1): %0:vgpr_32(s32) = COPY $vgpr0 SU(2): %4:sgpr_128 = S_LOAD_DWORDX4_IMM %1:sgpr_64(p4), 36, 0, 0 :: (dereferenceable invariant load 16 from %ir.1, align 4, addrspace 4) SU(3): %14:sreg_64_xexec = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 52, 0, 0 :: (dereferenceable invariant load 8 from %ir.1 + 16, align 4, addrspace 4) SU(4): %16:vgpr_32 = V_LSHLREV_B32_e32 3, %0:vgpr_32(s32), implicit $exec SU(5): %20:vgpr_32 = V_MOV_B32_e32 0, implicit $exec SU(6): %18:vgpr_32 = V_MOV_B32_e32 0, implicit $exec SU(9): %20:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %14:sreg_64_xexec, %16:vgpr_32, 4, 0, 0, 0, %20:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep1 + 4, align 4, addrspace 1) SU(10): %21:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %14:sreg_64_xexec, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep1, addrspace 1) SU(7): %18:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 4, 0, 0, 0, %18:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep0 + 4, align 4, addrspace 1) SU(8): %19:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep0, addrspace 1) SU(11): %22:vgpr_32 = V_PK_MAX_I16 8, %19:vgpr_32, 8, %21:vgpr_32, 0, 0, 0, 0, 0, implicit $exec SU(12): %23:vgpr_32 = V_PK_MAX_I16 8, %18:vgpr_32, 8, %20:vgpr_32, 0, 0, 0, 0, 0, implicit $exec SU(13): GLOBAL_STORE_SHORT_SADDR %16:vgpr_32, %23:vgpr_32, %4.sub0_sub1:sgpr_128, 4, 0, 0, 0, implicit $exec :: (store 2 into %ir.outgep + 4, align 4, addrspace 1) SU(14): GLOBAL_STORE_DWORD_SADDR %16:vgpr_32, %22:vgpr_32, %4.sub0_sub1:sgpr_128, 0, 0, 0, 0, implicit $exec :: (store 4 into %ir.outgep, addrspace 1) steven.zhang: This is an improvement from the scheduler log. old implementation cluster 3 ld/st pairs. ```…
	; GFX9-NEXT: global_load_dword v4, v0, s[0:1]			; GFX9-NEXT: global_load_short_d16 v2, v0, s[6:7] offset:4
	; GFX9-NEXT: s_waitcnt vmcnt(2)			; GFX9-NEXT: global_load_dword v4, v0, s[6:7]
				; GFX9-NEXT: s_waitcnt vmcnt(1)
	; GFX9-NEXT: v_pk_max_i16 v1, v2, v1			; GFX9-NEXT: v_pk_max_i16 v1, v2, v1
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_pk_max_i16 v3, v3, v4			; GFX9-NEXT: v_pk_max_i16 v3, v4, v3
	; GFX9-NEXT: global_store_short v0, v1, s[4:5] offset:4			; GFX9-NEXT: global_store_short v0, v1, s[4:5] offset:4
	; GFX9-NEXT: global_store_dword v0, v3, s[4:5]			; GFX9-NEXT: global_store_dword v0, v3, s[4:5]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	%tid = call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone			%tid = call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone
	%gep0 = getelementptr <3 x i16>, <3 x i16> addrspace(1)* %aptr, i32 %tid			%gep0 = getelementptr <3 x i16>, <3 x i16> addrspace(1)* %aptr, i32 %tid
	%gep1 = getelementptr <3 x i16>, <3 x i16> addrspace(1)* %bptr, i32 %tid			%gep1 = getelementptr <3 x i16>, <3 x i16> addrspace(1)* %bptr, i32 %tid
	%outgep = getelementptr <3 x i16>, <3 x i16> addrspace(1)* %out, i32 %tid			%outgep = getelementptr <3 x i16>, <3 x i16> addrspace(1)* %out, i32 %tid
	%a = load <3 x i16>, <3 x i16> addrspace(1)* %gep0, align 4			%a = load <3 x i16>, <3 x i16> addrspace(1)* %gep0, align 4
	▲ Show 20 Lines • Show All 255 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/stack-realign.ll

	Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines
	; Should use BP to access the incoming stack arguments.			; Should use BP to access the incoming stack arguments.
	; The BP value is saved/restored with a VGPR spill.			; The BP value is saved/restored with a VGPR spill.

	; GCN-LABEL: func_call_align1024_bp_gets_vgpr_spill:			; GCN-LABEL: func_call_align1024_bp_gets_vgpr_spill:
	; GCN: buffer_store_dword [[VGPR_REG:v[0-9]+]], off, s[0:3], s32 offset:1028 ; 4-byte Folded Spill			; GCN: buffer_store_dword [[VGPR_REG:v[0-9]+]], off, s[0:3], s32 offset:1028 ; 4-byte Folded Spill
	; GCN-NEXT: s_mov_b64 exec, s[4:5]			; GCN-NEXT: s_mov_b64 exec, s[4:5]
	; GCN-NEXT: v_writelane_b32 [[VGPR_REG]], s33, 2			; GCN-NEXT: v_writelane_b32 [[VGPR_REG]], s33, 2
	; GCN-NEXT: v_writelane_b32 [[VGPR_REG]], s34, 3			; GCN-NEXT: v_writelane_b32 [[VGPR_REG]], s34, 3
	; GCN: s_mov_b32 s34, s32
	; GCN: s_add_u32 [[SCRATCH_REG:s[0-9]+]], s32, 0xffc0			; GCN: s_add_u32 [[SCRATCH_REG:s[0-9]+]], s32, 0xffc0
	; GCN: s_and_b32 s33, [[SCRATCH_REG]], 0xffff0000			; GCN: s_and_b32 s33, [[SCRATCH_REG]], 0xffff0000
				; GCN: s_mov_b32 s34, s32
				; GCN: v_mov_b32_e32 v32, 0
				; GCN: buffer_store_dword v32, off, s[0:3], s33 offset:1024
	; GCN-NEXT: buffer_load_dword v{{[0-9]+}}, off, s[0:3], s34			; GCN-NEXT: buffer_load_dword v{{[0-9]+}}, off, s[0:3], s34
	; GCN-NEXT: s_add_u32 s32, s32, 0x30000			; GCN-NEXT: s_add_u32 s32, s32, 0x30000

	; GCN: v_mov_b32_e32 v33, 0

	; GCN: buffer_store_dword v33, off, s[0:3], s33 offset:1024

	; GCN: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s32			; GCN: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s32
				foadUnsubmitted Not Done Reply Inline Actions The stores are no longer clustered. foad: The stores are no longer clustered.
				steven.zhangAuthorUnsubmitted Done Reply Inline Actions The same reason as before. There is ADJCALLSTACKUP in-between these two stores. So, we don't cluster them. SU(35): BUFFER_STORE_DWORD_OFFEN %35:vgpr_32, %stack.0.temp, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.temp, align 1024, addrspace 5) SU(36): ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33 SU(37): BUFFER_STORE_DWORD_OFFSET %34:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5) And this is the final assembly for old implementation. They are not sched together in fact. buffer_store_dword v33, off, s[0:3], s33 offset:1024 s_waitcnt vmcnt(1) lgkmcnt(0) buffer_store_dword v32, off, s[0:3], s32 steven.zhang: The same reason as before. There is ADJCALLSTACKUP in-between these two stores. So, we don't…
	; GCN-NEXT: s_swappc_b64 s[30:31], s[4:5]			; GCN-NEXT: s_swappc_b64 s[30:31], s[4:5]

	; GCN: v_readlane_b32 s33, [[VGPR_REG]], 2			; GCN: v_readlane_b32 s33, [[VGPR_REG]], 2
	; GCN-NEXT: s_sub_u32 s32, s32, 0x30000			; GCN-NEXT: s_sub_u32 s32, s32, 0x30000
	; GCN-NEXT: v_readlane_b32 s34, [[VGPR_REG]], 3			; GCN-NEXT: v_readlane_b32 s34, [[VGPR_REG]], 3
	; GCN-NEXT: s_or_saveexec_b64 s[6:7], -1			; GCN-NEXT: s_or_saveexec_b64 s[6:7], -1
	; GCN-NEXT: buffer_load_dword [[VGPR_REG]], off, s[0:3], s32 offset:1028 ; 4-byte Folded Reload			; GCN-NEXT: buffer_load_dword [[VGPR_REG]], off, s[0:3], s32 offset:1028 ; 4-byte Folded Reload
	; GCN-NEXT: s_mov_b64 exec, s[6:7]			; GCN-NEXT: s_mov_b64 exec, s[6:7]
	▲ Show 20 Lines • Show All 150 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Scheduling] Implement a new way to cluster loads/storesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 287941

llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h

llvm/lib/CodeGen/MachineScheduler.cpp

llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll

llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll

llvm/test/CodeGen/AMDGPU/max.i16.ll

llvm/test/CodeGen/AMDGPU/stack-realign.ll

[Scheduling] Implement a new way to cluster loads/stores
ClosedPublic