This is an archive of the discontinued LLVM Phabricator instance.

[Scheduling] Implement a new way to cluster loads/stores
ClosedPublic

Authored by steven.zhang on Aug 7 2020, 4:57 AM.

Details

Summary

Before calling target hook to determine if two loads/stores are clusterable, we put them into different groups to avoid fake cluster due to dependency. For now, we are putting the loads/stores into the same group if they have the same predecessor. We assume that, if two loads/stores have the same predecessor, it is likely that, they didn't have dependency for each other.

However, one SUnit might have several predecessors and for now, we just pick up the first predecessor that has non-data/non-artificial dependency, which is too arbitrary. And we are struggling to fix it. D84139 D74524 D71717

This algorithm will become complete broken if the mutation is used post RA, because we are adding more dependency(anti, output). See this example:

           +----------------+
           |   y = add      +<----+
           +----------------+     |
                                  |
           +----------------+     |
      +---->   x = add      |     |
      |    +----------------+     |output
output|                           |
      |    +----------------+     |
      +----+  x = load A    |     |
           +----------------+     |
                                  |
                                  |
           +----------------+     |
           |  y = load A,4  +-----+
           +----------------+

We won't cluster these two loads which we should, as they have output dependency with different instruction. So that, we will put them into different groups.

Even for pre-ra mutation, there is still problems. i.e.

      +----------+
   +--+   Load A |                   +----------+
   |  +-----^----+                   |  Instr   |
   |        |                        +--+--^----+
   |        |         Dep               |  |
   |        +---------------------------+  |
Mem|                                       |
   |                  +----------+  Data   |
   |                  |  Load B  +---------+
   |                  +----+-----+
   |          Mem          |
   |     +-----------------+
   |     |
   |     |
+--v-----v--+
|   Store   |
+-----------+

Load A and Load B both have mem dependency on Store, for now, they will be put into the same group no matter if the data Load B dependent on has dependency with Load A. If yes, we cannot cluster them as there is a barrier instruction in-between them.

So, I am proposing some better implementation. (there is no perfect grouping algorithm as far as I know as it is NP complete problem).

  • Collect all the loads/stores that has memory info first to reduce the complexity.
  • Sort these loads/stores so that we can stop the seeking as early as possible.
  • For each load/store, seeking for the first non-dependency instruction with the sorted order, and check if they can cluster or not.

Our internal tests show great improvement with this patch on the load/store clustering. Any thoughts ?

For the test change, I indeed see some improvement from stats data, but need confirm from AMDGPU expert. Notice that, old implementation will add fake cluster edges which hurt the scheduler. i.e.

CodeGen/AMDGPU/callee-special-input-vgprs.ll

1. Cluster ld/st SU(11) - SU(14)
2.   Copy Pred SU(13)
3.   Copy Pred SU(12)
4.   Copy Pred SU(12)
5.   Curr cluster length: 2, Curr cluster bytes: 8
6. Cluster ld/st SU(14) - SU(16)
7.   Copy Pred SU(15)
8.   Copy Pred SU(12)


And this is the final scheduler output with old implementation. (11-14-16 are not scheduled together due to the dependency). And there is many likewise case.

8. SU(11):   BUFFER_STORE_DWORD_OFFEN %13:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
9. SU(12):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg
10. SU(14):   BUFFER_STORE_DWORD_OFFSET %14:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
11. SU(15):   %15:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0.alloca, addrspace 5)
12. SU(17):   %16:sreg_64 = SI_PC_ADD_REL_OFFSET target-flags(amdgpu-gotprel32-lo) @too_many_args_use_workitem_id_x_byval + 4, target-flags(amdgpu-gotprel32-hi) @too_many_args_use_workitem_id_x_byval + 4, implicit-def dead $scc
13. SU(18):   %17:sreg_64_xexec = S_LOAD_DWORDX2_IMM %16:sreg_64, 0, 0, 0 :: (dereferenceable invariant load 8 from got, addrspace 4)
14. SU(19):   %20:vgpr_32 = V_LSHLREV_B32_e32 20, %2:vgpr_32(s32), implicit $exec
15. SU(20):   %22:vgpr_32 = V_LSHLREV_B32_e32 10, %1:vgpr_32(s32), implicit $exec
16. SU(21):   %23:vgpr_32 = V_OR_B32_e32 %0:vgpr_32(s32), %22:vgpr_32, implicit $exec
17. SU(22):   %24:vgpr_32 = V_OR_B32_e32 %23:vgpr_32, %20:vgpr_32, implicit $exec
18. SU(16):   BUFFER_STORE_DWORD_OFFSET %15:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack + 4, addrspace 5)

Diff Detail

Event Timeline

steven.zhang created this revision.Aug 7 2020, 4:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 7 2020, 4:57 AM
steven.zhang requested review of this revision.Aug 7 2020, 4:57 AM
steven.zhang edited the summary of this revision. (Show Details)Aug 7 2020, 4:59 AM
steven.zhang edited the summary of this revision. (Show Details)
steven.zhang edited the summary of this revision. (Show Details)Aug 7 2020, 5:45 AM
steven.zhang edited the summary of this revision. (Show Details)
foad added a comment.Aug 7 2020, 8:48 AM

Thanks for the new detailed summary.

llvm/lib/CodeGen/MachineScheduler.cpp
1613–1614

This is supposed to prefer to keep loads/stores in their original code order. From the AMDGPU test case diffs (e.g. test/CodeGen/AMDGPU/global-saddr.ll) it looks like a lot of the clusters have been reordered. Do you have any idea why?

llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
627–630

This looks like a regression because the stores on lines 627 and 630 are no longer clustered. BUT see D85530: I don't think there is any reason for AMDGPU to try to cluster stores, so this may become a non-issue.

672–674

As above, these stores are no longer clustered.

llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165–166 ↗(On Diff #283866)

This looks good!

197–198 ↗(On Diff #283866)

This looks good!

llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
759–761 ↗(On Diff #283866)

This looks like an improvement modulo D85530.

llvm/test/CodeGen/AMDGPU/llvm.round.f64.ll
349–350 ↗(On Diff #283866)

This looks good!

llvm/test/CodeGen/AMDGPU/stack-realign.ll
167–171

The stores are no longer clustered.

steven.zhang added a comment.EditedAug 9 2020, 10:19 PM

Thank you for the review. I have added all the necessary comments in the test change to explain what happens inside. All works as expected from my view.

llvm/lib/CodeGen/MachineScheduler.cpp
1613–1614

Good point.
Old implementation only cluster two loads(SU(8) and SU(9)). SU(12) isn't clustered with them. So, it looks like the right order.

SU(8):   %16:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 16, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 48, align 8, addrspace 1)
SU(9):   %21:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 0, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 32, align 8, addrspace 1)
SU(12):   %44:vreg_64 = GLOBAL_LOAD_DWORDX2_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 32, 0, 0, 0, implicit $exec, implicit $exec :: (load 8 from %ir.ptr3, addrspace 1)

If we are clustering more than 2 SUs, the swap logic isn't right here I think, as it cannot make them sorted. And that is the reason why we see it as this in the new implementation(we are clustering 3 SU's now):

SU(12):   %44:vreg_64 = GLOBAL_LOAD_DWORDX2_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 32, 0, 0, 0, implicit $exec, implicit $exec :: (load 8 from %ir.ptr3, addrspace 1)
SU(8):   %16:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 16, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 48, align 8, addrspace 1)
SU(9):   %21:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 0, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 32, align 8, addrspace 1)

Cluster ld/st SU(8) - SU(9)
  Copy Succ SU(16)
  Copy Succ SU(15)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Copy Succ SU(29)
  Curr cluster length: 2, Curr cluster bytes: 16
Cluster ld/st SU(8) - SU(12)
  Copy Succ SU(16)
  Copy Succ SU(15)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Copy Succ SU(29)
  Copy Succ SU(9)
  Curr cluster length: 3, Curr cluster bytes: 24

SU(8) and SU(9) are clustered first, and then, we are trying to cluster SU(8) and SU(12). It is unable to make SU(12) as the succ of SU(8) as we have clustered the SU(8) and SU(9). So, the only available sequence would be:

SU(12) SU(8) SU(9) or
SU(9) SU(8) SU(12) (not swap them, this is the offset order)
llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll
531 ↗(On Diff #283866)

s_load_dwordx8 and s_load_dword are not cluster pair from the log. And the new order is align with node order.
old:

SU(0):   %5:sreg_64 = COPY $sgpr0_sgpr1
SU(2):   %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 17, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4)
SU(1):   %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 9, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4)
SU(3):   %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256

New:

SU(0):   %5:sreg_64 = COPY $sgpr0_sgpr1
SU(1):   %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 9, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4)
SU(2):   %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 17, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4)
SU(3):   %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256
551 ↗(On Diff #283866)

Old implementation didn't cluster these two loads though they are scheduled together with lucky.

SU(0):   %5:sreg_64 = COPY $sgpr0_sgpr1
SU(2):   %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 68, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4)
SU(1):   %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 36, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4)
SU(3):   %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256

New implementation take them as cluster pair and schedule them with node order.

Cluster ld/st SU(1) - SU(2)
  Copy Succ SU(10)
  Copy Succ SU(5)
  Copy Succ SU(4)
  Copy Succ SU(3)
  Curr cluster length: 2, Curr cluster bytes: 4

*** Final schedule for %bb.0 ***
SU(0):   %5:sreg_64 = COPY $sgpr0_sgpr1
SU(1):   %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 36, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4)
SU(2):   %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 68, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4)
SU(3):   %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256
llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
627–630

This is the final seq with old implementation. In fact, there is a stack operation in-between these two stores which acts as a barrier for them, and that's why we didn't cluster them. So, this works as expected. But if there is any other concern on this, please let me know.

SU(11):   BUFFER_STORE_DWORD_OFFEN %13:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
SU(12):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg
SU(14):   BUFFER_STORE_DWORD_OFFSET %14:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)

Old implementation still cluster them, but appear to be good as we do nothing for the stack operation.

Cluster ld/st SU(11) - SU(14)
  Copy Pred SU(13)
  Copy Pred SU(12)
  Copy Pred SU(12)
  Curr cluster length: 2, Curr cluster bytes: 8
672–674

The same reason as above. Old implementation cluster them two stores.

SU(10):   BUFFER_STORE_DWORD_OFFEN %10:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
SU(11):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33
SU(13):   BUFFER_STORE_DWORD_OFFSET %11:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
llvm/test/CodeGen/AMDGPU/global-saddr.ll
50 ↗(On Diff #283866)

See the reason I explain above. This expose an issue our swap logic with more than 2 SU's clustered.

llvm/test/CodeGen/AMDGPU/llvm.round.f64.ll
241 ↗(On Diff #283866)

Old implementation SU(1) and SU(2) are not cluster.

SU(1):   %10.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
  # preds left       : 1
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 5
  Depth              : 1
  Height             : 5
  Predecessors:
    SU(0): Data Latency=1 Reg=%1
  Successors:
    SU(104): Data Latency=5 Reg=%10
    SU(103): Data Latency=5 Reg=%10
  Pressure Diff      : SReg_32 -2
  Single Issue       : false;

New implementation cluster them and sort it with node order.

SU(1):   %10.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
  # preds left       : 1
  # succs left       : 2
  # weak succs left  : 1
  # rdefs left       : 0
  Latency            : 5
  Depth              : 1
  Height             : 17
  Predecessors:
    SU(0): Data Latency=1 Reg=%1
  Successors:
    SU(104): Data Latency=5 Reg=%10
    SU(103): Data Latency=5 Reg=%10
    SU(2): Ord  Latency=0 Cluster
  Pressure Diff      : SReg_32 -2
  Single Issue       : false;
*** Final schedule for %bb.0 ***
SU(0):   %1:sgpr_64(p4) = COPY $sgpr0_sgpr1
SU(3):   undef %10.sub3:sgpr_128 = S_MOV_B32 61440
SU(4):   %10.sub2:sgpr_128 = S_MOV_B32 -1
SU(6):   %13:sreg_32 = S_MOV_B32 -1023
SU(1):   %10.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
SU(2):   %5:sgpr_256 = S_LOAD_DWORDX8_IMM %1:sgpr_64(p4), 17, 0, 0 :: (dereferenceable invariant load 32 from %ir.in.kernarg.offset.cast, align 4, addrspace 4)
llvm/test/CodeGen/AMDGPU/sgpr-control-flow.ll
14 ↗(On Diff #283866)

Old implementation only cluster SU(1) and SU(2)

Cluster ld/st SU(1) - SU(2)
  Copy Succ SU(4294967295)
  Curr cluster length: 2, Curr cluster bytes: 24

SU(0):   %10:sgpr_64(p4) = COPY $sgpr0_sgpr1
SU(1):   undef %36.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %10:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
SU(2):   %16:sgpr_128 = S_LOAD_DWORDX4_IMM %10:sgpr_64(p4), 11, 0, 0 :: (dereferenceable invariant load 16 from %ir.0, align 4, addrspace 4)
SU(4):   S_CMP_LG_U32 %16.sub0:sgpr_128, 0, implicit-def $scc
SU(3):   %17:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %10:sgpr_64(p4), 15, 0, 0 :: (dereferenceable invariant load 4 from %ir.0 + 16, addrspace 4)

New implementation cluster 3 SU's

Cluster ld/st SU(1) - SU(2)
  Copy Succ SU(4294967295)
  Curr cluster length: 2, Curr cluster bytes: 16
Cluster ld/st SU(2) - SU(3)
  Copy Succ SU(4)
  Copy Succ SU(4294967295)
  Curr cluster length: 3, Curr cluster bytes: 20


*** Final schedule for %bb.0 ***
SU(0):   %10:sgpr_64(p4) = COPY $sgpr0_sgpr1
SU(2):   %16:sgpr_128 = S_LOAD_DWORDX4_IMM %10:sgpr_64(p4), 11, 0, 0 :: (dereferenceable invariant load 16 from %ir.0, align 4, addrspace 4)
SU(3):   %17:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %10:sgpr_64(p4), 15, 0, 0 :: (dereferenceable invariant load 4 from %ir.0 + 16, addrspace 4)
SU(4):   S_CMP_LG_U32 %16.sub0:sgpr_128, 0, implicit-def $scc
SU(1):   undef %36.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %10:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4)

The reason why we didn't have SU(1) SU(2) SU(3) is a bit tricky and limitation of the scheduler. I can explain more if it is needed.

llvm/test/CodeGen/AMDGPU/shift-i128.ll
450 ↗(On Diff #283866)

Old implementation didn't cluster these two loads while new implementation did. And it is sorted with Node Order.
Old implementation:

SU(1):   %5:sgpr_256 = S_LOAD_DWORDX8_IMM %2:sgpr_64(p4), 0, 0, 0 :: (dereferenceable invariant load 32 from %ir.lhs.kernarg.offset.cast, align 16, addrspace 4)
  # preds left       : 1
  # succs left       : 12
  # rdefs left       : 0
  Latency            : 5
  Depth              : 1
  Height             : 11
  Predecessors:
    SU(0): Data Latency=1 Reg=%2
  Successors:
    SU(51): Data Latency=5 Reg=%5
    SU(46): Data Latency=5 Reg=%5
    SU(44): Data Latency=5 Reg=%5
    SU(39): Data Latency=5 Reg=%5
    SU(32): Data Latency=5 Reg=%5
    SU(31): Data Latency=5 Reg=%5
    SU(29): Data Latency=5 Reg=%5
    SU(23): Data Latency=5 Reg=%5
    SU(18): Data Latency=5 Reg=%5
    SU(11): Data Latency=5 Reg=%5
    SU(10): Data Latency=5 Reg=%5
    SU(8): Data Latency=5 Reg=%5
  Pressure Diff      : SReg_32 -

New Implementation:

Cluster ld/st SU(1) - SU(2)
  Copy Succ SU(51)
  Copy Succ SU(46)
  Copy Succ SU(44)
  Copy Succ SU(39)
  Copy Succ SU(32)
  Copy Succ SU(31)
  Copy Succ SU(29)
  Copy Succ SU(23)
  Copy Succ SU(18)
  Copy Succ SU(11)
  Copy Succ SU(10)
  Copy Succ SU(8)
  Curr cluster length: 2, Curr cluster bytes: 32

SU(1):   %5:sgpr_256 = S_LOAD_DWORDX8_IMM %2:sgpr_64(p4), 0, 0, 0 :: (dereferenceable invariant load 32 from %ir.lhs.kernarg.offset.cast, align 16, addrspace 4)
SU(2):   %6:sgpr_256 = S_LOAD_DWORDX8_IMM %2:sgpr_64(p4), 8, 0, 0 :: (dereferenceable invariant load 32 from %ir.rhs.kernarg.offset.cast, align 16, addrspace 4)
520 ↗(On Diff #283866)

Same reason as above.

590 ↗(On Diff #283866)

Same reason as above.

llvm/test/CodeGen/AMDGPU/stack-realign.ll
167–171

The same reason as before. There is ADJCALLSTACKUP in-between these two stores. So, we don't cluster them.

SU(35):   BUFFER_STORE_DWORD_OFFEN %35:vgpr_32, %stack.0.temp, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.temp, align 1024, addrspace 5)
SU(36):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33
SU(37):   BUFFER_STORE_DWORD_OFFSET %34:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)

And this is the final assembly for old implementation. They are not sched together in fact.

buffer_store_dword v33, off, s[0:3], s33 offset:1024
s_waitcnt vmcnt(1) lgkmcnt(0)
buffer_store_dword v32, off, s[0:3], s32
foad added a comment.Aug 10 2020, 2:01 AM

Thanks for the detailed explanations! The test case diffs all look good to me now.

Perhaps in the future we can find some way to preserve original code order for clusters of more than two instructions. (Or even better, find a way to cluster them without enforcing any particular order at all. I.e. a way to say to the scheduler "I want these units to be adjacent but I don't care which one comes first.")

llvm/lib/CodeGen/MachineScheduler.cpp
1586

Typo "seek".

dmgreen added inline comments.
llvm/lib/CodeGen/MachineScheduler.cpp
1574

-> cluster

Fix typo ...

foad added a comment.Aug 10 2020, 6:17 AM

I've tested this patch by compiling several thousand shaders with the AMDGPU backend and analysing the generated code. It gives a nice ~ 4% increase in the average size of a load cluster (or equivalently a ~ 4% decrease in the total number of clusters, given that the total number of loads is unchanged).

foad added inline comments.Aug 10 2020, 6:35 AM
llvm/lib/CodeGen/MachineScheduler.cpp
1533–1536

I think this can still be an ArrayRef.

1536

I think this should take a SmallVectorImpl<MemOpInfo>&.

1600

Shouldn't this be MemOpa.Width+MemOpb.Width?

Address comments.

It is a pity that, the increasing cluster loads/stores we saw from AMDGPU are caused that we relax the constraint by mistake. @foad Really sorry about this and thank you for pointing this out. I have cook a new test using AArch64 load/store to show the benefit from pre-ra scheduler. And it will also remove some unnecessary cluster as we see from AMDGPU's tests. Further, it fixes the problems we see if it is used in post-ra mutation.

steven.zhang added inline comments.Aug 10 2020, 8:45 PM
llvm/lib/CodeGen/MachineScheduler.cpp
1600

Good catch. I miss this and this is the main reason we see more clustered loads/stores from the tests.

llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
627–630

This comment still apply. Old implementation cluster these three stores which is not right.

672–674

The same reason as above.

Gentle ping ...

foad added inline comments.Aug 18 2020, 1:54 AM
llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll
230

Would it make sense to pre-commit this test, so we can see how your patch affects it?

steven.zhang updated this revision to Diff 286235.EditedAug 18 2020, 3:04 AM

Rebase the patch with pre-commit arch64 test and see more cluster pairs on one new AMDGPU test max.i16.ll

steven.zhang added inline comments.Aug 18 2020, 3:13 AM
llvm/test/CodeGen/AMDGPU/max.i16.ll
148–149

This is an improvement from the scheduler log.
old implementation cluster 3 ld/st pairs.

Cluster ld/st SU(2) - SU(3)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Copy Succ SU(8)
  Copy Succ SU(7)
  Curr cluster length: 2, Curr cluster bytes: 24
Cluster ld/st SU(8) - SU(10)
  Copy Succ SU(11)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Curr cluster length: 2, Curr cluster bytes: 8
Num BaseOps: 2, Offset: 4, OffsetIsScalable: 0, Width: 4
Num BaseOps: 2, Offset: 4, OffsetIsScalable: 0, Width: 4
Num BaseOps: 2, Offset: 0, OffsetIsScalable: 0, Width: 4
Cluster ld/st SU(13) - SU(14)
  Copy Pred SU(11)
  Copy Pred SU(10)
  Copy Pred SU(9)
  Copy Pred SU(8)
  Copy Pred SU(7)
  Copy Pred SU(4)
  Copy Pred SU(2)
  Copy Pred SU(3)
  Curr cluster length: 2, Curr cluster bytes: 8

Final:
SU(0):   %1:sgpr_64(p4) = COPY $sgpr0_sgpr1
SU(1):   %0:vgpr_32(s32) = COPY $vgpr0
SU(2):   %4:sgpr_128 = S_LOAD_DWORDX4_IMM %1:sgpr_64(p4), 36, 0, 0 :: (dereferenceable invariant load 16 from %ir.1, align 4, addrspace 4)
SU(3):   %14:sreg_64_xexec = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 52, 0, 0 :: (dereferenceable invariant load 8 from %ir.1 + 16, align 4, addrspace 4)
SU(4):   %16:vgpr_32 = V_LSHLREV_B32_e32 3, %0:vgpr_32(s32), implicit $exec
SU(5):   %20:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
SU(6):   %18:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
SU(7):   %18:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 4, 0, 0, 0, %18:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep0 + 4, align 4, addrspace 1)
SU(9):   %20:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %14:sreg_64_xexec, %16:vgpr_32, 4, 0, 0, 0, %20:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep1 + 4, align 4, addrspace 1)
SU(8):   %19:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep0, addrspace 1)
SU(10):   %21:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %14:sreg_64_xexec, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep1, addrspace 1)
SU(11):   %22:vgpr_32 = V_PK_MAX_I16 8, %19:vgpr_32, 8, %21:vgpr_32, 0, 0, 0, 0, 0, implicit $exec
SU(12):   %23:vgpr_32 = V_PK_MAX_I16 8, %18:vgpr_32, 8, %20:vgpr_32, 0, 0, 0, 0, 0, implicit $exec
SU(13):   GLOBAL_STORE_SHORT_SADDR %16:vgpr_32, %23:vgpr_32, %4.sub0_sub1:sgpr_128, 4, 0, 0, 0, implicit $exec :: (store 2 into %ir.outgep + 4, align 4, addrspace 1)
SU(14):   GLOBAL_STORE_DWORD_SADDR %16:vgpr_32, %22:vgpr_32, %4.sub0_sub1:sgpr_128, 0, 0, 0, 0, implicit $exec :: (store 4 into %ir.outgep, addrspace 1)

New implementation cluster 5 pairs.

Cluster ld/st SU(2) - SU(3)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Copy Succ SU(8)
  Copy Succ SU(7)
  Curr cluster length: 2, Curr cluster bytes: 24
Cluster ld/st SU(7) - SU(8)
  Copy Succ SU(12)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Curr cluster length: 2, Curr cluster bytes: 8
Cluster ld/st SU(7) - SU(10)
  Copy Succ SU(12)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Copy Succ SU(8)
  Curr cluster length: 3, Curr cluster bytes: 12
Cluster ld/st SU(9) - SU(10)
  Copy Succ SU(12)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Curr cluster length: 4, Curr cluster bytes: 16
Num BaseOps: 2, Offset: 4, OffsetIsScalable: 0, Width: 4
Num BaseOps: 2, Offset: 0, OffsetIsScalable: 0, Width: 4
Cluster ld/st SU(13) - SU(14)
  Copy Pred SU(11)
  Copy Pred SU(10)
  Copy Pred SU(9)
  Copy Pred SU(8)
  Copy Pred SU(7)
  Copy Pred SU(4)
  Copy Pred SU(2)
  Copy Pred SU(3)
  Curr cluster length: 2, Curr cluster bytes: 8
Final:
*** Final schedule for %bb.0 ***
SU(0):   %1:sgpr_64(p4) = COPY $sgpr0_sgpr1
SU(1):   %0:vgpr_32(s32) = COPY $vgpr0
SU(2):   %4:sgpr_128 = S_LOAD_DWORDX4_IMM %1:sgpr_64(p4), 36, 0, 0 :: (dereferenceable invariant load 16 from %ir.1, align 4, addrspace 4)
SU(3):   %14:sreg_64_xexec = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 52, 0, 0 :: (dereferenceable invariant load 8 from %ir.1 + 16, align 4, addrspace 4)
SU(4):   %16:vgpr_32 = V_LSHLREV_B32_e32 3, %0:vgpr_32(s32), implicit $exec
SU(5):   %20:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
SU(6):   %18:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
SU(9):   %20:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %14:sreg_64_xexec, %16:vgpr_32, 4, 0, 0, 0, %20:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep1 + 4, align 4, addrspace 1)
SU(10):   %21:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %14:sreg_64_xexec, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep1, addrspace 1)
SU(7):   %18:vgpr_32 = GLOBAL_LOAD_SHORT_D16_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 4, 0, 0, 0, %18:vgpr_32(tied-def 0), implicit $exec :: (load 2 from %ir.gep0 + 4, align 4, addrspace 1)
SU(8):   %19:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %4.sub2_sub3:sgpr_128, %16:vgpr_32, 0, 0, 0, 0, implicit $exec :: (load 4 from %ir.gep0, addrspace 1)
SU(11):   %22:vgpr_32 = V_PK_MAX_I16 8, %19:vgpr_32, 8, %21:vgpr_32, 0, 0, 0, 0, 0, implicit $exec
SU(12):   %23:vgpr_32 = V_PK_MAX_I16 8, %18:vgpr_32, 8, %20:vgpr_32, 0, 0, 0, 0, 0, implicit $exec
SU(13):   GLOBAL_STORE_SHORT_SADDR %16:vgpr_32, %23:vgpr_32, %4.sub0_sub1:sgpr_128, 4, 0, 0, 0, implicit $exec :: (store 2 into %ir.outgep + 4, align 4, addrspace 1)
SU(14):   GLOBAL_STORE_DWORD_SADDR %16:vgpr_32, %22:vgpr_32, %4.sub0_sub1:sgpr_128, 0, 0, 0, 0, implicit $exec :: (store 4 into %ir.outgep, addrspace 1)
steven.zhang added inline comments.Aug 18 2020, 3:15 AM
llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll
230

Sure. Done.

Gentle ping ...

foad accepted this revision.Aug 25 2020, 2:24 AM

I've tested this patch by compiling several thousand shaders with the AMDGPU backend and analysing the generated code. It gives a nice ~ 4% increase in the average size of a load cluster (or equivalently a ~ 4% decrease in the total number of clusters, given that the total number of loads is unchanged).

I've tested this again and it now gives a ~ 0.25% increase in the amount of clustering. LGTM.

This revision is now accepted and ready to land.Aug 25 2020, 2:24 AM
This revision was landed with ongoing or failed builds.Aug 26 2020, 5:34 AM
This revision was automatically updated to reflect the committed changes.