Page MenuHomePhabricator

[MachineScheduler] relax successor chain on clustering
AbandonedPublic

Authored by rampitec on Jan 27 2020, 2:50 PM.

Details

Summary

BaseMemOpClusterMutation::apply forms store chains by looking for
control dependencies from one mem op to another.

In the test case, clusterNeighboringMemOps successfully clusters the
loads, and then adds artificial edges to the loads' successors.

The effect of this is that data dependencies from one load to a store
are copied as artificial dependencies from a different load to the
same store.

Then when BaseMemOpClusterMutation::apply looks at the stores, it finds
that some of them have a control dependency on a previous load, which
breaks the chains and means that the stores are not all considered part
of the same chain and won't all be clustered.

In fact to prevent scheduling of a successor of any load in a cluster
inside the cluster that is sufficient to transfer all successors only
to the last node in the chain. That last node becomes a sentinel for
the whole cluster, but no extra dependencies are created.

Diff Detail

Event Timeline

rampitec created this revision.Jan 27 2020, 2:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2020, 2:50 PM
foad added a comment.Jan 28 2020, 3:25 AM

This doesn't fix the problem that inspired D71717. Consider the first test case in memory_clause.ll. With baseline llvm I get:

$ bin/llc -march=amdgcn -mcpu=gfx902 -verify-machineinstrs -amdgpu-enable-global-sgpr-addr -o /dev/null ~/git/llvm-project/llvm/test/CodeGen/AMDGPU/memory_clause.ll -debug-only=machine-scheduler |& egrep "^Cluster|Machine code for function"
# Machine code for function vector_clause: NoPHIs, TracksLiveness
Cluster ld/st SU(2) - SU(3)
Cluster ld/st SU(6) - SU(7)
Cluster ld/st SU(8) - SU(9)
Cluster ld/st SU(11) - SU(13)
# Machine code for function vector_clause: NoPHIs, TracksLiveness
[...]

Your patch doesn't change this, but with D71717 I get:

$ bin/llc -march=amdgcn -mcpu=gfx902 -verify-machineinstrs -amdgpu-enable-global-sgpr-addr -o /dev/null ~/git/llvm-project/llvm/test/CodeGen/AMDGPU/memory_clause.ll -debug-only=machine-scheduler |& egrep "^Cluster|Machine code for function"
# Machine code for function vector_clause: NoPHIs, TracksLiveness
Cluster ld/st SU(2) - SU(3)
Cluster ld/st SU(6) - SU(7)
Cluster ld/st SU(8) - SU(9)
Cluster ld/st SU(10) - SU(11)
Cluster ld/st SU(12) - SU(13)
# Machine code for function vector_clause: NoPHIs, TracksLiveness
[...]

So now it is considering all of SU(10) .. SU(13) for clustering, instead of just SU(11) and SU(13). The relevant SUs are:

SU(6):   %12:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %26:vreg_64, %4:sreg_64_xexec, 0, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.tmp3, addrspace 1)
SU(7):   %15:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %26:vreg_64, %4:sreg_64_xexec, 16, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.tmp72, addrspace 1)
SU(8):   %17:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %26:vreg_64, %4:sreg_64_xexec, 32, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.tmp116, addrspace 1)
SU(9):   %19:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %26:vreg_64, %4:sreg_64_xexec, 48, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.tmp1510, addrspace 1)
SU(10):   GLOBAL_STORE_DWORDX4_SADDR %26:vreg_64, %12:vreg_128, %5:sreg_64_xexec, 0, 0, 0, 0, implicit $exec, implicit $exec :: (store 16 into %ir.tmp5, addrspace 1)
SU(11):   GLOBAL_STORE_DWORDX4_SADDR %26:vreg_64, %15:vreg_128, %5:sreg_64_xexec, 16, 0, 0, 0, implicit $exec, implicit $exec :: (store 16 into %ir.tmp94, addrspace 1)
SU(12):   GLOBAL_STORE_DWORDX4_SADDR %26:vreg_64, %17:vreg_128, %5:sreg_64_xexec, 32, 0, 0, 0, implicit $exec, implicit $exec :: (store 16 into %ir.tmp138, addrspace 1)
SU(13):   GLOBAL_STORE_DWORDX4_SADDR %26:vreg_64, %19:vreg_128, %5:sreg_64_xexec, 48, 0, 0, 0, implicit $exec, implicit $exec :: (store 16 into %ir.tmp1712, addrspace 1)
foad added a comment.Jan 28 2020, 3:32 AM

Typo in summary: successfor -> successor ?

rampitec retitled this revision from [MachineScheduler] relax successfor chain on clustering to [MachineScheduler] relax successor chain on clustering.Jan 28 2020, 10:07 AM

This doesn't fix the problem that inspired D71717. Consider the first test case in memory_clause.ll. With baseline llvm I get:

$ bin/llc -march=amdgcn -mcpu=gfx902 -verify-machineinstrs -amdgpu-enable-global-sgpr-addr -o /dev/null ~/git/llvm-project/llvm/test/CodeGen/AMDGPU/memory_clause.ll -debug-only=machine-scheduler |& egrep "^Cluster|Machine code for function"
# Machine code for function vector_clause: NoPHIs, TracksLiveness
Cluster ld/st SU(2) - SU(3)
Cluster ld/st SU(6) - SU(7)
Cluster ld/st SU(8) - SU(9)
Cluster ld/st SU(11) - SU(13)
# Machine code for function vector_clause: NoPHIs, TracksLiveness

My problem is I cannot reproduce it. All I have is

# Machine code for function vector_clause: NoPHIs, TracksLiveness
Cluster ld/st SU(2) - SU(3)
# Machine code for function vector_clause: NoPHIs, TracksLiveness

It just does not try to cluster all global loads and stores at all! It also does not do it even with D71717.
I will debug it...

This doesn't fix the problem that inspired D71717. Consider the first test case in memory_clause.ll. With baseline llvm I get:

$ bin/llc -march=amdgcn -mcpu=gfx902 -verify-machineinstrs -amdgpu-enable-global-sgpr-addr -o /dev/null ~/git/llvm-project/llvm/test/CodeGen/AMDGPU/memory_clause.ll -debug-only=machine-scheduler |& egrep "^Cluster|Machine code for function"
# Machine code for function vector_clause: NoPHIs, TracksLiveness
Cluster ld/st SU(2) - SU(3)
Cluster ld/st SU(6) - SU(7)
Cluster ld/st SU(8) - SU(9)
Cluster ld/st SU(11) - SU(13)
# Machine code for function vector_clause: NoPHIs, TracksLiveness

My problem is I cannot reproduce it. All I have is

# Machine code for function vector_clause: NoPHIs, TracksLiveness
Cluster ld/st SU(2) - SU(3)
# Machine code for function vector_clause: NoPHIs, TracksLiveness

It just does not try to cluster all global loads and stores at all! It also does not do it even with D71717.
I will debug it...

I likely need to have some of your changes too, but I have reproduced it without -amdgpu-enable-global-sgpr-addr.

With master and my change, llc -march=amdgcn -mcpu=gfx902 -verify-machineinstrs < vector_clause.ll -debug-only=machine-scheduler |& egrep "^Cluster|^SU\(.*GLOBAL_"

Cluster ld/st SU(2) - SU(3)
Cluster ld/st SU(8) - SU(12)
Cluster ld/st SU(13) - SU(14)
Cluster ld/st SU(16) - SU(18)
SU(8):   %12:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 0, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp3, addrspace 1)
SU(12):   %15:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 16, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp72, addrspace 1)
SU(13):   %17:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 32, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp116, addrspace 1)
SU(14):   %19:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 48, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp1510, addrspace 1)
SU(15):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %12:vreg_128, 0, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp5, addrspace 1)
SU(16):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %15:vreg_128, 16, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp94, addrspace 1)
SU(17):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %17:vreg_128, 32, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp138, addrspace 1)
SU(18):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %19:vreg_128, 48, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp1712, addrspace 1)

With D71717:

Cluster ld/st SU(2) - SU(3)
Cluster ld/st SU(8) - SU(12)
Cluster ld/st SU(13) - SU(14)
Cluster ld/st SU(15) - SU(16)
Cluster ld/st SU(17) - SU(18)
SU(8):   %12:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 0, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp3, addrspace 1)
SU(12):   %15:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 16, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp72, addrspace 1)
SU(13):   %17:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 32, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp116, addrspace 1)
SU(14):   %19:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 48, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp1510, addrspace 1)
SU(15):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %12:vreg_128, 0, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp5, addrspace 1)
SU(16):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %15:vreg_128, 16, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp94, addrspace 1)
SU(17):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %17:vreg_128, 32, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp138, addrspace 1)
SU(18):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %19:vreg_128, 48, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp1712, addrspace 1)

Even with D71717 it does not do desired clustering, so both changes are insufficient.

Yeah, it was because our threshold is too low. I have increased it and I see my change misses one cluster which D71717 does. This change:

Cluster ld/st SU(2) - SU(3)
Cluster ld/st SU(8) - SU(12)
Cluster ld/st SU(12) - SU(13)
Cluster ld/st SU(13) - SU(14)
Cluster ld/st SU(15) - SU(16)
Cluster ld/st SU(16) - SU(17)
SU(8):   %12:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 0, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp3, addrspace 1)
SU(12):   %15:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 16, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp72, addrspace 1)
SU(13):   %17:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 32, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp116, addrspace 1)
SU(14):   %19:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 48, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp1510, addrspace 1)
SU(15):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %12:vreg_128, 0, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp5, addrspace 1)
SU(16):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %15:vreg_128, 16, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp94, addrspace 1)
SU(17):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %17:vreg_128, 32, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp138, addrspace 1)
SU(18):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %19:vreg_128, 48, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp1712, addrspace 1)

D71717:

Cluster ld/st SU(2) - SU(3)
Cluster ld/st SU(8) - SU(12)
Cluster ld/st SU(12) - SU(13)
Cluster ld/st SU(13) - SU(14)
Cluster ld/st SU(15) - SU(16)
Cluster ld/st SU(16) - SU(17)
Cluster ld/st SU(17) - SU(18)
SU(8):   %12:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 0, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp3, addrspace 1)
SU(12):   %15:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 16, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp72, addrspace 1)
SU(13):   %17:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 32, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp116, addrspace 1)
SU(14):   %19:vreg_128 = GLOBAL_LOAD_DWORDX4 %38:vreg_64, 48, 0, 0, 0, implicit $exec :: (load 16 from %ir.tmp1510, addrspace 1)
SU(15):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %12:vreg_128, 0, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp5, addrspace 1)
SU(16):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %15:vreg_128, 16, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp94, addrspace 1)
SU(17):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %17:vreg_128, 32, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp138, addrspace 1)
SU(18):   GLOBAL_STORE_DWORDX4 %28:vreg_64, %19:vreg_128, 48, 0, 0, 0, implicit $exec :: (store 16 into %ir.tmp1712, addrspace 1)

This last store escaped in my version.

After some meditation in gdb I have found that BaseMemOpClusterMutation::apply() in fact does not check all control dependencies and just breaks at the first one. I.e. D71717 just skips some SDeps preferring another one. More or less we are lucky to find a correct SDep which may form a useful chain. We may be not that lucky if order of the SDeps is different and we would somehow use another register (for example data register). We probably need a callback to check if that register belongs to pointer operand and skip it otherwise.

Meanwhile D71717 looks good as artificial edge shall not create a proper chain anyway. This change I will likely abandon as it will not produce any visible impact if D71717 is submitted. Its only impact is the reduction of artificial edges.

rampitec abandoned this revision.Feb 14 2020, 4:01 PM