Page MenuHomePhabricator

[AMDGPU] Bundle loads before post-RA scheduler
ClosedPublic

Authored by rampitec on Jan 14 2020, 2:49 PM.

Details

Summary

We are relying on atrificial DAG edges inserted by the
MemOpClusterMutation to keep loads and stores together in the
post-RA scheduler. This does not work all the time since it
allows to schedule a completely independent instruction in the
middle of the cluster.

Removed the DAG mutation and added pass to bundle already
clustered instructions. These bundles are unpacked before the
memory legalizer because it does not work with bundles but also
because it allows to insert waitcounts in the middle of a store
cluster.

Removing artificial edges also allows a more relaxed scheduling.

Diff Detail

Event Timeline

rampitec created this revision.Jan 14 2020, 2:49 PM

Is this just working around the post-RA scheduler? Can we finally replace the post-RA scheduler with misched?

Is this just working around the post-RA scheduler? Can we finally replace the post-RA scheduler with misched?

Post RA scheduler in fact works pretty well. That is not really clear if we should replace it, then in any way we likely want a different set of heuristics than in pre-RA scheduler. The pre-RA shall deal with RP, ILP and clustering, where post-RA needs to deal with more subtle aspects of SQ. We just cannot address all of these (often opposite) heuristics in the same algorithm. But for sure it shall not undo clustering done pre-RA.

Like you say the machine scheduler can choose not to prioritize cluster edges, so this is much more restrictive. We should be really sure we always want this clustering. There are probably cases where the cost/benefit is not in favor of bundling, but it's not easy to predict.

By the way I think we actually do try to insert waitcnt into bundles. I wonder if it would be worth it to just update the memory legalizer to work with bundles as well.

rampitec planned changes to this revision.Jan 14 2020, 3:54 PM

I have found a bug in the logic. Will update the review.

Actually I misunderstood the change because of the title. I see now that the loads are bundled before post-RA scheduler.

rampitec retitled this revision from [AMDGPU] Bundle loads before pre-RA scheduler to [AMDGPU] Bundle loads before post-RA scheduler.Jan 14 2020, 4:19 PM

Actually I misunderstood the change because of the title. I see now that the loads are bundled before post-RA scheduler.

Ough, thanks! Fixed the title.

By the way I think we actually do try to insert waitcnt into bundles. I wonder if it would be worth it to just update the memory legalizer to work with bundles as well.

I saw waitcount inserted after the bundle if I do not unpack them, so probably we do not do it always if we really do.

rampitec updated this revision to Diff 238140.Jan 14 2020, 4:42 PM

Fixed bug in the bundling logic and added test for produced bundles and when do they break.

By the way I think we actually do try to insert waitcnt into bundles. I wonder if it would be worth it to just update the memory legalizer to work with bundles as well.

I saw waitcount inserted after the bundle if I do not unpack them, so probably we do not do it always if we really do.

Yeah maybe. I remember it was added here: c04aab9c0646461bc187808920b3d5ee7f5cc5ab
Was the waitcnt after the bundle that you saw in the correct place?

By the way I think we actually do try to insert waitcnt into bundles. I wonder if it would be worth it to just update the memory legalizer to work with bundles as well.

I saw waitcount inserted after the bundle if I do not unpack them, so probably we do not do it always if we really do.

Yeah maybe. I remember it was added here: c04aab9c0646461bc187808920b3d5ee7f5cc5ab
Was the waitcnt after the bundle that you saw in the correct place?

Yes, it was correct. It was just vmcnt(0) because we could not tear apart load bundle and need all the loaded values in a store bundle. When I unpack it each store was waiting for vmcnt(3) which is better.

foad added a comment.Jan 15 2020, 1:42 AM

There are nice changes in a bunch of tests, where we're preserving clusters instead of breaking them apart.

But there are also strange changes in some other tests, where the clustering hasn't changed, but some instructions that use the result of a load have moved around. Does this mean we're getting the latency of the load wrong now? (Or were we getting it wrong before?) For example:
insert_vector_elt
llvm.maxnum.f16.ll
saddo.ll
sign_extend.ll

llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
280

Nice.

llvm/test/CodeGen/AMDGPU/cross-block-use-is-not-abi-copy.ll
188 ↗(On Diff #238144)

What happened here? Has some cost estimate changed because of the bundling? Can we fix it?

llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
278

Nice.

llvm/test/CodeGen/AMDGPU/idot2.ll
2677

Nice.

2797

Is this just coincidence, or are we actually trying to cluster a FLAT load with an SMEM load?

llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll
582–583

Nice.

llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll
148–149

Nice.

llvm/test/CodeGen/AMDGPU/memory_clause.ll
8

Nice.

llvm/test/CodeGen/AMDGPU/shl.ll
166

Nice.

llvm/test/CodeGen/AMDGPU/shl.v2i16.ll
149–150

Nice.

There are nice changes in a bunch of tests, where we're preserving clusters instead of breaking them apart.

But there are also strange changes in some other tests, where the clustering hasn't changed, but some instructions that use the result of a load have moved around. Does this mean we're getting the latency of the load wrong now? (Or were we getting it wrong before?) For example:
insert_vector_elt
llvm.maxnum.f16.ll
saddo.ll
sign_extend.ll

We have moved uses of loaded values further from their loads, which is good. As far as I understand these changes are inducted by the removal of artificial edges which were created by MemOpClusterMutation. These edges were linking successors of any load to all the nodes in a cluster and restricted the scheduling.
In sign_extend.ll that is because of the store clustering, we have moved v_ashrrev_i32_e32 producing v2 past v_ashrrev_i32_e32 producing v3 because store cluster uses them in this order. Before it was harder to do because of the artificial edges linking all predecessors to all stores.

rampitec marked 2 inline comments as done.Jan 15 2020, 8:45 AM
rampitec added inline comments.
llvm/test/CodeGen/AMDGPU/cross-block-use-is-not-abi-copy.ll
188 ↗(On Diff #238144)

It was duplicated by Branch Probability Basic Block Placement immediately after the post-RA scheduler.
It is now duplicated because of -tail-dup-placement-threshold default value of 2. If you use 3 it will be duplicated w/o bundling.
That is because TailDuplicator::shouldTailDuplicate() simply count instructions and compare against the threshold: https://llvm.org/doxygen/TailDuplicator_8cpp_source.html#l00622

It can be fixed in a separate follow-up patch to add a bundle's size if it is a bundle, I am not sure if it may affect other targets or not.

llvm/test/CodeGen/AMDGPU/idot2.ll
2797

No, we do not:

BUNDLE implicit-def $vgpr2, implicit-def $vgpr0, implicit killed $vgpr2_vgpr3, implicit $exec, implicit killed $vgpr0_vgpr1 {
  renamable $vgpr2 = GLOBAL_LOAD_USHORT killed renamable $vgpr2_vgpr3, 0, 0, 0, 0, implicit $exec :: (load 2 from %ir.2, addrspace 1)
  renamable $vgpr0 = GLOBAL_LOAD_USHORT killed renamable $vgpr0_vgpr1, 0, 0, 0, 0, implicit $exec :: (load 2 from %ir.3, addrspace 1)
}
renamable $sgpr0 = S_LOAD_DWORD_IMM renamable $sgpr4_sgpr5, 0, 0, 0 :: (load 4 from %ir.4, addrspace 1)

That is because of removed mutation again I guess. Anyway, this is a better schedule because global loads take longer than SMRD.

foad added a comment.Jan 15 2020, 9:17 AM

We have moved uses of loaded values further from their loads, which is good. As far as I understand these changes are inducted by the removal of artificial edges which were created by MemOpClusterMutation. These edges were linking successors of any load to all the nodes in a cluster and restricted the scheduling.
In sign_extend.ll that is because of the store clustering, we have moved v_ashrrev_i32_e32 producing v2 past v_ashrrev_i32_e32 producing v3 because store cluster uses them in this order. Before it was harder to do because of the artificial edges linking all predecessors to all stores.

OK, that sounds plausible, thanks!

rampitec marked an inline comment as done.Jan 15 2020, 9:40 AM
rampitec added inline comments.
llvm/test/CodeGen/AMDGPU/cross-block-use-is-not-abi-copy.ll
188 ↗(On Diff #238144)

Apparently that does not affect any other target: D72783

rampitec updated this revision to Diff 238308.Jan 15 2020, 10:05 AM

Rebased on top of D72783 to prevent tail duplication of a bundle.

rampitec updated this revision to Diff 238874.Jan 17 2020, 1:30 PM

Prevent bundling of loads with overlapping destinations. They are dependent.

vpykhtin accepted this revision.Jan 21 2020, 9:12 AM

LGTM

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1300

just out of curiosity: why is this needed?

This revision is now accepted and ready to land.Jan 21 2020, 9:12 AM
rampitec marked an inline comment as done.Jan 21 2020, 10:32 AM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1300

We need to set it when we forming bundle to model the fact a single bundle instruction reads a register which is defined inside the same instruction. Therefore, when we unpack the bundle this flag needs to be reversed.

foad added inline comments.Jan 24 2020, 3:54 AM
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1300

You mean when the address of one load depends on the result of another load? Surely we shouldn't be bundling loads that are dependent like that?

rampitec marked an inline comment as done.Jan 24 2020, 8:00 AM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1300

Yes, we do not. But I really cannot rely on that when unpacking.

This revision was automatically updated to reflect the committed changes.