This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Omit unnecessary waitcnt before barriers
ClosedPublic

Authored by kerbowa on Feb 25 2022, 12:54 AM.

Details

Summary

It is not necessary to wait for all outstanding memory operations before
barriers on hardware that can back off of the barrier in the event of an
exception when traps are enabled. Add a new subtarget feature which
tracks which HW has this ability.

Diff Detail

Event Timeline

kerbowa created this revision.Feb 25 2022, 12:54 AM
kerbowa requested review of this revision.Feb 25 2022, 12:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 25 2022, 12:55 AM

The test waitcnt-vscnt.ll is kind of muddied by this patch. It was relying on the behavior of s_barrier always adding s_waitcnt_vscnt at barriers if there were any outstanding events. The singlethread fences were not actually doing anything. I've added workgroup fences instead. With workgroup fences we currently always add the waitcnt. The test will look as it did before after my memory legalizer/waitcnt overhaul so that these are optimized. Consider this a precommit of this test for now.

foad added a comment.Feb 25 2022, 1:29 AM

The test waitcnt-vscnt.ll is kind of muddied by this patch. It was relying on the behavior of s_barrier always adding s_waitcnt_vscnt at barriers if there were any outstanding events. The singlethread fences were not actually doing anything. I've added workgroup fences instead. With workgroup fences we currently always add the waitcnt. The test will look as it did before after my memory legalizer/waitcnt overhaul so that these are optimized. Consider this a precommit of this test for now.

Can you split this into two patches: first change the test, and then rebase this patch?

foad added inline comments.Feb 25 2022, 1:33 AM
llvm/lib/Target/AMDGPU/GCNSubtarget.h
498

Stray " character

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1139

Typo "subatarget"

llvm/test/CodeGen/AMDGPU/waitcnt-preexisting-vscnt.mir
38

Isn't this "waitcnt 1" redundant, since we have just done a "waitcnt 0" and then issued one store? Or am I completely misunderstanding?

arsenm added inline comments.Feb 25 2022, 7:22 AM
llvm/lib/Target/AMDGPU/AMDGPU.td
743

I don't understand what "backing off" means here

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1143

Is this really a distinct feature if it's the same check as auto waitcnt?

kerbowa added inline comments.Feb 25 2022, 11:51 AM
llvm/lib/Target/AMDGPU/AMDGPU.td
743

Ya, I cannot think of a great name for this. Suggestions are welcome.

The SPG talks about it in these terms, but also uses "early barrier exit ".

The idea is that HW may need to "leave but not satisfy the barrier" to handle address-watch or MEMVIOL returned by in-flight memory ops.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1143

It's not the same as auto waitcnt. There are distinct subtargets that support each feature. This check is just saying we don't need an explicit waitcnt before barriers under any circumstances if there is an implicit wait by HW.

We don't actually use the auto waitcnt subtarget feature since it can be configured dynamically by HW. I was debating removing it entirely.

rampitec added inline comments.Feb 25 2022, 11:58 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1143

But then waitcount is still needed after the barrier? I.e. it does not mean that barrier wait for all outstanding memory operations, right?

kerbowa added inline comments.Feb 25 2022, 12:13 PM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1143

If auto waitcnt is enabled, I believe it does mean wait for all outstanding memory operations. However, the pass will still add redundant waitcnt after the barrier because the auto waitcnt feature is not optimized. But the feature is not used right now AFAIK.

If the subtarget has the "BackOffBarrier" feature, then waitcount are still needed after the barrier.

Normally barriers are paired with fences which may also enforce having waitcnt before the barrier. This change allows more flexibility and optimization in the cases where HW does not require waiting before barriers.

kerbowa updated this revision to Diff 411501.Feb 25 2022, 12:44 PM

Fix typo and rebase.

kerbowa marked an inline comment as done.Feb 25 2022, 12:48 PM
kerbowa added inline comments.
llvm/test/CodeGen/AMDGPU/waitcnt-preexisting-vscnt.mir
38

You are right the waitcnt is redundant, but this is testing waitcnt which are inserted by the memory legalizer pass that are left "as is". This test will change when we attempt to optimize this.

kerbowa updated this revision to Diff 412935.Mar 3 2022, 10:42 PM
kerbowa marked an inline comment as done.

Update test to use -mattr.

Forgot that there is a bug (feature?) where disabling a feature that
is in a "SubtargetFeatureGeneration" class disables ALL of the
features in that definition. Add BackOffBarrier to each gfx10
definition individually so that I can use -mattr in the test.

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 10:42 PM
rampitec accepted this revision.Mar 4 2022, 11:15 AM

LGTM, but could please also add it to gfx940?

This revision is now accepted and ready to land.Mar 4 2022, 11:15 AM
This revision was landed with ongoing or failed builds.Mar 7 2022, 8:25 AM
This revision was automatically updated to reflect the committed changes.
mceier added a subscriber: mceier.Apr 18 2022, 2:25 AM

This revision causes gpu hangups (Radeon 5700XT) on linux in many opengl apps, for example for godot I get something like this in dmesg:

[141585.653283] [drm:amdgpu_dm_commit_planes] *ERROR* Waiting for fences timed out!
[141587.702251] [drm:amdgpu_dm_commit_planes] *ERROR* Waiting for fences timed out!
[141590.783147] [drm:amdgpu_job_timedout] *ERROR* ring gfx_0.0.0 timeout, signaled seq=13399090, emitted seq=13399091
[141590.783726] [drm:amdgpu_job_timedout] *ERROR* Process information: process godot.x11.opt.t pid 527100 thread godot.x11.:cs0 pid 527120
[141590.783991] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[141591.148991] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper] *ERROR* ring kiq_2.1.0 test failed (-110)
[141591.149000] [drm:gfx_v10_0_hw_fini] *ERROR* KGQ disable failed
[141591.334553] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper] *ERROR* ring kiq_2.1.0 test failed (-110)
[141591.334559] [drm:gfx_v10_0_hw_fini] *ERROR* KCQ disable failed
[141591.521657] [drm:gfx_v10_0_cp_gfx_enable.isra.0] *ERROR* failed to halt cp gfx
[141591.528453] [drm] free PSP TMR buffer
[141591.560524] CPU: 0 PID: 492277 Comm: kworker/u8:1 Not tainted 5.18.0-rc2-x86_64+ #1
[141591.560531] Hardware name: Gigabyte Technology Co., Ltd. Z170X-Gaming 7/Z170X-Gaming 7, BIOS F22j 01/11/2018
[141591.560534] Workqueue: amdgpu-reset-dev drm_sched_job_timedout
[141591.560540] Call Trace:
[141591.560541]  <TASK>
[141591.560543]  show_stack+0x52/0x58
[141591.560548]  dump_stack_lvl+0x49/0x5e
[141591.560551]  dump_stack+0x10/0x12
[141591.560553]  amdgpu_do_asic_reset+0x24/0x43b
[141591.560557]  amdgpu_device_gpu_recover_imp.cold+0x660/0x755
[141591.560560]  amdgpu_job_timedout+0x14f/0x180
[141591.560563]  ? set_next_entity+0xe1/0x160
[141591.560567]  drm_sched_job_timedout+0x6d/0x100
[141591.560569]  ? trace_hardirqs_on+0x37/0xf0
[141591.560573]  process_one_work+0x216/0x3f0
[141591.560576]  worker_thread+0x50/0x3e0
[141591.560578]  ? rescuer_thread+0x380/0x380
[141591.560580]  kthread+0xfc/0x120
[141591.560583]  ? kthread_complete_and_exit+0x20/0x20
[141591.560586]  ret_from_fork+0x22/0x30
[141591.560590]  </TASK>
[141591.560788] amdgpu 0000:03:00.0: amdgpu: BACO reset
[141593.710773] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
...

git bisect log:

# bad: [6730b44480fcce18bfbbae0c46719250e9eae425] [HIP] Fix HIP include path
# good: [9b740c035c8b3e1a15effbfa73960fea259fd27d] Update normalizeAffineFor to canonicalize maps/operands before using them
git bisect start '--no-checkout' '6730b44480fcce18bfbbae0c46719250e9eae425' '9b740c035c8b3e1a15effbfa73960fea259fd27d' '--' 'llvm' 'cmake' 'third-party'
# bad: [9119eefe5fe513f188f8a2b8d5346ac0ce72d3f3] [X86] Add cheapX86FSETCC_SSE helper. NFC.
git bisect bad 9119eefe5fe513f188f8a2b8d5346ac0ce72d3f3
# bad: [f46fa4de4a95329b74ba0bb4ce9623b64ad84876] Revert "[clang][debug] port clang-cl /JMC flag to ELF"
git bisect bad f46fa4de4a95329b74ba0bb4ce9623b64ad84876
# bad: [0c2b43ab8cb1067dd1c7899094b824890803a7d2] [X86] Fix MCSymbolizer interface for X86Disassembler
git bisect bad 0c2b43ab8cb1067dd1c7899094b824890803a7d2
# bad: [e1069c1288d151f36f37c4c616b78b7b0a1e3a50] [AMDGPU] Ensure return address is save/restored if clobbered or when function has calls
git bisect bad e1069c1288d151f36f37c4c616b78b7b0a1e3a50
# good: [eadd1668d05df582281061460089562cfb6b3a90] update_analyze_test_checks.py: fix UTC_ARGS handling
git bisect good eadd1668d05df582281061460089562cfb6b3a90
# good: [20c4664552e2e5c1d85db13d3568b7d4a3e843ef] [gn build] Port 205557c908ff
git bisect good 20c4664552e2e5c1d85db13d3568b7d4a3e843ef
# bad: [8d0c34fd4fb66ea0d19563154a59658e4b7f35d4] [AMDGPU] Omit unnecessary waitcnt before barriers
git bisect bad 8d0c34fd4fb66ea0d19563154a59658e4b7f35d4
# first bad commit: [8d0c34fd4fb66ea0d19563154a59658e4b7f35d4] [AMDGPU] Omit unnecessary waitcnt before barriers

After reverting 8d0c34fd4fb66ea0d19563154a59658e4b7f35d4 on 4ffd0b6fde4da9a3ba4ee3a189504ce84c118b1c the gpu hangups disapear.

Steps to reproduce:

  1. Start godot
  2. Open project / create new opengl es project
  3. Hangup happens either immediately or after few seconds of moving mouse

This revision causes gpu hangups (Radeon 5700XT) on linux in many opengl apps, for example for godot I get something like this in dmesg:

[141585.653283] [drm:amdgpu_dm_commit_planes] *ERROR* Waiting for fences timed out!
[141587.702251] [drm:amdgpu_dm_commit_planes] *ERROR* Waiting for fences timed out!
[141590.783147] [drm:amdgpu_job_timedout] *ERROR* ring gfx_0.0.0 timeout, signaled seq=13399090, emitted seq=13399091
[141590.783726] [drm:amdgpu_job_timedout] *ERROR* Process information: process godot.x11.opt.t pid 527100 thread godot.x11.:cs0 pid 527120
[141590.783991] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[141591.148991] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper] *ERROR* ring kiq_2.1.0 test failed (-110)
[141591.149000] [drm:gfx_v10_0_hw_fini] *ERROR* KGQ disable failed
[141591.334553] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper] *ERROR* ring kiq_2.1.0 test failed (-110)
[141591.334559] [drm:gfx_v10_0_hw_fini] *ERROR* KCQ disable failed
[141591.521657] [drm:gfx_v10_0_cp_gfx_enable.isra.0] *ERROR* failed to halt cp gfx
[141591.528453] [drm] free PSP TMR buffer
[141591.560524] CPU: 0 PID: 492277 Comm: kworker/u8:1 Not tainted 5.18.0-rc2-x86_64+ #1
[141591.560531] Hardware name: Gigabyte Technology Co., Ltd. Z170X-Gaming 7/Z170X-Gaming 7, BIOS F22j 01/11/2018
[141591.560534] Workqueue: amdgpu-reset-dev drm_sched_job_timedout
[141591.560540] Call Trace:
[141591.560541]  <TASK>
[141591.560543]  show_stack+0x52/0x58
[141591.560548]  dump_stack_lvl+0x49/0x5e
[141591.560551]  dump_stack+0x10/0x12
[141591.560553]  amdgpu_do_asic_reset+0x24/0x43b
[141591.560557]  amdgpu_device_gpu_recover_imp.cold+0x660/0x755
[141591.560560]  amdgpu_job_timedout+0x14f/0x180
[141591.560563]  ? set_next_entity+0xe1/0x160
[141591.560567]  drm_sched_job_timedout+0x6d/0x100
[141591.560569]  ? trace_hardirqs_on+0x37/0xf0
[141591.560573]  process_one_work+0x216/0x3f0
[141591.560576]  worker_thread+0x50/0x3e0
[141591.560578]  ? rescuer_thread+0x380/0x380
[141591.560580]  kthread+0xfc/0x120
[141591.560583]  ? kthread_complete_and_exit+0x20/0x20
[141591.560586]  ret_from_fork+0x22/0x30
[141591.560590]  </TASK>
[141591.560788] amdgpu 0000:03:00.0: amdgpu: BACO reset
[141593.710773] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
...

git bisect log:

# bad: [6730b44480fcce18bfbbae0c46719250e9eae425] [HIP] Fix HIP include path
# good: [9b740c035c8b3e1a15effbfa73960fea259fd27d] Update normalizeAffineFor to canonicalize maps/operands before using them
git bisect start '--no-checkout' '6730b44480fcce18bfbbae0c46719250e9eae425' '9b740c035c8b3e1a15effbfa73960fea259fd27d' '--' 'llvm' 'cmake' 'third-party'
# bad: [9119eefe5fe513f188f8a2b8d5346ac0ce72d3f3] [X86] Add cheapX86FSETCC_SSE helper. NFC.
git bisect bad 9119eefe5fe513f188f8a2b8d5346ac0ce72d3f3
# bad: [f46fa4de4a95329b74ba0bb4ce9623b64ad84876] Revert "[clang][debug] port clang-cl /JMC flag to ELF"
git bisect bad f46fa4de4a95329b74ba0bb4ce9623b64ad84876
# bad: [0c2b43ab8cb1067dd1c7899094b824890803a7d2] [X86] Fix MCSymbolizer interface for X86Disassembler
git bisect bad 0c2b43ab8cb1067dd1c7899094b824890803a7d2
# bad: [e1069c1288d151f36f37c4c616b78b7b0a1e3a50] [AMDGPU] Ensure return address is save/restored if clobbered or when function has calls
git bisect bad e1069c1288d151f36f37c4c616b78b7b0a1e3a50
# good: [eadd1668d05df582281061460089562cfb6b3a90] update_analyze_test_checks.py: fix UTC_ARGS handling
git bisect good eadd1668d05df582281061460089562cfb6b3a90
# good: [20c4664552e2e5c1d85db13d3568b7d4a3e843ef] [gn build] Port 205557c908ff
git bisect good 20c4664552e2e5c1d85db13d3568b7d4a3e843ef
# bad: [8d0c34fd4fb66ea0d19563154a59658e4b7f35d4] [AMDGPU] Omit unnecessary waitcnt before barriers
git bisect bad 8d0c34fd4fb66ea0d19563154a59658e4b7f35d4
# first bad commit: [8d0c34fd4fb66ea0d19563154a59658e4b7f35d4] [AMDGPU] Omit unnecessary waitcnt before barriers

After reverting 8d0c34fd4fb66ea0d19563154a59658e4b7f35d4 on 4ffd0b6fde4da9a3ba4ee3a189504ce84c118b1c the gpu hangups disapear.

Steps to reproduce:

  1. Start godot
  2. Open project / create new opengl es project
  3. Hangup happens either immediately or after few seconds of moving mouse

Can you post the IR and ISA for the failing shader?

Can you post the IR and ISA for the failing shader?

I will try to get them this weekend. Before bisecting I tried running these programs with RADV_DEBUG=hang but that didn't produce any output.
I will try with MESA_GLSL=dump,log this weekend, hopefully it will capture something.

Reverted until I can investigate the failures.

Can you post the IR and ISA for the failing shader?

I ran godot with MESA_SHADER_CACHE_DISABLE=true RADV_DEBUG=shaders AMD_DEBUG=vs,ps,preoptir environment variables set, are these logs enough:

  • LLVM Without Revert (hangs GPU): dmesg:
  • LLVM With Revert:

Unfortunately I don't know how to pinpoint which exactly shader is at fault. I only noticed s_waitcnt is not emitted before s_barrier in the logs with hangup.

If these logs are not enough, I will probably need detailed instruction on how to get the IR/ISA files...
I know I can dump the shaders GLSL source by setting MESA_SHADER_DUMP_PATH, but I don't think these files are useful.

I don't see how this change could possibly have ever been correct. Whether backoff is supported or not, in most cases there will be no backoff, and in that case, the waitcnt is required.

Oh, I misunderstood what this patch does. It's possibly an unfortunate interaction between Mesa and LLVM, where Mesa assumed that the waitcnt was there, and your removal of it causes problems. In that case, can you please coordinate with Mesa to make sure it starts adding the required fence instructions in the LLVM IR? Cc @mareko

Oh, I misunderstood what this patch does. It's possibly an unfortunate interaction between Mesa and LLVM, where Mesa assumed that the waitcnt was there, and your removal of it causes problems. In that case, can you please coordinate with Mesa to make sure it starts adding the required fence instructions in the LLVM IR? Cc @mareko

I'm interested in what to do in Mesa. Do we just need to add LLVMBuildFence(... LLVMAtomicOrderingRelease ...) before amdgcn.s.barrier?

I believe HIP adds a release fence before the barrier and an acquire directly after. Would something like that make sense?

I believe HIP adds a release fence before the barrier and an acquire directly after. Would something like that make sense?

Yes, generally I expect that is what's required.

mareko added a comment.May 2 2022, 3:00 PM

I believe HIP adds a release fence before the barrier and an acquire directly after. Would something like that make sense?

Yes, generally I expect that is what's required.

What does the release fence add? Waiting for only LDS or everything?

mareko added a comment.May 2 2022, 4:11 PM

We are going to insert amdgcn.s.waitcnt instead of a fence because fences wait for memory stores, which we don't want.

We are going to insert amdgcn.s.waitcnt instead of a fence because fences wait for memory stores, which we don't want.

Without this patch, the compiler will wait for everything, so I'm not sure how you will be differentiating between loads and stores.

What does the release fence add? Waiting for only LDS or everything?

It depends on the scope of the fence and the HW. See: https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-memory-model

nhaehnle added a comment.EditedMay 3 2022, 5:32 AM

That document is pretty useless, I have to say. There is also a very obvious gap here in the expressiveness of the fence instruction which Marek implicitly pointed out: it should be possible to tell the fence which memory types should be affected. syncscope is not enough for that. @t-tye

We really need a change that allows us to put a "memory kind" modifier on fence instructions. This would correspond to the {Uniform,Workgroup,...}Memory bits in SPIR-V MemorySemantics fields.

mareko added a comment.EditedMay 3 2022, 9:26 AM

The IR (NIR) that we translate to LLVM typically contains this sequence of instructions:

  1. memory_barrier (fence)
  2. control_barrier (s_barrier)

Either of them is optional, which means a memory barrier can occur without a control barrier, and vice versa. SPIR-V also works like this.

The memory barrier (fence) has the expressive power to wait for one or more of these:

  • all shared memory opcodes
  • only HS output shared memory opcodes
  • all memory opcodes
  • only global memory opcodes (not buffer or image)
  • only buffer opcodes
  • only image opcodes
  • only buffer opcodes using shader storage buffer object variables (not other buffers even if aliased)
  • only buffer/image opcodes using image variables (image variables can also access buffers, but shouldn't include any other buffers even if aliased)
  • fences never wait for opcodes writing into write-only buffers
  • fences never wait for opcodes reading from read-only buffers and images

What the LLVM fence can do:

  • everything or nothing

Thank you for the summary.

  • only HS output shared memory opcodes

This one is tricky.

  • all memory opcodes
  • only global memory opcodes (not buffer or image)
  • only buffer opcodes
  • only image opcodes
  • only buffer opcodes using shader storage buffer object variables (not other buffers if not aliased)
  • only buffer/image opcodes using image variables (image variables can also access buffers, but shouldn't include any other buffers if not aliased)

I suppose the only difference this makes here is potentially in the counter value we wait for? I.e., we may wait for vmcnt(N) with N != 0?

  • fences never wait for opcodes writing into write-only buffers
  • fences never wait for opcodes reading from read-only buffers and images

IMO this one should be modeled like private/nonprivate in Vulkan. We had some discussions on this internally but not enough pressure to actually make it happen.

We should make all of this happen. I'm thinking in the direction of a memoryscopes operand, somewhat analogous to syncscope, except that while syncscope captures the set of threads we're potentially communicating with, memoryscope would capture the set of memoy through which we're potentially communicating.

I suppose the only difference this makes here is potentially in the counter value we wait for? I.e., we may wait for vmcnt(N) with N != 0?

LLVM IR can track everything accurately, and the compiler can choose what to do with it. The hw isn't that flexible, but it could either wait for nothing if that memory scope is not busy, vmcnt(N) with N > 0, or vmcnt(0).

I've changed the aliasing rules in my comment such that waiting for a memory scope shouldn't wait for any other memory scope even if they alias (the memory barrier is a user's choice).

JFYI, I'm planning to reintroduce this change in the next few weeks. Before this change, the compiler would ALWAYS wait for outstanding VMEM/LGKM at barriers. There is no HW requirement for this. As an optimization, we will be omitting these waitcnt on Navi/MI200. In order to retain the same behavior as before the change, fences must be added before barriers.