This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] New SIInsertHardClauses pass
ClosedPublic

Authored by foad on May 12 2020, 9:13 AM.

Details

Summary

Enable clausing of memory loads on gfx10 by adding a new pass to insert
the s_clause instructions that mark the start of each hard clause.

Diff Detail

Event Timeline

foad created this revision.May 12 2020, 9:13 AM
foad updated this revision to Diff 263456.May 12 2020, 9:16 AM

clang-format.

rampitec added inline comments.May 12 2020, 10:41 AM
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
119

Just break the scan at 64 and restart. Also needed test for this.

125

Wrap the check into GCNSubtarget::hasHardClauses()?
Also add skipFunction() check.

rampitec added inline comments.May 12 2020, 10:44 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
1048

Need to check OptLevel probably, the pass is optional.

foad updated this revision to Diff 263498.May 12 2020, 12:52 PM

Address review comments.

foad marked 4 inline comments as done.May 12 2020, 12:55 PM
foad added inline comments.
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
119

OK, but my way was simpler.

I tried to add a test (hard-clauses.mir) but this pass is guided by the SIInstrInfo::shouldClusterMemOps heuristic, which never clusters that many loads. I still think this pass should handle it correctly, in case the heuristic ever changes.

arsenm added inline comments.May 12 2020, 12:59 PM
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
117–119

Do we need to bundle the claused instructions? What prevents inserting something else between these?

153

Probably should use a named constant for the limit

rampitec added inline comments.May 12 2020, 1:08 PM
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
117–119

Looking at the passes which work after this it does not seem anything will be modified here. But you are probably right, just to be on a safer side.

119

This way you will restart new clause after the break, so it is better.

We may also want to add an option to shouldClusterMemOps() controlling a maximum cluster for this test. This option will be useful anyway for clustering experiments.

foad marked 2 inline comments as done.May 12 2020, 1:54 PM
foad added inline comments.
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
117–119

I don't know. I haven't noticed any problems in testing. I guess it just relies on running late enough in the pipeline.

Actually depending on shouldCluster is a problem. First soft clauses are formed and then you turn them into hard clauses. Soft clauses impose higher register pressure but then you simply break them here, essentially wasting registers. It is either both passes should be guided by the same heuristic or none.

foad added a comment.May 12 2020, 11:59 PM

Actually depending on shouldCluster is a problem. First soft clauses are formed and then you turn them into hard clauses. Soft clauses impose higher register pressure but then you simply break them here, essentially wasting registers. It is either both passes should be guided by the same heuristic or none.

I don't understand what you're suggesting. In this pass we have to impose a limit of 64 instructions for correctness, not performance, because that's how the s_clause instruction's operand is encoded.

Perhaps we should teach shouldCluster not to bother clausing more than 64 loads, because the s_clause won't be able to honor it, but it already has a much lower limit than that, so I don't think there's any need to change it.

foad updated this revision to Diff 263630.May 13 2020, 12:04 AM

Rebase. Use LAST_REAL_HARDCLAUSE_TYPE.

Actually depending on shouldCluster is a problem. First soft clauses are formed and then you turn them into hard clauses. Soft clauses impose higher register pressure but then you simply break them here, essentially wasting registers. It is either both passes should be guided by the same heuristic or none.

I don't understand what you're suggesting. In this pass we have to impose a limit of 64 instructions for correctness, not performance, because that's how the s_clause instruction's operand is encoded.

Perhaps we should teach shouldCluster not to bother clausing more than 64 loads, because the s_clause won't be able to honor it, but it already has a much lower limit than that, so I don't think there's any need to change it.

In fact it seems clauses are broken every 4 instructions. Which means there is no reason to form up to 15 instructions into a soft clause which will then be broken here. So what I propose: either do not call shallClustetMemOps here, or also call it in SIFormMemoryClauses. I do no really have a preference here, but disagreement results in registers used by soft clauses were used for nothing. Does it make sense? I.e. consider you also have xnack enabled.

JBTW what happens to counters if hard clause exceeds 16 instructions? Every instruction increases the counter and there are only 4 bits available. Do I get it wrong?

foad added a comment.May 13 2020, 2:09 AM

Actually depending on shouldCluster is a problem. First soft clauses are formed and then you turn them into hard clauses. Soft clauses impose higher register pressure but then you simply break them here, essentially wasting registers. It is either both passes should be guided by the same heuristic or none.

I don't understand what you're suggesting. In this pass we have to impose a limit of 64 instructions for correctness, not performance, because that's how the s_clause instruction's operand is encoded.

Perhaps we should teach shouldCluster not to bother clausing more than 64 loads, because the s_clause won't be able to honor it, but it already has a much lower limit than that, so I don't think there's any need to change it.

In fact it seems clauses are broken every 4 instructions. Which means there is no reason to form up to 15 instructions into a soft clause which will then be broken here. So what I propose: either do not call shallClustetMemOps here, or also call it in SIFormMemoryClauses. I do no really have a preference here, but disagreement results in registers used by soft clauses were used for nothing. Does it make sense? I.e. consider you also have xnack enabled.

Using gfx10 terminology: SIFormMemoryClauses deals with restartable groups but SIInsertHardClauses deals with hard clauses. They are similar but not the same. That's what I tried to explain in the big comment at the top of SIInsertHardClauses.cpp. Hard clauses are all about performance. Groups are all about correctness in the presence of XNACK.

shouldClusterMemOps is a heuristic to decide whether loads should be claused for performance. We have to call it from SIInsertHardClauses, so it can tell us not to bother clausing loads that have a different base address.

I'm not at all sure that calling shouldClusterMemOps from SIFormMemoryClauses is a good idea, but in any case I think that should be a separate patch with its own discussion.

JBTW what happens to counters if hard clause exceeds 16 instructions? Every instruction increases the counter and there are only 4 bits available. Do I get it wrong?

I don't know, sorry. Maybe it's one of the conditions that can cause the hardware to break a hard clause?

foad marked an inline comment as done.May 13 2020, 2:41 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
117–119

Would bundling actually fix the problem? What about other code that modifies bundles, like GCNHazardRecognizer::insertNoopInBundle? It feels like there's an arms race where one side invents mechanisms like bundling to stop instructions being inserted, and the other side goes and inserts instructions in the bundles anyway.

Having said that, I do have some patches in progress to remove GCNHazardRecognizer::insertNoopInBundle. But I don't know if there are any other places where bundles could be modified.

arsenm added inline comments.May 13 2020, 7:44 AM
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
117–119

I think the hazard recognizer is the only place with a potentially legitimate reason to try inserting anything else in a bundle

foad updated this revision to Diff 263725.May 13 2020, 8:06 AM

Bundle the clauses.

arsenm added inline comments.May 13 2020, 9:15 AM
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
115

Start with lowerccase, and should use SIInstrInfo

124–132

You didn't finalize the bundle, so the register operands on the BUNDLE are missing

arsenm added inline comments.May 13 2020, 9:16 AM
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
124–132

Could also use MIBundleBuilder

Using gfx10 terminology: SIFormMemoryClauses deals with restartable groups but SIInsertHardClauses deals with hard clauses. They are similar but not the same. That's what I tried to explain in the big comment at the top of SIInsertHardClauses.cpp. Hard clauses are all about performance. Groups are all about correctness in the presence of XNACK.

SIFormMemoryClauses is an optimization pass. If it does not run nothing breaks, just clauses will be broken by the hazard recognizer. So in fact it does the same thing.

shouldClusterMemOps is a heuristic to decide whether loads should be claused for performance. We have to call it from SIInsertHardClauses, so it can tell us not to bother clausing loads that have a different base address.

Actually a very small limit set by shouldClusterMemOps is driven by the register pressure. There seems to be no reason to call it here at all because this is past RA.
Note that SIFormMemoryClauses does not call it but uses RP tracker to maintain occupancy instead.

foad updated this revision to Diff 263804.May 13 2020, 11:35 AM

Finalize bundles.

foad marked 5 inline comments as done.May 13 2020, 11:36 AM
foad added a comment.May 14 2020, 2:57 AM

shouldClusterMemOps is a heuristic to decide whether loads should be claused for performance. We have to call it from SIInsertHardClauses, so it can tell us not to bother clausing loads that have a different base address.

Actually a very small limit set by shouldClusterMemOps is driven by the register pressure. There seems to be no reason to call it here at all because this is past RA.
Note that SIFormMemoryClauses does not call it but uses RP tracker to maintain occupancy instead.

shouldClusterMemOps decides whether it is beneficial for performance to cluster two loads. For example: two VMEM instructions with the same base address should be clustered, unless one has a sampler and the other doesn't, etc etc. All that sort of logic belongs in shouldClusterMemOps, and we have to call it here because we don't want to insert s_clause instructions if it is not going to improve performance.

Yes, shouldClusterMemOps also imposes a limit on the length of the cluster. If that is *only* to help with register pressure, then perhaps I can bypass that check by always calling it with NumLoads=2 instead of NumLoads=CI.Size+1. What do you think?

Yes, shouldClusterMemOps also imposes a limit on the length of the cluster. If that is *only* to help with register pressure, then perhaps I can bypass that check by always calling it with NumLoads=2 instead of NumLoads=CI.Size+1. What do you think?

That's actually a very good idea. Just add a comment explaining it!

foad updated this revision to Diff 264019.May 14 2020, 9:53 AM

Don't let shouldClusterMemOps limit clause size.

This revision is now accepted and ready to land.May 14 2020, 10:44 AM
This revision was automatically updated to reflect the committed changes.

This pretty much broke the world with radeonsi on Navi 14:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fc20abf155b in __GI_abort () at abort.c:79
#2  0x00007fc20abf142f in __assert_fail_base
    (fmt=0x7fc20ad57b48 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7fc1ecec6cf6 "CI.Length == std::distance(CI.First->getIterator(), CI.Last->getIterator()) + 1", file=0x7fc1ec73aa57 "/home/daenzer/src/llvm-git/llvm-project/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp", line=114, function=<optimized out>) at assert.c:92
#3  0x00007fc20ac00092 in __GI___assert_fail
    (assertion=0x7fc1ecec6cf6 "CI.Length == std::distance(CI.First->getIterator(), CI.Last->getIterator()) + 1", file=0x7fc1ec73aa57 "/home/daenzer/src/llvm-git/llvm-project/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp", line=114, function=0x7fc1ecf1e303 "bool (anonymous namespace)::SIInsertHardClauses::emitClause(const (anonymous namespace)::SIInsertHardClauses::ClauseInfo &, const llvm::SIInstrInfo *)") at assert.c:101
#4  0x00007fc1ef7bfb58 in (anonymous namespace)::SIInsertHardClauses::emitClause((anonymous namespace)::SIInsertHardClauses::ClauseInfo const&, llvm::SIInstrInfo const*) (this=<optimized out>, CI=..., SII=<optimized out>)
    at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp:113
#5  0x00007fc1ef7bf6c7 in (anonymous namespace)::SIInsertHardClauses::runOnMachineFunction(llvm::MachineFunction&) (this=<optimized out>, MF=...)
    at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp:167
#6  0x00007fc1ee1abcad in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (this=0x7fc1c00458f0, F=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/CodeGen/MachineFunctionPass.cpp:73
#7  0x00007fc1edf953a5 in llvm::FPPassManager::runOnFunction(llvm::Function&) (this=<optimized out>, F=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1482
#8  0x00007fc1eefa2493 in (anonymous namespace)::CGPassManager::RunPassOnSCC(llvm::Pass*, llvm::CallGraphSCC&, llvm::CallGraph&, bool&, bool&)
    (this=0x7fc1c0024970, P=0x7fc1c0025020, CurSCC=..., CG=..., CallGraphUpToDate=<optimized out>, DevirtualizedCall=<optimized out>) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/Analysis/CallGraphSCCPass.cpp:176
#9  (anonymous namespace)::CGPassManager::RunAllPassesOnSCC(llvm::CallGraphSCC&, llvm::CallGraph&, bool&) (this=0x7fc1c0024970, CurSCC=..., CG=..., DevirtualizedCall=<optimized out>)
    at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/Analysis/CallGraphSCCPass.cpp:441
#10 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) (this=<optimized out>, M=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/Analysis/CallGraphSCCPass.cpp:497
#11 0x00007fc1edf95c83 in (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) (this=<optimized out>, M=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1583
#12 llvm::legacy::PassManagerImpl::run(llvm::Module&) (this=0x7fc1c0009670, M=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1695
#13 0x00007fc1edf9613e in llvm::legacy::PassManager::run(llvm::Module&) (this=<optimized out>, this@entry=0x7fc1c0009650, M=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1726
#14 0x00007fc1f788eb9f in ac_compile_module_to_elf(ac_compiler_passes*, LLVMModuleRef, char**, size_t*) (p=p@entry=
    0x7fc1c0009610, module=<optimized out>, pelf_buffer=pelf_buffer@entry=0x563be208bc20, pelf_size=pelf_size@entry=0x563be208bc28) at /home/daenzer/src/llvm-git/llvm-project/llvm/include/llvm/IR/Module.h:892
#15 0x00007fc1f77b6c51 in si_compile_llvm
    (sscreen=sscreen@entry=0x563be1f066a0, binary=binary@entry=0x563be208bc20, conf=conf@entry=0x563be208bc38, compiler=compiler@entry=0x563be1f06f30, ac=ac@entry=0x7fc1e8f5e240, debug=debug@entry=0x563be208b5b0, shader_type=PIPE_SHADER_COMPUTE, name=0x7fc1f7aab177 "Compute Shader", less_optimized=false) at ../src/gallium/drivers/radeonsi/si_shader_llvm.c:104
#16 0x00007fc1f77b4103 in si_llvm_compile_shader (sscreen=sscreen@entry=0x563be1f066a0, compiler=compiler@entry=0x563be1f06f30, shader=shader@entry=0x563be208bb68, debug=debug@entry=0x563be208b5b0, nir=<optimized out>, 
    nir@entry=0x563be200e1b0, free_nir=<optimized out>) at ../src/gallium/drivers/radeonsi/si_shader.c:1581
#17 0x00007fc1f77b561d in si_compile_shader (sscreen=0x563be1f066a0, compiler=0x563be1f06f30, shader=<optimized out>, debug=0x563be208b5b0) at ../src/gallium/drivers/radeonsi/si_shader.c:1855
#18 0x00007fc1f77b64b7 in si_create_shader_variant (sscreen=sscreen@entry=0x563be1f066a0, compiler=compiler@entry=0x563be1f06f30, shader=shader@entry=0x563be208bb68, debug=debug@entry=0x563be208b5b0)
    at ../src/gallium/drivers/radeonsi/si_shader.c:2384
#19 0x00007fc1f78126e1 in si_create_compute_state_async (job=job@entry=0x563be208b580, thread_index=thread_index@entry=3) at ../src/gallium/drivers/radeonsi/si_compute.c:161
#20 0x00007fc1f72322f1 in util_queue_thread_func (input=input@entry=0x563be1e96bd0) at ../src/util/u_queue.c:308
#21 0x00007fc1f7231d48 in impl_thrd_routine (p=<optimized out>) at ../include/c11/threads_posix.h:87
#22 0x00007fc208cc8f27 in start_thread (arg=<optimized out>) at pthread_create.c:479
#23 0x00007fc20acc931f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
foad added a comment.May 15 2020, 6:41 AM

This pretty much broke the world with radeonsi on Navi 14:

Sorry. Can you please try D80007.