Enable clausing of memory loads on gfx10 by adding a new pass to insert
the s_clause instructions that mark the start of each hard clause.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
---|---|---|
1048 | Need to check OptLevel probably, the pass is optional. |
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp | ||
---|---|---|
119 | OK, but my way was simpler. I tried to add a test (hard-clauses.mir) but this pass is guided by the SIInstrInfo::shouldClusterMemOps heuristic, which never clusters that many loads. I still think this pass should handle it correctly, in case the heuristic ever changes. |
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp | ||
---|---|---|
117–119 | Looking at the passes which work after this it does not seem anything will be modified here. But you are probably right, just to be on a safer side. | |
119 | This way you will restart new clause after the break, so it is better. We may also want to add an option to shouldClusterMemOps() controlling a maximum cluster for this test. This option will be useful anyway for clustering experiments. |
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp | ||
---|---|---|
117–119 | I don't know. I haven't noticed any problems in testing. I guess it just relies on running late enough in the pipeline. |
Actually depending on shouldCluster is a problem. First soft clauses are formed and then you turn them into hard clauses. Soft clauses impose higher register pressure but then you simply break them here, essentially wasting registers. It is either both passes should be guided by the same heuristic or none.
I don't understand what you're suggesting. In this pass we have to impose a limit of 64 instructions for correctness, not performance, because that's how the s_clause instruction's operand is encoded.
Perhaps we should teach shouldCluster not to bother clausing more than 64 loads, because the s_clause won't be able to honor it, but it already has a much lower limit than that, so I don't think there's any need to change it.
In fact it seems clauses are broken every 4 instructions. Which means there is no reason to form up to 15 instructions into a soft clause which will then be broken here. So what I propose: either do not call shallClustetMemOps here, or also call it in SIFormMemoryClauses. I do no really have a preference here, but disagreement results in registers used by soft clauses were used for nothing. Does it make sense? I.e. consider you also have xnack enabled.
JBTW what happens to counters if hard clause exceeds 16 instructions? Every instruction increases the counter and there are only 4 bits available. Do I get it wrong?
Using gfx10 terminology: SIFormMemoryClauses deals with restartable groups but SIInsertHardClauses deals with hard clauses. They are similar but not the same. That's what I tried to explain in the big comment at the top of SIInsertHardClauses.cpp. Hard clauses are all about performance. Groups are all about correctness in the presence of XNACK.
shouldClusterMemOps is a heuristic to decide whether loads should be claused for performance. We have to call it from SIInsertHardClauses, so it can tell us not to bother clausing loads that have a different base address.
I'm not at all sure that calling shouldClusterMemOps from SIFormMemoryClauses is a good idea, but in any case I think that should be a separate patch with its own discussion.
JBTW what happens to counters if hard clause exceeds 16 instructions? Every instruction increases the counter and there are only 4 bits available. Do I get it wrong?
I don't know, sorry. Maybe it's one of the conditions that can cause the hardware to break a hard clause?
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp | ||
---|---|---|
117–119 | Would bundling actually fix the problem? What about other code that modifies bundles, like GCNHazardRecognizer::insertNoopInBundle? It feels like there's an arms race where one side invents mechanisms like bundling to stop instructions being inserted, and the other side goes and inserts instructions in the bundles anyway. Having said that, I do have some patches in progress to remove GCNHazardRecognizer::insertNoopInBundle. But I don't know if there are any other places where bundles could be modified. |
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp | ||
---|---|---|
117–119 | I think the hazard recognizer is the only place with a potentially legitimate reason to try inserting anything else in a bundle |
llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp | ||
---|---|---|
123–131 | Could also use MIBundleBuilder |
SIFormMemoryClauses is an optimization pass. If it does not run nothing breaks, just clauses will be broken by the hazard recognizer. So in fact it does the same thing.
shouldClusterMemOps is a heuristic to decide whether loads should be claused for performance. We have to call it from SIInsertHardClauses, so it can tell us not to bother clausing loads that have a different base address.
Actually a very small limit set by shouldClusterMemOps is driven by the register pressure. There seems to be no reason to call it here at all because this is past RA.
Note that SIFormMemoryClauses does not call it but uses RP tracker to maintain occupancy instead.
shouldClusterMemOps decides whether it is beneficial for performance to cluster two loads. For example: two VMEM instructions with the same base address should be clustered, unless one has a sampler and the other doesn't, etc etc. All that sort of logic belongs in shouldClusterMemOps, and we have to call it here because we don't want to insert s_clause instructions if it is not going to improve performance.
Yes, shouldClusterMemOps also imposes a limit on the length of the cluster. If that is *only* to help with register pressure, then perhaps I can bypass that check by always calling it with NumLoads=2 instead of NumLoads=CI.Size+1. What do you think?
This pretty much broke the world with radeonsi on Navi 14:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007fc20abf155b in __GI_abort () at abort.c:79 #2 0x00007fc20abf142f in __assert_fail_base (fmt=0x7fc20ad57b48 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7fc1ecec6cf6 "CI.Length == std::distance(CI.First->getIterator(), CI.Last->getIterator()) + 1", file=0x7fc1ec73aa57 "/home/daenzer/src/llvm-git/llvm-project/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp", line=114, function=<optimized out>) at assert.c:92 #3 0x00007fc20ac00092 in __GI___assert_fail (assertion=0x7fc1ecec6cf6 "CI.Length == std::distance(CI.First->getIterator(), CI.Last->getIterator()) + 1", file=0x7fc1ec73aa57 "/home/daenzer/src/llvm-git/llvm-project/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp", line=114, function=0x7fc1ecf1e303 "bool (anonymous namespace)::SIInsertHardClauses::emitClause(const (anonymous namespace)::SIInsertHardClauses::ClauseInfo &, const llvm::SIInstrInfo *)") at assert.c:101 #4 0x00007fc1ef7bfb58 in (anonymous namespace)::SIInsertHardClauses::emitClause((anonymous namespace)::SIInsertHardClauses::ClauseInfo const&, llvm::SIInstrInfo const*) (this=<optimized out>, CI=..., SII=<optimized out>) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp:113 #5 0x00007fc1ef7bf6c7 in (anonymous namespace)::SIInsertHardClauses::runOnMachineFunction(llvm::MachineFunction&) (this=<optimized out>, MF=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp:167 #6 0x00007fc1ee1abcad in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (this=0x7fc1c00458f0, F=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/CodeGen/MachineFunctionPass.cpp:73 #7 0x00007fc1edf953a5 in llvm::FPPassManager::runOnFunction(llvm::Function&) (this=<optimized out>, F=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1482 #8 0x00007fc1eefa2493 in (anonymous namespace)::CGPassManager::RunPassOnSCC(llvm::Pass*, llvm::CallGraphSCC&, llvm::CallGraph&, bool&, bool&) (this=0x7fc1c0024970, P=0x7fc1c0025020, CurSCC=..., CG=..., CallGraphUpToDate=<optimized out>, DevirtualizedCall=<optimized out>) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/Analysis/CallGraphSCCPass.cpp:176 #9 (anonymous namespace)::CGPassManager::RunAllPassesOnSCC(llvm::CallGraphSCC&, llvm::CallGraph&, bool&) (this=0x7fc1c0024970, CurSCC=..., CG=..., DevirtualizedCall=<optimized out>) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/Analysis/CallGraphSCCPass.cpp:441 #10 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) (this=<optimized out>, M=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/Analysis/CallGraphSCCPass.cpp:497 #11 0x00007fc1edf95c83 in (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) (this=<optimized out>, M=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1583 #12 llvm::legacy::PassManagerImpl::run(llvm::Module&) (this=0x7fc1c0009670, M=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1695 #13 0x00007fc1edf9613e in llvm::legacy::PassManager::run(llvm::Module&) (this=<optimized out>, this@entry=0x7fc1c0009650, M=...) at /home/daenzer/src/llvm-git/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1726 #14 0x00007fc1f788eb9f in ac_compile_module_to_elf(ac_compiler_passes*, LLVMModuleRef, char**, size_t*) (p=p@entry= 0x7fc1c0009610, module=<optimized out>, pelf_buffer=pelf_buffer@entry=0x563be208bc20, pelf_size=pelf_size@entry=0x563be208bc28) at /home/daenzer/src/llvm-git/llvm-project/llvm/include/llvm/IR/Module.h:892 #15 0x00007fc1f77b6c51 in si_compile_llvm (sscreen=sscreen@entry=0x563be1f066a0, binary=binary@entry=0x563be208bc20, conf=conf@entry=0x563be208bc38, compiler=compiler@entry=0x563be1f06f30, ac=ac@entry=0x7fc1e8f5e240, debug=debug@entry=0x563be208b5b0, shader_type=PIPE_SHADER_COMPUTE, name=0x7fc1f7aab177 "Compute Shader", less_optimized=false) at ../src/gallium/drivers/radeonsi/si_shader_llvm.c:104 #16 0x00007fc1f77b4103 in si_llvm_compile_shader (sscreen=sscreen@entry=0x563be1f066a0, compiler=compiler@entry=0x563be1f06f30, shader=shader@entry=0x563be208bb68, debug=debug@entry=0x563be208b5b0, nir=<optimized out>, nir@entry=0x563be200e1b0, free_nir=<optimized out>) at ../src/gallium/drivers/radeonsi/si_shader.c:1581 #17 0x00007fc1f77b561d in si_compile_shader (sscreen=0x563be1f066a0, compiler=0x563be1f06f30, shader=<optimized out>, debug=0x563be208b5b0) at ../src/gallium/drivers/radeonsi/si_shader.c:1855 #18 0x00007fc1f77b64b7 in si_create_shader_variant (sscreen=sscreen@entry=0x563be1f066a0, compiler=compiler@entry=0x563be1f06f30, shader=shader@entry=0x563be208bb68, debug=debug@entry=0x563be208b5b0) at ../src/gallium/drivers/radeonsi/si_shader.c:2384 #19 0x00007fc1f78126e1 in si_create_compute_state_async (job=job@entry=0x563be208b580, thread_index=thread_index@entry=3) at ../src/gallium/drivers/radeonsi/si_compute.c:161 #20 0x00007fc1f72322f1 in util_queue_thread_func (input=input@entry=0x563be1e96bd0) at ../src/util/u_queue.c:308 #21 0x00007fc1f7231d48 in impl_thrd_routine (p=<optimized out>) at ../include/c11/threads_posix.h:87 #22 0x00007fc208cc8f27 in start_thread (arg=<optimized out>) at pthread_create.c:479 #23 0x00007fc20acc931f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Need to check OptLevel probably, the pass is optional.