Details
Diff Detail
- Build Status
Buildable 5568 Build 5568: arc lint + arc unit
Event Timeline
include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
117–121 | There are several reasons:
|
lib/Target/AMDGPU/SIInstructions.td | ||
---|---|---|
290 | Why SSrc_b64 and not i64imm? It doesn't support non-immediate operands. |
include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
117–121 | I don't think we should be adding intrinsics for the sake of working around codegen defects. A better workaround would be to only ever use init_exec and then have AMDGPUCodeGenPrepare insert calls to the second intrinsic until we fix the various SCC handling issues | |
lib/Target/AMDGPU/SIInstructions.td | ||
290 | To have the one intrinsic |
include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
117–121 | That makes sense, however if we look at it from the hw perperctive, amdgcn.init.exec.from.input is all we need. We don't need any SCC handling, S_CMOV_B64 selection, etc. It would be desirable for have those, but not for init.exec. In fact, the init.exec instrinsic is already too flexible. The driver will only ever pass -1 into it. With all those in mind, it's unnecessary for init.exec to be more flexible. There is no use case for such flexibility. Would you implement something that nobody would ever use? |
I don't see anything wrong with the code.
I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.
Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.
Merged monolithic shaders set exec = -1 and then they use "if (tid < thread_count) ...". I think that's the only way to jump over the conditional code right now if the wave has thread_count == 0. If we don't want v_mbcnt+v_cmp, we could do something like "if (amdgcn.set.thread_count(n)) ..." that sets EXEC regardless of current EXEC and skips the conditional for thread_count == 0. The performance of that solution is unlikely to justify the implementation effort.
Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.
Yeah, I looked at the monolithic shader stuff briefly. I think the LLVM IR is fine. Adding another intrinsic for switching EXEC in the middle of a shader is bound to run into lots of problems in CodeGen around scheduling and such.
The LLVM CodeGen could generally grow some more smarts around EXEC and perhaps even pattern-match the v_mbcnt+v_cmp. I also think it's pretty low priority though.
Yes, thread_count == 0 is possible, and it's explained here: https://patchwork.freedesktop.org/patch/152356/
Ah, so that's why the barrier instruction is needed between shader parts? Interesting.
There are two cases when the barrier isn't needed: 1) When GS is processing points without any amplification. 2) When HS has input control points == output control points, and each HS thread doesn't access other threads' inputs. In both cases, the barrier and the LDS traffic can be removed and the previous shader can put outputs into VGPRs to get a fully merged shader.
Why can't you emit this sequence and feed that into the first intrinsic?