This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add new amdgcn.init.exec intrinsics
ClosedPublic

Authored by mareko on Apr 6 2017, 7:37 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

mareko created this revision.Apr 6 2017, 7:37 AM
arsenm added inline comments.Apr 6 2017, 9:47 AM
include/llvm/IR/IntrinsicsAMDGPU.td
117–121 ↗(On Diff #94369)

Why can't you emit this sequence and feed that into the first intrinsic?

lib/Target/AMDGPU/SIInstructions.td
286 ↗(On Diff #94369)

SSrc_b64:$src0

test/CodeGen/AMDGPU/set-initial-exec.ll
1 ↗(On Diff #94369)

s/CHECK/GCN/

mareko added inline comments.Apr 6 2017, 10:06 AM
include/llvm/IR/IntrinsicsAMDGPU.td
117–121 ↗(On Diff #94369)

There are several reasons:

  • It's easier this way, because the custom inserter only has to move the COPY opcode to the beginning instead of the whole expression.
  • LLVM can't select S_BFM_B64.
  • LLVM likely can't select S_CMP_U32_EQ in this case.
  • LLVM can't select S_CMOV_B64.
mareko added inline comments.Apr 6 2017, 10:08 AM
lib/Target/AMDGPU/SIInstructions.td
286 ↗(On Diff #94369)

Why SSrc_b64 and not i64imm? It doesn't support non-immediate operands.

arsenm added inline comments.Apr 6 2017, 4:37 PM
include/llvm/IR/IntrinsicsAMDGPU.td
117–121 ↗(On Diff #94369)

I don't think we should be adding intrinsics for the sake of working around codegen defects. A better workaround would be to only ever use init_exec and then have AMDGPUCodeGenPrepare insert calls to the second intrinsic until we fix the various SCC handling issues

lib/Target/AMDGPU/SIInstructions.td
286 ↗(On Diff #94369)

To have the one intrinsic

mareko added inline comments.Apr 7 2017, 6:41 AM
include/llvm/IR/IntrinsicsAMDGPU.td
117–121 ↗(On Diff #94369)

That makes sense, however if we look at it from the hw perperctive, amdgcn.init.exec.from.input is all we need. We don't need any SCC handling, S_CMOV_B64 selection, etc. It would be desirable for have those, but not for init.exec. In fact, the init.exec instrinsic is already too flexible. The driver will only ever pass -1 into it. With all those in mind, it's unnecessary for init.exec to be more flexible. There is no use case for such flexibility. Would you implement something that nobody would ever use?

mareko updated this revision to Diff 95346.Apr 14 2017, 2:18 PM

Bug fixes, cosmetic changes, more tests.

nhaehnle accepted this revision.Apr 24 2017, 9:19 AM

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

This revision is now accepted and ready to land.Apr 24 2017, 9:19 AM

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

Merged monolithic shaders set exec = -1 and then they use "if (tid < thread_count) ...". I think that's the only way to jump over the conditional code right now if the wave has thread_count == 0. If we don't want v_mbcnt+v_cmp, we could do something like "if (amdgcn.set.thread_count(n)) ..." that sets EXEC regardless of current EXEC and skips the conditional for thread_count == 0. The performance of that solution is unlikely to justify the implementation effort.

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

Merged monolithic shaders set exec = -1 and then they use "if (tid < thread_count) ...". I think that's the only way to jump over the conditional code right now if the wave has thread_count == 0. If we don't want v_mbcnt+v_cmp, we could do something like "if (amdgcn.set.thread_count(n)) ..." that sets EXEC regardless of current EXEC and skips the conditional for thread_count == 0. The performance of that solution is unlikely to justify the implementation effort.

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Yeah, I looked at the monolithic shader stuff briefly. I think the LLVM IR is fine. Adding another intrinsic for switching EXEC in the middle of a shader is bound to run into lots of problems in CodeGen around scheduling and such.

The LLVM CodeGen could generally grow some more smarts around EXEC and perhaps even pattern-match the v_mbcnt+v_cmp. I also think it's pretty low priority though.

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

Merged monolithic shaders set exec = -1 and then they use "if (tid < thread_count) ...". I think that's the only way to jump over the conditional code right now if the wave has thread_count == 0. If we don't want v_mbcnt+v_cmp, we could do something like "if (amdgcn.set.thread_count(n)) ..." that sets EXEC regardless of current EXEC and skips the conditional for thread_count == 0. The performance of that solution is unlikely to justify the implementation effort.

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Yes, thread_count == 0 is possible, and it's explained here: https://patchwork.freedesktop.org/patch/152356/

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Yes, thread_count == 0 is possible, and it's explained here: https://patchwork.freedesktop.org/patch/152356/

Ah, so that's why the barrier instruction is needed between shader parts? Interesting.

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Yes, thread_count == 0 is possible, and it's explained here: https://patchwork.freedesktop.org/patch/152356/

Ah, so that's why the barrier instruction is needed between shader parts? Interesting.

There are two cases when the barrier isn't needed: 1) When GS is processing points without any amplification. 2) When HS has input control points == output control points, and each HS thread doesn't access other threads' inputs. In both cases, the barrier and the LDS traffic can be removed and the previous shader can put outputs into VGPRs to get a fully merged shader.

This revision was automatically updated to reflect the committed changes.