Page MenuHomePhabricator

AMDGPU: Add SIWholeQuadMode pass
ClosedPublic

Authored by nhaehnle on Mar 14 2016, 3:03 PM.

Details

Summary

Whole quad mode is already enabled for pixel shaders that compute
derivatives, but it must be suspended for instructions that cause a
shader to have side effects (i.e. stores and atomics).

This pass addresses the issue by storing the real (initial) live mask
in a register, masking EXEC before instructions that require exact
execution and (re-)enabling WQM where required.

This pass is run before register coalescing so that we can use
machine SSA for analysis.

The changes in this patch expose a problem with the second machine
scheduling pass: target independent instructions like COPY implicitly
use EXEC when they operate on VGPRs, but this fact is not encoded in
the MIR. This can lead to miscompilation because instructions are
moved past changes to EXEC.

This patch fixes the problem by adding use-implicit operands to
target independent instructions. Some general codegen passes are
relaxed to work with such implicit use operands.

Diff Detail

Repository
rL LLVM

Event Timeline

nhaehnle updated this revision to Diff 50655.Mar 14 2016, 3:03 PM
nhaehnle retitled this revision from to AMDGPU: Add SIWholeQuadMode pass.
nhaehnle updated this object.
nhaehnle added a subscriber: llvm-commits.
nhaehnle updated this revision to Diff 50671.Mar 14 2016, 4:09 PM

In the fast case where we have WQM but don't need to switch back and forth,
also mark COPYs as implicitly using EXEC in the entry block.

Noticed this in a more involved shader in a piglit run.

The target independent changes might get more visibility if they were in a separate patch.

The changes in this patch expose a problem with the second machine
scheduling pass: target independent instructions like COPY implicitly
use EXEC when they operate on VGPRs, but this fact is not encoded in
the MIR. This can lead to miscompilation because instructions are
moved past changes to EXEC.

Is the Scheduler the only pass we need to worry about? Would we be able to avoid the problem by implementing TargetInstrInfo::isSchedulingBoundry() ?

The target independent changes might get more visibility if they were in a separate patch.

The changes in this patch expose a problem with the second machine
scheduling pass: target independent instructions like COPY implicitly
use EXEC when they operate on VGPRs, but this fact is not encoded in
the MIR. This can lead to miscompilation because instructions are
moved past changes to EXEC.

Is the Scheduler the only pass we need to worry about? Would we be able to avoid the problem by implementing TargetInstrInfo::isSchedulingBoundry() ?

Hmm, I wasn't aware of that function. Yes, customizing that function so that it considers instructions with EXEC defs as scheduling boundaries should work as well.

It's a slightly more conservative solution because it also prevents movement of SALU and SMEM instructions. Then again, it should not impact the initial DAG-based scheduling. Also, I notice that the target-independent code mentions that considering stack pointer modifications to be scheduling boundaries is profitable, and the same logic likely applies to EXEC. Plus, it's a more robust solution.

I'll modify the patch to use isSchedulingBoundary instead.

nhaehnle updated this revision to Diff 50943.Mar 17 2016, 9:16 AM
nhaehnle edited edge metadata.

[This time with the correct --update parameter for arc]

Use isSchedulingBoundary instead of implicit-use of EXEC, which gets rid of
the target-independent modifications.

This is indeed more conservative, as you can tell from the change in
si-scheduler.ll: previously, the later scheduling passes managed to move the
initial s_wqm_b64 after the s_load_dwordx4 & x8, i.e. we lose slightly in
latency hiding.

The impact should be small, and it does make sense to land a more conservative
and robust patch initially.

lib/Target/AMDGPU/AMDGPUInstrInfo.cpp
67 ↗(On Diff #50943)

This should go in SIInstrInfo, since it is SI specific.

70–72 ↗(On Diff #50943)

Can we just call TargetInstrInfo::isSchedulingBoundary() at the end of the function rather than have this check.

nhaehnle updated this revision to Diff 51029.Mar 18 2016, 9:06 AM

Makes sense, here's an updated patch.

tstellarAMD accepted this revision.Mar 18 2016, 3:44 PM
tstellarAMD edited edge metadata.

LGTM, just a few things to fix before you commit.

lib/Target/AMDGPU/AMDGPUInstrInfo.h
65 ↗(On Diff #51029)

Extra whitespace change.

lib/Target/AMDGPU/SIWholeQuadMode.cpp
311 ↗(On Diff #51029)

Coding Style. Brace on new line.

327 ↗(On Diff #51029)

Coding Style: Brace on new line.

This revision is now accepted and ready to land.Mar 18 2016, 3:44 PM
arsenm edited edge metadata.Mar 18 2016, 4:17 PM

Should wqm related instructions be marked as convergent?

Thanks for the feedback Tom. I'll work in those changes before I commit.

Matt, I'm not sure "convergent" really captures the properties. In practice, I doubt it's a problem because the relevant instructions define EXEC - as far as I can tell, this means they're left alone by the passes that care about convergence, because EXEC is a physical register.

Thanks for the feedback Tom. I'll work in those changes before I commit.

Matt, I'm not sure "convergent" really captures the properties. In practice, I doubt it's a problem because the relevant instructions define EXEC - as far as I can tell, this means they're left alone by the passes that care about convergence, because EXEC is a physical register.

It's more likely to be a problem for the corresponding IR intrinsics

Ah okay. There are two intrinsic (types) related to all this:

  1. The kill intrinsic. This is marked as having side effects, which I think should imply convergent.
  1. The load / store / atomic intrinsics. They themselves don't involve derivatives. If their results are used in a derivative computation, then it is sufficient to ensure that they are always executed when the consumer is executed, but that automatically follows from plain data flow (just like any other normal computations).

So I think we're fine without convergent attributes anywhere.

nhaehnle added a comment.EditedMar 19 2016, 7:29 AM

As an addendum: In part, correctness depends on the guarantees for derivatives that are required by GLSL. For example, if you have:

%tmp = call <2 x i32> @llvm.amdgcn.image.load.v2i32(...)
%coords = bitcast and extract from %tmp
...
br i1 %cc, label %IF, label %ELSE

IF:
%texel = call <4 x float> @llvm.SI.image.sample.v2i32(<2 x i32> %coord, ...)
...

ELSE:
... %coord not used here or later ...

The derivative taken by the llvm.SI.image.sample is undefined in GLSL if the control-flow is dynamically non-uniform, so it is perfectly legal to sink the llvm.amdgcn.image.load into the IF block (and the same applies to any other computation that leads to a derivative).

Thanks for the feedback Tom. I'll work in those changes before I commit.

Matt, I'm not sure "convergent" really captures the properties. In practice, I doubt it's a problem because the relevant instructions define EXEC - as far as I can tell, this means they're left alone by the passes that care about convergence, because EXEC is a physical register.

It's more likely to be a problem for the corresponding IR intrinsics

Ah okay. There are two intrinsic (types) related to all this:

  1. The kill intrinsic. This is marked as having side effects, which I think should imply convergent.

Side effects do not imply convergent

  1. The load / store / atomic intrinsics. They themselves don't involve derivatives. If their results are used in a derivative computation, then it is sufficient to ensure that they are always executed when the consumer is executed, but that automatically follows from plain data flow (just like any other normal computations).

It shouldn't be necessary for these

The derivative taken by the llvm.SI.image.sample is undefined in GLSL if the control-flow is dynamically non-uniform, so it is perfectly legal to sink the llvm.amdgcn.image.load into the IF block (and the same applies to any other computation that leads to a derivative).

This is the same situation as barriers, so I think this should still be convergent. The problem it is solving is if LLVM introduces uses that do not hit uniform control flow, like introducing a call in either side of an if/then block

For kill: Okay, side effects do not imply convergent, but the kill can only be moved in a way that preserves its execution. So IR-level optimizations are allowed to move a kill into both the if and the else branch of a subsequent if-else block, but this still results in correct code: both branches may mask away some bits of EXEC, but since the control flow is joined again with a bit-wise OR, that's okay.

The derivative taken by the llvm.SI.image.sample is undefined in GLSL if the control-flow is dynamically non-uniform, so it is perfectly legal to sink the llvm.amdgcn.image.load into the IF block (and the same applies to any other computation that leads to a derivative).

This is the same situation as barriers, so I think this should still be convergent. The problem it is solving is if LLVM introduces uses that do not hit uniform control flow, like introducing a call in either side of an if/then block

I do not think this is the same situation as barriers. In barriers, they need to be convergent because all threads have to execute the same barrier instruction (at the same program counter) for the semantics to remain correct.

If (plain) loads are pushed down into either side of an if/else block, that's inefficient but correct (of course assuming no stores in between etc.). There is no synchronization or data exchange between threads, so all that matters is that the right value is loaded into the right register on a per-thread basis; which instructions do the job at which program counter value is not important.

If I am still misunderstanding you, perhaps an example would be helpful.

This revision was automatically updated to reflect the committed changes.