This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic and live mask tracking
AbandonedPublic

Authored by critson on Sep 19 2019, 7:57 AM.

Details

Summary

Add intrinsic which demotes all active lanes to helper lanes.
This is used to implement demote to helper Vulkan extension.

In practice demoting a lane to helper simply means removing it
from the mask of live lanes used for WQM/WWM/Exact mode.
Where the shader does not use WQM, demotes just become kills.

To support demote live lanes must now be tracked through entire
shader. Move implementation of kill intrinsics to WQM pass
and use live lane tracking to enable early termination of
shaders are any point in control flow.

Additionally add llvm.amdgcn.live.mask intrinsic to complement
demote operations. In theory llvm.amdgcn.ps.live can be used
to detect helper lanes; however, ps.live can be moved by LICM.
The movement of ps.live cannot be remedied without changing
its type signature and such a change would require ps.live
users to update as well.

Diff Detail

Event Timeline

critson created this revision.Sep 19 2019, 7:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2019, 7:57 AM
arsenm requested changes to this revision.Sep 19 2019, 8:53 AM

The WQM pass should be moved to avoid recomputing these. Computing LiveIntervals twice is an unacceptable cost

This revision now requires changes to proceed.Sep 19 2019, 8:53 AM
critson updated this revision to Diff 258672.Apr 20 2020, 1:19 AM

Update to move comprehensive implementation now fairly tested
for graphics.
Now preserves LiveInterval analysis.
This is dependent on D78417.

arsenm added inline comments.Apr 24 2020, 4:08 PM
llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1267

I assume this needs to be convergent

1488

Ditto, and nomem

llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
242

Missing DivergenceAnalysis test

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
778–779

use Register

critson marked 3 inline comments as done.Apr 28 2020, 4:57 AM
critson added inline comments.
llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1267

Could be, my understanding is that without flags the intrinsic is marked "has side effects", which is correct as then it will not moved by LICM or removed by CSE.

1488

Convergent maybe (as above), but not nomem.
If this is marked nomem then it will be eaten by early CSE.
Since this was modelled on kill, is there a reason we don't mark kill Convergent?

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
778–779

I assume you mean the variable name Src -> Register?

critson updated this revision to Diff 260595.Apr 28 2020, 4:58 AM

Add DivergenceAnalysis test.
Mark intrinsics as convergent.
Rename a variable.

nhaehnle added inline comments.Apr 28 2020, 10:32 AM
llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1267

Rethinking this, convergent isn't correct here, because there is no implied cross-thread communication.

Rather, the semantics are that amdgcn.wqm.helper reads from some hidden memory that is written to by wqm.demote. So by that logic, this should arguably be ReadInaccessibleMemOnly.

1488

Following the logic above, this should not be convergent but WritesInaccessibleMemOnly. At least I *think* that captures the semantics most accurately.

critson updated this revision to Diff 260819.Apr 28 2020, 7:04 PM

Update intrinsic parameters:
llvm.amdgcn.wqm.demote - IntrWriteMem, IntrInaccessibleMemOnly
llvm.amdgcn.wqm.helper - IntrReadMem, IntrInaccessibleMemOnly

Ping - I would like to get this moving again.

Ping - I would like to get this moving again.

Does it make sense to wait until the WQM pass is moved?

Ping - I would like to get this moving again.

Does it make sense to wait until the WQM pass is moved?

Sure, I will revise this in light of D82190 and D82192.

Can you add globalisel tests? I think you may need to add the operand class to the output patterns

critson updated this revision to Diff 285085.Aug 12 2020, 7:33 AM
critson added a subscriber: mbrkusanin.
  • Rebase and update for changes in other code.
  • Rename intrinsic wqm.helper to wqm.live (it provides live lanes).
  • Add initial GlobalISel support.
  • Add GlobalISel tests (some ported by @mbrkusanin).
foad added a subscriber: foad.Aug 21 2020, 4:37 AM
foad added inline comments.Aug 21 2020, 4:52 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
778–779

I'm pretty sure he meant s/unsigned Src/Register Src/. clang-tidy thinks the same.

Hi Carl, what's the status of this?

Hi Carl, what's the status of this?

As per your previous comment:

Does it make sense to wait until the WQM pass is moved?

Pending review of D88081 after which this will need a little bit of reworking to fit into post-MI / non-SSA world.

critson updated this revision to Diff 304383.Nov 10 2020, 7:15 PM

Heavily rewritten:

  • Use/reuse single virtual register for live mask. This removes need for PHIs and live mask register tracking. Assumes WQM is running non-SSA. (Supporting non-SSA and SSA operation would bloat code.)
  • Live mask tracks all kills and demotes.
  • Live mask manipulations terminate shader if all lanes are killed, even in non-uniform control flow.
  • Move all kill lowering to WQM pass, this simplifies later passes and avoids duplication when updating live mask. Removes the need for "clean up" operations.
  • WQM pass always modifies shader if it has any kills or demotes, even if there is no WQM.
critson retitled this revision from [AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic to [AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic and live mask tracking.Nov 10 2020, 7:17 PM
critson edited the summary of this revision. (Show Details)
foad added inline comments.Nov 11 2020, 1:48 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
386

EXEC,SCC

llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll
210–211 ↗(On Diff #304383)

Why the extra move here?

critson added inline comments.Nov 11 2020, 2:50 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
386

Thanks!

llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll
210–211 ↗(On Diff #304383)

This is unfortunate cruft from moving WQM pass later in compiler where there is less COPY elimination.
I have been thinking about putting some additional COPY clean up optimisations in another pass.
I had originally put some in WQM pass, but agreed with Nicolai this was probably not appropriate as the pass is already getting quite complicated.
In this specific case, first copy is saving live lane mask at begin of shader, second copy is PS_LIVE which copies live lane mask.
In practice these are very unlikely to be next to each other, and determining that they are the trivial copies of each other will need to consider control flow.

critson abandoned this revision.Jan 14 2021, 10:07 PM

Split into D94746, D94747 and D94748.