This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][GISel] Add inverse ballot intrinsic
ClosedPublic

Authored by OutOfCache on Mar 17 2023, 4:58 AM.

Details

Summary

The inverse ballot intrinsic takes in a boolean mask for all lanes and
returns the boolean for the current lane. See SPIR-V's
subgroupInverseBallot() in the GL_KHR_shader_subgroup extension.
This allows decision making via branch and select instructions with a manually
manipulated mask.

Implemented in GlobalISel and SelectionDAG, since currently both are supported.
The SelectionDAG required pseudo instructions to use the custom inserter.

The boolean mask needs to be uniform for all lanes.
Therefore we expect SGPR input. In case the source is in a
VGPR, we insert one or more v_readfirstlane instructions.

Diff Detail

Event Timeline

OutOfCache created this revision.Mar 17 2023, 4:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2023, 4:58 AM
OutOfCache requested review of this revision.Mar 17 2023, 4:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2023, 4:58 AM
OutOfCache added a project: Restricted Project.
OutOfCache edited the summary of this revision. (Show Details)Mar 17 2023, 5:22 AM
OutOfCache added inline comments.Mar 17 2023, 5:25 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.inverse.ballot.i32.ll
21 ↗(On Diff #506042)

This s_sendmsg instruction is the only difference between gfx10 and gfx11 in every test. Should I remove the gfx10 test?

tsymalla added inline comments.Mar 17 2023, 6:06 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.inverse.ballot.i32.ll
2 ↗(On Diff #506042)

Why do you require -risky-select?

21 ↗(On Diff #506042)

I'd add just a single test run if the differences are minor.

Drop the "[GISel]" from the description, this isn't adding existing functionality to it. Obsoletes D57748?

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2540

Should let it fall through to fail to select instead

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
1363–1364

This isn't enforced by the verifier

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4480

Why do you have to go through selecting the pseudo? Can't you do this split during the select?

foad added inline comments.Mar 17 2023, 10:17 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
1380

Emitting the readfirstlanes directly during instruction selection is pretty unusual. Usually it is done by legalizeOperands during SIFixSGPRCopies. Is there a reason that inverse ballot has to be done differently?

tsymalla added inline comments.Mar 20 2023, 1:18 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4489

Use a better name, like Is32BitMask.

4504

Add a comment on why 0 is used in 32-bit mode.

4513

Could also use a comment.

llvm/lib/Target/AMDGPU/SIInstructions.td
194

Can unknown be changed to a more constrained register class instead of allowing arbitrary inputs?

194

Rather call $src $mask.

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.inverse.ballot.i32.ll
153 ↗(On Diff #506042)

Nit: Missing whitespace after %entry

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.inverse.ballot.i64.ll
52 ↗(On Diff #506042)

Can you add tests for more complex bitmask inputs?

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll
16

Can be optimized, as s0-s2 and v2-v3 are zero. Probably add another test to show the MIR your code reduces the intrinsic call to

rovka added a subscriber: rovka.Mar 20 2023, 5:01 AM
rovka added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4497

Nit: Most of this looks the same for GlobalISel, could you try to extract a helper function to use for both?

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.inverse.ballot.i32.ll
3 ↗(On Diff #506042)

I have 2 nits here:

  1. Why go all the way to assembly? You could -stop-after=finalize-isel (or even before it). You'll probably have to use update_mir_test_checks.py though.
  2. There's precedent for having a single test with -global-isel=0 and -global-isel=1 RUN lines, so you don't really need separate files in this directory.
  • Simplify GISel Implementation by using legalizeOperands
  • Remove unnecessary checks.
  • Remove redundant tests and move GISel ones to SDAG tests
  • Increase readability.
OutOfCache marked 7 inline comments as done.Mar 21 2023, 10:35 AM
OutOfCache added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
1380

Turns out there isn't! Thank you for the suggestion. It is much cleaner now.

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.inverse.ballot.i32.ll
3 ↗(On Diff #506042)

Thank you for the suggestion. Combining the GISel and SelDAG Tests into one file is much better.

I run into an issue when trying the -stop-after=finalize-isel however. The CHECK lines seem to be generated fine with update_mir_test_checks.py, but running the tests with llvm-lit fails the tests. Changing back to asm tests works again, with no other changes to the code. I will need to investigate further.

OutOfCache marked an inline comment as done.
  • Fix typos
  • Rename variable
OutOfCache marked 2 inline comments as done.Mar 21 2023, 10:44 AM
OutOfCache added inline comments.Mar 28 2023, 3:57 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4480

I have attempted simply selecting the node to a copy, similar to the current GISel implementation. However, I have run into issues in case the mask is inside a VGPR. The following passes changed the VGPR->SGPR copy into a VGPR->VGPR copy instead of producing the v_readfirstlanes. Do you have any other suggestion of things I could try?

nhaehnle added inline comments.Mar 28 2023, 12:30 PM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2540

Probably better to error out explicitly, no? A fall-through is easily missed by future changes.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4502–4523

This entire sequence could probably be replaced by just calling SIInstrInfo::readlaneVGPRToSGPR.

llvm/lib/Target/AMDGPU/SIInstructions.td
193

Remove the Defs = [VCC]. Defs is used to designate implicit defs of physical registers, but these instructions don't do that, they only use virtual registers.

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll
119–130

This code is wrong. That's a problem that's orthogonal to your change and has to do with GlobalISel not correctly taking uniformity analysis into account (yet). But please add a comment to make a note of it (same in the corresponding wave64 test).

The SDAG code is correct.

OutOfCache added inline comments.Mar 30 2023, 2:28 AM
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll
119–130

What is incorrect here? I only see the additional v_mov in the SelDAG code, but every other instruction seems to be the same.

  • Simplify SelDAG with call to SIInstrInfo::readlaneVGPRToSGPR
  • Add comment to clarify that current GISel test is incorrect
  • remove [VCC] form pseudo instruction def
OutOfCache marked 3 inline comments as done.Mar 30 2023, 4:12 AM
OutOfCache added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4502–4523

Good call. Would it make sense to also use this method for other pseudos that generate readfirstlanes here? Like S_ADD_CO_PSEUDO. In another change, of course.

nhaehnle added inline comments.Mar 30 2023, 4:26 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4502–4523

I think so, yes. And yes, it should be a separate change.

What about the comment(s) in AMDGPUISelDAGToDAG.cpp?

OutOfCache marked an inline comment as done.
  • add internal error if the mask size is not 32 or 64
OutOfCache marked an inline comment as done.Mar 31 2023, 6:05 AM
nhaehnle accepted this revision.Apr 3 2023, 3:03 AM

LGTM

This revision is now accepted and ready to land.Apr 3 2023, 3:03 AM
OutOfCache updated this revision to Diff 510534.Apr 3 2023, 9:27 AM
  • Change llvm_unreachable_internal to llvm_unreachable

I decided to keep the error output since it is more
descriptive than the Invalid Opcode error we
would otherwise output.

This revision was landed with ongoing or failed builds.Apr 5 2023, 10:47 PM
This revision was automatically updated to reflect the committed changes.