This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add inverse ballot intrinsic
Needs RevisionPublic

Authored by cwabbott on Feb 5 2019, 5:57 AM.

Details

Reviewers
arsenm
nhaehnle
Summary

This takes a uniform 64-bit bitmask, and returns a boolean value which
is true for each thread if the corresponding bit is 1. The
implementation of subgroupInverseBallot() in radv (and AMDVLK presumably) is
currently implements this by shifting by the thread-id, but this is a
complicated way of doing what's basically just a no-op thanks to how booleans
are stored in SGPR's.

Although the user guarantees that the value is uniform, it may still
wind up in VGPR's thanks to deficiencies in the backend, in which case
we have to emit two readfirstlane's.

Diff Detail

Event Timeline

cwabbott created this revision.Feb 5 2019, 5:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2019, 5:57 AM
arsenm added a comment.Feb 5 2019, 6:09 AM

The user can't actually guarantee the argument is uniform, since transforms are allowed to touch the argument. We're accumulating quite a few places that assume this though

arsenm added inline comments.Feb 5 2019, 6:13 AM
test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll
19–21

You can just use an i64 argument here instead of this build vector and cast thing

27

Need to add a few more tests with more complex situations (particularly one with a uniform phi, and one with a divergent phi)

The user can't actually guarantee the argument is uniform, since transforms are allowed to touch the argument. We're accumulating quite a few places that assume this though

Yeah, that's true. We're already currently disabling some problematic transforms in Mesa because of that. We can probably get a proper solution for that when we add thread group semantics to the LangRef (c.f. the llvm-dev thread), but we shouldn't need to wait on that to get this in.

cwabbott marked an inline comment as done.Feb 5 2019, 6:19 AM
cwabbott added inline comments.
test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll
19–21

No, that crashes:

Formal argument #0 has unhandled type i64
UNREACHABLE executed at ../lib/CodeGen/CallingConvLower.cpp:98!

I just copied the amdgpu_ps calling convention from somewhere else, is there a better one to use?

arsenm added inline comments.Feb 5 2019, 6:23 AM
test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll
19–21

I know I've written the patch to fix this before, but I guess I never committed it.

You can just use a default calling convention function for this just as well

cwabbott marked an inline comment as done.Feb 5 2019, 6:25 AM
cwabbott added inline comments.
test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll
27

Why would having a phi node change anything? As far as I can see these tests cover the only two cases that really matter for lowering this (input in SGPR and input in VGPR).

arsenm added inline comments.Feb 5 2019, 6:33 AM
test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll
27

I'm specifically worried about the change isVGPRToSGPRCopy. Phi and i1 handling has been a problematic area in SIFixSGPRCopies.

I would probably be more comfortable using a pseudo instruction for this until this point rather than making a regular COPY to VReg_1 legal

cwabbott updated this revision to Diff 185317.Feb 5 2019, 8:29 AM
  • Added a pseudoinstruction lowered in SILowerI1Copies, and legalized it more

similarly to how other instructions are legalized.

  • Added tests where the source is a uniform and non-uniform phi node. I had to

to use the amdgpu_ps calling convention here to get the arguments into SGPR's.

arsenm added inline comments.Feb 6 2019, 5:40 AM
lib/Target/AMDGPU/SIInstructions.td
597

extra whitespace change

lib/Target/AMDGPU/SILowerI1Copies.cpp
651–652 ↗(On Diff #185317)

Can you handle this in EmitInstrWithCustomInserter instead? You'll need to set usesCustomInserter on the instruction

cwabbott updated this revision to Diff 185757.Feb 7 2019, 6:46 AM
  • Remove spurious whitespace change.
  • Lower S_INV_BALLOT with EmitInstrWithCustomInserter.

Why can't we recognize this as a pattern? Basically, it's just (src & (1 << thread_idx)), and thread_idx can be matched as a sequence of mbcnt intrinsics.

Hmm, except the SelectionDAG is only per-basic block. Ugh.

arsenm added a comment.Feb 7 2019, 9:45 AM

Why can't we recognize this as a pattern? Basically, it's just (src & (1 << thread_idx)), and thread_idx can be matched as a sequence of mbcnt intrinsics.

Hmm, except the SelectionDAG is only per-basic block. Ugh.

We could always make CodeGenPrepare always sink these

arsenm added a comment.Feb 7 2019, 9:48 AM

Why can't we recognize this as a pattern? Basically, it's just (src & (1 << thread_idx)), and thread_idx can be matched as a sequence of mbcnt intrinsics.

Hmm, except the SelectionDAG is only per-basic block. Ugh.

We could always make CodeGenPrepare always sink these

I think this would be cleaner since then we wouldn't have to worry about the conceptually disturbing readfirstlane case

Adding a pattern for this wouldn't work for what I wanted to do, which was a ballot/inverseballot pair to operate directly on the bitmask representation of a boolean, since there's a bug where SelectionDAG forgets that ballot removes divergence, and it needs to be non-divergent for the pattern to fire. That being said, inserting two readlanes isn't that much better, so maybe I should just fix that instead...

arsenm added a comment.Feb 8 2019, 6:18 AM

Adding a pattern for this wouldn't work for what I wanted to do, which was a ballot/inverseballot pair to operate directly on the bitmask representation of a boolean, since there's a bug where SelectionDAG forgets that ballot removes divergence, and it needs to be non-divergent for the pattern to fire. That being said, inserting two readlanes isn't that much better, so maybe I should just fix that instead...

Fixing DAG bugs is good. This would be also be if you could use mbcnt here instead of hoping readfirstlane works out OK

arsenm requested changes to this revision.Dec 22 2022, 3:52 PM

Please rebase if still relevant

This revision now requires changes to proceed.Dec 22 2022, 3:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 22 2022, 3:52 PM