My understanding, and the current behavior of the backend, is that
booleans lowered to bitmasks have undefined values in bits corresponding
to lanes that aren't active. If we just lower this to a WQM intrinsic,
then if the values for inactive lanes are 1, then this may change the
result if not every thread in a quad is active. Fix this by AND'ing with
EXEC to mask out the garbage lanes.
This fixes some VK conformance tests when making the clustered subgroup
reduce operations use @llvm.amdgcn.wqm.vote. I added an extra test which
currently miscompiles.
One other way to handle this would be to change the way we lower boolean
operations like NOT so that inactive lanes are always 0. So far as I can
see, in terms of code quality, this would only prevent a few transforms,
like S_ORN2_B64 for src0 | ~src1. The main thing would be lowering NOT
to an XOR with EXEC. Otherwise we'd have to do an analysis to prove
that the AND is redundant in order to get rid of these instructions.
Thoughts?