Control flow in AMDGPU is lowered not just via branches, but also via bit
manipulation of the EXEC mask. This means every basic block might have
some ALU instructions at the beginning and end to setup the EXEC mask
correctly. Bad things can happen when target-independent passes like
scheduling or (especially) register allocation mess with these parts. This
has become more of an issue now that control flow pseudo instructions are
lowered earlier (which is how things should be).
For example, a conditional branch from an if-statement might be lowered as
s_and_saveexec_b64 s[10:11], vcc s_xor_b64 s[10:11], exec, s[10:11] ; mask branch BB5_2
As long as only the ; mask branch pseudo instruction is a terminator, the
register allocator might decide to spill registers after the s_and_saveexec
instruction, which (in the case of a vector register) means that the
register will not be spilled correctly.
Another problem is that we want to move the AMDGPU-specific Whole Quad Mode
pass to after machine scheduling. That pass needs to introduce its own
exec-manipulating instructions while being aware of the meaning of the
already existing exec-instructions. That meaning is currently lost.
So here's a suggestion for dealing with all that: Allow arbitrary
instructions to be marked as Terminator (end of BB) or Initiator (beginning
of BB) instructions.
The intention is that target-independent passes will mess with those
instructions as little as possible: no scheduling changes, and register
spilling and restoring happens outside those regions whenever possible.
The diff here is not complete, but it shows the direction I want to go in.
It passes all the AMDGPU lit tests except for spill-m0.ll, which needs
RegAllocFast to be fixed to become aware of the Initiator/Terminator
rules.