The AMDGPU target has a convention that defined all VGPRs
(except the initial 32 argument registers) as callee-saved.
This convention is not efficient always, esp. when the callee
requiring more registers, ended up emitting a large number of
spills, even though its caller requires only a few.
This patch revises the ABI by introducing more scratch registers
that a callee can freely use.
The 256 VGPR registers now become:
32 argument registers 112 scratch registers and 112 callee-saved registers.
The scratch registers and the CSRs are intermixed at regular
intervals (a split boundary of 8) to obtain a better occupancy.
A description of why it's split this way may be helpful