Code motion into a gpu::LaunchOp region requires operands of moved
instructions to be threaded through operands of the gpu.launch to
ensure that the gpu.launch remaines closed from above.
To enable this, gpu.launch operations are now created with extensible
operand storage. The overhead is expected to be low given that
gpu.launch is a relatively rare operation.
This says "gpu.launch" but the line above says "gpu.func". Let's use one everywhere and say the other is equivalent.