Add lowering of the vector.warp_execute_on_lane_0 into scf.if plus memory transfer for the operands and yield values.
This also add an integration test running on GPU warp. The same tests can be later re-used with different comment lines to tests distribution transformations.
This is mostly from @springerm contribution.
is this needed?