This combines two separate ops (D88972: gpu.create_token, D89043: gpu.host_wait) into one.
I do after all like the idea of combining the two ops, because it matches exactly the pattern we are
going to have in the other gpu ops that will implement the AsyncOpInterface (launch_func, copies, alloc):
If the op is async, we return a !gpu.async.token. Otherwise, we synchronize with the host and don't return a token.
The use cases for gpu.wait async and gpu.wait are further apart than those of e.g. gpu.h2d async and gpu.h2d,
but I like the consistent meaning of the async keyword in GPU ops.