Initial approach to get the values stored in the offloading arrays passed to the runtime calls.
The main idea is to split the runtime calls that involve host to device memory offloading into an "issue" and "wait" runtime calls. The "issue" is the asynchronous version of the original runtime call, but now it returns a handle. That handle is used by "wait" to wait for the transfer to complete. The objective is to forward move the "issue" as much as possible, issuing the memory transfer and continuing with computation that does not need it. The "wait" is moved downwards as much as possible, until another runtime call uses that data previously transferred. Hence, trying to use the processor in other stuff while the transfer is being made, hopefully not having to wait for the transfer and then continuing with the computation.
Do we expect to call this with a nullptr? If not, we should make it a reference instead.