Carefully work around not knowing the thread mask that nvptx intrinsic
functions require.
If the warp is converged when calling try_lock, a single rpc call will handle
all lanes within it. Otherwise more than one rpc call with thread masks that
compose to the unknown one will occur.
We might need another lane sync here. I'll test when I put the parallelism back in.