Changes the NVPTX ABI to pass aggregates directly.  Only clang-generated IR is
affected. The change does not affect ABI on thechange function signatures in the
generated PTX
Discussion: https://llvm.discourse.group/t/nvptx-calling-convention-for-aggregate-arguments-passed-by-value
Currently NVPTX ABI passes aggregate values indirectly as a byval pointer.  When
we need to pass a *value*, LLVM has to store it in an alloca, so it can have a
pointer to pass on. This is a double whammy for NVPTX. LLVM often fails to
eliminate that alloca (usually SROA considers such pointer as escaped and gives
up) and that is noticeable hit on performance. When we lower IR to PTX, the
argument is actually passed by copy, so we end up having to do more work just to
get the value loaded back from the alloca. So, we do more work for less
performance. Switching to passing aggregates directly allows us to generate
better code.
Nit: Maybe a note that this effectively disables passing values via byval.