This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Efficently support dynamic index on CUDA kernel aggregate parameters.
Needs ReviewPublic

Authored by hliao on Nov 16 2020, 11:15 PM.

Details

Reviewers
tra
jlebar

Diff Detail

Event Timeline

hliao created this revision.Nov 16 2020, 11:15 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptNov 16 2020, 11:15 PM
hliao requested review of this revision.Nov 16 2020, 11:15 PM

This's an experimental or demo-only patch in my spare time on eliminating private memory usage in https://godbolt.org/z/EPPn6h. The attachment

includes both the reference and new IR, PTX, and SASS (sm_60) output. For the new code, that aggregate argument is loaded through LDC instruction in SASS instead of MOV due to the non-static address. I don't have sm_60 to verify that. Could you try that on the real hardware?

BTW, from PTX ISA document, parameter space is read-only for input parameters and write-only for output parameters. If that's right, even non-kernel function may also require a similar change as the semantic is different from the language model, where the argument variable could be modified in the function body.

tra added a reviewer: jlebar.Nov 17 2020, 10:27 AM

This's an experimental or demo-only patch in my spare time on eliminating private memory usage in https://godbolt.org/z/EPPn6h. The attachment

includes both the reference and new IR, PTX, and SASS (sm_60) output. For the new code, that aggregate argument is loaded through LDC instruction in SASS instead of MOV due to the non-static address. I don't have sm_60 to verify that. Could you try that on the real hardware?

I'll give it a try.

BTW, from PTX ISA document, parameter space is read-only for input parameters and write-only for output parameters. If that's right, even non-kernel function may also require a similar change as the semantic is different from the language model, where the argument variable could be modified in the function body.

Regular functions currently handle parameters exactly the same way as kernels - via a copy to a local buffer, which can then be modified. https://godbolt.org/z/W9PY17
So, if we need to change a parameter, it would have to be done on a local copy.

clang/test/CodeGenCUDA/kernel-args.cu
13–14

Is the idea here to rely on PTX to store the value in param space (so we do actually pass the parameter by value) and represent it on IR level as a reference to an an externally-provided storage with the value.
So:

  • C++ passes argument by value
  • IR knows that PTX will store it somewhere in param space and uses byref
  • we still generate PTX which has parameter passed by value, but now we can access it directly via a reference to param-space value.

Presumably for parameters we do want to modify, we'll need to fall back to having a local copy.

So far so good. However, now we may have a problem distinguishing between C++-level arguments passed by value vs by reference -- they all will look like byref on IR level. That is, unless you rely on addrspace(101) to indicate that it's actually a byval in disguise.

It looks plausible as long as we can guarantee that we never modify it. Neither in the current function nor in any of the callees, if we pass it by reference.

I'm not particularly familiar with AA machinery. I'd appreciate if you could elaborate on how you see it all work end-to-end.

I am legit excited about this if we could figure out how to make it work, but I don't have anything to add beyond what tra said.

As mentioned earlier, that's very experimental support. Even though the SASS looks reasonable, it still needs verifying on real systems. For non-kernel functions, it seems we share the path. So that we should do a similar thing. The current approach fixes that in the codegen phase by adding back the alloca to match the parameter space semantic. Once that alloca is dynamically indexed, it won't be promoted in SROA. Only instcomb eliminates that alloca when it is only modified once by copying from a constant memory. As instcomb won't break certain patterns prepared in the codegen preparation, it won't run in the backend. That dynamically indexed alloca won't be removed.

clang/test/CodeGenCUDA/kernel-args.cu
13–14

It does the same thing as nvptx-lower-args does but applies that earlier in the frontend. The upside is that IR is optimized by all the middle-end opts. instcomb will remove that dynamically indexed alloca if it's only modified by copying from constant memory. AA teaches the compiler that parameter space has the property of constantness. Even though we run SROA after nvptx-lower-args, but we general won't run instcomb in the backend as it potentially breaks certain patterns prepared in the codegen preparation phase.

byref (newly added) in LLVM IR is different from by-reference in C++. The later is translated into a pointer. byref in LLVM IR says that content of that pointer should not be modified in the function body. It won't be ambiguous from the IR side.

It's still possible for the backend to do similar stuff. Once that byval argument has readonly, that alloca could be skipped.