This is an archive of the discontinued LLVM Phabricator instance.

[PATCH] [NVPTX] Backend support for variadic functions
ClosedPublic

Authored by pavelkopyl on Nov 22 2022, 4:26 PM.

Details

Summary

This patch adds lowering for function calls with variadic number of arguments
as well as enables support for the following instructions/intrinsics:

  • va_arg
  • va_start
  • va_end
  • va_copy

Note that this patch doesn't intent to include clang's support for
variadic functions for CUDA.

According to the docs:

PTX version 6.0 supports passing unsized array parameter to a function which
can be used to implement variadic functions. [0]
The last parameter in the parameter list may be a .param array of type .b8 with
no size specified. It is used to pass an arbitrary number of parameters to
the function packed into a single array object.

When calling a function with such an unsized last argument, the last argument
may be omitted from the call instruction if no parameter is passed through it.
Accesses to this array parameter must be within the bounds of the array.
The result of an access is undefined if no array was passed, or if the access
was outside the bounds of the actual array being passed. [1]

Note that aggregates passed by value as variadic arguments are not currently
supported.

[0] https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#variadic-functions
[1] https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-and-function-directives-func

Diff Detail

Event Timeline

pavelkopyl created this revision.Nov 22 2022, 4:26 PM
pavelkopyl requested review of this revision.Nov 22 2022, 4:26 PM
tra added a subscriber: yaxunl.Nov 22 2022, 10:05 PM

Nice.

I'm out of office this week and will take a closer look when I'm back next week, probably closer to the end of it.

@yaxunl Does HIP currently allow variadic functions on GPU? Of so, does that include kernels, or only regular functions?

llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1676

What determines the alignment here?
NVIDIA does not seem to specify anything regarding alignment here and their example shows align 4:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-and-function-directives-func

llvm/test/CodeGen/NVPTX/vaargs.ll
14

NVCC does not seem to allow varargs for kernels, only for __device__ functions. https://godbolt.org/z/s75vWsfbK

Not sure if we can do much about that on LLVM level, that would need to be something to be enforced in the front-end.

19

Would it be possible to reduce the checks to the minimum number of the instruction necessary to illustrate that we've lowered varargs correctly? Everything else just obscures what is ti exactly that we're testing for here.
If the remaining checks are still verbose, it may be useful to interleave the checks with the IR itself, so it's easier to tell which IR produced particular PTX.

@yaxunl Does HIP currently allow variadic functions on GPU? Of so, does that include kernels, or only regular functions?

No.

Fix issue after rebasing.

pavelkopyl added inline comments.Nov 23 2022, 4:56 PM
llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1676

It seems the documentation is a little bit outdated, because NVCC 11.7 generates .align 8 for the last parameter (unsized array): https://godbolt.org/z/7W7YThMf8

llvm/test/CodeGen/NVPTX/vaargs.ll
19

OK, I'll try to make it more clear.

tra added a comment.Nov 29 2022, 12:25 PM

Note that aggregates passed by value as variadic arguments are not currently supported.

What happens when a user does try to pass an aggregate as a var arg?

llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1676

The question remains. Do we set alignment to 8 because that's what NVCC does or is there some other reason behind it?
I.e. should it follow the alignment guarantees provided by e.g. malloc which returns a pointer sufficiently aligned to access any type.

I think this should be retrieved from DataLayout or TargetInfo, instead of being hardcoded here.
Based on NVPTXTargetLowering::getFunctionParamOptimizedAlign, we may have argument alignment as high as 16.

Note that aggregates passed by value as variadic arguments are not currently supported.

What happens when a user does try to pass an aggregate as a var arg?

That will trigger llvm_unreachable() at llvm/lib/CodeGen/ValueTypes.cpp:551
But this is common issue - aggregates are not allowed (at least now) in variadic arguments.

llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1676

I agree, that would be a right way to get alignment value from DataLayout. To be honest, it's not clear which LLVM IR type corresponds to unsized byte array and PTX documentation allows any alignment - 1, 2, 4, 8 or 16, but it doesn't specify which one should be used in what cases. Furthermore. from the correctness point of view exact value of the array alignment doesn't matter: both LowerCall() and LowerVAARG() insert instructions that align va_lits pointer according to a value type being stored/loaded (please, see vaargs.ll test). If we specify ".param .align 1 .b8 %VAParam[]" that may lead just to a padding space between the first variadic argument and beginning of the array itself. On the other hand, ".align 16" may also lead to wasting of stack space. So, ".aling 8" seems to be an optimal value. NVCC also uses ".align 8". That's why I chose exactly this value.

pavelkopyl added inline comments.Dec 3 2022, 4:55 AM
llvm/test/CodeGen/NVPTX/vaargs.ll
19

I reworked the test. Now it has only what is related to vaarg stuff.

pavelkopyl updated this revision to Diff 479831.Dec 3 2022, 4:58 AM

Updated vaargs.ll test to make it more clear how vaarg related instructions / intrinsics get lowered.

pavelkopyl added inline comments.Dec 7 2022, 3:12 PM
llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1676

After digging into this, it seems 8 byte - is the maximum value of alignments of data types which may be passed to a variadic function: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#vector-data-types
That is the reason for using exactly this value.
I moved this hardcode to NVPTXSubtarget where it's available via getMaxRequiredAlignment().

pavelkopyl updated this revision to Diff 481083.Dec 7 2022, 3:14 PM

Move align hardcode to NVPTXSubtarget where it's available via getMaxRequiredAlignment()

pavelkopyl updated this revision to Diff 481212.Dec 8 2022, 2:22 AM

Fix regexp for .align

tra added inline comments.Dec 8 2022, 11:26 AM
llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1677

We should probably follow the naming convention we use for other arguments <function>_paramN or in this case, <function>_vararg.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
1435

Is this needed? AFAICT raw_string_ostream is unbuffered and everything just gets appended to the string with nothing to flush.

1592

Nit: None of () are needed here.

1594–1595

nor here.

1716–1717

In practice we may want/need to deal with f16x2 and bf16x2 as variadic arguments. While nominally they are vectors in IR, they are passed as scalars and thus we should be able to pass them as variadic arguments. It's OK to deal with this later, in which case, this should have a TODO comment.

2337

almostly. It's a good one. :-)

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
200–203

Can we use existing MOV_ADDR/MOV_ADDR64 instead? It would also avoid hardcoding the symbol name.

That said, the fixed name has the benefit of being simpler, with the downside that the name we generate must be in sync with the name used by the instruction.

Another minor downside of hardcoded name is that it would be harder to search for in the generated PTX -- as it would be the same in all the functions. Having vararg argument name prefixed with function name as we do for other arguments would work better, IMO.

pavelkopyl added inline comments.Dec 11 2022, 11:23 AM
llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1677

OK, done.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
1435

Yes, it's unbuffered. Fixed.

1594–1595

Agree. Both statements are fixed.

1716–1717

Yes, probably in a perspective we will also support vectors in variadic arguments. I've added TODO about this.

2337

Fixed)

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
200–203

Thank you for advice.
Done. The only thing is that technically I create (Wrapper texternalsym) DAG to select IMOV64ri or IMOV32ri instructions. This is how fixed-sized .param arrays are lowered.
To be selected, MOV_ADDR requires a bit different DAG - (Wrapper tglobaladdr).

Replace fixed %VAParam name with <function>_vararg.

tra accepted this revision.Dec 12 2022, 12:05 PM
tra added inline comments.
llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
2618

Nit: We could define and use VARARG_IDX = -1 or just document that a negative index is for a vararg, instead of adding a new isVarArg argument.
The call would just use /* vararg*/ -1 which is a slight improvement, IMO over having to use the comment *and* an extra argument.

This revision is now accepted and ready to land.Dec 12 2022, 12:05 PM
  • Rebased
  • Added description for getParamSymbol()
krisb added inline comments.Dec 13 2022, 4:27 AM
llvm/test/CodeGen/NVPTX/vaargs.ll
2

nit: I guess check-lines are no longer autogenerated, so it's better to remove this note.

pavelkopyl added inline comments.Dec 13 2022, 6:05 AM
llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
2618

OK, done.

llvm/test/CodeGen/NVPTX/vaargs.ll
2

I agree, thank you.

  • Removed unneeded comment from vaarg.ll test
  • Removed unneeded comment from LowerFormalArguments()
This revision was landed with ongoing or failed builds.Dec 13 2022, 8:08 AM
This revision was automatically updated to reflect the committed changes.
tra added a comment.May 16 2023, 1:57 PM

There's an interesting discrepancy between what the PTX spec says and what NVCC does.

PTX spec (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-and-function-directives-func) allows passing vararg arguments as an unsized array parameter to the function. According to the same spec "Parameters in .param space are accessed using ld.param and st.param instructions in the body."
However, when I look at the code generated by nvcc, it appears that it uses ld.local to access vararg parameters: https://godbolt.org/z/qh4rq5xxK

I'm talking to NVIDIA folks and they seem to struggle to address the discrepancy. Considering that the local access for variadics has been there from the very beginning and that there's probably no other viable ways to pass arbitrary amount of data to a thread, local memory is probably the only choice. This suggests that they probably just forgot to document this quirk.

Can you elaborate on what was your reason for lowering va_arg as a local AS access? Was it to mimic what NVCC does, or is this documented somewhere.

I agree, the documentation is a bit messy. I used the information from section 5.1 State Spaces, in particular 5.1.6.4. At least, it admits that one can take the address of a ".param" formal parameter and then use ld.local. On the other hand, it says nothing about when should do this, so yes, I mostly tried to mimic NVCC.

tra added a comment.May 17 2023, 4:33 PM

I agree, the documentation is a bit messy. I used the information from section 5.1 State Spaces, in particular 5.1.6.4. At least, it admits that one can take the address of a ".param" formal parameter and then use ld.local. On the other hand, it says nothing about when should do this, so yes, I mostly tried to mimic NVCC.

The conclusion from NVIDIA's side was exactly that -- if address is taken, everything gets magically copied into .local.
However, if one were to directly access the vararg, we'd still need to use ld.param.

Local copies tend to be expensive. We may eventually consider whether we can calculate access using offset vs the vararg argument w/o doing the math on the actual pointer. For now, accessing them via a local pointer would do.