Page MenuHomePhabricator

[NVPTX] Improve lowering of byval args of device functions.

Authored by tra on Jun 15 2016, 4:38 PM.



Lower byval arguments of device functions the same way
we lower them for kernels and ensure that it can be accessed
via argument's symbol.

This allows loading value of the argument using [symbol+offset]
instead of converting argument to general space pointer and using it
for indexing (which also implicitly converts param space pointer to
local space one on SASS level and triggers copying of argument into
local space in the process).

This reduces call overhead, uses less registers and reduces overall
SASS size by 2-4% on thrust tests.

Diff Detail

Event Timeline

tra updated this revision to Diff 60899.Jun 15 2016, 4:38 PM
tra retitled this revision from to [NVPTX] Improve lowering of byval args of device functions..
tra updated this object.
tra added reviewers: jholewinski, jlebar, jingyue.
tra added a subscriber: llvm-commits.
jlebar requested changes to this revision.Jun 15 2016, 5:22 PM
jlebar edited edge metadata.

Pending addition of tests.

This revision now requires changes to proceed.Jun 15 2016, 5:22 PM
jlebar added inline comments.Jun 15 2016, 6:03 PM

I'm not sure this is the right place to do this transformation, as opposed to (say) at the machine instruction level.

My first approximation mental model for selection dag is that it does a reasonably direct translation of IR to machine instructions. It's responsible for recognizing mapping constructions in IR to fast machine instructions (e.g. x86's various addressing modes), but optimizations that recognize that e.g. f(f^-1(x)) == x are less common in isel. (Maybe someone will tell me this is wrong.)

So I'm not sure whether this belongs here. Especially if it's a correctness transformation -- what if the source isn't directly a MoveParam? Like, we moved the param to one reg, then moved it to another reg, then converted *that*. Are we clear of the bug we're working around? If this is just an optimization, at least it should be guarded so we don't run it at -O0.

I would also like to be explicit in our source (somewhere) exactly what is the nvptx bug that we're working around, so that this change doesn't just sit in our backend and confuse all hackers who come after us. :)

jingyue added inline comments.Jun 15 2016, 8:32 PM

Byval parameters of device functions might not be in the .param space (

void device_function(struct S x) {

__global__ void kernel() {
  struct S x;
tra updated this revision to Diff 60993.Jun 16 2016, 10:37 AM
tra edited edge metadata.

Added test case for lowering byval values in device functions.

tra added a comment.Jun 16 2016, 10:56 AM

Here's a bit more details on what I'm trying to do. Let's take ptr_in_byval_func() I've added in lower-kernel-ptr-arg.ll as an example.
Currently we produce this PTX:

	mov.b64	%rd1, ptr_in_byval_func_param_0;
	cvta.local.u64 	%rd2, %rd1;
	ld.param.u64 	%rd3, [ptr_in_byval_func_param_1];
	ld.u32 	%rd4, [%rd2+8];
	ld.u32 	%rd5, [%rd2+12];
	shl.b64 	%rd6, %rd5, 32;
	or.b64  	%rd7, %rd6, %rd4;
	ld.u32 	%r1, [%rd7];
	st.u32 	[%rd3], %r1;

We're doing fair amount of unnecessary things here (I wonder why we're loading a pointer as two 32-bit words here, too). To make things worse, it also forces PTX-to-SASS compiler to spill byval argument into local memory when we take address of parameter. It adds even more unnecessary code on SASS level.

With this patch things look a bit better:

	ld.param.u64 	%rd1, [ptr_in_byval_func_param_1];
	ld.param.u64 	%rd2, [ptr_in_byval_func_param_0+8];
	ld.u32 	%r1, [%rd2];
	st.u32 	[%rd1], %r1;

All function arguments are passed via either .reg or .param. Quoting from the link above:

The parameter (.param) state space is used [...] (2b) to declare locally-scoped byte array variables that serve as function call arguments, typically for passing large structures by value to a function.

We copy all argument to .param space when we lower a call.
In case of your example we get this:

	{ // callseq 0
	.reg .b32 temp_param_reg;
	.param .align 4 .b8 param0[8];
	st.param.b32	[param0+0], %r4;
	st.param.b32	[param0+4], %r3;
	.param .b32 retval0;
	call.uni (retval0), 
	ld.param.b32	%r5, [retval0+0];
tra added inline comments.Jun 16 2016, 1:59 PM

LowerKernelArgs() makes sure that IR itself no longer touches argument directly.
Here's a typical example of what happens:

Input IR with byval argument:

define i32 @gen_arg(%struct.S* byval align 1) #0 {
  %2 = getelementptr inbounds %struct.S, %struct.S* %0, i32 0, i32 0
  %3 = load i8, i8* %2, align 1
  %4 = sext i8 %3 to i32
  %5 = mul nsw i32 %4, 3
  ret i32 %5

After LowerKernelArgs:

define i32 @gen_arg(%struct.S* byval align 1) #0 {
  %2 = alloca %struct.S, align 1
  %3 = addrspacecast %struct.S* %0 to %struct.S addrspace(101)*
  %4 = load %struct.S, %struct.S addrspace(101)* %3
  store %struct.S %4, %struct.S* %2
  %5 = getelementptr inbounds %struct.S, %struct.S* %2, i32 0, i32 0
  %6 = load i8, i8* %5, align 1
  %7 = sext i8 %6 to i32
  %8 = mul nsw i32 %7, 3
  ret i32 %8

That guarantees that the only part of IR that touches the argument is the addrspacecast instruction. Everything else operates on its value. That covers our goal of controlling direct access to byval argument on IR level.

Argument itself is lowered to MoveParam (which is just a wrapper over argument symbol) in LowerFormalArguments() which becomes an input to addrspacecast.

Normally addrspacecast would be lowered as an intrinsic converting from generic param space.
In this case we check if it's addrspacecast(moveParam) (which would only ever happen for an argument) and lower it as the symbol of the argument. Behavior of all other addrspacecast variants is not affected.

I think that covers lowering of the IR in the body of the function.

One remaining case is CopyToExportRegsIfNeeded() which is called by SelectionDAGISel::LowerArguments() and which creates CopyToReg node which directly copies arg pointer to a register during unoptimized builds. We end up taking address of byval argument just to copy it to a register that's never used (AFAICT) for any purpose.

I'm not sure yet what to do about this -- do not copy byval pointer to export regs, copy something else to export regs (what?), or eliminate CopyToReg later. Ideas or suggestions are welcome.

We need to add comments in this code explaining

  • although this is an optimization, it is also needed to work around a bug in ptxas.
  • exactly what that bug is.
  • exactly how this works around the problem.

If it's OK with you, I'm not going to read the discussion here and just look at the comments in the code, as a way to prove to myself that the comments there actually explain the problem in a way that at least I can understand.

tra updated this revision to Diff 61042.Jun 16 2016, 3:30 PM
tra edited edge metadata.

Added code comments with details as jlebar@ suggested.

tra updated this revision to Diff 61045.Jun 16 2016, 3:41 PM

rephrased code comments a bit.

This is great, thank you. I wanted to try to help with articles and commas, so I pointed out all of them; hope that's not too annoying.


Nit, missing close quote.


the address


an argument


a 32-bit write



(Comma after long prepositional phrase at the beginning of a sentence.)


space, which

(Comma before non-exclusive "which" clause. It's the difference between

"I use the compiler which I prefer." (I choose my favorite compiler from among a variety of options.)

"I use the compiler, which I prefer." (I use a compiler, as opposed to an assembler, or writing bits out with a magnetized needle.))


unnecessary, as
variable, and

(Comma before conjunction.)


the argument


the address of the local copy


Based on this comment, I am unsure why this transformation is correct:

Suppose someone takes the address of a local variable and then writes to that address. That write needs to show up for everyone who reads from the param. Which means that everyone needs to be using that local copy. Which really means that the transformation we're describing is only safe because of how we lower the kernel args in the IR transformation.

So I think we should say something about that, if that makes any sense.


Would suggest keeping the "//"s here, so it doesn't look like separate comments (and is consistent with above).


sure the IR


Instead, the argument pointer


the default address space


, and the data


it, which


s/it/the spill/


Suggest "and gives us an opportunity" -- we're now a bit far from the original "gives".


OK, this is the argument I was looking for above. Maybe we just need a pointer there down to this paragraph ("This is safe because of how we lower kernel args in IR, see (a) below." or something.)


This is a bit confusing as a comment, because it's describing a change relative to code that no longer exists. I'm not sure this part is necessary at all, actually, which is maybe good because long comment is long. :)


The second sentence is a bit confusing, maybe we can say

It replaces the addrspacecast (MoveParam) from step (a) with the arg symbol itself. This can then be used for [symbol + offset] addressing.

tra updated this revision to Diff 61108.Jun 17 2016, 10:23 AM
tra marked 19 inline comments as done.

Incorporated jlebar's editing suggestions.


In current code because we move value of param pointer to a register, it magically gets spilled to local memory by ptxas and we actually get a pointer to that local copy. If we want to use that pointer to modify something, we'll be modifying local copy. However, if we attempt to read from param space as well, that will be a problem. You can probably do that with inline asm, but it will not happen with IR we generate -- neither currently, nor after the patch. In both cases IR can only do writes to a local copy (the only way to access parameter in IR is via that pointer to a local copy). With this patch direct reads from param space will only happen if we don't modify local copy, so we're safe there, too.


I've added a reference there to step (a).


Simplified to just describe where/how and arg is lowered to MoveParam.

jlebar accepted this revision.Jun 17 2016, 11:13 AM
jlebar edited edge metadata.



the argument

This revision is now accepted and ready to land.Jun 17 2016, 11:13 AM

Art, are you waiting for Jingue or Justin H. to look at this? (Justin H. told me to ping him via e-mail if I need something.)

jingyue accepted this revision.Jun 24 2016, 3:48 PM
jingyue edited edge metadata.
tra added a subscriber: tra.Jun 24 2016, 3:53 PM

Ah, the patch has landed in 273313, but wasn't picked up by phabricator.


tra planned changes to this revision.Jul 18 2016, 4:44 PM

Patch was rolled back in r274163 because it was breaking compilation with -O0.
I'll post updated patch shortly.

tra updated this revision to Diff 64422.Jul 18 2016, 4:58 PM
tra edited edge metadata.

ptxas bug workaround has been committed separately, so now it's purely an optimization.

Now byval arguments are handled the same way for both global and device functions:

  • NVPTXLowerKernelArgs pass creates local copy which is addrspacecast'ed to param space.
  • LowerFormalArguemts lowers them to MoveParams(param_symbol)
  • SelectDirectAddr() recognizes addrspacecast([0->101] MoveParam(arg_symbol)) created above and returns arg_symbol.
This revision is now accepted and ready to land.Jul 18 2016, 4:58 PM
tra planned changes to this revision.Jul 18 2016, 5:03 PM

Justin, Jingyue can you take another look?

jlebar added inline comments.Jul 18 2016, 5:30 PM

As discussed, I think it would be good to have a brief comment here along the lines of

SelectDirectAddr() recognizes addrspacecast([0->101] MoveParam(arg_symbol)) created above and returns arg_symbol.

I don't think we even need to to explain where this comes from -- it's a sufficiently generic construct that it makes sense to me we'd want to recognize it here.


Unnecessary parens.


This comment needs to be updated.

Actually, doing any of this work in a pass called "LowerKernelArgs" seems wrong. Should we rename the pass?


Actually it seems like they handle them exactly the same?


This comment isn't there anymore.

tra updated this revision to Diff 64521.Jul 19 2016, 10:29 AM
tra marked 4 inline comments as done.

Updated comments.
Use CastN for accessing SDNode arguments for consistency.

This revision is now accepted and ready to land.Jul 19 2016, 10:29 AM
tra planned changes to this revision.Jul 19 2016, 10:30 AM

I have no idea why phabricator marks this revision as accepted when I update the patch. Marking it as "Plan changes", again.


Comment updated. As for renaming the file/pass, I'd rather do it separately.

tra updated this revision to Diff 64522.Jul 19 2016, 10:34 AM

CastN is a pointer.

This revision is now accepted and ready to land.Jul 19 2016, 10:34 AM

lgtm. I don't have a problem if we rename the pass and in a separate patch, but I do think we should do that.


OK, but carrying this one list when we use it for two different things ("things we do to all functions" vs "things we do to kernel functions") is confusing.

Can we make it so this first one isn't part of the list?

tra updated this revision to Diff 64714.Jul 20 2016, 11:05 AM

Reedited comments describing why we need to copy args to local space.

tra closed this revision.Jul 20 2016, 11:50 AM

Landed in r276153.