This is an archive of the discontinued LLVM Phabricator instance.

[OpenCL] Block captured variables in dynamic parallelism - OpenCL 2.0
AbandonedPublic

Authored by mshockwave on Sep 18 2016, 10:52 PM.

Details

Reviewers
yaxunl
Anastasia
Summary

Pass block captured variables as function arguments to block invoke function rather than passing the pointer of block literal structure.
This revision is based on clang 3.9 final version and barely a prototype, feel free to leave comment on coding structure or even coding style.

Diff Detail

Event Timeline

mshockwave retitled this revision from to [OpenCL] Block captured variables in dynamic parallelism - OpenCL 2.0.
mshockwave updated this object.
mshockwave added reviewers: Anastasia, bader, yaxunl.
Anastasia edited edge metadata.Sep 26 2016, 9:42 AM
  1. Referring to the earlier discussion here: http://lists.llvm.org/pipermail/cfe-dev/2016-September/050775.html

I am still not able to understand why this helps avoiding memcpy to global memory? I am guessing somewhere in __enqueue_kernel_XXX implementation all captured variables will have to be copied to the global area to be accessed as parameters to the block invoke function during its enqueue in any case.

  1. This change seems to break original block use case as simple lambda function.

i. e. the following example:

void f1() {
  int x=1;
  global int* ptr;
  int (^bl1)() = ^() {return 1-x-*ptr;};
  bl1();
}

produces the IR:

%block.invoke = getelementptr inbounds <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32, i8*, i32*, i32 }>, <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32, i8*, i32*, i32 }>* %block, i32 0, i32 3
store i8* bitcast (i32 (i32*, i32)* @__f1_block_invoke to i8*), i8** %block.invoke, align 8
...
%5 = getelementptr inbounds %struct.__block_literal_generic, %struct.__block_literal_generic* %block.literal, i32 0, i32 3
%7 = load i8*, i8** %5, align 8
%8 = bitcast i8* %7 to i32 (i8*, ...)*
%call = call i32 (i8*, ...) %8(i8* %6)

Which results in the indirect call with the wrong prototype passing 1 argument instead of 2 expected:

define internal i32 @__f1_block_invoke(i32* %ptr, i32 %x).

Please, consider it too.

  1. A comment to the statement in http://lists.llvm.org/pipermail/cfe-dev/2016-September/050775.html regarding the offsets to the captures in the block literal struct. According to my understanding the struct header (before capture) has a fixed structure with all captures appended to the end of it. Therefore, the offset to the captures is statically known, although having an explicit copy helper might still be beneficial to have.
  1. General comment regarding missing tests to check the program paths you modify in this change.
Anastasia edited edge metadata.Sep 26 2016, 10:14 AM
Anastasia added a subscriber: cfe-commits.
mshockwave updated this revision to Diff 73254.EditedOct 3 2016, 2:47 AM
mshockwave removed a reviewer: bader.

@Anastasia
Sorry for late responding, I'd just attach a new version of patch that fixes block function use cases as normal lambda functions. But testing code is not included in this revision, we're still working on it.

About the questions you asked in the previous comment, I'm going to explain them from another aspect: How will one implement __enqueue_kernel_XXX? It might be classified into two categories:

  1. Library implementation like libclc/libclcxx written in OpenCL-C
  2. Implement builtins directly in compiler.

If we choose the first one, which most of people would do regarding its simplicity and flexibility, and we want to fetch captured variables inside the implementation of __enqueue_kernel_XXX, the possible approach would be:

void* block_as_voidptr = (void*)arg_child_kernel;
block_literal_ty *block = (block_literal_ty*)block_as_voidptr;
block->capA;
block->capB;

This seems promise, but what exactly block_literal_ty looks like? We all know that block_literal_ty would look similar to:

typedef struct {
  /*
  * Fields of block header. 
  * e.g. isa, block_descriptor...
  */

  int capA;
  int capB;
  ...
} block_literal_ty;

But since we're discussing a static type language, the definition of this struct, in OpenCL-C, must be known. However, the EXACT appearence of block_literal_ty would vary among programs, or even functions. That's the thing cap_copy_helper want to aid.

Of course there is another library approach: Keep the child kernel's invoke_function prototype untouched, pass block_literal variable(in void pointer type) as its first function argument. Since instructions for extracting captured variables had been generated during the codegen of invoke_function body. Also, we don't need to tackle any captured variables inside __enqueue_kernel_XXX.
However, the OpenCL spec says that global address space is the only address space shared between parent and child kernel; and the block_literal variable itself, is allocated as private(stack) variable in parent kernel. So we need to copy the block_literal variable(not its pointer) into some global space. Nevertheless, OpenCL doesn't allow dynamic-sized memory in global space, so we need to define a block of static size memory, perhaps array, in our library implementation. Here is the place might require global memory management since static size implies potential risk of running out pre-allocated space.

Regarding the improvement proposed by us which "flatten" captured variables into invoke_function argument list and block_literal pointer wouldn't be passed as first argument(to invoke_function) anymore. The reason why it doesn't require global memory management is that we can retrieve captured variables with cap_num field and cap_copy_helper routine INSIDE __enqueue_kernel_XXX and passed those captures as arguments to child kernel, rather than saving block_literal variable globally and postpone the retrieving actions until invoke_function, the child kernel body.

About the second implementation category, which implement builtins directly in compiler, we haven't spend time thinking about its detail approach. Maybe we can discuss about this.

Regarding the improvement proposed by us which "flatten" captured variables into invoke_function argument list and block_literal pointer wouldn't be passed as first argument(to invoke_function) anymore. The reason why it doesn't require global memory management is that we can retrieve captured variables with cap_num field and cap_copy_helper routine INSIDE __enqueue_kernel_XXX and passed those captures as arguments to child kernel, rather than saving block_literal variable globally and postpone the retrieving actions until invoke_function, the child kernel body.

Just to be clear, we are now comparing the following two approaches:

(1)

__enqueue_kernel_XXX ( ... block_literal, size ... ) { // size seems to be missing in the current implementation (?)
   ...
  // copy block literal into accessible for enqueued kernel memory
  memcpy(accessible_mem, block_literal, size); // for efficiency (block_literal + static_header_size) can be used instead
  // activate  block_literal->invoke as a kernel
  ...
}

void invoke_function(block_decriptor* d){
  use(d->capA, d->capB);
}

(2)

__enqueue_kernel_XXX ( ... block_literal ... ) {
  ...
  // copy block literal captures into accessible for enqueued kernel memory
  memcpy(accessible_mem, &(block_literal->capA), sizeof(block_literal->capA)); // which can be done using cap_copy_helper instead
  memcpy(accessible_mem + sizeof(block_literal->capA), &(block_literal->capB), sizeof(block_literal->capB)); // which can be done using cap_copy_helper instead
  // activate  block_literal->invoke as a kernel
  ...
}

void invoke_function(capA_t capA, capB_t capB){
  use(capA, capB);
}

From this picture I don't see how the flattening itself can help us to avoid using global memory. Surely in both cases the captures content will have to be copied into the memory accessible for the enqueued kernel (which is a global memory in a general case, but doesn't have to be in some cases I am guessing). Perhaps I am missing some extra step in the approach you are proposing. If you rely on the parameter passing using normal function call into the block_invoke then in both cases we can skip the memcpy of captures at all. Otherwise both appoaches will need to make a copy of the captures.

What we can improve though is avoiding extra data copy using the copy helpers you are proposing (this can also be achieved by calling mempy passing the capture offset pointer into block_literal and captures size instead of the the whole block_literal as highlighted above). We can also potentially avoid reloading of the captures in the enqueued kernel though the capture flattening, but this depends on the calling convention (or rather enqueueing convension I would say).

About the second implementation category, which implement builtins directly in compiler, we haven't spend time thinking about its detail approach. Maybe we can discuss about this.

Neither did I look at it to be honest as I also find it quite difficult from the maintanance point of view as well as future modifications.

From this picture I don't see how the flattening itself can help us to avoid using global memory. Surely in both cases the captures content will have to be copied into the memory accessible for the enqueued kernel (which is a global memory in a general case, but doesn't have to be in some cases I am guessing). Perhaps I am missing some extra step in the approach you are proposing. If you rely on the parameter passing using normal function call into the block_invoke then in both cases we can skip the memcpy of captures at all. Otherwise both appoaches will need to make a copy of the captures.

Now I agree with the necessity of global accessable_memory. After all, kernel enqueuing functions in the host side(clEnqueueNDRangeXXX) also require pre-allocated __global memory, we should follow the same fashion.

What we can improve though is avoiding extra data copy using the copy helpers you are proposing (this can also be achieved by calling mempy passing the capture offset pointer into block_literal and captures size instead of the the whole block_literal as highlighted above). We can also potentially avoid reloading of the captures in the enqueued kernel though the capture flattening, but this depends on the calling convention (or rather enqueueing convension I would say).

It seems that the flattening approach leave little space for the implementation, which violate the generic property of llvm. Also, although cap_copy_helper looks helpful, I think there is little chance for one to copy individual captured variables - copy the entire block_literal is sufficient. Of course, we can reduce block_literal size by removing redundant fields for the sake of optimization but that's another discussion topic I think.

I would like to remove this code review.
@Anastasia thanks for your patient discussing with newbie llvm/clang developer like me :)

mshockwave abandoned this revision.Oct 5 2016, 5:49 PM

However, it seems like we didn't provide quite complete implementation with respect to the captures yet as it's not possible at the moment for __enqueue_kernel_XXX to know the size of the captures or even the block literal struct itself to be able to copy the block data correctly.