Page MenuHomePhabricator

[OpenCL] Hierarchical/dynamic parallelism - enqueue kernel in OpenCL 2.0

Authored by Anastasia on May 13 2016, 10:51 AM.



An implementation of device side enqueue (DSE) - enqueue_kernel and related BIFs from OpenCL v2.0 s6.13.17.

This change includes:

  1. adding enqueue_kernel, get_kernel_work_group_size and get_kernel_preferred_work_group_size_multiple as Clang builtins with a custom check.


enqueue_kernel(.../*ommited params*/, block, /*optional sizes of passed block args if any*/)

This allows diagnosing parameters of the passed block variable (the spec mandates them to be 'local void*' type) and we can check different overloads too (Table 6.31).

  1. IR generation with an internal library call for each new builtins used in the CL code, reusing ObjC block generation.

For the following example of CL code:

kernel void device_side_enqueue(…) {
  … /*declare default_queue, flags, ndrange, a, b here*/
  enqueue_kernel(default_queue, flags, ndrange, ^(void) { a + b; });

The generated IR could be:

; from ObjC block CodeGen (the second field contains the size of the block literal record)
@__block_descriptor_tmp = internal constant { i64, i64, i8*, i8* } { i64 0, i64 52, i8* getelementptr inbounds ([35 x i8]* @.str, i32 0, i32 0), i8* null } 
define void @device_side_enqueue() {
  ; from ObjC block CodeGen (block literal record with a capture)
  %block = alloca <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32, i32}>
  ; from ObjC block CodeGen - store block descriptor and block captures below
  ; from ObjC block CodeGen (set pointer to block definition code)
  %block.invoke = getelementptr inbounds <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32, i32}>* %block, i64 0, i32 3 * 
  store i8* bitcast (void (i8*)* @__device_side_enqueue_block_invoke to i8*), i8** %block.invoke *
  ; potential impl of OpenCL CodeGen (cast from block literal record ptr to void ptr)
  %1 = bitcast <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32, i32}>* %block to i8*
  ; potential impl of OpenCL CodeGen (this function will have additional integer params at the end if the block has any parameters to be passed to)
   ... call i32 @__enqueue_kernel_impl(..., i8* %1)

define internal void @__device_side_enqueue_block_invoke(i8* nocapture readonly %.block_descriptor) { ; from ObjC block CodeGen (this can have more params of local void* type)
  ; from ObjC block CodeGen - load captures below
  ; from ObjC block CodeGen - original body of block

Note that there are different versions of enqueue_kernel_impl with unique name each. These functions will have to be implemented as a part of an OpenCL runtime library which will get a block literal data structure (allocated locally as in this example if capture is present or as a global variable otherwise), sizes of each block literal parameter (from 'local void*' list) and other omitted arguments at the beginning - mainly opaque objects, and will perform necessary steps to enqueue work specified by the block. The block literal record itself contains all important bits to facilitate basic implementation of DSE: a pointer to a block function definition, captured fields, and size of the block literal record. We can also discuss and implement some optimisations later on or as a part of this work. The implementation of enqueue_kernel_impl will have to take care of (1) initiating execution of the block invoke code pointed to by the block literal record (%block.invoke in the example above), (2) copying captured variables in the accessible memory location, (3) performing some sort of memory management to allocate space for 'local void*' parameters passed to the block if any.

Additional changes not included in this change:

  1. Modifications of ObjC blocks IR generation. A block literal record currently contains a number of fields that are not needed for OpenCL, i.e. isa, flags, copy and dispose helpers. They can be removed when compiling in OpenCL mode. We might potentially add extra fields to enable more efficient support of DSE or facilitate compiler optimisations. Ideas are welcome! I expect some places might require taking care of address spaces too.
  1. Potentially change of existing OpenCL types is needed. At least it seems like we might need to handle the ndrange_t type differently than we do currently. It's an opaque type now, but we need it to be allocated on a stack because a local variable of that type can be declared in CL code.

Diff Detail

Event Timeline

Anastasia updated this revision to Diff 57215.May 13 2016, 10:51 AM
Anastasia retitled this revision from to [OpenCL] Hierarchical/dynamic parallelism - enqueue kernel in OpenCL 2.0.
Anastasia updated this object.
Anastasia added reviewers: bader, yaxunl.
Anastasia updated this object.
Anastasia added inline comments.May 13 2016, 10:54 AM

This is necessary in order to remove CL2.0 BIFs from the list of Clang identifiers in other CL versions.


This diagnostic should follow the latest version reporting style.

yaxunl added inline comments.May 17 2016, 1:18 PM

Is it better to move this msg to the generic msg part for OpenCL?


Can we remove the non-vararg version and keep only the vararg version? The vararg can be empty, so the non-vararg version is redundant.


Can we drop NumVaargs? It seem redundant.


joint with the previous line?


joint with previous line?


rename to SemaOpenCLBuiltin... to be consistent with others?

Anastasia updated this revision to Diff 58219.May 24 2016, 4:22 AM
  • Improved with suggestions from Sam: reformatting, renaming!
  • Removed unused err_opencl_function_not_supported diagnostic.
  • Rebased and changed to_addr builtins to allow the same identifier to be used in earlier than CL2.0 versions.
Anastasia marked 6 inline comments as done.May 24 2016, 4:25 AM
Anastasia added inline comments.

It's not used actually!


We still need it for checking the parameters of blocks are 'local void*'.

Also I am not sure there is a way to combine Clang builtin and non-builtin prototype for the same function easily...

bader added inline comments.May 24 2016, 7:40 AM

There should be some check before cast.
Here is the code snippet that will crash the compiler:

extern queue_t get_default_queue();
extern int get_global_id(int);
extern ndrange_t get_ndrange();
typedef void (^MyBlock)(local void*, local int*);

const MyBlock myBlock = (MyBlock)^(local int *p1, local int *p2) {
  int id = get_global_id(0);
  p1[id] += p2[id];

void kernel f2(global int* a, global int* b) {
  enqueue_kernel(get_default_queue(), 0, get_ndrange(), myBlock, 2U, 1U);

This check is failing on my machine:

test\CodeGenOpenCL\ error: expected string not found in input
// CHECK: [[BL:%[0-9]+]] = bitcast <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32 addrspace(1)*, i32 addrspace(1)*, i32 }>* %block to void ()*
<stdin>:47:60: note: scanning from here
%3 = load %opencl.ndrange_t*, %opencl.ndrange_t** %ndrange, align 4
<stdin>:67:2: note: possible intended match here
%7 = bitcast <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32 addrspace(1)*, i32, i32 addrspace(1)* }>* %block to void ()*

It looks like the order of captured arguments is different from expected? Is the order specified or test shouldn't check it?

Anastasia updated this revision to Diff 58403.May 25 2016, 4:45 AM
Anastasia marked an inline comment as done.

Updates from Alexey's comments:

  • Use canonical type while casting to block.
  • Added failing case to Sema tests.
  • Modified CodeGen tests due to captures ordering issue.
Anastasia marked 2 inline comments as done.May 25 2016, 4:49 AM
Anastasia added inline comments.

Good spot! Thanks! I have added this case to the tests!


It seems that capture order is enforced by the order in which the parser parses statements, and it isn't really enforced anywhere by ObjC implementation. Their tests only contain 1 capture or captures of the same type, so this issue couldn't be caught.

I am changing the test now, but I have no idea why the parsing order is different though.

If you tell me your revision, I could try to see if I can reproduce this locally.

bader edited edge metadata.May 31 2016, 7:37 AM

Sorry for the delay.

Is this code valid:

clk_event_t e1, e2, e3;
clk_event_t events[] = {e1, e2};
enqueue_kernel(get_default_queue(), 0, get_ndrange(), 2, events, &e3, ...);

With this patch clang rejects it with an error:

'illegal call to enqueue_kernel, expected 'clk_event_t *' argument type'

C rules allows implicit conversion of to pointer by taking address of the first element of the array.


Here block type must be canonical too.

Anastasia updated this revision to Diff 60330.Jun 10 2016, 4:09 AM
Anastasia edited edge metadata.
Anastasia marked 2 inline comments as done.
    • Allow passing array of events as a valid event list object (following C implicit cast rules of arrays to pointers) in enqueue_kernel function.
  • Added failing test case!
  • Removed declarations of get_kernel_work_group_size and get_kernel_preferred_work_group_size_multiple from the OpenCL header.
Anastasia updated this revision to Diff 60339.Jun 10 2016, 4:48 AM

Fixed issue with block typedef.

Anastasia marked 2 inline comments as done.Jun 10 2016, 4:49 AM
yaxunl accepted this revision.Jun 16 2016, 11:43 AM
yaxunl edited edge metadata.

LGTM. Thanks!

This revision is now accepted and ready to land.Jun 16 2016, 11:43 AM

Alexey, do you have any additional comments?

bader added a comment.Jun 23 2016, 5:01 AM

Hi Anastasia,

Sorry for the delay.
I have just a few comments.



It looks like this function will report only first invalid parameter. Can we report all of them by moving code from lines 109-120 inside the loop?


If I understand it correctly the intention is to convert this integer type to i32. Am I right?
By the spec it must be unsigned 32-bit integer, not just any integer type.


block arguments must be pointers to the local memory (the same as previous declaration).


Could you clarify that code path?
My understanding is that if NumArgs ==4, clang should except only block with empty parameter list. Am I right?

/// int enqueue_kernel(queue_t queue,
///                    kernel_enqueue_flags_t flags,
///                    const ndrange_t ndrange,
///                    void (^block)(void))

This check is redundant.
It's known to be always true at this point.


This check is redundant. This condition is known to be true at line 250.


Shouldn't by default we return false and report an error only if checks above find inconsistency?
I expect code to be much simpler with this approach.

Anastasia updated this revision to Diff 61816.Jun 24 2016, 11:52 AM
Anastasia edited edge metadata.

Addressed Alexey's comments:

  • Removed redundant checks;
  • Improved diagnostics of parameters to blocks and sizes of block parameters;
  • Handled casting of integer types for specifying sizes of block parameters;
Anastasia marked 4 inline comments as done.Jun 24 2016, 12:09 PM
Anastasia added inline comments.

I think the problem is that in C99 there are implicit casts among integer types, therefore making this check more restrictive would imply the cast has to be done explicitly instead.

Thus, this would have to be modified as follows:

enqueue_kernel(...,  64); -> enqueue_kernel(...,  (uint)64); or enqueue_kernel(...,  64U);

which in my opinion is undesirable and also a bit unexpected because by analogy to C99 you can compile the following successfully:

void g(unsigned);

void f() {
  char i;

I have added a check for a size however not to be larger than 32 bits and handled type cast in CodeGen. The test cases are added too.

What's your opinion about it?


This error is when we can't deduce exactly which overload is meant to be and we are giving more general error (like it's ambiguous). You can see some examples in tests. We could of course go further and specify possible candidates, but it would complicate the story even more.

bader accepted this revision.Jun 27 2016, 2:42 AM
bader edited edge metadata.

A few style nitpicks.


[Style] Minor suggestion to consider.
To avoid indentation of almost 100 lines of code inside if statement this can be implemented as:
if (NumArgs < 5) llvm_unreachable("Unhandled enqueue_kernel signature");

There is no else branch for that condition anyway.


I'm OK with allowing implicit conversions as long as there is a way to enforce strict check for unsigned 32-bit integer here (e.g. with -Wconversion or in pedatic mode).
Can we check that mode?

Anastasia updated this revision to Diff 62376.Jun 30 2016, 9:19 AM
Anastasia edited edge metadata.
  • Fixed style issues
  • Added warning diagnostics for conversions of integer types
Anastasia marked 12 inline comments as done.Jun 30 2016, 9:24 AM
Anastasia added inline comments.

Would assertion be more appropriate even?


I am reusing C diagnostics functionality here now to keep consistency!

bader added inline comments.Jul 1 2016, 1:33 AM

The con of this change - it will export all the symbols from this unnamed namespace.
The better approach would be to move the code that uses CheckImplicitConversion method below this namespace.

Anastasia updated this revision to Diff 62485.Jul 1 2016, 5:19 AM
Anastasia marked an inline comment as done.

Undo removal of anonymous namespace and move the position of function definition instead.

Anastasia marked an inline comment as done.Jul 1 2016, 5:20 AM
bader added a comment.Jul 1 2016, 5:28 AM

Thanks a lot for working on this.

Anastasia closed this revision.Jul 12 2016, 9:58 AM

r274540 and r274509