Page MenuHomePhabricator

[OpenMP][IRBuilder][WIP] Prototype `omp task` support
Needs ReviewPublic

Authored by jdoerfert on Dec 29 2019, 11:20 PM.

Details

Summary
NOTE: The patch is not complete and lacks testing but it should allow people to see how the IRBuilder will evolve and give feedback.

This patch adds support for omp task to the OpenMPIRBuilder. To
simplify this *and* to make optimizations of tasks possible, we
introduce a new interface: __kmpc_task.

Similar to TRegions, the new interface exposes front-end information in
a direct way. More importantly, the task create and task issue step are
conceptually not separated anymore as it is, together with other
non-user related code, hidden inside __kmpc_task.

Most of the properties of a task are supported already, incl. dependences.

The outlining logic used for omp parallel was (slightly) generalized
and reused but better abstractions are needed.

Diff Detail

Event Timeline

jdoerfert created this revision.Dec 29 2019, 11:20 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 29 2019, 11:20 PM
lebedev.ri added inline comments.
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
86–87

This doesn't seem correctt.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

openmp/runtime/src/kmp.h
2242

Existing code, but the comments about number of bits here should probably be executable. E.g. give the compiler and the library a uint16_t field each, assert that sizeof kmp_tasking_flags == sizeof(uint32_t). Optionally four byte align the first short.

jdoerfert marked 2 inline comments as done.Dec 30 2019, 9:55 AM
jdoerfert added inline comments.
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
86–87

It is not, thx, 62.

openmp/runtime/src/kmp.h
2242

I'd like to make a bunch of changes like that now that I started to look into the library closer. In addition to the TODOs I added below (in another patch), I found seemingly unused functions, arguments, ... Versioning the runtime seems interesting to me now.

rogfer01 added inline comments.Jan 13 2020, 11:43 PM
openmp/runtime/src/kmp_tasking.cpp
1764

Just double-checking here I understand what would happen here for C++ firstprivatized objects (sorry if I'm asking the obvious):

We would capture the copy onto an alloced memory and then we would copy that memory to the task struct?

So given something like

struct A {
  A(int x = 3);
  void bar();
};

void foo() {
  A a;
  #pragma omp task
  {
    a.bar();
  }
}

we'd do (in pseudo-C)

void foo() {
   struct task_env_0 { 
      char A_storage[sizeof(A)]; // properly aligned and all that
   } tenv_0;

   A::A(&tenv_0.A_storage, 3); // capture

   // This would happen (dynamically) inside __kmpc_task
   kmp_task_t* new_task = __kmp_omp_task_alloc( , sizeof(tenv_0), ... );
   memcpy(new_task->shareds, &tenv_0, sizeof(tenv_0));
   kmpc_omp_task(..., new_task).
   //
}

So we go from

  • create task
  • capture environment in the task context
  • queue task / run immediately for if0

to

  • capture environment in a local storage
  • allocate task + copy local environment to task environment if needed (for tasks that are not if0) + queue task / run immediately for if0

Did I get that right?

Thanks!

the task create and task issue step are
conceptually not separated anymore as it is

I don't think this can work reliably. Because not all C++ objects can be mem-copied.
E.g. an object can keep its own address or reference, and mem-copy will make it broken.
This could be fixed by generating (optional) thunk routine which would create all needed objects
in the library-allocated space, and similar routine which would destroy all the created objects.

the task create and task issue step are
conceptually not separated anymore as it is

I don't think this can work reliably. Because not all C++ objects can be mem-copied.
E.g. an object can keep its own address or reference, and mem-copy will make it broken.
This could be fixed by generating (optional) thunk routine which would create all needed objects
in the library-allocated space, and similar routine which would destroy all the created objects.

I agree. memcpy will only work for certain types (incl. all PoD types). We need to extend this later to pass the copy constructor function pointers and locations of the original object.
We should keep kmp_uint32 sizeof_shared_and_private_vars, void *shared_and_private_vars for the simple cases and add something like:

/// Copy wrapper takes the address of an object and the address the copy is going to be initialized in and returns the address right after the new object. 
typedef void*(*copy_wrapper)(void * /* src_addr */, void * /* trg_addr */);
 
__kmpc_task( ..., kmp_uint32 sizeof_copy_infos, copy_wrapper * copy_wrapper_list, void * copied_obj_list, ...)

and we then call the copy constructors like this:

void *local_addr = ...;
for (kmp_uint32 u = 0; u < sizeof_copy_infos; ++u)
  local_addr = copy_wrapper_list[u](copied_obj_list[u], local_addr);

@rogfer01 @AndreyChurbanov What do you think? (I can do this in this patch or later)

and we then call the copy constructors like this:

void *local_addr = ...;
for (kmp_uint32 u = 0; u < sizeof_copy_infos; ++u)
  local_addr = copy_wrapper_list[u](copied_obj_list[u], local_addr);

Just to confirm: that local_addr would be somehow linked to the task, I imagine it'd be initialized to something like task->shareds + offset_to_firstprivates, wouldn't it?

Also, perhaps you already considered if it makes sense to just have a copy function generated by the front-end rather than one of for each "firstprivatized variable that can't just be memcpy'ed"?

Also I'm curious why are you moving away (perhaps I did get this wrong!) from the current model of

  1. kmpc_omp_task_alloc
  2. capture environment
  3. queue the task kmpc_omp_taskor (if the task is if(0)) do an "immediate" execution but use the environment captured in 2

to something that looks like

  1. (partially?) capture the environment. I think I didn't understand what void *shared_and_private_vars will do here. Is this something that the front-end precaptured for us?
  2. copy the environment you obtained from shared_and_private_vars to a task-local storage and then queue the task or (if the task is if(0)) do an immediate execution using the environment you got from the argument to kmpc_task

I guess there is a benefit in the new approach. So far I read this as improving the if(0) case (but I may be missing something here).

Thanks!

and we then call the copy constructors like this:

void *local_addr = ...;
for (kmp_uint32 u = 0; u < sizeof_copy_infos; ++u)
  local_addr = copy_wrapper_list[u](copied_obj_list[u], local_addr);

Just to confirm: that local_addr would be somehow linked to the task, I imagine it'd be initialized to something like task->shareds + offset_to_firstprivates, wouldn't it?

Yes. It is some location in which the task local variables live.

Also, perhaps you already considered if it makes sense to just have a copy function generated by the front-end rather than one of for each "firstprivatized variable that can't just be memcpy'ed"?

Having a copy function per type allows us to reuse it, otherwise we have one copy function per static task location (at worst). Either works for me I think.

Also I'm curious why are you moving away (perhaps I did get this wrong!) from the current model of

  1. kmpc_omp_task_alloc
  2. capture environment
  3. queue the task kmpc_omp_taskor (if the task is if(0)) do an "immediate" execution but use the environment captured in 2

to something that looks like

  1. (partially?) capture the environment. I think I didn't understand what void *shared_and_private_vars will do here. Is this something that the front-end precaptured for us?

The idea is that we can allocate variables (for which we do not invoke a copy constructor) in a smart way so that we can easily copy them if we need to or not copy them at all if we don't. This interface is a first version though roughly designed like the tregion interface. Any suggestions are welcome!

  1. copy the environment you obtained from shared_and_private_vars to a task-local storage and then queue the task or (if the task is if(0)) do an immediate execution using the environment you got from the argument to kmpc_task

Optimally, you would only copy if you need to.

I guess there is a benefit in the new approach. So far I read this as improving the if(0) case (but I may be missing something here).

There are multiple reasons but the main one for me is for sure the ability to use callback metadata to link the values passed at the call site of kmpc_task with the values received at the outlined body function. This is really complicated in the old approach, because there is no direct link possible between the captured values and the local copies in the thread (the stores go to some really hard to describe location and the body function reference is only at some point later).
In the new approach you link shared_and_private_vars to the argument the task function receives. Now you encapsulate the call site in an additional, modifiable level of indirection (see D71505), and the Attributor will unpack the struct (see D68852), propagate constants, alias information, ... between the task invocation and the task function.

Having a copy function per type allows us to reuse it, otherwise we have one copy function per static task location (at worst). Either works for me I think.

I also would be OK with either option.
But note that using per-type functions will require more additions to the interface, like:

(..., num_objects, array_of_copy_wrappers, array_of_desctuctor_wrappers, array_of_obj_offsets).

Then the library can iterate over objects to copy-construct them, then iterate to destroy them after the task is complete. Without any possibility of inlining of any wrappers.

Per-task function only needs two additions - copy_wrapper and destructor_wrapper, all other details can live inside them, including possible inlining of constructors and destructors.

Having a copy function per type allows us to reuse it, otherwise we have one copy function per static task location (at worst). Either works for me I think.

I also would be OK with either option.
But note that using per-type functions will require more additions to the interface, like:

(..., num_objects, array_of_copy_wrappers, array_of_desctuctor_wrappers, array_of_obj_offsets).

Then the library can iterate over objects to copy-construct them, then iterate to destroy them after the task is complete. Without any possibility of inlining of any wrappers.

Per-task function only needs two additions - copy_wrapper and destructor_wrapper, all other details can live inside them, including possible inlining of constructors and destructors.

Agreed. Interestingly, the "per-type" solution should allow you to implement the "per-task" version on top. You pretend all objects are part of a (task) meta-object which will copy them one by one.