This is an archive of the discontinued LLVM Phabricator instance.

[LIBOMPTARGET] Add support for mapping of lambda captures.
ClosedPublic

Authored by ABataev on Aug 22 2018, 9:02 AM.

Details

Summary

Added support for correct mapping of variables captured by reference in
lambdas. That kind of mapping may appear only in target-executable
regions and must follow the original lambda or another lambda capture
for the same lambda.
The expected data: base address - the address of the lambda, begin
pointer - pointer to the address of the lambda capture, size - size of
the captured variable.
When OMP_TGT_MAPTYPE_LAMBDA_REF mapping type is seen in
target-executable region, the target address of the last processed item
is taken as the address of the original lambda tgt_lambda_ptr. Then,
the pointer to capture on the device is calculated like `tgt_lambda_ptr
+ (host_begin_pointer - host_begin_base)` and the target-based address
of the original variable (which host address is
*(void**)begin_pointer) is written to that pointer.

Diff Detail

Repository
rL LLVM

Event Timeline

ABataev created this revision.Aug 22 2018, 9:02 AM

Is this patch about lambdas inside target regions or about lambdas containing a target region? From the description I assume it's the former. In that case, what happens if there are more than 1 lambdas in the target region?

Is this patch about lambdas inside target regions or about lambdas containing a target region? From the description I assume it's the former. In that case, what happens if there are more than 1 lambdas in the target region?

  1. This is for lambdas inside target region.
  2. For each lambda we generate this sequence: <lambda1>, <capture1_1>, .., <capture1_n>, <lambda2>, <capture2_1>, ..., <capture2_m>, ...
Hahnfeld added inline comments.
libomptarget/src/omptarget.cpp
637–638 ↗(On Diff #161976)

After D50522 this should probably be return OFFLOAD_FAIL

Is this patch about lambdas inside target regions or about lambdas containing a target region? From the description I assume it's the former. In that case, what happens if there are more than 1 lambdas in the target region?

  1. This is for lambdas inside target region.
  2. For each lambda we generate this sequence: <lambda1>, <capture1_1>, .., <capture1_n>, <lambda2>, <capture2_1>, ..., <capture2_m>, ...

OK, I understand.

As far as I know, lambdas are internally represented as structs, right? In that case, wouldn't it make sense to treat captured references as struct members? Then we would just use the MEMBER_OF field to find the base address of a lambda instead of relying on the last element in tgt_args (omptarget.cpp:619). Also, the reference would be marked with the PTR_AND_OBJ flag and there is already code to handle such entries, so there is no need to duplicate it (lines 612-640 of this patch basically replicate what is already implemented in target_data_begin in lines 254-270 and 316-332).

Is this patch about lambdas inside target regions or about lambdas containing a target region? From the description I assume it's the former. In that case, what happens if there are more than 1 lambdas in the target region?

  1. This is for lambdas inside target region.
  2. For each lambda we generate this sequence: <lambda1>, <capture1_1>, .., <capture1_n>, <lambda2>, <capture2_1>, ..., <capture2_m>, ...

OK, I understand.

As far as I know, lambdas are internally represented as structs, right? In that case, wouldn't it make sense to treat captured references as struct members? Then we would just use the MEMBER_OF field to find the base address of a lambda instead of relying on the last element in tgt_args (omptarget.cpp:619). Also, the reference would be marked with the PTR_AND_OBJ flag and there is already code to handle such entries, so there is no need to duplicate it (lines 612-640 of this patch basically replicate what is already implemented in target_data_begin in lines 254-270 and 316-332).

I tried it, it won't work.

  1. Lambdas are captured as firstprivates, thus we cannot translate the host address of the lambda to the device address. MEMBER_OF does not work here because of that.
  2. PTR_AND_OBJ performs the deep copy, which is incorrect. LAMBDA_REF translates host reference address to the previously mapped device reference address. It makes it possible to implement the correct behavior of the lambdas with the captures by reference.
ABataev added inline comments.Aug 23 2018, 6:32 AM
libomptarget/src/omptarget.cpp
637–638 ↗(On Diff #161976)

I think this code is ok. It exits from the main loop and then returns rc as a result.

Hahnfeld added inline comments.Aug 23 2018, 6:36 AM
libomptarget/src/omptarget.cpp
637–638 ↗(On Diff #161976)

Yep, and D50522 changes the existing code to return OFFLOAD_FAIL so this patch should do the same.

ABataev added inline comments.Aug 23 2018, 6:40 AM
libomptarget/src/omptarget.cpp
637–638 ↗(On Diff #161976)

Ahh, now I see. Do you want me to make it return OFFLOAD_FAIL right now or after D50522 is landed?

Hahnfeld added inline comments.Aug 23 2018, 6:43 AM
libomptarget/src/omptarget.cpp
637–638 ↗(On Diff #161976)

Depends on the order in which the changes are committed :) the end result will be the same, it's just a matter of who needs to update his patch.

libomptarget/include/omptarget.h
52 ↗(On Diff #161976)

Head's up.

The new flag is already in use for XL/LOMP. I am investigating now if that XL/LOMP flag is really needed. It might, in which case that functionality need to be added to CLANG and the conflict need to be avoided. Will confirm as soon as I get the data.

I tried it, it won't work.

  1. Lambdas are captured as firstprivates, thus we cannot translate the host address of the lambda to the device address. MEMBER_OF does not work here because of that.

I agree that this doesn't work with the current code. IIRC the reason is that libomptarget has two loops dealing with map clauses: one in target_data_begin (which also handles MEMBER_OF) and the one in target which this patch is modifying. The problem is that firstprivate arguments are handled in the second, so the one in target_data_begin (which runs first) doesn't have a device address to resolve entries with MEMBER_OF. IMO the proper solution would be to have all allocations in one single loop in target_data_begin. The loop in target should only deal with constructing the arguments passed to the device.

To implement this efficiently we need a temporary data structure for each mapping so that we can store the device address allocated in target_data_begin. That pointer could then be queried in target to populate tgt_args.
(Ideally the data structure would wrap each entry in the args_base, args, arg_sizes, and arg_types arrays into a nice object. This would have functions to hide all this bitmasking to find out the type.)

Long story short, I think that lambdas are indeed firstprivate. What needs to be fixed are firstprivate "mappings" together with MEMBER_OF (@AlexEichenberger this already works in lomp, right?).

I tried it, it won't work.

  1. Lambdas are captured as firstprivates, thus we cannot translate the host address of the lambda to the device address. MEMBER_OF does not work here because of that.

I agree that this doesn't work with the current code. IIRC the reason is that libomptarget has two loops dealing with map clauses: one in target_data_begin (which also handles MEMBER_OF) and the one in target which this patch is modifying. The problem is that firstprivate arguments are handled in the second, so the one in target_data_begin (which runs first) doesn't have a device address to resolve entries with MEMBER_OF. IMO the proper solution would be to have all allocations in one single loop in target_data_begin. The loop in target should only deal with constructing the arguments passed to the device.

To implement this efficiently we need a temporary data structure for each mapping so that we can store the device address allocated in target_data_begin. That pointer could then be queried in target to populate tgt_args.
(Ideally the data structure would wrap each entry in the args_base, args, arg_sizes, and arg_types arrays into a nice object. This would have functions to hide all this bitmasking to find out the type.)

Long story short, I think that lambdas are indeed firstprivate. What needs to be fixed are firstprivate "mappings" together with MEMBER_OF (@AlexEichenberger this already works in lomp, right?).

This requires some significant redesign of the library itself + possibly the compiler. It would be better if somebody else, more familiar with libomptarget, could implement this.

I tried it, it won't work.

  1. Lambdas are captured as firstprivates, thus we cannot translate the host address of the lambda to the device address. MEMBER_OF does not work here because of that.

I agree that this doesn't work with the current code. IIRC the reason is that libomptarget has two loops dealing with map clauses: one in target_data_begin (which also handles MEMBER_OF) and the one in target which this patch is modifying. The problem is that firstprivate arguments are handled in the second, so the one in target_data_begin (which runs first) doesn't have a device address to resolve entries with MEMBER_OF. IMO the proper solution would be to have all allocations in one single loop in target_data_begin. The loop in target should only deal with constructing the arguments passed to the device.

To implement this efficiently we need a temporary data structure for each mapping so that we can store the device address allocated in target_data_begin. That pointer could then be queried in target to populate tgt_args.
(Ideally the data structure would wrap each entry in the args_base, args, arg_sizes, and arg_types arrays into a nice object. This would have functions to hide all this bitmasking to find out the type.)

Long story short, I think that lambdas are indeed firstprivate. What needs to be fixed are firstprivate "mappings" together with MEMBER_OF (@AlexEichenberger this already works in lomp, right?).

BTW, the main problem here is that there is no mapping between host pointer and device pointer for the firstprivates, just like for the mapped data. We just need to rework the way we allocate the memory for the firstprivate to make MEMBER_OF to work.

This requires some significant redesign of the library itself + possibly the compiler. It would be better if somebody else, more familiar with libomptarget, could implement this.

I'd hope this to be an implementation detail of the library. This is on my list for quite some time...

BTW, the main problem here is that there is no mapping between host pointer and device pointer for the firstprivates, just like for the mapped data. We just need to rework the way we allocate the memory for the firstprivate to make MEMBER_OF to work.

Hmm, by going through the full stack of data allocation? Yes, I think that would be an acceptable workaround to get things running (certainly better than my hack in https://github.com/clang-ykt/openmp/commit/b836d5ddfc48faa8a61e79668f364608012075e4).
The refactoring that I proposed would then become an optimization to avoid needless queries to find the target pointer (because we can just cache it).

RaviNarayanaswamy added inline comments.
libomptarget/src/omptarget.cpp
623 ↗(On Diff #161976)

We should not be doing target arithmetic in omptarget.cpp. The target pointer can be an opaque object for some implementation.
Yes we should clean up code in other places which do this.

ABataev added inline comments.Aug 23 2018, 10:05 AM
libomptarget/src/omptarget.cpp
623 ↗(On Diff #161976)

How should we handle elements that are part of structures?

Provide additional APIs in plugins if necessary to accomplish the arithmetic operations.
By doing arithmetic for target address in libomptarget you are assuming the pointer size is 64.

Provide additional APIs in plugins if necessary to accomplish the arithmetic operations.
By doing arithmetic for target address in libomptarget you are assuming the pointer size is 64.

I don't think it will solve the problem. If the host is 32 bits, the arithmetics also will be 32 bits. IIUC, you're talking about the situation where the host is 64 bits and device is 32 bits, right? In this case, you still will have troubles, because you just cannot bitmap complex types correctly.

If the objects which are copied are not compatible between host and target due to use of pointers which have different size in a structure then it is users responsibility.
I am more interested in where you allocate memory on device you get back an opaque object of size 64. The opaque object may be 64bit device pointer or a high 32bit pointer to device memory and low 32bit offset from that memory or it can be a pointer to a structure which has other information. The idea having libomptarget is to provide a thin layer between compiler and device specific plugin and we need to support different types of plugin.

If the objects which are copied are not compatible between host and target due to use of pointers which have different size in a structure then it is users responsibility.
I am more interested in where you allocate memory on device you get back an opaque object of size 64. The opaque object may be 64bit device pointer or a high 32bit pointer to device memory and low 32bit offset from that memory or it can be a pointer to a structure which has other information. The idea having libomptarget is to provide a thin layer between compiler and device specific plugin and we need to support different types of plugin.

Ok, I see. Then ask Georgios to fix this, this problem is not part of this patch.

I agree that this doesn't work with the current code. IIRC the reason is that libomptarget has two loops dealing with map clauses: one in target_data_begin (which also handles MEMBER_OF) and the one in target which this patch is modifying. The problem is that firstprivate arguments are handled in the second, so the one in target_data_begin (which runs first) doesn't have a device address to resolve entries with MEMBER_OF. IMO the proper solution would be to have all allocations in one single loop in target_data_begin. The loop in target should only deal with constructing the arguments passed to the device.

To implement this efficiently we need a temporary data structure for each mapping so that we can store the device address allocated in target_data_begin. That pointer could then be queried in target to populate tgt_args.
(Ideally the data structure would wrap each entry in the args_base, args, arg_sizes, and arg_types arrays into a nice object. This would have functions to hide all this bitmasking to find out the type.)

Long story short, I think that lambdas are indeed firstprivate. What needs to be fixed are firstprivate "mappings" together with MEMBER_OF (@AlexEichenberger this already works in lomp, right?).

While I am not 100% familiar with the libomptarget code yet, it is possible to handle the beginning of a target construct like a target begin (phase 1) , and the end of a target construct like a target end (phase 3), with the invocation of the target being phase 2. Keeping a data structure that hold the data from phase 1 to phase 3 is quite advantageous, as locating maps is generally expensive. Flow of info from phase 1 to phase 2 is minimal (only the list of parameters, which are indicated by a flag in the arg_types parameter.

Incidentally, having a data structure becomes necessary once async is implemented, as there are then a lot more state to remember (e.g. events associated with the termination of specific data transfers, book keeping for deleting temporary buffers once data is copied over...)

In my mind, MEMBER_OF does not imply MAP, it can be used for firstprivate just the same (though member_of firstprivate cannot be given by a user, as only map can have alloc of a whole structure, with to, from, tofrom for subsets of the structure).

PTR_AND_OBJ does not imply a deep copy, if the OBJ is already mapped, it will find it, increment the ref count on entry/decrement on exit. If we need a PTR_AND_MAPPED_OBJ, where NULL is set if the object is not present, that is something we could do. Because of the oncoming unified mem, PTR_AND_MAPPED_OBJ would simply preserve the original address (no attachment) with the requires unified_shared_memory.

I agree that this doesn't work with the current code. IIRC the reason is that libomptarget has two loops dealing with map clauses: one in target_data_begin (which also handles MEMBER_OF) and the one in target which this patch is modifying. The problem is that firstprivate arguments are handled in the second, so the one in target_data_begin (which runs first) doesn't have a device address to resolve entries with MEMBER_OF. IMO the proper solution would be to have all allocations in one single loop in target_data_begin. The loop in target should only deal with constructing the arguments passed to the device.

To implement this efficiently we need a temporary data structure for each mapping so that we can store the device address allocated in target_data_begin. That pointer could then be queried in target to populate tgt_args.
(Ideally the data structure would wrap each entry in the args_base, args, arg_sizes, and arg_types arrays into a nice object. This would have functions to hide all this bitmasking to find out the type.)

Long story short, I think that lambdas are indeed firstprivate. What needs to be fixed are firstprivate "mappings" together with MEMBER_OF (@AlexEichenberger this already works in lomp, right?).

While I am not 100% familiar with the libomptarget code yet, it is possible to handle the beginning of a target construct like a target begin (phase 1) , and the end of a target construct like a target end (phase 3), with the invocation of the target being phase 2. Keeping a data structure that hold the data from phase 1 to phase 3 is quite advantageous, as locating maps is generally expensive. Flow of info from phase 1 to phase 2 is minimal (only the list of parameters, which are indicated by a flag in the arg_types parameter.

Yes, that's how it is implemented in libomptarget. In fact it's the "problem" with the two loops that I described above: firstprivate "mappings" are not handled by phase 1, but instead in target (phase 2) which is why MEMBER_OF doesn't work.

Incidentally, having a data structure becomes necessary once async is implemented, as there are then a lot more state to remember (e.g. events associated with the termination of specific data transfers, book keeping for deleting temporary buffers once data is copied over...)

That's one of my motiviation why I think this refactoring is worth it ;-)

In my mind, MEMBER_OF does not imply MAP, it can be used for firstprivate just the same (though member_of firstprivate cannot be given by a user, as only map can have alloc of a whole structure, with to, from, tofrom for subsets of the structure).

... and that's the point that should be fixed in libomptarget. For now I think it's acceptable to do more expensive lookups before implementing the caching information.

grokos added a comment.Sep 3 2018, 2:52 PM

@AlexEichenberger: Have you confirmed whether flag 0x400 is free to be used for LAMBDA_REF or has been reserved for something else in XL?

@AlexEichenberger: Have you confirmed whether flag 0x400 is free to be used for LAMBDA_REF or has been reserved for something else in XL?

It does not matter, I'll rework it to use PTR_AND_OBJ instead.

ABataev updated this revision to Diff 163856.Sep 4 2018, 10:01 AM

Reworked handling of lambdas captures.

BTW, the main problem here is that there is no mapping between host pointer and device pointer for the firstprivates, just like for the mapped data. We just need to rework the way we allocate the memory for the firstprivate to make MEMBER_OF to work.

I don't see this path taken in the latest patch. Is there a major drawback of doing this?

BTW, the main problem here is that there is no mapping between host pointer and device pointer for the firstprivates, just like for the mapped data. We just need to rework the way we allocate the memory for the firstprivate to make MEMBER_OF to work.

I don't see this path taken in the latest patch. Is there a major drawback of doing this?

I did it using new tgtArgsPositions, which handles mappings between the firstprivates and the corresponding target pointers.

BTW, the main problem here is that there is no mapping between host pointer and device pointer for the firstprivates, just like for the mapped data. We just need to rework the way we allocate the memory for the firstprivate to make MEMBER_OF to work.

I don't see this path taken in the latest patch. Is there a major drawback of doing this?

I did it using new tgtArgsPositions, which handles mappings between the firstprivates and the corresponding target pointers.

I think it's more consistent to handle all data mapping in a single loop (in target_data_begin) as discussed before.

BTW, the main problem here is that there is no mapping between host pointer and device pointer for the firstprivates, just like for the mapped data. We just need to rework the way we allocate the memory for the firstprivate to make MEMBER_OF to work.

I don't see this path taken in the latest patch. Is there a major drawback of doing this?

I did it using new tgtArgsPositions, which handles mappings between the firstprivates and the corresponding target pointers.

I think it's more consistent to handle all data mapping in a single loop (in target_data_begin) as discussed before.

I think this is a different problem that should be solved separately. It requires a redesign of the existing solution, while I'm just trying to implement a new feature.

grokos added a comment.Sep 6 2018, 8:49 AM

BTW, the main problem here is that there is no mapping between host pointer and device pointer for the firstprivates, just like for the mapped data. We just need to rework the way we allocate the memory for the firstprivate to make MEMBER_OF to work.

I don't see this path taken in the latest patch. Is there a major drawback of doing this?

I did it using new tgtArgsPositions, which handles mappings between the firstprivates and the corresponding target pointers.

I think it's more consistent to handle all data mapping in a single loop (in target_data_begin) as discussed before.

I think this is a different problem that should be solved separately. It requires a redesign of the existing solution, while I'm just trying to implement a new feature.

I think I agree with Alexey here. I favor letting support for lambda captures in and later on we can redesign the way libomptarget handles mappings. Is there any objection?

I think it's more consistent to handle all data mapping in a single loop (in target_data_begin) as discussed before.

I think this is a different problem that should be solved separately. It requires a redesign of the existing solution, while I'm just trying to implement a new feature.

I think I agree with Alexey here. I favor letting support for lambda captures in and later on we can redesign the way libomptarget handles mappings. Is there any objection?

I didn't ask to do the refactoring to land support for lambda captures. Instead I thought Alexey proposed allocating firstprivates in target_data_begin and fully registering them with the mapping machinery. This should make member_of work without duplicating the code because the firstprivate "mapping" will be allocated when the loop reaches the depending entries. In target you'd only need to query all target pointers that need to be passed as a parameter to the kernel function.

ABataev added a comment.EditedSep 11 2018, 9:51 AM

It is not very easy to do, this rework requires redesign. I'm not interested in redesigning the whole library, I have a lot of other work to do.

Sorry, I've only just become aware of this discussion,
I had already implemented the lambda mapping change in the clang-ykt branch (and for that I did use MEMBER_OF and PTR_AND_OBJ which worked as expected).
Could we possibly reuse (some of?) that implementation?
I've tested that on a few different applications and it works quite well.
I'm not that familiar with the difference between clang-ykt and upstream so it's probable that I'm missing something.

This revision is now accepted and ready to land.Oct 30 2018, 8:06 AM
This revision was automatically updated to reflect the committed changes.