This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP 5.0] declare mapper runtime implementation
ClosedPublic

Authored by grokos on Sep 26 2019, 12:32 PM.

Details

Summary

This patch implements the runtime functionality to support the OpenMP 5.0 declare mapper. It introduces a set of new interfaces so user-defined mapper functions can be passed to the runtime. The runtime will call mapper functions to fill up an internal data structure. Later it will map every component in the internal data structure.
The design slides can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

Diff Detail

Event Timeline

lildmh created this revision.Sep 26 2019, 12:32 PM
lildmh updated this revision to Diff 222173.Sep 27 2019, 7:43 AM

Please review when you have time, thanks

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

  1. This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.
  1. If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

  1. This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.
  1. If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

  1. This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.
  1. If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

  1. This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.
  1. If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

I think another problem is this may not work with legacy code, since they don't have calls to __tgt_mapper. This may be a bigger problem when legacy code and code compiled with new Clang are mixed together: a legacy call to __tgt_target may get a mapper which is intended for some new code.

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

  1. This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.
  1. If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

I think another problem is this may not work with legacy code, since they don't have calls to __tgt_mapper. This may be a bigger problem when legacy code and code compiled with new Clang are mixed together: a legacy call to __tgt_target may get a mapper which is intended for some new code.

No, there should not be problem with the legacy code. If the array of mappers is empty - use default mapping through the bitcopying.

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

Yes, but I think it cannot solve this problem. For example, after a task executes __tgt_mapper, it is scheduled out and a new task is scheduled to execute. After the previous task resumes execution, the mapper information it stored has lost, and the execution of the __tgt_target will not get the mapper. I don't think there is a per-task context in libomptarget.

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

Yes, but I think it cannot solve this problem. For example, after a task executes __tgt_mapper, it is scheduled out and a new task is scheduled to execute. After the previous task resumes execution, the mapper information it stored has lost, and the execution of the __tgt_target will not get the mapper. I don't think there is a per-task context in libomptarget.

libomptarget already uses some functions from libomp, you can use it to check for the task context.

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

I don't see a big problems here. You can store the mapper data per task using thread id from libomp, as we do it for tripcount. We can use the same solution for mappers.

Alexey and George: This is a big decision to make. We need to have most people's consents. I'll send it to the mailing list later.

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

Btw, I never understand why we have a separate function to push loop trip count. Why is that?

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

Btw, I never understand why we have a separate function to push loop trip count. Why is that?

The same problem, to not bloat interface of the runtime library

ABataev added inline comments.Nov 11 2019, 7:08 AM
libomptarget/src/exports
16–25

On the last telecon, we decided to support this solution so adding new functions is accepted.

libomptarget/src/interface.cpp
103

__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

171

__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

240

__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

293

__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

358

__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

libomptarget/src/omptarget.cpp
659–662

Why we have this limitation?

I think the direction is good here. Some duplication is inevitable in the interface. Between the non functional changes and the code duplication in the implementation it's difficult to work out exactly what the code is doing though.

libomptarget/src/omptarget.cpp
653

typedef/using instead of copying the type list for the cast?

690

It's probably worth doing the arg_types[i] => Type refactor first. Combined with the whitespace change it's quite a lot of this diff.

libomptarget/src/omptarget.cpp
699

Likewise data_size => Size. Separating the NFC from the FC makes it easier to parse the latter.

791

I think this is the same type list copy & paste that suggested a typedef above

794

The rest of this looks quite familiar too. Perhaps factor the copy & paste into helper functions that are called by both locations?

989

And again

lildmh updated this revision to Diff 229199.Nov 13 2019, 4:11 PM
lildmh marked 6 inline comments as done.

Thanks Alexey and Jon for your review. Fixed the issues and rebased

libomptarget/src/omptarget.cpp
653

Sounds good, thanks

659–662

It's because that's the length of the parent bit. If we have more components, the parent bits will break.

690

I extracted the mapping of each component into a separate function for code reuse purpose, like target_data_end_component here. It uses Type as the input argument, so there is no longer arg_types[i]. It's the same for Size.

So I don't think it will make it more clear to change arg_types[i] to Type first. What do you think?

794

The duplication is not too much though. Do you think it will worth it to have a helper function?

lildmh marked 6 inline comments as done.Nov 13 2019, 4:12 PM
ABataev added inline comments.Nov 14 2019, 8:23 AM
libomptarget/src/omptarget.cpp
794

+1 for refactoring.

lildmh marked an inline comment as done.Nov 15 2019, 8:46 AM
lildmh added inline comments.
libomptarget/src/omptarget.cpp
794

Hi Alexey and Jon,

I didn't find an elegant way to merge the code below. It's mainly because they have different way to access other components:
E.g., for mapper, Components.get(parent_idx) is used to get its parent, on the other hand, args[parent_idx] is used for arguments. One is array of struct, the other is struct of array.

ABataev added inline comments.Nov 25 2019, 10:28 AM
libomptarget/src/omptarget.cpp
794

Still, do not understand what is a problem with the refactoring. You can use lambdas, if need some differences in data, or something similar. Anyway, it would better rather than just copy-paste.

796

Usually, we use something like (for i = 0, e = end(); i < e; ++i) pattern.

My thoughts are the same as before. This change mixes a refactor with a functional change plus duplicates a bunch of code. The overall change might work but I can't tell from the diff.

libomptarget/src/omptarget.cpp
794

Some options:

  • wrap object in a class that adapts the interface
  • pass in a function that does the access
  • refactor one data type to the same layout as the other
  • extract small functions which are called by both
lildmh updated this revision to Diff 230971.Nov 25 2019, 2:40 PM

Thanks for your reviews. Hope this looks better.

@JonChesterfield: If you insist, I can break this patch into 2 smaller ones. Since I don't have much time now, it will happen later.

Thanks for your reviews. Hope this looks better.

Thanks

@JonChesterfield: If you insist, I can break this patch into 2 smaller ones. Since I don't have much time now, it will happen later.

Splitting large patches into NFC and functional change doesn't seem contentious but is not required.

The advantage is seen when the build breaks. It's less annoying for the author to have the functional change part temporarily reverted than to lose the whole lot, especially when the functional change is the smaller diff as I think it would be here.

lildmh updated this revision to Diff 235142.Dec 23 2019, 6:48 AM

Rebase and rediff with the nfc version

JonChesterfield requested changes to this revision.Dec 23 2019, 7:01 AM

The premise seems OK, but three copies of a large block of control flow is not so good. Why the duplication?

openmp/libomptarget/src/omptarget.cpp
349 ↗(On Diff #235142)

What makes the mapper valid? I don't see any checking in the source. Perhaps just strike the word valid from the comment

359 ↗(On Diff #235142)

What limitation? Why 0xffff?

366 ↗(On Diff #235142)

Size probably returns size_t, why is the induction variable signed?

536 ↗(On Diff #235142)

This appears to be a copy and paste of the above

667 ↗(On Diff #235142)

And another copy and paste

This revision now requires changes to proceed.Dec 23 2019, 7:01 AM
lildmh updated this revision to Diff 235145.Dec 23 2019, 7:38 AM
lildmh marked 4 inline comments as done.

Address Jon's comments

openmp/libomptarget/src/omptarget.cpp
349 ↗(On Diff #235142)

If there is a valid pointer, which is generated by compiler, I say it's valid. I will remove word valid from the comment

359 ↗(On Diff #235142)

Because the parent idx in map type has 16 bits, we cannot handle components more than that.

366 ↗(On Diff #235142)

I think the indices are always type int32_t in libomptarget, so I followed the rules. Otherwise there will be a signed and unsigned compariion warning

667 ↗(On Diff #235142)

Okay, will get the common part into a function

gentle ping :)

There's a lot of copy and paste remaining, and no test cases. Do we want this anyway? At some point it can be better to patch and keep moving than to iterate in phabricator.

Test cases will be uploaded in another patch when the Clang patch is upstreamed. That Clang patch depends on this (https://reviews.llvm.org/D67833). So I think the order is this patch, clang patch, test patch.

JonChesterfield resigned from this revision.Apr 20 2020, 5:09 AM
grokos commandeered this revision.Jun 3 2020, 12:59 PM
grokos edited reviewers, added: lildmh; removed: grokos.
grokos updated this revision to Diff 275793.Jul 6 2020, 12:24 PM

I tried to address our previous complaints about code duplication and came up with a scheme which results in a much shorter and cleaner diff with virtually no code duplication. Instead of refactoring code from taget_data_begin/end/update, I introduced a new internal function target_data_mapper which generates new arrays args_base, args, arg_sizes and arg_types for the custom mapper and calls target_data_begin/end/update again using the new arguments.

Ping. If the patch lands toady or tomorrow, then we will meet the clang-11 deadline and include support for declare mapper.

This revision is now accepted and ready to land.Jul 14 2020, 12:40 PM

The codegen patch needs to land before this one to pass these test programs

Also, not sure if https://reviews.llvm.org/D71782 is still needed. Please check

No, it's not needed anymore. This patch bypasses the need to do that refactoring. Can you please abandon that revision?

No, it's not needed anymore. This patch bypasses the need to do that refactoring. Can you please abandon that revision?

Ok, please ignore it then. Thanks for working on this! The only thing left is to have https://reviews.llvm.org/D67833 accepted @ABataev

Nice, thanks. All my concerns were addressed by the above revision.

This revision was automatically updated to reflect the committed changes.

declare_mapper_target.cpp still fails for me consistently in RUN line 2, so for the nvptx version. I execute on x86_64 with Tesla V100 and cuda 10.0.
When I execute the test with export LIBOMPTARGET_DEBUG=1, the test succeeds.
In case of failure, the test prints Sum = 1024, in case of success, the test prints Sum = 2048 as expected.

I run the tests with a fresh build (ae31d7838c36).

declare_mapper_target.cpp still fails for me consistently in RUN line 2, so for the nvptx version. I execute on x86_64 with Tesla V100 and cuda 10.0.
When I execute the test with export LIBOMPTARGET_DEBUG=1, the test succeeds.
In case of failure, the test prints Sum = 1024, in case of success, the test prints Sum = 2048 as expected.

I run the tests with a fresh build (ae31d7838c36).

Interesting, how about other tests declare_mapper_*.cpp? Do they fail or pass?

The other tests failed after the commit. They started to succeed with various commits.
At above mentioned commit, only this single test fails.

The other tests failed after the commit. They started to succeed with various commits.
At above mentioned commit, only this single test fails.

Very interesting. Any guess what's the problem? I'll look into it. @grokos your test passed before, right?

I can reproduce this. When running the test itself, Sum=1024.
When running the test with nvprof, Sum=2048. Combining with that you said Sum=2048 when LIBOMPTARGET_DEBUG=1, I suspect the GPU offloading is disabled in the above case. Any idea what happened recently to libomptarget which can potentially cause this problem? I didn't follow the recent development so have no idea.

For the commit of this patch, the test fails with and without env LIBOMPTARGET_DEBUG=1. I'm using a release build, but have -DLIBOMPTARGET_ENABLE_DEBUG=on. This allows to activate debug output by setting the env variable.

I'm currently bisecting for the commit, when the test started to succeed with env LIBOMPTARGET_DEBUG=1. I'm hoping its sufficient to bisect commits on /openmp/.

Thanks. Another weird place is it passes with nvprof. Not sure why using nvprof makes a difference here.

I tried to run declare_mapper_target.cpp on a Nvidia GPU. The problem occurs while loading the device image:

Target CUDA RTL --> Error returned from cuModuleLoadDataEx
Target CUDA RTL --> CUDA error is: device kernel image is invalid

Sounds like a clang problem, I don't see why libomptarget could be the culprit here.

Ok, the bisecting did not really reveal anything new. The test fails with and without the env var for 140ab574 , starting with 537b16e9 the test succeeds with env LIBOMPTARGET_DEBUG=1

@grokos Why do you think, it's not the runtime, if the same executable is behaving differently just based on this env variable?

From me this looks like one of the debugging statements has a side effect, which disappears when I execute without the debugging variable.

$ env LIBOMPTARGET_DEBUG=0 projects/openmp/libomptarget/test/mapping/Output/declare_mapper_target.cpp.tmp-nvptx64-nvidia-cuda 
Sum = 1024
$ env LIBOMPTARGET_DEBUG=1 projects/openmp/libomptarget/test/mapping/Output/declare_mapper_target.cpp.tmp-nvptx64-nvidia-cuda 
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ve.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': libomptarget.rtl.ve.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 1 devices!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000000401350 is NOT compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> Image 0x0000000000401350 is compatible with RTL libomptarget.rtl.cuda.so!
Libomptarget --> RTL 0x000000000063d430 has index 0!
Libomptarget --> Registering image 0x0000000000401350 with RTL libomptarget.rtl.cuda.so!
Libomptarget --> Done registering entries!
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL --> Init requires flags to 1
Target CUDA RTL --> Getting device 0
Target CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL --> Using 1024 CUDA threads per block
Target CUDA RTL --> Using warp size 32
Target CUDA RTL --> Max number of CUDA blocks 65536, threads 1024 & warp size 32
Target CUDA RTL --> Default number of teams set according to library's default 128
Target CUDA RTL --> Default number of threads set according to library's default 128
Libomptarget --> Device 0 is ready to use.
Target CUDA RTL --> Load data from image 0x0000000000401350
Target CUDA RTL --> CUDA module successfully loaded!
Target CUDA RTL --> Entry point 0x0000000000000000 maps to __omp_offloading_13_a6fc814_main_l25 (0x0000000000ee5510)
Target CUDA RTL --> Sending global device environment data 4 bytes
Libomptarget --> __kmpc_push_target_tripcount(0, 1024)
Libomptarget --> Entering target region with entry point 0x0000000000401301 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry  0: Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x23
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e90
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc0985c190) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00000000006a0600, Size=4096, Type=0x2000000000013).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Creating new map entry: HstBase=0x00007ffc0985c510, HstBegin=0x00007ffc0985c510, HstEnd=0x00007ffc0985c518, TgtBegin=0x00002ad71e600000
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, updated RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is not new
Libomptarget --> Has a pointer entry: 
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000006a0600, Size=4096)...
Libomptarget --> Creating new map entry: HstBase=0x00000000006a0600, HstBegin=0x00000000006a0600, HstEnd=0x00000000006a1600, TgtBegin=0x00002ad71e600200
Libomptarget --> There are 4096 bytes allocated at target address 0x00002ad71e600200 - is new
Libomptarget --> Moving 4096 bytes (hst:0x00000000006a0600) -> (tgt:0x00002ad71e600200)
Libomptarget --> Update pointer (0x00002ad71e600000) -> [0x00002ad71e600200]
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, RefCount=2
Libomptarget --> Obtained target argument 0x00002ad71e600000 from host pointer 0x00007ffc0985c510
Libomptarget --> loop trip count is 1024.
Libomptarget --> Launching target execution __omp_offloading_13_a6fc814_main_l25 with pointer 0x0000000000eb1d10 (index=0).
Target CUDA RTL --> Setting CUDA threads per block to default 128
Target CUDA RTL --> Using 8 teams due to loop trip count 1024 and number of threads per block 128
Target CUDA RTL --> Launch kernel with 8 blocks and 128 threads
Target CUDA RTL --> Launch of entry point at 0x0000000000eb1d10 successful!
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e90
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc0985c190) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00000000006a0600, Size=4096, Type=0x2000000000013).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000006a0600, Size=4096)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000006a0600, TgtPtrBegin=0x00002ad71e600200, Size=4096, updated RefCount=1
Libomptarget --> There are 4096 bytes allocated at target address 0x00002ad71e600200 - is last
Libomptarget --> Moving 4096 bytes (tgt:0x00002ad71e600200) -> (hst:0x00000000006a0600)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000006a0600, Size=4096)...
Libomptarget --> Deleting tgt data 0x00002ad71e600200 of size 4096
Libomptarget --> Removing mapping with HstPtrBegin=0x00000000006a0600, TgtPtrBegin=0x00002ad71e600200, Size=4096
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, updated RefCount=1
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is not last
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, updated RefCount=1
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is last
Libomptarget --> Removing shadow pointer 0x00007ffc0985c510
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Deleting tgt data 0x00002ad71e600000 of size 8
Libomptarget --> Removing mapping with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8
Sum = 2048
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x0000000000401350 is compatible with RTL 0x000000000063d430!
Libomptarget --> Unregistered image 0x0000000000401350 from RTL 0x000000000063d430!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor 0x0000000000423900
Libomptarget --> Done unregistering library!
Libomptarget --> Deinit target library!
$ 

My system was messed up and used libraries and compilers from different builds. Please ignore my previous message.

I was able to reproduce what @protze.joachim described, i.e. different runtime behavior when LIBOMPTARGET_DEBUG=1 is used. I'm looking at the issue.

OK, I suspect there is a race condition involving the CUDA plugin. If I compile the test on x86_64-pc-linux-gnu then I always get the correct result, no matter whether we print debug output or not.

On CUDA, I tried to increase the test size from 1024 to 16M. With debug output off, I always get 16M as a result (instead of 32M) - this tells me that the CUDA kernel is launched and host code proceeds to verify the result before the kernel returns. Because the problem size is large, verification on the host always finished before the kernel returns and data is copied back from the device.

With debug output on, I get inconsistent results from execution to execution ranging from 16M to 32M, meaning that the host is busier printing output messages so verification can start later while data is being copied back.

@lildmh I've got a question unrelated to the problem we are discussing here. I ran declare_mapper_target.cpp and when libomptarget calls the mapper function it prints the following:

Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x00007ffcebd101e0, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x000000000231bfe0, Size=4096, Type=0x2000000000013)

Why is the second entry's MEMBER_OF field set to 2? It should be MEMBER_OF 1, since the pointer-pointee pair c.a[0:N] is part of struct c which is the first entry on the list.

@lildmh I've got a question unrelated to the problem we are discussing here. I ran declare_mapper_target.cpp and when libomptarget calls the mapper function it prints the following:

Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x00007ffcebd101e0, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x000000000231bfe0, Size=4096, Type=0x2000000000013)

Why is the second entry's MEMBER_OF field set to 2? It should be MEMBER_OF 1, since the pointer-pointee pair c.a[0:N] is part of struct c which is the first entry on the list.

Good point. May be a bug. Let me check later

@lildmh: I think I've found a bug. I used declare_mapper_target.cpp. When we call the mapper function, it generates 3 components. The first 2 are identical and correspond to the parent struct. This is what MapperComponents looks like inside function target_data_mapper:

(gdb) print MapperComponents
$1 = {Components = std::vector of length 3, capacity 4 = {{Base = 0x7fffffffb598, Begin = 0x7fffffffb598, Size = 8, Type = 32}, {Base = 0x7fffffffb598, Begin = 0x7fffffffb598, Size = 8, Type = 32}, {Base = 0x7fffffffb598,
      Begin = 0x62efd0, Size = 4096, Type = 562949953421331}}}

Mapping the parent struct twice is problematic. If we have more struct members and some of them are NOT pointers, then upon target_data_end libomptarget will check the parent struct's reference counter to determine whether the scalar member must be copied back to the host. If the reference counter is greater than 1, then the runtime will skip copying back the scalar. Mapping the parent struct two times in a row results in RefCount=2.

So in the example below (modified declare_mapper_target.cpp) the scalar is processed by libomptarget but because at that time the struct's RefCount is 2 we never copy the scalar back:

#include <cstdio>
#include <cstdlib>
#include <omp.h>

#define NUM 1024

class C {
public:
  int *a;
  int onHost;
};

#pragma omp declare mapper(id: C s) map(s.a[0:NUM], s.onHost)

int main() {
  C c;
  c.a = (int*) malloc(sizeof(int)*NUM);
  c.onHost = -1;
  for (int i = 0; i < NUM; i++) {
    c.a[i] = 1;
  }
  #pragma omp target teams distribute parallel for map(mapper(id),tofrom: c)
  for (int i = 0; i < NUM; i++) {
    ++c.a[i];
    if (i == 0) {
      c.onHost = omp_is_initial_device();
    }
  }

  int sum = 0;
  for (int i = 0; i < NUM; i++) {
    sum += c.a[i];
  }
  printf("Executed on %s\n", c.onHost==1 ? "host" : c.onHost==0 ? "device" : "unknown");
  // CHECK: Sum = 2048
  printf("Sum = %d\n", sum);
  return 0;
}

Upon target_data_end the mapper function will generate this:

(gdb) print MapperComponents
$1 = {Components = std::vector of length 4, capacity 4 = {{Base = 0x7fffffffb588, Begin = 0x7fffffffb588, Size = 16, Type = 32}, {Base = 0x7fffffffb588, Begin = 0x7fffffffb588, Size = 12, Type = 32}, {Base = 0x7fffffffb588,
      Begin = 0x62efd0, Size = 4096, Type = 562949953421331}, {Base = 0x7fffffffb588, Begin = 0x7fffffffb590, Size = 4, Type = 562949953421315}}}

When libomptarget processes the scalar, the parent struct's RefCount is 2, so inside the if-block in omptarget.cpp:507-524 CopyMember will never be set to true and the scalar will never be copied back to the host.

Can you revert the patches for declare mapper until it is fixed?

Thanks George for looking into this, and sorry for the late response.

I believe this is not a bug, it's a design choice we made early. The design choice is we map the whole structure at the beginning for one piece so we don't map the individual parts of them separately, which may cause a lot of memcpy.

For the RefCount, when the runtime check the 2nd component in your example, it will find it's already mapped and will not increase the RefCount, so I think it's not a bug and the behavior is correct.

Thanks George for looking into this, and sorry for the late response.

I believe this is not a bug, it's a design choice we made early. The design choice is we map the whole structure at the beginning for one piece so we don't map the individual parts of them separately, which may cause a lot of memcpy.

For the RefCount, when the runtime check the 2nd component in your example, it will find it's already mapped and will not increase the RefCount, so I think it's not a bug and the behavior is correct.

No, this is not related to our design choices. Here we are mapping the whole struct twice for no reason. The entries should be:

1) combined entry (i.e. the entry that maps the whole struct
    base = &c, begin = &c.a, size = sizeof(class S), type = TARGET_PARAM
2) member entry for c.a[0:NUM]
    base = &c.a, begin = &c.a[0], size = NUM*sizeof(int), type = MEMBER_OF(1) | PTR_AND_OBJ | TO | FROM
3) member entry for c.onHost
    base = &c, begin = &c.onHost, size = sizeof(int), type = MEMBER_OF(1) | TO | FROM

But what happens now is that the combined entry is emitted twice, so MapperComponents looks like this:

<combined entry>, <combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>

instead of

<combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>

And what's more, the first combined entry has size=16 whereas the second combined entry has size=12. Where does this 16 come from? The size of the struct is 12 bytes (a pointer + an int). This also explains why the MEMBER_OF field is set to 2, because the second element in the list of arguments is also the combined entry.

There is no rationale behind emitting the combined entry twice, on the contrary it leads to errors because the RefCount is indeed incremented when it shouldn't.

This is libomptarget's debug output from the provided example upon entering the target region:

Libomptarget --> Entering target region with entry point 0x0000000000401409 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry  0: Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x23
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e50
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc24203368) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=12, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x0000000000d89ff0, Size=4096, Type=0x2000000000013).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a70, Size=4, Type=0x2000000000003).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Creating new map entry: HstBase=0x00007ffc24203a68, HstBegin=0x00007ffc24203a68, HstEnd=0x00007ffc24203a78, TgtBegin=0x00007fa582400000
Libomptarget --> There are 16 bytes allocated at target address 0x00007fa582400000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=12)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=12, updated RefCount=2
Libomptarget --> There are 12 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=8, RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000d89ff0, Size=4096)...
Libomptarget --> Creating new map entry: HstBase=0x0000000000d89ff0, HstBegin=0x0000000000d89ff0, HstEnd=0x0000000000d8aff0, TgtBegin=0x00007fa582400200
Libomptarget --> There are 4096 bytes allocated at target address 0x00007fa582400200 - is new
Libomptarget --> Moving 4096 bytes (hst:0x0000000000d89ff0) -> (tgt:0x00007fa582400200)
Libomptarget --> Update pointer (0x00007fa582400000) -> [0x00007fa582400200]
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a70, Size=4)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a70, TgtPtrBegin=0x00007fa582400008, Size=4, RefCount=2
Libomptarget --> There are 4 bytes allocated at target address 0x00007fa582400008 - is not new
Libomptarget --> DeviceTy::getMapEntry: requested entry found
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=16, RefCount=2
Libomptarget --> Obtained target argument 0x00007fa582400000 from host pointer 0x00007ffc24203a68
Libomptarget --> loop trip count is 1024.
Libomptarget --> Launching target execution __omp_offloading_801_1ee0443_main_l28 with pointer 0x0000000001427dc0 (index=0).

When we process the 16-byte combined entry we allocate space for the struct and RefCount=1, then we process the 12-byte combined entry and RefCount is incremented to 2.

Thanks George for looking into this, and sorry for the late response.

I believe this is not a bug, it's a design choice we made early. The design choice is we map the whole structure at the beginning for one piece so we don't map the individual parts of them separately, which may cause a lot of memcpy.

For the RefCount, when the runtime check the 2nd component in your example, it will find it's already mapped and will not increase the RefCount, so I think it's not a bug and the behavior is correct.

No, this is not related to our design choices. Here we are mapping the whole struct twice for no reason. The entries should be:

1) combined entry (i.e. the entry that maps the whole struct
    base = &c, begin = &c.a, size = sizeof(class S), type = TARGET_PARAM
2) member entry for c.a[0:NUM]
    base = &c.a, begin = &c.a[0], size = NUM*sizeof(int), type = MEMBER_OF(1) | PTR_AND_OBJ | TO | FROM
3) member entry for c.onHost
    base = &c, begin = &c.onHost, size = sizeof(int), type = MEMBER_OF(1) | TO | FROM

But what happens now is that the combined entry is emitted twice, so MapperComponents looks like this:

<combined entry>, <combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>

instead of

<combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>

And what's more, the first combined entry has size=16 whereas the second combined entry has size=12. Where does this 16 come from? The size of the struct is 12 bytes (a pointer + an int). This also explains why the MEMBER_OF field is set to 2, because the second element in the list of arguments is also the combined entry.

The first combined entry comes from mapping the whole structure. I think because of the alignment, the structure is actually 16 bytes. The 2nd combined entry is the real entry emitted to map the structure. Why it looks like there are 2 of them, because at the beginning of a mapper function, it maps the whole structure no matter what, which generate the 1st combined entry you saw here. Then we generate detailed mapping entry, which generates the 2nd combined entry you saw here. They are not necessarily the same. It happens to be similar in this example.

There is no rationale behind emitting the combined entry twice, on the contrary it leads to errors because the RefCount is indeed incremented when it shouldn't.

This is libomptarget's debug output from the provided example upon entering the target region:

Libomptarget --> Entering target region with entry point 0x0000000000401409 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry  0: Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x23
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e50
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc24203368) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=12, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x0000000000d89ff0, Size=4096, Type=0x2000000000013).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a70, Size=4, Type=0x2000000000003).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Creating new map entry: HstBase=0x00007ffc24203a68, HstBegin=0x00007ffc24203a68, HstEnd=0x00007ffc24203a78, TgtBegin=0x00007fa582400000
Libomptarget --> There are 16 bytes allocated at target address 0x00007fa582400000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=12)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=12, updated RefCount=2
Libomptarget --> There are 12 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=8, RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000d89ff0, Size=4096)...
Libomptarget --> Creating new map entry: HstBase=0x0000000000d89ff0, HstBegin=0x0000000000d89ff0, HstEnd=0x0000000000d8aff0, TgtBegin=0x00007fa582400200
Libomptarget --> There are 4096 bytes allocated at target address 0x00007fa582400200 - is new
Libomptarget --> Moving 4096 bytes (hst:0x0000000000d89ff0) -> (tgt:0x00007fa582400200)
Libomptarget --> Update pointer (0x00007fa582400000) -> [0x00007fa582400200]
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a70, Size=4)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a70, TgtPtrBegin=0x00007fa582400008, Size=4, RefCount=2
Libomptarget --> There are 4 bytes allocated at target address 0x00007fa582400008 - is not new
Libomptarget --> DeviceTy::getMapEntry: requested entry found
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=16, RefCount=2
Libomptarget --> Obtained target argument 0x00007fa582400000 from host pointer 0x00007ffc24203a68
Libomptarget --> loop trip count is 1024.
Libomptarget --> Launching target execution __omp_offloading_801_1ee0443_main_l28 with pointer 0x0000000001427dc0 (index=0).

When we process the 16-byte combined entry we allocate space for the struct and RefCount=1, then we process the 12-byte combined entry and RefCount is incremented to 2.

It indeed increases RefCount after I checked the code and you are right. I think it should not cause any problem? Because RefCount will be reduced before to 0 at exit (It looks like the combined entry's mapped twice, it should also be 'deleted' twice when the target region exits).

The first combined entry comes from mapping the whole structure. I think because of the alignment, the structure is actually 16 bytes. The 2nd combined entry is the real entry emitted to map the structure. Why it looks like there are 2 of them, because at the beginning of a mapper function, it maps the whole structure no matter what, which generate the 1st combined entry you saw here. Then we generate detailed mapping entry, which generates the 2nd combined entry you saw here. They are not necessarily the same. It happens to be similar in this example.

I assure you that's not how structs are mapped.

You don't map "the whole struct", you only map what is needed. For this you emit a combined entry which has a size large enough to encompass all members we are interested in. Then entries for individual members follow. One combined entry + as many member entries as needed. The first entry which "maps the whole struct" should not be there and is plain wrong.

It indeed increases RefCount after I checked the code and you are right. I think it should not cause any problem? Because RefCount will be reduced before to 0 at exit (It looks like the combined entry's mapped twice, it should also be 'deleted' twice when the target region exits).

The problem is that when individual members are processed in target_data_end, RefCount = 2 so these members will not be copied back to the host. RefCount must be 1 for data motion to take place and in this case it's not.

Anyway, I modified libomptarget locally to ignore the 16-byte combined entry and now all tests pass. Can you please submit a clang patch which removes the first combined entry?

The first combined entry comes from mapping the whole structure. I think because of the alignment, the structure is actually 16 bytes. The 2nd combined entry is the real entry emitted to map the structure. Why it looks like there are 2 of them, because at the beginning of a mapper function, it maps the whole structure no matter what, which generate the 1st combined entry you saw here. Then we generate detailed mapping entry, which generates the 2nd combined entry you saw here. They are not necessarily the same. It happens to be similar in this example.

I assure you that's not how structs are mapped.

You don't map "the whole struct", you only map what is needed. For this you emit a combined entry which has a size large enough to encompass all members we are interested in. Then entries for individual members follow. One combined entry + as many member entries as needed. The first entry which "maps the whole struct" should not be there and is plain wrong.

This is an optimization brought up by Deepak. I guess you were in that meeting too but forgot. It could be quite useful when you map an array of struct/class. Assume you map 1000 of this structure, with this optimization most memory allocation can be done in a single allocation, instead of allocation 12 bytes memory 1000 times.

Thinking about it, it's actually important for correctness too. Assume you map C a[2]. If you map separately, a[0] and a[1] could be mapped to not contiguous locations, and it will cause error/segfault when GPU kernel access this array. If you allocate the whole array a[2] together, such problem won't happen.

It indeed increases RefCount after I checked the code and you are right. I think it should not cause any problem? Because RefCount will be reduced before to 0 at exit (It looks like the combined entry's mapped twice, it should also be 'deleted' twice when the target region exits).

The problem is that when individual members are processed in target_data_end, RefCount = 2 so these members will not be copied back to the host. RefCount must be 1 for data motion to take place and in this case it's not.

Anyway, I modified libomptarget locally to ignore the 16-byte combined entry and now all tests pass. Can you please submit a clang patch which removes the first combined entry?

I believe RefCount should be reduced to 1 when want to copy it back in target_data_end, could you post the whole trace of debug output how RefCount changes in target_data_end?

From my perspective, the declare_mapper_target.cpp code is semantically equivalent to:

#pragma omp target data map(tofrom: c)
#pragma omp target data map(tofrom: c.a[0:NUM])
#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}

and

#pragma omp target enter data map(to: c)
#pragma omp target enter data map(to: c.a[0:NUM])

#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}
#pragma omp target exit data map(from: c.a[0:NUM])
#pragma omp target exit data map(from: c)

Can you express the behavior of your mapping implementation in terms of OpenMP target enter/exit data primitives?

From my perspective, the declare_mapper_target.cpp code is semantically equivalent to:

#pragma omp target data map(tofrom: c)
#pragma omp target data map(tofrom: c.a[0:NUM])
#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}

and

#pragma omp target enter data map(to: c)
#pragma omp target enter data map(to: c.a[0:NUM])

#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}
#pragma omp target exit data map(from: c.a[0:NUM])
#pragma omp target exit data map(from: c)

Can you express the behavior of your mapping implementation in terms of OpenMP target enter/exit data primitives?

You are basically right. In implementation, a function is generated for every mapper to do all internal mapping. More details can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

You are basically right. In implementation, a function is generated for every mapper to do all internal mapping. More details can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

This document does not say, which concrete mapping operations you push for the concrete case of the failing test. Can you probably express this in terms of omp target enter/exit data operations?

Your understanding above is exactly right. It should be equivalent to

#pragma omp target data map(tofrom: c)
#pragma omp target data map(tofrom: c.a[0:NUM])
#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}
grokos added a comment.Aug 3 2020, 6:15 PM

This is an optimization brought up by Deepak. I guess you were in that meeting too but forgot. It could be quite useful when you map an array of struct/class. Assume you map 1000 of this structure, with this optimization most memory allocation can be done in a single allocation, instead of allocation 12 bytes memory 1000 times.

Thinking about it, it's actually important for correctness too. Assume you map C a[2]. If you map separately, a[0] and a[1] could be mapped to not contiguous locations, and it will cause error/segfault when GPU kernel access this array. If you allocate the whole array a[2] together, such problem won't happen.

Sorry for the late response. Here you are talking about something else. The case you are considering is an array of structs. In this case, indeed we have to allocate the whole array beforehand. It's not an optimization, it's a correctness issue as you point out (array objects must be allocated consecutively). In the failing tests, however, we have single structs, not an array of structs. The difference is that in the former case the object we are mapping is the array, whereas in the latter case it's the struct. The two cases are not related to one another unless we intend to treat both of them uniformly, i.e. even if we have a single struct we still treat it as if it were the sole element of a length-1 array. Do I understand correctly?