This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/
-
libomptarget/
-
include/
-
omptarget.h
-
src/
-
exports
-
interface.cpp
4/9
omptarget.cpp
-
private.h
-
rtl.cpp
-
test/mapping/
-
mapping/
-
declare_mapper_api.cpp
-
declare_mapper_target.cpp
-
declare_mapper_target_data.cpp
-
declare_mapper_target_data_enter_exit.cpp
-
declare_mapper_target_update.cpp

Differential D68100

[OpenMP 5.0] declare mapper runtime implementation
ClosedPublic

Authored by grokos on Sep 26 2019, 12:32 PM.

Download Raw Diff

Details

Reviewers

ABataev
kkwli0
hfinkel
Meinersbur
jdoerfert
lildmh
JonChesterfield

Commits

rG140ab574a1c8: [OpenMP][Offload] Declare mapper runtime implementation

Summary

This patch implements the runtime functionality to support the OpenMP 5.0 declare mapper. It introduces a set of new interfaces so user-defined mapper functions can be passed to the runtime. The runtime will call mapper functions to fill up an internal data structure. Later it will map every component in the internal data structure.
The design slides can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lildmh created this revision.Sep 26 2019, 12:32 PM

Herald added subscribers: openmp-commits, guansong. · View Herald TranscriptSep 26 2019, 12:32 PM

Please review when you have time, thanks

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

In D68100#1686337, @grokos wrote:

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.

If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

In D68100#1686395, @lildmh wrote:

In D68100#1686337, @grokos wrote:

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.

If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

In D68100#1686397, @ABataev wrote:

In D68100#1686395, @lildmh wrote:

In D68100#1686337, @grokos wrote:

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.

If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

In D68100#1686413, @lildmh wrote:

In D68100#1686397, @ABataev wrote:

In D68100#1686395, @lildmh wrote:

In D68100#1686337, @grokos wrote:

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.

If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

I think another problem is this may not work with legacy code, since they don't have calls to __tgt_mapper. This may be a bigger problem when legacy code and code compiled with new Clang are mixed together: a legacy call to __tgt_target may get a mapper which is intended for some new code.

In D68100#1686416, @lildmh wrote:

In D68100#1686413, @lildmh wrote:

In D68100#1686397, @ABataev wrote:

In D68100#1686395, @lildmh wrote:

In D68100#1686337, @grokos wrote:

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.

If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

I think another problem is this may not work with legacy code, since they don't have calls to __tgt_mapper. This may be a bigger problem when legacy code and code compiled with new Clang are mixed together: a legacy call to __tgt_target may get a mapper which is intended for some new code.

No, there should not be problem with the legacy code. If the array of mappers is empty - use default mapping through the bitcopying.

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

In D68100#1686432, @lildmh wrote:

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

In D68100#1686437, @ABataev wrote:

In D68100#1686432, @lildmh wrote:

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

Yes, but I think it cannot solve this problem. For example, after a task executes __tgt_mapper, it is scheduled out and a new task is scheduled to execute. After the previous task resumes execution, the mapper information it stored has lost, and the execution of the __tgt_target will not get the mapper. I don't think there is a per-task context in libomptarget.

In D68100#1686451, @lildmh wrote:

In D68100#1686437, @ABataev wrote:

In D68100#1686432, @lildmh wrote:

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

Yes, but I think it cannot solve this problem. For example, after a task executes __tgt_mapper, it is scheduled out and a new task is scheduled to execute. After the previous task resumes execution, the mapper information it stored has lost, and the execution of the __tgt_target will not get the mapper. I don't think there is a per-task context in libomptarget.

libomptarget already uses some functions from libomp, you can use it to check for the task context.

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

In D68100#1686458, @grokos wrote:

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

I don't see a big problems here. You can store the mapper data per task using thread id from libomp, as we do it for tripcount. We can use the same solution for mappers.

Alexey and George: This is a big decision to make. We need to have most people's consents. I'll send it to the mailing list later.

In D68100#1686458, @grokos wrote:

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

Btw, I never understand why we have a separate function to push loop trip count. Why is that?

In D68100#1686480, @lildmh wrote:

In D68100#1686458, @grokos wrote:

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

Btw, I never understand why we have a separate function to push loop trip count. Why is that?

The same problem, to not bloat interface of the runtime library

ABataev added inline comments.Nov 11 2019, 7:08 AM

libomptarget/src/exports
16–25 ↗	(On Diff #222173)	On the last telecon, we decided to support this solution so adding new functions is accepted.
libomptarget/src/interface.cpp
103 ↗	(On Diff #222173)	`__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`
171 ↗	(On Diff #222173)	`__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`
240 ↗	(On Diff #222173)	`__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`
293 ↗	(On Diff #222173)	`__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`
358 ↗	(On Diff #222173)	`__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`
libomptarget/src/omptarget.cpp
361–364 ↗	(On Diff #222173)	Why we have this limitation?

JonChesterfield added a subscriber: JonChesterfield.Nov 11 2019, 12:27 PM

I think the direction is good here. Some duplication is inevitable in the interface. Between the non functional changes and the code duplication in the implementation it's difficult to work out exactly what the code is doing though.

libomptarget/src/omptarget.cpp
355 ↗	(On Diff #222173)	typedef/using instead of copying the type list for the cast?
438 ↗	(On Diff #222173)	It's probably worth doing the arg_types[i] => Type refactor first. Combined with the whitespace change it's quite a lot of this diff.

JonChesterfield added inline comments.Nov 11 2019, 12:38 PM

libomptarget/src/omptarget.cpp
466 ↗	(On Diff #222173)	Likewise data_size => Size. Separating the NFC from the FC makes it easier to parse the latter.
547 ↗	(On Diff #222173)	I think this is the same type list copy & paste that suggested a typedef above
550 ↗	(On Diff #222173)	The rest of this looks quite familiar too. Perhaps factor the copy & paste into helper functions that are called by both locations?
698 ↗	(On Diff #222173)	And again

Thanks Alexey and Jon for your review. Fixed the issues and rebased

libomptarget/src/omptarget.cpp
355 ↗	(On Diff #222173)	Sounds good, thanks
361–364 ↗	(On Diff #222173)	It's because that's the length of the parent bit. If we have more components, the parent bits will break.
438 ↗	(On Diff #222173)	I extracted the mapping of each component into a separate function for code reuse purpose, like `target_data_end_component` here. It uses `Type` as the input argument, so there is no longer `arg_types[i]`. It's the same for `Size`. So I don't think it will make it more clear to change `arg_types[i]` to `Type` first. What do you think?
550 ↗	(On Diff #222173)	The duplication is not too much though. Do you think it will worth it to have a helper function?

lildmh marked 6 inline comments as done.Nov 13 2019, 4:12 PM

lildmh added a child revision: D67833: [OpenMP 5.0] Codegen support to pass user-defined mapper functions to runtime.Nov 13 2019, 4:20 PM

ABataev added inline comments.Nov 14 2019, 8:23 AM

libomptarget/src/omptarget.cpp
550 ↗	(On Diff #222173)	+1 for refactoring.

lildmh marked an inline comment as done.Nov 15 2019, 8:46 AM

lildmh added inline comments.

libomptarget/src/omptarget.cpp
550 ↗	(On Diff #222173)	Hi Alexey and Jon, I didn't find an elegant way to merge the code below. It's mainly because they have different way to access other components: E.g., for mapper, `Components.get(parent_idx)` is used to get its parent, on the other hand, `args[parent_idx]` is used for arguments. One is array of struct, the other is struct of array.

Hahnfeld removed a reviewer: Hahnfeld.Nov 16 2019, 9:56 AM

@ABataev @JonChesterfield ping

ABataev added inline comments.Nov 25 2019, 10:28 AM

libomptarget/src/omptarget.cpp
546 ↗	(On Diff #229199)	Usually, we use something like `(for i = 0, e = end(); i < e; ++i)` pattern.
550 ↗	(On Diff #222173)	Still, do not understand what is a problem with the refactoring. You can use lambdas, if need some differences in data, or something similar. Anyway, it would better rather than just copy-paste.

In D68100#1759099, @jdoerfert wrote:

@ABataev @JonChesterfield ping

My thoughts are the same as before. This change mixes a refactor with a functional change plus duplicates a bunch of code. The overall change might work but I can't tell from the diff.

libomptarget/src/omptarget.cpp
550 ↗	(On Diff #222173)	Some options: wrap object in a class that adapts the interface pass in a function that does the access refactor one data type to the same layout as the other extract small functions which are called by both

Thanks for your reviews. Hope this looks better.

@JonChesterfield: If you insist, I can break this patch into 2 smaller ones. Since I don't have much time now, it will happen later.

In D68100#1759502, @lildmh wrote:

Thanks for your reviews. Hope this looks better.

Thanks

@JonChesterfield: If you insist, I can break this patch into 2 smaller ones. Since I don't have much time now, it will happen later.

Splitting large patches into NFC and functional change doesn't seem contentious but is not required.

The advantage is seen when the build breaks. It's less annoying for the author to have the functional change part temporarily reverted than to lose the whole lot, especially when the functional change is the smaller diff as I think it would be here.

lildmh mentioned this in D67833: [OpenMP 5.0] Codegen support to pass user-defined mapper functions to runtime.Dec 18 2019, 11:14 AM

lildmh added a parent revision: D71782: [OpenMP] [NFC] Refactor data mapping to prepare for declare mapper implementation.Dec 20 2019, 1:23 PM

JonChesterfield mentioned this in D71782: [OpenMP] [NFC] Refactor data mapping to prepare for declare mapper implementation.Dec 20 2019, 1:50 PM

Rebase and rediff with the nfc version

The premise seems OK, but three copies of a large block of control flow is not so good. Why the duplication?

openmp/libomptarget/src/omptarget.cpp
263–283	What makes the mapper valid? I don't see any checking in the source. Perhaps just strike the word valid from the comment
273	What limitation? Why 0xffff?
280	Size probably returns size_t, why is the induction variable signed?
426	This appears to be a copy and paste of the above
575	And another copy and paste

This revision now requires changes to proceed.Dec 23 2019, 7:01 AM

Address Jon's comments

openmp/libomptarget/src/omptarget.cpp
263–283	If there is a valid pointer, which is generated by compiler, I say it's valid. I will remove word valid from the comment
273	Because the parent idx in map type has 16 bits, we cannot handle components more than that.
280	I think the indices are always type `int32_t` in libomptarget, so I followed the rules. Otherwise there will be a signed and unsigned compariion warning
575	Okay, will get the common part into a function

gentle ping :)

Ping

There's a lot of copy and paste remaining, and no test cases. Do we want this anyway? At some point it can be better to patch and keep moving than to iterate in phabricator.

Test cases will be uploaded in another patch when the Clang patch is upstreamed. That Clang patch depends on this (https://reviews.llvm.org/D67833). So I think the order is this patch, clang patch, test patch.

JonChesterfield resigned from this revision.Apr 20 2020, 5:09 AM

Herald added a subscriber: yaxunl. · View Herald TranscriptApr 20 2020, 5:09 AM

grokos commandeered this revision.Jun 3 2020, 12:59 PM

grokos edited reviewers, added: lildmh; removed: grokos.

Herald added a subscriber: sstefan1. · View Herald TranscriptJun 3 2020, 12:59 PM

I tried to address our previous complaints about code duplication and came up with a scheme which results in a much shorter and cleaner diff with virtually no code duplication. Instead of refactoring code from taget_data_begin/end/update, I introduced a new internal function target_data_mapper which generates new arrays args_base, args, arg_sizes and arg_types for the custom mapper and calls target_data_begin/end/update again using the new arguments.

Harbormaster failed remote builds in B63063: Diff 275793!Jul 6 2020, 12:24 PM

Ping. If the patch lands toady or tomorrow, then we will meet the clang-11 deadline and include support for declare mapper.

LGTM

This revision is now accepted and ready to land.Jul 14 2020, 12:40 PM

The codegen patch needs to land before this one to pass these test programs

Also, not sure if https://reviews.llvm.org/D71782 is still needed. Please check

No, it's not needed anymore. This patch bypasses the need to do that refactoring. Can you please abandon that revision?

In D68100#2151386, @grokos wrote:

No, it's not needed anymore. This patch bypasses the need to do that refactoring. Can you please abandon that revision?

Ok, please ignore it then. Thanks for working on this! The only thing left is to have https://reviews.llvm.org/D67833 accepted @ABataev

Nice, thanks. All my concerns were addressed by the above revision.

Closed by commit rG140ab574a1c8: [OpenMP][Offload] Declare mapper runtime implementation (authored by grokos). · Explain WhyJul 15 2020, 7:11 PM

This revision was automatically updated to reflect the committed changes.

saiislam mentioned this in D64571: [OPENMP]Fix threadid in __kmpc_omp_taskwait call for dependent target calls..Jul 20 2020, 8:39 AM

declare_mapper_target.cpp still fails for me consistently in RUN line 2, so for the nvptx version. I execute on x86_64 with Tesla V100 and cuda 10.0.
When I execute the test with export LIBOMPTARGET_DEBUG=1, the test succeeds.
In case of failure, the test prints Sum = 1024, in case of success, the test prints Sum = 2048 as expected.

I run the tests with a fresh build (ae31d7838c36).

In D68100#2166450, @protze.joachim wrote:

declare_mapper_target.cpp still fails for me consistently in RUN line 2, so for the nvptx version. I execute on x86_64 with Tesla V100 and cuda 10.0.
When I execute the test with export LIBOMPTARGET_DEBUG=1, the test succeeds.
In case of failure, the test prints Sum = 1024, in case of success, the test prints Sum = 2048 as expected.

I run the tests with a fresh build (ae31d7838c36).

Interesting, how about other tests declare_mapper_*.cpp? Do they fail or pass?

The other tests failed after the commit. They started to succeed with various commits.
At above mentioned commit, only this single test fails.

In D68100#2167208, @protze.joachim wrote:

The other tests failed after the commit. They started to succeed with various commits.
At above mentioned commit, only this single test fails.

Very interesting. Any guess what's the problem? I'll look into it. @grokos your test passed before, right?

I can reproduce this. When running the test itself, Sum=1024.
When running the test with nvprof, Sum=2048. Combining with that you said Sum=2048 when LIBOMPTARGET_DEBUG=1, I suspect the GPU offloading is disabled in the above case. Any idea what happened recently to libomptarget which can potentially cause this problem? I didn't follow the recent development so have no idea.

For the commit of this patch, the test fails with and without env LIBOMPTARGET_DEBUG=1. I'm using a release build, but have -DLIBOMPTARGET_ENABLE_DEBUG=on. This allows to activate debug output by setting the env variable.

I'm currently bisecting for the commit, when the test started to succeed with env LIBOMPTARGET_DEBUG=1. I'm hoping its sufficient to bisect commits on /openmp/.

Thanks. Another weird place is it passes with nvprof. Not sure why using nvprof makes a difference here.

I tried to run declare_mapper_target.cpp on a Nvidia GPU. The problem occurs while loading the device image:

Target CUDA RTL --> Error returned from cuModuleLoadDataEx
Target CUDA RTL --> CUDA error is: device kernel image is invalid

Sounds like a clang problem, I don't see why libomptarget could be the culprit here.

Ok, the bisecting did not really reveal anything new. The test fails with and without the env var for 140ab574 , starting with 537b16e9 the test succeeds with env LIBOMPTARGET_DEBUG=1

@grokos Why do you think, it's not the runtime, if the same executable is behaving differently just based on this env variable?

From me this looks like one of the debugging statements has a side effect, which disappears when I execute without the debugging variable.

$ env LIBOMPTARGET_DEBUG=0 projects/openmp/libomptarget/test/mapping/Output/declare_mapper_target.cpp.tmp-nvptx64-nvidia-cuda 
Sum = 1024
$ env LIBOMPTARGET_DEBUG=1 projects/openmp/libomptarget/test/mapping/Output/declare_mapper_target.cpp.tmp-nvptx64-nvidia-cuda 
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ve.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': libomptarget.rtl.ve.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 1 devices!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000000401350 is NOT compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> Image 0x0000000000401350 is compatible with RTL libomptarget.rtl.cuda.so!
Libomptarget --> RTL 0x000000000063d430 has index 0!
Libomptarget --> Registering image 0x0000000000401350 with RTL libomptarget.rtl.cuda.so!
Libomptarget --> Done registering entries!
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL --> Init requires flags to 1
Target CUDA RTL --> Getting device 0
Target CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL --> Using 1024 CUDA threads per block
Target CUDA RTL --> Using warp size 32
Target CUDA RTL --> Max number of CUDA blocks 65536, threads 1024 & warp size 32
Target CUDA RTL --> Default number of teams set according to library's default 128
Target CUDA RTL --> Default number of threads set according to library's default 128
Libomptarget --> Device 0 is ready to use.
Target CUDA RTL --> Load data from image 0x0000000000401350
Target CUDA RTL --> CUDA module successfully loaded!
Target CUDA RTL --> Entry point 0x0000000000000000 maps to __omp_offloading_13_a6fc814_main_l25 (0x0000000000ee5510)
Target CUDA RTL --> Sending global device environment data 4 bytes
Libomptarget --> __kmpc_push_target_tripcount(0, 1024)
Libomptarget --> Entering target region with entry point 0x0000000000401301 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry  0: Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x23
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e90
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc0985c190) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00000000006a0600, Size=4096, Type=0x2000000000013).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Creating new map entry: HstBase=0x00007ffc0985c510, HstBegin=0x00007ffc0985c510, HstEnd=0x00007ffc0985c518, TgtBegin=0x00002ad71e600000
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, updated RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is not new
Libomptarget --> Has a pointer entry: 
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000006a0600, Size=4096)...
Libomptarget --> Creating new map entry: HstBase=0x00000000006a0600, HstBegin=0x00000000006a0600, HstEnd=0x00000000006a1600, TgtBegin=0x00002ad71e600200
Libomptarget --> There are 4096 bytes allocated at target address 0x00002ad71e600200 - is new
Libomptarget --> Moving 4096 bytes (hst:0x00000000006a0600) -> (tgt:0x00002ad71e600200)
Libomptarget --> Update pointer (0x00002ad71e600000) -> [0x00002ad71e600200]
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, RefCount=2
Libomptarget --> Obtained target argument 0x00002ad71e600000 from host pointer 0x00007ffc0985c510
Libomptarget --> loop trip count is 1024.
Libomptarget --> Launching target execution __omp_offloading_13_a6fc814_main_l25 with pointer 0x0000000000eb1d10 (index=0).
Target CUDA RTL --> Setting CUDA threads per block to default 128
Target CUDA RTL --> Using 8 teams due to loop trip count 1024 and number of threads per block 128
Target CUDA RTL --> Launch kernel with 8 blocks and 128 threads
Target CUDA RTL --> Launch of entry point at 0x0000000000eb1d10 successful!
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e90
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc0985c190) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00000000006a0600, Size=4096, Type=0x2000000000013).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000006a0600, Size=4096)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000006a0600, TgtPtrBegin=0x00002ad71e600200, Size=4096, updated RefCount=1
Libomptarget --> There are 4096 bytes allocated at target address 0x00002ad71e600200 - is last
Libomptarget --> Moving 4096 bytes (tgt:0x00002ad71e600200) -> (hst:0x00000000006a0600)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000006a0600, Size=4096)...
Libomptarget --> Deleting tgt data 0x00002ad71e600200 of size 4096
Libomptarget --> Removing mapping with HstPtrBegin=0x00000000006a0600, TgtPtrBegin=0x00002ad71e600200, Size=4096
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, updated RefCount=1
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is not last
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, updated RefCount=1
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is last
Libomptarget --> Removing shadow pointer 0x00007ffc0985c510
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Deleting tgt data 0x00002ad71e600000 of size 8
Libomptarget --> Removing mapping with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8
Sum = 2048
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x0000000000401350 is compatible with RTL 0x000000000063d430!
Libomptarget --> Unregistered image 0x0000000000401350 from RTL 0x000000000063d430!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor 0x0000000000423900
Libomptarget --> Done unregistering library!
Libomptarget --> Deinit target library!
$

My system was messed up and used libraries and compilers from different builds. Please ignore my previous message.

I was able to reproduce what @protze.joachim described, i.e. different runtime behavior when LIBOMPTARGET_DEBUG=1 is used. I'm looking at the issue.

OK, I suspect there is a race condition involving the CUDA plugin. If I compile the test on x86_64-pc-linux-gnu then I always get the correct result, no matter whether we print debug output or not.

On CUDA, I tried to increase the test size from 1024 to 16M. With debug output off, I always get 16M as a result (instead of 32M) - this tells me that the CUDA kernel is launched and host code proceeds to verify the result before the kernel returns. Because the problem size is large, verification on the host always finished before the kernel returns and data is copied back from the device.

With debug output on, I get inconsistent results from execution to execution ranging from 16M to 32M, meaning that the host is busier printing output messages so verification can start later while data is being copied back.

@lildmh I've got a question unrelated to the problem we are discussing here. I ran declare_mapper_target.cpp and when libomptarget calls the mapper function it prints the following:

Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x00007ffcebd101e0, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x000000000231bfe0, Size=4096, Type=0x2000000000013)

Why is the second entry's MEMBER_OF field set to 2? It should be MEMBER_OF 1, since the pointer-pointee pair c.a[0:N] is part of struct c which is the first entry on the list.

In D68100#2168350, @grokos wrote:
@lildmh I've got a question unrelated to the problem we are discussing here. I ran declare_mapper_target.cpp and when libomptarget calls the mapper function it prints the following:
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x00007ffcebd101e0, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x000000000231bfe0, Size=4096, Type=0x2000000000013)
Why is the second entry's MEMBER_OF field set to 2? It should be MEMBER_OF 1, since the pointer-pointee pair c.a[0:N] is part of struct c which is the first entry on the list.

Good point. May be a bug. Let me check later

@lildmh: I think I've found a bug. I used declare_mapper_target.cpp. When we call the mapper function, it generates 3 components. The first 2 are identical and correspond to the parent struct. This is what MapperComponents looks like inside function target_data_mapper:

(gdb) print MapperComponents
$1 = {Components = std::vector of length 3, capacity 4 = {{Base = 0x7fffffffb598, Begin = 0x7fffffffb598, Size = 8, Type = 32}, {Base = 0x7fffffffb598, Begin = 0x7fffffffb598, Size = 8, Type = 32}, {Base = 0x7fffffffb598,
      Begin = 0x62efd0, Size = 4096, Type = 562949953421331}}}

Mapping the parent struct twice is problematic. If we have more struct members and some of them are NOT pointers, then upon target_data_end libomptarget will check the parent struct's reference counter to determine whether the scalar member must be copied back to the host. If the reference counter is greater than 1, then the runtime will skip copying back the scalar. Mapping the parent struct two times in a row results in RefCount=2.

So in the example below (modified declare_mapper_target.cpp) the scalar is processed by libomptarget but because at that time the struct's RefCount is 2 we never copy the scalar back:

#include <cstdio>
#include <cstdlib>
#include <omp.h>

#define NUM 1024

class C {
public:
  int *a;
  int onHost;
};

#pragma omp declare mapper(id: C s) map(s.a[0:NUM], s.onHost)

int main() {
  C c;
  c.a = (int*) malloc(sizeof(int)*NUM);
  c.onHost = -1;
  for (int i = 0; i < NUM; i++) {
    c.a[i] = 1;
  }
  #pragma omp target teams distribute parallel for map(mapper(id),tofrom: c)
  for (int i = 0; i < NUM; i++) {
    ++c.a[i];
    if (i == 0) {
      c.onHost = omp_is_initial_device();
    }
  }

  int sum = 0;
  for (int i = 0; i < NUM; i++) {
    sum += c.a[i];
  }
  printf("Executed on %s\n", c.onHost==1 ? "host" : c.onHost==0 ? "device" : "unknown");
  // CHECK: Sum = 2048
  printf("Sum = %d\n", sum);
  return 0;
}

Upon target_data_end the mapper function will generate this:

(gdb) print MapperComponents
$1 = {Components = std::vector of length 4, capacity 4 = {{Base = 0x7fffffffb588, Begin = 0x7fffffffb588, Size = 16, Type = 32}, {Base = 0x7fffffffb588, Begin = 0x7fffffffb588, Size = 12, Type = 32}, {Base = 0x7fffffffb588,
      Begin = 0x62efd0, Size = 4096, Type = 562949953421331}, {Base = 0x7fffffffb588, Begin = 0x7fffffffb590, Size = 4, Type = 562949953421315}}}

When libomptarget processes the scalar, the parent struct's RefCount is 2, so inside the if-block in omptarget.cpp:507-524 CopyMember will never be set to true and the scalar will never be copied back to the host.

Can you revert the patches for declare mapper until it is fixed?

Thanks George for looking into this, and sorry for the late response.

I believe this is not a bug, it's a design choice we made early. The design choice is we map the whole structure at the beginning for one piece so we don't map the individual parts of them separately, which may cause a lot of memcpy.

For the RefCount, when the runtime check the 2nd component in your example, it will find it's already mapped and will not increase the RefCount, so I think it's not a bug and the behavior is correct.

In D68100#2173460, @lildmh wrote:

Thanks George for looking into this, and sorry for the late response.

I believe this is not a bug, it's a design choice we made early. The design choice is we map the whole structure at the beginning for one piece so we don't map the individual parts of them separately, which may cause a lot of memcpy.

For the RefCount, when the runtime check the 2nd component in your example, it will find it's already mapped and will not increase the RefCount, so I think it's not a bug and the behavior is correct.

No, this is not related to our design choices. Here we are mapping the whole struct twice for no reason. The entries should be:

1) combined entry (i.e. the entry that maps the whole struct
    base = &c, begin = &c.a, size = sizeof(class S), type = TARGET_PARAM
2) member entry for c.a[0:NUM]
    base = &c.a, begin = &c.a[0], size = NUM*sizeof(int), type = MEMBER_OF(1) | PTR_AND_OBJ | TO | FROM
3) member entry for c.onHost
    base = &c, begin = &c.onHost, size = sizeof(int), type = MEMBER_OF(1) | TO | FROM

But what happens now is that the combined entry is emitted twice, so MapperComponents looks like this:

<combined entry>, <combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>

instead of

<combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>

And what's more, the first combined entry has size=16 whereas the second combined entry has size=12. Where does this 16 come from? The size of the struct is 12 bytes (a pointer + an int). This also explains why the MEMBER_OF field is set to 2, because the second element in the list of arguments is also the combined entry.

There is no rationale behind emitting the combined entry twice, on the contrary it leads to errors because the RefCount is indeed incremented when it shouldn't.

This is libomptarget's debug output from the provided example upon entering the target region:

Libomptarget --> Entering target region with entry point 0x0000000000401409 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry  0: Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x23
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e50
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc24203368) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=12, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x0000000000d89ff0, Size=4096, Type=0x2000000000013).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a70, Size=4, Type=0x2000000000003).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Creating new map entry: HstBase=0x00007ffc24203a68, HstBegin=0x00007ffc24203a68, HstEnd=0x00007ffc24203a78, TgtBegin=0x00007fa582400000
Libomptarget --> There are 16 bytes allocated at target address 0x00007fa582400000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=12)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=12, updated RefCount=2
Libomptarget --> There are 12 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=8, RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000d89ff0, Size=4096)...
Libomptarget --> Creating new map entry: HstBase=0x0000000000d89ff0, HstBegin=0x0000000000d89ff0, HstEnd=0x0000000000d8aff0, TgtBegin=0x00007fa582400200
Libomptarget --> There are 4096 bytes allocated at target address 0x00007fa582400200 - is new
Libomptarget --> Moving 4096 bytes (hst:0x0000000000d89ff0) -> (tgt:0x00007fa582400200)
Libomptarget --> Update pointer (0x00007fa582400000) -> [0x00007fa582400200]
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a70, Size=4)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a70, TgtPtrBegin=0x00007fa582400008, Size=4, RefCount=2
Libomptarget --> There are 4 bytes allocated at target address 0x00007fa582400008 - is not new
Libomptarget --> DeviceTy::getMapEntry: requested entry found
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=16, RefCount=2
Libomptarget --> Obtained target argument 0x00007fa582400000 from host pointer 0x00007ffc24203a68
Libomptarget --> loop trip count is 1024.
Libomptarget --> Launching target execution __omp_offloading_801_1ee0443_main_l28 with pointer 0x0000000001427dc0 (index=0).

When we process the 16-byte combined entry we allocate space for the struct and RefCount=1, then we process the 12-byte combined entry and RefCount is incremented to 2.

In D68100#2173541, @grokos wrote:
In D68100#2173460, @lildmh wrote:

Thanks George for looking into this, and sorry for the late response.

I believe this is not a bug, it's a design choice we made early. The design choice is we map the whole structure at the beginning for one piece so we don't map the individual parts of them separately, which may cause a lot of memcpy.

For the RefCount, when the runtime check the 2nd component in your example, it will find it's already mapped and will not increase the RefCount, so I think it's not a bug and the behavior is correct.

No, this is not related to our design choices. Here we are mapping the whole struct twice for no reason. The entries should be:
1) combined entry (i.e. the entry that maps the whole struct
    base = &c, begin = &c.a, size = sizeof(class S), type = TARGET_PARAM
2) member entry for c.a[0:NUM]
    base = &c.a, begin = &c.a[0], size = NUM*sizeof(int), type = MEMBER_OF(1) | PTR_AND_OBJ | TO | FROM
3) member entry for c.onHost
    base = &c, begin = &c.onHost, size = sizeof(int), type = MEMBER_OF(1) | TO | FROM
But what happens now is that the combined entry is emitted twice, so MapperComponents looks like this:
<combined entry>, <combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>
instead of
<combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>
And what's more, the first combined entry has size=16 whereas the second combined entry has size=12. Where does this 16 come from? The size of the struct is 12 bytes (a pointer + an int). This also explains why the MEMBER_OF field is set to 2, because the second element in the list of arguments is also the combined entry.

The first combined entry comes from mapping the whole structure. I think because of the alignment, the structure is actually 16 bytes. The 2nd combined entry is the real entry emitted to map the structure. Why it looks like there are 2 of them, because at the beginning of a mapper function, it maps the whole structure no matter what, which generate the 1st combined entry you saw here. Then we generate detailed mapping entry, which generates the 2nd combined entry you saw here. They are not necessarily the same. It happens to be similar in this example.

There is no rationale behind emitting the combined entry twice, on the contrary it leads to errors because the RefCount is indeed incremented when it shouldn't.

This is libomptarget's debug output from the provided example upon entering the target region:

Libomptarget --> Entering target region with entry point 0x0000000000401409 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry  0: Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x23
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e50
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc24203368) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=12, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x0000000000d89ff0, Size=4096, Type=0x2000000000013).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a70, Size=4, Type=0x2000000000003).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Creating new map entry: HstBase=0x00007ffc24203a68, HstBegin=0x00007ffc24203a68, HstEnd=0x00007ffc24203a78, TgtBegin=0x00007fa582400000
Libomptarget --> There are 16 bytes allocated at target address 0x00007fa582400000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=12)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=12, updated RefCount=2
Libomptarget --> There are 12 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=8, RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000d89ff0, Size=4096)...
Libomptarget --> Creating new map entry: HstBase=0x0000000000d89ff0, HstBegin=0x0000000000d89ff0, HstEnd=0x0000000000d8aff0, TgtBegin=0x00007fa582400200
Libomptarget --> There are 4096 bytes allocated at target address 0x00007fa582400200 - is new
Libomptarget --> Moving 4096 bytes (hst:0x0000000000d89ff0) -> (tgt:0x00007fa582400200)
Libomptarget --> Update pointer (0x00007fa582400000) -> [0x00007fa582400200]
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a70, Size=4)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a70, TgtPtrBegin=0x00007fa582400008, Size=4, RefCount=2
Libomptarget --> There are 4 bytes allocated at target address 0x00007fa582400008 - is not new
Libomptarget --> DeviceTy::getMapEntry: requested entry found
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=16, RefCount=2
Libomptarget --> Obtained target argument 0x00007fa582400000 from host pointer 0x00007ffc24203a68
Libomptarget --> loop trip count is 1024.
Libomptarget --> Launching target execution __omp_offloading_801_1ee0443_main_l28 with pointer 0x0000000001427dc0 (index=0).

When we process the 16-byte combined entry we allocate space for the struct and RefCount=1, then we process the 12-byte combined entry and RefCount is incremented to 2.

It indeed increases RefCount after I checked the code and you are right. I think it should not cause any problem? Because RefCount will be reduced before to 0 at exit (It looks like the combined entry's mapped twice, it should also be 'deleted' twice when the target region exits).

In D68100#2173612, @lildmh wrote:

The first combined entry comes from mapping the whole structure. I think because of the alignment, the structure is actually 16 bytes. The 2nd combined entry is the real entry emitted to map the structure. Why it looks like there are 2 of them, because at the beginning of a mapper function, it maps the whole structure no matter what, which generate the 1st combined entry you saw here. Then we generate detailed mapping entry, which generates the 2nd combined entry you saw here. They are not necessarily the same. It happens to be similar in this example.

I assure you that's not how structs are mapped.

You don't map "the whole struct", you only map what is needed. For this you emit a combined entry which has a size large enough to encompass all members we are interested in. Then entries for individual members follow. One combined entry + as many member entries as needed. The first entry which "maps the whole struct" should not be there and is plain wrong.

In D68100#2173612, @lildmh wrote:

It indeed increases RefCount after I checked the code and you are right. I think it should not cause any problem? Because RefCount will be reduced before to 0 at exit (It looks like the combined entry's mapped twice, it should also be 'deleted' twice when the target region exits).

The problem is that when individual members are processed in target_data_end, RefCount = 2 so these members will not be copied back to the host. RefCount must be 1 for data motion to take place and in this case it's not.

Anyway, I modified libomptarget locally to ignore the 16-byte combined entry and now all tests pass. Can you please submit a clang patch which removes the first combined entry?

In D68100#2173703, @grokos wrote:

In D68100#2173612, @lildmh wrote:

The first combined entry comes from mapping the whole structure. I think because of the alignment, the structure is actually 16 bytes. The 2nd combined entry is the real entry emitted to map the structure. Why it looks like there are 2 of them, because at the beginning of a mapper function, it maps the whole structure no matter what, which generate the 1st combined entry you saw here. Then we generate detailed mapping entry, which generates the 2nd combined entry you saw here. They are not necessarily the same. It happens to be similar in this example.

I assure you that's not how structs are mapped.

You don't map "the whole struct", you only map what is needed. For this you emit a combined entry which has a size large enough to encompass all members we are interested in. Then entries for individual members follow. One combined entry + as many member entries as needed. The first entry which "maps the whole struct" should not be there and is plain wrong.

This is an optimization brought up by Deepak. I guess you were in that meeting too but forgot. It could be quite useful when you map an array of struct/class. Assume you map 1000 of this structure, with this optimization most memory allocation can be done in a single allocation, instead of allocation 12 bytes memory 1000 times.

Thinking about it, it's actually important for correctness too. Assume you map C a[2]. If you map separately, a[0] and a[1] could be mapped to not contiguous locations, and it will cause error/segfault when GPU kernel access this array. If you allocate the whole array a[2] together, such problem won't happen.

In D68100#2173612, @lildmh wrote:

It indeed increases RefCount after I checked the code and you are right. I think it should not cause any problem? Because RefCount will be reduced before to 0 at exit (It looks like the combined entry's mapped twice, it should also be 'deleted' twice when the target region exits).

The problem is that when individual members are processed in target_data_end, RefCount = 2 so these members will not be copied back to the host. RefCount must be 1 for data motion to take place and in this case it's not.

Anyway, I modified libomptarget locally to ignore the 16-byte combined entry and now all tests pass. Can you please submit a clang patch which removes the first combined entry?

I believe RefCount should be reduced to 1 when want to copy it back in target_data_end, could you post the whole trace of debug output how RefCount changes in target_data_end?

From my perspective, the declare_mapper_target.cpp code is semantically equivalent to:

#pragma omp target data map(tofrom: c)
#pragma omp target data map(tofrom: c.a[0:NUM])
#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}

and

#pragma omp target enter data map(to: c)
#pragma omp target enter data map(to: c.a[0:NUM])

#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}
#pragma omp target exit data map(from: c.a[0:NUM])
#pragma omp target exit data map(from: c)

Can you express the behavior of your mapping implementation in terms of OpenMP target enter/exit data primitives?

In D68100#2173949, @protze.joachim wrote:
From my perspective, the declare_mapper_target.cpp code is semantically equivalent to:
#pragma omp target data map(tofrom: c)
#pragma omp target data map(tofrom: c.a[0:NUM])
#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}
and
#pragma omp target enter data map(to: c)
#pragma omp target enter data map(to: c.a[0:NUM])

#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}
#pragma omp target exit data map(from: c.a[0:NUM])
#pragma omp target exit data map(from: c)
Can you express the behavior of your mapping implementation in terms of OpenMP target enter/exit data primitives?

You are basically right. In implementation, a function is generated for every mapper to do all internal mapping. More details can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

You are basically right. In implementation, a function is generated for every mapper to do all internal mapping. More details can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

This document does not say, which concrete mapping operations you push for the concrete case of the failing test. Can you probably express this in terms of omp target enter/exit data operations?

Your understanding above is exactly right. It should be equivalent to

#pragma omp target data map(tofrom: c)
#pragma omp target data map(tofrom: c.a[0:NUM])
#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}

protze.joachim mentioned this in D84557: [OpenMP][Tests] Enable nvptx64 testing for most libomptarget tests.Jul 28 2020, 2:15 AM

In D68100#2173822, @lildmh wrote:

This is an optimization brought up by Deepak. I guess you were in that meeting too but forgot. It could be quite useful when you map an array of struct/class. Assume you map 1000 of this structure, with this optimization most memory allocation can be done in a single allocation, instead of allocation 12 bytes memory 1000 times.

Thinking about it, it's actually important for correctness too. Assume you map C a[2]. If you map separately, a[0] and a[1] could be mapped to not contiguous locations, and it will cause error/segfault when GPU kernel access this array. If you allocate the whole array a[2] together, such problem won't happen.

Sorry for the late response. Here you are talking about something else. The case you are considering is an array of structs. In this case, indeed we have to allocate the whole array beforehand. It's not an optimization, it's a correctness issue as you point out (array objects must be allocated consecutively). In the failing tests, however, we have single structs, not an array of structs. The difference is that in the former case the object we are mapping is the array, whereas in the latter case it's the struct. The two cases are not related to one another unless we intend to treat both of them uniformly, i.e. even if we have a single struct we still treat it as if it were the sole element of a length-1 array. Do I understand correctly?

Revision Contents

Path

Size

openmp/

libomptarget/

include/

omptarget.h

46 lines

src/

10 lines

153 lines

113 lines

25 lines

4 lines

test/

mapping/

declare_mapper_api.cpp

3 lines

declare_mapper_target.cpp

37 lines

declare_mapper_target_data.cpp

40 lines

declare_mapper_target_data_enter_exit.cpp

39 lines

declare_mapper_target_update.cpp

61 lines

Diff 278352

openmp/libomptarget/include/omptarget.h

Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	void __tgt_target_data_begin(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types);		int64_t *arg_types);
void __tgt_target_data_begin_nowait(int64_t device_id, int32_t arg_num,		void __tgt_target_data_begin_nowait(int64_t device_id, int32_t arg_num,
void args_base, void args,		void args_base, void args,
int64_t arg_sizes, int64_t arg_types,		int64_t arg_sizes, int64_t arg_types,
int32_t depNum, void *depList,		int32_t depNum, void *depList,
int32_t noAliasDepNum,		int32_t noAliasDepNum,
void *noAliasDepList);		void *noAliasDepList);
		void __tgt_target_data_begin_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers);
		void __tgt_target_data_begin_nowait_mapper(
		int64_t device_id, int32_t arg_num, void args_base, void args,
		int64_t arg_sizes, int64_t arg_types, void **arg_mappers, int32_t depNum,
		void depList, int32_t noAliasDepNum, void noAliasDepList);

// passes data from the target, release target memory and destroys the		// passes data from the target, release target memory and destroys the
// host-target mapping (top entry from the stack of data maps) created by		// host-target mapping (top entry from the stack of data maps) created by
// the last __tgt_target_data_begin		// the last __tgt_target_data_begin
void __tgt_target_data_end(int64_t device_id, int32_t arg_num, void **args_base,		void __tgt_target_data_end(int64_t device_id, int32_t arg_num, void **args_base,
void *args, int64_t arg_sizes, int64_t *arg_types);		void *args, int64_t arg_sizes, int64_t *arg_types);
void __tgt_target_data_end_nowait(int64_t device_id, int32_t arg_num,		void __tgt_target_data_end_nowait(int64_t device_id, int32_t arg_num,
void args_base, void args,		void args_base, void args,
int64_t arg_sizes, int64_t arg_types,		int64_t arg_sizes, int64_t arg_types,
int32_t depNum, void *depList,		int32_t depNum, void *depList,
int32_t noAliasDepNum, void *noAliasDepList);		int32_t noAliasDepNum, void *noAliasDepList);
		void __tgt_target_data_end_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers);
		void __tgt_target_data_end_nowait_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers, int32_t depNum,
		void *depList, int32_t noAliasDepNum,
		void *noAliasDepList);

/// passes data to/from the target		/// passes data to/from the target
void __tgt_target_data_update(int64_t device_id, int32_t arg_num,		void __tgt_target_data_update(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types);		int64_t *arg_types);
void __tgt_target_data_update_nowait(int64_t device_id, int32_t arg_num,		void __tgt_target_data_update_nowait(int64_t device_id, int32_t arg_num,
void args_base, void args,		void args_base, void args,
int64_t arg_sizes, int64_t arg_types,		int64_t arg_sizes, int64_t arg_types,
int32_t depNum, void *depList,		int32_t depNum, void *depList,
int32_t noAliasDepNum,		int32_t noAliasDepNum,
void *noAliasDepList);		void *noAliasDepList);
		void __tgt_target_data_update_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers);
		void __tgt_target_data_update_nowait_mapper(
		int64_t device_id, int32_t arg_num, void args_base, void args,
		int64_t arg_sizes, int64_t arg_types, void **arg_mappers, int32_t depNum,
		void depList, int32_t noAliasDepNum, void noAliasDepList);

// Performs the same actions as data_begin in case arg_num is non-zero		// Performs the same actions as data_begin in case arg_num is non-zero
// and initiates run of offloaded region on target platform; if arg_num		// and initiates run of offloaded region on target platform; if arg_num
// is non-zero after the region execution is done it also performs the		// is non-zero after the region execution is done it also performs the
// same action as data_end above. The following types are used; this		// same action as data_end above. The following types are used; this
// function returns 0 if it was able to transfer the execution to a		// function returns 0 if it was able to transfer the execution to a
// target and an int different from zero otherwise.		// target and an int different from zero otherwise.
int __tgt_target(int64_t device_id, void *host_ptr, int32_t arg_num,		int __tgt_target(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types);		int64_t *arg_types);
int __tgt_target_nowait(int64_t device_id, void *host_ptr, int32_t arg_num,		int __tgt_target_nowait(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t arg_types, int32_t depNum, void depList,		int64_t arg_types, int32_t depNum, void depList,
int32_t noAliasDepNum, void *noAliasDepList);		int32_t noAliasDepNum, void *noAliasDepList);
		int __tgt_target_mapper(int64_t device_id, void *host_ptr, int32_t arg_num,
		void args_base, void args, int64_t *arg_sizes,
		int64_t arg_types, void *arg_mappers);
		int __tgt_target_nowait_mapper(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers, int32_t depNum,
		void *depList, int32_t noAliasDepNum,
		void *noAliasDepList);

int __tgt_target_teams(int64_t device_id, void *host_ptr, int32_t arg_num,		int __tgt_target_teams(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types, int32_t num_teams,		int64_t *arg_types, int32_t num_teams,
int32_t thread_limit);		int32_t thread_limit);
int __tgt_target_teams_nowait(int64_t device_id, void *host_ptr,		int __tgt_target_teams_nowait(int64_t device_id, void *host_ptr,
int32_t arg_num, void args_base, void args,		int32_t arg_num, void args_base, void args,
int64_t arg_sizes, int64_t arg_types,		int64_t arg_sizes, int64_t arg_types,
int32_t num_teams, int32_t thread_limit,		int32_t num_teams, int32_t thread_limit,
int32_t depNum, void *depList,		int32_t depNum, void *depList,
int32_t noAliasDepNum, void *noAliasDepList);		int32_t noAliasDepNum, void *noAliasDepList);
		int __tgt_target_teams_mapper(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers, int32_t num_teams,
		int32_t thread_limit);
		int __tgt_target_teams_nowait_mapper(
		int64_t device_id, void host_ptr, int32_t arg_num, void *args_base,
		void *args, int64_t arg_sizes, int64_t arg_types, void *arg_mappers,
		int32_t num_teams, int32_t thread_limit, int32_t depNum, void *depList,
		int32_t noAliasDepNum, void *noAliasDepList);

void __kmpc_push_target_tripcount(int64_t device_id, uint64_t loop_tripcount);		void __kmpc_push_target_tripcount(int64_t device_id, uint64_t loop_tripcount);

#ifdef __cplusplus		#ifdef __cplusplus
}		}
#endif		#endif

#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
#include <stdio.h>		#include <stdio.h>
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

openmp/libomptarget/src/exports

	VERS1.0 {			VERS1.0 {
	global:			global:
	__tgt_register_requires;			__tgt_register_requires;
	__tgt_register_lib;			__tgt_register_lib;
	__tgt_unregister_lib;			__tgt_unregister_lib;
	__tgt_target_data_begin;			__tgt_target_data_begin;
	__tgt_target_data_end;			__tgt_target_data_end;
	__tgt_target_data_update;			__tgt_target_data_update;
	__tgt_target;			__tgt_target;
	__tgt_target_teams;			__tgt_target_teams;
	__tgt_target_data_begin_nowait;			__tgt_target_data_begin_nowait;
	__tgt_target_data_end_nowait;			__tgt_target_data_end_nowait;
	__tgt_target_data_update_nowait;			__tgt_target_data_update_nowait;
	__tgt_target_nowait;			__tgt_target_nowait;
	__tgt_target_teams_nowait;			__tgt_target_teams_nowait;
				__tgt_target_data_begin_mapper;
				__tgt_target_data_end_mapper;
				__tgt_target_data_update_mapper;
				__tgt_target_mapper;
				__tgt_target_teams_mapper;
				__tgt_target_data_begin_nowait_mapper;
				__tgt_target_data_end_nowait_mapper;
				__tgt_target_data_update_nowait_mapper;
				__tgt_target_nowait_mapper;
				__tgt_target_teams_nowait_mapper;
	__tgt_mapper_num_components;			__tgt_mapper_num_components;
	__tgt_push_mapper_component;			__tgt_push_mapper_component;
	omp_get_num_devices;			omp_get_num_devices;
	omp_get_initial_device;			omp_get_initial_device;
	omp_target_alloc;			omp_target_alloc;
	omp_target_free;			omp_target_free;
	omp_target_is_present;			omp_target_is_present;
	omp_target_memcpy;			omp_target_memcpy;
	omp_target_memcpy_rect;			omp_target_memcpy_rect;
	omp_target_associate_ptr;			omp_target_associate_ptr;
	omp_target_disassociate_ptr;			omp_target_disassociate_ptr;
	__kmpc_push_target_tripcount;			__kmpc_push_target_tripcount;
	local:			local:
	*;			*;
	};			};

openmp/libomptarget/src/interface.cpp

Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	EXTERN void __tgt_unregister_lib(__tgt_bin_desc *desc) {
RTLs->UnregisterLib(desc);		RTLs->UnregisterLib(desc);
}		}

/// creates host-to-target data mapping, stores it in the		/// creates host-to-target data mapping, stores it in the
/// libomptarget.so internal structure (an entry in a stack of data maps)		/// libomptarget.so internal structure (an entry in a stack of data maps)
/// and passes the data to the device.		/// and passes the data to the device.
EXTERN void __tgt_target_data_begin(int64_t device_id, int32_t arg_num,		EXTERN void __tgt_target_data_begin(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		void args_base, void args, int64_t arg_sizes, int64_t arg_types) {
		__tgt_target_data_begin_mapper(device_id, arg_num, args_base, args,
		arg_sizes, arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_begin_nowait(int64_t device_id, int32_t arg_num,
		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
		int32_t depNum, void *depList, int32_t noAliasDepNum,
		void *noAliasDepList) {
		if (depNum + noAliasDepNum > 0)
		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

		__tgt_target_data_begin_mapper(device_id, arg_num, args_base, args,
		arg_sizes, arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_begin_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers) {
if (IsOffloadDisabled()) return;		if (IsOffloadDisabled()) return;

DP("Entering data begin region for device %" PRId64 " with %d mappings\n",		DP("Entering data begin region for device %" PRId64 " with %d mappings\n",
device_id, arg_num);		device_id, arg_num);

// No devices available?		// No devices available?
if (device_id == OFFLOAD_DEVICE_DEFAULT) {		if (device_id == OFFLOAD_DEVICE_DEFAULT) {
device_id = omp_get_default_device();		device_id = omp_get_default_device();
Show All 12 Lines	#ifdef OMPTARGET_DEBUG
for (int i = 0; i < arg_num; ++i) {		for (int i = 0; i < arg_num; ++i) {
DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
", Type=0x%" PRIx64 "\n",		", Type=0x%" PRIx64 "\n",
i, DPxPTR(args_base[i]), DPxPTR(args[i]), arg_sizes[i], arg_types[i]);		i, DPxPTR(args_base[i]), DPxPTR(args[i]), arg_sizes[i], arg_types[i]);
}		}
#endif		#endif

int rc = target_data_begin(Device, arg_num, args_base, args, arg_sizes,		int rc = target_data_begin(Device, arg_num, args_base, args, arg_sizes,
arg_types, nullptr);		arg_types, arg_mappers, nullptr);
HandleTargetOutcome(rc == OFFLOAD_SUCCESS);		HandleTargetOutcome(rc == OFFLOAD_SUCCESS);
}		}

EXTERN void __tgt_target_data_begin_nowait(int64_t device_id, int32_t arg_num,		EXTERN void __tgt_target_data_begin_nowait_mapper(int64_t device_id,
void args_base, void args, int64_t arg_sizes, int64_t arg_types,		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
int32_t depNum, void *depList, int32_t noAliasDepNum,		int64_t arg_types, void arg_mappers, int32_t depNum, void depList,
void *noAliasDepList) {		int32_t noAliasDepNum, void *noAliasDepList) {
if (depNum + noAliasDepNum > 0)		if (depNum + noAliasDepNum > 0)
__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

__tgt_target_data_begin(device_id, arg_num, args_base, args, arg_sizes,		__tgt_target_data_begin_mapper(device_id, arg_num, args_base, args,
arg_types);		arg_sizes, arg_types, arg_mappers);
}		}

/// passes data from the target, releases target memory and destroys		/// passes data from the target, releases target memory and destroys
/// the host-target mapping (top entry from the stack of data maps)		/// the host-target mapping (top entry from the stack of data maps)
/// created by the last __tgt_target_data_begin.		/// created by the last __tgt_target_data_begin.
EXTERN void __tgt_target_data_end(int64_t device_id, int32_t arg_num,		EXTERN void __tgt_target_data_end(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		void args_base, void args, int64_t arg_sizes, int64_t arg_types) {
		__tgt_target_data_end_mapper(device_id, arg_num, args_base, args, arg_sizes,
		arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_end_nowait(int64_t device_id, int32_t arg_num,
		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
		int32_t depNum, void *depList, int32_t noAliasDepNum,
		void *noAliasDepList) {
		if (depNum + noAliasDepNum > 0)
		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

		__tgt_target_data_end_mapper(device_id, arg_num, args_base, args, arg_sizes,
		arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_end_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers) {
if (IsOffloadDisabled()) return;		if (IsOffloadDisabled()) return;
DP("Entering data end region with %d mappings\n", arg_num);		DP("Entering data end region with %d mappings\n", arg_num);

// No devices available?		// No devices available?
if (device_id == OFFLOAD_DEVICE_DEFAULT) {		if (device_id == OFFLOAD_DEVICE_DEFAULT) {
device_id = omp_get_default_device();		device_id = omp_get_default_device();
}		}

Show All 17 Lines	#ifdef OMPTARGET_DEBUG
for (int i=0; i<arg_num; ++i) {		for (int i=0; i<arg_num; ++i) {
DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),		", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),
arg_sizes[i], arg_types[i]);		arg_sizes[i], arg_types[i]);
}		}
#endif		#endif

int rc = target_data_end(Device, arg_num, args_base, args, arg_sizes,		int rc = target_data_end(Device, arg_num, args_base, args, arg_sizes,
arg_types, nullptr);		arg_types, arg_mappers, nullptr);
HandleTargetOutcome(rc == OFFLOAD_SUCCESS);		HandleTargetOutcome(rc == OFFLOAD_SUCCESS);
}		}

EXTERN void __tgt_target_data_end_nowait(int64_t device_id, int32_t arg_num,		EXTERN void __tgt_target_data_end_nowait_mapper(int64_t device_id,
		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
		int64_t arg_types, void arg_mappers, int32_t depNum, void depList,
		int32_t noAliasDepNum, void *noAliasDepList) {
		if (depNum + noAliasDepNum > 0)
		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

		__tgt_target_data_end_mapper(device_id, arg_num, args_base, args, arg_sizes,
		arg_types, arg_mappers);
		}

		EXTERN void __tgt_target_data_update(int64_t device_id, int32_t arg_num,
		void args_base, void args, int64_t arg_sizes, int64_t arg_types) {
		__tgt_target_data_update_mapper(device_id, arg_num, args_base, args,
		arg_sizes, arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_update_nowait(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types,		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
int32_t depNum, void *depList, int32_t noAliasDepNum,		int32_t depNum, void *depList, int32_t noAliasDepNum,
void *noAliasDepList) {		void *noAliasDepList) {
if (depNum + noAliasDepNum > 0)		if (depNum + noAliasDepNum > 0)
__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

__tgt_target_data_end(device_id, arg_num, args_base, args, arg_sizes,		__tgt_target_data_update_mapper(device_id, arg_num, args_base, args,
arg_types);		arg_sizes, arg_types, nullptr);
}		}

EXTERN void __tgt_target_data_update(int64_t device_id, int32_t arg_num,		EXTERN void __tgt_target_data_update_mapper(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers) {
if (IsOffloadDisabled()) return;		if (IsOffloadDisabled()) return;
DP("Entering data update with %d mappings\n", arg_num);		DP("Entering data update with %d mappings\n", arg_num);

// No devices available?		// No devices available?
if (device_id == OFFLOAD_DEVICE_DEFAULT) {		if (device_id == OFFLOAD_DEVICE_DEFAULT) {
device_id = omp_get_default_device();		device_id = omp_get_default_device();
}		}

if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {		if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {
DP("Failed to get device %" PRId64 " ready\n", device_id);		DP("Failed to get device %" PRId64 " ready\n", device_id);
HandleTargetOutcome(false);		HandleTargetOutcome(false);
return;		return;
}		}

DeviceTy& Device = Devices[device_id];		DeviceTy& Device = Devices[device_id];
int rc = target_data_update(Device, arg_num, args_base,		int rc = target_data_update(Device, arg_num, args_base,
args, arg_sizes, arg_types);		args, arg_sizes, arg_types, arg_mappers);
HandleTargetOutcome(rc == OFFLOAD_SUCCESS);		HandleTargetOutcome(rc == OFFLOAD_SUCCESS);
}		}

EXTERN void __tgt_target_data_update_nowait(		EXTERN void __tgt_target_data_update_nowait_mapper(int64_t device_id,
int64_t device_id, int32_t arg_num, void args_base, void args,		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
int64_t arg_sizes, int64_t arg_types, int32_t depNum, void *depList,		int64_t arg_types, void arg_mappers, int32_t depNum, void depList,
int32_t noAliasDepNum, void *noAliasDepList) {		int32_t noAliasDepNum, void *noAliasDepList) {
if (depNum + noAliasDepNum > 0)		if (depNum + noAliasDepNum > 0)
__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

__tgt_target_data_update(device_id, arg_num, args_base, args, arg_sizes,		__tgt_target_data_update_mapper(device_id, arg_num, args_base, args,
arg_types);		arg_sizes, arg_types, arg_mappers);
}		}

EXTERN int __tgt_target(int64_t device_id, void *host_ptr, int32_t arg_num,		EXTERN int __tgt_target(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		void args_base, void args, int64_t arg_sizes, int64_t arg_types) {
		return __tgt_target_mapper(device_id, host_ptr, arg_num, args_base, args,
		arg_sizes, arg_types, nullptr);
		}

		EXTERN int __tgt_target_nowait(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
		int64_t arg_types, int32_t depNum, void depList, int32_t noAliasDepNum,
		void *noAliasDepList) {
		if (depNum + noAliasDepNum > 0)
		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

		return __tgt_target_mapper(device_id, host_ptr, arg_num, args_base, args,
		arg_sizes, arg_types, nullptr);
		}

		EXTERN int __tgt_target_mapper(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
		int64_t arg_types, void *arg_mappers) {
if (IsOffloadDisabled()) return OFFLOAD_FAIL;		if (IsOffloadDisabled()) return OFFLOAD_FAIL;
DP("Entering target region with entry point " DPxMOD " and device Id %"		DP("Entering target region with entry point " DPxMOD " and device Id %"
PRId64 "\n", DPxPTR(host_ptr), device_id);		PRId64 "\n", DPxPTR(host_ptr), device_id);

if (device_id == OFFLOAD_DEVICE_DEFAULT) {		if (device_id == OFFLOAD_DEVICE_DEFAULT) {
device_id = omp_get_default_device();		device_id = omp_get_default_device();
}		}

if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {		if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {
DP("Failed to get device %" PRId64 " ready\n", device_id);		DP("Failed to get device %" PRId64 " ready\n", device_id);
HandleTargetOutcome(false);		HandleTargetOutcome(false);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
for (int i=0; i<arg_num; ++i) {		for (int i=0; i<arg_num; ++i) {
DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),		", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),
arg_sizes[i], arg_types[i]);		arg_sizes[i], arg_types[i]);
}		}
#endif		#endif

int rc = target(device_id, host_ptr, arg_num, args_base, args, arg_sizes,		int rc = target(device_id, host_ptr, arg_num, args_base, args, arg_sizes,
arg_types, 0, 0, false /team/);		arg_types, arg_mappers, 0, 0, false /team/);
HandleTargetOutcome(rc == OFFLOAD_SUCCESS);		HandleTargetOutcome(rc == OFFLOAD_SUCCESS);
return rc;		return rc;
}		}

EXTERN int __tgt_target_nowait(int64_t device_id, void *host_ptr,		EXTERN int __tgt_target_nowait_mapper(int64_t device_id, void *host_ptr,
int32_t arg_num, void args_base, void args, int64_t *arg_sizes,		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
int64_t arg_types, int32_t depNum, void depList, int32_t noAliasDepNum,		int64_t arg_types, void arg_mappers, int32_t depNum, void depList,
void *noAliasDepList) {		int32_t noAliasDepNum, void *noAliasDepList) {
if (depNum + noAliasDepNum > 0)		if (depNum + noAliasDepNum > 0)
__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

return __tgt_target(device_id, host_ptr, arg_num, args_base, args, arg_sizes,		return __tgt_target_mapper(device_id, host_ptr, arg_num, args_base, args,
arg_types);		arg_sizes, arg_types, arg_mappers);
}		}

EXTERN int __tgt_target_teams(int64_t device_id, void *host_ptr,		EXTERN int __tgt_target_teams(int64_t device_id, void *host_ptr,
int32_t arg_num, void args_base, void args, int64_t *arg_sizes,		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types, int32_t team_num, int32_t thread_limit) {		int64_t *arg_types, int32_t team_num, int32_t thread_limit) {
		return __tgt_target_teams_mapper(device_id, host_ptr, arg_num, args_base,
		args, arg_sizes, arg_types, nullptr, team_num, thread_limit);
		}

		EXTERN int __tgt_target_teams_nowait(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
		int64_t *arg_types, int32_t team_num, int32_t thread_limit, int32_t depNum,
		void depList, int32_t noAliasDepNum, void noAliasDepList) {
		if (depNum + noAliasDepNum > 0)
		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

		return __tgt_target_teams_mapper(device_id, host_ptr, arg_num, args_base,
		args, arg_sizes, arg_types, nullptr, team_num, thread_limit);
		}

		EXTERN int __tgt_target_teams_mapper(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
		int64_t arg_types, void *arg_mappers, int32_t team_num, int32_t thread_limit) {
if (IsOffloadDisabled()) return OFFLOAD_FAIL;		if (IsOffloadDisabled()) return OFFLOAD_FAIL;
DP("Entering target region with entry point " DPxMOD " and device Id %"		DP("Entering target region with entry point " DPxMOD " and device Id %"
PRId64 "\n", DPxPTR(host_ptr), device_id);		PRId64 "\n", DPxPTR(host_ptr), device_id);

if (device_id == OFFLOAD_DEVICE_DEFAULT) {		if (device_id == OFFLOAD_DEVICE_DEFAULT) {
device_id = omp_get_default_device();		device_id = omp_get_default_device();
}		}

if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {		if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {
DP("Failed to get device %" PRId64 " ready\n", device_id);		DP("Failed to get device %" PRId64 " ready\n", device_id);
HandleTargetOutcome(false);		HandleTargetOutcome(false);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
for (int i=0; i<arg_num; ++i) {		for (int i=0; i<arg_num; ++i) {
DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),		", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),
arg_sizes[i], arg_types[i]);		arg_sizes[i], arg_types[i]);
}		}
#endif		#endif

int rc = target(device_id, host_ptr, arg_num, args_base, args, arg_sizes,		int rc = target(device_id, host_ptr, arg_num, args_base, args, arg_sizes,
arg_types, team_num, thread_limit, true /team/);		arg_types, arg_mappers, team_num, thread_limit, true /team/);
HandleTargetOutcome(rc == OFFLOAD_SUCCESS);		HandleTargetOutcome(rc == OFFLOAD_SUCCESS);

return rc;		return rc;
}		}

EXTERN int __tgt_target_teams_nowait(int64_t device_id, void *host_ptr,		EXTERN int __tgt_target_teams_nowait_mapper(int64_t device_id, void *host_ptr,
int32_t arg_num, void args_base, void args, int64_t *arg_sizes,		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types, int32_t team_num, int32_t thread_limit, int32_t depNum,		int64_t arg_types, void *arg_mappers, int32_t team_num,
void depList, int32_t noAliasDepNum, void noAliasDepList) {		int32_t thread_limit, int32_t depNum, void *depList, int32_t noAliasDepNum,
		void *noAliasDepList) {
if (depNum + noAliasDepNum > 0)		if (depNum + noAliasDepNum > 0)
__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

return __tgt_target_teams(device_id, host_ptr, arg_num, args_base, args,		return __tgt_target_teams_mapper(device_id, host_ptr, arg_num, args_base,
arg_sizes, arg_types, team_num, thread_limit);		args, arg_sizes, arg_types, arg_mappers, team_num, thread_limit);
}		}

// Get the current number of components for a user-defined mapper.		// Get the current number of components for a user-defined mapper.
EXTERN int64_t __tgt_mapper_num_components(void *rt_mapper_handle) {		EXTERN int64_t __tgt_mapper_num_components(void *rt_mapper_handle) {
auto MapperComponentsPtr = (struct MapperComponentsTy )rt_mapper_handle;		auto MapperComponentsPtr = (struct MapperComponentsTy )rt_mapper_handle;
int64_t size = MapperComponentsPtr->Components.size();		int64_t size = MapperComponentsPtr->Components.size();
DP("__tgt_mapper_num_components(Handle=" DPxMOD ") returns %" PRId64 "\n",		DP("__tgt_mapper_num_components(Handle=" DPxMOD ") returns %" PRId64 "\n",
DPxPTR(rt_mapper_handle), size);		DPxPTR(rt_mapper_handle), size);
Show All 38 Lines

openmp/libomptarget/src/omptarget.cpp

Show First 20 Lines • Show All 160 Lines • ▼ Show 20 Lines	static int InitLibrary(DeviceTy& Device) {
*/		*/
if (!Device.PendingCtorsDtors.empty()) {		if (!Device.PendingCtorsDtors.empty()) {
// Call all ctors for all libraries registered so far		// Call all ctors for all libraries registered so far
for (auto &lib : Device.PendingCtorsDtors) {		for (auto &lib : Device.PendingCtorsDtors) {
if (!lib.second.PendingCtors.empty()) {		if (!lib.second.PendingCtors.empty()) {
DP("Has pending ctors... call now\n");		DP("Has pending ctors... call now\n");
for (auto &entry : lib.second.PendingCtors) {		for (auto &entry : lib.second.PendingCtors) {
void *ctor = entry;		void *ctor = entry;
int rc = target(device_id, ctor, 0, NULL, NULL, NULL,		int rc = target(device_id, ctor, 0, NULL, NULL, NULL, NULL, NULL, 1,
NULL, 1, 1, true /team/);		1, true /team/);
if (rc != OFFLOAD_SUCCESS) {		if (rc != OFFLOAD_SUCCESS) {
DP("Running ctor " DPxMOD " failed.\n", DPxPTR(ctor));		DP("Running ctor " DPxMOD " failed.\n", DPxPTR(ctor));
Device.PendingGlobalsMtx.unlock();		Device.PendingGlobalsMtx.unlock();
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
}		}
// Clear the list to indicate that this device has been used		// Clear the list to indicate that this device has been used
lib.second.PendingCtors.clear();		lib.second.PendingCtors.clear();
Show All 30 Lines	int CheckDeviceAndCtors(int64_t device_id) {

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

static int32_t member_of(int64_t type) {		static int32_t member_of(int64_t type) {
return ((type & OMP_TGT_MAPTYPE_MEMBER_OF) >> 48) - 1;		return ((type & OMP_TGT_MAPTYPE_MEMBER_OF) >> 48) - 1;
}		}

		/// Call the user-defined mapper function followed by the appropriate
		// target_data_* function (target_data_{begin,end,update}).
		int target_data_mapper(DeviceTy &Device, void *arg_base,
		void arg, int64_t arg_size, int64_t arg_type, void arg_mapper,
		TargetDataFuncPtrTy target_data_function) {
		DP("Calling the mapper function " DPxMOD "\n", DPxPTR(arg_mapper));

		// The mapper function fills up Components.
		MapperComponentsTy MapperComponents;
		MapperFuncPtrTy MapperFuncPtr = (MapperFuncPtrTy)(arg_mapper);
		(MapperFuncPtr)((void )&MapperComponents, arg_base, arg, arg_size,
		arg_type);

		// Construct new arrays for args_base, args, arg_sizes and arg_types
		// using the information in MapperComponents and call the corresponding
		// target_data_* function using these new arrays.
		std::vector<void *> mapper_args_base;
		std::vector<void *> mapper_args;
		std::vector<int64_t> mapper_arg_sizes;
		std::vector<int64_t> mapper_arg_types;

		for (auto& C : MapperComponents.Components) {
		mapper_args_base.push_back(C.Base);
		mapper_args.push_back(C.Begin);
		mapper_arg_sizes.push_back(C.Size);
		mapper_arg_types.push_back(C.Type);
		}

		int rc = target_data_function(Device, MapperComponents.Components.size(),
		mapper_args_base.data(), mapper_args.data(), mapper_arg_sizes.data(),
		mapper_arg_types.data(), /arg_mappers/ nullptr,
		/__tgt_async_info/ nullptr);

		return rc;
		}

/// Internal function to do the mapping and transfer the data to the device		/// Internal function to do the mapping and transfer the data to the device
int target_data_begin(DeviceTy &Device, int32_t arg_num, void **args_base,		int target_data_begin(DeviceTy &Device, int32_t arg_num, void **args_base,
void *args, int64_t arg_sizes, int64_t *arg_types,		void *args, int64_t arg_sizes, int64_t *arg_types,
__tgt_async_info *async_info_ptr) {		void *arg_mappers, __tgt_async_info async_info_ptr) {
// process each input.		// process each input.
for (int32_t i = 0; i < arg_num; ++i) {		for (int32_t i = 0; i < arg_num; ++i) {
// Ignore private variables and arrays - there is no mapping for them.		// Ignore private variables and arrays - there is no mapping for them.
if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|		if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|
(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))		(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))
continue;		continue;

		if (arg_mappers && arg_mappers[i]) {
		// Instead of executing the regular path of target_data_begin, call the
		// target_data_mapper variant which will call target_data_begin again
		// with new arguments.
		DP("Calling target_data_mapper for the %dth argument\n", i);

		int rc = target_data_mapper(Device, args_base[i], args[i], arg_sizes[i],
		arg_types[i], arg_mappers[i], target_data_begin);

		if (rc != OFFLOAD_SUCCESS) {
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions What limitation? Why 0xffff? JonChesterfield: What limitation? Why 0xffff?
		lildmhUnsubmitted Done Reply Inline Actions Because the parent idx in map type has 16 bits, we cannot handle components more than that. lildmh: Because the parent idx in map type has 16 bits, we cannot handle components more than that.
		DP("Call to target_data_begin via target_data_mapper for custom mapper"
		" failed.\n");
		return OFFLOAD_FAIL;
		}

		// Skip the rest of this function, continue to the next argument.
		continue;
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Size probably returns size_t, why is the induction variable signed? JonChesterfield: Size probably returns size_t, why is the induction variable signed?
		lildmhUnsubmitted Done Reply Inline Actions I think the indices are always type `int32_t` in libomptarget, so I followed the rules. Otherwise there will be a signed and unsigned compariion warning lildmh: I think the indices are always type `int32_t` in libomptarget, so I followed the rules.
		}

void *HstPtrBegin = args[i];		void *HstPtrBegin = args[i];
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions What makes the mapper valid? I don't see any checking in the source. Perhaps just strike the word valid from the comment JonChesterfield: What makes the mapper valid? I don't see any checking in the source. Perhaps just strike the…
		lildmhUnsubmitted Done Reply Inline Actions If there is a valid pointer, which is generated by compiler, I say it's valid. I will remove word valid from the comment lildmh: If there is a valid pointer, which is generated by compiler, I say it's valid. I will remove…
void *HstPtrBase = args_base[i];		void *HstPtrBase = args_base[i];
int64_t data_size = arg_sizes[i];		int64_t data_size = arg_sizes[i];

// Adjust for proper alignment if this is a combined entry (for structs).		// Adjust for proper alignment if this is a combined entry (for structs).
// Look at the next argument - if that is MEMBER_OF this one, then this one		// Look at the next argument - if that is MEMBER_OF this one, then this one
// is a combined entry.		// is a combined entry.
int64_t padding = 0;		int64_t padding = 0;
const int next_i = i+1;		const int next_i = i+1;
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	int target_data_begin(DeviceTy &Device, int32_t arg_num, void **args_base,
}		}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

/// Internal function to undo the mapping and retrieve the data from the device.		/// Internal function to undo the mapping and retrieve the data from the device.
int target_data_end(DeviceTy &Device, int32_t arg_num, void **args_base,		int target_data_end(DeviceTy &Device, int32_t arg_num, void **args_base,
void *args, int64_t arg_sizes, int64_t *arg_types,		void *args, int64_t arg_sizes, int64_t *arg_types,
__tgt_async_info *async_info_ptr) {		void *arg_mappers, __tgt_async_info async_info_ptr) {
// process each input.		// process each input.
for (int32_t i = arg_num - 1; i >= 0; --i) {		for (int32_t i = arg_num - 1; i >= 0; --i) {
// Ignore private variables and arrays - there is no mapping for them.		// Ignore private variables and arrays - there is no mapping for them.
// Also, ignore the use_device_ptr directive, it has no effect here.		// Also, ignore the use_device_ptr directive, it has no effect here.
if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|		if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|
(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))		(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))
continue;		continue;

		if (arg_mappers && arg_mappers[i]) {
		// Instead of executing the regular path of target_data_end, call the
		// target_data_mapper variant which will call target_data_end again
		// with new arguments.
		DP("Calling target_data_mapper for the %dth argument\n", i);

		int rc = target_data_mapper(Device, args_base[i], args[i], arg_sizes[i],
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions This appears to be a copy and paste of the above JonChesterfield: This appears to be a copy and paste of the above
		arg_types[i], arg_mappers[i], target_data_end);

		if (rc != OFFLOAD_SUCCESS) {
		DP("Call to target_data_end via target_data_mapper for custom mapper"
		" failed.\n");
		return OFFLOAD_FAIL;
		}

		// Skip the rest of this function, continue to the next argument.
		continue;
		}

void *HstPtrBegin = args[i];		void *HstPtrBegin = args[i];
int64_t data_size = arg_sizes[i];		int64_t data_size = arg_sizes[i];
// Adjust for proper alignment if this is a combined entry (for structs).		// Adjust for proper alignment if this is a combined entry (for structs).
// Look at the next argument - if that is MEMBER_OF this one, then this one		// Look at the next argument - if that is MEMBER_OF this one, then this one
// is a combined entry.		// is a combined entry.
int64_t padding = 0;		int64_t padding = 0;
const int next_i = i+1;		const int next_i = i+1;
if (member_of(arg_types[i]) < 0 && next_i < arg_num &&		if (member_of(arg_types[i]) < 0 && next_i < arg_num &&
▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	if ((arg_types[i] & OMP_TGT_MAPTYPE_FROM) \|\| DelEntry) {
}		}
}		}
}		}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

/// Internal function to pass data to/from the target.		/// Internal function to pass data to/from the target.
		// async_info_ptr is currently unused, added here so target_data_update has the
		// same signature as target_data_begin and target_data_end.
int target_data_update(DeviceTy &Device, int32_t arg_num,		int target_data_update(DeviceTy &Device, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
		void *arg_mappers, __tgt_async_info async_info_ptr) {
// process each input.		// process each input.
for (int32_t i = 0; i < arg_num; ++i) {		for (int32_t i = 0; i < arg_num; ++i) {
if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|		if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|
(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))		(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))
continue;		continue;

		if (arg_mappers && arg_mappers[i]) {
		// Instead of executing the regular path of target_data_update, call the
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions And another copy and paste JonChesterfield: And another copy and paste
		lildmhUnsubmitted Done Reply Inline Actions Okay, will get the common part into a function lildmh: Okay, will get the common part into a function
		// target_data_mapper variant which will call target_data_update again
		// with new arguments.
		DP("Calling target_data_mapper for the %dth argument\n", i);

		int rc = target_data_mapper(Device, args_base[i], args[i], arg_sizes[i],
		arg_types[i], arg_mappers[i], target_data_update);

		if (rc != OFFLOAD_SUCCESS) {
		DP("Call to target_data_update via target_data_mapper for custom mapper"
		" failed.\n");
		return OFFLOAD_FAIL;
		}

		// Skip the rest of this function, continue to the next argument.
		continue;
		}

void *HstPtrBegin = args[i];		void *HstPtrBegin = args[i];
int64_t MapSize = arg_sizes[i];		int64_t MapSize = arg_sizes[i];
bool IsLast, IsHostPtr;		bool IsLast, IsHostPtr;
void *TgtPtrBegin = Device.getTgtPtrBegin(HstPtrBegin, MapSize, IsLast,		void *TgtPtrBegin = Device.getTgtPtrBegin(HstPtrBegin, MapSize, IsLast,
false, IsHostPtr);		false, IsHostPtr);
if (!TgtPtrBegin) {		if (!TgtPtrBegin) {
DP("hst data:" DPxMOD " not found, becomes a noop\n", DPxPTR(HstPtrBegin));		DP("hst data:" DPxMOD " not found, becomes a noop\n", DPxPTR(HstPtrBegin));
continue;		continue;
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines
/// performs the same actions as data_begin in case arg_num is		/// performs the same actions as data_begin in case arg_num is
/// non-zero and initiates run of the offloaded region on the target platform;		/// non-zero and initiates run of the offloaded region on the target platform;
/// if arg_num is non-zero after the region execution is done it also		/// if arg_num is non-zero after the region execution is done it also
/// performs the same action as data_update and data_end above. This function		/// performs the same action as data_update and data_end above. This function
/// returns 0 if it was able to transfer the execution to a target and an		/// returns 0 if it was able to transfer the execution to a target and an
/// integer different from zero otherwise.		/// integer different from zero otherwise.
int target(int64_t device_id, void *host_ptr, int32_t arg_num,		int target(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types,		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
int32_t team_num, int32_t thread_limit, int IsTeamConstruct) {		void **arg_mappers, int32_t team_num, int32_t thread_limit,
		int IsTeamConstruct) {
DeviceTy &Device = Devices[device_id];		DeviceTy &Device = Devices[device_id];

// Find the table information in the map or look it up in the translation		// Find the table information in the map or look it up in the translation
// tables.		// tables.
TableMap *TM = 0;		TableMap *TM = 0;
TblMapMtx->lock();		TblMapMtx->lock();
HostPtrToTableMapTy::iterator TableMapIt = HostPtrToTableMap->find(host_ptr);		HostPtrToTableMapTy::iterator TableMapIt = HostPtrToTableMap->find(host_ptr);
if (TableMapIt == HostPtrToTableMap->end()) {		if (TableMapIt == HostPtrToTableMap->end()) {
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	int target(int64_t device_id, void *host_ptr, int32_t arg_num,
__tgt_target_table *TargetTable = TM->Table->TargetsTable[device_id];		__tgt_target_table *TargetTable = TM->Table->TargetsTable[device_id];
TrlTblMtx->unlock();		TrlTblMtx->unlock();
assert(TargetTable && "Global data has not been mapped\n");		assert(TargetTable && "Global data has not been mapped\n");

__tgt_async_info AsyncInfo;		__tgt_async_info AsyncInfo;

// Move data to device.		// Move data to device.
int rc = target_data_begin(Device, arg_num, args_base, args, arg_sizes,		int rc = target_data_begin(Device, arg_num, args_base, args, arg_sizes,
arg_types, &AsyncInfo);		arg_types, arg_mappers, &AsyncInfo);
if (rc != OFFLOAD_SUCCESS) {		if (rc != OFFLOAD_SUCCESS) {
DP("Call to target_data_begin failed, abort target.\n");		DP("Call to target_data_begin failed, abort target.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

std::vector<void *> tgt_args;		std::vector<void *> tgt_args;
std::vector<ptrdiff_t> tgt_offsets;		std::vector<ptrdiff_t> tgt_offsets;

▲ Show 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	for (auto it : fpArrays) {
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Deallocation of (first-)private arrays failed.\n");		DP("Deallocation of (first-)private arrays failed.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
}		}

// Move data from device.		// Move data from device.
int rt = target_data_end(Device, arg_num, args_base, args, arg_sizes,		int rt = target_data_end(Device, arg_num, args_base, args, arg_sizes,
arg_types, &AsyncInfo);		arg_types, arg_mappers, &AsyncInfo);
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Call to target_data_end failed, abort targe.\n");		DP("Call to target_data_end failed, abort targe.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

if (Device.RTL->synchronize)		if (Device.RTL->synchronize)
return Device.RTL->synchronize(device_id, &AsyncInfo);		return Device.RTL->synchronize(device_id, &AsyncInfo);

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

openmp/libomptarget/src/private.h

Show All 13 Lines
#define _OMPTARGET_PRIVATE_H		#define _OMPTARGET_PRIVATE_H

#include <omptarget.h>		#include <omptarget.h>

#include <cstdint>		#include <cstdint>

extern int target_data_begin(DeviceTy &Device, int32_t arg_num,		extern int target_data_begin(DeviceTy &Device, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types,		int64_t arg_types, void *arg_mappers,
__tgt_async_info *async_info_ptr);		__tgt_async_info *async_info_ptr);

extern int target_data_end(DeviceTy &Device, int32_t arg_num, void **args_base,		extern int target_data_end(DeviceTy &Device, int32_t arg_num, void **args_base,
void *args, int64_t arg_sizes, int64_t *arg_types,		void *args, int64_t arg_sizes, int64_t *arg_types,
		void **arg_mappers,
__tgt_async_info *async_info_ptr);		__tgt_async_info *async_info_ptr);

extern int target_data_update(DeviceTy &Device, int32_t arg_num,		extern int target_data_update(DeviceTy &Device, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types);		void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers,
		__tgt_async_info *async_info_ptr = nullptr);

extern int target(int64_t device_id, void *host_ptr, int32_t arg_num,		extern int target(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types,		void args_base, void args, int64_t *arg_sizes,
int32_t team_num, int32_t thread_limit, int IsTeamConstruct);		int64_t arg_types, void *arg_mappers, int32_t team_num,
		int32_t thread_limit, int IsTeamConstruct);

extern int CheckDeviceAndCtors(int64_t device_id);		extern int CheckDeviceAndCtors(int64_t device_id);

// enum for OMP_TARGET_OFFLOAD; keep in sync with kmp.h definition		// enum for OMP_TARGET_OFFLOAD; keep in sync with kmp.h definition
enum kmp_target_offload_kind {		enum kmp_target_offload_kind {
tgt_disabled = 0,		tgt_disabled = 0,
tgt_default = 1,		tgt_default = 1,
tgt_mandatory = 2		tgt_mandatory = 2
Show All 12 Lines	MapComponentInfoTy(void Base, void Begin, int64_t Size, int64_t Type)
: Base(Base), Begin(Begin), Size(Size), Type(Type) {}		: Base(Base), Begin(Begin), Size(Size), Type(Type) {}
};		};

// This structure stores all components of a user-defined mapper. The number of		// This structure stores all components of a user-defined mapper. The number of
// components are dynamically decided, so we utilize C++ STL vector		// components are dynamically decided, so we utilize C++ STL vector
// implementation here.		// implementation here.
struct MapperComponentsTy {		struct MapperComponentsTy {
std::vector<MapComponentInfoTy> Components;		std::vector<MapComponentInfoTy> Components;
		int32_t size() { return Components.size(); }
};		};

		// The mapper function pointer type. It follows the signature below:
		// void .omp_mapper.<type_name>.<mapper_id>.(void *rt_mapper_handle,
		// void base, void begin,
		// size_t size, int64_t type);
		typedef void (MapperFuncPtrTy)(void , void , void , int64_t, int64_t);

		// Function pointer type for target_data_* functions (target_data_begin,
		// target_data_end and target_data_update).
		typedef int (TargetDataFuncPtrTy)(DeviceTy &, int32_t, void , void *,
		int64_t , int64_t , void *, __tgt_async_info );

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// implementation for fatal messages		// implementation for fatal messages
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

#define FATAL_MESSAGE0(_num, _str) \		#define FATAL_MESSAGE0(_num, _str) \
do { \		do { \
fprintf(stderr, "Libomptarget fatal error %d: %s\n", _num, _str); \		fprintf(stderr, "Libomptarget fatal error %d: %s\n", _num, _str); \
exit(1); \		exit(1); \
Show All 36 Lines

openmp/libomptarget/src/rtl.cpp

Show First 20 Lines • Show All 381 Lines • ▼ Show 20 Lines	for (auto *R : UsedRTLs) {

// Execute dtors for static objects if the device has been used, i.e.		// Execute dtors for static objects if the device has been used, i.e.
// if its PendingCtors list has been emptied.		// if its PendingCtors list has been emptied.
for (int32_t i = 0; i < FoundRTL->NumberOfDevices; ++i) {		for (int32_t i = 0; i < FoundRTL->NumberOfDevices; ++i) {
DeviceTy &Device = Devices[FoundRTL->Idx + i];		DeviceTy &Device = Devices[FoundRTL->Idx + i];
Device.PendingGlobalsMtx.lock();		Device.PendingGlobalsMtx.lock();
if (Device.PendingCtorsDtors[desc].PendingCtors.empty()) {		if (Device.PendingCtorsDtors[desc].PendingCtors.empty()) {
for (auto &dtor : Device.PendingCtorsDtors[desc].PendingDtors) {		for (auto &dtor : Device.PendingCtorsDtors[desc].PendingDtors) {
int rc = target(Device.DeviceID, dtor, 0, NULL, NULL, NULL, NULL, 1,		int rc = target(Device.DeviceID, dtor, 0, NULL, NULL, NULL, NULL,
1, true /team/);		NULL, 1, 1, true /team/);
if (rc != OFFLOAD_SUCCESS) {		if (rc != OFFLOAD_SUCCESS) {
DP("Running destructor " DPxMOD " failed.\n", DPxPTR(dtor));		DP("Running destructor " DPxMOD " failed.\n", DPxPTR(dtor));
}		}
}		}
// Remove this library's entry from PendingCtorsDtors		// Remove this library's entry from PendingCtorsDtors
Device.PendingCtorsDtors.erase(desc);		Device.PendingCtorsDtors.erase(desc);
}		}
Device.PendingGlobalsMtx.unlock();		Device.PendingGlobalsMtx.unlock();
▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

openmp/libomptarget/test/mapping/declare_mapper_api.cpp

// RUN: %libomptarget-compilexx-run-and-check-aarch64-unknown-linux-gnu		// RUN: %libomptarget-compilexx-run-and-check-aarch64-unknown-linux-gnu
// RUN: %libomptarget-compilexx-run-and-check-powerpc64-ibm-linux-gnu		// RUN: %libomptarget-compilexx-run-and-check-powerpc64-ibm-linux-gnu
// RUN: %libomptarget-compilexx-run-and-check-powerpc64le-ibm-linux-gnu		// RUN: %libomptarget-compilexx-run-and-check-powerpc64le-ibm-linux-gnu
// RUN: %libomptarget-compilexx-run-and-check-x86_64-pc-linux-gnu		// RUN: %libomptarget-compilexx-run-and-check-x86_64-pc-linux-gnu

#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
#include <vector>		#include <vector>
		#include <cinttypes>

// Data structure definitions copied from OpenMP RTL.		// Data structure definitions copied from OpenMP RTL.
struct MapComponentInfoTy {		struct MapComponentInfoTy {
void *Base;		void *Base;
void *Begin;		void *Begin;
int64_t Size;		int64_t Size;
int64_t Type;		int64_t Type;
MapComponentInfoTy() = default;		MapComponentInfoTy() = default;
Show All 20 Lines	int main(int argc, char *argv[]) {
MapperComponentsTy MC;		MapperComponentsTy MC;
void base, begin;		void base, begin;
int64_t size, type;		int64_t size, type;
// Push 2 elements into MC.		// Push 2 elements into MC.
__tgt_push_mapper_component((void *)&MC, base, begin, size, type);		__tgt_push_mapper_component((void *)&MC, base, begin, size, type);
__tgt_push_mapper_component((void *)&MC, base, begin, size, type);		__tgt_push_mapper_component((void *)&MC, base, begin, size, type);
int64_t num = __tgt_mapper_num_components((void *)&MC);		int64_t num = __tgt_mapper_num_components((void *)&MC);
// CHECK: num=2		// CHECK: num=2
printf("num=%lld\n", num);		printf("num=%" PRId64 "\n", num);
return 0;		return 0;
}		}

openmp/libomptarget/test/mapping/declare_mapper_target.cpp

This file was added.

				// RUN: %libomptarget-compile-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
				// RUN: %libomptarget-compile-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-x86_64-pc-linux-gnu

				#include <cstdio>
				#include <cstdlib>

				#define NUM 1024

				class C {
				public:
				int *a;
				};

				#pragma omp declare mapper(id: C s) map(s.a[0:NUM])

				int main() {
				C c;
				c.a = (int) malloc(sizeof(int)NUM);
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 1;
				}
				#pragma omp target teams distribute parallel for map(mapper(id),tofrom: c)
				for (int i = 0; i < NUM; i++) {
				++c.a[i];
				}
				int sum = 0;
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum = 2048
				printf("Sum = %d\n", sum);
				return 0;
				}

openmp/libomptarget/test/mapping/declare_mapper_target_data.cpp

This file was added.

				// RUN: %libomptarget-compile-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
				// RUN: %libomptarget-compile-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-x86_64-pc-linux-gnu

				#include <cstdio>
				#include <cstdlib>

				#define NUM 1024

				class C {
				public:
				int *a;
				};

				#pragma omp declare mapper(id: C s) map(s.a[0:NUM])

				int main() {
				C c;
				c.a = (int) malloc(sizeof(int)NUM);
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 1;
				}
				#pragma omp target data map(mapper(id),tofrom: c)
				{
				#pragma omp target teams distribute parallel for
				for (int i = 0; i < NUM; i++) {
				++c.a[i];
				}
				}
				int sum = 0;
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum = 2048
				printf("Sum = %d\n", sum);
				return 0;
				}

openmp/libomptarget/test/mapping/declare_mapper_target_data_enter_exit.cpp

This file was added.

				// RUN: %libomptarget-compile-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
				// RUN: %libomptarget-compile-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-x86_64-pc-linux-gnu

				#include <cstdio>
				#include <cstdlib>

				#define NUM 1024

				class C {
				public:
				int *a;
				};

				#pragma omp declare mapper(id: C s) map(s.a[0:NUM])

				int main() {
				C c;
				c.a = (int) malloc(sizeof(int)NUM);
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 1;
				}
				#pragma omp target enter data map(mapper(id),to: c)
				#pragma omp target teams distribute parallel for
				for (int i = 0; i < NUM; i++) {
				++c.a[i];
				}
				#pragma omp target exit data map(mapper(id),from: c)
				int sum = 0;
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum = 2048
				printf("Sum = %d\n", sum);
				return 0;
				}

openmp/libomptarget/test/mapping/declare_mapper_target_update.cpp

This file was added.

				// RUN: %libomptarget-compile-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
				// RUN: %libomptarget-compile-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-x86_64-pc-linux-gnu

				#include <cstdio>
				#include <cstdlib>

				#define NUM 1024

				class C {
				public:
				int *a;
				};

				#pragma omp declare mapper(id: C s) map(s.a[0:NUM])

				int main() {
				C c;
				int sum = 0;
				c.a = (int) malloc(sizeof(int)NUM);
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 1;
				}
				#pragma omp target enter data map(mapper(id),alloc: c)
				#pragma omp target teams distribute parallel for
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 0;
				}
				#pragma omp target update from(mapper(id): c)
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum (after first update from) = 0
				printf("Sum (after first update from) = %d\n", sum);
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 1;
				}
				#pragma omp target update to(mapper(id): c)
				#pragma omp target teams distribute parallel for
				for (int i = 0; i < NUM; i++) {
				++c.a[i];
				}
				sum = 0;
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum (after update to) = 1024
				printf("Sum (after update to) = %d\n", sum);
				#pragma omp target update from(mapper(id): c)
				sum = 0;
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum (after second update from) = 2048
				printf("Sum (after second update from) = %d\n", sum);
				#pragma omp target exit data map(mapper(id),delete: c)
				return 0;
				}