This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libomptarget/
-
include/
-
omptarget.h
-
src/
1/1
exports
5/5
interface.cpp
7/16
omptarget.cpp
-
private.h
-
rtl.cpp
-
test/mapping/
-
mapping/
-
declare_mapper_target.cpp
-
declare_mapper_target_data.cpp
-
declare_mapper_target_data_enter_exit.cpp
-
declare_mapper_target_update.cpp

Differential D68100

[OpenMP 5.0] declare mapper runtime implementation
ClosedPublic

Authored by grokos on Sep 26 2019, 12:32 PM.

Download Raw Diff

Details

Reviewers

ABataev
kkwli0
hfinkel
Meinersbur
jdoerfert
lildmh
JonChesterfield

Commits

rG140ab574a1c8: [OpenMP][Offload] Declare mapper runtime implementation

Summary

This patch implements the runtime functionality to support the OpenMP 5.0 declare mapper. It introduces a set of new interfaces so user-defined mapper functions can be passed to the runtime. The runtime will call mapper functions to fill up an internal data structure. Later it will map every component in the internal data structure.
The design slides can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

Diff Detail

Event Timeline

lildmh created this revision.Sep 26 2019, 12:32 PM

Herald added subscribers: openmp-commits, guansong. · View Herald TranscriptSep 26 2019, 12:32 PM

Please review when you have time, thanks

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

In D68100#1686337, @grokos wrote:

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.

If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

In D68100#1686395, @lildmh wrote:

In D68100#1686337, @grokos wrote:

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.

If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

In D68100#1686397, @ABataev wrote:

In D68100#1686395, @lildmh wrote:

In D68100#1686337, @grokos wrote:

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.

If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

In D68100#1686413, @lildmh wrote:

In D68100#1686397, @ABataev wrote:

In D68100#1686395, @lildmh wrote:

In D68100#1686337, @grokos wrote:

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.

If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

I think another problem is this may not work with legacy code, since they don't have calls to __tgt_mapper. This may be a bigger problem when legacy code and code compiled with new Clang are mixed together: a legacy call to __tgt_target may get a mapper which is intended for some new code.

In D68100#1686416, @lildmh wrote:

In D68100#1686413, @lildmh wrote:

In D68100#1686397, @ABataev wrote:

In D68100#1686395, @lildmh wrote:

In D68100#1686337, @grokos wrote:

In D68100#1686044, @ABataev wrote:

In D68100#1685975, @lildmh wrote:

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.

If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

I think another problem is this may not work with legacy code, since they don't have calls to __tgt_mapper. This may be a bigger problem when legacy code and code compiled with new Clang are mixed together: a legacy call to __tgt_target may get a mapper which is intended for some new code.

No, there should not be problem with the legacy code. If the array of mappers is empty - use default mapping through the bitcopying.

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

In D68100#1686432, @lildmh wrote:

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

In D68100#1686437, @ABataev wrote:

In D68100#1686432, @lildmh wrote:

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

Yes, but I think it cannot solve this problem. For example, after a task executes __tgt_mapper, it is scheduled out and a new task is scheduled to execute. After the previous task resumes execution, the mapper information it stored has lost, and the execution of the __tgt_target will not get the mapper. I don't think there is a per-task context in libomptarget.

In D68100#1686451, @lildmh wrote:

In D68100#1686437, @ABataev wrote:

In D68100#1686432, @lildmh wrote:

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

Yes, but I think it cannot solve this problem. For example, after a task executes __tgt_mapper, it is scheduled out and a new task is scheduled to execute. After the previous task resumes execution, the mapper information it stored has lost, and the execution of the __tgt_target will not get the mapper. I don't think there is a per-task context in libomptarget.

libomptarget already uses some functions from libomp, you can use it to check for the task context.

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

In D68100#1686458, @grokos wrote:

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

I don't see a big problems here. You can store the mapper data per task using thread id from libomp, as we do it for tripcount. We can use the same solution for mappers.

Alexey and George: This is a big decision to make. We need to have most people's consents. I'll send it to the mailing list later.

In D68100#1686458, @grokos wrote:

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

Btw, I never understand why we have a separate function to push loop trip count. Why is that?

In D68100#1686480, @lildmh wrote:

In D68100#1686458, @grokos wrote:

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

Btw, I never understand why we have a separate function to push loop trip count. Why is that?

The same problem, to not bloat interface of the runtime library

ABataev added inline comments.Nov 11 2019, 7:08 AM

libomptarget/src/exports
16–25	On the last telecon, we decided to support this solution so adding new functions is accepted.
libomptarget/src/interface.cpp
103	`__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`
171	`__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`
240	`__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`
293	`__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`
358	`__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`
libomptarget/src/omptarget.cpp
379–382	Why we have this limitation?

JonChesterfield added a subscriber: JonChesterfield.Nov 11 2019, 12:27 PM

I think the direction is good here. Some duplication is inevitable in the interface. Between the non functional changes and the code duplication in the implementation it's difficult to work out exactly what the code is doing though.

libomptarget/src/omptarget.cpp
373	typedef/using instead of copying the type list for the cast?
435	It's probably worth doing the arg_types[i] => Type refactor first. Combined with the whitespace change it's quite a lot of this diff.

JonChesterfield added inline comments.Nov 11 2019, 12:38 PM

libomptarget/src/omptarget.cpp
458	Likewise data_size => Size. Separating the NFC from the FC makes it easier to parse the latter.
539	I think this is the same type list copy & paste that suggested a typedef above
542	The rest of this looks quite familiar too. Perhaps factor the copy & paste into helper functions that are called by both locations?
699	And again

Thanks Alexey and Jon for your review. Fixed the issues and rebased

libomptarget/src/omptarget.cpp
373	Sounds good, thanks
379–382	It's because that's the length of the parent bit. If we have more components, the parent bits will break.
435	I extracted the mapping of each component into a separate function for code reuse purpose, like `target_data_end_component` here. It uses `Type` as the input argument, so there is no longer `arg_types[i]`. It's the same for `Size`. So I don't think it will make it more clear to change `arg_types[i]` to `Type` first. What do you think?
542	The duplication is not too much though. Do you think it will worth it to have a helper function?

lildmh marked 6 inline comments as done.Nov 13 2019, 4:12 PM

lildmh added a child revision: D67833: [OpenMP 5.0] Codegen support to pass user-defined mapper functions to runtime.Nov 13 2019, 4:20 PM

ABataev added inline comments.Nov 14 2019, 8:23 AM

libomptarget/src/omptarget.cpp
542	+1 for refactoring.

lildmh marked an inline comment as done.Nov 15 2019, 8:46 AM

lildmh added inline comments.

libomptarget/src/omptarget.cpp
542	Hi Alexey and Jon, I didn't find an elegant way to merge the code below. It's mainly because they have different way to access other components: E.g., for mapper, `Components.get(parent_idx)` is used to get its parent, on the other hand, `args[parent_idx]` is used for arguments. One is array of struct, the other is struct of array.

Hahnfeld removed a reviewer: Hahnfeld.Nov 16 2019, 9:56 AM

@ABataev @JonChesterfield ping

ABataev added inline comments.Nov 25 2019, 10:28 AM

libomptarget/src/omptarget.cpp
542	Still, do not understand what is a problem with the refactoring. You can use lambdas, if need some differences in data, or something similar. Anyway, it would better rather than just copy-paste.
544	Usually, we use something like `(for i = 0, e = end(); i < e; ++i)` pattern.

In D68100#1759099, @jdoerfert wrote:

@ABataev @JonChesterfield ping

My thoughts are the same as before. This change mixes a refactor with a functional change plus duplicates a bunch of code. The overall change might work but I can't tell from the diff.

libomptarget/src/omptarget.cpp
542	Some options: wrap object in a class that adapts the interface pass in a function that does the access refactor one data type to the same layout as the other extract small functions which are called by both

Thanks for your reviews. Hope this looks better.

@JonChesterfield: If you insist, I can break this patch into 2 smaller ones. Since I don't have much time now, it will happen later.

In D68100#1759502, @lildmh wrote:

Thanks for your reviews. Hope this looks better.

Thanks

@JonChesterfield: If you insist, I can break this patch into 2 smaller ones. Since I don't have much time now, it will happen later.

Splitting large patches into NFC and functional change doesn't seem contentious but is not required.

The advantage is seen when the build breaks. It's less annoying for the author to have the functional change part temporarily reverted than to lose the whole lot, especially when the functional change is the smaller diff as I think it would be here.

lildmh mentioned this in D67833: [OpenMP 5.0] Codegen support to pass user-defined mapper functions to runtime.Dec 18 2019, 11:14 AM

lildmh added a parent revision: D71782: [OpenMP] [NFC] Refactor data mapping to prepare for declare mapper implementation.Dec 20 2019, 1:23 PM

JonChesterfield mentioned this in D71782: [OpenMP] [NFC] Refactor data mapping to prepare for declare mapper implementation.Dec 20 2019, 1:50 PM

Rebase and rediff with the nfc version

The premise seems OK, but three copies of a large block of control flow is not so good. Why the duplication?

openmp/libomptarget/src/omptarget.cpp
349 ↗	(On Diff #235142)	What makes the mapper valid? I don't see any checking in the source. Perhaps just strike the word valid from the comment
359 ↗	(On Diff #235142)	What limitation? Why 0xffff?
366 ↗	(On Diff #235142)	Size probably returns size_t, why is the induction variable signed?
536 ↗	(On Diff #235142)	This appears to be a copy and paste of the above
667 ↗	(On Diff #235142)	And another copy and paste

This revision now requires changes to proceed.Dec 23 2019, 7:01 AM

Address Jon's comments

openmp/libomptarget/src/omptarget.cpp
349 ↗	(On Diff #235142)	If there is a valid pointer, which is generated by compiler, I say it's valid. I will remove word valid from the comment
359 ↗	(On Diff #235142)	Because the parent idx in map type has 16 bits, we cannot handle components more than that.
366 ↗	(On Diff #235142)	I think the indices are always type `int32_t` in libomptarget, so I followed the rules. Otherwise there will be a signed and unsigned compariion warning
667 ↗	(On Diff #235142)	Okay, will get the common part into a function

gentle ping :)

Ping

There's a lot of copy and paste remaining, and no test cases. Do we want this anyway? At some point it can be better to patch and keep moving than to iterate in phabricator.

Test cases will be uploaded in another patch when the Clang patch is upstreamed. That Clang patch depends on this (https://reviews.llvm.org/D67833). So I think the order is this patch, clang patch, test patch.

JonChesterfield resigned from this revision.Apr 20 2020, 5:09 AM

Herald added a subscriber: yaxunl. · View Herald TranscriptApr 20 2020, 5:09 AM

grokos commandeered this revision.Jun 3 2020, 12:59 PM

grokos edited reviewers, added: lildmh; removed: grokos.

Herald added a subscriber: sstefan1. · View Herald TranscriptJun 3 2020, 12:59 PM

I tried to address our previous complaints about code duplication and came up with a scheme which results in a much shorter and cleaner diff with virtually no code duplication. Instead of refactoring code from taget_data_begin/end/update, I introduced a new internal function target_data_mapper which generates new arrays args_base, args, arg_sizes and arg_types for the custom mapper and calls target_data_begin/end/update again using the new arguments.

Harbormaster failed remote builds in B63063: Diff 275793!Jul 6 2020, 12:24 PM

Ping. If the patch lands toady or tomorrow, then we will meet the clang-11 deadline and include support for declare mapper.

LGTM

This revision is now accepted and ready to land.Jul 14 2020, 12:40 PM

The codegen patch needs to land before this one to pass these test programs

Also, not sure if https://reviews.llvm.org/D71782 is still needed. Please check

No, it's not needed anymore. This patch bypasses the need to do that refactoring. Can you please abandon that revision?

In D68100#2151386, @grokos wrote:

No, it's not needed anymore. This patch bypasses the need to do that refactoring. Can you please abandon that revision?

Ok, please ignore it then. Thanks for working on this! The only thing left is to have https://reviews.llvm.org/D67833 accepted @ABataev

Nice, thanks. All my concerns were addressed by the above revision.

Closed by commit rG140ab574a1c8: [OpenMP][Offload] Declare mapper runtime implementation (authored by grokos). · Explain WhyJul 15 2020, 7:11 PM

This revision was automatically updated to reflect the committed changes.

saiislam mentioned this in D64571: [OPENMP]Fix threadid in __kmpc_omp_taskwait call for dependent target calls..Jul 20 2020, 8:39 AM

declare_mapper_target.cpp still fails for me consistently in RUN line 2, so for the nvptx version. I execute on x86_64 with Tesla V100 and cuda 10.0.
When I execute the test with export LIBOMPTARGET_DEBUG=1, the test succeeds.
In case of failure, the test prints Sum = 1024, in case of success, the test prints Sum = 2048 as expected.

I run the tests with a fresh build (ae31d7838c36).

In D68100#2166450, @protze.joachim wrote:

declare_mapper_target.cpp still fails for me consistently in RUN line 2, so for the nvptx version. I execute on x86_64 with Tesla V100 and cuda 10.0.
When I execute the test with export LIBOMPTARGET_DEBUG=1, the test succeeds.
In case of failure, the test prints Sum = 1024, in case of success, the test prints Sum = 2048 as expected.

I run the tests with a fresh build (ae31d7838c36).

Interesting, how about other tests declare_mapper_*.cpp? Do they fail or pass?

The other tests failed after the commit. They started to succeed with various commits.
At above mentioned commit, only this single test fails.

In D68100#2167208, @protze.joachim wrote:

The other tests failed after the commit. They started to succeed with various commits.
At above mentioned commit, only this single test fails.

Very interesting. Any guess what's the problem? I'll look into it. @grokos your test passed before, right?

I can reproduce this. When running the test itself, Sum=1024.
When running the test with nvprof, Sum=2048. Combining with that you said Sum=2048 when LIBOMPTARGET_DEBUG=1, I suspect the GPU offloading is disabled in the above case. Any idea what happened recently to libomptarget which can potentially cause this problem? I didn't follow the recent development so have no idea.

For the commit of this patch, the test fails with and without env LIBOMPTARGET_DEBUG=1. I'm using a release build, but have -DLIBOMPTARGET_ENABLE_DEBUG=on. This allows to activate debug output by setting the env variable.

I'm currently bisecting for the commit, when the test started to succeed with env LIBOMPTARGET_DEBUG=1. I'm hoping its sufficient to bisect commits on /openmp/.

Thanks. Another weird place is it passes with nvprof. Not sure why using nvprof makes a difference here.

I tried to run declare_mapper_target.cpp on a Nvidia GPU. The problem occurs while loading the device image:

Target CUDA RTL --> Error returned from cuModuleLoadDataEx
Target CUDA RTL --> CUDA error is: device kernel image is invalid

Sounds like a clang problem, I don't see why libomptarget could be the culprit here.

Ok, the bisecting did not really reveal anything new. The test fails with and without the env var for 140ab574 , starting with 537b16e9 the test succeeds with env LIBOMPTARGET_DEBUG=1

@grokos Why do you think, it's not the runtime, if the same executable is behaving differently just based on this env variable?

From me this looks like one of the debugging statements has a side effect, which disappears when I execute without the debugging variable.

$ env LIBOMPTARGET_DEBUG=0 projects/openmp/libomptarget/test/mapping/Output/declare_mapper_target.cpp.tmp-nvptx64-nvidia-cuda 
Sum = 1024
$ env LIBOMPTARGET_DEBUG=1 projects/openmp/libomptarget/test/mapping/Output/declare_mapper_target.cpp.tmp-nvptx64-nvidia-cuda 
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ve.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': libomptarget.rtl.ve.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 1 devices!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000000401350 is NOT compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> Image 0x0000000000401350 is compatible with RTL libomptarget.rtl.cuda.so!
Libomptarget --> RTL 0x000000000063d430 has index 0!
Libomptarget --> Registering image 0x0000000000401350 with RTL libomptarget.rtl.cuda.so!
Libomptarget --> Done registering entries!
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL --> Init requires flags to 1
Target CUDA RTL --> Getting device 0
Target CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL --> Using 1024 CUDA threads per block
Target CUDA RTL --> Using warp size 32
Target CUDA RTL --> Max number of CUDA blocks 65536, threads 1024 & warp size 32
Target CUDA RTL --> Default number of teams set according to library's default 128
Target CUDA RTL --> Default number of threads set according to library's default 128
Libomptarget --> Device 0 is ready to use.
Target CUDA RTL --> Load data from image 0x0000000000401350
Target CUDA RTL --> CUDA module successfully loaded!
Target CUDA RTL --> Entry point 0x0000000000000000 maps to __omp_offloading_13_a6fc814_main_l25 (0x0000000000ee5510)
Target CUDA RTL --> Sending global device environment data 4 bytes
Libomptarget --> __kmpc_push_target_tripcount(0, 1024)
Libomptarget --> Entering target region with entry point 0x0000000000401301 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry  0: Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x23
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e90
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc0985c190) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00000000006a0600, Size=4096, Type=0x2000000000013).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Creating new map entry: HstBase=0x00007ffc0985c510, HstBegin=0x00007ffc0985c510, HstEnd=0x00007ffc0985c518, TgtBegin=0x00002ad71e600000
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, updated RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is not new
Libomptarget --> Has a pointer entry: 
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000006a0600, Size=4096)...
Libomptarget --> Creating new map entry: HstBase=0x00000000006a0600, HstBegin=0x00000000006a0600, HstEnd=0x00000000006a1600, TgtBegin=0x00002ad71e600200
Libomptarget --> There are 4096 bytes allocated at target address 0x00002ad71e600200 - is new
Libomptarget --> Moving 4096 bytes (hst:0x00000000006a0600) -> (tgt:0x00002ad71e600200)
Libomptarget --> Update pointer (0x00002ad71e600000) -> [0x00002ad71e600200]
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, RefCount=2
Libomptarget --> Obtained target argument 0x00002ad71e600000 from host pointer 0x00007ffc0985c510
Libomptarget --> loop trip count is 1024.
Libomptarget --> Launching target execution __omp_offloading_13_a6fc814_main_l25 with pointer 0x0000000000eb1d10 (index=0).
Target CUDA RTL --> Setting CUDA threads per block to default 128
Target CUDA RTL --> Using 8 teams due to loop trip count 1024 and number of threads per block 128
Target CUDA RTL --> Launch kernel with 8 blocks and 128 threads
Target CUDA RTL --> Launch of entry point at 0x0000000000eb1d10 successful!
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e90
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc0985c190) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00007ffc0985c510, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc0985c190) adds an entry (Base=0x00007ffc0985c510, Begin=0x00000000006a0600, Size=4096, Type=0x2000000000013).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000006a0600, Size=4096)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000006a0600, TgtPtrBegin=0x00002ad71e600200, Size=4096, updated RefCount=1
Libomptarget --> There are 4096 bytes allocated at target address 0x00002ad71e600200 - is last
Libomptarget --> Moving 4096 bytes (tgt:0x00002ad71e600200) -> (hst:0x00000000006a0600)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000006a0600, Size=4096)...
Libomptarget --> Deleting tgt data 0x00002ad71e600200 of size 4096
Libomptarget --> Removing mapping with HstPtrBegin=0x00000000006a0600, TgtPtrBegin=0x00002ad71e600200, Size=4096
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, updated RefCount=1
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is not last
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8, updated RefCount=1
Libomptarget --> There are 8 bytes allocated at target address 0x00002ad71e600000 - is last
Libomptarget --> Removing shadow pointer 0x00007ffc0985c510
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc0985c510, Size=8)...
Libomptarget --> Deleting tgt data 0x00002ad71e600000 of size 8
Libomptarget --> Removing mapping with HstPtrBegin=0x00007ffc0985c510, TgtPtrBegin=0x00002ad71e600000, Size=8
Sum = 2048
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x0000000000401350 is compatible with RTL 0x000000000063d430!
Libomptarget --> Unregistered image 0x0000000000401350 from RTL 0x000000000063d430!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor 0x0000000000423900
Libomptarget --> Done unregistering library!
Libomptarget --> Deinit target library!
$

My system was messed up and used libraries and compilers from different builds. Please ignore my previous message.

I was able to reproduce what @protze.joachim described, i.e. different runtime behavior when LIBOMPTARGET_DEBUG=1 is used. I'm looking at the issue.

OK, I suspect there is a race condition involving the CUDA plugin. If I compile the test on x86_64-pc-linux-gnu then I always get the correct result, no matter whether we print debug output or not.

On CUDA, I tried to increase the test size from 1024 to 16M. With debug output off, I always get 16M as a result (instead of 32M) - this tells me that the CUDA kernel is launched and host code proceeds to verify the result before the kernel returns. Because the problem size is large, verification on the host always finished before the kernel returns and data is copied back from the device.

With debug output on, I get inconsistent results from execution to execution ranging from 16M to 32M, meaning that the host is busier printing output messages so verification can start later while data is being copied back.

@lildmh I've got a question unrelated to the problem we are discussing here. I ran declare_mapper_target.cpp and when libomptarget calls the mapper function it prints the following:

Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x00007ffcebd101e0, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x000000000231bfe0, Size=4096, Type=0x2000000000013)

Why is the second entry's MEMBER_OF field set to 2? It should be MEMBER_OF 1, since the pointer-pointee pair c.a[0:N] is part of struct c which is the first entry on the list.

In D68100#2168350, @grokos wrote:
@lildmh I've got a question unrelated to the problem we are discussing here. I ran declare_mapper_target.cpp and when libomptarget calls the mapper function it prints the following:
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x00007ffcebd101e0, Size=8, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffcebd0fb48) adds an entry (Base=0x00007ffcebd101e0, Begin=0x000000000231bfe0, Size=4096, Type=0x2000000000013)
Why is the second entry's MEMBER_OF field set to 2? It should be MEMBER_OF 1, since the pointer-pointee pair c.a[0:N] is part of struct c which is the first entry on the list.

Good point. May be a bug. Let me check later

@lildmh: I think I've found a bug. I used declare_mapper_target.cpp. When we call the mapper function, it generates 3 components. The first 2 are identical and correspond to the parent struct. This is what MapperComponents looks like inside function target_data_mapper:

(gdb) print MapperComponents
$1 = {Components = std::vector of length 3, capacity 4 = {{Base = 0x7fffffffb598, Begin = 0x7fffffffb598, Size = 8, Type = 32}, {Base = 0x7fffffffb598, Begin = 0x7fffffffb598, Size = 8, Type = 32}, {Base = 0x7fffffffb598,
      Begin = 0x62efd0, Size = 4096, Type = 562949953421331}}}

Mapping the parent struct twice is problematic. If we have more struct members and some of them are NOT pointers, then upon target_data_end libomptarget will check the parent struct's reference counter to determine whether the scalar member must be copied back to the host. If the reference counter is greater than 1, then the runtime will skip copying back the scalar. Mapping the parent struct two times in a row results in RefCount=2.

So in the example below (modified declare_mapper_target.cpp) the scalar is processed by libomptarget but because at that time the struct's RefCount is 2 we never copy the scalar back:

#include <cstdio>
#include <cstdlib>
#include <omp.h>

#define NUM 1024

class C {
public:
  int *a;
  int onHost;
};

#pragma omp declare mapper(id: C s) map(s.a[0:NUM], s.onHost)

int main() {
  C c;
  c.a = (int*) malloc(sizeof(int)*NUM);
  c.onHost = -1;
  for (int i = 0; i < NUM; i++) {
    c.a[i] = 1;
  }
  #pragma omp target teams distribute parallel for map(mapper(id),tofrom: c)
  for (int i = 0; i < NUM; i++) {
    ++c.a[i];
    if (i == 0) {
      c.onHost = omp_is_initial_device();
    }
  }

  int sum = 0;
  for (int i = 0; i < NUM; i++) {
    sum += c.a[i];
  }
  printf("Executed on %s\n", c.onHost==1 ? "host" : c.onHost==0 ? "device" : "unknown");
  // CHECK: Sum = 2048
  printf("Sum = %d\n", sum);
  return 0;
}

Upon target_data_end the mapper function will generate this:

(gdb) print MapperComponents
$1 = {Components = std::vector of length 4, capacity 4 = {{Base = 0x7fffffffb588, Begin = 0x7fffffffb588, Size = 16, Type = 32}, {Base = 0x7fffffffb588, Begin = 0x7fffffffb588, Size = 12, Type = 32}, {Base = 0x7fffffffb588,
      Begin = 0x62efd0, Size = 4096, Type = 562949953421331}, {Base = 0x7fffffffb588, Begin = 0x7fffffffb590, Size = 4, Type = 562949953421315}}}

When libomptarget processes the scalar, the parent struct's RefCount is 2, so inside the if-block in omptarget.cpp:507-524 CopyMember will never be set to true and the scalar will never be copied back to the host.

Can you revert the patches for declare mapper until it is fixed?

Thanks George for looking into this, and sorry for the late response.

I believe this is not a bug, it's a design choice we made early. The design choice is we map the whole structure at the beginning for one piece so we don't map the individual parts of them separately, which may cause a lot of memcpy.

For the RefCount, when the runtime check the 2nd component in your example, it will find it's already mapped and will not increase the RefCount, so I think it's not a bug and the behavior is correct.

In D68100#2173460, @lildmh wrote:

Thanks George for looking into this, and sorry for the late response.

I believe this is not a bug, it's a design choice we made early. The design choice is we map the whole structure at the beginning for one piece so we don't map the individual parts of them separately, which may cause a lot of memcpy.

For the RefCount, when the runtime check the 2nd component in your example, it will find it's already mapped and will not increase the RefCount, so I think it's not a bug and the behavior is correct.

No, this is not related to our design choices. Here we are mapping the whole struct twice for no reason. The entries should be:

1) combined entry (i.e. the entry that maps the whole struct
    base = &c, begin = &c.a, size = sizeof(class S), type = TARGET_PARAM
2) member entry for c.a[0:NUM]
    base = &c.a, begin = &c.a[0], size = NUM*sizeof(int), type = MEMBER_OF(1) | PTR_AND_OBJ | TO | FROM
3) member entry for c.onHost
    base = &c, begin = &c.onHost, size = sizeof(int), type = MEMBER_OF(1) | TO | FROM

But what happens now is that the combined entry is emitted twice, so MapperComponents looks like this:

<combined entry>, <combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>

instead of

<combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>

And what's more, the first combined entry has size=16 whereas the second combined entry has size=12. Where does this 16 come from? The size of the struct is 12 bytes (a pointer + an int). This also explains why the MEMBER_OF field is set to 2, because the second element in the list of arguments is also the combined entry.

There is no rationale behind emitting the combined entry twice, on the contrary it leads to errors because the RefCount is indeed incremented when it shouldn't.

This is libomptarget's debug output from the provided example upon entering the target region:

Libomptarget --> Entering target region with entry point 0x0000000000401409 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry  0: Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x23
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e50
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc24203368) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=12, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x0000000000d89ff0, Size=4096, Type=0x2000000000013).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a70, Size=4, Type=0x2000000000003).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Creating new map entry: HstBase=0x00007ffc24203a68, HstBegin=0x00007ffc24203a68, HstEnd=0x00007ffc24203a78, TgtBegin=0x00007fa582400000
Libomptarget --> There are 16 bytes allocated at target address 0x00007fa582400000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=12)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=12, updated RefCount=2
Libomptarget --> There are 12 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=8, RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000d89ff0, Size=4096)...
Libomptarget --> Creating new map entry: HstBase=0x0000000000d89ff0, HstBegin=0x0000000000d89ff0, HstEnd=0x0000000000d8aff0, TgtBegin=0x00007fa582400200
Libomptarget --> There are 4096 bytes allocated at target address 0x00007fa582400200 - is new
Libomptarget --> Moving 4096 bytes (hst:0x0000000000d89ff0) -> (tgt:0x00007fa582400200)
Libomptarget --> Update pointer (0x00007fa582400000) -> [0x00007fa582400200]
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a70, Size=4)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a70, TgtPtrBegin=0x00007fa582400008, Size=4, RefCount=2
Libomptarget --> There are 4 bytes allocated at target address 0x00007fa582400008 - is not new
Libomptarget --> DeviceTy::getMapEntry: requested entry found
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=16, RefCount=2
Libomptarget --> Obtained target argument 0x00007fa582400000 from host pointer 0x00007ffc24203a68
Libomptarget --> loop trip count is 1024.
Libomptarget --> Launching target execution __omp_offloading_801_1ee0443_main_l28 with pointer 0x0000000001427dc0 (index=0).

When we process the 16-byte combined entry we allocate space for the struct and RefCount=1, then we process the 12-byte combined entry and RefCount is incremented to 2.

In D68100#2173541, @grokos wrote:
In D68100#2173460, @lildmh wrote:

Thanks George for looking into this, and sorry for the late response.

I believe this is not a bug, it's a design choice we made early. The design choice is we map the whole structure at the beginning for one piece so we don't map the individual parts of them separately, which may cause a lot of memcpy.

For the RefCount, when the runtime check the 2nd component in your example, it will find it's already mapped and will not increase the RefCount, so I think it's not a bug and the behavior is correct.

No, this is not related to our design choices. Here we are mapping the whole struct twice for no reason. The entries should be:
1) combined entry (i.e. the entry that maps the whole struct
    base = &c, begin = &c.a, size = sizeof(class S), type = TARGET_PARAM
2) member entry for c.a[0:NUM]
    base = &c.a, begin = &c.a[0], size = NUM*sizeof(int), type = MEMBER_OF(1) | PTR_AND_OBJ | TO | FROM
3) member entry for c.onHost
    base = &c, begin = &c.onHost, size = sizeof(int), type = MEMBER_OF(1) | TO | FROM
But what happens now is that the combined entry is emitted twice, so MapperComponents looks like this:
<combined entry>, <combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>
instead of
<combined entry>, <entry for c.a[0:NUM]>, <entry for c.onHost>
And what's more, the first combined entry has size=16 whereas the second combined entry has size=12. Where does this 16 come from? The size of the struct is 12 bytes (a pointer + an int). This also explains why the MEMBER_OF field is set to 2, because the second element in the list of arguments is also the combined entry.

The first combined entry comes from mapping the whole structure. I think because of the alignment, the structure is actually 16 bytes. The 2nd combined entry is the real entry emitted to map the structure. Why it looks like there are 2 of them, because at the beginning of a mapper function, it maps the whole structure no matter what, which generate the 1st combined entry you saw here. Then we generate detailed mapping entry, which generates the 2nd combined entry you saw here. They are not necessarily the same. It happens to be similar in this example.

There is no rationale behind emitting the combined entry twice, on the contrary it leads to errors because the RefCount is indeed incremented when it shouldn't.

This is libomptarget's debug output from the provided example upon entering the target region:

Libomptarget --> Entering target region with entry point 0x0000000000401409 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry  0: Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x23
Libomptarget --> Calling target_data_mapper for the 0th argument
Libomptarget --> Calling the mapper function 0x0000000000400e50
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=16, Type=0x20).
Libomptarget --> __tgt_mapper_num_components(Handle=0x00007ffc24203368) returns 1
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a68, Size=12, Type=0x20).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x0000000000d89ff0, Size=4096, Type=0x2000000000013).
Libomptarget --> __tgt_push_mapper_component(Handle=0x00007ffc24203368) adds an entry (Base=0x00007ffc24203a68, Begin=0x00007ffc24203a70, Size=4, Type=0x2000000000003).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Creating new map entry: HstBase=0x00007ffc24203a68, HstBegin=0x00007ffc24203a68, HstEnd=0x00007ffc24203a78, TgtBegin=0x00007fa582400000
Libomptarget --> There are 16 bytes allocated at target address 0x00007fa582400000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=12)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=12, updated RefCount=2
Libomptarget --> There are 12 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=8, RefCount=2
Libomptarget --> There are 8 bytes allocated at target address 0x00007fa582400000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000d89ff0, Size=4096)...
Libomptarget --> Creating new map entry: HstBase=0x0000000000d89ff0, HstBegin=0x0000000000d89ff0, HstEnd=0x0000000000d8aff0, TgtBegin=0x00007fa582400200
Libomptarget --> There are 4096 bytes allocated at target address 0x00007fa582400200 - is new
Libomptarget --> Moving 4096 bytes (hst:0x0000000000d89ff0) -> (tgt:0x00007fa582400200)
Libomptarget --> Update pointer (0x00007fa582400000) -> [0x00007fa582400200]
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a70, Size=4)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a70, TgtPtrBegin=0x00007fa582400008, Size=4, RefCount=2
Libomptarget --> There are 4 bytes allocated at target address 0x00007fa582400008 - is not new
Libomptarget --> DeviceTy::getMapEntry: requested entry found
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc24203a68, Size=16)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffc24203a68, TgtPtrBegin=0x00007fa582400000, Size=16, RefCount=2
Libomptarget --> Obtained target argument 0x00007fa582400000 from host pointer 0x00007ffc24203a68
Libomptarget --> loop trip count is 1024.
Libomptarget --> Launching target execution __omp_offloading_801_1ee0443_main_l28 with pointer 0x0000000001427dc0 (index=0).

When we process the 16-byte combined entry we allocate space for the struct and RefCount=1, then we process the 12-byte combined entry and RefCount is incremented to 2.

It indeed increases RefCount after I checked the code and you are right. I think it should not cause any problem? Because RefCount will be reduced before to 0 at exit (It looks like the combined entry's mapped twice, it should also be 'deleted' twice when the target region exits).

In D68100#2173612, @lildmh wrote:

The first combined entry comes from mapping the whole structure. I think because of the alignment, the structure is actually 16 bytes. The 2nd combined entry is the real entry emitted to map the structure. Why it looks like there are 2 of them, because at the beginning of a mapper function, it maps the whole structure no matter what, which generate the 1st combined entry you saw here. Then we generate detailed mapping entry, which generates the 2nd combined entry you saw here. They are not necessarily the same. It happens to be similar in this example.

I assure you that's not how structs are mapped.

You don't map "the whole struct", you only map what is needed. For this you emit a combined entry which has a size large enough to encompass all members we are interested in. Then entries for individual members follow. One combined entry + as many member entries as needed. The first entry which "maps the whole struct" should not be there and is plain wrong.

In D68100#2173612, @lildmh wrote:

It indeed increases RefCount after I checked the code and you are right. I think it should not cause any problem? Because RefCount will be reduced before to 0 at exit (It looks like the combined entry's mapped twice, it should also be 'deleted' twice when the target region exits).

The problem is that when individual members are processed in target_data_end, RefCount = 2 so these members will not be copied back to the host. RefCount must be 1 for data motion to take place and in this case it's not.

Anyway, I modified libomptarget locally to ignore the 16-byte combined entry and now all tests pass. Can you please submit a clang patch which removes the first combined entry?

In D68100#2173703, @grokos wrote:

In D68100#2173612, @lildmh wrote:

The first combined entry comes from mapping the whole structure. I think because of the alignment, the structure is actually 16 bytes. The 2nd combined entry is the real entry emitted to map the structure. Why it looks like there are 2 of them, because at the beginning of a mapper function, it maps the whole structure no matter what, which generate the 1st combined entry you saw here. Then we generate detailed mapping entry, which generates the 2nd combined entry you saw here. They are not necessarily the same. It happens to be similar in this example.

I assure you that's not how structs are mapped.

You don't map "the whole struct", you only map what is needed. For this you emit a combined entry which has a size large enough to encompass all members we are interested in. Then entries for individual members follow. One combined entry + as many member entries as needed. The first entry which "maps the whole struct" should not be there and is plain wrong.

This is an optimization brought up by Deepak. I guess you were in that meeting too but forgot. It could be quite useful when you map an array of struct/class. Assume you map 1000 of this structure, with this optimization most memory allocation can be done in a single allocation, instead of allocation 12 bytes memory 1000 times.

Thinking about it, it's actually important for correctness too. Assume you map C a[2]. If you map separately, a[0] and a[1] could be mapped to not contiguous locations, and it will cause error/segfault when GPU kernel access this array. If you allocate the whole array a[2] together, such problem won't happen.

In D68100#2173612, @lildmh wrote:

It indeed increases RefCount after I checked the code and you are right. I think it should not cause any problem? Because RefCount will be reduced before to 0 at exit (It looks like the combined entry's mapped twice, it should also be 'deleted' twice when the target region exits).

The problem is that when individual members are processed in target_data_end, RefCount = 2 so these members will not be copied back to the host. RefCount must be 1 for data motion to take place and in this case it's not.

Anyway, I modified libomptarget locally to ignore the 16-byte combined entry and now all tests pass. Can you please submit a clang patch which removes the first combined entry?

I believe RefCount should be reduced to 1 when want to copy it back in target_data_end, could you post the whole trace of debug output how RefCount changes in target_data_end?

From my perspective, the declare_mapper_target.cpp code is semantically equivalent to:

#pragma omp target data map(tofrom: c)
#pragma omp target data map(tofrom: c.a[0:NUM])
#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}

and

#pragma omp target enter data map(to: c)
#pragma omp target enter data map(to: c.a[0:NUM])

#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}
#pragma omp target exit data map(from: c.a[0:NUM])
#pragma omp target exit data map(from: c)

Can you express the behavior of your mapping implementation in terms of OpenMP target enter/exit data primitives?

In D68100#2173949, @protze.joachim wrote:
From my perspective, the declare_mapper_target.cpp code is semantically equivalent to:
#pragma omp target data map(tofrom: c)
#pragma omp target data map(tofrom: c.a[0:NUM])
#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}
and
#pragma omp target enter data map(to: c)
#pragma omp target enter data map(to: c.a[0:NUM])

#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}
#pragma omp target exit data map(from: c.a[0:NUM])
#pragma omp target exit data map(from: c)
Can you express the behavior of your mapping implementation in terms of OpenMP target enter/exit data primitives?

You are basically right. In implementation, a function is generated for every mapper to do all internal mapping. More details can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

You are basically right. In implementation, a function is generated for every mapper to do all internal mapping. More details can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

This document does not say, which concrete mapping operations you push for the concrete case of the failing test. Can you probably express this in terms of omp target enter/exit data operations?

Your understanding above is exactly right. It should be equivalent to

#pragma omp target data map(tofrom: c)
#pragma omp target data map(tofrom: c.a[0:NUM])
#pragma omp target teams distribute parallel for
for (int i = 0; i < NUM; i++) {
  ++c.a[i];
}

protze.joachim mentioned this in D84557: [OpenMP][Tests] Enable nvptx64 testing for most libomptarget tests.Jul 28 2020, 2:15 AM

In D68100#2173822, @lildmh wrote:

This is an optimization brought up by Deepak. I guess you were in that meeting too but forgot. It could be quite useful when you map an array of struct/class. Assume you map 1000 of this structure, with this optimization most memory allocation can be done in a single allocation, instead of allocation 12 bytes memory 1000 times.

Thinking about it, it's actually important for correctness too. Assume you map C a[2]. If you map separately, a[0] and a[1] could be mapped to not contiguous locations, and it will cause error/segfault when GPU kernel access this array. If you allocate the whole array a[2] together, such problem won't happen.

Sorry for the late response. Here you are talking about something else. The case you are considering is an array of structs. In this case, indeed we have to allocate the whole array beforehand. It's not an optimization, it's a correctness issue as you point out (array objects must be allocated consecutively). In the failing tests, however, we have single structs, not an array of structs. The difference is that in the former case the object we are mapping is the array, whereas in the latter case it's the struct. The two cases are not related to one another unless we intend to treat both of them uniformly, i.e. even if we have a single struct we still treat it as if it were the sole element of a length-1 array. Do I understand correctly?

Revision Contents

Path

Size

libomptarget/

include/

omptarget.h

46 lines

src/

10 lines

180 lines

690 lines

26 lines

4 lines

test/

mapping/

declare_mapper_target.cpp

36 lines

declare_mapper_target_data.cpp

39 lines

declare_mapper_target_data_enter_exit.cpp

38 lines

declare_mapper_target_update.cpp

60 lines

Diff 230971

libomptarget/include/omptarget.h

Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	void __tgt_target_data_begin(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types);		int64_t *arg_types);
void __tgt_target_data_begin_nowait(int64_t device_id, int32_t arg_num,		void __tgt_target_data_begin_nowait(int64_t device_id, int32_t arg_num,
void args_base, void args,		void args_base, void args,
int64_t arg_sizes, int64_t arg_types,		int64_t arg_sizes, int64_t arg_types,
int32_t depNum, void *depList,		int32_t depNum, void *depList,
int32_t noAliasDepNum,		int32_t noAliasDepNum,
void *noAliasDepList);		void *noAliasDepList);
		void __tgt_target_data_begin_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers);
		void __tgt_target_data_begin_nowait_mapper(
		int64_t device_id, int32_t arg_num, void args_base, void args,
		int64_t arg_sizes, int64_t arg_types, void **arg_mappers, int32_t depNum,
		void depList, int32_t noAliasDepNum, void noAliasDepList);

// passes data from the target, release target memory and destroys the		// passes data from the target, release target memory and destroys the
// host-target mapping (top entry from the stack of data maps) created by		// host-target mapping (top entry from the stack of data maps) created by
// the last __tgt_target_data_begin		// the last __tgt_target_data_begin
void __tgt_target_data_end(int64_t device_id, int32_t arg_num, void **args_base,		void __tgt_target_data_end(int64_t device_id, int32_t arg_num, void **args_base,
void *args, int64_t arg_sizes, int64_t *arg_types);		void *args, int64_t arg_sizes, int64_t *arg_types);
void __tgt_target_data_end_nowait(int64_t device_id, int32_t arg_num,		void __tgt_target_data_end_nowait(int64_t device_id, int32_t arg_num,
void args_base, void args,		void args_base, void args,
int64_t arg_sizes, int64_t arg_types,		int64_t arg_sizes, int64_t arg_types,
int32_t depNum, void *depList,		int32_t depNum, void *depList,
int32_t noAliasDepNum, void *noAliasDepList);		int32_t noAliasDepNum, void *noAliasDepList);
		void __tgt_target_data_end_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers);
		void __tgt_target_data_end_nowait_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers, int32_t depNum,
		void *depList, int32_t noAliasDepNum,
		void *noAliasDepList);

/// passes data to/from the target		/// passes data to/from the target
void __tgt_target_data_update(int64_t device_id, int32_t arg_num,		void __tgt_target_data_update(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types);		int64_t *arg_types);
void __tgt_target_data_update_nowait(int64_t device_id, int32_t arg_num,		void __tgt_target_data_update_nowait(int64_t device_id, int32_t arg_num,
void args_base, void args,		void args_base, void args,
int64_t arg_sizes, int64_t arg_types,		int64_t arg_sizes, int64_t arg_types,
int32_t depNum, void *depList,		int32_t depNum, void *depList,
int32_t noAliasDepNum,		int32_t noAliasDepNum,
void *noAliasDepList);		void *noAliasDepList);
		void __tgt_target_data_update_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers);
		void __tgt_target_data_update_nowait_mapper(
		int64_t device_id, int32_t arg_num, void args_base, void args,
		int64_t arg_sizes, int64_t arg_types, void **arg_mappers, int32_t depNum,
		void depList, int32_t noAliasDepNum, void noAliasDepList);

// Performs the same actions as data_begin in case arg_num is non-zero		// Performs the same actions as data_begin in case arg_num is non-zero
// and initiates run of offloaded region on target platform; if arg_num		// and initiates run of offloaded region on target platform; if arg_num
// is non-zero after the region execution is done it also performs the		// is non-zero after the region execution is done it also performs the
// same action as data_end above. The following types are used; this		// same action as data_end above. The following types are used; this
// function returns 0 if it was able to transfer the execution to a		// function returns 0 if it was able to transfer the execution to a
// target and an int different from zero otherwise.		// target and an int different from zero otherwise.
int __tgt_target(int64_t device_id, void *host_ptr, int32_t arg_num,		int __tgt_target(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types);		int64_t *arg_types);
int __tgt_target_nowait(int64_t device_id, void *host_ptr, int32_t arg_num,		int __tgt_target_nowait(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t arg_types, int32_t depNum, void depList,		int64_t arg_types, int32_t depNum, void depList,
int32_t noAliasDepNum, void *noAliasDepList);		int32_t noAliasDepNum, void *noAliasDepList);
		int __tgt_target_mapper(int64_t device_id, void *host_ptr, int32_t arg_num,
		void args_base, void args, int64_t *arg_sizes,
		int64_t arg_types, void *arg_mappers);
		int __tgt_target_nowait_mapper(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers, int32_t depNum,
		void *depList, int32_t noAliasDepNum,
		void *noAliasDepList);

int __tgt_target_teams(int64_t device_id, void *host_ptr, int32_t arg_num,		int __tgt_target_teams(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t *arg_sizes,		void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types, int32_t num_teams,		int64_t *arg_types, int32_t num_teams,
int32_t thread_limit);		int32_t thread_limit);
int __tgt_target_teams_nowait(int64_t device_id, void *host_ptr,		int __tgt_target_teams_nowait(int64_t device_id, void *host_ptr,
int32_t arg_num, void args_base, void args,		int32_t arg_num, void args_base, void args,
int64_t arg_sizes, int64_t arg_types,		int64_t arg_sizes, int64_t arg_types,
int32_t num_teams, int32_t thread_limit,		int32_t num_teams, int32_t thread_limit,
int32_t depNum, void *depList,		int32_t depNum, void *depList,
int32_t noAliasDepNum, void *noAliasDepList);		int32_t noAliasDepNum, void *noAliasDepList);
		int __tgt_target_teams_mapper(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers, int32_t num_teams,
		int32_t thread_limit);
		int __tgt_target_teams_nowait_mapper(
		int64_t device_id, void host_ptr, int32_t arg_num, void *args_base,
		void *args, int64_t arg_sizes, int64_t arg_types, void *arg_mappers,
		int32_t num_teams, int32_t thread_limit, int32_t depNum, void *depList,
		int32_t noAliasDepNum, void *noAliasDepList);

void __kmpc_push_target_tripcount(int64_t device_id, uint64_t loop_tripcount);		void __kmpc_push_target_tripcount(int64_t device_id, uint64_t loop_tripcount);

#ifdef __cplusplus		#ifdef __cplusplus
}		}
#endif		#endif

#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
#include <stdio.h>		#include <stdio.h>
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

libomptarget/src/exports

	VERS1.0 {			VERS1.0 {
	global:			global:
	__tgt_register_requires;			__tgt_register_requires;
	__tgt_register_lib;			__tgt_register_lib;
	__tgt_unregister_lib;			__tgt_unregister_lib;
	__tgt_target_data_begin;			__tgt_target_data_begin;
	__tgt_target_data_end;			__tgt_target_data_end;
	__tgt_target_data_update;			__tgt_target_data_update;
	__tgt_target;			__tgt_target;
	__tgt_target_teams;			__tgt_target_teams;
	__tgt_target_data_begin_nowait;			__tgt_target_data_begin_nowait;
	__tgt_target_data_end_nowait;			__tgt_target_data_end_nowait;
	__tgt_target_data_update_nowait;			__tgt_target_data_update_nowait;
	__tgt_target_nowait;			__tgt_target_nowait;
	__tgt_target_teams_nowait;			__tgt_target_teams_nowait;
				__tgt_target_data_begin_mapper;
				__tgt_target_data_end_mapper;
				__tgt_target_data_update_mapper;
				__tgt_target_mapper;
				__tgt_target_teams_mapper;
				__tgt_target_data_begin_nowait_mapper;
				__tgt_target_data_end_nowait_mapper;
				__tgt_target_data_update_nowait_mapper;
				__tgt_target_nowait_mapper;
				__tgt_target_teams_nowait_mapper;
				ABataevUnsubmitted Done Reply Inline Actions On the last telecon, we decided to support this solution so adding new functions is accepted. ABataev: On the last telecon, we decided to support this solution so adding new functions is accepted.
	__tgt_mapper_num_components;			__tgt_mapper_num_components;
	__tgt_push_mapper_component;			__tgt_push_mapper_component;
	omp_get_num_devices;			omp_get_num_devices;
	omp_get_initial_device;			omp_get_initial_device;
	omp_target_alloc;			omp_target_alloc;
	omp_target_free;			omp_target_free;
	omp_target_is_present;			omp_target_is_present;
	omp_target_memcpy;			omp_target_memcpy;
	omp_target_memcpy_rect;			omp_target_memcpy_rect;
	omp_target_associate_ptr;			omp_target_associate_ptr;
	omp_target_disassociate_ptr;			omp_target_disassociate_ptr;
	__kmpc_push_target_tripcount;			__kmpc_push_target_tripcount;
	local:			local:
	*;			*;
	};			};

libomptarget/src/interface.cpp

Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	EXTERN void __tgt_unregister_lib(__tgt_bin_desc *desc) {
RTLs.UnregisterLib(desc);		RTLs.UnregisterLib(desc);
}		}

/// creates host-to-target data mapping, stores it in the		/// creates host-to-target data mapping, stores it in the
/// libomptarget.so internal structure (an entry in a stack of data maps)		/// libomptarget.so internal structure (an entry in a stack of data maps)
/// and passes the data to the device.		/// and passes the data to the device.
EXTERN void __tgt_target_data_begin(int64_t device_id, int32_t arg_num,		EXTERN void __tgt_target_data_begin(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		void args_base, void args, int64_t arg_sizes, int64_t arg_types) {
		__tgt_target_data_begin_mapper(device_id, arg_num, args_base, args, arg_sizes,
		arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_begin_nowait(int64_t device_id, int32_t arg_num,
		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
		int32_t depNum, void *depList, int32_t noAliasDepNum,
		void *noAliasDepList) {
		if (depNum + noAliasDepNum > 0)
		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));
		ABataevUnsubmitted Done Reply Inline Actions `__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));` ABataev: `__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`

		__tgt_target_data_begin_mapper(device_id, arg_num, args_base, args, arg_sizes,
		arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_begin_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t *arg_sizes,
		int64_t *arg_types,
		void **arg_mappers) {
if (IsOffloadDisabled()) return;		if (IsOffloadDisabled()) return;

DP("Entering data begin region for device %" PRId64 " with %d mappings\n",		DP("Entering data begin region for device %" PRId64 " with %d mappings\n",
device_id, arg_num);		device_id, arg_num);

// No devices available?		// No devices available?
if (device_id == OFFLOAD_DEVICE_DEFAULT) {		if (device_id == OFFLOAD_DEVICE_DEFAULT) {
device_id = omp_get_default_device();		device_id = omp_get_default_device();
Show All 11 Lines
#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
for (int i=0; i<arg_num; ++i) {		for (int i=0; i<arg_num; ++i) {
DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),		", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),
arg_sizes[i], arg_types[i]);		arg_sizes[i], arg_types[i]);
}		}
#endif		#endif

int rc = target_data_begin(Device, arg_num, args_base,		int rc = target_data_begin(Device, arg_num, args_base, args, arg_sizes,
args, arg_sizes, arg_types);		arg_types, arg_mappers);
HandleTargetOutcome(rc == OFFLOAD_SUCCESS);		HandleTargetOutcome(rc == OFFLOAD_SUCCESS);
}		}

EXTERN void __tgt_target_data_begin_nowait(int64_t device_id, int32_t arg_num,		EXTERN void __tgt_target_data_begin_nowait_mapper(
void args_base, void args, int64_t arg_sizes, int64_t arg_types,		int64_t device_id, int32_t arg_num, void args_base, void args,
int32_t depNum, void *depList, int32_t noAliasDepNum,		int64_t arg_sizes, int64_t arg_types, void **arg_mappers, int32_t depNum,
void *noAliasDepList) {		void depList, int32_t noAliasDepNum, void noAliasDepList) {
if (depNum + noAliasDepNum > 0)		if (depNum + noAliasDepNum > 0)
__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

__tgt_target_data_begin(device_id, arg_num, args_base, args, arg_sizes,		__tgt_target_data_begin_mapper(device_id, arg_num, args_base, args, arg_sizes,
arg_types);		arg_types, arg_mappers);
}		}

/// passes data from the target, releases target memory and destroys		/// passes data from the target, releases target memory and destroys
/// the host-target mapping (top entry from the stack of data maps)		/// the host-target mapping (top entry from the stack of data maps)
/// created by the last __tgt_target_data_begin.		/// created by the last __tgt_target_data_begin.
EXTERN void __tgt_target_data_end(int64_t device_id, int32_t arg_num,		EXTERN void __tgt_target_data_end(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		void args_base, void args, int64_t arg_sizes, int64_t arg_types) {
		__tgt_target_data_end_mapper(device_id, arg_num, args_base, args, arg_sizes,
		arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_end_nowait(int64_t device_id, int32_t arg_num,
		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
		int32_t depNum, void *depList, int32_t noAliasDepNum,
		void *noAliasDepList) {
		if (depNum + noAliasDepNum > 0)
		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));
		ABataevUnsubmitted Done Reply Inline Actions `__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));` ABataev: `__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`

		__tgt_target_data_end_mapper(device_id, arg_num, args_base, args, arg_sizes,
		arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_end_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers) {
if (IsOffloadDisabled()) return;		if (IsOffloadDisabled()) return;
DP("Entering data end region with %d mappings\n", arg_num);		DP("Entering data end region with %d mappings\n", arg_num);

// No devices available?		// No devices available?
if (device_id == OFFLOAD_DEVICE_DEFAULT) {		if (device_id == OFFLOAD_DEVICE_DEFAULT) {
device_id = omp_get_default_device();		device_id = omp_get_default_device();
}		}

Show All 16 Lines
#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
for (int i=0; i<arg_num; ++i) {		for (int i=0; i<arg_num; ++i) {
DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),		", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),
arg_sizes[i], arg_types[i]);		arg_sizes[i], arg_types[i]);
}		}
#endif		#endif

int rc = target_data_end(Device, arg_num, args_base,		int rc = target_data_end(Device, arg_num, args_base, args, arg_sizes,
args, arg_sizes, arg_types);		arg_types, arg_mappers);
HandleTargetOutcome(rc == OFFLOAD_SUCCESS);		HandleTargetOutcome(rc == OFFLOAD_SUCCESS);
}		}

EXTERN void __tgt_target_data_end_nowait(int64_t device_id, int32_t arg_num,		EXTERN void __tgt_target_data_end_nowait_mapper(
void args_base, void args, int64_t arg_sizes, int64_t arg_types,		int64_t device_id, int32_t arg_num, void args_base, void args,
int32_t depNum, void *depList, int32_t noAliasDepNum,		int64_t arg_sizes, int64_t arg_types, void **arg_mappers, int32_t depNum,
void *noAliasDepList) {		void depList, int32_t noAliasDepNum, void noAliasDepList) {
if (depNum + noAliasDepNum > 0)		if (depNum + noAliasDepNum > 0)
__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

__tgt_target_data_end(device_id, arg_num, args_base, args, arg_sizes,		__tgt_target_data_end_mapper(device_id, arg_num, args_base, args, arg_sizes,
arg_types);		arg_types, arg_mappers);
}		}

EXTERN void __tgt_target_data_update(int64_t device_id, int32_t arg_num,		EXTERN void __tgt_target_data_update(int64_t device_id, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		void args_base, void args, int64_t arg_sizes, int64_t arg_types) {
		__tgt_target_data_update_mapper(device_id, arg_num, args_base, args,
		arg_sizes, arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_update_nowait(
		int64_t device_id, int32_t arg_num, void args_base, void args,
		int64_t arg_sizes, int64_t arg_types, int32_t depNum, void *depList,
		int32_t noAliasDepNum, void *noAliasDepList) {
		if (depNum + noAliasDepNum > 0)
		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));
		ABataevUnsubmitted Done Reply Inline Actions `__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));` ABataev: `__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`

		__tgt_target_data_update_mapper(device_id, arg_num, args_base, args,
		arg_sizes, arg_types, nullptr);
		}

		EXTERN void __tgt_target_data_update_mapper(int64_t device_id, int32_t arg_num,
		void args_base, void args,
		int64_t *arg_sizes,
		int64_t *arg_types,
		void **arg_mappers) {
if (IsOffloadDisabled()) return;		if (IsOffloadDisabled()) return;
DP("Entering data update with %d mappings\n", arg_num);		DP("Entering data update with %d mappings\n", arg_num);

// No devices available?		// No devices available?
if (device_id == OFFLOAD_DEVICE_DEFAULT) {		if (device_id == OFFLOAD_DEVICE_DEFAULT) {
device_id = omp_get_default_device();		device_id = omp_get_default_device();
}		}

if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {		if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {
DP("Failed to get device %" PRId64 " ready\n", device_id);		DP("Failed to get device %" PRId64 " ready\n", device_id);
HandleTargetOutcome(false);		HandleTargetOutcome(false);
return;		return;
}		}

DeviceTy& Device = Devices[device_id];		DeviceTy& Device = Devices[device_id];
int rc = target_data_update(Device, arg_num, args_base,		int rc = target_data_update(Device, arg_num, args_base, args, arg_sizes,
args, arg_sizes, arg_types);		arg_types, arg_mappers);
HandleTargetOutcome(rc == OFFLOAD_SUCCESS);		HandleTargetOutcome(rc == OFFLOAD_SUCCESS);
}		}

EXTERN void __tgt_target_data_update_nowait(		EXTERN void __tgt_target_data_update_nowait_mapper(
int64_t device_id, int32_t arg_num, void args_base, void args,		int64_t device_id, int32_t arg_num, void args_base, void args,
int64_t arg_sizes, int64_t arg_types, int32_t depNum, void *depList,		int64_t arg_sizes, int64_t arg_types, void **arg_mappers, int32_t depNum,
int32_t noAliasDepNum, void *noAliasDepList) {		void depList, int32_t noAliasDepNum, void noAliasDepList) {
if (depNum + noAliasDepNum > 0)		if (depNum + noAliasDepNum > 0)
__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

__tgt_target_data_update(device_id, arg_num, args_base, args, arg_sizes,		__tgt_target_data_update_mapper(device_id, arg_num, args_base, args,
arg_types);		arg_sizes, arg_types, arg_mappers);
}		}

EXTERN int __tgt_target(int64_t device_id, void *host_ptr, int32_t arg_num,		EXTERN int __tgt_target(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		void args_base, void args, int64_t arg_sizes, int64_t arg_types) {
		return __tgt_target_mapper(device_id, host_ptr, arg_num, args_base, args,
		arg_sizes, arg_types, nullptr);
		}

		EXTERN int __tgt_target_nowait(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
		int64_t arg_types, int32_t depNum, void depList, int32_t noAliasDepNum,
		void *noAliasDepList) {
		if (depNum + noAliasDepNum > 0)
		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));
		ABataevUnsubmitted Done Reply Inline Actions `__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));` ABataev: `__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`

		return __tgt_target_mapper(device_id, host_ptr, arg_num, args_base, args,
		arg_sizes, arg_types, nullptr);
		}

		EXTERN int __tgt_target_mapper(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args,
		int64_t arg_sizes, int64_t arg_types,
		void **arg_mappers) {
if (IsOffloadDisabled()) return OFFLOAD_FAIL;		if (IsOffloadDisabled()) return OFFLOAD_FAIL;
DP("Entering target region with entry point " DPxMOD " and device Id %"		DP("Entering target region with entry point " DPxMOD " and device Id %"
PRId64 "\n", DPxPTR(host_ptr), device_id);		PRId64 "\n", DPxPTR(host_ptr), device_id);

if (device_id == OFFLOAD_DEVICE_DEFAULT) {		if (device_id == OFFLOAD_DEVICE_DEFAULT) {
device_id = omp_get_default_device();		device_id = omp_get_default_device();
}		}

if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {		if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {
DP("Failed to get device %" PRId64 " ready\n", device_id);		DP("Failed to get device %" PRId64 " ready\n", device_id);
HandleTargetOutcome(false);		HandleTargetOutcome(false);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
for (int i=0; i<arg_num; ++i) {		for (int i=0; i<arg_num; ++i) {
DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),		", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),
arg_sizes[i], arg_types[i]);		arg_sizes[i], arg_types[i]);
}		}
#endif		#endif

int rc = target(device_id, host_ptr, arg_num, args_base, args, arg_sizes,		int rc = target(device_id, host_ptr, arg_num, args_base, args, arg_sizes,
arg_types, 0, 0, false /team/);		arg_types, arg_mappers, 0, 0, false /team/);
HandleTargetOutcome(rc == OFFLOAD_SUCCESS);		HandleTargetOutcome(rc == OFFLOAD_SUCCESS);
return rc;		return rc;
}		}

EXTERN int __tgt_target_nowait(int64_t device_id, void *host_ptr,		EXTERN int __tgt_target_nowait_mapper(int64_t device_id, void *host_ptr,
int32_t arg_num, void args_base, void args, int64_t *arg_sizes,		int32_t arg_num, void **args_base,
int64_t arg_types, int32_t depNum, void depList, int32_t noAliasDepNum,		void *args, int64_t arg_sizes,
		int64_t arg_types, void *arg_mappers,
		int32_t depNum, void *depList,
		int32_t noAliasDepNum,
void *noAliasDepList) {		void *noAliasDepList) {
if (depNum + noAliasDepNum > 0)		if (depNum + noAliasDepNum > 0)
__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

return __tgt_target(device_id, host_ptr, arg_num, args_base, args, arg_sizes,		return __tgt_target_mapper(device_id, host_ptr, arg_num, args_base, args,
arg_types);		arg_sizes, arg_types, arg_mappers);
}		}

EXTERN int __tgt_target_teams(int64_t device_id, void *host_ptr,		EXTERN int __tgt_target_teams(int64_t device_id, void *host_ptr,
int32_t arg_num, void args_base, void args, int64_t *arg_sizes,		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
int64_t *arg_types, int32_t team_num, int32_t thread_limit) {		int64_t *arg_types, int32_t team_num, int32_t thread_limit) {
		return __tgt_target_teams_mapper(device_id, host_ptr, arg_num, args_base,
		args, arg_sizes, arg_types, nullptr,
		team_num, thread_limit);
		}

		EXTERN int __tgt_target_teams_nowait(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
		int64_t *arg_types, int32_t team_num, int32_t thread_limit, int32_t depNum,
		void depList, int32_t noAliasDepNum, void noAliasDepList) {
		if (depNum + noAliasDepNum > 0)
		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));
		ABataevUnsubmitted Done Reply Inline Actions `__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));` ABataev: `__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));`

		return __tgt_target_teams_mapper(device_id, host_ptr, arg_num, args_base,
		args, arg_sizes, arg_types, nullptr,
		team_num, thread_limit);
		}

		EXTERN int __tgt_target_teams_mapper(int64_t device_id, void *host_ptr,
		int32_t arg_num, void args_base, void args, int64_t *arg_sizes,
		int64_t arg_types, void *arg_mappers, int32_t team_num, int32_t thread_limit) {
if (IsOffloadDisabled()) return OFFLOAD_FAIL;		if (IsOffloadDisabled()) return OFFLOAD_FAIL;
DP("Entering target region with entry point " DPxMOD " and device Id %"		DP("Entering target region with entry point " DPxMOD " and device Id %"
PRId64 "\n", DPxPTR(host_ptr), device_id);		PRId64 "\n", DPxPTR(host_ptr), device_id);

if (device_id == OFFLOAD_DEVICE_DEFAULT) {		if (device_id == OFFLOAD_DEVICE_DEFAULT) {
device_id = omp_get_default_device();		device_id = omp_get_default_device();
}		}

if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {		if (CheckDeviceAndCtors(device_id) != OFFLOAD_SUCCESS) {
DP("Failed to get device %" PRId64 " ready\n", device_id);		DP("Failed to get device %" PRId64 " ready\n", device_id);
HandleTargetOutcome(false);		HandleTargetOutcome(false);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
for (int i=0; i<arg_num; ++i) {		for (int i=0; i<arg_num; ++i) {
DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),		", Type=0x%" PRIx64 "\n", i, DPxPTR(args_base[i]), DPxPTR(args[i]),
arg_sizes[i], arg_types[i]);		arg_sizes[i], arg_types[i]);
}		}
#endif		#endif

int rc = target(device_id, host_ptr, arg_num, args_base, args, arg_sizes,		int rc =
arg_types, team_num, thread_limit, true /team/);		target(device_id, host_ptr, arg_num, args_base, args, arg_sizes,
		arg_types, arg_mappers, team_num, thread_limit, true /team/);
HandleTargetOutcome(rc == OFFLOAD_SUCCESS);		HandleTargetOutcome(rc == OFFLOAD_SUCCESS);

return rc;		return rc;
}		}

EXTERN int __tgt_target_teams_nowait(int64_t device_id, void *host_ptr,		EXTERN int __tgt_target_teams_nowait_mapper(
int32_t arg_num, void args_base, void args, int64_t *arg_sizes,		int64_t device_id, void host_ptr, int32_t arg_num, void *args_base,
int64_t *arg_types, int32_t team_num, int32_t thread_limit, int32_t depNum,		void *args, int64_t arg_sizes, int64_t arg_types, void *arg_mappers,
void depList, int32_t noAliasDepNum, void noAliasDepList) {		int32_t team_num, int32_t thread_limit, int32_t depNum, void *depList,
		int32_t noAliasDepNum, void *noAliasDepList) {
if (depNum + noAliasDepNum > 0)		if (depNum + noAliasDepNum > 0)
__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));		__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

return __tgt_target_teams(device_id, host_ptr, arg_num, args_base, args,		return __tgt_target_teams_mapper(device_id, host_ptr, arg_num, args_base,
arg_sizes, arg_types, team_num, thread_limit);		args, arg_sizes, arg_types, arg_mappers,
		team_num, thread_limit);
}		}

// Get the current number of components for a user-defined mapper.		// Get the current number of components for a user-defined mapper.
EXTERN int64_t __tgt_mapper_num_components(void *rt_mapper_handle) {		EXTERN int64_t __tgt_mapper_num_components(void *rt_mapper_handle) {
auto MapperComponentsPtr = (struct MapperComponentsTy )rt_mapper_handle;		auto MapperComponentsPtr = (struct MapperComponentsTy )rt_mapper_handle;
int64_t size = MapperComponentsPtr->Components.size();		int64_t size = MapperComponentsPtr->Components.size();
DP("__tgt_mapper_num_components(Handle=" DPxMOD ") returns %" PRId64 "\n",		DP("__tgt_mapper_num_components(Handle=" DPxMOD ") returns %" PRId64 "\n",
DPxPTR(rt_mapper_handle), size);		DPxPTR(rt_mapper_handle), size);
Show All 38 Lines

libomptarget/src/omptarget.cpp

Show First 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	static int InitLibrary(DeviceTy& Device) {
*/		*/
if (!Device.PendingCtorsDtors.empty()) {		if (!Device.PendingCtorsDtors.empty()) {
// Call all ctors for all libraries registered so far		// Call all ctors for all libraries registered so far
for (auto &lib : Device.PendingCtorsDtors) {		for (auto &lib : Device.PendingCtorsDtors) {
if (!lib.second.PendingCtors.empty()) {		if (!lib.second.PendingCtors.empty()) {
DP("Has pending ctors... call now\n");		DP("Has pending ctors... call now\n");
for (auto &entry : lib.second.PendingCtors) {		for (auto &entry : lib.second.PendingCtors) {
void *ctor = entry;		void *ctor = entry;
int rc = target(device_id, ctor, 0, NULL, NULL, NULL,		int rc = target(device_id, ctor, 0, NULL, NULL, NULL, NULL, NULL, 1,
NULL, 1, 1, true /team/);		1, true /team/);
if (rc != OFFLOAD_SUCCESS) {		if (rc != OFFLOAD_SUCCESS) {
DP("Running ctor " DPxMOD " failed.\n", DPxPTR(ctor));		DP("Running ctor " DPxMOD " failed.\n", DPxPTR(ctor));
Device.PendingGlobalsMtx.unlock();		Device.PendingGlobalsMtx.unlock();
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
}		}
// Clear the list to indicate that this device has been used		// Clear the list to indicate that this device has been used
lib.second.PendingCtors.clear();		lib.second.PendingCtors.clear();
Show All 30 Lines	int CheckDeviceAndCtors(int64_t device_id) {

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

static int32_t member_of(int64_t type) {		static int32_t member_of(int64_t type) {
return ((type & OMP_TGT_MAPTYPE_MEMBER_OF) >> 48) - 1;		return ((type & OMP_TGT_MAPTYPE_MEMBER_OF) >> 48) - 1;
}		}

/// Internal function to do the mapping and transfer the data to the device		/// Map one component when entering a target data region.
int target_data_begin(DeviceTy &Device, int32_t arg_num,		template <typename GetBeginTy, typename GetTypeTy>
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		int target_data_begin_component(DeviceTy &Device, void *HstPtrBegin,
// process each input.		int64_t Size, int64_t Type,
for (int32_t i = 0; i < arg_num; ++i) {		void **HstPtrBasePtr, int32_t Idx, int32_t End,
// Ignore private variables and arrays - there is no mapping for them.		GetBeginTy &GetBegin, GetTypeTy &GetType) {
if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|
(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))
continue;

void *HstPtrBegin = args[i];
void *HstPtrBase = args_base[i];
int64_t data_size = arg_sizes[i];

// Adjust for proper alignment if this is a combined entry (for structs).		// Adjust for proper alignment if this is a combined entry (for structs).
// Look at the next argument - if that is MEMBER_OF this one, then this one		// Look at the next argument - if that is MEMBER_OF this one, then this one
// is a combined entry.		// is a combined entry.
int64_t padding = 0;		int64_t padding = 0;
const int next_i = i+1;		const int Next = Idx + 1;
if (member_of(arg_types[i]) < 0 && next_i < arg_num &&		if (member_of(Type) < 0 && Next < End && member_of(GetType(Next)) == Idx) {
member_of(arg_types[next_i]) == i) {
padding = (int64_t)HstPtrBegin % alignment;		padding = (int64_t)HstPtrBegin % alignment;
if (padding) {		if (padding) {
DP("Using a padding of %" PRId64 " bytes for begin address " DPxMOD		DP("Using a padding of %" PRId64 " bytes for begin address " DPxMOD "\n",
"\n", padding, DPxPTR(HstPtrBegin));		padding, DPxPTR(HstPtrBegin));
HstPtrBegin = (char *) HstPtrBegin - padding;		HstPtrBegin = (char *)HstPtrBegin - padding;
data_size += padding;		Size += padding;
}		}
}		}

// Address of pointer on the host and device, respectively.		// Address of pointer on the host and device, respectively.
void Pointer_HstPtrBegin, Pointer_TgtPtrBegin;		void Pointer_HstPtrBegin, Pointer_TgtPtrBegin;
bool IsNew, Pointer_IsNew;		bool IsNew, Pointer_IsNew;
bool IsHostPtr = false;		bool IsHostPtr = false;
bool IsImplicit = arg_types[i] & OMP_TGT_MAPTYPE_IMPLICIT;		bool IsImplicit = Type & OMP_TGT_MAPTYPE_IMPLICIT;
// Force the creation of a device side copy of the data when:		// Force the creation of a device side copy of the data when:
// a close map modifier was associated with a map that contained a to.		// a close map modifier was associated with a map that contained a to.
bool HasCloseModifier = arg_types[i] & OMP_TGT_MAPTYPE_CLOSE;		bool HasCloseModifier = Type & OMP_TGT_MAPTYPE_CLOSE;
// UpdateRef is based on MEMBER_OF instead of TARGET_PARAM because if we		// UpdateRef is based on MEMBER_OF instead of TARGET_PARAM because if we
// have reached this point via __tgt_target_data_begin and not __tgt_target		// have reached this point via __tgt_target_data_begin and not __tgt_target
// then no argument is marked as TARGET_PARAM ("omp target data map" is not		// then no argument is marked as TARGET_PARAM ("omp target data map" is not
// associated with a target region, so there are no target parameters). This		// associated with a target region, so there are no target parameters). This
// may be considered a hack, we could revise the scheme in the future.		// may be considered a hack, we could revise the scheme in the future.
bool UpdateRef = !(arg_types[i] & OMP_TGT_MAPTYPE_MEMBER_OF);		bool UpdateRef = !(Type & OMP_TGT_MAPTYPE_MEMBER_OF);
if (arg_types[i] & OMP_TGT_MAPTYPE_PTR_AND_OBJ) {		void HstPtrBase = HstPtrBasePtr;
		if (Type & OMP_TGT_MAPTYPE_PTR_AND_OBJ) {
DP("Has a pointer entry: \n");		DP("Has a pointer entry: \n");
// base is address of pointer.		// base is address of pointer.
Pointer_TgtPtrBegin = Device.getOrAllocTgtPtr(HstPtrBase, HstPtrBase,		Pointer_TgtPtrBegin = Device.getOrAllocTgtPtr(
sizeof(void *), Pointer_IsNew, IsHostPtr, IsImplicit, UpdateRef,		HstPtrBase, HstPtrBase, sizeof(void *), Pointer_IsNew, IsHostPtr,
HasCloseModifier);		IsImplicit, UpdateRef, HasCloseModifier);
if (!Pointer_TgtPtrBegin) {		if (!Pointer_TgtPtrBegin) {
DP("Call to getOrAllocTgtPtr returned null pointer (device failure or "		DP("Call to getOrAllocTgtPtr returned null pointer (device failure or "
"illegal mapping).\n");		"illegal mapping).\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
DP("There are %zu bytes allocated at target address " DPxMOD " - is%s new"		DP("There are %zu bytes allocated at target address " DPxMOD " - is%s new"
"\n", sizeof(void *), DPxPTR(Pointer_TgtPtrBegin),		"\n",
		sizeof(void *), DPxPTR(Pointer_TgtPtrBegin),
(Pointer_IsNew ? "" : " not"));		(Pointer_IsNew ? "" : " not"));
Pointer_HstPtrBegin = HstPtrBase;		Pointer_HstPtrBegin = HstPtrBase;
// modify current entry.		// modify current entry.
HstPtrBase = (void *)HstPtrBase;		HstPtrBase = (void *)HstPtrBase;
UpdateRef = true; // subsequently update ref count of pointee		UpdateRef = true; // subsequently update ref count of pointee
}		}

void *TgtPtrBegin = Device.getOrAllocTgtPtr(HstPtrBegin, HstPtrBase,		void *TgtPtrBegin =
data_size, IsNew, IsHostPtr, IsImplicit, UpdateRef, HasCloseModifier);		Device.getOrAllocTgtPtr(HstPtrBegin, HstPtrBase, Size, IsNew, IsHostPtr,
if (!TgtPtrBegin && data_size) {		IsImplicit, UpdateRef, HasCloseModifier);
// If data_size==0, then the argument could be a zero-length pointer to		if (!TgtPtrBegin && Size) {
		// If Size==0, then the argument could be a zero-length pointer to
// NULL, so getOrAlloc() returning NULL is not an error.		// NULL, so getOrAlloc() returning NULL is not an error.
DP("Call to getOrAllocTgtPtr returned null pointer (device failure or "		DP("Call to getOrAllocTgtPtr returned null pointer (device failure or "
"illegal mapping).\n");		"illegal mapping).\n");
}		}
DP("There are %" PRId64 " bytes allocated at target address " DPxMOD		DP("There are %" PRId64 " bytes allocated at target address " DPxMOD
" - is%s new\n", data_size, DPxPTR(TgtPtrBegin),		" - is%s new\n",
(IsNew ? "" : " not"));		Size, DPxPTR(TgtPtrBegin), (IsNew ? "" : " not"));

if (arg_types[i] & OMP_TGT_MAPTYPE_RETURN_PARAM) {		if (Type & OMP_TGT_MAPTYPE_RETURN_PARAM) {
uintptr_t Delta = (uintptr_t)HstPtrBegin - (uintptr_t)HstPtrBase;		uintptr_t Delta = (uintptr_t)HstPtrBegin - (uintptr_t)HstPtrBase;
void TgtPtrBase = (void )((uintptr_t)TgtPtrBegin - Delta);		void TgtPtrBase = (void )((uintptr_t)TgtPtrBegin - Delta);
DP("Returning device pointer " DPxMOD "\n", DPxPTR(TgtPtrBase));		DP("Returning device pointer " DPxMOD "\n", DPxPTR(TgtPtrBase));
args_base[i] = TgtPtrBase;		HstPtrBase = TgtPtrBase;
}		}

if (arg_types[i] & OMP_TGT_MAPTYPE_TO) {		if (Type & OMP_TGT_MAPTYPE_TO) {
bool copy = false;		bool copy = false;
if (!(RTLs.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY) \|\|		if (!(RTLs.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY) \|\|
HasCloseModifier) {		HasCloseModifier) {
if (IsNew \|\| (arg_types[i] & OMP_TGT_MAPTYPE_ALWAYS)) {		if (IsNew \|\| (Type & OMP_TGT_MAPTYPE_ALWAYS)) {
copy = true;		copy = true;
} else if (arg_types[i] & OMP_TGT_MAPTYPE_MEMBER_OF) {		} else if (Type & OMP_TGT_MAPTYPE_MEMBER_OF) {
// Copy data only if the "parent" struct has RefCount==1.		// Copy data only if the "parent" struct has RefCount==1.
int32_t parent_idx = member_of(arg_types[i]);		int32_t parent_idx = member_of(Type);
long parent_rc = Device.getMapEntryRefCnt(args[parent_idx]);		long parent_rc = Device.getMapEntryRefCnt(GetBegin(parent_idx));
assert(parent_rc > 0 && "parent struct not found");		assert(parent_rc > 0 && "parent struct not found");
if (parent_rc == 1) {		if (parent_rc == 1) {
copy = true;		copy = true;
}		}
}		}
}		}

if (copy && !IsHostPtr) {		if (copy && !IsHostPtr) {
DP("Moving %" PRId64 " bytes (hst:" DPxMOD ") -> (tgt:" DPxMOD ")\n",		DP("Moving %" PRId64 " bytes (hst:" DPxMOD ") -> (tgt:" DPxMOD ")\n",
data_size, DPxPTR(HstPtrBegin), DPxPTR(TgtPtrBegin));		Size, DPxPTR(HstPtrBegin), DPxPTR(TgtPtrBegin));
int rt = Device.data_submit(TgtPtrBegin, HstPtrBegin, data_size);		int rt = Device.data_submit(TgtPtrBegin, HstPtrBegin, Size);
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Copying data to device failed.\n");		DP("Copying data to device failed.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
}		}
}		}

if (arg_types[i] & OMP_TGT_MAPTYPE_PTR_AND_OBJ && !IsHostPtr) {		if (Type & OMP_TGT_MAPTYPE_PTR_AND_OBJ && !IsHostPtr) {
DP("Update pointer (" DPxMOD ") -> [" DPxMOD "]\n",		DP("Update pointer (" DPxMOD ") -> [" DPxMOD "]\n",
DPxPTR(Pointer_TgtPtrBegin), DPxPTR(TgtPtrBegin));		DPxPTR(Pointer_TgtPtrBegin), DPxPTR(TgtPtrBegin));
uint64_t Delta = (uint64_t)HstPtrBegin - (uint64_t)HstPtrBase;		uint64_t Delta = (uint64_t)HstPtrBegin - (uint64_t)HstPtrBase;
void TgtPtrBase = (void )((uint64_t)TgtPtrBegin - Delta);		void TgtPtrBase = (void )((uint64_t)TgtPtrBegin - Delta);
int rt = Device.data_submit(Pointer_TgtPtrBegin, &TgtPtrBase,		int rt =
sizeof(void *));		Device.data_submit(Pointer_TgtPtrBegin, &TgtPtrBase, sizeof(void *));
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Copying data to device failed.\n");		DP("Copying data to device failed.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
// create shadow pointers for this entry		// create shadow pointers for this entry
Device.ShadowMtx.lock();		Device.ShadowMtx.lock();
Device.ShadowPtrMap[Pointer_HstPtrBegin] = {HstPtrBase,		Device.ShadowPtrMap[Pointer_HstPtrBegin] = {HstPtrBase, Pointer_TgtPtrBegin,
Pointer_TgtPtrBegin, TgtPtrBase};		TgtPtrBase};
Device.ShadowMtx.unlock();		Device.ShadowMtx.unlock();
}		}
}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

/// Internal function to undo the mapping and retrieve the data from the device.		/// Internal function to do the mapping and transfer the data to the device
int target_data_end(DeviceTy &Device, int32_t arg_num, void **args_base,		int target_data_begin(DeviceTy &Device, int32_t arg_num, void **args_base,
void *args, int64_t arg_sizes, int64_t *arg_types) {		void *args, int64_t arg_sizes, int64_t *arg_types,
		void **arg_mappers) {
// process each input.		// process each input.
for (int32_t i = arg_num - 1; i >= 0; --i) {		for (int32_t i = 0; i < arg_num; ++i) {
// Ignore private variables and arrays - there is no mapping for them.		// Ignore private variables and arrays - there is no mapping for them.
// Also, ignore the use_device_ptr directive, it has no effect here.
if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|		if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|
(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))		(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))
continue;		continue;

void *HstPtrBegin = args[i];		// If a valid user-defined mapper is attached, use the associated mapper
int64_t data_size = arg_sizes[i];		// function to complete data mapping.
		if (arg_mappers && arg_mappers[i]) {
		DP("Call the mapper function " DPxMOD " for the %dth argument\n",
		DPxPTR(arg_mappers[i]), i);
		// The mapper function fills up Components.
		MapperComponentsTy Components;
		MapperFuncPtrTy MapperFuncPtr = (MapperFuncPtrTy)(arg_mappers[i]);
		(MapperFuncPtr)((void )&Components, args_base[i], args[i], arg_sizes[i],
		arg_types[i]);
		if (Components.size() >= 0xffff) {
		DP("The number of components exceed the limitation\n");
		return OFFLOAD_FAIL;
		}

		// Map each component filled up by the mapper function.
		for (int32_t j = 0, e = Components.size(); j < e; ++j) {
		// Helper function to get the base address and type.
		auto &&GetBegin = [&Components](int32_t Idx) {
		return Components.get(Idx)->Begin;
		};
		auto &&GetType = [&Components](int32_t Idx) {
		return Components.get(Idx)->Type;
		};
		int rt = target_data_begin_component(
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions typedef/using instead of copying the type list for the cast? JonChesterfield: typedef/using instead of copying the type list for the cast?
		lildmhUnsubmitted Done Reply Inline Actions Sounds good, thanks lildmh: Sounds good, thanks
		Device, Components.get(j)->Begin, Components.get(j)->Size,
		Components.get(j)->Type, &args_base[i], j, e, GetBegin, GetType);
		if (rt != OFFLOAD_SUCCESS)
		return OFFLOAD_FAIL;
		}
		} else {
		// Helper function to get the base address and type.
		auto &&GetBegin = [&args](int32_t Idx) { return args[Idx]; };
		auto &&GetType = [&arg_types](int32_t Idx) { return arg_types[Idx]; };
		ABataevUnsubmitted Not Done Reply Inline Actions Why we have this limitation? ABataev: Why we have this limitation?
		lildmhUnsubmitted Done Reply Inline Actions It's because that's the length of the parent bit. If we have more components, the parent bits will break. lildmh: It's because that's the length of the parent bit. If we have more components, the parent bits…
		int rt = target_data_begin_component(Device, args[i], arg_sizes[i],
		arg_types[i], &args_base[i], i,
		arg_num, GetBegin, GetType);
		if (rt != OFFLOAD_SUCCESS)
		return OFFLOAD_FAIL;
		}
		}

		return OFFLOAD_SUCCESS;
		}

		/// Map one component when exiting a target data region.
		template <typename GetBeginTy, typename GetTypeTy>
		int target_data_end_component(DeviceTy &Device, void *HstPtrBegin, int64_t Size,
		int64_t Type, int32_t Idx, int32_t End,
		GetBeginTy &GetBegin, GetTypeTy &GetType) {
// Adjust for proper alignment if this is a combined entry (for structs).		// Adjust for proper alignment if this is a combined entry (for structs).
// Look at the next argument - if that is MEMBER_OF this one, then this one		// Look at the next argument - if that is MEMBER_OF this one, then this one
// is a combined entry.		// is a combined entry.
int64_t padding = 0;		int64_t padding = 0;
const int next_i = i+1;		const int Next = Idx + 1;
if (member_of(arg_types[i]) < 0 && next_i < arg_num &&		if (member_of(Type) < 0 && Next < End && member_of(GetType(Next)) == Idx) {
member_of(arg_types[next_i]) == i) {
padding = (int64_t)HstPtrBegin % alignment;		padding = (int64_t)HstPtrBegin % alignment;
if (padding) {		if (padding) {
DP("Using a padding of %" PRId64 " bytes for begin address " DPxMOD		DP("Using a padding of %" PRId64 " bytes for begin address " DPxMOD "\n",
"\n", padding, DPxPTR(HstPtrBegin));		padding, DPxPTR(HstPtrBegin));
HstPtrBegin = (char *) HstPtrBegin - padding;		HstPtrBegin = (char *)HstPtrBegin - padding;
data_size += padding;		Size += padding;
}		}
}		}

bool IsLast, IsHostPtr;		bool IsLast, IsHostPtr;
bool UpdateRef = !(arg_types[i] & OMP_TGT_MAPTYPE_MEMBER_OF) \|\|		bool UpdateRef = !(Type & OMP_TGT_MAPTYPE_MEMBER_OF) \|\|
(arg_types[i] & OMP_TGT_MAPTYPE_PTR_AND_OBJ);		(Type & OMP_TGT_MAPTYPE_PTR_AND_OBJ);
bool ForceDelete = arg_types[i] & OMP_TGT_MAPTYPE_DELETE;		bool ForceDelete = Type & OMP_TGT_MAPTYPE_DELETE;
bool HasCloseModifier = arg_types[i] & OMP_TGT_MAPTYPE_CLOSE;		bool HasCloseModifier = Type & OMP_TGT_MAPTYPE_CLOSE;

// If PTR_AND_OBJ, HstPtrBegin is address of pointee		// If PTR_AND_OBJ, HstPtrBegin is address of pointee
void *TgtPtrBegin = Device.getTgtPtrBegin(HstPtrBegin, data_size, IsLast,		void *TgtPtrBegin =
UpdateRef, IsHostPtr);		Device.getTgtPtrBegin(HstPtrBegin, Size, IsLast, UpdateRef, IsHostPtr);
DP("There are %" PRId64 " bytes allocated at target address " DPxMOD		DP("There are %" PRId64 " bytes allocated at target address " DPxMOD
" - is%s last\n", data_size, DPxPTR(TgtPtrBegin),		" - is%s last\n",
(IsLast ? "" : " not"));		Size, DPxPTR(TgtPtrBegin), (IsLast ? "" : " not"));

bool DelEntry = IsLast \|\| ForceDelete;		bool DelEntry = IsLast \|\| ForceDelete;

if ((arg_types[i] & OMP_TGT_MAPTYPE_MEMBER_OF) &&		if ((Type & OMP_TGT_MAPTYPE_MEMBER_OF) &&
!(arg_types[i] & OMP_TGT_MAPTYPE_PTR_AND_OBJ)) {		!(Type & OMP_TGT_MAPTYPE_PTR_AND_OBJ)) {
DelEntry = false; // protect parent struct from being deallocated		DelEntry = false; // protect parent struct from being deallocated
}		}

if ((arg_types[i] & OMP_TGT_MAPTYPE_FROM) \|\| DelEntry) {		if ((Type & OMP_TGT_MAPTYPE_FROM) \|\| DelEntry) {
// Move data back to the host		// Move data back to the host
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions It's probably worth doing the arg_types[i] => Type refactor first. Combined with the whitespace change it's quite a lot of this diff. JonChesterfield: It's probably worth doing the arg_types[i] => Type refactor first. Combined with the whitespace…
		lildmhUnsubmitted Done Reply Inline Actions I extracted the mapping of each component into a separate function for code reuse purpose, like `target_data_end_component` here. It uses `Type` as the input argument, so there is no longer `arg_types[i]`. It's the same for `Size`. So I don't think it will make it more clear to change `arg_types[i]` to `Type` first. What do you think? lildmh: I extracted the mapping of each component into a separate function for code reuse purpose, like…
if (arg_types[i] & OMP_TGT_MAPTYPE_FROM) {		if (Type & OMP_TGT_MAPTYPE_FROM) {
bool Always = arg_types[i] & OMP_TGT_MAPTYPE_ALWAYS;		bool Always = Type & OMP_TGT_MAPTYPE_ALWAYS;
bool CopyMember = false;		bool CopyMember = false;
if (!(RTLs.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY) \|\|		if (!(RTLs.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY) \|\|
HasCloseModifier) {		HasCloseModifier) {
if ((arg_types[i] & OMP_TGT_MAPTYPE_MEMBER_OF) &&		if ((Type & OMP_TGT_MAPTYPE_MEMBER_OF) &&
!(arg_types[i] & OMP_TGT_MAPTYPE_PTR_AND_OBJ)) {		!(Type & OMP_TGT_MAPTYPE_PTR_AND_OBJ)) {
// Copy data only if the "parent" struct has RefCount==1.		// Copy data only if the "parent" struct has RefCount==1.
int32_t parent_idx = member_of(arg_types[i]);		int32_t parent_idx = member_of(Type);
long parent_rc = Device.getMapEntryRefCnt(args[parent_idx]);		long parent_rc = Device.getMapEntryRefCnt(GetBegin(parent_idx));
assert(parent_rc > 0 && "parent struct not found");		assert(parent_rc > 0 && "parent struct not found");
if (parent_rc == 1) {		if (parent_rc == 1) {
CopyMember = true;		CopyMember = true;
}		}
}		}
}		}

if ((DelEntry \|\| Always \|\| CopyMember) &&		if ((DelEntry \|\| Always \|\| CopyMember) &&
!(RTLs.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY &&		!(RTLs.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY &&
TgtPtrBegin == HstPtrBegin)) {		TgtPtrBegin == HstPtrBegin)) {
DP("Moving %" PRId64 " bytes (tgt:" DPxMOD ") -> (hst:" DPxMOD ")\n",		DP("Moving %" PRId64 " bytes (tgt:" DPxMOD ") -> (hst:" DPxMOD ")\n",
data_size, DPxPTR(TgtPtrBegin), DPxPTR(HstPtrBegin));		Size, DPxPTR(TgtPtrBegin), DPxPTR(HstPtrBegin));
int rt = Device.data_retrieve(HstPtrBegin, TgtPtrBegin, data_size);		int rt = Device.data_retrieve(HstPtrBegin, TgtPtrBegin, Size);
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Likewise data_size => Size. Separating the NFC from the FC makes it easier to parse the latter. JonChesterfield: Likewise data_size => Size. Separating the NFC from the FC makes it easier to parse the latter.
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Copying data from device failed.\n");		DP("Copying data from device failed.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
}		}
}		}

// If we copied back to the host a struct/array containing pointers, we		// If we copied back to the host a struct/array containing pointers, we
// need to restore the original host pointer values from their shadow		// need to restore the original host pointer values from their shadow
// copies. If the struct is going to be deallocated, remove any remaining		// copies. If the struct is going to be deallocated, remove any remaining
// shadow pointer entries for this struct.		// shadow pointer entries for this struct.
uintptr_t lb = (uintptr_t) HstPtrBegin;		uintptr_t lb = (uintptr_t)HstPtrBegin;
uintptr_t ub = (uintptr_t) HstPtrBegin + data_size;		uintptr_t ub = (uintptr_t)HstPtrBegin + Size;
Device.ShadowMtx.lock();		Device.ShadowMtx.lock();
for (ShadowPtrListTy::iterator it = Device.ShadowPtrMap.begin();		for (ShadowPtrListTy::iterator it = Device.ShadowPtrMap.begin();
it != Device.ShadowPtrMap.end();) {		it != Device.ShadowPtrMap.end();) {
void ShadowHstPtrAddr = (void) it->first;		void ShadowHstPtrAddr = (void )it->first;

// An STL map is sorted on its keys; use this property		// An STL map is sorted on its keys; use this property
// to quickly determine when to break out of the loop.		// to quickly determine when to break out of the loop.
if ((uintptr_t) ShadowHstPtrAddr < lb) {		if ((uintptr_t)ShadowHstPtrAddr < lb) {
++it;		++it;
continue;		continue;
}		}
if ((uintptr_t) ShadowHstPtrAddr >= ub)		if ((uintptr_t)ShadowHstPtrAddr >= ub)
break;		break;

// If we copied the struct to the host, we need to restore the pointer.		// If we copied the struct to the host, we need to restore the pointer.
if (arg_types[i] & OMP_TGT_MAPTYPE_FROM) {		if (Type & OMP_TGT_MAPTYPE_FROM) {
DP("Restoring original host pointer value " DPxMOD " for host "		DP("Restoring original host pointer value " DPxMOD " for host "
"pointer " DPxMOD "\n", DPxPTR(it->second.HstPtrVal),		"pointer " DPxMOD "\n",
DPxPTR(ShadowHstPtrAddr));		DPxPTR(it->second.HstPtrVal), DPxPTR(ShadowHstPtrAddr));
*ShadowHstPtrAddr = it->second.HstPtrVal;		*ShadowHstPtrAddr = it->second.HstPtrVal;
}		}
// If the struct is to be deallocated, remove the shadow entry.		// If the struct is to be deallocated, remove the shadow entry.
if (DelEntry) {		if (DelEntry) {
DP("Removing shadow pointer " DPxMOD "\n", DPxPTR(ShadowHstPtrAddr));		DP("Removing shadow pointer " DPxMOD "\n", DPxPTR(ShadowHstPtrAddr));
it = Device.ShadowPtrMap.erase(it);		it = Device.ShadowPtrMap.erase(it);
} else {		} else {
++it;		++it;
}		}
}		}
Device.ShadowMtx.unlock();		Device.ShadowMtx.unlock();

// Deallocate map		// Deallocate map
if (DelEntry) {		if (DelEntry) {
int rt = Device.deallocTgtPtr(HstPtrBegin, data_size, ForceDelete,		int rt = Device.deallocTgtPtr(HstPtrBegin, Size, ForceDelete,
HasCloseModifier);		HasCloseModifier);
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Deallocating data from device failed.\n");		DP("Deallocating data from device failed.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
}		}
}		}
}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

/// Internal function to pass data to/from the target.		/// Internal function to undo the mapping and retrieve the data from the device.
int target_data_update(DeviceTy &Device, int32_t arg_num,		int target_data_end(DeviceTy &Device, int32_t arg_num, void **args_base,
void args_base, void args, int64_t arg_sizes, int64_t arg_types) {		void *args, int64_t arg_sizes, int64_t *arg_types,
		void **arg_mappers) {
// process each input.		// process each input.
for (int32_t i = 0; i < arg_num; ++i) {		for (int32_t i = arg_num - 1; i >= 0; --i) {
		// Ignore private variables and arrays - there is no mapping for them.
		// Also, ignore the use_device_ptr directive, it has no effect here.
if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|		if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|
(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))		(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))
continue;		continue;

void *HstPtrBegin = args[i];		// If a valid user-defined mapper is attached, use the associated mapper
int64_t MapSize = arg_sizes[i];		// function to complete data mapping.
		if (arg_mappers && arg_mappers[i]) {
		DP("Call the mapper function " DPxMOD " for the %dth argument\n",
		DPxPTR(arg_mappers[i]), i);
		// The mapper function fills up Components.
		MapperComponentsTy Components;
		MapperFuncPtrTy MapperFuncPtr = (MapperFuncPtrTy)(arg_mappers[i]);
		(MapperFuncPtr)((void )&Components, args_base[i], args[i], arg_sizes[i],
		arg_types[i]);
		if (Components.size() >= 0xffff) {
		JonChesterfieldUnsubmitted Done Reply Inline Actions I think this is the same type list copy & paste that suggested a typedef above JonChesterfield: I think this is the same type list copy & paste that suggested a typedef above
		DP("The number of components exceed the limitation\n");
		return OFFLOAD_FAIL;
		}
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions The rest of this looks quite familiar too. Perhaps factor the copy & paste into helper functions that are called by both locations? JonChesterfield: The rest of this looks quite familiar too. Perhaps factor the copy & paste into helper…
		lildmhUnsubmitted Done Reply Inline Actions The duplication is not too much though. Do you think it will worth it to have a helper function? lildmh: The duplication is not too much though. Do you think it will worth it to have a helper function?
		ABataevUnsubmitted Not Done Reply Inline Actions +1 for refactoring. ABataev: +1 for refactoring.
		lildmhUnsubmitted Done Reply Inline Actions Hi Alexey and Jon, I didn't find an elegant way to merge the code below. It's mainly because they have different way to access other components: E.g., for mapper, `Components.get(parent_idx)` is used to get its parent, on the other hand, `args[parent_idx]` is used for arguments. One is array of struct, the other is struct of array. lildmh: Hi Alexey and Jon, I didn't find an elegant way to merge the code below. It's mainly because…
		ABataevUnsubmitted Not Done Reply Inline Actions Still, do not understand what is a problem with the refactoring. You can use lambdas, if need some differences in data, or something similar. Anyway, it would better rather than just copy-paste. ABataev: Still, do not understand what is a problem with the refactoring. You can use lambdas, if need…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Some options: wrap object in a class that adapts the interface pass in a function that does the access refactor one data type to the same layout as the other extract small functions which are called by both JonChesterfield: Some options: - wrap object in a class that adapts the interface - pass in a function that does…

		// Map each component filled up by the mapper function.
		ABataevUnsubmitted Not Done Reply Inline Actions Usually, we use something like `(for i = 0, e = end(); i < e; ++i)` pattern. ABataev: Usually, we use something like `(for i = 0, e = end(); i < e; ++i)` pattern.
		for (int32_t j = 0, e = Components.size(); j < e; ++j) {
		// Helper function to get the base address and type.
		auto &&GetBegin = [&Components](int32_t Idx) {
		return Components.get(Idx)->Begin;
		};
		auto &&GetType = [&Components](int32_t Idx) {
		return Components.get(Idx)->Type;
		};
		int rt = target_data_end_component(
		Device, Components.get(j)->Begin, Components.get(j)->Size,
		Components.get(j)->Type, j, e, GetBegin, GetType);
		if (rt != OFFLOAD_SUCCESS)
		return OFFLOAD_FAIL;
		}
		} else {
		// Helper function to get the base address and type.
		auto &&GetBegin = [&args](int32_t Idx) { return args[Idx]; };
		auto &&GetType = [&arg_types](int32_t Idx) { return arg_types[Idx]; };
		int rt =
		target_data_end_component(Device, args[i], arg_sizes[i], arg_types[i],
		i, arg_num, GetBegin, GetType);
		if (rt != OFFLOAD_SUCCESS)
		return OFFLOAD_FAIL;
		}
		}

		return OFFLOAD_SUCCESS;
		}

		/// Map one component when updating a target data region.
		int target_data_update_component(DeviceTy &Device, void *HstPtrBegin,
		int64_t Size, int64_t Type) {
bool IsLast, IsHostPtr;		bool IsLast, IsHostPtr;
void *TgtPtrBegin = Device.getTgtPtrBegin(HstPtrBegin, MapSize, IsLast,		void *TgtPtrBegin =
false, IsHostPtr);		Device.getTgtPtrBegin(HstPtrBegin, Size, IsLast, false, IsHostPtr);
if (!TgtPtrBegin) {		if (!TgtPtrBegin) {
DP("hst data:" DPxMOD " not found, becomes a noop\n", DPxPTR(HstPtrBegin));		DP("hst data:" DPxMOD " not found, becomes a noop\n", DPxPTR(HstPtrBegin));
continue;		return OFFLOAD_SUCCESS;
}		}

if (RTLs.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY &&		if (RTLs.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY &&
TgtPtrBegin == HstPtrBegin) {		TgtPtrBegin == HstPtrBegin) {
DP("hst data:" DPxMOD " unified and shared, becomes a noop\n",		DP("hst data:" DPxMOD " unified and shared, becomes a noop\n",
DPxPTR(HstPtrBegin));		DPxPTR(HstPtrBegin));
continue;		return OFFLOAD_SUCCESS;
}		}

if (arg_types[i] & OMP_TGT_MAPTYPE_FROM) {		if (Type & OMP_TGT_MAPTYPE_FROM) {
DP("Moving %" PRId64 " bytes (tgt:" DPxMOD ") -> (hst:" DPxMOD ")\n",		DP("Moving %" PRId64 " bytes (tgt:" DPxMOD ") -> (hst:" DPxMOD ")\n", Size,
arg_sizes[i], DPxPTR(TgtPtrBegin), DPxPTR(HstPtrBegin));		DPxPTR(TgtPtrBegin), DPxPTR(HstPtrBegin));
int rt = Device.data_retrieve(HstPtrBegin, TgtPtrBegin, MapSize);		int rt = Device.data_retrieve(HstPtrBegin, TgtPtrBegin, Size);
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Copying data from device failed.\n");		DP("Copying data from device failed.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

uintptr_t lb = (uintptr_t) HstPtrBegin;		uintptr_t lb = (uintptr_t)HstPtrBegin;
uintptr_t ub = (uintptr_t) HstPtrBegin + MapSize;		uintptr_t ub = (uintptr_t)HstPtrBegin + Size;
Device.ShadowMtx.lock();		Device.ShadowMtx.lock();
for (ShadowPtrListTy::iterator it = Device.ShadowPtrMap.begin();		for (ShadowPtrListTy::iterator it = Device.ShadowPtrMap.begin();
it != Device.ShadowPtrMap.end(); ++it) {		it != Device.ShadowPtrMap.end(); ++it) {
void ShadowHstPtrAddr = (void) it->first;		void ShadowHstPtrAddr = (void )it->first;
if ((uintptr_t) ShadowHstPtrAddr < lb)		if ((uintptr_t)ShadowHstPtrAddr < lb)
continue;		continue;
if ((uintptr_t) ShadowHstPtrAddr >= ub)		if ((uintptr_t)ShadowHstPtrAddr >= ub)
break;		break;
DP("Restoring original host pointer value " DPxMOD " for host pointer "		DP("Restoring original host pointer value " DPxMOD
DPxMOD "\n", DPxPTR(it->second.HstPtrVal),		" for host pointer " DPxMOD "\n",
DPxPTR(ShadowHstPtrAddr));		DPxPTR(it->second.HstPtrVal), DPxPTR(ShadowHstPtrAddr));
*ShadowHstPtrAddr = it->second.HstPtrVal;		*ShadowHstPtrAddr = it->second.HstPtrVal;
}		}
Device.ShadowMtx.unlock();		Device.ShadowMtx.unlock();
}		}

if (arg_types[i] & OMP_TGT_MAPTYPE_TO) {		if (Type & OMP_TGT_MAPTYPE_TO) {
DP("Moving %" PRId64 " bytes (hst:" DPxMOD ") -> (tgt:" DPxMOD ")\n",		DP("Moving %" PRId64 " bytes (hst:" DPxMOD ") -> (tgt:" DPxMOD ")\n", Size,
arg_sizes[i], DPxPTR(HstPtrBegin), DPxPTR(TgtPtrBegin));		DPxPTR(HstPtrBegin), DPxPTR(TgtPtrBegin));
int rt = Device.data_submit(TgtPtrBegin, HstPtrBegin, MapSize);		int rt = Device.data_submit(TgtPtrBegin, HstPtrBegin, Size);
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Copying data to device failed.\n");		DP("Copying data to device failed.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

uintptr_t lb = (uintptr_t) HstPtrBegin;		uintptr_t lb = (uintptr_t)HstPtrBegin;
uintptr_t ub = (uintptr_t) HstPtrBegin + MapSize;		uintptr_t ub = (uintptr_t)HstPtrBegin + Size;
Device.ShadowMtx.lock();		Device.ShadowMtx.lock();
for (ShadowPtrListTy::iterator it = Device.ShadowPtrMap.begin();		for (ShadowPtrListTy::iterator it = Device.ShadowPtrMap.begin();
it != Device.ShadowPtrMap.end(); ++it) {		it != Device.ShadowPtrMap.end(); ++it) {
void ShadowHstPtrAddr = (void) it->first;		void ShadowHstPtrAddr = (void )it->first;
if ((uintptr_t) ShadowHstPtrAddr < lb)		if ((uintptr_t)ShadowHstPtrAddr < lb)
continue;		continue;
if ((uintptr_t) ShadowHstPtrAddr >= ub)		if ((uintptr_t)ShadowHstPtrAddr >= ub)
break;		break;
DP("Restoring original target pointer value " DPxMOD " for target "		DP("Restoring original target pointer value " DPxMOD " for target "
"pointer " DPxMOD "\n", DPxPTR(it->second.TgtPtrVal),		"pointer " DPxMOD "\n",
DPxPTR(it->second.TgtPtrAddr));		DPxPTR(it->second.TgtPtrVal), DPxPTR(it->second.TgtPtrAddr));
rt = Device.data_submit(it->second.TgtPtrAddr,		rt = Device.data_submit(it->second.TgtPtrAddr, &it->second.TgtPtrVal,
&it->second.TgtPtrVal, sizeof(void *));		sizeof(void *));
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Copying data to device failed.\n");		DP("Copying data to device failed.\n");
Device.ShadowMtx.unlock();		Device.ShadowMtx.unlock();
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
}		}
Device.ShadowMtx.unlock();		Device.ShadowMtx.unlock();
}		}

		return OFFLOAD_SUCCESS;
}		}

		/// Internal function to pass data to/from the target.
		int target_data_update(DeviceTy &Device, int32_t arg_num, void **args_base,
		void *args, int64_t arg_sizes, int64_t *arg_types,
		void **arg_mappers) {
		// process each input.
		for (int32_t i = 0; i < arg_num; ++i) {
		if ((arg_types[i] & OMP_TGT_MAPTYPE_LITERAL) \|\|
		(arg_types[i] & OMP_TGT_MAPTYPE_PRIVATE))
		continue;

		// If a valid user-defined mapper is attached, use the associated mapper
		// function to complete data mapping.
		if (arg_mappers && arg_mappers[i]) {
		DP("Call the mapper function " DPxMOD " for the %dth argument\n",
		DPxPTR(arg_mappers[i]), i);
		// The mapper function fills up Components.
		MapperComponentsTy Components;
		MapperFuncPtrTy MapperFuncPtr = (MapperFuncPtrTy)(arg_mappers[i]);
		(MapperFuncPtr)((void )&Components, args_base[i], args[i], arg_sizes[i],
		arg_types[i]);
		if (Components.size() >= 0xffff) {
		DP("The number of components exceed the limitation\n");
		return OFFLOAD_FAIL;
		}

		// Map each component filled up by the mapper function.
		for (int32_t j = 0, e = Components.size(); j < e; ++j) {
		int rt = target_data_update_component(Device, Components.get(j)->Begin,
		Components.get(j)->Size,
		Components.get(j)->Type);
		if (rt != OFFLOAD_SUCCESS)
		return OFFLOAD_FAIL;
		}
		}
		else {
		int rt = target_data_update_component(Device, args[i], arg_sizes[i],
		arg_types[i]);
		if (rt != OFFLOAD_SUCCESS)
		return OFFLOAD_FAIL;
		}
		}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

		JonChesterfieldUnsubmitted Done Reply Inline Actions And again JonChesterfield: And again
static const unsigned LambdaMapping = OMP_TGT_MAPTYPE_PTR_AND_OBJ \|		static const unsigned LambdaMapping = OMP_TGT_MAPTYPE_PTR_AND_OBJ \|
OMP_TGT_MAPTYPE_LITERAL \|		OMP_TGT_MAPTYPE_LITERAL \|
OMP_TGT_MAPTYPE_IMPLICIT;		OMP_TGT_MAPTYPE_IMPLICIT;
static bool isLambdaMapping(int64_t Mapping) {		static bool isLambdaMapping(int64_t Mapping) {
return (Mapping & LambdaMapping) == LambdaMapping;		return (Mapping & LambdaMapping) == LambdaMapping;
}		}

/// performs the same actions as data_begin in case arg_num is		/// performs the same actions as data_begin in case arg_num is
/// non-zero and initiates run of the offloaded region on the target platform;		/// non-zero and initiates run of the offloaded region on the target platform;
/// if arg_num is non-zero after the region execution is done it also		/// if arg_num is non-zero after the region execution is done it also
/// performs the same action as data_update and data_end above. This function		/// performs the same action as data_update and data_end above. This function
/// returns 0 if it was able to transfer the execution to a target and an		/// returns 0 if it was able to transfer the execution to a target and an
/// integer different from zero otherwise.		/// integer different from zero otherwise.
int target(int64_t device_id, void *host_ptr, int32_t arg_num,		int target(int64_t device_id, void host_ptr, int32_t arg_num, void *args_base,
void args_base, void args, int64_t arg_sizes, int64_t arg_types,		void *args, int64_t arg_sizes, int64_t *arg_types,
int32_t team_num, int32_t thread_limit, int IsTeamConstruct) {		void **arg_mappers, int32_t team_num, int32_t thread_limit,
		int IsTeamConstruct) {
DeviceTy &Device = Devices[device_id];		DeviceTy &Device = Devices[device_id];

// Find the table information in the map or look it up in the translation		// Find the table information in the map or look it up in the translation
// tables.		// tables.
TableMap *TM = 0;		TableMap *TM = 0;
TblMapMtx.lock();		TblMapMtx.lock();
HostPtrToTableMapTy::iterator TableMapIt = HostPtrToTableMap.find(host_ptr);		HostPtrToTableMapTy::iterator TableMapIt = HostPtrToTableMap.find(host_ptr);
if (TableMapIt == HostPtrToTableMap.end()) {		if (TableMapIt == HostPtrToTableMap.end()) {
Show All 39 Lines	int target(int64_t device_id, void host_ptr, int32_t arg_num, void *args_base,
assert(TM->Table->TargetsTable.size() > (size_t)device_id &&		assert(TM->Table->TargetsTable.size() > (size_t)device_id &&
"Not expecting a device ID outside the table's bounds!");		"Not expecting a device ID outside the table's bounds!");
__tgt_target_table *TargetTable = TM->Table->TargetsTable[device_id];		__tgt_target_table *TargetTable = TM->Table->TargetsTable[device_id];
TrlTblMtx.unlock();		TrlTblMtx.unlock();
assert(TargetTable && "Global data has not been mapped\n");		assert(TargetTable && "Global data has not been mapped\n");

// Move data to device.		// Move data to device.
int rc = target_data_begin(Device, arg_num, args_base, args, arg_sizes,		int rc = target_data_begin(Device, arg_num, args_base, args, arg_sizes,
arg_types);		arg_types, arg_mappers);
if (rc != OFFLOAD_SUCCESS) {		if (rc != OFFLOAD_SUCCESS) {
DP("Call to target_data_begin failed, abort target.\n");		DP("Call to target_data_begin failed, abort target.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

std::vector<void *> tgt_args;		std::vector<void *> tgt_args;
std::vector<ptrdiff_t> tgt_offsets;		std::vector<ptrdiff_t> tgt_offsets;

▲ Show 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	for (auto it : fpArrays) {
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Deallocation of (first-)private arrays failed.\n");		DP("Deallocation of (first-)private arrays failed.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
}		}

// Move data from device.		// Move data from device.
int rt = target_data_end(Device, arg_num, args_base, args, arg_sizes,		int rt = target_data_end(Device, arg_num, args_base, args, arg_sizes,
arg_types);		arg_types, arg_mappers);
if (rt != OFFLOAD_SUCCESS) {		if (rt != OFFLOAD_SUCCESS) {
DP("Call to target_data_end failed, abort targe.\n");		DP("Call to target_data_end failed, abort targe.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

libomptarget/src/private.h

Show All 9 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef _OMPTARGET_PRIVATE_H		#ifndef _OMPTARGET_PRIVATE_H
#define _OMPTARGET_PRIVATE_H		#define _OMPTARGET_PRIVATE_H

#include <omptarget.h>		#include <omptarget.h>

		#include <cassert>
#include <cstdint>		#include <cstdint>

extern int target_data_begin(DeviceTy &Device, int32_t arg_num,		extern int target_data_begin(DeviceTy &Device, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types);		void args_base, void args, int64_t *arg_sizes,
		int64_t arg_types, void *arg_mappers);

extern int target_data_end(DeviceTy &Device, int32_t arg_num, void **args_base,		extern int target_data_end(DeviceTy &Device, int32_t arg_num, void **args_base,
void *args, int64_t arg_sizes, int64_t *arg_types);		void *args, int64_t arg_sizes, int64_t *arg_types,
		void **arg_mappers);

extern int target_data_update(DeviceTy &Device, int32_t arg_num,		extern int target_data_update(DeviceTy &Device, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types);		void args_base, void args, int64_t *arg_sizes,
		int64_t arg_types, void *arg_mappers);

extern int target(int64_t device_id, void *host_ptr, int32_t arg_num,		extern int target(int64_t device_id, void *host_ptr, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types,		void args_base, void args, int64_t *arg_sizes,
int32_t team_num, int32_t thread_limit, int IsTeamConstruct);		int64_t arg_types, void *arg_mappers, int32_t team_num,
		int32_t thread_limit, int IsTeamConstruct);

extern int CheckDeviceAndCtors(int64_t device_id);		extern int CheckDeviceAndCtors(int64_t device_id);

// enum for OMP_TARGET_OFFLOAD; keep in sync with kmp.h definition		// enum for OMP_TARGET_OFFLOAD; keep in sync with kmp.h definition
enum kmp_target_offload_kind {		enum kmp_target_offload_kind {
tgt_disabled = 0,		tgt_disabled = 0,
tgt_default = 1,		tgt_default = 1,
tgt_mandatory = 2		tgt_mandatory = 2
Show All 12 Lines	MapComponentInfoTy(void Base, void Begin, int64_t Size, int64_t Type)
: Base(Base), Begin(Begin), Size(Size), Type(Type) {}		: Base(Base), Begin(Begin), Size(Size), Type(Type) {}
};		};

// This structure stores all components of a user-defined mapper. The number of		// This structure stores all components of a user-defined mapper. The number of
// components are dynamically decided, so we utilize C++ STL vector		// components are dynamically decided, so we utilize C++ STL vector
// implementation here.		// implementation here.
struct MapperComponentsTy {		struct MapperComponentsTy {
std::vector<MapComponentInfoTy> Components;		std::vector<MapComponentInfoTy> Components;
		int32_t size() { return Components.size(); }
		MapComponentInfoTy *get(int32_t i) {
		assert(i < size() && "Try to access a component that does not exist");
		return &Components[i];
		}
};		};

		// The mapper function pointer type. It follows the signature below:
		// void .omp_mapper.<type_name>.<mapper_id>.(void *rt_mapper_handle,
		// void base, void begin,
		// size_t size, int64_t type);
		typedef void (MapperFuncPtrTy)(void , void , void , int64_t, int64_t);

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// implemtation for fatal messages		// implemtation for fatal messages
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

#define FATAL_MESSAGE0(_num, _str) \		#define FATAL_MESSAGE0(_num, _str) \
do { \		do { \
fprintf(stderr, "Libomptarget fatal error %d: %s\n", _num, _str); \		fprintf(stderr, "Libomptarget fatal error %d: %s\n", _num, _str); \
exit(1); \		exit(1); \
Show All 36 Lines

libomptarget/src/rtl.cpp

Show First 20 Lines • Show All 346 Lines • ▼ Show 20 Lines	for (auto *R : RTLs.UsedRTLs) {

// Execute dtors for static objects if the device has been used, i.e.		// Execute dtors for static objects if the device has been used, i.e.
// if its PendingCtors list has been emptied.		// if its PendingCtors list has been emptied.
for (int32_t i = 0; i < FoundRTL->NumberOfDevices; ++i) {		for (int32_t i = 0; i < FoundRTL->NumberOfDevices; ++i) {
DeviceTy &Device = Devices[FoundRTL->Idx + i];		DeviceTy &Device = Devices[FoundRTL->Idx + i];
Device.PendingGlobalsMtx.lock();		Device.PendingGlobalsMtx.lock();
if (Device.PendingCtorsDtors[desc].PendingCtors.empty()) {		if (Device.PendingCtorsDtors[desc].PendingCtors.empty()) {
for (auto &dtor : Device.PendingCtorsDtors[desc].PendingDtors) {		for (auto &dtor : Device.PendingCtorsDtors[desc].PendingDtors) {
int rc = target(Device.DeviceID, dtor, 0, NULL, NULL, NULL, NULL, 1,		int rc = target(Device.DeviceID, dtor, 0, NULL, NULL, NULL, NULL,
1, true /team/);		NULL, 1, 1, true /team/);
if (rc != OFFLOAD_SUCCESS) {		if (rc != OFFLOAD_SUCCESS) {
DP("Running destructor " DPxMOD " failed.\n", DPxPTR(dtor));		DP("Running destructor " DPxMOD " failed.\n", DPxPTR(dtor));
}		}
}		}
// Remove this library's entry from PendingCtorsDtors		// Remove this library's entry from PendingCtorsDtors
Device.PendingCtorsDtors.erase(desc);		Device.PendingCtorsDtors.erase(desc);
}		}
Device.PendingGlobalsMtx.unlock();		Device.PendingGlobalsMtx.unlock();
▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

libomptarget/test/mapping/declare_mapper_target.cpp

This file was added.

				// RUN: %libomptarget-compile-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
				// RUN: %libomptarget-compile-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-x86_64-pc-linux-gnu

				#include <cstdio>
				#include <cstdlib>

				#define NUM 1024

				class C {
				public:
				int *a;
				};

				#pragma omp declare mapper(id: C s) map(s.a[0:NUM])

				int main() {
				C c;
				c.a = (int) malloc(sizeof(int)NUM);
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 1;
				}
				#pragma omp target teams distribute parallel for map(mapper(id),tofrom: c)
				for (int i = 0; i < NUM; i++) {
				++c.a[i];
				}
				int sum = 0;
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum = 2048
				printf("Sum = %d\n", sum);
				return 0;
				}

libomptarget/test/mapping/declare_mapper_target_data.cpp

This file was added.

				// RUN: %libomptarget-compile-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
				// RUN: %libomptarget-compile-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-x86_64-pc-linux-gnu

				#include <cstdio>
				#include <cstdlib>

				#define NUM 1024

				class C {
				public:
				int *a;
				};

				#pragma omp declare mapper(id: C s) map(s.a[0:NUM])

				int main() {
				C c;
				c.a = (int) malloc(sizeof(int)NUM);
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 1;
				}
				#pragma omp target data map(mapper(id),tofrom: c)
				{
				#pragma omp target teams distribute parallel for
				for (int i = 0; i < NUM; i++) {
				++c.a[i];
				}
				}
				int sum = 0;
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum = 2048
				printf("Sum = %d\n", sum);
				return 0;
				}

libomptarget/test/mapping/declare_mapper_target_data_enter_exit.cpp

This file was added.

				// RUN: %libomptarget-compile-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
				// RUN: %libomptarget-compile-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-x86_64-pc-linux-gnu

				#include <cstdio>
				#include <cstdlib>

				#define NUM 1024

				class C {
				public:
				int *a;
				};

				#pragma omp declare mapper(id: C s) map(s.a[0:NUM])

				int main() {
				C c;
				c.a = (int) malloc(sizeof(int)NUM);
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 1;
				}
				#pragma omp target enter data map(mapper(id),to: c)
				#pragma omp target teams distribute parallel for
				for (int i = 0; i < NUM; i++) {
				++c.a[i];
				}
				#pragma omp target exit data map(mapper(id),from: c)
				int sum = 0;
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum = 2048
				printf("Sum = %d\n", sum);
				return 0;
				}

libomptarget/test/mapping/declare_mapper_target_update.cpp

This file was added.

				// RUN: %libomptarget-compile-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
				// RUN: %libomptarget-compile-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compile-run-and-check-x86_64-pc-linux-gnu

				#include <cstdio>
				#include <cstdlib>

				#define NUM 1024

				class C {
				public:
				int *a;
				};

				#pragma omp declare mapper(id: C s) map(s.a[0:NUM])

				int main() {
				C c;
				int sum = 0;
				c.a = (int) malloc(sizeof(int)NUM);
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 1;
				}
				#pragma omp target enter data map(mapper(id),alloc: c)
				#pragma omp target teams distribute parallel for
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 0;
				}
				#pragma omp target update from(mapper(id): c)
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum (after first update from) = 0
				printf("Sum (after first update from) = %d\n", sum);
				for (int i = 0; i < NUM; i++) {
				c.a[i] = 1;
				}
				#pragma omp target update to(mapper(id): c)
				#pragma omp target teams distribute parallel for
				for (int i = 0; i < NUM; i++) {
				++c.a[i];
				}
				sum = 0;
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum (after update to) = 1024
				printf("Sum (after update to) = %d\n", sum);
				#pragma omp target update from(mapper(id): c)
				sum = 0;
				for (int i = 0; i < NUM; i++) {
				sum += c.a[i];
				}
				// CHECK: Sum (after second update from) = 2048
				printf("Sum (after second update from) = %d\n", sum);
				#pragma omp target exit data map(mapper(id),delete: c)
				return 0;
				}