This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/
-
docs/optimizations/
-
optimizations/
7/7
OpenMPOpt.rst
-
libomptarget/
-
include/
5/5
device.h
16/17
omptarget.h
-
src/
5/5
device.cpp
6/6
interface.cpp
18/18
omptarget.cpp

Differential D132306

Automatic asynchronous execution of OpenMP Target Regions
Needs ReviewPublic

Authored by randreshg on Aug 20 2022, 9:50 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
josemonsalve2
jhuber6
tianshilei1992
JonChesterfield
ronlieb
ye-luo

Summary

This patch allows Automatic asynchronous execution of OpenMP Target Regions.
When the environment variable LIBOMPTARGET_INTRA_THREAD_ASYNC is enabled
(LIBOMPTARGET_INTRA_THREAD_ASYNC=1), the implicit barrier that exists at the end
of every target region is removed and synchronization is only performed on memory transfers
or at the end of the program when the OpenMP runtime calls its destructors

Diff Detail

Event Timeline

randreshg created this revision.Aug 20 2022, 9:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2022, 9:50 AM

Herald added subscribers: guansong, yaxunl. · View Herald Transcript

randreshg requested review of this revision.Aug 20 2022, 9:50 AM

Herald added a subscriber: sstefan1. · View Herald TranscriptAug 20 2022, 9:50 AM

Harbormaster completed remote builds in B182395: Diff 454232.Aug 20 2022, 9:52 AM

josemonsalve2 added reviewers: jhuber6, tianshilei1992, JonChesterfield.Aug 20 2022, 1:28 PM

josemonsalve2 added a reviewer: ronlieb.

randreshg edited the summary of this revision. (Show Details)Aug 20 2022, 1:31 PM

Could you please explain why removing synchronization at the end of a target region (without nowait) is a valid optimization. Also there should be no valid program relying on synchronization at the end of the program"

This revision now requires changes to proceed.Aug 20 2022, 2:06 PM

Ye,

It is only valid under the assumption of non unified shared memory. State across host and device is only visible during data movements. So it is up until then when changes in the host or device data is reflected. Assuming there are no external runtimes, it is possible to synchronize only on data movements, conserving the data dependencies between host and device.

When I ran target region A and then B and B consumes numbers generated by A on the device and there is no transfer involved. B may start before the A get the numbers ready.

As long as the RTL of the device provides a queue mechanism to execute target tasks sequentially, the dependencies among different tasks will be respected, for NVIDIA GPUs the queue mechanism is the stream. This patch, in contrast with the current OpenMP offloading implementation, launches the execution of tasks into the same stream.

CUDA streams are FIFO queues. But it’s true this will not work if the device queue is not FIFO. In the case of CUDA this works for the case you described without being a read after write dependency.

right now in CUDA plugin, streams are pooled. There is no guarantee that A and B get the same Stream when multiple threads all are doing their own As and Bs.
Need to minimize designing libomptarget based on CUDA behaviors.

I agree on 2. Any recommendations? We can move some of the logic there.

Regarding 1 this patch does not return the stream to the pool right away. It is held by the thread until a synchronization occurs. Synchronizations aré evaluated lazily on data movement. But the stream does not change between consecutive target invokes with no data movement

In D132306#3737736, @josemonsalve2 wrote:

I agree on 2. Any recommendations? We can move some of the logic there.

Regarding 1 this patch does not return the stream to the pool right away. It is held by the thread until a synchronization occurs. Synchronizations aré evaluated lazily on data movement. But the stream does not change between consecutive target invokes with no data movement

Explicitly tied to thread context is even worse. If I wrap both region with openmp CPU tasks and make task B depends on task A. task A completes but the kernel is still flying. Task B may start at anytime and there is no guarantee of running on the same thread.

That one is a good point. Let us revise that. Since the effect we actually want to have is the creation of the task graph sequentially.

jdoerfert added inline comments.Aug 29 2022, 7:31 AM

openmp/libomptarget/include/device.h
340	Why is this static, that seems wrong.
471	Documentation, please.
openmp/libomptarget/include/omptarget.h
195–200	You cannot just remove this. See https://reviews.llvm.org/D132045, as it introduces a flag you can use to disable synchronization here.
openmp/libomptarget/src/device.cpp
69	Reverse the condition, single line alternative comes first, then the complex consequence (without the else).
81–83	Use early exits and no else after return.

josemonsalve2 added inline comments.Aug 31 2022, 2:25 PM

openmp/libomptarget/src/device.cpp
61–65	It may be better to merge these into: DP("Asynchronous execution %s\n", AsyncFlag ? "Enabled" : "Disabled");

randreshg marked 4 inline comments as done.Sep 5 2022, 7:17 AM

randreshg added inline comments.Sep 5 2022, 7:31 AM

openmp/libomptarget/include/device.h
340	The idea of AsyncInfoMng is to have a way to control the AsyncInfo object to skip synchronization that is not needed. It's part of the Device class, so we can sync like this: Device.syncAsyncInfo(AsyncInfo, true); however, it is not necessary for every device to have a copy of AsyncInfoMng. That's why it's static.

randreshg marked 2 inline comments as done.Sep 13 2022, 6:26 AM

diff updated.

The AsyncInfoManager is now part of the omptarget.h and not device.h
Global variable AIM was added in interface.cpp
Flag was added to the Synchronization in asyncinfo destructor

Harbormaster completed remote builds in B186362: Diff 459736.Sep 13 2022, 6:39 AM

Documentation added

Harbormaster completed remote builds in B187516: Diff 461250.Sep 19 2022, 10:07 AM

Some more initial comments

openmp/docs/optimizations/OpenMPOpt.rst
113	Doesn't the length need to match?
116	For `nowait` regions this is maybe misleading. We should probably say they can be executed synchronously depending on the pragmas and implementation. In that case ...
134	This is misleading. Rather explain what would happen if the map on the target was triggering a transfer.
openmp/libomptarget/include/omptarget.h
228	This default is dangerous. Either swap it or avoid the default. Also, explain what it means to synchronize but not to force synchronize.
232	Use doxygen comments `///` not `//`.
openmp/libomptarget/src/device.cpp
52	Unrelated
openmp/libomptarget/src/interface.cpp
25	We should not have a global like this. It probably belongs in/to the Device object.
254	A comment would be nice as this is the one place we do not force synchronization.

randreshg marked 8 inline comments as done.Sep 26 2022, 11:54 AM

This new version has the following changes:

Updated documentation
The AsyncInfoManager is part of the Device class and now has a map that contains thread ids as its key values.

Harbormaster completed remote builds in B188766: Diff 462993.Sep 26 2022, 12:28 PM

ping

jdoerfert added inline comments.Oct 18 2022, 10:40 AM

openmp/docs/optimizations/OpenMPOpt.rst
120
135–140
openmp/libomptarget/include/device.h
320	Documentation, please.
349	I would rather expose the AIM than have 3 functions that just forward to it.
openmp/libomptarget/include/omptarget.h
22	We can use LLVM data structures now too, e.g., maps.
openmp/libomptarget/src/interface.cpp
254	multi-line conditionals should have braces. The comment is unhelpful. It states what the code already says, not why the code is this way.
openmp/libomptarget/src/omptarget.cpp
53	Lot's of static flags to simply lookup a env var once. Why don't we do it the same way as for other env vars?
55
61
62	You did get an iterator, why do another lookup via `[...]`? You can do a single `[...]` lookup and use the reference result to check and update it.
64	And this is the 3rd lookup into the map..
65	Do we ever have to clear the map?
82	I doubt we need both these lines.

This new patch addresses all previous comments from reviewers

randreshg marked 11 inline comments as done.Oct 19 2022, 8:51 AM

Harbormaster completed remote builds in B193018: Diff 468932.Oct 19 2022, 8:53 AM

I know this is getting tiresome but we need to make sure people understand what's happening and this plays well with future extensions. More comments.

openmp/libomptarget/include/omptarget.h
192	ShouldSyncWhenDestroyed
228	You never return a nullptr, make it a reference.
236	The description is unhelpful. What AsyncInfo, etc. The function is also unused, do we need it?
openmp/libomptarget/src/interface.cpp
101	Now this pattern is somewhat unfortunate. You get the AsyncInfo from the AIM and then you need to be careful to call the right synchronize. If the AsynFlag was global you could move the new "sync" logic into the regular AsyncInfo sync, right? AIM would just be used to manage the map ID -> AsyncInfo. WDYT about this scheme? You could check the env variable once, like we do it for some others: https://github.com/llvm/llvm-project/blob/23bc343855fdf6fb7668abadf2b064034b207981/openmp/libomptarget/src/rtl.cpp#L43
openmp/libomptarget/src/omptarget.cpp
54–57

randreshg marked 6 inline comments as done.Oct 20 2022, 12:35 PM

randreshg added inline comments.

openmp/libomptarget/src/interface.cpp
101	Agreed! The pattern you suggest allows for separating the synchronization logic and AsyncInfo management.

randreshg updated this revision to Diff 469594.Oct 21 2022, 6:59 AM

randreshg marked an inline comment as done.

Harbormaster completed remote builds in B193512: Diff 469594.Oct 21 2022, 7:02 AM

Ping

I think this is fine for now. @ye-luo can this go in as opt-in feature that we probably refine as we go?

openmp/libomptarget/src/interface.cpp
101	Better. Not super happy about the explicit free call but that's fine for now.

How many testing has been done? it seems only workable on toy examples.

openmp/docs/optimizations/OpenMPOpt.rst
140	Through this example, I only see TT1 and TT2 racing when the async feature is enabled.
openmp/libomptarget/src/device.cpp
56	Why AIM captures a pointer instead of reference?
openmp/libomptarget/src/omptarget.cpp
33	It is mess. synchroning or not depends on a bunch of states. Move the if-statement to the caller side and make the logic directly exposed on the use side.
71	I think this is against coding principles, pass in a reference and delete its memory.
openmp/libomptarget/src/rtl.cpp
42 ↗	(On Diff #469594)	Is AsyncFlag documented?

ye-luo added inline comments.Oct 26 2022, 1:29 PM

openmp/libomptarget/include/omptarget.h
192	Add const
216	This is insufficient documentation. Please explain what this struct does actually not just what it is used for.
217	Why is this a struct instead of a class? It encourage code like `AIM.AsyncInfoM[1]`? Use class and appropriate private/public.
218	Why AsyncInfoTy needs to be associated with thread::id. Does this add unnecessary entanglement between target tasks.

ye-luo added inline comments.Oct 26 2022, 1:48 PM

openmp/libomptarget/include/omptarget.h
218	it is necessary to document the design choice that AsyncInfoTy objects and threadIDs have one to one mapping when AsyncFlag is true.
openmp/libomptarget/src/omptarget.cpp
54	New a pointer and then return its reference. "reference" types should not be used to manage the ownership.

tianshilei1992 added inline comments.Oct 27 2022, 11:20 AM

openmp/libomptarget/src/rtl.cpp
42 ↗	(On Diff #469594)	I don't think this is a good design, that tries to modify the state of `libomptarget` from plugins, especially the corresponding env is called `LIBOMPTARGET_ASYNC`, which indicates it should be handled in `libomptarget` instead of in each plugin. Potentially I think we could add a plugin interface function to tell what opt-in feature is enabled. This could be done in a separate patch, but is better to land it before this patch. I don't think it makes sense to open the door (now) and then close it later, because I don't think "later" will come. BTW, `LIBOMPTARGET_ASYNC` is a little bit confusing. It's too general.

tianshilei1992 added inline comments.Oct 27 2022, 11:23 AM

openmp/libomptarget/include/omptarget.h
17	It is recommended to add a blank line between LLVM and STL headers.
218	Better to use LLVM ADT here

tianshilei1992 added inline comments.Oct 27 2022, 11:25 AM

openmp/libomptarget/src/rtl.cpp
42 ↗	(On Diff #469594)	nvm, I didn't look at it right. It's in `libomptarget`. My bad. But the env name is true.

In D132306#3886372, @ye-luo wrote:

How many testing has been done? it seems only workable on toy examples.

Wrt toy examples:

openmp/docs/optimizations/OpenMPOpt.rst
140	As discussed yesterday, there is no race.
openmp/libomptarget/include/omptarget.h
218	This does not depend on tasks but threads. If threads are independent wrt offload, this extension allows them to run host and device code asynchronously. If threads are not independent wrt. offload, this extension cannot be used.
openmp/libomptarget/src/omptarget.cpp
33	Exposing this to the user side is not only not helpful but will actively harm things. The logic depends on two flags, not "a bunch of states". One is passed by the user, one describes the system setup. This is totally fine.
54	What about this: Replace "get" with AsyncInfoMng::register(AsyncInfo &AI); Remove "free". Register is implemented as no-op w/o the AsyncInfo flag set. Otherwise it'll replace AI with a dynamically allocated one.
openmp/libomptarget/src/rtl.cpp
42 ↗	(On Diff #469594)	What about `LIBOMPTARGET_INTRA_THREAD_ASYNC`?

tianshilei1992 added inline comments.Oct 27 2022, 12:24 PM

openmp/libomptarget/src/rtl.cpp
42 ↗	(On Diff #469594)	that sounds good!

ye-luo added inline comments.Oct 27 2022, 12:36 PM

openmp/libomptarget/include/omptarget.h
218	This does not depend on tasks but threads. If threads are independent wrt offload, this extension allows them to run host and device code asynchronously. If threads are not independent wrt. offload, this extension cannot be used. Fair enough. This feature relies on thread id to impose dependency. It can only be used under certain restrictions.
openmp/libomptarget/src/omptarget.cpp
33	Two flags = 4 states. Not a low number of variants. Calling synchronize() but it may or may not do the sync. When should the user set ForceSync to true? Better to have some explanation.
54	Not getting what you mean. I'm still expecting, AsyncInfoTy object being destroyed properly at the end of the target region, when AsyncInfo=false.

jdoerfert added inline comments.Oct 27 2022, 1:28 PM

openmp/libomptarget/src/omptarget.cpp
33	4 states, ok. However, they collapse to 2; it's a single conditional after all: synchronize or not. The rules are fairly simple (one or condition) and documented: /// Synchronize all pending actions when the LIBOMPTARGET_ASYNC env var /// is disabled or when synchronization is forced (ForceSync = true) // Otherwise, synchronization is skipped /// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately.
54	I'm still expecting, AsyncInfoTy object being destroyed properly at the end of the target region, when AsyncInfo=false. It is properly destroyed with this patch, and it will be with the proposed scheme. In the proposed scheme we have a local AsyncInfo (as we have upstream now) and we use it if AsyncInfo=false.

randreshg added inline comments.Oct 28 2022, 12:28 PM

openmp/libomptarget/src/omptarget.cpp
54	Im not following this. Could you please provide a pseudocode?

randreshg marked 22 inline comments as done.Nov 21 2022, 7:52 AM

randreshg marked 5 inline comments as done.

This patch addresses comments from reviewers:

Env var name is LIBOMPTARGET_INTRA_THREAD_ASYNC
The pattern to get and destroy the AsyncInfoTy object changed.
A DenseMap is used instead of the std::map.

Harbormaster completed remote builds in B198795: Diff 476906.Nov 21 2022, 10:00 AM

ping

Patch updated to the trunk version.
Changes compared to the last patch:

the HasDataTransfer flag was added.
the functions AsyncInfoTy *get() was added to both AsyncInfoTy and TaskAsyncInfoWrapperTy

Herald added subscribers: jplehr, sunshaoce. · View Herald TranscriptMar 20 2023, 8:34 AM

Harbormaster completed remote builds in B220456: Diff 506605.Mar 20 2023, 8:38 AM

Adding another limitation to this approach. In the following code:

void aaaa(int b) {
    // TARGET A
    #pragma omp target
    {}

    // TARGET B
    #pragma omp target nowait
    {}

    // TARGET C
    #pragma omp target
    {}
}

The task A->B dependency should be respected, since A is originally synchronous, it should have been executed before B. However, B will be spawned in a different thread and A can potentially execute after B.

openmp/libomptarget/include/omptarget.h
205	It may be a good idea to add a comment here explaining why this is necessary. It has to do with garbage collection of the device RTL, as per our conversation today.

In D132306#4207335, @josemonsalve2 wrote:
Adding another limitation to this approach. In the following code:
void aaaa(int b) {
    // TARGET A
    #pragma omp target
    {}

    // TARGET B
    #pragma omp target nowait
    {}

    // TARGET C
    #pragma omp target
    {}
}
The task A->B dependency should be respected, since A is originally synchronous, it should have been executed before B. However, B will be spawned in a different thread and A can potentially execute after B.

The thread which encounter A will not reach B until A has completed so how can B be spawned when A is active.

Hi Ravi,

The purpose of this is to have regions with nowait to be asynchronous (but ordered within the stream). (See paper). So using the solution presented by @randreshg will have this issue. The problem is that the encountering thread will push A to the queue, but B is going to be lowered as a host task. B can be pushed into another thread, and potentially delayed way before A finishes. A must be synchronized such that the original order is maintained. My example was rather simplistic, but it demonstrates the issue. While A and C will be in order, B will not be ordered w.r.t. A.

Of course this is not default OpenMP behavior where what you said does apply.

In D132306#4207789, @RaviNarayanaswamy wrote:

The thread which encounter A will not reach B until A has completed so how can B be spawned when A is active.

Test added

Harbormaster completed remote builds in B223444: Diff 510619.Apr 3 2023, 3:22 PM

Revision Contents

Path

Size

openmp/

docs/

optimizations/

OpenMPOpt.rst

41 lines

libomptarget/

include/

device.h

5 lines

omptarget.h

31 lines

src/

device.cpp

11 lines

interface.cpp

18 lines

omptarget.cpp

61 lines

Diff 461250

openmp/docs/optimizations/OpenMPOpt.rst

Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines

These optimizations can have very large performance implications. Both of these These optimizations can have very large performance implications. Both of these

optimizations rely heavily on inter-procedural analysis. Because of this, optimizations rely heavily on inter-procedural analysis. Because of this,

offloading applications should ideally be contained in a single translation unit offloading applications should ideally be contained in a single translation unit

and functions should not be externally visible unless needed. OpenMPOpt will and functions should not be externally visible unless needed. OpenMPOpt will

inform the user if any globalization calls remain if remarks are enabled. This inform the user if any globalization calls remain if remarks are enabled. This

should be treated as a defect in the program. should be treated as a defect in the program.

.. _Others:

Others

=========

.. contents::

:local:

:depth: 1

Automatic Asynchronous Execution of Target Regions

---------------------------------

jdoerfertUnsubmitted

Done

Doesn't the length need to match?

jdoerfert: Doesn't the length need to match?

By default, offloaded regions are executed synchronously,

thus the host thread blocks until their completion.

jdoerfertUnsubmitted

Done

For nowait regions this is maybe misleading. We should probably say they can be executed synchronously depending on the pragmas and implementation. In that case ...

jdoerfert: For `nowait` regions this is maybe misleading. We should probably say they can be executed…

By using the enviroment flag `LIBOMPTARGET_ASYNC=1` the implicit

barrier that exists at the end of every target region is removed.

.. code-block:: c++

jdoerfertUnsubmitted

Done

only on memory transfers or at the end of the program when the

- OpenMP runtime call its destructors.

+ OpenMP runtime is torn down.

.. code-block:: c++

jdoerfert:

LIBOMPTARGET_ASYNC=1 ./simple1 //For async execution

Limitations:

- It is necessary to define host and target data environments

(e.g. `#pragma omp target data map`). For example:

.. code-block:: c++

#pragma omp target enter data map(alloc: array[:N])

#pragma omp target map(tofrom: array[:N])

busy_device(array);

#pragma omp target map(tofrom: array[:N])

busy_device(array);

busy_host();

#pragma omp target exit data map(from: array[:N])

jdoerfertUnsubmitted

Done

This is misleading. Rather explain what would happen if the map on the target was triggering a transfer.

jdoerfert: This is misleading. Rather explain what would happen if the map on the target was triggering a…

- Cross-device synchronization primitives or atomics are not allowed.

- Timers wrapping target regions will not show the time it takes to

execute the target task but the time the runtime takes to launch

the execution of the task.

- Synchronization using different threads: Target task and

synchronizations in different threads is not supported yet

jdoerfertUnsubmitted

Done

#pragma omp target exit data map(from: array[:N])

- In the example above, the target regions inside the device environment

- don't perform any memory transfer, since the used variables are already

+ In the example above, the target regions inside the enter/exit data region

+ do not perform any memory transfer, since the mapped memory is already

on the device. So, the host launches the execution of both target

regions (TT1 and TT2) in the device queue and continues the execution

on its end (busy host). Offloaded regions that trigger memory transfers

will not benefit from this optimization.

Limitations:

jdoerfert:

ye-luoUnsubmitted

Done

Through this example, I only see TT1 and TT2 racing when the async feature is enabled.

ye-luo: Through this example, I only see TT1 and TT2 racing when the async feature is enabled.

jdoerfertUnsubmitted

Done

As discussed yesterday, there is no race.

jdoerfert: As discussed yesterday, there is no race.

when the flag for asynchronous asynchronous execution

is enabled.

Resources Resources

========= =========

- 2021 OpenMP Webinar: "A Compiler's View of OpenMP" https://youtu.be/eIMpgez61r4 - 2021 OpenMP Webinar: "A Compiler's View of OpenMP" https://youtu.be/eIMpgez61r4

- 2020 LLVM Developers’ Meeting: "(OpenMP) Parallelism-Aware Optimizations" https://youtu.be/gtxWkeLCxmU - 2020 LLVM Developers’ Meeting: "(OpenMP) Parallelism-Aware Optimizations" https://youtu.be/gtxWkeLCxmU

- 2019 EuroLLVM Developers’ Meeting: "Compiler Optimizations for (OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30 - 2019 EuroLLVM Developers’ Meeting: "Compiler Optimizations for (OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30

openmp/libomptarget/include/device.h

Show First 20 Lines • Show All 300 Lines • ▼ Show 20 Lines
///		///
struct PendingCtorDtorListsTy {		struct PendingCtorDtorListsTy {
std::list<void *> PendingCtors;		std::list<void *> PendingCtors;
std::list<void *> PendingDtors;		std::list<void *> PendingDtors;
};		};
typedef std::map<__tgt_bin_desc *, PendingCtorDtorListsTy>		typedef std::map<__tgt_bin_desc *, PendingCtorDtorListsTy>
PendingCtorsDtorsPerLibrary;		PendingCtorsDtorsPerLibrary;

		struct AsyncInfoMng;

struct DeviceTy {		struct DeviceTy {
int32_t DeviceID;		int32_t DeviceID;
RTLInfoTy *RTL;		RTLInfoTy *RTL;
int32_t RTLDeviceID;		int32_t RTLDeviceID;

bool IsInit;		bool IsInit;
std::once_flag InitFlag;		std::once_flag InitFlag;
bool HasPendingGlobals;		bool HasPendingGlobals;

/// Host data to device map type with a wrapper key indirection that allows		/// Host data to device map type with a wrapper key indirection that allows
		jdoerfertUnsubmitted Done Reply Inline Actions Documentation, please. jdoerfert: Documentation, please.
/// concurrent modification of the entries without invalidating the underlying		/// concurrent modification of the entries without invalidating the underlying
/// entries.		/// entries.
using HostDataToTargetListTy =		using HostDataToTargetListTy =
std::set<HostDataToTargetMapKeyTy, std::less<>>;		std::set<HostDataToTargetMapKeyTy, std::less<>>;

/// The HDTTMap is a protected object that can only be accessed by one thread		/// The HDTTMap is a protected object that can only be accessed by one thread
/// at a time.		/// at a time.
ProtectedObj<HostDataToTargetListTy> HostDataToTargetMap;		ProtectedObj<HostDataToTargetListTy> HostDataToTargetMap;

/// The type used to access the HDTT map.		/// The type used to access the HDTT map.
using HDTTMapAccessorTy = decltype(HostDataToTargetMap)::AccessorTy;		using HDTTMapAccessorTy = decltype(HostDataToTargetMap)::AccessorTy;

PendingCtorsDtorsPerLibrary PendingCtorsDtors;		PendingCtorsDtorsPerLibrary PendingCtorsDtors;

ShadowPtrListTy ShadowPtrMap;		ShadowPtrListTy ShadowPtrMap;

std::mutex PendingGlobalsMtx, ShadowMtx;		std::mutex PendingGlobalsMtx, ShadowMtx;

DeviceTy(RTLInfoTy *RTL);		DeviceTy(RTLInfoTy *RTL);
// DeviceTy is not copyable		// DeviceTy is not copyable
		jdoerfertUnsubmitted Done Reply Inline Actions Why is this static, that seems wrong. jdoerfert: Why is this static, that seems wrong.
		randreshgAuthorUnsubmitted Done Reply Inline Actions The idea of AsyncInfoMng is to have a way to control the AsyncInfo object to skip synchronization that is not needed. It's part of the Device class, so we can sync like this: Device.syncAsyncInfo(AsyncInfo, true); however, it is not necessary for every device to have a copy of AsyncInfoMng. That's why it's static. randreshg: The idea of AsyncInfoMng is to have a way to control the AsyncInfo object to skip…
DeviceTy(const DeviceTy &D) = delete;		DeviceTy(const DeviceTy &D) = delete;
DeviceTy &operator=(const DeviceTy &D) = delete;		DeviceTy &operator=(const DeviceTy &D) = delete;

~DeviceTy();		~DeviceTy();

// Return true if data can be copied to DstDevice directly		// Return true if data can be copied to DstDevice directly
bool isDataExchangable(const DeviceTy &DstDevice);		bool isDataExchangable(const DeviceTy &DstDevice);

/// Lookup the mapping of \p HstPtrBegin in \p HDTTMap. The accessor ensures		/// Lookup the mapping of \p HstPtrBegin in \p HDTTMap. The accessor ensures
		jdoerfertUnsubmitted Done Reply Inline Actions I would rather expose the AIM than have 3 functions that just forward to it. jdoerfert: I would rather expose the AIM than have 3 functions that just forward to it.
/// exclusive access to the HDTT map.		/// exclusive access to the HDTT map.
LookupResult lookupMapping(HDTTMapAccessorTy &HDTTMap, void *HstPtrBegin,		LookupResult lookupMapping(HDTTMapAccessorTy &HDTTMap, void *HstPtrBegin,
int64_t Size);		int64_t Size);

/// Get the target pointer based on host pointer begin and base. If the		/// Get the target pointer based on host pointer begin and base. If the
/// mapping already exists, the target pointer will be returned directly. In		/// mapping already exists, the target pointer will be returned directly. In
/// addition, if required, the memory region pointed by \p HstPtrBegin of size		/// addition, if required, the memory region pointed by \p HstPtrBegin of size
/// \p Size will also be transferred to the device. If the mapping doesn't		/// \p Size will also be transferred to the device. If the mapping doesn't
Show All 23 Lines	struct DeviceTy {

/// Deallocate \p LR and remove the entry. Assume the total reference count is		/// Deallocate \p LR and remove the entry. Assume the total reference count is
/// zero and the calling thread is the deleting thread for \p LR. \p HDTTMap		/// zero and the calling thread is the deleting thread for \p LR. \p HDTTMap
/// ensure the caller holds exclusive access and can modify the map. Return \c		/// ensure the caller holds exclusive access and can modify the map. Return \c
/// OFFLOAD_SUCCESS if the map entry existed, and return \c OFFLOAD_FAIL if		/// OFFLOAD_SUCCESS if the map entry existed, and return \c OFFLOAD_FAIL if
/// not. It is the caller's responsibility to skip calling this function if		/// not. It is the caller's responsibility to skip calling this function if
/// the map entry is not expected to exist because \p HstPtrBegin uses shared		/// the map entry is not expected to exist because \p HstPtrBegin uses shared
/// memory.		/// memory.
int deallocTgtPtr(HDTTMapAccessorTy &HDTTMap, LookupResult LR, int64_t Size);		int deallocTgtPtr(HDTTMapAccessorTy &HDTTMap, LookupResult LR, int64_t Size,
		AsyncInfoTy &AsyncInfo);

int associatePtr(void HstPtrBegin, void TgtPtrBegin, int64_t Size);		int associatePtr(void HstPtrBegin, void TgtPtrBegin, int64_t Size);
int disassociatePtr(void *HstPtrBegin);		int disassociatePtr(void *HstPtrBegin);

// calls to RTL		// calls to RTL
int32_t initOnce();		int32_t initOnce();
__tgt_target_table loadBinary(void Img);		__tgt_target_table loadBinary(void Img);

▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines
private:		private:
// Call to RTL		// Call to RTL
void init(); // To be called only via DeviceTy::initOnce()		void init(); // To be called only via DeviceTy::initOnce()

/// Deinitialize the device (and plugin).		/// Deinitialize the device (and plugin).
void deinit();		void deinit();
};		};

extern bool deviceIsReady(int DeviceNum);		extern bool deviceIsReady(int DeviceNum);
		jdoerfertUnsubmitted Done Reply Inline Actions Documentation, please. jdoerfert: Documentation, please.

/// Struct for the data required to handle plugins		/// Struct for the data required to handle plugins
struct PluginManager {		struct PluginManager {
PluginManager(bool UseEventsForAtomicTransfers)		PluginManager(bool UseEventsForAtomicTransfers)
: UseEventsForAtomicTransfers(UseEventsForAtomicTransfers) {}		: UseEventsForAtomicTransfers(UseEventsForAtomicTransfers) {}

/// RTLs identified on the host		/// RTLs identified on the host
RTLsTy RTLs;		RTLsTy RTLs;
Show All 31 Lines

openmp/libomptarget/include/omptarget.h

	//===-------- omptarget.h - Target independent OpenMP target RTL -- C++ -*-===//			//===-------- omptarget.h - Target independent OpenMP target RTL -- C++ -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Interface to be used by Clang during the codegen of a			// Interface to be used by Clang during the codegen of a
	// target region.			// target region.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef _OMPTARGET_H_			#ifndef _OMPTARGET_H_
	#define _OMPTARGET_H_			#define _OMPTARGET_H_

	#include <deque>			#include <deque>
				tianshilei1992Unsubmitted Done Reply Inline Actions It is recommended to add a blank line between LLVM and STL headers. tianshilei1992: It is recommended to add a blank line between LLVM and STL headers.
	#include <stddef.h>			#include <stddef.h>
	#include <stdint.h>			#include <stdint.h>
				#include <memory>
				#include <vector>
	#include <SourceInfo.h>			#include <SourceInfo.h>
				jdoerfertUnsubmitted Done Reply Inline Actions We can use LLVM data structures now too, e.g., maps. jdoerfert: We can use LLVM data structures now too, e.g., maps.

	#define OFFLOAD_SUCCESS (0)			#define OFFLOAD_SUCCESS (0)
	#define OFFLOAD_FAIL (~0)			#define OFFLOAD_FAIL (~0)

	#define OFFLOAD_DEVICE_DEFAULT -1			#define OFFLOAD_DEVICE_DEFAULT -1

	// Don't format out enums and structs.			// Don't format out enums and structs.
	// clang-format off			// clang-format off
	▲ Show 20 Lines • Show All 152 Lines • ▼ Show 20 Lines
	/// mistakes.			/// mistakes.
	class AsyncInfoTy {			class AsyncInfoTy {
	/// Locations we used in (potentially) asynchronous calls which should live			/// Locations we used in (potentially) asynchronous calls which should live
	/// as long as this AsyncInfoTy object.			/// as long as this AsyncInfoTy object.
	std::deque<void *> BufferLocations;			std::deque<void *> BufferLocations;

	__tgt_async_info AsyncInfo;			__tgt_async_info AsyncInfo;
	DeviceTy &Device;			DeviceTy &Device;
				// Enables/Disable synchronization in destructor
				bool ShouldSync;
				jdoerfertUnsubmitted Done Reply Inline Actions ShouldSyncWhenDestroyed jdoerfert: ShouldSyncWhenDestroyed
				ye-luoUnsubmitted Done Reply Inline Actions Add const ye-luo: Add const

	public:			public:
	AsyncInfoTy(DeviceTy &Device) : Device(Device) {}			AsyncInfoTy(DeviceTy &Device, bool ShouldSync = true)
	~AsyncInfoTy() { synchronize(); }			: Device(Device), ShouldSync(ShouldSync) {}
				~AsyncInfoTy() {
				if(ShouldSync)
				synchronize();
				}
				jdoerfertUnsubmitted Done Reply Inline Actions You cannot just remove this. See https://reviews.llvm.org/D132045, as it introduces a flag you can use to disable synchronization here. jdoerfert: You cannot just remove this. See https://reviews.llvm.org/D132045, as it introduces a flag you…

	/// Implicit conversion to the __tgt_async_info which is used in the			/// Implicit conversion to the __tgt_async_info which is used in the
	/// plugin interface.			/// plugin interface.
	operator __tgt_async_info *() { return &AsyncInfo; }			operator __tgt_async_info *() { return &AsyncInfo; }

				josemonsalve2Unsubmitted Not Done Reply Inline Actions It may be a good idea to add a comment here explaining why this is necessary. It has to do with garbage collection of the device RTL, as per our conversation today. josemonsalve2: It may be a good idea to add a comment here explaining why this is necessary. It has to do with…
	/// Synchronize all pending actions.			/// Synchronize all pending actions.
	///			///
	/// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately.			/// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately.
	int synchronize();			int synchronize();

	/// Return a void* reference with a lifetime that is at least as long as this			/// Return a void* reference with a lifetime that is at least as long as this
	/// AsyncInfoTy object. The location can be used as intermediate buffer.			/// AsyncInfoTy object. The location can be used as intermediate buffer.
	void *&getVoidPtrLocation();			void *&getVoidPtrLocation();
	};			};

				/// This structs allows for automatic asynchronous execution of target regions.
				ye-luoUnsubmitted Done Reply Inline Actions This is insufficient documentation. Please explain what this struct does actually not just what it is used for. ye-luo: This is insufficient documentation. Please explain what this struct does actually not just what…
				/// It controls whether to synchronize or not based on the value of AsyncFlag.
				ye-luoUnsubmitted Done Reply Inline Actions Why is this a struct instead of a class? It encourage code like `AIM.AsyncInfoM[1]`? Use class and appropriate private/public. ye-luo: Why is this a struct instead of a class? It encourage code like `AIM.AsyncInfoM[1]`? Use class…
				struct AsyncInfoMng {
				ye-luoUnsubmitted Done Reply Inline Actions Why AsyncInfoTy needs to be associated with thread::id. Does this add unnecessary entanglement between target tasks. ye-luo: Why AsyncInfoTy needs to be associated with thread::id. Does this add unnecessary entanglement…
				jdoerfertUnsubmitted Done Reply Inline Actions This does not depend on tasks but threads. If threads are independent wrt offload, this extension allows them to run host and device code asynchronously. If threads are not independent wrt. offload, this extension cannot be used. jdoerfert: This does not depend on tasks but threads. If threads are independent wrt offload, this…
				ye-luoUnsubmitted Done Reply Inline Actions This does not depend on tasks but threads. If threads are independent wrt offload, this extension allows them to run host and device code asynchronously. If threads are not independent wrt. offload, this extension cannot be used. Fair enough. This feature relies on thread id to impose dependency. It can only be used under certain restrictions. ye-luo: >This does not depend on tasks but threads. If threads are independent wrt offload, this…
				ye-luoUnsubmitted Done Reply Inline Actions it is necessary to document the design choice that AsyncInfoTy objects and threadIDs have one to one mapping when AsyncFlag is true. ye-luo: it is necessary to document the design choice that AsyncInfoTy objects and threadIDs have one…
				tianshilei1992Unsubmitted Done Reply Inline Actions Better to use LLVM ADT here tianshilei1992: Better to use LLVM ADT here
				static thread_local std::vector<std::unique_ptr<AsyncInfoTy>> AsyncInfoV;
				bool AsyncFlag;

				AsyncInfoMng();

				// Get async info object
				AsyncInfoTy *get(DeviceTy &device);

				// Synchronize asyncinfo
				int synchronize(AsyncInfoTy &AsyncInfo, bool ForceSync = false);
				jdoerfertUnsubmitted Done Reply Inline Actions This default is dangerous. Either swap it or avoid the default. Also, explain what it means to synchronize but not to force synchronize. jdoerfert: This default is dangerous. Either swap it or avoid the default. Also, explain what it means to…
				jdoerfertUnsubmitted Done Reply Inline Actions You never return a nullptr, make it a reference. jdoerfert: You never return a nullptr, make it a reference.

				// Free asyncinfo
				void free(DeviceTy &device);
				};
				jdoerfertUnsubmitted Done Reply Inline Actions Use doxygen comments `///` not `//`. jdoerfert: Use doxygen comments `///` not `//`.

	/// This struct is a record of non-contiguous information			/// This struct is a record of non-contiguous information
	struct __tgt_target_non_contig {			struct __tgt_target_non_contig {
	uint64_t Offset;			uint64_t Offset;
				jdoerfertUnsubmitted Done Reply Inline Actions The description is unhelpful. What AsyncInfo, etc. The function is also unused, do we need it? jdoerfert: The description is unhelpful. What AsyncInfo, etc. The function is also unused, do we need it?
	uint64_t Count;			uint64_t Count;
	uint64_t Stride;			uint64_t Stride;
	};			};

	struct __tgt_device_info {			struct __tgt_device_info {
	void *Context = nullptr;			void *Context = nullptr;
	void *Device = nullptr;			void *Device = nullptr;
	};			};
	▲ Show 20 Lines • Show All 138 Lines • Show Last 20 Lines

openmp/libomptarget/src/device.cpp

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines

int HostDataToTargetTy::addEventIfNecessary(DeviceTy &Device,

}

if (NeedNewEvent)

setEvent(Event);

return OFFLOAD_SUCCESS;

}

// Device

jdoerfertUnsubmitted

Done

Unrelated

jdoerfert: Unrelated

DeviceTy::DeviceTy(RTLInfoTy *RTL)

: DeviceID(-1), RTL(RTL), RTLDeviceID(-1), IsInit(false), InitFlag(),

HasPendingGlobals(false), PendingCtorsDtors(), ShadowPtrMap(),

PendingGlobalsMtx(), ShadowMtx() {}

ye-luoUnsubmitted

Done

Why AIM captures a pointer instead of reference?

ye-luo: Why AIM captures a pointer instead of reference?

DeviceTy::~DeviceTy() {

if (DeviceID == -1 || !(getInfoLevel() & OMP_INFOTYPE_DUMP_TABLE))

return;

ident_t Loc = {0, 0, 0, 0, ";libomptarget;libomptarget;0;0;;"};

dumpTargetPointerMappings(&Loc, *this);

}

josemonsalve2Unsubmitted

Done

It may be better to merge these into:

DP("Asynchronous execution %s\n", AsyncFlag ? "Enabled" : "Disabled");

josemonsalve2: It may be better to merge these into: DP("Asynchronous execution %s\n", AsyncFlag ? "Enabled"…

int DeviceTy::associatePtr(void *HstPtrBegin, void *TgtPtrBegin, int64_t Size) {

HDTTMapAccessorTy HDTTMap = HostDataToTargetMap.getExclusiveAccessor();

// Check if entry exists

jdoerfertUnsubmitted

Done

Reverse the condition, single line alternative comes first, then the complex consequence (without the else).

jdoerfert: Reverse the condition, single line alternative comes first, then the complex consequence…

auto It = HDTTMap->find(HstPtrBegin);

if (It != HDTTMap->end()) {

HostDataToTargetTy &HDTT = *It->HDTT;

// Mapping already exists

bool IsValid = HDTT.HstPtrEnd == (uintptr_t)HstPtrBegin + Size &&

HDTT.TgtPtrBegin == (uintptr_t)TgtPtrBegin;

if (IsValid) {

DP("Attempt to re-associate the same device ptr+offset with the same "

"host ptr, nothing to do\n");

return OFFLOAD_SUCCESS;

}

REPORT("Not allowed to re-associate a different device ptr+offset with "

"the same host ptr\n");

return OFFLOAD_FAIL;

jdoerfertUnsubmitted

Done

return AsyncInfoV[device.DeviceID].get();

- } else {

- return new AsyncInfoTy(device);

}

+ return new AsyncInfoTy(device);

}

int AsyncInfoMng::synchronize(AsyncInfoTy &AsyncInfo, bool ForceSync) {

Use early exits and no else after return.

jdoerfert: Use early exits and no else after return.

}

// Mapping does not exist, allocate it with refCount=INF

const HostDataToTargetTy &NewEntry =

*HDTTMap

->emplace(new HostDataToTargetTy(

/*HstPtrBase=*/(uintptr_t)HstPtrBegin,

/*HstPtrBegin=*/(uintptr_t)HstPtrBegin,

▲ Show 20 Lines • Show All 342 Lines • ▼ Show 20 Lines

if (LR.Flags.IsContained || LR.Flags.ExtendsBefore || LR.Flags.ExtendsAfter) {

uintptr_t TP = HT.TgtPtrBegin + (HP - HT.HstPtrBegin);

return (void *)TP;

}

return NULL;

}

int DeviceTy::deallocTgtPtr(HDTTMapAccessorTy &HDTTMap, LookupResult LR,

int64_t Size) {

int64_t Size, AsyncInfoTy &AsyncInfo) {

// Check if the pointer is contained in any sub-nodes.

if (!(LR.Flags.IsContained || LR.Flags.ExtendsBefore ||

LR.Flags.ExtendsAfter)) {

REPORT("Section to delete (hst addr " DPxMOD ") does not exist in the"

" allocated memory\n",

DPxPTR(LR.Entry->HstPtrBegin));

return OFFLOAD_FAIL;

}

auto &HT = *LR.Entry;

// Verify this thread is still in charge of deleting the entry.

assert(HT.getTotalRefCount() == 0 &&

HT.getDeleteThreadId() == std::this_thread::get_id() &&

"Trying to delete entry that is in use or owned by another thread.");

// Do synchronization

int Ret = AsyncInfo.synchronize();

if (Ret != OFFLOAD_SUCCESS)

return OFFLOAD_FAIL;

// Delete tgt data

DP("Deleting tgt data " DPxMOD " of size %" PRId64 "\n",

DPxPTR(HT.TgtPtrBegin), Size);

deleteData((void *)HT.TgtPtrBegin);

INFO(OMP_INFOTYPE_MAPPING_CHANGED, DeviceID,

"Removing map entry with HstPtrBegin=" DPxMOD ", TgtPtrBegin=" DPxMOD

", Size=%" PRId64 ", Name=%s\n",

DPxPTR(HT.HstPtrBegin), DPxPTR(HT.TgtPtrBegin), Size,

(HT.HstPtrName) ? getNameFromMapping(HT.HstPtrName).c_str() : "unknown");

void *Event = LR.Entry->getEvent();

HDTTMap->erase(LR.Entry);

delete LR.Entry;

int Ret = OFFLOAD_SUCCESS;

Ret = OFFLOAD_SUCCESS;

if (Event && destroyEvent(Event) != OFFLOAD_SUCCESS) {

REPORT("Failed to destroy event " DPxMOD "\n", DPxPTR(Event));

Ret = OFFLOAD_FAIL;

}

return Ret;

}

▲ Show 20 Lines • Show All 220 Lines • Show Last 20 Lines

openmp/libomptarget/src/interface.cpp

Show All 15 Lines
#include "private.h"		#include "private.h"
#include "rtl.h"		#include "rtl.h"

#include <cassert>		#include <cassert>
#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
#include <mutex>		#include <mutex>

		AsyncInfoMng AIM;

		jdoerfertUnsubmitted Done Reply Inline Actions We should not have a global like this. It probably belongs in/to the Device object. jdoerfert: We should not have a global like this. It probably belongs in/to the Device object.
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
/// adds requires flags		/// adds requires flags
EXTERN void __tgt_register_requires(int64_t Flags) {		EXTERN void __tgt_register_requires(int64_t Flags) {
TIMESCOPE();		TIMESCOPE();
PM->RTLs.registerRequires(Flags);		PM->RTLs.registerRequires(Flags);
}		}

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	#ifdef OMPTARGET_DEBUG
for (int I = 0; I < ArgNum; ++I) {		for (int I = 0; I < ArgNum; ++I) {
DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
", Type=0x%" PRIx64 ", Name=%s\n",		", Type=0x%" PRIx64 ", Name=%s\n",
I, DPxPTR(ArgsBase[I]), DPxPTR(Args[I]), ArgSizes[I], ArgTypes[I],		I, DPxPTR(ArgsBase[I]), DPxPTR(Args[I]), ArgSizes[I], ArgTypes[I],
(ArgNames) ? getNameFromMapping(ArgNames[I]).c_str() : "unknown");		(ArgNames) ? getNameFromMapping(ArgNames[I]).c_str() : "unknown");
}		}
#endif		#endif

AsyncInfoTy AsyncInfo(Device);		AsyncInfoTy &AsyncInfo = *AIM.get(Device);
int Rc = targetDataBegin(Loc, Device, ArgNum, ArgsBase, Args, ArgSizes,		int Rc = targetDataBegin(Loc, Device, ArgNum, ArgsBase, Args, ArgSizes,
ArgTypes, ArgNames, ArgMappers, AsyncInfo);		ArgTypes, ArgNames, ArgMappers, AsyncInfo);
if (Rc == OFFLOAD_SUCCESS)		if (Rc == OFFLOAD_SUCCESS)
Rc = AsyncInfo.synchronize();		Rc = AIM.synchronize(AsyncInfo, true);
		jdoerfertUnsubmitted Done Reply Inline Actions Now this pattern is somewhat unfortunate. You get the AsyncInfo from the AIM and then you need to be careful to call the right synchronize. If the AsynFlag was global you could move the new "sync" logic into the regular AsyncInfo sync, right? AIM would just be used to manage the map ID -> AsyncInfo. WDYT about this scheme? You could check the env variable once, like we do it for some others: https://github.com/llvm/llvm-project/blob/23bc343855fdf6fb7668abadf2b064034b207981/openmp/libomptarget/src/rtl.cpp#L43 jdoerfert: Now this pattern is somewhat unfortunate. You get the AsyncInfo from the AIM and then you need…
		randreshgAuthorUnsubmitted Done Reply Inline Actions Agreed! The pattern you suggest allows for separating the synchronization logic and AsyncInfo management. randreshg: Agreed! The pattern you suggest allows for separating the synchronization logic and AsyncInfo…
		jdoerfertUnsubmitted Done Reply Inline Actions Better. Not super happy about the explicit free call but that's fine for now. jdoerfert: Better. Not super happy about the explicit free call but that's fine for now.
handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);		handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);
}		}

EXTERN void __tgt_target_data_begin_nowait_mapper(		EXTERN void __tgt_target_data_begin_nowait_mapper(
ident_t Loc, int64_t DeviceId, int32_t ArgNum, void *ArgsBase,		ident_t Loc, int64_t DeviceId, int32_t ArgNum, void *ArgsBase,
void *Args, int64_t ArgSizes, int64_t ArgTypes, map_var_info_t ArgNames,		void *Args, int64_t ArgSizes, int64_t ArgTypes, map_var_info_t ArgNames,
void *ArgMappers, int32_t DepNum, void DepList, int32_t NoAliasDepNum,		void *ArgMappers, int32_t DepNum, void DepList, int32_t NoAliasDepNum,
void *NoAliasDepList) {		void *NoAliasDepList) {
Show All 28 Lines	#ifdef OMPTARGET_DEBUG
for (int I = 0; I < ArgNum; ++I) {		for (int I = 0; I < ArgNum; ++I) {
DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
", Type=0x%" PRIx64 ", Name=%s\n",		", Type=0x%" PRIx64 ", Name=%s\n",
I, DPxPTR(ArgsBase[I]), DPxPTR(Args[I]), ArgSizes[I], ArgTypes[I],		I, DPxPTR(ArgsBase[I]), DPxPTR(Args[I]), ArgSizes[I], ArgTypes[I],
(ArgNames) ? getNameFromMapping(ArgNames[I]).c_str() : "unknown");		(ArgNames) ? getNameFromMapping(ArgNames[I]).c_str() : "unknown");
}		}
#endif		#endif

AsyncInfoTy AsyncInfo(Device);		AsyncInfoTy &AsyncInfo = *AIM.get(Device);
int Rc = targetDataEnd(Loc, Device, ArgNum, ArgsBase, Args, ArgSizes,		int Rc = targetDataEnd(Loc, Device, ArgNum, ArgsBase, Args, ArgSizes,
ArgTypes, ArgNames, ArgMappers, AsyncInfo);		ArgTypes, ArgNames, ArgMappers, AsyncInfo);
if (Rc == OFFLOAD_SUCCESS)		if (Rc == OFFLOAD_SUCCESS)
Rc = AsyncInfo.synchronize();		Rc = AIM.synchronize(AsyncInfo, true);
handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);		handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);
}		}

EXTERN void __tgt_target_data_end_nowait_mapper(		EXTERN void __tgt_target_data_end_nowait_mapper(
ident_t Loc, int64_t DeviceId, int32_t ArgNum, void *ArgsBase,		ident_t Loc, int64_t DeviceId, int32_t ArgNum, void *ArgsBase,
void *Args, int64_t ArgSizes, int64_t ArgTypes, map_var_info_t ArgNames,		void *Args, int64_t ArgSizes, int64_t ArgTypes, map_var_info_t ArgNames,
void *ArgMappers, int32_t DepNum, void DepList, int32_t NoAliasDepNum,		void *ArgMappers, int32_t DepNum, void DepList, int32_t NoAliasDepNum,
void *NoAliasDepList) {		void *NoAliasDepList) {
Show All 16 Lines	if (checkDeviceAndCtors(DeviceId, Loc)) {
return;		return;
}		}

if (getInfoLevel() & OMP_INFOTYPE_KERNEL_ARGS)		if (getInfoLevel() & OMP_INFOTYPE_KERNEL_ARGS)
printKernelArguments(Loc, DeviceId, ArgNum, ArgSizes, ArgTypes, ArgNames,		printKernelArguments(Loc, DeviceId, ArgNum, ArgSizes, ArgTypes, ArgNames,
"Updating OpenMP data");		"Updating OpenMP data");

DeviceTy &Device = *PM->Devices[DeviceId];		DeviceTy &Device = *PM->Devices[DeviceId];
AsyncInfoTy AsyncInfo(Device);		AsyncInfoTy &AsyncInfo = *AIM.get(Device);
int Rc = targetDataUpdate(Loc, Device, ArgNum, ArgsBase, Args, ArgSizes,		int Rc = targetDataUpdate(Loc, Device, ArgNum, ArgsBase, Args, ArgSizes,
ArgTypes, ArgNames, ArgMappers, AsyncInfo);		ArgTypes, ArgNames, ArgMappers, AsyncInfo);
if (Rc == OFFLOAD_SUCCESS)		if (Rc == OFFLOAD_SUCCESS)
Rc = AsyncInfo.synchronize();		Rc = AIM.synchronize(AsyncInfo, true);
handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);		handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);
}		}

EXTERN void __tgt_target_data_update_nowait_mapper(		EXTERN void __tgt_target_data_update_nowait_mapper(
ident_t Loc, int64_t DeviceId, int32_t ArgNum, void *ArgsBase,		ident_t Loc, int64_t DeviceId, int32_t ArgNum, void *ArgsBase,
void *Args, int64_t ArgSizes, int64_t ArgTypes, map_var_info_t ArgNames,		void *Args, int64_t ArgSizes, int64_t ArgTypes, map_var_info_t ArgNames,
void *ArgMappers, int32_t DepNum, void DepList, int32_t NoAliasDepNum,		void *ArgMappers, int32_t DepNum, void DepList, int32_t NoAliasDepNum,
void *NoAliasDepList) {		void *NoAliasDepList) {
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	#ifdef OMPTARGET_DEBUG
}		}
#endif		#endif

bool IsTeams = NumTeams != -1;		bool IsTeams = NumTeams != -1;
if (!IsTeams)		if (!IsTeams)
NumTeams = 0;		NumTeams = 0;

DeviceTy &Device = *PM->Devices[DeviceId];		DeviceTy &Device = *PM->Devices[DeviceId];
AsyncInfoTy AsyncInfo(Device);		AsyncInfoTy &AsyncInfo = *AIM.get(Device);
int Rc = target(Loc, Device, HostPtr, Args->NumArgs, Args->ArgBasePtrs,		int Rc = target(Loc, Device, HostPtr, Args->NumArgs, Args->ArgBasePtrs,
Args->ArgPtrs, Args->ArgSizes, Args->ArgTypes, Args->ArgNames,		Args->ArgPtrs, Args->ArgSizes, Args->ArgTypes, Args->ArgNames,
Args->ArgMappers, NumTeams, ThreadLimit, Args->Tripcount,		Args->ArgMappers, NumTeams, ThreadLimit, Args->Tripcount,
IsTeams, AsyncInfo);		IsTeams, AsyncInfo);
if (Rc == OFFLOAD_SUCCESS)		if (Rc == OFFLOAD_SUCCESS)
Rc = AsyncInfo.synchronize();		Rc = AIM.synchronize(AsyncInfo);
		jdoerfertUnsubmitted Done Reply Inline Actions A comment would be nice as this is the one place we do not force synchronization. jdoerfert: A comment would be nice as this is the one place we do not force synchronization.
		jdoerfertUnsubmitted Done Reply Inline Actions multi-line conditionals should have braces. The comment is unhelpful. It states what the code already says, not why the code is this way. jdoerfert: multi-line conditionals should have braces. The comment is unhelpful. It states what the code…
handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);		handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);
assert(Rc == OFFLOAD_SUCCESS && "__tgt_target_kernel unexpected failure!");		assert(Rc == OFFLOAD_SUCCESS && "__tgt_target_kernel unexpected failure!");
return OMP_TGT_SUCCESS;		return OMP_TGT_SUCCESS;
}		}

EXTERN int __tgt_target_kernel_nowait(		EXTERN int __tgt_target_kernel_nowait(
ident_t *Loc, int64_t DeviceId, int32_t NumTeams, int32_t ThreadLimit,		ident_t *Loc, int64_t DeviceId, int32_t NumTeams, int32_t ThreadLimit,
void HostPtr, __tgt_kernel_arguments Args, int32_t DepNum, void *DepList,		void HostPtr, __tgt_kernel_arguments Args, int32_t DepNum, void *DepList,
▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

openmp/libomptarget/src/omptarget.cpp

Show All 19 Lines

#include <cstdint> #include <cstdint>

#include <vector> #include <vector>

using llvm::SmallVector; using llvm::SmallVector;

int AsyncInfoTy::synchronize() { int AsyncInfoTy::synchronize() {

int Result = OFFLOAD_SUCCESS; int Result = OFFLOAD_SUCCESS;

if (AsyncInfo.Queue) { if (AsyncInfo.Queue) {

DP("Device synchronization\n");

// If we have a queue we need to synchronize it now. // If we have a queue we need to synchronize it now.

Result = Device.synchronize(*this); Result = Device.synchronize(*this);

assert(AsyncInfo.Queue == nullptr && assert(AsyncInfo.Queue == nullptr &&

"The device plugin should have nulled the queue to indicate there " "The device plugin should have nulled the queue to indicate there "

"are no outstanding actions!"); "are no outstanding actions!");

ye-luoUnsubmitted

Done

It is mess. synchroning or not depends on a bunch of states. Move the if-statement to the caller side and make the logic directly exposed on the use side.

ye-luo: It is mess. synchroning or not depends on a bunch of states. Move the if-statement to the…

jdoerfertUnsubmitted

Done

Exposing this to the user side is not only not helpful but will actively harm things. The logic depends on two flags, not "a bunch of states". One is passed by the user, one describes the system setup. This is totally fine.

jdoerfert: Exposing this to the user side is not only not helpful but will actively harm things. The logic…

ye-luoUnsubmitted

Done

Two flags = 4 states. Not a low number of variants. Calling synchronize() but it may or may not do the sync. When should the user set ForceSync to true? Better to have some explanation.

ye-luo: Two flags = 4 states. Not a low number of variants. Calling synchronize() but it may or may not…

jdoerfertUnsubmitted

Done

4 states, ok. However, they collapse to 2; it's a single conditional after all: synchronize or not.
The rules are fairly simple (one or condition) and documented:

/// Synchronize all pending actions when the LIBOMPTARGET_ASYNC env var
/// is disabled or when synchronization is forced (ForceSync = true) 
// Otherwise, synchronization is skipped
/// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately.

jdoerfert: 4 states, ok. However, they collapse to 2; it's a single conditional after all: synchronize or…

} }

return Result; return Result;

} }

void *&AsyncInfoTy::getVoidPtrLocation() { void *&AsyncInfoTy::getVoidPtrLocation() {

BufferLocations.push_back(nullptr); BufferLocations.push_back(nullptr);

return BufferLocations.back(); return BufferLocations.back();

} }

// Async info manager

thread_local std::vector<std::unique_ptr<AsyncInfoTy>> AsyncInfoMng::AsyncInfoV;

AsyncInfoMng::AsyncInfoMng() {

if (char *EnvStr = getenv("LIBOMPTARGET_ASYNC"))

AsyncFlag = std::stoi(EnvStr) ? true : false;

else

AsyncFlag = false;

DP("Asynchronous execution %s\n", AsyncFlag ? "Enabled" : "Disabled");

}

jdoerfertUnsubmitted

Done

Lot's of static flags to simply lookup a env var once. Why don't we do it the same way as for other env vars?

jdoerfert: Lot's of static flags to simply lookup a env var once. Why don't we do it the same way as for…

AsyncInfoTy *AsyncInfoMng::get(DeviceTy &device) {

ye-luoUnsubmitted

Done

New a pointer and then return its reference. "reference" types should not be used to manage the ownership.

ye-luo: New a pointer and then return its reference. "reference" types should not be used to manage the…

jdoerfertUnsubmitted

Done

What about this:

Replace "get" with

AsyncInfoMng::register(AsyncInfo &AI);

Remove "free".
Register is implemented as no-op w/o the AsyncInfo flag set.
Otherwise it'll replace AI with a dynamically allocated one.

jdoerfert: What about this: Replace "get" with ``` AsyncInfoMng::register(AsyncInfo &AI); ``` Remove…

ye-luoUnsubmitted

Done

Not getting what you mean.
I'm still expecting, AsyncInfoTy object being destroyed properly at the end of the target region, when AsyncInfo=false.

ye-luo: Not getting what you mean. I'm still expecting, AsyncInfoTy object being destroyed properly at…

jdoerfertUnsubmitted

Done

I'm still expecting, AsyncInfoTy object being destroyed properly at the end of the target region, when AsyncInfo=false.

It is properly destroyed with this patch, and it will be with the proposed scheme. In the proposed scheme we have a local AsyncInfo (as we have upstream now) and we use it if AsyncInfo=false.

jdoerfert: > I'm still expecting, AsyncInfoTy object being destroyed properly at the end of the target…

randreshgAuthorUnsubmitted

Done

Im not following this. Could you please provide a pseudocode?

randreshg: Im not following this. Could you please provide a pseudocode?

if (!AsyncFlag)

jdoerfertUnsubmitted

Done

});

}

- AsyncInfoTy *AsyncInfoMng::get(DeviceTy &device) {

+ AsyncInfoTy *AsyncInfoMng::get(DeviceTy &Device) {

if (!AsyncFlag)

jdoerfert:

return new AsyncInfoTy(device);

jdoerfertUnsubmitted

Done

AsyncInfoMng::~AsyncInfoMng() {

- std::map<std::thread::id, AsyncInfoTy *>::iterator it;

- for (it = AsyncInfoM.begin(); it != AsyncInfoM.end(); it++)

- delete (it->second);

- AsyncInfoM.clear();

- }

+ for (const auto &It : AsyncInfoM)

+ delete(It.second);

+ // or use the helper with a name similar to:

+ llvm::DeleteContainerSeconds(AsyncInfoM)}

AsyncInfoTy *AsyncInfoMng::get() {

jdoerfert:

//Async execution

if (AsyncInfoV.empty()) {

auto num_devices = omp_get_num_devices();

AsyncInfoV.reserve(num_devices);

jdoerfertUnsubmitted

Done

std::lock_guard<std::mutex> MapLock(AsyncMtx);

- std::map<std::thread::id, std::unique_ptr<AsyncInfoTy>>::iterator it = AsyncInfoM.find(std::this_thread::get_id());

+ auto& It = AsyncInfoM.find(std::this_thread::get_id());

if (it == AsyncInfoM.end() || !AsyncInfoM[std::this_thread::get_id()])

jdoerfert:

for (auto i = 0; i < num_devices; i++)

jdoerfertUnsubmitted

Done

You did get an iterator, why do another lookup via [...]? You can do a single [...] lookup and use the reference result to check and update it.

jdoerfert: You did get an iterator, why do another lookup via `[...]`? You can do a single `[...]` lookup…

AsyncInfoV.push_back(nullptr);

}

jdoerfertUnsubmitted

Done

And this is the 3rd lookup into the map..

jdoerfert: And this is the 3rd lookup into the map..

// Get async info

jdoerfertUnsubmitted

Done

Do we ever have to clear the map?

jdoerfert: Do we ever have to clear the map?

if (!AsyncInfoV[device.DeviceID])

AsyncInfoV[device.DeviceID] = std::make_unique<AsyncInfoTy>(device, false);

return AsyncInfoV[device.DeviceID].get();

}

int AsyncInfoMng::synchronize(AsyncInfoTy &AsyncInfo, bool ForceSync) {

ye-luoUnsubmitted

Done

I think this is against coding principles, pass in a reference and delete its memory.

ye-luo: I think this is against coding principles, pass in a reference and delete its memory.

int Rc = OFFLOAD_SUCCESS;

if (!AsyncFlag) {

Rc = AsyncInfo.synchronize();

delete &AsyncInfo;

} else if (ForceSync) {

Rc = AsyncInfo.synchronize();

}

return Rc;

}

void AsyncInfoMng::free(DeviceTy &device) {

jdoerfertUnsubmitted

Done

I doubt we need both these lines.

jdoerfert: I doubt we need both these lines.

if (AsyncFlag) {

AsyncInfoV[device.DeviceID].reset();

AsyncInfoV[device.DeviceID] = nullptr;

}

/* All begin addresses for partially mapped structs must be 8-aligned in order /* All begin addresses for partially mapped structs must be 8-aligned in order

* to ensure proper alignment of members. E.g. * to ensure proper alignment of members. E.g.

* *

* struct S { * struct S {

* int a; // 4-aligned * int a; // 4-aligned

* int b; // 4-aligned * int b; // 4-aligned

* int *p; // 8-aligned * int *p; // 8-aligned

* } s1; * } s1;

▲ Show 20 Lines • Show All 625 Lines • ▼ Show 20 Lines

} // namespace } // namespace

/// Internal function to undo the mapping and retrieve the data from the device. /// Internal function to undo the mapping and retrieve the data from the device.

int targetDataEnd(ident_t *Loc, DeviceTy &Device, int32_t ArgNum, int targetDataEnd(ident_t *Loc, DeviceTy &Device, int32_t ArgNum,

void **ArgBases, void **Args, int64_t *ArgSizes, void **ArgBases, void **Args, int64_t *ArgSizes,

int64_t *ArgTypes, map_var_info_t *ArgNames, int64_t *ArgTypes, map_var_info_t *ArgNames,

void **ArgMappers, AsyncInfoTy &AsyncInfo, bool FromMapper) { void **ArgMappers, AsyncInfoTy &AsyncInfo, bool FromMapper) {

int Ret; int Ret = OFFLOAD_SUCCESS;

SmallVector<PostProcessingInfo> PostProcessingPtrs; SmallVector<PostProcessingInfo> PostProcessingPtrs;

void *FromMapperBase = nullptr; void *FromMapperBase = nullptr;

// process each input. // process each input.

for (int32_t I = ArgNum - 1; I >= 0; --I) { for (int32_t I = ArgNum - 1; I >= 0; --I) {

// Ignore private variables and arrays - there is no mapping for them. // Ignore private variables and arrays - there is no mapping for them.

// Also, ignore the use_device_ptr directive, it has no effect here. // Also, ignore the use_device_ptr directive, it has no effect here.

if ((ArgTypes[I] & OMP_TGT_MAPTYPE_LITERAL) || if ((ArgTypes[I] & OMP_TGT_MAPTYPE_LITERAL) ||

(ArgTypes[I] & OMP_TGT_MAPTYPE_PRIVATE)) (ArgTypes[I] & OMP_TGT_MAPTYPE_PRIVATE))

▲ Show 20 Lines • Show All 142 Lines • ▼ Show 20 Lines if ((ArgTypes[I] & OMP_TGT_MAPTYPE_FROM) || DelEntry) {

} }

// Add pointer to the buffer for post-synchronize processing. // Add pointer to the buffer for post-synchronize processing.

PostProcessingPtrs.emplace_back(HstPtrBegin, DataSize, ArgTypes[I], PostProcessingPtrs.emplace_back(HstPtrBegin, DataSize, ArgTypes[I],

DelEntry && !IsHostPtr, TPR); DelEntry && !IsHostPtr, TPR);

} }

// TODO: We should not synchronize here but pass the AsyncInfo object to the

// allocate/deallocate device APIs.

// We need to synchronize before deallocating data.

Ret = AsyncInfo.synchronize();

if (Ret != OFFLOAD_SUCCESS)

return OFFLOAD_FAIL;

// Deallocate target pointer // Deallocate target pointer

for (PostProcessingInfo &Info : PostProcessingPtrs) { for (PostProcessingInfo &Info : PostProcessingPtrs) {

// If we marked the entry to be deleted we need to verify no other thread // If we marked the entry to be deleted we need to verify no other thread

// reused it by now. If deletion is still supposed to happen by this thread // reused it by now. If deletion is still supposed to happen by this thread

// LR will be set and exclusive access to the HDTT map will avoid another // LR will be set and exclusive access to the HDTT map will avoid another

// thread reusing the entry now. Note that we do not request (exclusive) // thread reusing the entry now. Note that we do not request (exclusive)

// access to the HDTT map if Info.DelEntry is not set. // access to the HDTT map if Info.DelEntry is not set.

LookupResult LR; LookupResult LR;

DeviceTy::HDTTMapAccessorTy HDTTMap = DeviceTy::HDTTMapAccessorTy HDTTMap =

Device.HostDataToTargetMap.getExclusiveAccessor(!Info.DelEntry); Device.HostDataToTargetMap.getExclusiveAccessor(!Info.DelEntry);

if (Info.DelEntry) { if (Info.DelEntry) {

LR = Device.lookupMapping(HDTTMap, Info.HstPtrBegin, Info.DataSize); LR = Device.lookupMapping(HDTTMap, Info.HstPtrBegin, Info.DataSize);

if (LR.Entry->getTotalRefCount() != 0 || if (LR.Entry->getTotalRefCount() != 0 ||

LR.Entry->getDeleteThreadId() != std::this_thread::get_id()) { LR.Entry->getDeleteThreadId() != std::this_thread::get_id()) {

// The thread is not in charge of deletion anymore. Give up access to // The thread is not in charge of deletion anymore. Give up access to

// the HDTT map and unset the deletion flag. // the HDTT map and unset the deletion flag.

HDTTMap.destroy(); HDTTMap.destroy();

Info.DelEntry = false; Info.DelEntry = false;

Show All 21 Lines auto CB = [&](ShadowPtrListTy::iterator &Itr) {

Device.ShadowPtrMap.erase(OldItr); Device.ShadowPtrMap.erase(OldItr);

} else { } else {

++Itr; ++Itr;

} }

return OFFLOAD_SUCCESS; return OFFLOAD_SUCCESS;

}; };

applyToShadowMapEntries(Device, CB, Info.HstPtrBegin, Info.DataSize, applyToShadowMapEntries(Device, CB, Info.HstPtrBegin, Info.DataSize,

Info.TPR); Info.TPR);

// If we are deleting the entry the DataMapMtx is locked and we own the // If we are deleting the entry the DataMapMtx is locked and we own the

// entry. // entry.

if (Info.DelEntry) { if (Info.DelEntry) {

if (!FromMapperBase || FromMapperBase != Info.HstPtrBegin) if (!FromMapperBase || FromMapperBase != Info.HstPtrBegin)

Ret = Device.deallocTgtPtr(HDTTMap, LR, Info.DataSize); Ret = Device.deallocTgtPtr(HDTTMap, LR, Info.DataSize, AsyncInfo);

if (Ret != OFFLOAD_SUCCESS) { if (Ret != OFFLOAD_SUCCESS) {

REPORT("Deallocating data from device failed.\n"); REPORT("Deallocating data from device failed.\n");

break; break;

} }

▲ Show 20 Lines • Show All 647 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Automatic asynchronous execution of OpenMP Target RegionsNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 461250

openmp/docs/optimizations/OpenMPOpt.rst

openmp/libomptarget/include/device.h

openmp/libomptarget/include/omptarget.h

openmp/libomptarget/src/device.cpp

openmp/libomptarget/src/interface.cpp

openmp/libomptarget/src/omptarget.cpp

Automatic asynchronous execution of OpenMP Target Regions
Needs ReviewPublic