This is an archive of the discontinued LLVM Phabricator instance.

openmp/libomptarget/include/omptargetplugin.h
160	Please remove AsyncInfo which is not used.
169	Please remove AsyncInfo which is not used.
172	Please remove AsyncInfo which is not used. If left, more confusion got added.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1400	I feel better to avoid hard-code 0. better to check if CU_EVENT_WAIT_DEFAULT exists, if not, set CU_EVENT_WAIT_DEFAULT to 0.

tianshilei1992 added inline comments.Aug 23 2021, 1:13 PM

openmp/libomptarget/include/omptargetplugin.h
160	I would keep it for compatibility consideration. Note that we are not writing a library only for CUDA. There might be programming models requiring the queue.

ye-luo added inline comments.Aug 23 2021, 1:30 PM

openmp/libomptarget/include/omptargetplugin.h
160	If you think compatibility consideration is real now, please provide details of one case of need. It is not about CUDA, but it is better to avoid hypothetical needs. Additional unused arguments adds confusion and cost maintenance. With the minimal arguments, If a real needs shows up, it is easy to adapt and expand the list arguments.

JonChesterfield added inline comments.Aug 23 2021, 2:50 PM

openmp/libomptarget/include/omptargetplugin.h
160	That's true if the interface is easy to change. This one doesn't seem to be - we carry around legacy entry points in case they're still in use, at least for some period of time. I wonder if we should start putting version numbers in it to establish a path to dropping parts. To a fair approximation, every C callback interface should take an additional void. Otherwise you end up with qsort vs qsort_r when inevitably the callback has to do something with state. In this case, the interface is roughly: void async_info_create(); // do things with the async info // awkwardly some of these make a new instance if passed 0 void async_info_destroy(void); Threading a single void through the sequence of calls gives the plugin a way of associating them with each other. Given multiple host threads and multiple targets that is otherwise extremely difficult, hence every function getting an integer ID. I would guess that eliding the void* will work fine until we want to have two independent asynchronous regions running on the same target at the same time, at which point we won't know which one to 'record_event' onto. That's not an argument for this interface in particular, I haven't thought about whether an event should have an identity independent of whatever it is coordinating, but each function in the __tgt_rtl interface that can do things asynchronously should get the async_info instance it is acting on.

JonChesterfield added inline comments.Aug 23 2021, 3:00 PM

openmp/libomptarget/include/omptargetplugin.h
157	Is this a means of coordinating across different async objects? Across different targets, or just between a cpu thread and one target? Trying to guess what the sequence of record / wait / sync might map onto. In particular it's difficult to map this onto the HSA operations without knowing the cuda terminology. This is presumably a way of making some kernels wait for others to complete before they start executing but I'd expect that to be a single API call, probably named 'barrier'.

Does 'waitEvent' mean the GPU waits until all the kernels before it have completed and 'syncEvent' mean the host thread waits for the same?

ye-luo added inline comments.Aug 23 2021, 3:02 PM

openmp/libomptarget/include/omptargetplugin.h
160	Changing API may never as easy as it might be but this is secondary. I prefer not adding an argument which is not used. That means it cannot be precisely defined and can be potentially abused. For this reason, I'm asking if there was a solid need. tgt_rtl_record_event and tgt_rtl_wait_event should have AsyncInfo. I was only asking for the removal on tgt_rtl_wait_event tgt_rtl_sync_event and __tgt_rtl_destroy_event which are not asynchronous APIs.

JonChesterfield added inline comments.Aug 23 2021, 3:05 PM

openmp/libomptarget/include/omptargetplugin.h
160	What do wait_event or sync_event do? Sounds like they might involve launching a kernel on the GPU that writes to global memory to indicate it has executed, in which case it probably needs an async info. Destroy event probably has to wait for some resources on the GPU to no longer be in use, which may be at some distant point in the future, in which case the host probably doesn't want to wait for it.

In D108528#2961197, @JonChesterfield wrote:

Does 'waitEvent' mean the GPU waits until all the kernels before it have completed and 'syncEvent' mean the host thread waits for the same?

Maybe this one should be renamed as QueueWaitEvent or WaitEventInQueue to be more explicit. CUDA has this as cuStreamWaitEvent and categorizes this API under Stream Management rather than Event Management.

In D108528#2961197, @JonChesterfield wrote:

Does 'waitEvent' mean the GPU waits until all the kernels before it have completed and 'syncEvent' mean the host thread waits for the same?

Wait, attach a wait into the AsyncObj stream to avoid anything added after runs before the event is fulfilled.
Sync, block until the event is fulfilled.

openmp/libomptarget/include/omptargetplugin.h
160	Remove the AsyncInfo in the ones that do not attach the event to a stream/AsyncInfo object.

In D108528#2961204, @jdoerfert wrote:

Wait, attach a wait into the AsyncObj stream to avoid anything added after runs before the event is fulfilled.

Would 'fulfilled' mean the kernels that were launched before it have all completed?

That probably involves a barrier packet on amdgpu. Needs to go on the same HSA queue as the associated kernels, which is probably what will be in the async info.

Sync, block until the event is fulfilled.

That is probably doable without poking at the HSA queue for amdgpu but I'm not certain of it.

Can I invoke debug printing as a justification for passing the async object to all of them? I probably want to be able to tell what the lifecycle of a given event is in terms of what functions it was passed to.

In D108528#2961216, @JonChesterfield wrote:

In D108528#2961204, @jdoerfert wrote:

Wait, attach a wait into the AsyncObj stream to avoid anything added after runs before the event is fulfilled.

Would 'fulfilled' mean the kernels that were launched before it have all completed?

That probably involves a barrier packet on amdgpu. Needs to go on the same HSA queue as the associated kernels, which is probably what will be in the async info.

If you think of the event as a kernel on the stream/queue, it is fulfilled if it is "executed".

Sync, block until the event is fulfilled.

That is probably doable without poking at the HSA queue for amdgpu but I'm not certain of it.

Can I invoke debug printing as a justification for passing the async object to all of them? I probably want to be able to tell what the lifecycle of a given event is in terms of what functions it was passed to.

I don't mind either way, passing it or not. Event's are opaque too, you can even reference the AsyncInfo/stream/queue if you want. That said, I doubt you need the AsyncInfo to print stuff about the event.
In CUDA we really only need both to make the connection, which happens two times: Put the event on the queue, put a wait for the event in the queue.
If HSA needs it for anything else, we can also change the API before a release w/o much hassle.

In D108528#2961216, @JonChesterfield wrote:

In D108528#2961204, @jdoerfert wrote:

Wait, attach a wait into the AsyncObj stream to avoid anything added after runs before the event is fulfilled.

Would 'fulfilled' mean the kernels that were launched before it have all completed?

That probably involves a barrier packet on amdgpu. Needs to go on the same HSA queue as the associated kernels, which is probably what will be in the async info.

Sync, block until the event is fulfilled.

That is probably doable without poking at the HSA queue for amdgpu but I'm not certain of it.

Can I invoke debug printing as a justification for passing the async object to all of them? I probably want to be able to tell what the lifecycle of a given event is in terms of what functions it was passed to.

It is actually meaningless. When you pass in an event, this event may or may not have been recorded on the AsyncInfo->Queue passed in. It is not one-to-one mapping.

JonChesterfield added a subscriber: ronlieb.Aug 25 2021, 8:01 AM

tianshilei1992 added inline comments.Aug 27 2021, 9:33 AM

openmp/libomptarget/include/omptargetplugin.h
160	Changing plugin interfaces will be very painful. All changes since 2019 were all to add new interfaces just to make sure not to break existing applications. If we decide to do it in one way and in the future we ask for another, existing applications will be broken. Of course we can then add another interface such as `__tgt_rtl_create_event_async`, but it's gonna be more confusing. Anyway, I'm fine to remove it, but still more inclined to leave some spaces.

Changing plugin interfaces will be very painful. All changes since 2019 were all to add new interfaces just to make sure not to break existing applications.

The interface you're talking about is between clang and libomptarget (__tgt_target_* functions). Here we are dealing with the interface between the base library and the plugins (__tgt_rtl_* functions), i.e. it is an internal interface within the library, so user applications have nothing to do with it. We can change it without breaking anything (I assume libomptarget and the plugins are compiled and distributed together, I don't expect anyone to mix-and-match components from different commits). The only caveat is that if the __tgt_rtl_* interface changes for one plugin, we have to propagate the change in all other plugins, but that shouldn't be a major concern.

In D108528#2969435, @grokos wrote:

Changing plugin interfaces will be very painful. All changes since 2019 were all to add new interfaces just to make sure not to break existing applications.

The interface you're talking about is between clang and libomptarget (__tgt_target_* functions). Here we are dealing with the interface between the base library and the plugins (__tgt_rtl_* functions), i.e. it is an internal interface within the library, so user applications have nothing to do with it. We can change it without breaking anything (I assume libomptarget and the plugins are compiled and distributed together, I don't expect anyone to mix-and-match components from different commits). The only caveat is that if the __tgt_rtl_* interface changes for one plugin, we have to propagate the change in all other plugins, but that shouldn't be a major concern.

Oh, yeah, you're right. However I can still remember back to 2019 when I added the *_async series, I did want to change existing interfaces, but it was rejected.

I think we need to pursue well defined APIs. I have not seen solid prof that additional arguments are necessary.
Changing API is easy or not is something secondary to me.

rebase and fix comments

tianshilei1992 marked 12 inline comments as done.Aug 27 2021, 3:54 PM

tianshilei1992 added inline comments.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1400	`CU_EVENT_WAIT_DEFAULT` is a enum value. We cannot check if it exists.

tianshilei1992 requested review of this revision.Aug 27 2021, 3:55 PM

tianshilei1992 marked an inline comment as done.

ye-luo added inline comments.Aug 27 2021, 4:08 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
132	I just realize that this one needs to be int createEvent(void** createEvent) and the return code needs to indicate the success/failure of cuEventCreate call. This is probably my last concern.

Harbormaster completed remote builds in B121565: Diff 369209.Aug 27 2021, 4:31 PM

rebase and fix comments

tianshilei1992 marked an inline comment as done.Aug 27 2021, 4:41 PM

tianshilei1992 added inline comments.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
132	That's a good one. Done.

ye-luo added inline comments.Aug 27 2021, 4:51 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
132	Should the return type int as others?

ye-luo added inline comments.Aug 27 2021, 4:56 PM

openmp/libomptarget/src/device.cpp
560	If there is no event support. Should createEvent being called? if it should not be called, OFFLOAD_FAIL is better. I don't have deep thought on this. What do you think?

Harbormaster completed remote builds in B121569: Diff 369215.Aug 27 2021, 5:16 PM

fix comments

openmp/libomptarget/plugins/cuda/src/rtl.cpp
132	Sure
openmp/libomptarget/src/device.cpp
560	It will be called anyway. We will not check (for now) if a feature is supported before calling it.

Harbormaster completed remote builds in B121572: Diff 369218.Aug 27 2021, 6:23 PM

ye-luo accepted this revision.Aug 28 2021, 12:37 PM

ye-luo added inline comments.

openmp/libomptarget/src/device.cpp
560	Calling events related APIs without actually plugin support would generally be considered undefined behaviors. So it is better to avoid calls. When a device is initialized, we can attempt creating an event. If it remains nullptr, we can flag events not supported in this device.

This revision is now accepted and ready to land.Aug 28 2021, 12:37 PM

tianshilei1992 marked an inline comment as done.Aug 28 2021, 1:22 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/device.cpp
560	If a plugin doesn't support event mechanism, it simply returns `OFFLOAD_SUCCESS` for all interfaces. It is a consistent behavior.

This revision was landed with ongoing or failed builds.Aug 28 2021, 1:24 PM

Closed by commit rG29df4ab3f3c9: [OpenMP][Offloading] Add support for event related interfaces (authored by tianshilei1992). · Explain Why

This revision was automatically updated to reflect the committed changes.

tianshilei1992 marked an inline comment as done.

tianshilei1992 added a commit: rG29df4ab3f3c9: [OpenMP][Offloading] Add support for event related interfaces.

tianshilei1992 removed a child revision: D104418: [OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement.Aug 28 2021, 3:23 PM

ksidorov mentioned this in D109882: [OpenMP] Benchmarking the execution of kernel with different runtime parameters.Sep 16 2021, 7:26 AM

Revision Contents

Path

Size

openmp/

libomptarget/

include/

omptargetplugin.h

26 lines

plugins/

cuda/

dynamic_cuda/

cuda.h

7 lines

cuda.cpp

6 lines

src/

rtl.cpp

110 lines

exports

5 lines

src/

24 lines

35 lines

10 lines

8 lines

Diff 369276

openmp/libomptarget/include/omptargetplugin.h

	Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines
	int32_t __tgt_rtl_synchronize(int32_t ID, __tgt_async_info *AsyncInfo);			int32_t __tgt_rtl_synchronize(int32_t ID, __tgt_async_info *AsyncInfo);

	// Set plugin's internal information flag externally.			// Set plugin's internal information flag externally.
	void __tgt_rtl_set_info_flag(uint32_t);			void __tgt_rtl_set_info_flag(uint32_t);

	// Print the device information			// Print the device information
	void __tgt_rtl_print_device_info(int32_t ID);			void __tgt_rtl_print_device_info(int32_t ID);

				// Event related interfaces. It is expected to use the interfaces in the
				// following way:
				// 1) Create an event on the target device (__tgt_rtl_create_event).
				// 2) Record the event based on the status of \p AsyncInfo->Queue at the moment
				// of function call to __tgt_rtl_record_event. An event becomes "meaningful"
				// once it is recorded, such that others can depend on it.
				// 3) Call __tgt_rtl_wait_event to set dependence on the event. Whether the
				// operation is blocking or non-blocking depends on the target. It is expected
				// to be non-blocking, just set dependence and return.
				// 4) Call __tgt_rtl_sync_event to sync the event. It is expected to block the
				JonChesterfieldUnsubmitted Done Reply Inline Actions Is this a means of coordinating across different async objects? Across different targets, or just between a cpu thread and one target? Trying to guess what the sequence of record / wait / sync might map onto. In particular it's difficult to map this onto the HSA operations without knowing the cuda terminology. This is presumably a way of making some kernels wait for others to complete before they start executing but I'd expect that to be a single API call, probably named 'barrier'. JonChesterfield: Is this a means of coordinating across different async objects? Across different targets, or…
				// thread calling the function.
				grokosUnsubmitted Done Reply Inline Actions New line between 4) and 5) grokos: New line between 4) and 5)
				// 5) Destroy the event (__tgt_rtl_destroy_event).
				// {
				ye-luoUnsubmitted Done Reply Inline Actions Please remove AsyncInfo which is not used. ye-luo: Please remove AsyncInfo which is not used.
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions I would keep it for compatibility consideration. Note that we are not writing a library only for CUDA. There might be programming models requiring the queue. tianshilei1992: I would keep it for compatibility consideration. Note that we are not writing a library only…
				ye-luoUnsubmitted Done Reply Inline Actions If you think compatibility consideration is real now, please provide details of one case of need. It is not about CUDA, but it is better to avoid hypothetical needs. Additional unused arguments adds confusion and cost maintenance. With the minimal arguments, If a real needs shows up, it is easy to adapt and expand the list arguments. ye-luo: If you think compatibility consideration is real now, please provide details of one case of…
				JonChesterfieldUnsubmitted Done Reply Inline Actions That's true if the interface is easy to change. This one doesn't seem to be - we carry around legacy entry points in case they're still in use, at least for some period of time. I wonder if we should start putting version numbers in it to establish a path to dropping parts. To a fair approximation, every C callback interface should take an additional void. Otherwise you end up with qsort vs qsort_r when inevitably the callback has to do something with state. In this case, the interface is roughly: void async_info_create(); // do things with the async info // awkwardly some of these make a new instance if passed 0 void async_info_destroy(void); Threading a single void through the sequence of calls gives the plugin a way of associating them with each other. Given multiple host threads and multiple targets that is otherwise extremely difficult, hence every function getting an integer ID. I would guess that eliding the void* will work fine until we want to have two independent asynchronous regions running on the same target at the same time, at which point we won't know which one to 'record_event' onto. That's not an argument for this interface in particular, I haven't thought about whether an event should have an identity independent of whatever it is coordinating, but each function in the __tgt_rtl interface that can do things asynchronously should get the async_info instance it is acting on. JonChesterfield: That's true if the interface is easy to change. This one doesn't seem to be - we carry around…
				ye-luoUnsubmitted Done Reply Inline Actions Changing API may never as easy as it might be but this is secondary. I prefer not adding an argument which is not used. That means it cannot be precisely defined and can be potentially abused. For this reason, I'm asking if there was a solid need. tgt_rtl_record_event and tgt_rtl_wait_event should have AsyncInfo. I was only asking for the removal on tgt_rtl_wait_event tgt_rtl_sync_event and __tgt_rtl_destroy_event which are not asynchronous APIs. ye-luo: Changing API may never as easy as it might be but this is secondary. I prefer not adding an…
				jdoerfertUnsubmitted Done Reply Inline Actions Remove the AsyncInfo in the ones that do not attach the event to a stream/AsyncInfo object. jdoerfert: Remove the AsyncInfo in the ones that do not attach the event to a stream/AsyncInfo object.
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Changing plugin interfaces will be very painful. All changes since 2019 were all to add new interfaces just to make sure not to break existing applications. If we decide to do it in one way and in the future we ask for another, existing applications will be broken. Of course we can then add another interface such as `__tgt_rtl_create_event_async`, but it's gonna be more confusing. Anyway, I'm fine to remove it, but still more inclined to leave some spaces. tianshilei1992: Changing plugin interfaces will be very painful. All changes since 2019 were all to add new…
				JonChesterfieldUnsubmitted Done Reply Inline Actions What do wait_event or sync_event do? Sounds like they might involve launching a kernel on the GPU that writes to global memory to indicate it has executed, in which case it probably needs an async info. Destroy event probably has to wait for some resources on the GPU to no longer be in use, which may be at some distant point in the future, in which case the host probably doesn't want to wait for it. JonChesterfield: What do wait_event or sync_event do? Sounds like they might involve launching a kernel on the…
				int32_t __tgt_rtl_create_event(int32_t ID, void **Event);

				int32_t __tgt_rtl_record_event(int32_t ID, void *Event,
				__tgt_async_info *AsyncInfo);

				int32_t __tgt_rtl_wait_event(int32_t ID, void *Event,
				__tgt_async_info *AsyncInfo);

				int32_t __tgt_rtl_sync_event(int32_t ID, void *Event);
				ye-luoUnsubmitted Done Reply Inline Actions Please remove AsyncInfo which is not used. ye-luo: Please remove AsyncInfo which is not used.

				int32_t __tgt_rtl_destroy_event(int32_t ID, void *Event);
				// }
				ye-luoUnsubmitted Done Reply Inline Actions Please remove AsyncInfo which is not used. If left, more confusion got added. ye-luo: Please remove AsyncInfo which is not used. If left, more confusion got added.

	#ifdef __cplusplus			#ifdef __cplusplus
	}			}
	#endif			#endif

	#endif // _OMPTARGETPLUGIN_H_			#endif // _OMPTARGETPLUGIN_H_

openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.h

	Show All 16 Lines
	#include <cstdint>			#include <cstdint>

	typedef int CUdevice;			typedef int CUdevice;
	typedef uintptr_t CUdeviceptr;			typedef uintptr_t CUdeviceptr;
	typedef struct CUmod_st *CUmodule;			typedef struct CUmod_st *CUmodule;
	typedef struct CUctx_st *CUcontext;			typedef struct CUctx_st *CUcontext;
	typedef struct CUfunc_st *CUfunction;			typedef struct CUfunc_st *CUfunction;
	typedef struct CUstream_st *CUstream;			typedef struct CUstream_st *CUstream;
				typedef struct CUevent_st *CUevent;

	typedef enum cudaError_enum {			typedef enum cudaError_enum {
	CUDA_SUCCESS = 0,			CUDA_SUCCESS = 0,
	CUDA_ERROR_INVALID_VALUE = 1,			CUDA_ERROR_INVALID_VALUE = 1,
	CUDA_ERROR_INVALID_HANDLE = 400,			CUDA_ERROR_INVALID_HANDLE = 400,
	} CUresult;			} CUresult;

	typedef enum CUstream_flags_enum {			typedef enum CUstream_flags_enum {
	▲ Show 20 Lines • Show All 210 Lines • ▼ Show 20 Lines
	CUresult cuDeviceCanAccessPeer(int *, CUdevice, CUdevice);			CUresult cuDeviceCanAccessPeer(int *, CUdevice, CUdevice);
	CUresult cuCtxEnablePeerAccess(CUcontext, unsigned);			CUresult cuCtxEnablePeerAccess(CUcontext, unsigned);
	CUresult cuMemcpyPeerAsync(CUdeviceptr, CUcontext, CUdeviceptr, CUcontext,			CUresult cuMemcpyPeerAsync(CUdeviceptr, CUcontext, CUdeviceptr, CUcontext,
	size_t, CUstream);			size_t, CUstream);

	CUresult cuCtxGetLimit(size_t *, CUlimit);			CUresult cuCtxGetLimit(size_t *, CUlimit);
	CUresult cuCtxSetLimit(CUlimit, size_t);			CUresult cuCtxSetLimit(CUlimit, size_t);

				CUresult cuEventCreate(CUevent *, unsigned int);
				CUresult cuEventRecord(CUevent, CUstream);
				CUresult cuStreamWaitEvent(CUstream, CUevent, unsigned int);
				CUresult cuEventSynchronize(CUevent);
				CUresult cuEventDestroy(CUevent);

	#endif			#endif

openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.cpp

	Show First 20 Lines • Show All 63 Lines • ▼ Show 20 Lines

	DLWRAP(cuDeviceCanAccessPeer, 3);			DLWRAP(cuDeviceCanAccessPeer, 3);
	DLWRAP(cuCtxEnablePeerAccess, 2);			DLWRAP(cuCtxEnablePeerAccess, 2);
	DLWRAP(cuMemcpyPeerAsync, 6);			DLWRAP(cuMemcpyPeerAsync, 6);

	DLWRAP(cuCtxGetLimit, 2);			DLWRAP(cuCtxGetLimit, 2);
	DLWRAP(cuCtxSetLimit, 2);			DLWRAP(cuCtxSetLimit, 2);

				DLWRAP(cuEventCreate, 2);
				DLWRAP(cuEventRecord, 2);
				DLWRAP(cuStreamWaitEvent, 3);
				DLWRAP(cuEventSynchronize, 1);
				DLWRAP(cuEventDestroy, 1);

	DLWRAP_FINALIZE();			DLWRAP_FINALIZE();

	#ifndef DYNAMIC_CUDA_PATH			#ifndef DYNAMIC_CUDA_PATH
	#define DYNAMIC_CUDA_PATH "libcuda.so"			#define DYNAMIC_CUDA_PATH "libcuda.so"
	#endif			#endif

	#define TARGET_NAME CUDA			#define TARGET_NAME CUDA
	#define DEBUG_PREFIX "Target " GETNAME(TARGET_NAME) " RTL"			#define DEBUG_PREFIX "Target " GETNAME(TARGET_NAME) " RTL"
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

openmp/libomptarget/plugins/cuda/src/rtl.cpp

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	DP("Error when copying data from device to device. Pointers: src "
DPxPTR(SrcPtr), DPxPTR(DstPtr), Size);		DPxPTR(SrcPtr), DPxPTR(DstPtr), Size);
CUDA_ERR_STRING(Err);		CUDA_ERR_STRING(Err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

		int createEvent(void **P) {
		ye-luoUnsubmitted Done Reply Inline Actions I just realize that this one needs to be int createEvent(void createEvent) and the return code needs to indicate the success/failure of cuEventCreate call. This is probably my last concern. ye-luo: I just realize that this one needs to be int createEvent(void createEvent) and the return…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions That's a good one. Done. tianshilei1992: That's a good one. Done.
		ye-luoUnsubmitted Done Reply Inline Actions Should the return type int as others? ye-luo: Should the return type int as others?
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Sure tianshilei1992: Sure
		CUevent Event = nullptr;

		CUresult Err = cuEventCreate(&Event, CU_EVENT_DEFAULT);
		if (Err != CUDA_SUCCESS) {
		DP("Error when creating event event = " DPxMOD "\n", DPxPTR(Event));
		CUDA_ERR_STRING(Err);
		return OFFLOAD_FAIL;
		}

		*P = Event;

		return OFFLOAD_SUCCESS;
		}

		int recordEvent(void EventPtr, __tgt_async_info AsyncInfo) {
		CUstream Stream = reinterpret_cast<CUstream>(AsyncInfo->Queue);
		CUevent Event = reinterpret_cast<CUevent>(EventPtr);

		CUresult Err = cuEventRecord(Event, Stream);
		if (Err != CUDA_SUCCESS) {
		DP("Error when recording event. stream = " DPxMOD ", event = " DPxMOD "\n",
		DPxPTR(Stream), DPxPTR(Event));
		CUDA_ERR_STRING(Err);
		return OFFLOAD_FAIL;
		}

		return OFFLOAD_SUCCESS;
		}

		int syncEvent(void *EventPtr) {
		CUevent Event = reinterpret_cast<CUevent>(EventPtr);

		CUresult Err = cuEventSynchronize(Event);
		if (Err != CUDA_SUCCESS) {
		DP("Error when syncing event = " DPxMOD "\n", DPxPTR(Event));
		CUDA_ERR_STRING(Err);
		return OFFLOAD_FAIL;
		}

		return OFFLOAD_SUCCESS;
		}

		int destroyEvent(void *EventPtr) {
		CUevent Event = reinterpret_cast<CUevent>(EventPtr);

		CUresult Err = cuEventDestroy(Event);
		if (Err != CUDA_SUCCESS) {
		DP("Error when destroying event = " DPxMOD "\n", DPxPTR(Event));
		CUDA_ERR_STRING(Err);
		return OFFLOAD_FAIL;
		}

		return OFFLOAD_SUCCESS;
		}

// Structure contains per-device data		// Structure contains per-device data
struct DeviceDataTy {		struct DeviceDataTy {
/// List that contains all the kernels.		/// List that contains all the kernels.
std::list<KernelTy> KernelsList;		std::list<KernelTy> KernelsList;

std::list<FuncOrGblEntryTy> FuncGblEntries;		std::list<FuncOrGblEntryTy> FuncGblEntries;

CUcontext Context = nullptr;		CUcontext Context = nullptr;
▲ Show 20 Lines • Show All 1,187 Lines • ▼ Show 20 Lines	checkResult(
&TmpInt, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, Device),		&TmpInt, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, Device),
"Error returned from cuDeviceGetAttribute\n");		"Error returned from cuDeviceGetAttribute\n");
checkResult(		checkResult(
cuDeviceGetAttribute(		cuDeviceGetAttribute(
&TmpInt2, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, Device),		&TmpInt2, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, Device),
"Error returned from cuDeviceGetAttribute\n");		"Error returned from cuDeviceGetAttribute\n");
printf(" Compute Capabilities: \t\t%d%d \n", TmpInt, TmpInt2);		printf(" Compute Capabilities: \t\t%d%d \n", TmpInt, TmpInt2);
}		}

		int waitEvent(const int DeviceId, __tgt_async_info *AsyncInfo,
		void *EventPtr) const {
		CUstream Stream = getStream(DeviceId, AsyncInfo);
		CUevent Event = reinterpret_cast<CUevent>(EventPtr);

		// We don't use CU_EVENT_WAIT_DEFAULT here as it is only available from
		// specific CUDA version, and defined as 0x0. In previous version, per CUDA
		// API document, that argument has to be 0x0.
		CUresult Err = cuStreamWaitEvent(Stream, Event, 0);
		ye-luoUnsubmitted Done Reply Inline Actions I feel better to avoid hard-code 0. better to check if CU_EVENT_WAIT_DEFAULT exists, if not, set CU_EVENT_WAIT_DEFAULT to 0. ye-luo: I feel better to avoid hard-code 0. better to check if CU_EVENT_WAIT_DEFAULT exists, if not…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions `CU_EVENT_WAIT_DEFAULT` is a enum value. We cannot check if it exists. tianshilei1992: `CU_EVENT_WAIT_DEFAULT` is a enum value. We cannot check if it exists.
		if (Err != CUDA_SUCCESS) {
		DP("Error when waiting event. stream = " DPxMOD ", event = " DPxMOD "\n",
		DPxPTR(Stream), DPxPTR(Event));
		CUDA_ERR_STRING(Err);
		return OFFLOAD_FAIL;
		}

		return OFFLOAD_SUCCESS;
		}
};		};

DeviceRTLTy DeviceRTL;		DeviceRTLTy DeviceRTL;
} // namespace		} // namespace

// Exposed library API function		// Exposed library API function
#ifdef __cplusplus		#ifdef __cplusplus
extern "C" {		extern "C" {
▲ Show 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	void __tgt_rtl_set_info_flag(uint32_t NewInfoLevel) {
InfoLevel.store(NewInfoLevel);		InfoLevel.store(NewInfoLevel);
}		}

void __tgt_rtl_print_device_info(int32_t device_id) {		void __tgt_rtl_print_device_info(int32_t device_id) {
assert(DeviceRTL.isValidDeviceId(device_id) && "device_id is invalid");		assert(DeviceRTL.isValidDeviceId(device_id) && "device_id is invalid");
DeviceRTL.printDeviceInfo(device_id);		DeviceRTL.printDeviceInfo(device_id);
}		}

		int32_t __tgt_rtl_create_event(int32_t device_id, void **event) {
		assert(event && "event is nullptr");
		return createEvent(event);
		}

		int32_t __tgt_rtl_record_event(int32_t device_id, void *event_ptr,
		__tgt_async_info *async_info_ptr) {
		assert(async_info_ptr && "async_info_ptr is nullptr");
		assert(async_info_ptr->Queue && "async_info_ptr->Queue is nullptr");
		assert(event_ptr && "event_ptr is nullptr");

		return recordEvent(event_ptr, async_info_ptr);
		}

		int32_t __tgt_rtl_wait_event(int32_t device_id, void *event_ptr,
		__tgt_async_info *async_info_ptr) {
		assert(DeviceRTL.isValidDeviceId(device_id) && "device_id is invalid");
		assert(async_info_ptr && "async_info_ptr is nullptr");
		assert(event_ptr && "event is nullptr");

		return DeviceRTL.waitEvent(device_id, async_info_ptr, event_ptr);
		}

		int32_t __tgt_rtl_sync_event(int32_t device_id, void *event_ptr) {
		assert(event_ptr && "event is nullptr");

		return syncEvent(event_ptr);
		}

		int32_t __tgt_rtl_destroy_event(int32_t device_id, void *event_ptr) {
		assert(event_ptr && "event is nullptr");

		return destroyEvent(event_ptr);
		}

#ifdef __cplusplus		#ifdef __cplusplus
}		}
#endif		#endif

openmp/libomptarget/plugins/exports

Show All 18 Lines	global:
__tgt_rtl_run_target_region;		__tgt_rtl_run_target_region;
__tgt_rtl_run_target_region_async;		__tgt_rtl_run_target_region_async;
__tgt_rtl_synchronize;		__tgt_rtl_synchronize;
__tgt_rtl_register_lib;		__tgt_rtl_register_lib;
__tgt_rtl_unregister_lib;		__tgt_rtl_unregister_lib;
__tgt_rtl_supports_empty_images;		__tgt_rtl_supports_empty_images;
__tgt_rtl_set_info_flag;		__tgt_rtl_set_info_flag;
__tgt_rtl_print_device_info;		__tgt_rtl_print_device_info;
		__tgt_rtl_create_event;
		__tgt_rtl_record_event;
		__tgt_rtl_wait_event;
		__tgt_rtl_sync_event;
		__tgt_rtl_destroy_event;
local:		local:
*;		*;
};		};

openmp/libomptarget/src/device.h

Show First 20 Lines • Show All 269 Lines • ▼ Show 20 Lines	int32_t runTeamRegion(void TgtEntryPtr, void *TgtVarsPtr,
ptrdiff_t *TgtOffsets, int32_t TgtVarsSize,		ptrdiff_t *TgtOffsets, int32_t TgtVarsSize,
int32_t NumTeams, int32_t ThreadLimit,		int32_t NumTeams, int32_t ThreadLimit,
uint64_t LoopTripCount, AsyncInfoTy &AsyncInfo);		uint64_t LoopTripCount, AsyncInfoTy &AsyncInfo);

/// Synchronize device/queue/event based on \p AsyncInfo and return		/// Synchronize device/queue/event based on \p AsyncInfo and return
/// OFFLOAD_SUCCESS/OFFLOAD_FAIL when succeeds/fails.		/// OFFLOAD_SUCCESS/OFFLOAD_FAIL when succeeds/fails.
int32_t synchronize(AsyncInfoTy &AsyncInfo);		int32_t synchronize(AsyncInfoTy &AsyncInfo);

/// Calls the corresponding print in the \p RTLDEVID		/// Calls the corresponding print in the \p RTLDEVID
/// device RTL to obtain the information of the specific device.		/// device RTL to obtain the information of the specific device.
bool printDeviceInfo(int32_t RTLDevID);		bool printDeviceInfo(int32_t RTLDevID);

		/// Event related interfaces.
		/// {
		/// Create an event.
		int32_t createEvent(void **Event);

		/// Record the event based on status in AsyncInfo->Queue at the moment the
		/// function is called.
		int32_t recordEvent(void *Event, AsyncInfoTy &AsyncInfo);

		/// Wait for an event. This function can be blocking or non-blocking,
		/// depending on the implmentation. It is expected to set a dependence on the
		grokosUnsubmitted Done Reply Inline Actions dependency grokos: dependency
		/// event such that corresponding operations shall only start once the event
		/// is fulfilled.
		int32_t waitEvent(void *Event, AsyncInfoTy &AsyncInfo);

		/// Synchronize the event. It is expected to block the thread.
		int32_t syncEvent(void *Event);

		/// Destroy the event.
		int32_t destroyEvent(void *Event);
		/// }

private:		private:
// Call to RTL		// Call to RTL
void init(); // To be called only via DeviceTy::initOnce()		void init(); // To be called only via DeviceTy::initOnce()
};		};

/// Map between Device ID (i.e. openmp device id) and its DeviceTy.		/// Map between Device ID (i.e. openmp device id) and its DeviceTy.
typedef std::vector<DeviceTy> DevicesTy;		typedef std::vector<DeviceTy> DevicesTy;

Show All 29 Lines

openmp/libomptarget/src/device.cpp

	Show First 20 Lines • Show All 547 Lines • ▼ Show 20 Lines
	}			}

	int32_t DeviceTy::synchronize(AsyncInfoTy &AsyncInfo) {			int32_t DeviceTy::synchronize(AsyncInfoTy &AsyncInfo) {
	if (RTL->synchronize)			if (RTL->synchronize)
	return RTL->synchronize(RTLDeviceID, AsyncInfo);			return RTL->synchronize(RTLDeviceID, AsyncInfo);
	return OFFLOAD_SUCCESS;			return OFFLOAD_SUCCESS;
	}			}

				int32_t DeviceTy::createEvent(void **Event) {
				if (RTL->create_event)
				return RTL->create_event(RTLDeviceID, Event);

				return OFFLOAD_SUCCESS;
				ye-luoUnsubmitted Done Reply Inline Actions If there is no event support. Should createEvent being called? if it should not be called, OFFLOAD_FAIL is better. I don't have deep thought on this. What do you think? ye-luo: If there is no event support. Should createEvent being called? if it should not be called…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions It will be called anyway. We will not check (for now) if a feature is supported before calling it. tianshilei1992: It will be called anyway. We will not check (for now) if a feature is supported before calling…
				ye-luoUnsubmitted Done Reply Inline Actions Calling events related APIs without actually plugin support would generally be considered undefined behaviors. So it is better to avoid calls. When a device is initialized, we can attempt creating an event. If it remains nullptr, we can flag events not supported in this device. ye-luo: Calling events related APIs without actually plugin support would generally be considered…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions If a plugin doesn't support event mechanism, it simply returns `OFFLOAD_SUCCESS` for all interfaces. It is a consistent behavior. tianshilei1992: If a plugin doesn't support event mechanism, it simply returns `OFFLOAD_SUCCESS` for all…
				}

				int32_t DeviceTy::recordEvent(void *Event, AsyncInfoTy &AsyncInfo) {
				if (RTL->record_event)
				return RTL->record_event(RTLDeviceID, Event, AsyncInfo);

				return OFFLOAD_SUCCESS;
				}

				int32_t DeviceTy::waitEvent(void *Event, AsyncInfoTy &AsyncInfo) {
				if (RTL->wait_event)
				return RTL->wait_event(RTLDeviceID, Event, AsyncInfo);

				return OFFLOAD_SUCCESS;
				}

				int32_t DeviceTy::syncEvent(void *Event) {
				if (RTL->sync_event)
				return RTL->sync_event(RTLDeviceID, Event);

				return OFFLOAD_SUCCESS;
				}

				int32_t DeviceTy::destroyEvent(void *Event) {
				if (RTL->create_event)
				return RTL->destroy_event(RTLDeviceID, Event);

				return OFFLOAD_SUCCESS;
				}

	/// Check whether a device has an associated RTL and initialize it if it's not			/// Check whether a device has an associated RTL and initialize it if it's not
	/// already initialized.			/// already initialized.
	bool device_is_ready(int device_num) {			bool device_is_ready(int device_num) {
	DP("Checking whether device %d is ready.\n", device_num);			DP("Checking whether device %d is ready.\n", device_num);
	// Devices.size() can only change while registering a new			// Devices.size() can only change while registering a new
	// library, so try to acquire the lock of RTLs' mutex.			// library, so try to acquire the lock of RTLs' mutex.
	PM->RTLsMtx.lock();			PM->RTLsMtx.lock();
	size_t DevicesSize = PM->Devices.size();			size_t DevicesSize = PM->Devices.size();
	Show All 22 Lines

openmp/libomptarget/src/rtl.h

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	typedef int32_t(run_team_region_async_ty)(int32_t, void , void *,
int32_t, uint64_t,		int32_t, uint64_t,
__tgt_async_info *);		__tgt_async_info *);
typedef int64_t(init_requires_ty)(int64_t);		typedef int64_t(init_requires_ty)(int64_t);
typedef int64_t(synchronize_ty)(int32_t, __tgt_async_info *);		typedef int64_t(synchronize_ty)(int32_t, __tgt_async_info *);
typedef int32_t (register_lib_ty)(__tgt_bin_desc );		typedef int32_t (register_lib_ty)(__tgt_bin_desc );
typedef int32_t(supports_empty_images_ty)();		typedef int32_t(supports_empty_images_ty)();
typedef void(print_device_info_ty)(int32_t);		typedef void(print_device_info_ty)(int32_t);
typedef void(set_info_flag_ty)(uint32_t);		typedef void(set_info_flag_ty)(uint32_t);
		typedef int32_t(create_event_ty)(int32_t, void **);
		typedef int32_t(record_event_ty)(int32_t, void , __tgt_async_info );
		typedef int32_t(wait_event_ty)(int32_t, void , __tgt_async_info );
		typedef int32_t(sync_event_ty)(int32_t, void *);
		typedef int32_t(destroy_event_ty)(int32_t, void *);

int32_t Idx = -1; // RTL index, index is the number of devices		int32_t Idx = -1; // RTL index, index is the number of devices
// of other RTLs that were registered before,		// of other RTLs that were registered before,
// i.e. the OpenMP index of the first device		// i.e. the OpenMP index of the first device
// to be registered with this RTL.		// to be registered with this RTL.
int32_t NumberOfDevices = -1; // Number of devices this RTL deals with.		int32_t NumberOfDevices = -1; // Number of devices this RTL deals with.

void *LibraryHandler = nullptr;		void *LibraryHandler = nullptr;
Show All 22 Lines	#endif
run_team_region_async_ty *run_team_region_async = nullptr;		run_team_region_async_ty *run_team_region_async = nullptr;
init_requires_ty *init_requires = nullptr;		init_requires_ty *init_requires = nullptr;
synchronize_ty *synchronize = nullptr;		synchronize_ty *synchronize = nullptr;
register_lib_ty register_lib = nullptr;		register_lib_ty register_lib = nullptr;
register_lib_ty unregister_lib = nullptr;		register_lib_ty unregister_lib = nullptr;
supports_empty_images_ty *supports_empty_images = nullptr;		supports_empty_images_ty *supports_empty_images = nullptr;
set_info_flag_ty *set_info_flag = nullptr;		set_info_flag_ty *set_info_flag = nullptr;
print_device_info_ty *print_device_info = nullptr;		print_device_info_ty *print_device_info = nullptr;
		create_event_ty *create_event = nullptr;
		record_event_ty *record_event = nullptr;
		wait_event_ty *wait_event = nullptr;
		sync_event_ty *sync_event = nullptr;
		destroy_event_ty *destroy_event = nullptr;

// Are there images associated with this RTL.		// Are there images associated with this RTL.
bool isUsed = false;		bool isUsed = false;

// Mutex for thread-safety when calling RTL interface functions.		// Mutex for thread-safety when calling RTL interface functions.
// It is easier to enforce thread-safety at the libomptarget level,		// It is easier to enforce thread-safety at the libomptarget level,
// so that developers of new RTLs do not have to worry about it.		// so that developers of new RTLs do not have to worry about it.
std::mutex Mtx;		std::mutex Mtx;
▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

openmp/libomptarget/src/rtl.cpp

Show First 20 Lines • Show All 177 Lines • ▼ Show 20 Lines	#endif
((void *)&R.unregister_lib) =		((void *)&R.unregister_lib) =
dlsym(dynlib_handle, "__tgt_rtl_unregister_lib");		dlsym(dynlib_handle, "__tgt_rtl_unregister_lib");
((void *)&R.supports_empty_images) =		((void *)&R.supports_empty_images) =
dlsym(dynlib_handle, "__tgt_rtl_supports_empty_images");		dlsym(dynlib_handle, "__tgt_rtl_supports_empty_images");
((void *)&R.set_info_flag) =		((void *)&R.set_info_flag) =
dlsym(dynlib_handle, "__tgt_rtl_set_info_flag");		dlsym(dynlib_handle, "__tgt_rtl_set_info_flag");
((void *)&R.print_device_info) =		((void *)&R.print_device_info) =
dlsym(dynlib_handle, "__tgt_rtl_print_device_info");		dlsym(dynlib_handle, "__tgt_rtl_print_device_info");
		((void *)&R.create_event) =
		dlsym(dynlib_handle, "__tgt_rtl_create_event");
		((void *)&R.record_event) =
		dlsym(dynlib_handle, "__tgt_rtl_record_event");
		((void *)&R.wait_event) = dlsym(dynlib_handle, "__tgt_rtl_wait_event");
		((void *)&R.sync_event) = dlsym(dynlib_handle, "__tgt_rtl_sync_event");
		((void *)&R.destroy_event) =
		dlsym(dynlib_handle, "__tgt_rtl_destroy_event");
}		}

#if OMPT_SUPPORT		#if OMPT_SUPPORT
DP("OMPT_SUPPORT is enabled in libomptarget\n");		DP("OMPT_SUPPORT is enabled in libomptarget\n");
DP("Init OMPT for libomptarget\n");		DP("Init OMPT for libomptarget\n");
if (libomp_start_tool) {		if (libomp_start_tool) {
DP("Retrieve libomp_start_tool successfully\n");		DP("Retrieve libomp_start_tool successfully\n");
if (!libomp_start_tool(&ompt_target_enabled)) {		if (!libomp_start_tool(&ompt_target_enabled)) {
▲ Show 20 Lines • Show All 301 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][Offloading] Add support for event related interfacesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 369276

openmp/libomptarget/include/omptargetplugin.h

openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.h

openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.cpp

openmp/libomptarget/plugins/cuda/src/rtl.cpp

openmp/libomptarget/plugins/exports

openmp/libomptarget/src/device.h

openmp/libomptarget/src/device.cpp

openmp/libomptarget/src/rtl.h

openmp/libomptarget/src/rtl.cpp

[OpenMP][Offloading] Add support for event related interfaces
ClosedPublic