This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/
-
libomptarget/
-
plugins-nextgen/
-
CMakeLists.txt
-
amdgpu/
-
CMakeLists.txt
-
src/
62/69
rtl.cpp
-
plugins/amdgpu/dynamic_hsa/
-
amdgpu/
-
dynamic_hsa/
-
hsa.h

Differential D138389

[OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior
ClosedPublic

Authored by kevinsala on Nov 20 2022, 6:24 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
jhuber6
tianshilei1992
JonChesterfield
josemonsalve2
ye-luo

Group Reviewers

Restricted Project

Commits

rG6bbf9c0cca6f: [OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior
rG87e6b96b0009: [OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior

Summary

This patch adds the AMDGPU NextGen plugin inheriting from PluginInterface's classes. It also implements the asynchronous behavior in the plugin operations: kernel launches and memory transfers. To this end, it implements the concept of streams of asynchronous operations. The streams are implemented using the HSA signals to define input and output dependencies between asynchronous operations.

Missing features:

Retrieve the maximum number of threads per group that a kernel can run. This requires reading the image.
Implement __tgt_rtl_sync_event, not used on the libomptarget side.

Diff Detail

Event Timeline

kevinsala created this revision.Nov 20 2022, 6:24 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 20 2022, 6:24 PM

Herald added subscribers: kosarev, kerbowa, guansong and 5 others. · View Herald Transcript

kevinsala requested review of this revision.Nov 20 2022, 6:24 PM

Herald added subscribers: openmp-commits, sstefan1, wdng. · View Herald TranscriptNov 20 2022, 6:24 PM

Harbormaster completed remote builds in B198700: Diff 476782.Nov 20 2022, 6:27 PM

Looks good overall for me, mostly nits. I'm not the most qualifies to scrutinize the HSA usage, maybe @JonChesterfield or @saiislam could help on that front.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
44	This is declared as a struct here then defined as a class later.
65	Empty argument isn't needed in C++17
149	These can be `StringRef` since they're constants. Avoids the heap allocation.
293	I'm not deeply familiar with the `MemoryManager`, but I thought that we didn't use it for AMD because the HSA plugin already handled its own caching in the memory pool? Maybe @JonChesterfield and @tianshilei1992 could comment further.
416	Never looked into it much myself, but does `std::atomic` allow us to avoid using a compiler builtin for this? Not a big deal if not.
537–538	Small nit, can use LLVM's string, it's basically just a small vector of chars.
1010	typo
1200	Probably best to specify the template arguments.
1573	Unused
1713	Using `const_cast` isn't great but otherwise the compiler will complain with warnings on.
1734	Same here.
1761	Nit, no else after return
2049	Is this supposed to be marked override?
2177	Best make a new one here as this was checked above.
2298	Nit. no else after return.

There's a lot of code here. Some of it very familiar, in copy/paste fashion from the current plugin. Some of it quite subtle in terms of whether it's going to work or not. Not an easy thing to review successfully in one block.

How was this tested? What's the intended lifetime plan for new and old plugins (it looks like cuda's old one is still with us)?

In D138389#3942377, @JonChesterfield wrote:

There's a lot of code here. Some of it very familiar, in copy/paste fashion from the current plugin. Some of it quite subtle in terms of whether it's going to work or not. Not an easy thing to review successfully in one block.

I fondly remember when we merged the AMD plugin, which is to this day full of dead code... That said, @kevinsala, everything that is not AMD related should be split off. Make one ore more pre-commits to update interfaces etc. of the generic parts.

How was this tested?

We need a configuration for our regression tests. I thought we had one (or a patch) already. @kevinsala @jhuber6

What's the intended lifetime plan for new and old plugins (it looks like cuda's old one is still with us)?

The next release will have both. After we branch we'll delete the old plugins. Like we did with the runtime.

openmp/libomptarget/plugins-nextgen/common/PluginInterface/GlobalHandler.cpp
136 ↗	(On Diff #476782)	@kevinsala: Split these changes into a separate pre-commit please.
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
570 ↗	(On Diff #476782)	@kevinsala: Split these changes into a separate pre-commit please.
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
764 ↗	(On Diff #476782)	@kevinsala: Split these changes into a separate pre-commit please.
openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
1004 ↗	(On Diff #476782)	@kevinsala: Split these changes into a separate pre-commit please.
openmp/libomptarget/plugins-nextgen/generic-elf-64bit/src/rtl.cpp
371 ↗	(On Diff #476782)	@kevinsala: Split these changes into a separate pre-commit please.

jplehr added a subscriber: jplehr.Nov 23 2022, 6:49 AM

jdoerfert added inline comments.Nov 23 2022, 11:10 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
293	I'm generally in favor of using it either way.
335
416	We'd need to bend over backwards as the definitions are plain pointers. Let's keep it this way.
674	Pick a size that is at least the default Capacity.
830	Isn't this a problem? Shouldn't we wait instead? Same in other places. Let's mark it as TODO and address it in a follow up.
1252–1253	CUDA + AMD
1462–1463	4 x 512 x 128 might be enough

ye-luo added a subscriber: ye-luo.Nov 23 2022, 11:21 AM

ye-luo added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1462	What is QUEUE_SIZE? Prefer not to use SIZE but NUM_XXX_PER_QUEUE
1463	What is the STREAM_SIZE?
1464	LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES?

jdoerfert added inline comments.Nov 23 2022, 2:41 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1464	We need documentation for those in openmp/docs, next to the other env vars.

How was this tested?

We need a configuration for our regression tests. I thought we had one (or a patch) already. @kevinsala @jhuber6

For functional tests I've used the LLVM OpenMP target tests (make check) and miniQMC. For performance experiments, the miniQMC directly. Is there any test suite you use for functional/performance testing?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
830	Yes, there is no problem waiting on it. Even so, this check will go away with the new version of dynamically sized streams.
1464	I'll add the documentation for these envars. They control the following aspects: `LIBOMPTARGET_AMDGPU_QUEUE_SIZE`: The number of HSA packets that can be pushed into each HSA queue without waiting the driver to process them. `LIBOMPTARGET_AMDGPU_STREAM_SIZE`: Number of asynchronous operations (e.g., kernel launches, memory transfers) that can be pushed into each of our streams without waiting on their finalization. With the upcoming patch implementing dynamically sized streams, this envar will be renamed and become a hint for the initial stream size. `LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_SIZE`: Up to this size, the memory copies (tgt_rtl_submit_data, tgt_rtl_retrieve_data) will be asynchronous operations appended to the corresponding stream. For larger transfer, it will become synchronous transfers. This can be seen `dataSubmitImpl` (1695).

For functional tests I've used the LLVM OpenMP target tests (make check) and miniQMC. For performance experiments, the miniQMC directly. Is there any test suite you use for functional/performance testing?

This plugin introduces new functionality which seems unlikely to be covered by existing make check, particularly given quite a lot of that is disabled on amdgpu anyway.

AMD's aomp and rocm forks have more aggressive runtime testing but presumably don't make heavy use of async given it doesn't exist before this patch.

Code review isn't likely to catch problems in thousands of lines of async c++. Can we split this into a non-functional change where the current plugin is refactored to use the new interface, or equivalently split all the async feature gain out of this to get a more boring alternative plugin which is expected to behave identically to the current one? That lets us distinguish between unintentional behaviour changes (when the boring/refactor doesn't work) and intentional ones (which ideally would have tests).

The big bang replace a large functional unit with a totally new one that behaves differently tends to leave intel and amd's product forks behind, which is unfortunate when upstream testing is relatively superficial and code review struggles to spot bugs in large commits.

ye-luo added inline comments.Nov 25 2022, 10:08 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1464	Thanks for adding docs. What I found in hsa doc is "Number of packets the queue is expected to hold". This is more understandable than your line. I think your description is true but that is a second level interpretation. I think both lines can be useful in the doc. LIBOMPTARGET_AMDGPU_QUEUE_SIZE is an insufficient name. LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE is better IMO. A STREAM may have many contents with different sizes. STREAM_SIZE is to vague. Better to have something like ASYNC_OP_DEPTH_PER_STREAM. Still prefer COPY_BYTES.

In D138389#3950741, @JonChesterfield wrote:

For functional tests I've used the LLVM OpenMP target tests (make check) and miniQMC. For performance experiments, the miniQMC directly. Is there any test suite you use for functional/performance testing?

This plugin introduces new functionality which seems unlikely to be covered by existing make check, particularly given quite a lot of that is disabled on amdgpu anyway.

AMD's aomp and rocm forks have more aggressive runtime testing but presumably don't make heavy use of async given it doesn't exist before this patch.

Code review isn't likely to catch problems in thousands of lines of async c++. Can we split this into a non-functional change where the current plugin is refactored to use the new interface, or equivalently split all the async feature gain out of this to get a more boring alternative plugin which is expected to behave identically to the current one? That lets us distinguish between unintentional behaviour changes (when the boring/refactor doesn't work) and intentional ones (which ideally would have tests).

The big bang replace a large functional unit with a totally new one that behaves differently tends to leave intel and amd's product forks behind, which is unfortunate when upstream testing is relatively superficial and code review struggles to spot bugs in large commits.

Intel and AMD not upstreaming their tests is not our problem per se. I'd love for them to do it but they would also need to upstreaming/using the functionality in their plugins. They had 2+ years to do that, "waiting" doesn't help. Maybe if we diverge more they finally have some incentive to work closer to/with upstream.
We should not apply two different sets of review rules only to make it easier for AMD. Your initial drop of the "old" plugin (https://reviews.llvm.org/D85742) has all the problems you listed above and more (including and lots of more lines with dead and untested code). Most of the things you argue are bad about this patch are literally states as the state of the "old plugin" in the commit message of D85742. Still, then the sentiment was to merge it "as-is and then improve it in-tree", with an explanation of why that is better.
The splits of non-AMD related parts and the actual AMD plugin was already part of my prior review and I'm sure @kevinsala will address the last of those comments soon (=before we merge this).

kosarev added a reviewer: Restricted Project.Nov 28 2022, 10:42 AM

arsenm added a subscriber: arsenm.Nov 28 2022, 11:18 AM

arsenm added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
138	Do we not have this in a shared place already?

jhuber6 added inline comments.Nov 28 2022, 11:20 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
138	I believe we have some support in `clang` somewhere, we could probably move that to somewhere in `llvm` and include it here instead.

jdoerfert added inline comments.Nov 28 2022, 1:26 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
138	I used the github search and the only place I found that looks for the sramecc+ string is the existing AMD plugin (same function basically): https://github.com/llvm/llvm-project/blob/a35ad711d90497994701a99723a81badf3d4348e/openmp/libomptarget/plugins/amdgpu/src/rtl.cpp#L1855

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
138	If you find it fine, I can move it to a new utils header in `plugins-nextgen/amdgpu/` and be included by both AMDGPU plugins.

In D138389#3954160, @jdoerfert wrote:

Intel and AMD not upstreaming their tests is not our problem per se. I'd love for them to do it but they would also need to upstreaming/using the functionality in their plugins. They had 2+ years to do that, "waiting" doesn't help. Maybe if we diverge more they finally have some incentive to work closer to/with upstream.

Yeah, likewise. There was a period where trunk, rocm and aomp all had exactly the same implementation for this plugin which was hard won. After the initial drop I mean, I spent a few weeks to pull them back into exact alignment and encouraged the internal devs to patch trunk instead. Didn't take. Today's status is the trunk plugin looks pretty much like the rocm one used to, but the rocm one has a bunch of stuff changed in it for a new code object ABI and some variant of asynchronous offloading. Further divergence seems unlikely to persuade the people who prefer working internally to stop doing that.

We should not apply two different sets of review rules only to make it easier for AMD. Your initial drop of the "old" plugin (https://reviews.llvm.org/D85742) has all the problems you listed above and more (including and lots of more lines with dead and untested code). Most of the things you argue are bad about this patch are literally states as the state of the "old plugin" in the commit message of D85742. Still, then the sentiment was to merge it "as-is and then improve it in-tree", with an explanation of why that is better.

Pretty much. The version today is smaller and saner than the initial drop but not refactored to a fixpoint by any stretch. This suffered from me handing it over to @pdhaliwal who did good work before moving on to greener pastures, and I haven't really picked it back up. I'm not clear why various AMD devs decided to patch the rocm version instead of upstream - gentle encouragement and declining to review things on the internal board hasn't really changed anything there.

One thing the initial drop of non-review-compliant and suspected buggy code did have going for it was that it passed a bunch of application testing when run alongside a broadly similar clang toolchain. So it didn't come with in tree tests to keep it known working but at least it started out OK. I would guess we can do similar here - land the new plugin without tests, and manually feed it through various levels of internal testing and report back what works. That'll probably suffice, painfully, for getting it to functional equivalence with the current plugin.

If there's a load of tests for the asynchronous behaviour that landed for cuda (which I didn't notice at the time) then we can xfail the ones that don't work on amdgpu and (ideally) fix those over time. I'd quite like the plugins to be unit tested, as opposed to running whole openmp programs through them, but am not sure what can be done in terms of infra for that. Feels like we might need a mock/stub GPU to tie it to, which perhaps works better with more of the plugin code shared than is currently the case.

A few comments above. General case, I don't see the benefit of getting constructs like:
hsa_signal_t get() const { return Signal; }
can we either not allow access to the private member, or just make it public and drop the getter?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
182	this is pointless, just make the member public if we have to leak it. but again, existing code...
221	this is a scary comment, though presumably in existing code
293	IIUC the HSA interface with 'memory pool' in the name doesn't implement a memory pool so caching it here is probably faster. I expect it'll interact badly with the asan development effort but that was rocm-internal last time I looked at it so can't influence decisions here
394	does this behave sensibly on wrap around? the arithmetic used elsewhere is static uint64_t acquireAvailablePacketId(hsa_queue_t *Queue) { uint64_t PacketId = hsa_queue_add_write_index_relaxed(Queue, 1); bool Full = true; while (Full) { Full = PacketId >= (Queue->size + hsa_queue_load_read_index_scacquire(Queue)); } return PacketId; }
453	This is strange. The whole point of the queue abstraction is that multiple threads can enqueue work onto it at the same time, so why would we be locking it?
482	Is {0} a meaningful value for a signal? e.g. can we pass it to signal_destroy or similar without it falling over?
500	comment for why active instead of blocked? wait in particular sounds like we expect the signal to be unavailable when the function is called so busy-spin in the host thread seems bad
508	This looks racy. If a few threads are calling decrement and one calls reset, it's going to break.

JonChesterfield added inline comments.Nov 29 2022, 4:06 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
726	Confused by this, why are we locking the queue?
969	passing queue.getAgent() as two arguments here seems suspicious - shouldn't one of these be the GPU and one the host?
1932	I can't work out what's going on here. The corresponding logic for erase looks up the pointer directly, should found not be the same? Also can't tell why we're recording the size of the allocation next to the pointer, as opposed to a DenseSet<void*>
1974	this is implemented as a tree - why std::map?

kevinsala added inline comments.Dec 3 2022, 4:27 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
221	I can implement it, but I would perform this check only in debug mode.
394	I obtained that method from the HSA standard document.
453	Multiple threads may enqueue work onto the queue concurrently thanks to the atomic operations to retrieve a packet slot and publishing/ringing the doorbell. But, if I understood correctly, we shouldn't allow publishing a packet (ringing the doorbell with that packet id) if there is any previous packet in the queue that is not ready to be processed yet. For instance, this sequence will be invalid according to my understanding: Thread 1 (T1): Thread2 (T2): ======== ======== 1. Gets packet slot 0 2. Fills packet 0's fields Gets packet slot 1 3. Fills packet 1's fields 4. Publish packet writing doorbell with id 1 5. Publish packet writing doorbell with id 0) Probably, we could reduce the locked section or remove the lock, by preventing T2 (or any thread in the generalized case) from ringing the doorbell if any previous packet is not ready. But I didn't find that critical currently. If we see this lock is a bottleneck, we can try to optimize it.
482	We shouldn't call destroy with the {0} handle. I can add asserts.
500	True, that's something I was experimenting with. I'll change it back to blocked wait.
508	Reset is expected to be called when the signal is not being used by any other thread nor the HSA runtime. It should be called just before being reused.
1932	The `__tgt_rtl_data_delete` operation should pass the same pointer provided by the `__tgt_rtl_data_alloc`. As far as I know, it's not valid to make a partial deletion of an allocated buffer. But in the case of `__tgt_rtl_data_submit/retrieve`, the pointer can come with an applied offset. Thus, we should check whether the provided pointer is inside any host allocation, considering the sizes of the allocations.
1974	To perform `lower_bound` operations with logarithmic complexity.

jdoerfert added inline comments.Dec 3 2022, 6:48 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
182	It's not pointless, it gives a common name for abstracted resources. Often this would be `operator*` or similar but `get` is as good as anything for now. Let's not go down rabbit holes that are indeed pointless.
500	I'm not sure which way to go. Add a TODO to try this out. I'm uncertain if the host thread can do something useful while it's not busy waiting.
508	Add that to the function comment please.
726	I think that's related to the answer above. Add a TODO to look into this.
1932	We might want to make our lives easier here and not do a search. Later we want two changes that will help: Have a pre-allocated pinned buffer for arguments. Use the hsa lookup function if it's not a pointer to pinned memory allocated by us. @kevinsala You think we need the search for the current use cases?

kevinsala added inline comments.Dec 4 2022, 1:13 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
182	I don't think it's pointless; we limit the access to the HSA handle and only expose it when it's really needed. And we can also detect easily where these handles are being accessed.
1932	We can replace the search by a call to `hsa_amd_pointer_info` and let the HSA runtime do the work. If the buffer is explicitly locked by the user (malloc + HSA lock), it should return `HSA_EXT_POINTER_TYPE_LOCKED`. If the buffer was allocated using the HSA allocator functions, I guess it will return `HSA_EXT_POINTER_TYPE_HSA`. This latter does not mean that the buffer is host pinned memory because it may also be any other kind of memory allocated through HSA API. But we can assume the user won't pass such invalid buffer types.

This has fared OK under ad hoc testing but we're struggling with moving code between branches, preference would be to ship this ~ as-is and iterate in place. Going to mark it accepted for what it's worth, but probably want a confirmation from @jdoerfert before merging.

Thank you for digging through the HSA interface to put this together!

This revision is now accepted and ready to land.Dec 7 2022, 1:32 PM

I'll accept it under the expectation that the comments Johannes and I made get addressed.

I'm working on fixing all your comments, allowing the streams to change their size based on the demand of pushed async operations, and also re-using their HSA signals. These last aspects should reduce the number of signals we need at a given time. I'll update the patch soon. Thanks for the reviews!

jhuber6 mentioned this in D139730: [OpenMP][DeviceRTL][AMDGPU] Support code object version 5.Dec 9 2022, 11:11 AM

Rebasing onto main to include several pre-commits. Fixing several reviewer's issues. Also making streams dynamically sized (std::deque to avoid invalidating elements on insertions). In this way, we can re-use signals to reduce signal consumption through a signal resource manager.

There are some issues to fix:

There is still one error when running the tests. The stream hangs with the recent changes regarding dynamically sized streams. Currently under investigation.
Missing documentation on openmp/docs. We can open a separate patch for that.
Missing std::atomic_thread_fence inside the stream implementation. May be related to the issue mentioned above.
Some reviewer's comments not fixed yet. Currently working on it.

Harbormaster completed remote builds in B202612: Diff 482161.Dec 12 2022, 9:29 AM

kevinsala added a parent revision: D139792: [OpenMP][libomptarget] Add utility header for AMDGPU plugins.Dec 12 2022, 9:32 AM

kevinsala marked 38 inline comments as done and an inline comment as not done.Dec 12 2022, 10:03 AM

kevinsala added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
138	Opened a separate patch for that: https://reviews.llvm.org/D139792
500	Kept as BLOCKED waiting and added a TODO comment about it.
537–538	For some reason, that doesn't seem to work. It's printing strange characters after `".kd"`, even if I add a `\0`. I'm looking into it.
726	Added a long comment about it on top of the Stream's `std::mutex` data member.
969	I'm looking into it. The documentation of `hsa_amd_memory_async_copy` in the header doesn't give much detail on what's the meaning of those two parameters and which are the consequences of passing GPU/CPU agents. I passed the GPU agent as both parameters as the original AMDGPU plugin does. In case we pass the CPU agent as the src/dst agent, we will need to allow the CPU agent access to each of the device allocations. Otherwise, it will fail to comply with "the agent must be able to directly access both the source and destination buffers in their current locations" (header doc).
1464	Current names: `LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES` (default: 8) `LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE` (default: 1024) `LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES` (default: 110241024, 1MB) `LIBOMPTARGET_AMDGPU_NUM_INITIAL_SIGNALS` (default: 64), probably better to name `HSA_SIGNALS` instead of `SIGNALS` The description of each one appears on top of the envar's declarations. I'll add documentation on openmp/docs in another patch.

jdoerfert added inline comments.Dec 12 2022, 10:55 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
2296	Nit early exit first.

Fixing format with clang-format and renaming envar to LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS.

Harbormaster completed remote builds in B202631: Diff 482193.Dec 12 2022, 11:02 AM

kevinsala added inline comments.Dec 14 2022, 12:41 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1688	Should be a return

Rebasing and several fixes:

Fixing hang in streams (thanks @jdoerfert for finding it!)
Fixing reuse of signals
Fixing dataSubmitImpl when host pinned memory is detected
Implementing queryAsyncImpl introduced in https://reviews.llvm.org/D132005
Adding changes to be compatible with https://reviews.llvm.org/D139792
Other minor fixes

Harbormaster completed remote builds in B203387: Diff 483232.Dec 15 2022, 10:43 AM

Checking whether agents can actually access a memory pool (debug mode)
Fixing two asserts in the events' implementation

Harbormaster completed remote builds in B203479: Diff 483347.Dec 15 2022, 2:33 PM

Adding minor fix in eventHandler.

kevinsala marked an inline comment as done.Dec 15 2022, 2:45 PM

kevinsala added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
221	Now we're checking it when debug is enabled.

Harbormaster completed remote builds in B203482: Diff 483351.Dec 15 2022, 2:48 PM

Let's get this in so we can iterate on it.

My requests have been addressed.

Closed by commit rG87e6b96b0009: [OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior (authored by kevinsala). · Explain WhyDec 15 2022, 3:31 PM

This revision was automatically updated to reflect the committed changes.

kevinsala added a commit: rG87e6b96b0009: [OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior.

Seems that it caused this problem:
https://github.com/llvm/llvm-project/issues/59543

could you please fix it asap or revert this?

kevinsala added a reverting change: rGa66826a23381: Revert "[OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous….Dec 16 2022, 2:54 AM

Removing changes from dynamic_hsa and moving them to https://reviews.llvm.org/D140213

Harbormaster completed remote builds in B203604: Diff 483517.Dec 16 2022, 6:39 AM

kevinsala mentioned this in D140213: [OpenMP][libomptarget] Add missing symbols in dynamic_hsa.Dec 16 2022, 6:41 AM

kevinsala added a commit: rG6bbf9c0cca6f: [OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior.Dec 16 2022, 3:02 PM

Revision Contents

Path

Size

openmp/

libomptarget/

plugins-nextgen/

CMakeLists.txt

1 line

amdgpu/

CMakeLists.txt

107 lines

src/

rtl.cpp

2454 lines

plugins/

amdgpu/

dynamic_hsa/

hsa.h

63 lines

Diff 482161

openmp/libomptarget/plugins-nextgen/CMakeLists.txt

Show First 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	else(LIBOMPTARGET_DEP_LIBFFI_FOUND)
libomptarget_say("Not building ${tmachine_name} NextGen offloading plugin: libffi dependency not found.")		libomptarget_say("Not building ${tmachine_name} NextGen offloading plugin: libffi dependency not found.")
endif(LIBOMPTARGET_DEP_LIBFFI_FOUND)		endif(LIBOMPTARGET_DEP_LIBFFI_FOUND)
else()		else()
libomptarget_say("Not building ${tmachine_name} NextGen offloading plugin: machine not found in the system.")		libomptarget_say("Not building ${tmachine_name} NextGen offloading plugin: machine not found in the system.")
endif()		endif()
endmacro()		endmacro()

add_subdirectory(aarch64)		add_subdirectory(aarch64)
		add_subdirectory(amdgpu)
add_subdirectory(cuda)		add_subdirectory(cuda)
add_subdirectory(ppc64)		add_subdirectory(ppc64)
add_subdirectory(ppc64le)		add_subdirectory(ppc64le)
add_subdirectory(x86_64)		add_subdirectory(x86_64)

# Make sure the parent scope can see the plugins that will be created.		# Make sure the parent scope can see the plugins that will be created.
set(LIBOMPTARGET_SYSTEM_TARGETS "${LIBOMPTARGET_SYSTEM_TARGETS}" PARENT_SCOPE)		set(LIBOMPTARGET_SYSTEM_TARGETS "${LIBOMPTARGET_SYSTEM_TARGETS}" PARENT_SCOPE)
set(LIBOMPTARGET_TESTED_PLUGINS "${LIBOMPTARGET_TESTED_PLUGINS}" PARENT_SCOPE)		set(LIBOMPTARGET_TESTED_PLUGINS "${LIBOMPTARGET_TESTED_PLUGINS}" PARENT_SCOPE)

openmp/libomptarget/plugins-nextgen/amdgpu/CMakeLists.txt

This file was added.

				##===----------------------------------------------------------------------===##
				#
				# The LLVM Compiler Infrastructure
				#
				# This file is dual licensed under the MIT and the University of Illinois Open
				# Source Licenses. See LICENSE.txt for details.
				#
				##===----------------------------------------------------------------------===##
				#
				# Build a plugin for an AMDGPU machine if available.
				#
				##===----------------------------------------------------------------------===##

				################################################################################
				set(LIBOMPTARGET_BUILD_AMDGPU_PLUGIN TRUE CACHE BOOL
				"Whether to build AMDGPU plugin")
				if (NOT LIBOMPTARGET_BUILD_AMDGPU_PLUGIN)
				libomptarget_say("Not building AMDGPU NextGen offloading plugin: LIBOMPTARGET_BUILD_AMDGPU_PLUGIN is false")
				return()
				endif()

				# as of rocm-3.7, hsa is installed with cmake packages and kmt is found via hsa
				find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)

				if(NOT CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)\|(ppc64le)\|(aarch64)$" AND CMAKE_SYSTEM_NAME MATCHES "Linux")
				libomptarget_say("Not building AMDGPU NextGen plugin: only support AMDGPU in Linux x86_64, ppc64le, or aarch64 hosts")
				return()
				endif()

				################################################################################
				# Define the suffix for the runtime messaging dumps.
				add_definitions(-DTARGET_NAME=AMDGPU)

				# Define debug prefix. TODO: This should be automatized in the Debug.h but it
				# requires changing the original plugins.
				add_definitions(-DDEBUG_PREFIX="TARGET AMDGPU RTL")

				if(CMAKE_SYSTEM_PROCESSOR MATCHES "(ppc64le)\|(aarch64)$")
				add_definitions(-DLITTLEENDIAN_CPU=1)
				endif()

				if(CMAKE_BUILD_TYPE MATCHES Debug)
				add_definitions(-DDEBUG)
				endif()

				set(LIBOMPTARGET_DLOPEN_LIBHSA OFF)
				option(LIBOMPTARGET_FORCE_DLOPEN_LIBHSA "Build with dlopened libhsa" ${LIBOMPTARGET_DLOPEN_LIBHSA})

				if (${hsa-runtime64_FOUND} AND NOT LIBOMPTARGET_FORCE_DLOPEN_LIBHSA)
				libomptarget_say("Building AMDGPU NextGen plugin linked against libhsa")
				set(LIBOMPTARGET_EXTRA_SOURCE)
				set(LIBOMPTARGET_DEP_LIBRARIES hsa-runtime64::hsa-runtime64)
				else()
				libomptarget_say("Building AMDGPU NextGen plugin for dlopened libhsa")
				include_directories(../../plugins/amdgpu/dynamic_hsa)
				set(LIBOMPTARGET_EXTRA_SOURCE ../../plugins/amdgpu/dynamic_hsa/hsa.cpp)
				set(LIBOMPTARGET_DEP_LIBRARIES)
				endif()

				if(CMAKE_SYSTEM_NAME MATCHES "FreeBSD")
				# On FreeBSD, the 'environ' symbol is undefined at link time, but resolved by
				# the dynamic linker at runtime. Therefore, allow the symbol to be undefined
				# when creating a shared library.
				set(LDFLAGS_UNDEFINED "-Wl,--allow-shlib-undefined")
				else()
				set(LDFLAGS_UNDEFINED "-Wl,-z,defs")
				endif()

				add_llvm_library(omptarget.rtl.amdgpu.nextgen SHARED
				src/rtl.cpp
				${LIBOMPTARGET_EXTRA_SOURCE}

				ADDITIONAL_HEADER_DIRS
				${LIBOMPTARGET_INCLUDE_DIR}
				${CMAKE_CURRENT_SOURCE_DIR}/utils

				LINK_COMPONENTS
				Support
				Object

				LINK_LIBS
				PRIVATE
				elf_common
				MemoryManager
				PluginInterface
				${LIBOMPTARGET_DEP_LIBRARIES}
				${OPENMP_PTHREAD_LIB}
				"-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/../exports"
				${LDFLAGS_UNDEFINED}

				NO_INSTALL_RPATH
				)
				add_dependencies(omptarget.rtl.amdgpu.nextgen omptarget.devicertl.amdgpu)

				target_include_directories(
				omptarget.rtl.amdgpu.nextgen
				PRIVATE
				${LIBOMPTARGET_INCLUDE_DIR}
				${CMAKE_CURRENT_SOURCE_DIR}/utils
				)


				# Install plugin under the lib destination folder.
				install(TARGETS omptarget.rtl.amdgpu.nextgen LIBRARY DESTINATION "${OPENMP_INSTALL_LIBDIR}")
				set_target_properties(omptarget.rtl.amdgpu.nextgen PROPERTIES
				INSTALL_RPATH "$ORIGIN" BUILD_RPATH "$ORIGIN:${CMAKE_CURRENT_BINARY_DIR}/.."
				CXX_VISIBILITY_PRESET protected)

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

This file was added.

//===----RTLs/amdgpu/src/rtl.cpp - Target RTLs Implementation ----- C++ -*-===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// RTL NextGen for AMDGPU machine

//===----------------------------------------------------------------------===//

#include <cassert>

#include <cstddef>

#include <deque>

#include <hsa.h>

#include <hsa_ext_amd.h>

#include <mutex>

#include <shared_mutex>

#include <string>

#include <unistd.h>

#include <unordered_map>

#include "Debug.h"

#include "DeviceEnvironment.h"

#include "GlobalHandler.h"

#include "PluginInterface.h"

#include "Utilities.h"

#include "UtilitiesRTL.h"

#include "llvm/ADT/SmallString.h"

#include "llvm/ADT/StringRef.h"

#include "llvm/BinaryFormat/ELF.h"

#include "llvm/Frontend/OpenMP/OMPConstants.h"

#include "llvm/Frontend/OpenMP/OMPGridValues.h"

namespace llvm {

namespace omp {

namespace target {

namespace plugin {

/// Forward declarations for all specialized data structures.

struct AMDGPUKernelTy;

struct AMDGPUDeviceTy;

jhuber6Unsubmitted

Done

This is declared as a struct here then defined as a class later.

jhuber6: This is declared as a struct here then defined as a class later.

struct AMDGPUPluginTy;

struct AMDGPUStreamTy;

struct AMDGPUEventTy;

struct AMDGPUStreamManagerTy;

struct AMDGPUEventManagerTy;

struct AMDGPUDeviceImageTy;

struct AMDGPUMemoryManagerTy;

struct AMDGPUMemoryPoolTy;

namespace utils {

/// Iterate elements using an HSA iterate function. Do not use this function

/// directly but the specialized ones below instead.

template <typename ElemTy, typename IterFuncTy, typename CallbackTy>

hsa_status_t iterate(IterFuncTy Func, CallbackTy Cb) {

auto L = [](ElemTy Elem, void *Data) -> hsa_status_t {

CallbackTy *Unwrapped = static_cast<CallbackTy *>(Data);

return (*Unwrapped)(Elem);

};

return Func(L, static_cast<void *>(&Cb));

}

jhuber6Unsubmitted

Done

Empty argument isn't needed in C++17

jhuber6: Empty argument isn't needed in C++17

/// Iterate elements using an HSA iterate function passing a parameter. Do not

/// use this function directly but the specialized ones below instead.

template <typename ElemTy, typename IterFuncTy, typename IterFuncArgTy,

typename CallbackTy>

hsa_status_t iterate(IterFuncTy Func, IterFuncArgTy FuncArg, CallbackTy Cb) {

auto L = [](ElemTy Elem, void *Data) -> hsa_status_t {

CallbackTy *Unwrapped = static_cast<CallbackTy *>(Data);

return (*Unwrapped)(Elem);

};

return Func(FuncArg, L, static_cast<void *>(&Cb));

}

/// Iterate elements using an HSA iterate function passing a parameter. Do not

/// use this function directly but the specialized ones below instead.

template <typename Elem1Ty, typename Elem2Ty, typename IterFuncTy,

typename IterFuncArgTy, typename CallbackTy>

hsa_status_t iterate(IterFuncTy Func, IterFuncArgTy FuncArg, CallbackTy Cb) {

auto L = [](Elem1Ty Elem1, Elem2Ty Elem2, void *Data) -> hsa_status_t {

CallbackTy *Unwrapped = static_cast<CallbackTy *>(Data);

return (*Unwrapped)(Elem1, Elem2);

};

return Func(FuncArg, L, static_cast<void *>(&Cb));

}

/// Iterate agents.

template <typename CallbackTy> Error iterateAgents(CallbackTy Callback) {

hsa_status_t Status = iterate<hsa_agent_t>(hsa_iterate_agents, Callback);

return Plugin::check(Status, "Error in hsa_iterate_agents: %s");

}

/// Iterate ISAs of an agent.

template <typename CallbackTy>

Error iterateAgentISAs(hsa_agent_t Agent, CallbackTy Cb) {

hsa_status_t Status = iterate<hsa_isa_t>(hsa_agent_iterate_isas, Agent, Cb);

return Plugin::check(Status, "Error in hsa_agent_iterate_isas: %s");

}

/// Iterate memory pools of an agent.

template <typename CallbackTy>

Error iterateAgentMemoryPools(hsa_agent_t Agent, CallbackTy Cb) {

hsa_status_t Status = iterate<hsa_amd_memory_pool_t>(

hsa_amd_agent_iterate_memory_pools, Agent, Cb);

return Plugin::check(Status,

"Error in hsa_amd_agent_iterate_memory_pools: %s");

}

} // namespace utils

/// Utility class representing generic resource references to AMDGPU resources.

template <typename ResourceTy>

struct AMDGPUResourceRef : public GenericDeviceResourceRef {

/// Create an empty reference to an invalid resource.

AMDGPUResourceRef() : Resource(nullptr) {}

/// Create a reference to an existing resource.

AMDGPUResourceRef(ResourceTy *Resource) : Resource(Resource) {}

/// Create a new resource and save the reference. The reference must be empty

/// before calling to this function.

Error create(GenericDeviceTy &Device) override;

/// Destroy the referenced resource and invalidate the reference. The reference

/// must be to a valid event before calling to this function.

Error destroy(GenericDeviceTy &Device) override {

if (!Resource)

return Plugin::error("Destroying an invalid resource");

if (auto Err = Resource->deinit())

return Err;

delete Resource;

arsenmUnsubmitted

Done

Do we not have this in a shared place already?

arsenm: Do we not have this in a shared place already?

jhuber6Unsubmitted

Done

I believe we have some support in clang somewhere, we could probably move that to somewhere in llvm and include it here instead.

jhuber6: I believe we have some support in `clang` somewhere, we could probably move that to somewhere…

jdoerfertUnsubmitted

Done

I used the github search and the only place I found that looks for the sramecc+ string is the existing AMD plugin (same function basically):
https://github.com/llvm/llvm-project/blob/a35ad711d90497994701a99723a81badf3d4348e/openmp/libomptarget/plugins/amdgpu/src/rtl.cpp#L1855

jdoerfert: I used the github search and the only place I found that looks for the sramecc+ string is the…

kevinsalaAuthorUnsubmitted

Done

If you find it fine, I can move it to a new utils header in plugins-nextgen/amdgpu/ and be included by both AMDGPU plugins.

kevinsala: If you find it fine, I can move it to a new utils header in `plugins-nextgen/amdgpu/` and be…

kevinsalaAuthorUnsubmitted

Done

Opened a separate patch for that: https://reviews.llvm.org/D139792

kevinsala: Opened a separate patch for that: https://reviews.llvm.org/D139792

Resource = nullptr;

return Plugin::success();

}

/// Get the underlying AMDGPUSignalTy reference.

operator ResourceTy *() const { return Resource; }

private:

/// The reference to the actual resource.

ResourceTy *Resource;

jhuber6Unsubmitted

Done

if (Features.contains("sramecc+")) {

- FeatureMap.insert(std::pair<std::string, bool>("sramecc", true));

+ FeatureMap.insert(std::pair<StringRef, bool>("sramecc", true));

} else if (Features.contains("sramecc-")) {

These can be StringRef since they're constants. Avoids the heap allocation.

jhuber6: These can be `StringRef` since they're constants. Avoids the heap allocation.

};

/// Class holding an HSA memory pool.

struct AMDGPUMemoryPoolTy {

/// Create a memory pool from an HSA memory pool.

AMDGPUMemoryPoolTy(hsa_amd_memory_pool_t MemoryPool)

: MemoryPool(MemoryPool), GlobalFlags(0) {}

/// Initialize the memory pool retrieving its properties.

Error init() {

if (auto Err = getAttr(HSA_AMD_MEMORY_POOL_INFO_SEGMENT, Segment))

return Err;

if (auto Err = getAttr(HSA_AMD_MEMORY_POOL_INFO_GLOBAL_FLAGS, GlobalFlags))

return Err;

return Plugin::success();

}

/// Getter of the HSA memory pool.

hsa_amd_memory_pool_t get() const { return MemoryPool; }

/// Indicate if it belongs to the global segment.

bool isGlobal() const { return (Segment == HSA_AMD_SEGMENT_GLOBAL); }

/// Indicate if it is fine-grained memory. Valid only for global.

bool isFineGrained() const {

assert(isGlobal() && "Not global memory");

return (GlobalFlags & HSA_AMD_MEMORY_POOL_GLOBAL_FLAG_FINE_GRAINED);

}

/// Indicate if it is coarse-grained memory. Valid only for global.

bool isCoarseGrained() const {

JonChesterfieldUnsubmitted

Done

this is pointless, just make the member public if we have to leak it. but again, existing code...

JonChesterfield: this is pointless, just make the member public if we have to leak it. but again, existing code..

jdoerfertUnsubmitted

Done

It's not pointless, it gives a common name for abstracted resources. Often this would be operator* or similar but get is as good as anything for now. Let's not go down rabbit holes that are indeed pointless.

jdoerfert: It's not pointless, it gives a common name for abstracted resources. Often this would be…

kevinsalaAuthorUnsubmitted

Done

I don't think it's pointless; we limit the access to the HSA handle and only expose it when it's really needed. And we can also detect easily where these handles are being accessed.

kevinsala: I don't think it's pointless; we limit the access to the HSA handle and only expose it when…

assert(isGlobal() && "Not global memory");

return (GlobalFlags & HSA_AMD_MEMORY_POOL_GLOBAL_FLAG_COARSE_GRAINED);

}

/// Indicate if it supports storing kernel arguments. Valid only for global.

bool supportsKernelArgs() const {

assert(isGlobal() && "Not global memory");

return (GlobalFlags & HSA_AMD_MEMORY_POOL_GLOBAL_FLAG_KERNARG_INIT);

}

/// Allocate memory on the memory pool.

Error allocate(size_t Size, void **PtrStorage) {

hsa_status_t Status =

hsa_amd_memory_pool_allocate(MemoryPool, Size, 0, PtrStorage);

return Plugin::check(Status, "Error in hsa_amd_memory_pool_allocate: %s");

}

/// Return memory to the memory pool.

Error deallocate(void *Ptr) {

hsa_status_t Status = hsa_amd_memory_pool_free(Ptr);

return Plugin::check(Status, "Error in hsa_amd_memory_pool_free: %s");

}

/// Allow the device to access a specific allocation.

Error enableAccess(void *Ptr, int64_t Size,

const llvm::SmallVector<hsa_agent_t> &Agents) const {

// TODO: Ensure it is possible to enable the access. This can be retrieved

// through HSA_AMD_AGENT_MEMORY_POOL_INFO_ACCESS. If it is not possible,

// enabling the access results in undefined behavior.

// We can access but it is disabled by default. Enable the access then.

hsa_status_t Status =

hsa_amd_agents_allow_access(Agents.size(), Agents.data(), nullptr, Ptr);

return Plugin::check(Status, "Error in hsa_amd_agents_allow_access: %s");

}

private:

/// Get attribute from the memory pool.

template <typename Ty>

JonChesterfieldUnsubmitted

Done

this is a scary comment, though presumably in existing code

JonChesterfield: this is a scary comment, though presumably in existing code

kevinsalaAuthorUnsubmitted

Done

I can implement it, but I would perform this check only in debug mode.

kevinsala: I can implement it, but I would perform this check only in debug mode.

kevinsalaAuthorUnsubmitted

Done

Now we're checking it when debug is enabled.

kevinsala: Now we're checking it when debug is enabled.

Error getAttr(hsa_amd_memory_pool_info_t Kind, Ty &Value) const {

hsa_status_t Status;

Status = hsa_amd_memory_pool_get_info(MemoryPool, Kind, &Value);

return Plugin::check(Status, "Error in hsa_amd_memory_pool_get_info: %s");

}

/// Get attribute from the memory pool relating to an agent.

template <typename Ty>

Error getAttr(hsa_agent_t Agent, hsa_amd_agent_memory_pool_info_t Kind,

Ty &Value) const {

hsa_status_t Status;

Status =

hsa_amd_agent_memory_pool_get_info(Agent, MemoryPool, Kind, &Value);

return Plugin::check(Status,

"Error in hsa_amd_agent_memory_pool_get_info: %s");

}

/// The HSA memory pool.

hsa_amd_memory_pool_t MemoryPool;

/// The segment where the memory pool belongs to.

hsa_amd_segment_t Segment;

/// The global flags of memory pool. Only valid if the memory pool belongs to

/// the global segment.

uint32_t GlobalFlags;

};

/// Class that implements a memory manager that gets memory from a specific

/// memory pool.

struct AMDGPUMemoryManagerTy : public DeviceAllocatorTy {

/// Create an empty memory manager.

AMDGPUMemoryManagerTy() : MemoryPool(nullptr), MemoryManager(nullptr) {}

/// Initialize the memory manager from a memory pool.

Error init(AMDGPUMemoryPoolTy &MemoryPool) {

const uint32_t Threshold = 1 << 30;

this->MemoryManager = new MemoryManagerTy(*this, Threshold);

this->MemoryPool = &MemoryPool;

return Plugin::success();

}

/// Deinitialize the memory manager and free its allocations.

Error deinit() {

assert(MemoryManager && "Invalid memory manager");

// Delete and invalidate the memory manager. At this point, the memory

// manager will deallocate all its allocations.

delete MemoryManager;

MemoryManager = nullptr;

return Plugin::success();

}

/// Reuse or allocate memory through the memory manager.

Error allocate(size_t Size, void **PtrStorage) {

assert(MemoryManager && "Invalid memory manager");

assert(PtrStorage && "Invalid pointer storage");

*PtrStorage = MemoryManager->allocate(Size, nullptr);

if (*PtrStorage == nullptr)

return Plugin::error("Failure to allocate from AMDGPU memory manager");

return Plugin::success();

}

/// Release an allocation to be reused.

Error deallocate(void *Ptr) {

assert(Ptr && "Invalid pointer");

if (MemoryManager->free(Ptr))

jhuber6Unsubmitted

Done

I'm not deeply familiar with the MemoryManager, but I thought that we didn't use it for AMD because the HSA plugin already handled its own caching in the memory pool? Maybe @JonChesterfield and @tianshilei1992 could comment further.

jhuber6: I'm not deeply familiar with the `MemoryManager`, but I thought that we didn't use it for AMD…

jdoerfertUnsubmitted

Done

I'm generally in favor of using it either way.

jdoerfert: I'm generally in favor of using it either way.

JonChesterfieldUnsubmitted

Done

IIUC the HSA interface with 'memory pool' in the name doesn't implement a memory pool so caching it here is probably faster. I expect it'll interact badly with the asan development effort but that was rocm-internal last time I looked at it so can't influence decisions here

JonChesterfield: IIUC the HSA interface with 'memory pool' in the name doesn't implement a memory pool so…

return Plugin::error("Failure to deallocate from AMDGPU memory manager");

return Plugin::success();

}

private:

/// Allocation callback that will be called once the memory manager does not

/// have more previously allocated buffers.

void *allocate(size_t Size, void *HstPtr, TargetAllocTy Kind) override;

/// Deallocation callack that will be called by the memory manager.

int free(void *TgtPtr, TargetAllocTy Kind) override {

if (auto Err = MemoryPool->deallocate(TgtPtr)) {

consumeError(std::move(Err));

return OFFLOAD_FAIL;

}

return OFFLOAD_SUCCESS;

}

/// The memory pool used to allocate memory.

AMDGPUMemoryPoolTy *MemoryPool;

/// Reference to the actual memory manager.

MemoryManagerTy *MemoryManager;

};

/// Class implementing the AMDGPU device images' properties.

struct AMDGPUDeviceImageTy : public DeviceImageTy {

/// Create the AMDGPU image with the id and the target image pointer.

AMDGPUDeviceImageTy(int32_t ImageId, const __tgt_device_image *TgtImage)

: DeviceImageTy(ImageId, TgtImage) {}

/// Prepare and load the executable corresponding to the image.

Error loadExecutable(const AMDGPUDeviceTy &Device);

/// Unload the executable.

Error unloadExecutable() {

hsa_status_t Status = hsa_executable_destroy(Executable);

if (auto Err = Plugin::check(Status, "Error in hsa_executable_destroy: %s"))

return Err;

Status = hsa_code_object_destroy(CodeObject);

jdoerfertUnsubmitted

Done

AMDGPUDeviceImageTy(int32_t ImageId, const __tgt_device_image *TgtImage)

- : DeviceImageTy(ImageId, TgtImage) /*, Module(nullptr)*/ {}

+ : DeviceImageTy(ImageId, TgtImage) {}

/// Prepare and load the executable corresponding to the image.

jdoerfert:

return Plugin::check(Status, "Error in hsa_code_object_destroy: %s");

}

/// Get the executable.

hsa_executable_t getExecutable() const { return Executable; }

/// Find an HSA device symbol by its name on the executable.

Expected<hsa_executable_symbol_t>

findDeviceSymbol(GenericDeviceTy &Device, StringRef SymbolName) const;

private:

/// The exectuable loaded on the agent.

hsa_executable_t Executable;

hsa_code_object_t CodeObject;

};

/// Class implementing the AMDGPU kernel functionalities which derives from the

/// generic kernel class.

struct AMDGPUKernelTy : public GenericKernelTy {

/// Create an AMDGPU kernel with a name and an execution mode.

AMDGPUKernelTy(const char *Name, OMPTgtExecModeFlags ExecutionMode)

: GenericKernelTy(Name, ExecutionMode),

ImplicitArgsSize(sizeof(utils::impl_implicit_args_t)) {}

/// Initialize the AMDGPU kernel.

Error initImpl(GenericDeviceTy &Device, DeviceImageTy &Image) override {

AMDGPUDeviceImageTy &AMDImage = static_cast<AMDGPUDeviceImageTy &>(Image);

// Kernel symbols have a ".kd" suffix.

std::string KernelName(getName());

KernelName += ".kd";

// Find the symbol on the device executable.

auto SymbolOrErr = AMDImage.findDeviceSymbol(Device, KernelName);

if (!SymbolOrErr)

return SymbolOrErr.takeError();

hsa_executable_symbol_t Symbol = *SymbolOrErr;

hsa_symbol_kind_t SymbolType;

hsa_status_t Status;

// Retrieve different properties of the kernel symbol.

std::pair<hsa_executable_symbol_info_t, void *> RequiredInfos[] = {

{HSA_EXECUTABLE_SYMBOL_INFO_TYPE, &SymbolType},

{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &KernelObject},

{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE, &ArgsSize},

{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE, &GroupSize},

{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE, &PrivateSize}};

for (auto &Info : RequiredInfos) {

Status = hsa_executable_symbol_get_info(Symbol, Info.first, Info.second);

if (auto Err = Plugin::check(

Status, "Error in hsa_executable_symbol_get_info: %s"))

return Err;

}

// Make sure it is a kernel symbol.

if (SymbolType != HSA_SYMBOL_KIND_KERNEL)

return Plugin::error("Symbol %s is not a kernel function");

JonChesterfieldUnsubmitted

Done

does this behave sensibly on wrap around? the arithmetic used elsewhere is

static uint64_t acquireAvailablePacketId(hsa_queue_t *Queue) {
  uint64_t PacketId = hsa_queue_add_write_index_relaxed(Queue, 1);
  bool Full = true;
  while (Full) {
    Full =
        PacketId >= (Queue->size + hsa_queue_load_read_index_scacquire(Queue));
  }
  return PacketId;
}

JonChesterfield: does this behave sensibly on wrap around? the arithmetic used elsewhere is ``` static…

kevinsalaAuthorUnsubmitted

Done

I obtained that method from the HSA standard document.

kevinsala: I obtained that method from the HSA standard document.

// TODO: Read the kernel descriptor for the max threads per block. May be

// read from the image.

return Plugin::success();

}

/// Launch the AMDGPU kernel function.

Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,

uint64_t NumBlocks, uint32_t DynamicMemorySize,

int32_t NumKernelArgs, void *KernelArgs,

AsyncInfoWrapperTy &AsyncInfoWrapper) const override;

/// The default number of blocks is common to the whole device.

uint64_t getDefaultNumBlocks(GenericDeviceTy &GenericDevice) const override {

return GenericDevice.getDefaultNumBlocks();

}

/// The default number of threads is common to the whole device.

uint32_t getDefaultNumThreads(GenericDeviceTy &GenericDevice) const override {

return GenericDevice.getDefaultNumThreads();

}

jhuber6Unsubmitted

Done

Never looked into it much myself, but does std::atomic allow us to avoid using a compiler builtin for this? Not a big deal if not.

jhuber6: Never looked into it much myself, but does `std::atomic` allow us to avoid using a compiler…

jdoerfertUnsubmitted

Done

We'd need to bend over backwards as the definitions are plain pointers. Let's keep it this way.

jdoerfert: We'd need to bend over backwards as the definitions are plain pointers. Let's keep it this way.

/// Get group and private segment kernel size.

uint32_t getGroupSize() const { return GroupSize; }

uint32_t getPrivateSize() const { return PrivateSize; }

/// Get the HSA kernel object representing the kernel function.

uint64_t getKernelObject() const { return KernelObject; }

private:

/// The kernel object to execute.

uint64_t KernelObject;

/// The args, group and private segments sizes required by a kernel instance.

uint32_t ArgsSize;

uint32_t GroupSize;

uint32_t PrivateSize;

/// The size of implicit kernel arguments.

const uint32_t ImplicitArgsSize;

};

/// Class representing an HSA signal. Signals are used to define dependencies

/// between asynchronous operations: kernel launches and memory transfers.

struct AMDGPUSignalTy {

/// Create an empty signal.

AMDGPUSignalTy() : Signal({0}), UseCount() {}

AMDGPUSignalTy(AMDGPUDeviceTy &Device) : Signal({0}), UseCount() {}

/// Initialize the signal with an initial value.

Error init(uint32_t InitialValue = 1) {

hsa_status_t Status =

hsa_amd_signal_create(InitialValue, 0, nullptr, 0, &Signal);

return Plugin::check(Status, "Error in hsa_signal_create: %s");

}

/// Deinitialize the signal.

Error deinit() {

JonChesterfieldUnsubmitted

Done

This is strange. The whole point of the queue abstraction is that multiple threads can enqueue work onto it at the same time, so why would we be locking it?

JonChesterfield: This is strange. The whole point of the queue abstraction is that multiple threads can enqueue…

kevinsalaAuthorUnsubmitted

Done

Multiple threads may enqueue work onto the queue concurrently thanks to the atomic operations to retrieve a packet slot and publishing/ringing the doorbell. But, if I understood correctly, we shouldn't allow publishing a packet (ringing the doorbell with that packet id) if there is any previous packet in the queue that is not ready to be processed yet. For instance, this sequence will be invalid according to my understanding:

   Thread 1 (T1):                               Thread2 (T2):
   ========                                     ========
1. Gets packet slot 0
2. Fills packet 0's fields                      Gets packet slot 1
3.                                              Fills packet 1's fields
4.                                              Publish packet writing doorbell with id 1
5. Publish packet writing doorbell with id 0)

Probably, we could reduce the locked section or remove the lock, by preventing T2 (or any thread in the generalized case) from ringing the doorbell if any previous packet is not ready. But I didn't find that critical currently. If we see this lock is a bottleneck, we can try to optimize it.

kevinsala: Multiple threads may enqueue work onto the queue concurrently thanks to the atomic operations…

hsa_status_t Status = hsa_signal_destroy(Signal);

return Plugin::check(Status, "Error in hsa_signal_destroy: %s");

}

/// Wait until the signal gets a zero value.

Error wait() const {

// TODO: Is it better to use busy waiting or blocking the thread?

while (hsa_signal_wait_scacquire(Signal, HSA_SIGNAL_CONDITION_EQ, 0,

UINT64_MAX, HSA_WAIT_STATE_BLOCKED) != 0)

;

return Plugin::success();

}

/// Load the value on the signal.

hsa_signal_value_t load() const { return hsa_signal_load_scacquire(Signal); }

/// Signal decrementing by one.

void signal() {

assert(load() > 0 && "Invalid signal value");

hsa_signal_subtract_screlease(Signal, 1);

}

/// Reset the signal value before reusing the signal. Do not call this

/// function if the signal is being currently used by any watcher, such as a

/// plugin thread or the HSA runtime.

void reset() { hsa_signal_store_screlease(Signal, 1); }

/// Increase the number of concurrent uses.

void increaseUseCount() { UseCount.increase(); }

JonChesterfieldUnsubmitted

Not Done

Is {0} a meaningful value for a signal? e.g. can we pass it to signal_destroy or similar without it falling over?

JonChesterfield: Is {0} a meaningful value for a signal? e.g. can we pass it to signal_destroy or similar…

kevinsalaAuthorUnsubmitted

Done

We shouldn't call destroy with the {0} handle. I can add asserts.

kevinsala: We shouldn't call destroy with the {0} handle. I can add asserts.

/// Decrease the number of concurrent uses and return whether was the last.

bool decreaseUseCount() { return UseCount.decrease(); }

hsa_signal_t get() const { return Signal; }

private:

/// The underlying HSA signal.

hsa_signal_t Signal;

/// Reference counter for tracking the concurrent use count. This is mainly

/// used for knowing how many streams are using the signal.

RefCountTy<> UseCount;

};

/// Classes for holding AMDGPU signals and managing signals.

using AMDGPUSignalRef = AMDGPUResourceRef<AMDGPUSignalTy>;

using AMDGPUSignalManagerTy = GenericDeviceResourceManagerTy<AMDGPUSignalRef>;

JonChesterfieldUnsubmitted

Done

comment for why active instead of blocked? wait in particular sounds like we expect the signal to be unavailable when the function is called so busy-spin in the host thread seems bad

JonChesterfield: comment for why active instead of blocked? wait in particular sounds like we expect the signal…

kevinsalaAuthorUnsubmitted

Done

True, that's something I was experimenting with. I'll change it back to blocked wait.

kevinsala: True, that's something I was experimenting with. I'll change it back to blocked wait.

jdoerfertUnsubmitted

Done

I'm not sure which way to go. Add a TODO to try this out. I'm uncertain if the host thread can do something useful while it's not busy waiting.

jdoerfert: I'm not sure which way to go. Add a TODO to try this out. I'm uncertain if the host thread can…

kevinsalaAuthorUnsubmitted

Done

Kept as BLOCKED waiting and added a TODO comment about it.

kevinsala: Kept as BLOCKED waiting and added a TODO comment about it.

/// Class holding an HSA queue to submit kernel and barrier packets.

struct AMDGPUQueueTy {

/// Create an empty queue.

AMDGPUQueueTy() : Queue(nullptr), Mutex() {}

/// Initialize a new queue belonging to a specific agent.

Error init(hsa_agent_t Agent, int32_t QueueSize) {

JonChesterfieldUnsubmitted

Done

This looks racy. If a few threads are calling decrement and one calls reset, it's going to break.

JonChesterfield: This looks racy. If a few threads are calling decrement and one calls reset, it's going to…

kevinsalaAuthorUnsubmitted

Done

Reset is expected to be called when the signal is not being used by any other thread nor the HSA runtime. It should be called just before being reused.

kevinsala: Reset is expected to be called when the signal is not being used by any other thread nor the…

jdoerfertUnsubmitted

Done

Add that to the function comment please.

jdoerfert: Add that to the function comment please.

hsa_status_t Status =

hsa_queue_create(Agent, QueueSize, HSA_QUEUE_TYPE_MULTI, callbackError,

nullptr, UINT32_MAX, UINT32_MAX, &Queue);

return Plugin::check(Status, "Error in hsa_queue_create: %s");

}

/// Deinitialize the queue and destroy its resources.

Error deinit() {

hsa_status_t Status = hsa_queue_destroy(Queue);

return Plugin::check(Status, "Error in hsa_queue_destroy: %s");

}

/// Push a kernel launch to the queue. The kernel launch requires an output

/// signal and can define an optional input signal (nullptr if none).

Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,

uint32_t NumThreads, uint64_t NumBlocks,

AMDGPUSignalTy *OutputSignal,

AMDGPUSignalTy *InputSignal) {

assert(OutputSignal && "Invalid kernel output signal");

// Lock the queue during the packet publishing process. Notice this blocks

// the addition of other packets to the queue. The following piece of code

// should be lightweight; do not block the thread, allocate memory, etc.

std::lock_guard<std::mutex> Lock(Mutex);

// Add a barrier packet before the kernel packet in case there is a pending

// preceding operation. The barrier packet will delay the processing of

// subsequent queue's packets until the barrier input signal are satisfied.

// No need output signal needed because the dependency is already guaranteed

// by the queue barrier itself.

jhuber6Unsubmitted

Not Done

// Kernel symbols have a ".kd" suffix.

- std::string KernelName(getName());

- KernelName += ".kd";

+ SmallString<128> KernelName({getName(), ".kd"});

// Find the symbol on the device executable.

Small nit, can use LLVM's string, it's basically just a small vector of chars.

jhuber6: Small nit, can use LLVM's string, it's basically just a small vector of chars.

kevinsalaAuthorUnsubmitted

Done

For some reason, that doesn't seem to work. It's printing strange characters after ".kd", even if I add a \0. I'm looking into it.

kevinsala: For some reason, that doesn't seem to work. It's printing strange characters after `".kd"`…

if (InputSignal)

if (auto Err = pushBarrierImpl(nullptr, InputSignal))

return Err;

// Now prepare the kernel packet.

uint64_t PacketId;

hsa_kernel_dispatch_packet_t *Packet = acquirePacket(PacketId);

assert(Packet && "Invalid packet");

// The header of the packet is written in the last moment.

Packet->setup = UINT16_C(1) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;

Packet->workgroup_size_x = NumThreads;

Packet->workgroup_size_y = 1;

Packet->workgroup_size_z = 1;

Packet->reserved0 = 0;

Packet->grid_size_x = NumBlocks * NumThreads;

Packet->grid_size_y = 1;

Packet->grid_size_z = 1;

Packet->private_segment_size = Kernel.getPrivateSize();

Packet->group_segment_size = Kernel.getGroupSize();

Packet->kernel_object = Kernel.getKernelObject();

Packet->kernarg_address = KernelArgs;

Packet->reserved2 = 0;

Packet->completion_signal = OutputSignal->get();

// Publish the packet. Do not modify the packet after this point.

publishKernelPacket(PacketId, Packet);

return Plugin::success();

}

/// Push a barrier packet that will wait up to two input signals. All signals

/// are optional (nullptr if none).

Error pushBarrier(AMDGPUSignalTy *OutputSignal,

const AMDGPUSignalTy *InputSignal1,

const AMDGPUSignalTy *InputSignal2) {

// Lock the queue during the packet publishing process.

std::lock_guard<std::mutex> Lock(Mutex);

// Push the barrier with the lock acquired.

return pushBarrierImpl(OutputSignal, InputSignal1, InputSignal2);

}

private:

/// Push a barrier packet that will wait up to two input signals. Assumes the

/// the queue lock is acquired.

Error pushBarrierImpl(AMDGPUSignalTy *OutputSignal,

const AMDGPUSignalTy *InputSignal1,

const AMDGPUSignalTy *InputSignal2 = nullptr) {

// Add a queue barrier waiting on both the other stream's operation and the

// last operation on the current stream (if any).

uint64_t PacketId;

hsa_barrier_and_packet_t *Packet =

(hsa_barrier_and_packet_t *)acquirePacket(PacketId);

assert(Packet && "Invalid packet");

Packet->reserved0 = 0;

Packet->reserved1 = 0;

Packet->dep_signal[0] = {0};

Packet->dep_signal[1] = {0};

Packet->dep_signal[2] = {0};

Packet->dep_signal[3] = {0};

Packet->dep_signal[4] = {0};

Packet->reserved2 = 0;

Packet->completion_signal = {0};

// Set input and output dependencies if needed.

if (OutputSignal)

Packet->completion_signal = OutputSignal->get();

if (InputSignal1)

Packet->dep_signal[0] = InputSignal1->get();

if (InputSignal2)

Packet->dep_signal[1] = InputSignal2->get();

// Publish the packet. Do not modify the packet after this point.

publishBarrierPacket(PacketId, Packet);

return Plugin::success();

}

/// Acquire a packet from the queue. This call may block the thread if there

/// is no space in the underlying HSA queue. It may need to wait until the HSA

/// runtime processes some packets. Assumes the queue lock is acquired.

hsa_kernel_dispatch_packet_t *acquirePacket(uint64_t &PacketId) {

// Increase the queue index with relaxed memory order. Notice this will need

// another subsequent atomic operation with acquire order.

PacketId = hsa_queue_add_write_index_relaxed(Queue, 1);

// Wait for the package to be available. Notice the atomic operation uses

// the acquire memory order.

while (PacketId - hsa_queue_load_read_index_scacquire(Queue) >= Queue->size)

;

// Return the packet reference.

const uint32_t Mask = Queue->size - 1; // The size is a power of 2.

return (hsa_kernel_dispatch_packet_t *)Queue->base_address +

(PacketId & Mask);

}

/// Publish the kernel packet so that the HSA runtime can start processing

/// the kernel launch. Do not modify the packet once this function is called.

/// Assumes the queue lock is acquired.

void publishKernelPacket(uint64_t PacketId,

hsa_kernel_dispatch_packet_t *Packet) {

uint32_t *PacketPtr = reinterpret_cast<uint32_t *>(Packet);

uint16_t Setup = Packet->setup;

uint16_t Header = HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE;

Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE;

Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE;

// Publish the packet. Do not modify the package after this point.

__atomic_store_n(PacketPtr, Header | (Setup << 16), __ATOMIC_RELEASE);

// Signal the doorbell about the published packet.

hsa_signal_store_relaxed(Queue->doorbell_signal, PacketId);

}

/// Publish the barrier packet so that the HSA runtime can start processing

/// the barrier. NextSlot packets in the queue will not be processed until all

/// barrier dependencies (signals) are satisfied. Assumes the queue is locked

void publishBarrierPacket(uint64_t PacketId, hsa_barrier_and_packet_t *Packet) {

uint32_t *PacketPtr = reinterpret_cast<uint32_t *>(Packet);

uint16_t Setup = 0;

uint16_t Header = HSA_PACKET_TYPE_BARRIER_AND << HSA_PACKET_HEADER_TYPE;

Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE;

Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE;

// Publish the packet. Do not modify the package after this point.

__atomic_store_n(PacketPtr, Header | (Setup << 16), __ATOMIC_RELEASE);

// Signal the doorbell about the published packet.

hsa_signal_store_relaxed(Queue->doorbell_signal, PacketId);

}

jdoerfertUnsubmitted

Done

Pick a size that is at least the default Capacity.

jdoerfert: Pick a size that is at least the default Capacity.

/// Callack that will be called when an error is detected on the HSA queue.

static void callbackError(hsa_status_t Status, hsa_queue_t *Source, void *) {

auto Err = Plugin::check(Status, "Received error in queue %p: %s", Source);

FATAL_MESSAGE(1, "%s", toString(std::move(Err)).data());

}

/// The HSA queue.

hsa_queue_t *Queue;

/// Mutex to protect the acquiring and publishing of packets. For the moment,

/// we need this mutex to prevent publishing packets that are not ready to be

/// published in a multi-thread scenario. Without a queue lock, a thread T1

/// could acquire packet P and thread T2 acquire packet P+1. Thread T2 could

/// publish its packet P+1 (signaling the queue's doorbell) before packet P

/// from T1 is ready to be processed. That scenario should be invalid. Thus,

/// we use the following mutex to make packet acquiring and publishing atomic.

/// TODO: There are other more advanced approaches to avoid this mutex using

/// atomic operations. We can further investigate it if this is a bottleneck.

std::mutex Mutex;

};

/// Struct that implements a stream of asynchronous operations for AMDGPU

/// devices. This class relies on signals to implement streams and define the

/// dependencies between asynchronous operations.

struct AMDGPUStreamTy {

private:

/// Utility struct holding arguments for async H2H memory copies.

struct MemcpyArgsTy {

void *Dst;

const void *Src;

size_t Size;

};

/// Utility struct holding arguments for freeing buffers to memory managers.

struct ReleaseBufferArgsTy {

void *Buffer;

AMDGPUMemoryManagerTy *MemoryManager;

};

/// Utility struct holding arguments for releasing signals to signal managers.

struct ReleaseSignalArgsTy {

AMDGPUSignalTy *Signal;

AMDGPUSignalManagerTy *SignalManager;

};

/// The stream is composed of N stream's slots. The struct below represents

/// the fields of each slot. Each slot has a signal and an optional action

/// function. When appending an HSA asynchronous operation to the stream, one

/// slot is consumed and used to store the operation's information. The

/// operation's output signal is set to the consumed slot's signal. If there

/// is a previous asynchronous operation on the previous slot, the HSA async

/// operation's input signal is set to the signal of the previous slot. This

JonChesterfieldUnsubmitted

Done

Confused by this, why are we locking the queue?

JonChesterfield: Confused by this, why are we locking the queue?

jdoerfertUnsubmitted

Done

I think that's related to the answer above. Add a TODO to look into this.

jdoerfert: I think that's related to the answer above. Add a TODO to look into this.

kevinsalaAuthorUnsubmitted

Done

Added a long comment about it on top of the Stream's std::mutex data member.

kevinsala: Added a long comment about it on top of the Stream's `std::mutex` data member.

/// way, we obtain a chain of dependant async operations. The action is a

/// function that will be executed eventually after the operation is completed,

/// e.g., for releasing a buffer.

struct StreamSlotTy {

/// The output signal of the stream operation. May be used by the subsequent

/// operation as input signal.

AMDGPUSignalTy *Signal;

/// The action that must be performed after the operation's completion. Set

/// to nullptr when there is no action to perform.

Error (*ActionFunction)(void *);

/// Space for the action's arguments. A pointer to these arguments is passed

/// to the action function. Notice the space of arguments is limited.

union {

MemcpyArgsTy MemcpyArgs;

ReleaseBufferArgsTy ReleaseBufferArgs;

ReleaseSignalArgsTy ReleaseSignalArgs;

} ActionArgs;

/// Create an empty slot.

StreamSlotTy() : Signal(nullptr), ActionFunction(nullptr) {}

/// Schedule a host memory copy action on the slot.

Error schedHostMemoryCopy(void *Dst, const void *Src, size_t Size) {

ActionFunction = memcpyAction;

ActionArgs.MemcpyArgs = MemcpyArgsTy{ Dst, Src, Size };

return Plugin::success();

}

/// Schedule a release buffer action on the slot.

Error schedReleaseBuffer(void *Buffer, AMDGPUMemoryManagerTy &Manager) {

ActionFunction = releaseBufferAction;

ActionArgs.ReleaseBufferArgs = ReleaseBufferArgsTy{ Buffer, &Manager };

return Plugin::success();

}

/// Schedule a release buffer action on the slot.

Error schedReleaseSignal(AMDGPUSignalTy *SignalToRelease,

AMDGPUSignalManagerTy *SignalManager) {

ActionFunction = releaseSignalAction;

ActionArgs.ReleaseSignalArgs = ReleaseSignalArgsTy{ SignalToRelease, SignalManager };

return Plugin::success();

}

// Perform the action if needed.

Error performAction() {

if (!ActionFunction)

return Plugin::success();

// Perform the action.

if (auto Err = (*ActionFunction)(&ActionArgs))

return Err;

// Invalidate the action.

ActionFunction = nullptr;

return Plugin::success();

}

};

/// The device agent where the stream was created.

hsa_agent_t Agent;

/// The queue that the stream uses to launch kernels.

AMDGPUQueueTy &Queue;

/// The manager of signals to reuse signals.

AMDGPUSignalManagerTy &SignalManager;

/// Array of stream slots. Use std::deque because it can dynamically grow

/// without invalidating the already inserted elements. For instance, the

/// std::vector may invalidate the elements by reallocating the internal

/// array if there is not enough space on new insertions.

std::deque<StreamSlotTy> Slots;

/// The next available slot on the queue. This is reset to zero each time the

/// stream is synchronized. It also indicates the current number of consumed

/// slots at a given time.

uint32_t NextSlot;

/// The synchronization id. This number is increased each time the stream is

/// synchronized. It is useful to detect if an AMDGPUEventTy points to an

/// operation that was already finalized in a previous stream sycnhronize.

uint32_t SyncId;

/// Mutex to protect stream's management.

mutable std::mutex Mutex;

/// Return the current number of asychronous operations on the stream.

uint32_t size() const { return NextSlot; }

/// Consume one slot from the stream. Since the stream uses signals on demand

/// and releases them once the slot is no longer used, the function requires

/// an idle signal for the new consumed slot.

std::pair<uint32_t, AMDGPUSignalTy *> consume(AMDGPUSignalTy *OutputSignal) {

// Double the stream size if needed. Since we use std::deque, this operation

// does not invalidate the already added slots.

if (Slots.size() == NextSlot)

Slots.resize(Slots.size() * 2);

// Update the next available slot and the stream size.

uint32_t Curr = NextSlot++;

jdoerfertUnsubmitted

Done

Isn't this a problem? Shouldn't we wait instead? Same in other places. Let's mark it as TODO and address it in a follow up.

jdoerfert: Isn't this a problem? Shouldn't we wait instead? Same in other places. Let's mark it as TODO…

kevinsalaAuthorUnsubmitted

Done

Yes, there is no problem waiting on it. Even so, this check will go away with the new version of dynamically sized streams.

kevinsala: Yes, there is no problem waiting on it. Even so, this check will go away with the new version…

// Retrieve the input signal, if any, of the current operation.

AMDGPUSignalTy *InputSignal = (Curr > 0) ? Slots[Curr - 1].Signal : nullptr;

// Set the output signal of the current slot.

Slots[Curr].Signal = OutputSignal;

return std::make_pair(Curr, InputSignal);

}

/// Make the current stream wait on a specific operation of another stream.

/// The idea is to make the current stream waiting on two signals: 1) the last

/// signal of the current stream, and 2) the last signal of the other stream.

/// Use a barrier packet with two input signals.

Error waitOnStreamOperation(AMDGPUStreamTy &OtherStream, uint32_t Slot) {

/// The signal that we must wait from the other stream.

AMDGPUSignalTy *OtherSignal = OtherStream.Slots[Slot].Signal;

// Prevent the release of the other stream's signal.

OtherSignal->increaseUseCount();

// Retrieve an available signal for the operation's output.

AMDGPUSignalTy *OutputSignal = SignalManager.getResource();

OutputSignal->reset();

OutputSignal->increaseUseCount();

// Consume stream slot and compute dependencies.

auto [Curr, InputSignal] = consume(OutputSignal);

// Setup the post action to release the signal.

if (auto Err = Slots[Curr].schedReleaseSignal(OtherSignal, &SignalManager))

return Err;

// Push a barrier into the queue with both input signals.

return Queue.pushBarrier(OutputSignal, InputSignal, OtherSignal);

}

/// Callback for running a specific asynchronous operation. This callback is

/// used for hsa_amd_signal_async_handler. The argument is the operation that

/// should be executed. Notice we use the post action mechanism to codify the

/// asynchronous operation.

static bool asyncActionCallback(hsa_signal_value_t Value, void *Args) {

StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args);

assert(Slot && "Invalid slot");

assert(Slot->Signal && "Invalid signal");

// Peform the operation.

if (auto Err = Slot->performAction())

FATAL_MESSAGE(1, "Error peforming post action: %s",

toString(std::move(Err)).data());

// Signal the output signal to notify the asycnhronous operation finalized.

Slot->Signal->signal();

// Unregister callback.

return false;

}

// Callback for host-to-host memory copies.

static Error memcpyAction(void *Data) {

MemcpyArgsTy *Args = reinterpret_cast<MemcpyArgsTy *>(Data);

assert(Args && "Invalid arguments");

assert(Args->Dst && "Invalid destination buffer");

assert(Args->Src && "Invalid source buffer");

std::memcpy(Args->Dst, Args->Src, Args->Size);

return Plugin::success();

}

// Callback for releasing a memory buffer to a memory manager.

static Error releaseBufferAction(void *Data) {

ReleaseBufferArgsTy *Args = reinterpret_cast<ReleaseBufferArgsTy *>(Data);

assert(Args && "Invalid arguments");

assert(Args->MemoryManager && "Invalid memory manager");

assert(Args->Buffer && "Invalid buffer");

// Release the allocation to the memory manager.

return Args->MemoryManager->deallocate(Args->Buffer);

}

static Error releaseSignalAction(void *Data) {

ReleaseSignalArgsTy *Args = reinterpret_cast<ReleaseSignalArgsTy *>(Data);

assert(Args && "Invalid arguments");

assert(Args->Signal && "Invalid signal");

assert(Args->SignalManager && "Invalid signal manager");

// Release the signal if needed.

if (Args->Signal->decreaseUseCount())

Args->SignalManager->returnResource(Args->Signal);

return Plugin::success();

}

public:

/// Create an empty stream associated with a specific device.

AMDGPUStreamTy(AMDGPUDeviceTy &Device);

/// Intialize the stream's signals.

Error init() { return Plugin::success(); }

/// Deinitialize the stream's signals.

Error deinit() { return Plugin::success(); }

/// Push a asynchronous kernel to the stream. The kernel arguments must be

/// placed in a special allocation for kernel args and must keep alive until

/// the kernel finalizes. Once the kernel is finished, the stream will release

/// the kernel args buffer to the specified memory manager.

Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,

uint32_t NumThreads, uint64_t NumBlocks,

AMDGPUMemoryManagerTy &MemoryManager) {

// Retrieve an available signal for the operation's output.

AMDGPUSignalTy *OutputSignal = SignalManager.getResource();

OutputSignal->reset();

OutputSignal->increaseUseCount();

std::lock_guard<std::mutex> StreamLock(Mutex);

// Consume stream slot and compute dependencies.

auto [Curr, InputSignal] = consume(OutputSignal);

// Avoid defining the input dependency if already satisfied.

if (InputSignal && !InputSignal->load())

InputSignal = nullptr;

// Setup the post action to release the kernel args buffer.

if (auto Err = Slots[Curr].schedReleaseBuffer(KernelArgs, MemoryManager))

return Err;

// Push the kernel with the output signal and an input signal (optional)

return Queue.pushKernelLaunch(Kernel, KernelArgs, NumThreads, NumBlocks,

OutputSignal, InputSignal);

}

/// Push an asynchronous memory copy between pinned memory buffers.

Error pushPinnedMemoryCopyAsync(void *Dst, const void *Src,

uint64_t CopySize) {

// Retrieve an available signal for the operation's output.

AMDGPUSignalTy *OutputSignal = SignalManager.getResource();

OutputSignal->reset();

JonChesterfieldUnsubmitted

Not Done

passing queue.getAgent() as two arguments here seems suspicious - shouldn't one of these be the GPU and one the host?

JonChesterfield: passing queue.getAgent() as two arguments here seems suspicious - shouldn't one of these be the…

kevinsalaAuthorUnsubmitted

Done

I'm looking into it. The documentation of hsa_amd_memory_async_copy in the header doesn't give much detail on what's the meaning of those two parameters and which are the consequences of passing GPU/CPU agents. I passed the GPU agent as both parameters as the original AMDGPU plugin does.

In case we pass the CPU agent as the src/dst agent, we will need to allow the CPU agent access to each of the device allocations. Otherwise, it will fail to comply with "the agent must be able to directly access both the source and destination buffers in their current locations" (header doc).

kevinsala: I'm looking into it. The documentation of `hsa_amd_memory_async_copy` in the header doesn't…

OutputSignal->increaseUseCount();

std::lock_guard<std::mutex> Lock(Mutex);

// Consume stream slot and compute dependencies.

auto [Curr, InputSignal] = consume(OutputSignal);

// Avoid defining the input dependency if already satisfied.

if (InputSignal && !InputSignal->load())

InputSignal = nullptr;

// Issue the async memory copy.

hsa_status_t Status;

if (InputSignal) {

hsa_signal_t InputSignalRaw = InputSignal->get();

Status = hsa_amd_memory_async_copy(Dst, Agent, Src,

Agent, CopySize, 1,

&InputSignalRaw, OutputSignal->get());

} else

Status = hsa_amd_memory_async_copy(Dst, Agent, Src,

Agent, CopySize, 0, nullptr,

OutputSignal->get());

return Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s");

}

/// Push an asynchronous memory copy device-to-host involving an unpinned

/// memory buffer. The operation consists of a two-step copy from the

/// device buffer to an intermediate pinned host buffer, and then, to a

/// unpinned host buffer. Both operations are asynchronous and dependant.

/// The intermediate pinned buffer will be released to the specified memory

/// manager once the operation completes.

Error pushMemoryCopyD2HAsync(void *Dst, const void *Src, void *Inter,

uint64_t CopySize,

AMDGPUMemoryManagerTy &MemoryManager) {

// TODO: Managers should define a function to retrieve multiple resources

// in a single call.

// Retrieve available signals for the operation's outputs.

AMDGPUSignalTy *OutputSignal1 = SignalManager.getResource();

AMDGPUSignalTy *OutputSignal2 = SignalManager.getResource();

OutputSignal1->reset();

OutputSignal2->reset();

jhuber6Unsubmitted

Done

return Plugin::success();

}

- /// Push an asynchronous memory copy host-to-device involving a unpinned

+ /// Push an asynchronous memory copy host-to-device involving an unpinned

/// memory buffer. The operation consists of a two-step copy from the

typo

jhuber6: typo

OutputSignal1->increaseUseCount();

OutputSignal2->increaseUseCount();

std::lock_guard<std::mutex> Lock(Mutex);

// Consume stream slot and compute dependencies.

auto [Curr, InputSignal] = consume(OutputSignal1);

// Avoid defining the input dependency if already satisfied.

if (InputSignal && !InputSignal->load())

InputSignal = nullptr;

// Setup the post action for releasing the intermediate buffer.

if (auto Err = Slots[Curr].schedReleaseBuffer(Inter, MemoryManager))

return Err;

// Issue the first step: device to host transfer. Avoid defining the input

// dependency if already satisfied.

hsa_status_t Status;

if (InputSignal) {

hsa_signal_t InputSignalRaw = InputSignal->get();

Status = hsa_amd_memory_async_copy(Inter, Agent, Src,

Agent, CopySize, 1,

&InputSignalRaw, OutputSignal1->get());

} else {

Status = hsa_amd_memory_async_copy(Inter, Agent, Src,

Agent, CopySize, 0, nullptr,

OutputSignal1->get());

}

if (auto Err =

Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))

return Err;

// Consume another stream slot and compute dependencies.

std::tie(Curr, InputSignal) = consume(OutputSignal2);

// The std::memcpy is done asynchronously using an async handler. We store

// the function's information in the action but it's not actually an action.

if (auto Err = Slots[Curr].schedHostMemoryCopy(Dst, Inter, CopySize))

return Err;

// TODO: Need a memory fence (release) here.

// Issue the second step: host to host transfer.

Status = hsa_amd_signal_async_handler(

OutputSignal1->get(), HSA_SIGNAL_CONDITION_EQ, 0, asyncActionCallback,

(void *)&Slots[Curr]);

return Plugin::check(Status, "Error in hsa_amd_signal_async_handler: %s");

}

/// Push an asynchronous memory copy host-to-device involving an unpinned

/// memory buffer. The operation consists of a two-step copy from the

/// unpinned host buffer to an intermediate pinned host buffer, and then, to

/// the pinned host buffer. Both operations are asynchronous and dependant.

/// The intermediate pinned buffer will be released to the specified memory

/// manager once the operation completes.

Error pushMemoryCopyH2DAsync(void *Dst, const void *Src, void *Inter,

uint64_t CopySize,

AMDGPUMemoryManagerTy &MemoryManager) {

// Retrieve available signals for the operation's outputs.

AMDGPUSignalTy *OutputSignal1 = SignalManager.getResource();

AMDGPUSignalTy *OutputSignal2 = SignalManager.getResource();

OutputSignal1->reset();

OutputSignal2->reset();

OutputSignal1->increaseUseCount();

OutputSignal2->increaseUseCount();

AMDGPUSignalTy *OutputSignal = OutputSignal1;

std::lock_guard<std::mutex> Lock(Mutex);

// Consume stream slot and compute dependencies.

auto [Curr, InputSignal] = consume(OutputSignal);

// Avoid defining the input dependency if already satisfied.

if (InputSignal && !InputSignal->load())

InputSignal = nullptr;

// Issue the first step: host to host transfer.

if (InputSignal) {

// The std::memcpy is done asynchronously using an async handler. We store

// the function's information in the action but it is not actually a

// post action.

if (auto Err = Slots[Curr].schedHostMemoryCopy(Inter, Src, CopySize))

return Err;

// TODO: Need a memory fence (release) here.

hsa_status_t Status = hsa_amd_signal_async_handler(

InputSignal->get(), HSA_SIGNAL_CONDITION_EQ, 0, asyncActionCallback,

(void *)&Slots[NextSlot]);

if (auto Err = Plugin::check(Status,

"Error in hsa_amd_signal_async_handler: %s"))

return Err;

// Let's use now the second output signal.

OutputSignal = OutputSignal2;

// Consume another stream slot and compute dependencies.

std::tie(Curr, InputSignal) = consume(OutputSignal);

} else {

// All preceding operations completed, copy the memory synchronously.

std::memcpy(Inter, Src, CopySize);

// Return the second signal because it will not be used.

SignalManager.returnResource(OutputSignal2);

}

// Setup the post action to release the intermediate pinned buffer.

if (auto Err = Slots[Curr].schedReleaseBuffer(Inter, MemoryManager))

return Err;

// Issue the second step: host to device transfer. Avoid defining the input

// dependency if already satisfied.

hsa_status_t Status;

if (InputSignal && InputSignal->load()) {

hsa_signal_t InputSignalRaw = InputSignal->get();

Status = hsa_amd_memory_async_copy(Dst, Agent, Inter,

Agent, CopySize, 1,

&InputSignalRaw, OutputSignal->get());

} else

Status = hsa_amd_memory_async_copy(Dst, Agent, Inter,

Agent, CopySize, 0, nullptr,

OutputSignal->get());

return Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s");

}

/// Synchronize with the stream. The current thread waits until all operations

/// are finalized and it performs the pending post actions (i.e., releasing

/// intermediate buffers).

Error synchronize() {

std::lock_guard<std::mutex> Lock(Mutex);

// Increase the number of synchronizations.

SyncId += 1;

// No need to synchronize anything.

if (NextSlot == 0)

return Plugin::success();

// Wait until all previous operations on the stream have completed.

if (auto Err = Slots[NextSlot-1].Signal->wait())

return Err;

for (uint32_t Slot = 0; Slot < NextSlot; ++Slot) {

// Take the post action of the operation if any.

if (auto Err = Slots[Slot].performAction())

return Err;

// Release the signal if possible.

if (Slots[Slot].Signal->decreaseUseCount())

SignalManager.returnResource(Slots[Slot].Signal);

Slots[Slot].Signal = nullptr;

}

// Reset the stream.

NextSlot = 0;

return Plugin::success();

}

/// Record the state of the stream on an event.

Error recordEvent(AMDGPUEventTy &Event) const;

/// Make the stream wait on an event.

Error waitEvent(const AMDGPUEventTy &Event);

};

/// Class representing an event on AMDGPU. The event basically stores some

/// information regarding the state of the recorded stream.

struct AMDGPUEventTy {

/// Create an empty event.

AMDGPUEventTy(AMDGPUDeviceTy &Device)

: RecordedStream(nullptr), RecordedOperation(-1), RecordedSyncId(-1) {}

/// Initialize and deinitialize.

Error init() { return Plugin::success(); }

Error deinit() { return Plugin::success(); }

/// Record the state of a stream on the event.

Error record(AMDGPUStreamTy &Stream) {

std::lock_guard<std::mutex> Lock(Mutex);

// Ignore the last recorded stream.

RecordedStream = &Stream;

jhuber6Unsubmitted

Done

const AMDGPUStreamTy &RecordedStream = *Event.RecordedStream;

- std::scoped_lock MultiLock(Mutex, RecordedStream.Mutex);

+ std::scoped_lock<std::mutex, std::mutex> MultiLock(Mutex, RecordedStream.Mutex);

// The recorded stream already completed the operation because the synchronize

Probably best to specify the template arguments.

jhuber6: Probably best to specify the template arguments.

return Stream.recordEvent(*this);

}

/// Make a stream wait on the current event.

Error wait(AMDGPUStreamTy &Stream) {

std::lock_guard<std::mutex> Lock(Mutex);

if (!RecordedStream)

return Plugin::error("Event does not have any recorded stream");

// Synchronizing the same stream. Do nothing.

if (RecordedStream == &Stream)

return Plugin::success();

// No need to wait anything, the recorded stream already finished the

// corresponding operation.

if (RecordedOperation < 0)

return Plugin::success();

if (auto Err = Stream.waitEvent(*this))

return Err;

return Plugin::success();

}

protected:

/// The stream registered in this event.

AMDGPUStreamTy *RecordedStream;

/// The recordered operation on the recorded stream.

int32_t RecordedOperation;

/// The sync number when the stream was recorded.

int32_t RecordedSyncId;

/// Mutex to safely access event fields.

mutable std::mutex Mutex;

friend struct AMDGPUStreamTy;

};

Error AMDGPUStreamTy::recordEvent(AMDGPUEventTy &Event) const {

std::lock_guard<std::mutex> Lock(Mutex);

if (size() > 0) {

// Record the synchronize identifier (to detect stale recordings) and

// the last valid stream's operation.

Event.RecordedSyncId = SyncId;

Event.RecordedOperation = NextSlot - 1;

} else {

// The stream is empty, everything already completed, record nothing.

Event.RecordedSyncId = -1;

jdoerfertUnsubmitted

Done

CUDA + AMD

jdoerfert: - CUDA + AMD

Event.RecordedOperation = -1;

}

return Plugin::success();

}

Error AMDGPUStreamTy::waitEvent(const AMDGPUEventTy &Event) {

// Retrieve the recorded stream on the event.

AMDGPUStreamTy &RecordedStream = *Event.RecordedStream;

std::scoped_lock<std::mutex, std::mutex> Lock(Mutex, RecordedStream.Mutex);

// The recorded stream already completed the operation because the synchronize

// identifier is already outdated.

if (RecordedStream.SyncId != (uint32_t)Event.RecordedSyncId)

return Plugin::success();

// Again, the recorded stream already completed the operation, the last

// operation's output signal is satisfied.

if (!RecordedStream.Slots[Event.RecordedOperation].Signal->load())

return Plugin::success();

// Otherwise, make the current stream wait on the other stream's operation.

return waitOnStreamOperation(RecordedStream, Event.RecordedOperation);

}

/// Abstract class that holds the common members of the actual kernel devices

/// and the host device. Both types should inherit from this class.

struct AMDGenericDeviceTy {

AMDGenericDeviceTy() {}

virtual ~AMDGenericDeviceTy() {}

/// Create all memory pools which the device has access to and classify them.

Error initMemoryPools() {

// Retrieve all memory pools from the device agent(s).

Error Err = retrieveAllMemoryPools();

if (Err)

return Err;

for (AMDGPUMemoryPoolTy *MemoryPool : AllMemoryPools) {

// Initialize the memory pool and retrieve some basic info.

Error Err = MemoryPool->init();

if (Err)

return Err;

if (!MemoryPool->isGlobal())

continue;

// Classify the memory pools depending on their properties.

if (MemoryPool->isFineGrained()) {

FineGrainedMemoryPools.push_back(MemoryPool);

if (MemoryPool->supportsKernelArgs())

ArgsMemoryPools.push_back(MemoryPool);

} else if (MemoryPool->isCoarseGrained()) {

CoarseGrainedMemoryPools.push_back(MemoryPool);

}

return Plugin::success();

}

/// Destroy all memory pools.

Error deinitMemoryPools() {

for (AMDGPUMemoryPoolTy *Pool : AllMemoryPools)

delete Pool;

AllMemoryPools.clear();

FineGrainedMemoryPools.clear();

CoarseGrainedMemoryPools.clear();

ArgsMemoryPools.clear();

return Plugin::success();

}

/// Retrieve and construct all memory pools from the device agent(s).

virtual Error retrieveAllMemoryPools() = 0;

/// Get the device agent.

virtual hsa_agent_t getAgent() const = 0;

protected:

/// Array of all memory pools available to the host agents.

llvm::SmallVector<AMDGPUMemoryPoolTy *> AllMemoryPools;

/// Array of fine-grained memory pools available to the host agents.

llvm::SmallVector<AMDGPUMemoryPoolTy *> FineGrainedMemoryPools;

/// Array of coarse-grained memory pools available to the host agents.

llvm::SmallVector<AMDGPUMemoryPoolTy *> CoarseGrainedMemoryPools;

/// Array of kernel args memory pools available to the host agents.

llvm::SmallVector<AMDGPUMemoryPoolTy *> ArgsMemoryPools;

};

/// Class representing the host device. This host device may have more than one

/// HSA host agent. We aggregate all its resources into the same instance.

struct AMDHostDeviceTy : public AMDGenericDeviceTy {

/// Create a host device from an array of host agents.

AMDHostDeviceTy(const llvm::SmallVector<hsa_agent_t> &HostAgents)

: AMDGenericDeviceTy(), Agents(HostAgents), ArgsMemoryManager(),

PinnedMemoryManager() {

assert(HostAgents.size() && "No host agent found");

}

/// Initialize the host device memory pools and the memory managers for

/// kernel args and host pinned memory allocations.

Error init() {

if (auto Err = initMemoryPools())

return Err;

if (auto Err = ArgsMemoryManager.init(getArgsMemoryPool()))

return Err;

if (auto Err = PinnedMemoryManager.init(getHostMemoryPool()))

return Err;

return Plugin::success();

}

/// Deinitialize memory pools and managers.

Error deinit() {

if (auto Err = deinitMemoryPools())

return Err;

if (auto Err = ArgsMemoryManager.deinit())

return Err;

if (auto Err = PinnedMemoryManager.deinit())

return Err;

return Plugin::success();

}

/// Retrieve and construct all memory pools from the host agents.

Error retrieveAllMemoryPools() override {

// Iterate through the available pools across the host agents.

for (hsa_agent_t Agent : Agents) {

Error Err = utils::iterateAgentMemoryPools(

Agent, [&](hsa_amd_memory_pool_t HSAMemoryPool) {

AMDGPUMemoryPoolTy *MemoryPool =

new AMDGPUMemoryPoolTy(HSAMemoryPool);

AllMemoryPools.push_back(MemoryPool);

return HSA_STATUS_SUCCESS;

});

if (Err)

return Err;

}

return Plugin::success();

}

/// Get one of the host agents. Return always the first agent.

hsa_agent_t getAgent() const override { return Agents[0]; }

/// Get a memory pool for host pinned allocations.

AMDGPUMemoryPoolTy &getHostMemoryPool() {

assert(!FineGrainedMemoryPools.empty() && "No fine-grained mempool");

// Retrive any memory pool.

return *FineGrainedMemoryPools[0];

}

/// Get a memory pool for kernel args allocations.

AMDGPUMemoryPoolTy &getArgsMemoryPool() {

assert(!ArgsMemoryPools.empty() && "No kernelargs mempool");

// Retrieve any memory pool.

return *ArgsMemoryPools[0];

}

/// Getters for kernel args and host pinned memory managers.

AMDGPUMemoryManagerTy &getArgsMemoryManager() { return ArgsMemoryManager; }

AMDGPUMemoryManagerTy &getPinnedMemoryManager() {

return PinnedMemoryManager;

}

private:

/// Array of agents on the host side.

const llvm::SmallVector<hsa_agent_t> Agents;

// Memory manager for kernel arguments.

AMDGPUMemoryManagerTy ArgsMemoryManager;

// Memory manager for pinned memory.

AMDGPUMemoryManagerTy PinnedMemoryManager;

};

/// Class implementing the AMDGPU device functionalities which derives from the

/// generic device class.

struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {

// Create an AMDGPU device with a device id and default AMDGPU grid values.

AMDGPUDeviceTy(int32_t DeviceId, int32_t NumDevices,

AMDHostDeviceTy &HostDevice, hsa_agent_t Agent)

: GenericDeviceTy(DeviceId, NumDevices, {0}), AMDGenericDeviceTy(),

OMPX_NumQueues("LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES", 8),

OMPX_QueueSize("LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE", 1024),

OMPX_MaxAsyncCopyBytes("LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES",

1 * 1024 * 1024), // 1MB

OMPX_InitialNumSignals("LIBOMPTARGET_AMDGPU_NUM_INITIAL_SIGNALS", 64),

AMDGPUStreamManager(*this), AMDGPUEventManager(*this),

AMDGPUSignalManager(*this), Agent(Agent),

HostDevice(HostDevice), Queues() {}

~AMDGPUDeviceTy() {}

/// Initialize the device, its resources and get its properties.

Error initImpl(GenericPluginTy &Plugin) override {

// First setup all the memory pools.

if (auto Err = initMemoryPools())

return Err;

// Get the wavefront size.

uint32_t WavefrontSize = 0;

ye-luoUnsubmitted

Done

What is QUEUE_SIZE? Prefer not to use SIZE but NUM_XXX_PER_QUEUE

ye-luo: What is QUEUE_SIZE? Prefer not to use SIZE but NUM_XXX_PER_QUEUE

if (auto Err = getDeviceAttr(HSA_AGENT_INFO_WAVEFRONT_SIZE, WavefrontSize))

jdoerfertUnsubmitted

Done

4 x 512 x 128 might be enough

jdoerfert: 4 x 512 x 128 might be enough

ye-luoUnsubmitted

Done

What is the STREAM_SIZE?

ye-luo: What is the STREAM_SIZE?

return Err;

ye-luoUnsubmitted

Done

LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES?

ye-luo: LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES?

jdoerfertUnsubmitted

Not Done

We need documentation for those in openmp/docs, next to the other env vars.

jdoerfert: We need documentation for those in openmp/docs, next to the other env vars.

kevinsalaAuthorUnsubmitted

Done

I'll add the documentation for these envars. They control the following aspects:

LIBOMPTARGET_AMDGPU_QUEUE_SIZE: The number of HSA packets that can be pushed into each HSA queue without waiting the driver to process them.
LIBOMPTARGET_AMDGPU_STREAM_SIZE: Number of asynchronous operations (e.g., kernel launches, memory transfers) that can be pushed into each of our streams without waiting on their finalization. With the upcoming patch implementing dynamically sized streams, this envar will be renamed and become a hint for the initial stream size.
LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_SIZE: Up to this size, the memory copies (tgt_rtl_submit_data, tgt_rtl_retrieve_data) will be asynchronous operations appended to the corresponding stream. For larger transfer, it will become synchronous transfers. This can be seen dataSubmitImpl (1695).

kevinsala: I'll add the documentation for these envars. They control the following aspects: 1)…

ye-luoUnsubmitted

Done

Thanks for adding docs.

What I found in hsa doc is "Number of packets the queue is expected to hold". This is more understandable than your line. I think your description is true but that is a second level interpretation. I think both lines can be useful in the doc. LIBOMPTARGET_AMDGPU_QUEUE_SIZE is an insufficient name. LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE is better IMO.
A STREAM may have many contents with different sizes. STREAM_SIZE is to vague. Better to have something like ASYNC_OP_DEPTH_PER_STREAM.
Still prefer COPY_BYTES.

ye-luo: Thanks for adding docs. 1. What I found in hsa doc is "Number of packets the queue is expected…

kevinsalaAuthorUnsubmitted

Done

Current names:

LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES (default: 8)
LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE (default: 1024)
LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES (default: 1*1024*1024, 1MB)
LIBOMPTARGET_AMDGPU_NUM_INITIAL_SIGNALS (default: 64), probably better to name HSA_SIGNALS instead of SIGNALS

The description of each one appears on top of the envar's declarations. I'll add documentation on openmp/docs in another patch.

kevinsala: Current names: - `LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES` (default: 8)…

GridValues.GV_Warp_Size = WavefrontSize;

// Load the grid values dependending on the wavefront.

if (WavefrontSize == 32)

GridValues = getAMDGPUGridValues<32>();

else if (WavefrontSize == 64)

GridValues = getAMDGPUGridValues<64>();

else

return Plugin::error("Unexpected AMDGPU wavefront %d", WavefrontSize);

// Get maximum number of workitems per workgroup.

uint16_t WorkgroupMaxDim[3];

if (auto Err =

getDeviceAttr(HSA_AGENT_INFO_WORKGROUP_MAX_DIM, WorkgroupMaxDim))

return Err;

GridValues.GV_Max_WG_Size = WorkgroupMaxDim[0];

// Get maximum number of workgroups.

hsa_dim3_t GridMaxDim;

if (auto Err = getDeviceAttr(HSA_AGENT_INFO_GRID_MAX_DIM, GridMaxDim))

return Err;

GridValues.GV_Max_Teams = GridMaxDim.x / GridValues.GV_Max_WG_Size;

if (GridValues.GV_Max_Teams == 0)

return Plugin::error("Maximum number of teams cannot be zero");

// Get maximum size of any device queues and maximum number of queues.

uint32_t MaxQueueSize;

if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUE_MAX_SIZE, MaxQueueSize))

return Err;

uint32_t MaxQueues;

if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUES_MAX, MaxQueues))

return Err;

// Compute the number of queues and their size.

const uint32_t NumQueues = std::min(OMPX_NumQueues.get(), MaxQueues);

const uint32_t QueueSize = std::min(OMPX_QueueSize.get(), MaxQueueSize);

// Construct and initialize each device queue.

Queues = std::vector<AMDGPUQueueTy>(NumQueues);

for (AMDGPUQueueTy &Queue : Queues)

if (auto Err = Queue.init(Agent, QueueSize))

return Err;

// Initialize stream pool.

if (auto Err = AMDGPUStreamManager.init(OMPX_InitialNumStreams))

return Err;

// Initialize event pool.

if (auto Err = AMDGPUEventManager.init(OMPX_InitialNumEvents))

return Err;

// Initialize signal pool.

if (auto Err = AMDGPUSignalManager.init(OMPX_InitialNumSignals))

return Err;

return Plugin::success();

}

/// Deinitialize the device and release its resources.

Error deinitImpl() override {

// Deinitialize the stream and event pools.

if (auto Err = AMDGPUStreamManager.deinit())

return Err;

if (auto Err = AMDGPUEventManager.deinit())

return Err;

if (auto Err = AMDGPUSignalManager.deinit())

return Err;

// Close modules if necessary.

if (!LoadedImages.empty()) {

// Each image has its own module.

for (DeviceImageTy *Image : LoadedImages) {

AMDGPUDeviceImageTy &AMDImage =

static_cast<AMDGPUDeviceImageTy &>(*Image);

// Unload the executable of the image.

if (auto Err = AMDImage.unloadExecutable())

return Err;

}

for (AMDGPUQueueTy &Queue : Queues) {

if (auto Err = Queue.deinit())

return Err;

}

// Invalidate agent reference.

Agent = {0};

return Plugin::success();

}

/// Allocate and construct an AMDGPU kernel.

Expected<GenericKernelTy *>

constructKernelEntry(const __tgt_offload_entry &KernelEntry,

DeviceImageTy &Image) override {

// Create a metadata object for the exec mode global (auto-generated).

StaticGlobalTy<llvm::omp::OMPTgtExecModeFlags> ExecModeGlobal(

KernelEntry.name, "_exec_mode");

// Retrieve execution mode for the kernel. This may fail since some kernels

// may not have a execution mode.

GenericGlobalHandlerTy &GHandler = Plugin::get().getGlobalHandler();

if (auto Err = GHandler.readGlobalFromImage(*this, Image, ExecModeGlobal)) {

DP("Failed to read execution mode for '%s': %s\n"

"Using default GENERIC (1) execution mode\n",

jhuber6Unsubmitted

Done

DeviceImageTy &Image) override {

- AMDGPUDeviceImageTy &AMDImage = static_cast<AMDGPUDeviceImageTy &>(Image);

// Create a metadata object for the exec mode global (auto-generated).

Unused

jhuber6: Unused

KernelEntry.name, toString(std::move(Err)).data());

// Consume the error since it is acceptable to fail.

consumeError(std::move(Err));

// In some cases the execution mode is not included, so use the default.

ExecModeGlobal.setValue(llvm::omp::OMP_TGT_EXEC_MODE_GENERIC);

}

// Check that the retrieved execution mode is valid.

if (!GenericKernelTy::isValidExecutionMode(ExecModeGlobal.getValue()))

return Plugin::error("Invalid execution mode %d for '%s'",

ExecModeGlobal.getValue(), KernelEntry.name);

// Allocate and initialize the AMDGPU kernel.

AMDGPUKernelTy *AMDKernel = Plugin::get().allocate<AMDGPUKernelTy>();

new (AMDKernel) AMDGPUKernelTy(KernelEntry.name, ExecModeGlobal.getValue());

return AMDKernel;

}

/// Set the current context to this device's context. Do nothing since the

/// AMDGPU devices do not have the concept of contexts.

Error setContext() override { return Plugin::success(); }

/// Get the stream of the asynchronous info sructure or get a new one.

AMDGPUStreamTy &getStream(AsyncInfoWrapperTy &AsyncInfoWrapper) {

AMDGPUStreamTy *&Stream = AsyncInfoWrapper.getQueueAs<AMDGPUStreamTy *>();

if (!Stream)

Stream = AMDGPUStreamManager.getResource();

return *Stream;

}

/// Load the binary image into the device and allocate an image object.

Expected<DeviceImageTy *> loadBinaryImpl(const __tgt_device_image *TgtImage,

int32_t ImageId) override {

// Allocate and initialize the image object.

AMDGPUDeviceImageTy *AMDImage =

Plugin::get().allocate<AMDGPUDeviceImageTy>();

new (AMDImage) AMDGPUDeviceImageTy(ImageId, TgtImage);

// Load the HSA executable.

if (Error Err = AMDImage->loadExecutable(*this))

return std::move(Err);

return AMDImage;

}

/// Allocate memory on the device or related to the device.

void *allocate(size_t Size, void *, TargetAllocTy Kind) override;

/// Deallocate memory on the device or related to the device.

int free(void *TgtPtr, TargetAllocTy Kind) override {

if (TgtPtr == nullptr)

return OFFLOAD_SUCCESS;

AMDGPUMemoryPoolTy *MemoryPool = nullptr;

switch (Kind) {

case TARGET_ALLOC_DEFAULT:

case TARGET_ALLOC_DEVICE:

MemoryPool = CoarseGrainedMemoryPools[0];

break;

case TARGET_ALLOC_HOST:

MemoryPool = &HostDevice.getHostMemoryPool();

break;

case TARGET_ALLOC_SHARED:

// TODO: Not supported yet. We could look at fine-grained host memory

// pools that are accessible by this device. The allocation should be made

// explicitly accessible if it is not yet.

break;

}

if (!MemoryPool) {

REPORT("No memory pool for the specified allocation kind\n");

return OFFLOAD_FAIL;

}

if (Error Err = MemoryPool->deallocate(TgtPtr)) {

REPORT("%s\n", toString(std::move(Err)).data());

return OFFLOAD_FAIL;

}

if (Kind == TARGET_ALLOC_HOST) {

std::lock_guard<std::shared_mutex> Lock(HostAllocationsMutex);

size_t Erased = HostAllocations.erase(TgtPtr);

if (!Erased) {

REPORT("Cannot find a host allocation in the map\n");

return OFFLOAD_FAIL;

}

return OFFLOAD_SUCCESS;

}

/// Synchronize current thread with the pending operations on the async info.

Error synchronizeImpl(__tgt_async_info &AsyncInfo) override {

AMDGPUStreamTy *Stream =

reinterpret_cast<AMDGPUStreamTy *>(AsyncInfo.Queue);

assert(Stream && "Invalid stream");

if (auto Err = Stream->synchronize())

return Err;

// Once the stream is synchronized, return it to stream pool and reset

// AsyncInfo. This is to make sure the synchronization only works for its

// own tasks.

AMDGPUStreamManager.returnResource(Stream);

AsyncInfo.Queue = nullptr;

return Plugin::success();

}

/// Submit data to the device (host to device transfer).

Error dataSubmitImpl(void *TgtPtr, const void *HstPtr, int64_t Size,

AsyncInfoWrapperTy &AsyncInfoWrapper) override {

// Use one-step asynchronous operation when host memory is already pinned.

kevinsalaAuthorUnsubmitted

Done

Should be a return

kevinsala: Should be a return

if (isHostPinnedMemory(HstPtr)) {

AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper);

if (auto Err = Stream.pushPinnedMemoryCopyAsync(TgtPtr, HstPtr, Size))

return Err;

}

void *PinnedHstPtr = nullptr;

// For large transfers use synchronous behavior.

if (Size >= OMPX_MaxAsyncCopyBytes) {

if (AsyncInfoWrapper.hasQueue())

if (auto Err = synchronize(AsyncInfoWrapper))

return Err;

hsa_status_t Status;

Status =

hsa_amd_memory_lock(const_cast<void *>(HstPtr), Size, nullptr, 0, &PinnedHstPtr);

if (auto Err =

Plugin::check(Status, "Error in hsa_amd_memory_lock: %s\n"))

return Err;

AMDGPUSignalTy Signal;

if (auto Err = Signal.init())

return Err;

jhuber6Unsubmitted

Done

Status =

- hsa_amd_memory_lock((void *)HstPtr, Size, nullptr, 0, &PinnedHstPtr);

+ hsa_amd_memory_lock(const_cast<void *>(HstPtr), Size, nullptr, 0, &PinnedHstPtr);

if (auto Err =

Using const_cast isn't great but otherwise the compiler will complain with warnings on.

jhuber6: Using `const_cast` isn't great but otherwise the compiler will complain with warnings on.

Status = hsa_amd_memory_async_copy(TgtPtr, Agent, PinnedHstPtr, Agent,

Size, 0, nullptr, Signal.get());

if (auto Err =

Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))

return Err;

if (auto Err = Signal.wait())

return Err;

if (auto Err = Signal.deinit())

return Err;

Status = hsa_amd_memory_unlock(const_cast<void *>(HstPtr));

return Plugin::check(Status, "Error in hsa_amd_memory_unlock: %s\n");

}

// Otherwise, use two-step copy with an intermediate pinned host buffer.

AMDGPUMemoryManagerTy &PinnedMemoryManager =

HostDevice.getPinnedMemoryManager();

if (auto Err = PinnedMemoryManager.allocate(Size, &PinnedHstPtr))

return Err;

jhuber6Unsubmitted

Done

return Err;

- Status = hsa_amd_memory_unlock((void *)HstPtr);

+ Status = hsa_amd_memory_unlock(const_cast<void *>(HstPtr));

if (auto Err =

Same here.

jhuber6: Same here.

AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper);

return Stream.pushMemoryCopyH2DAsync(TgtPtr, HstPtr, PinnedHstPtr, Size,

PinnedMemoryManager);

}

/// Retrieve data from the device (device to host transfer).

Error dataRetrieveImpl(void *HstPtr, const void *TgtPtr, int64_t Size,

AsyncInfoWrapperTy &AsyncInfoWrapper) override {

if (isHostPinnedMemory(HstPtr)) {

// Use one-step asynchronous operation when host memory is already pinned.

AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper);

return Stream.pushPinnedMemoryCopyAsync(HstPtr, TgtPtr, Size);

}

void *PinnedHstPtr = nullptr;

// For large transfers use synchronous behavior.

if (Size >= OMPX_MaxAsyncCopyBytes) {

if (AsyncInfoWrapper.hasQueue())

if (auto Err = synchronize(AsyncInfoWrapper))

return Err;

hsa_status_t Status;

Status = hsa_amd_memory_lock(const_cast<void *>(HstPtr), Size, nullptr, 0, &PinnedHstPtr);

if (auto Err =

Plugin::check(Status, "Error in hsa_amd_memory_lock: %s\n"))

jhuber6Unsubmitted

Done

Nit, no else after return

jhuber6: Nit, no else after return

return Err;

AMDGPUSignalTy Signal;

if (auto Err = Signal.init())

return Err;

Status = hsa_amd_memory_async_copy(PinnedHstPtr, Agent, TgtPtr, Agent,

Size, 0, nullptr, Signal.get());

if (auto Err =

Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))

return Err;

if (auto Err = Signal.wait())

return Err;

if (auto Err = Signal.deinit())

return Err;

Status = hsa_amd_memory_unlock(const_cast<void *>(HstPtr));

return Plugin::check(Status, "Error in hsa_amd_memory_unlock: %s\n");

}

// Otherwise, use two-step copy with an intermediate pinned host buffer.

AMDGPUMemoryManagerTy &PinnedMemoryManager =

HostDevice.getPinnedMemoryManager();

if (auto Err = PinnedMemoryManager.allocate(Size, &PinnedHstPtr))

return Err;

AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper);

return Stream.pushMemoryCopyD2HAsync(HstPtr, TgtPtr, PinnedHstPtr, Size,

PinnedMemoryManager);

}

/// Exchange data between two devices within the plugin. This function is not

/// supported in this plugin.

Error dataExchangeImpl(const void *SrcPtr, GenericDeviceTy &DstGenericDevice,

void *DstPtr, int64_t Size,

AsyncInfoWrapperTy &AsyncInfoWrapper) override {

// This function should never be called because the function

// AMDGPUPluginTy::isDataExchangable() returns false.

return Plugin::error("dataExchangeImpl not supported");

}

/// Initialize the async info for interoperability purposes.

Error initAsyncInfoImpl(AsyncInfoWrapperTy &AsyncInfoWrapper) override {

// TODO: Implement this function.

return Plugin::success();

}

/// Initialize the device info for interoperability purposes.

Error initDeviceInfoImpl(__tgt_device_info *DeviceInfo) override {

DeviceInfo->Context = nullptr;

if (!DeviceInfo->Device)

DeviceInfo->Device = reinterpret_cast<void *>(Agent.handle);

return Plugin::success();

}

/// Create an event.

Error createEventImpl(void **EventPtrStorage) override {

AMDGPUEventTy **Event = reinterpret_cast<AMDGPUEventTy **>(EventPtrStorage);

*Event = AMDGPUEventManager.getResource();

return Plugin::success();

}

/// Destroy a previously created event.

Error destroyEventImpl(void *EventPtr) override {

AMDGPUEventTy *Event = reinterpret_cast<AMDGPUEventTy *>(EventPtr);

AMDGPUEventManager.returnResource(Event);

return Plugin::success();

}

/// Record the event.

Error recordEventImpl(void *EventPtr,

AsyncInfoWrapperTy &AsyncInfoWrapper) override {

AMDGPUEventTy *Event = reinterpret_cast<AMDGPUEventTy *>(EventPtr);

assert(Event && "Invalid event");

AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper);

return Event->record(Stream);

}

/// Make the stream wait on the event.

Error waitEventImpl(void *EventPtr,

AsyncInfoWrapperTy &AsyncInfoWrapper) override {

AMDGPUEventTy *Event = reinterpret_cast<AMDGPUEventTy *>(EventPtr);

AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper);

return Event->wait(Stream);

}

/// Synchronize the current thread with the event.

Error syncEventImpl(void *EventPtr) override {

return Plugin::error("Synchronize event not implemented");

}

/// Print information about the device.

Error printInfoImpl() override {

// TODO: Implement the basic info.

return Plugin::success();

}

/// Getters and setters for stack and heap sizes.

Error getDeviceStackSize(uint64_t &Value) override {

Value = 0;

return Plugin::success();

}

Error setDeviceStackSize(uint64_t Value) override {

return Plugin::success();

}

Error getDeviceHeapSize(uint64_t &Value) override {

Value = 0;

return Plugin::success();

}

Error setDeviceHeapSize(uint64_t Value) override { return Plugin::success(); }

/// AMDGPU-specific function to get device attributes.

template <typename Ty> Error getDeviceAttr(uint32_t Kind, Ty &Value) {

hsa_status_t Status =

hsa_agent_get_info(Agent, (hsa_agent_info_t)Kind, &Value);

return Plugin::check(Status, "Error in hsa_agent_get_info: %s");

}

/// Get the device agent.

hsa_agent_t getAgent() const override { return Agent; }

/// Get the signal manager.

AMDGPUSignalManagerTy &getSignalManager() { return AMDGPUSignalManager; }

/// Retrieve and construct all memory pools of the device agent.

Error retrieveAllMemoryPools() override {

// Iterate through the available pools of the device agent.

return utils::iterateAgentMemoryPools(

Agent, [&](hsa_amd_memory_pool_t HSAMemoryPool) {

AMDGPUMemoryPoolTy *MemoryPool =

Plugin::get().allocate<AMDGPUMemoryPoolTy>();

new (MemoryPool) AMDGPUMemoryPoolTy(HSAMemoryPool);

AllMemoryPools.push_back(MemoryPool);

return HSA_STATUS_SUCCESS;

});

}

/// Get the next queue in a round-robin fashion.

AMDGPUQueueTy &getNextQueue() {

static std::atomic<uint32_t> NextQueue(0);

uint32_t Current = NextQueue.fetch_add(1, std::memory_order_relaxed);

return Queues[Current % Queues.size()];

}

/// Check whether a buffer is a host pinned buffer.

bool isHostPinnedMemory(const void *Ptr) const {

bool Found = false;

HostAllocationsMutex.lock_shared();

if (!HostAllocations.empty()) {

auto It = HostAllocations.lower_bound((const void *)Ptr);

if (It != HostAllocations.end() && It->first == Ptr) {

Found = true;

} else if (It != HostAllocations.begin()) {

--It;

Found = ((const char *)It->first + It->second > (const char *)Ptr);

}

HostAllocationsMutex.unlock_shared();

return Found;

}

private:

JonChesterfieldUnsubmitted

Not Done

I can't work out what's going on here. The corresponding logic for erase looks up the pointer directly, should found not be the same? Also can't tell why we're recording the size of the allocation next to the pointer, as opposed to a DenseSet<void*>

JonChesterfield: I can't work out what's going on here. The corresponding logic for erase looks up the pointer…

kevinsalaAuthorUnsubmitted

Done

The __tgt_rtl_data_delete operation should pass the same pointer provided by the __tgt_rtl_data_alloc. As far as I know, it's not valid to make a partial deletion of an allocated buffer. But in the case of __tgt_rtl_data_submit/retrieve, the pointer can come with an applied offset. Thus, we should check whether the provided pointer is inside any host allocation, considering the sizes of the allocations.

kevinsala: The `__tgt_rtl_data_delete` operation should pass the same pointer provided by the…

jdoerfertUnsubmitted

Not Done

We might want to make our lives easier here and not do a search.
Later we want two changes that will help:

Have a pre-allocated pinned buffer for arguments.
Use the hsa lookup function if it's not a pointer to pinned memory allocated by us.

@kevinsala You think we need the search for the current use cases?

jdoerfert: We might want to make our lives easier here and not do a search. Later we want two changes that…

kevinsalaAuthorUnsubmitted

Not Done

We can replace the search by a call to hsa_amd_pointer_info and let the HSA runtime do the work. If the buffer is explicitly locked by the user (malloc + HSA lock), it should return HSA_EXT_POINTER_TYPE_LOCKED. If the buffer was allocated using the HSA allocator functions, I guess it will return HSA_EXT_POINTER_TYPE_HSA. This latter does not mean that the buffer is host pinned memory because it may also be any other kind of memory allocated through HSA API. But we can assume the user won't pass such invalid buffer types.

kevinsala: We can replace the search by a call to `hsa_amd_pointer_info` and let the HSA runtime do the…

using AMDGPUStreamRef = AMDGPUResourceRef<AMDGPUStreamTy>;

using AMDGPUEventRef = AMDGPUResourceRef<AMDGPUEventTy>;

using AMDGPUStreamManagerTy = GenericDeviceResourceManagerTy<AMDGPUStreamRef>;

using AMDGPUEventManagerTy = GenericDeviceResourceManagerTy<AMDGPUEventRef>;

/// Envar for controlling the number of HSA queues per device. High number of

/// queues may degrade performance.

UInt32Envar OMPX_NumQueues;

/// Envar for controlling the size of each HSA queue. The size is the number

/// of HSA packets a queue is expected to hold. It is also the number of HSA

/// packets that can be pushed into each queue without waiting the driver to

/// process them.

UInt32Envar OMPX_QueueSize;

/// Envar specifying the maximum size in bytes where the memory copies are

/// asynchronous operations. Up to this transfer size, the memory copies are

/// asychronous operations pushed to the corresponding stream. For larger

/// transfers, they are synchronous transfers.

UInt32Envar OMPX_MaxAsyncCopyBytes;

/// Envar controlling the initial number of HSA signals per device. There is

/// one manager of signals per device managing several pre-allocated signals.

/// These signals are mainly used by AMDGPU streams. If needed, more signals

/// will be created.

UInt32Envar OMPX_InitialNumSignals;

/// Stream manager for AMDGPU streams.

AMDGPUStreamManagerTy AMDGPUStreamManager;

/// Event manager for AMDGPU events.

AMDGPUEventManagerTy AMDGPUEventManager;

/// Signal manager for AMDGPU signals.

AMDGPUSignalManagerTy AMDGPUSignalManager;

/// The agent handler corresponding to the device.

hsa_agent_t Agent;

/// Reference to the host device.

AMDHostDeviceTy &HostDevice;

JonChesterfieldUnsubmitted

Done

this is implemented as a tree - why std::map?

JonChesterfield: this is implemented as a tree - why std::map?

kevinsalaAuthorUnsubmitted

Done

To perform lower_bound operations with logarithmic complexity.

kevinsala: To perform `lower_bound` operations with logarithmic complexity.

/// List of device packet queues.

std::vector<AMDGPUQueueTy> Queues;

/// Map of host pinned allocations. We track these pinned allocations so that

/// memory transfers involving these allocations do not need a two-step copy

/// with an intermediate pinned buffer.

std::map<const void *, size_t> HostAllocations;

mutable std::shared_mutex HostAllocationsMutex;

};

Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {

hsa_status_t Status;

Status = hsa_code_object_deserialize(getStart(), getSize(), "", &CodeObject);

if (auto Err =

Plugin::check(Status, "Error in hsa_code_object_deserialize: %s"))

return Err;

Status = hsa_executable_create_alt(

HSA_PROFILE_FULL, HSA_DEFAULT_FLOAT_ROUNDING_MODE_ZERO, "", &Executable);

if (auto Err =

Plugin::check(Status, "Error in hsa_executable_create_alt: %s"))

return Err;

Status = hsa_executable_load_code_object(Executable, Device.getAgent(),

CodeObject, "");

if (auto Err =

Plugin::check(Status, "Error in hsa_executable_load_code_object: %s"))

return Err;

Status = hsa_executable_freeze(Executable, "");

if (auto Err = Plugin::check(Status, "Error in hsa_executable_freeze: %s"))

return Err;

uint32_t Result;

Status = hsa_executable_validate(Executable, &Result);

if (auto Err = Plugin::check(Status, "Error in hsa_executable_validate: %s"))

return Err;

if (Result)

return Plugin::error("Loaded HSA executable does not validate");

return Plugin::success();

}

Expected<hsa_executable_symbol_t>

AMDGPUDeviceImageTy::findDeviceSymbol(GenericDeviceTy &Device,

StringRef SymbolName) const {

AMDGPUDeviceTy &AMDGPUDevice = static_cast<AMDGPUDeviceTy &>(Device);

hsa_agent_t Agent = AMDGPUDevice.getAgent();

hsa_executable_symbol_t Symbol;

hsa_status_t Status = hsa_executable_get_symbol_by_name(

Executable, SymbolName.data(), &Agent, &Symbol);

if (auto Err = Plugin::check(

Status, "Error in hsa_executable_get_symbol_by_name(%s): %s",

SymbolName.data()))

return std::move(Err);

return Symbol;

}

template <typename ResourceTy>

Error AMDGPUResourceRef<ResourceTy>::create(GenericDeviceTy &Device) {

if (Resource)

return Plugin::error("Creating an existing resource");

AMDGPUDeviceTy &AMDGPUDevice = static_cast<AMDGPUDeviceTy &>(Device);

Resource = new ResourceTy(AMDGPUDevice);

return Resource->init();

}

jhuber6Unsubmitted

Done

const ELF64LE::Shdr &Section,

- GlobalTy &ImageGlobal) {

+ GlobalTy &ImageGlobal) override {

// The global's address in AMDGPU is computed as the image begin + the ELF

Is this supposed to be marked override?

jhuber6: Is this supposed to be marked override?

AMDGPUStreamTy::AMDGPUStreamTy(AMDGPUDeviceTy &Device)

: Agent(Device.getAgent()), Queue(Device.getNextQueue()),

SignalManager(Device.getSignalManager()),

// Initialize the std::deque with some empty positions.

Slots(32), NextSlot(0), SyncId(0) {}

/// Class implementing the AMDGPU-specific functionalities of the global

/// handler.

struct AMDGPUGlobalHandlerTy final : public GenericGlobalHandlerTy {

/// Get the metadata of a global from the device. The name and size of the

/// global is read from DeviceGlobal and the address of the global is written

/// to DeviceGlobal.

Error getGlobalMetadataFromDevice(GenericDeviceTy &Device,

DeviceImageTy &Image,

GlobalTy &DeviceGlobal) override {

AMDGPUDeviceImageTy &AMDImage = static_cast<AMDGPUDeviceImageTy &>(Image);

// Find the symbol on the device executable.

auto SymbolOrErr =

AMDImage.findDeviceSymbol(Device, DeviceGlobal.getName());

if (!SymbolOrErr)

return SymbolOrErr.takeError();

hsa_executable_symbol_t Symbol = *SymbolOrErr;

hsa_symbol_kind_t SymbolType;

hsa_status_t Status;

uint64_t SymbolAddr;

uint32_t SymbolSize;

// Retrieve the type, address and size of the symbol.

std::pair<hsa_executable_symbol_info_t, void *> RequiredInfos[] = {

{HSA_EXECUTABLE_SYMBOL_INFO_TYPE, &SymbolType},

{HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS, &SymbolAddr},

{HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_SIZE, &SymbolSize}};

for (auto &Info : RequiredInfos) {

Status = hsa_executable_symbol_get_info(Symbol, Info.first, Info.second);

if (auto Err = Plugin::check(

Status, "Error in hsa_executable_symbol_get_info: %s"))

return Err;

}

// Check the size of the symbol.

if (SymbolSize != DeviceGlobal.getSize())

return Plugin::error(

"Failed to load global '%s' due to size mismatch (%zu != %zu)",

DeviceGlobal.getName().data(), SymbolSize,

(size_t)DeviceGlobal.getSize());

// Store the symbol address on the device global metadata.

DeviceGlobal.setPtr(reinterpret_cast<void *>(SymbolAddr));

return Plugin::success();

}

private:

/// Extract the global's information from the ELF image, section, and symbol.

Error getGlobalMetadataFromELF(const DeviceImageTy &Image,

const ELF64LE::Sym &Symbol,

const ELF64LE::Shdr &Section,

GlobalTy &ImageGlobal) override {

// The global's address in AMDGPU is computed as the image begin + the ELF

// symbol value. Notice we do not add the ELF section offset.

ImageGlobal.setPtr((char *)Image.getStart() + Symbol.st_value);

// Set the global's size.

ImageGlobal.setSize(Symbol.st_size);

return Plugin::success();

}

};

/// Class implementing the AMDGPU-specific functionalities of the plugin.

struct AMDGPUPluginTy final : public GenericPluginTy {

/// Create an AMDGPU plugin and initialize the AMDGPU driver.

AMDGPUPluginTy() : GenericPluginTy(), HostDevice(nullptr) {}

/// This class should not be copied.

AMDGPUPluginTy(const AMDGPUPluginTy &) = delete;

AMDGPUPluginTy(AMDGPUPluginTy &&) = delete;

/// Initialize the plugin and return the number of devices.

Expected<int32_t> initImpl() override {

hsa_status_t Status = hsa_init();

if (Status != HSA_STATUS_SUCCESS) {

// Cannot call hsa_success_string.

DP("Failed initialize AMDGPU's HSA library\n");

return 0;

}

// Register event handler to detect memory errors on the devices.

Status = hsa_amd_register_system_event_handler(eventHandler, nullptr);

if (auto Err = Plugin::check(

Status, "Error in hsa_amd_register_system_event_handler: %s"))

return std::move(Err);

// List of host (CPU) agents.

llvm::SmallVector<hsa_agent_t> HostAgents;

// Count the number of available agents.

auto Err = utils::iterateAgents([&](hsa_agent_t Agent) {

// Get the device type of the agent.

hsa_device_type_t DeviceType;

hsa_status_t Status =

hsa_agent_get_info(Agent, HSA_AGENT_INFO_DEVICE, &DeviceType);

if (Status != HSA_STATUS_SUCCESS)

return Status;

// Classify the agents into kernel (GPU) and host (CPU) kernels.

if (DeviceType == HSA_DEVICE_TYPE_GPU) {

// Ensure that the GPU agent supports kernel dispatch packets.

hsa_agent_feature_t features;

Status = hsa_agent_get_info(Agent, HSA_AGENT_INFO_FEATURE, &features);

if (features & HSA_AGENT_FEATURE_KERNEL_DISPATCH)

KernelAgents.push_back(Agent);

} else if (DeviceType == HSA_DEVICE_TYPE_CPU) {

HostAgents.push_back(Agent);

}

return HSA_STATUS_SUCCESS;

});

if (Err)

return std::move(Err);

int32_t NumDevices = KernelAgents.size();

if (NumDevices == 0) {

// Do not initialize if there are no devices.

DP("There are no devices supporting AMDGPU.\n");

jhuber6Unsubmitted

Done

// Setup the memory pools of available for the host.

- if (Err = HostDevice->init())

+ if (auto Err = HostDevice->init())

return std::move(Err);

Best make a new one here as this was checked above.

jhuber6: Best make a new one here as this was checked above.

return 0;

}

// There are kernel agents but there is no host agent. That should be

// treated as an error.

if (HostAgents.empty())

return Plugin::error("No AMDGPU host agents");

// Initialize the host device using host agents.

HostDevice = allocate<AMDHostDeviceTy>();

new (HostDevice) AMDHostDeviceTy(HostAgents);

// Setup the memory pools of available for the host.

if (auto Err = HostDevice->init())

return std::move(Err);

return NumDevices;

}

/// Deinitialize the plugin.

Error deinitImpl() override {

if (auto Err = HostDevice->deinit())

return Err;

// Finalize the HSA runtime.

hsa_status_t Status = hsa_shut_down();

return Plugin::check(Status, "Error in hsa_shut_down: %s");

}

/// Get the ELF code for recognizing the compatible image binary.

uint16_t getMagicElfBits() const override { return ELF::EM_AMDGPU; }

/// Check whether the image is compatible with an AMDGPU device.

Expected<bool> isImageCompatible(__tgt_image_info *Info) const override {

for (hsa_agent_t Agent : KernelAgents) {

std::string Target;

auto Err = utils::iterateAgentISAs(Agent, [&](hsa_isa_t ISA) {

uint32_t Length;

hsa_status_t Status;

Status = hsa_isa_get_info_alt(ISA, HSA_ISA_INFO_NAME_LENGTH, &Length);

if (Status != HSA_STATUS_SUCCESS)

return Status;

// TODO: This is not allowed by the standard.

char ISAName[Length];

Status = hsa_isa_get_info_alt(ISA, HSA_ISA_INFO_NAME, ISAName);

if (Status != HSA_STATUS_SUCCESS)

return Status;

llvm::StringRef TripleTarget(ISAName);

if (TripleTarget.consume_front("amdgcn-amd-amdhsa"))

Target = TripleTarget.ltrim('-').str();

return HSA_STATUS_SUCCESS;

});

if (Err)

return std::move(Err);

if (!utils::isImageCompatibleWithEnv(Info, Target))

return false;

}

return true;

}

/// This plugin does not support exchanging data between two devices.

bool isDataExchangable(int32_t SrcDeviceId, int32_t DstDeviceId) override {

return false;

}

/// Get the host device instance.

AMDHostDeviceTy &getHostDevice() {

assert(HostDevice && "Host device not initialized");

return *HostDevice;

}

/// Get the kernel agent with the corresponding agent id.

hsa_agent_t getKernelAgent(int32_t AgentId) const {

assert((uint32_t)AgentId < KernelAgents.size() && "Invalid agent id");

return KernelAgents[AgentId];

}

/// Get the list of the available kernel agents.

const llvm::SmallVector<hsa_agent_t> &getKernelAgents() const {

return KernelAgents;

}

private:

/// Event handler that will be called by ROCr if an event is detected.

static hsa_status_t eventHandler(const hsa_amd_event_t *Event, void *) {

if (Event->event_type == HSA_AMD_GPU_MEMORY_FAULT_EVENT) {

std::string Reasons;

uint32_t ReasonsMask = Event->memory_fault.fault_reason_mask;

if (ReasonsMask & HSA_AMD_MEMORY_FAULT_PAGE_NOT_PRESENT)

Reasons += "HSA_AMD_MEMORY_FAULT_PAGE_NOT_PRESENT\n";

if (ReasonsMask & HSA_AMD_MEMORY_FAULT_READ_ONLY)

Reasons += " HSA_AMD_MEMORY_FAULT_READ_ONLY\n";

if (ReasonsMask & HSA_AMD_MEMORY_FAULT_NX)

Reasons += " HSA_AMD_MEMORY_FAULT_NX\n";

if (ReasonsMask & HSA_AMD_MEMORY_FAULT_HOST_ONLY)

Reasons += " HSA_AMD_MEMORY_FAULT_HOST_ONLY\n";

if (ReasonsMask & HSA_AMD_MEMORY_FAULT_DRAMECC)

Reasons += " HSA_AMD_MEMORY_FAULT_DRAMECC\n";

if (ReasonsMask & HSA_AMD_MEMORY_FAULT_IMPRECISE)

Reasons += " HSA_AMD_MEMORY_FAULT_IMPRECISE\n";

if (ReasonsMask & HSA_AMD_MEMORY_FAULT_SRAMECC)

Reasons += " HSA_AMD_MEMORY_FAULT_SRAMECC\n";

if (ReasonsMask & HSA_AMD_MEMORY_FAULT_HANG)

Reasons += " HSA_AMD_MEMORY_FAULT_HANG\n";

// Abort the execution since we do not recover from this error.

FATAL_MESSAGE(1,

"Found HSA_AMD_GPU_MEMORY_FAULT_EVENT in agent %" PRIu64

" at virtual address %p and reasons:\n %s",

Event->memory_fault.agent.handle,

(void *)Event->memory_fault.virtual_address,

Reasons.data());

return HSA_STATUS_ERROR;

}

return HSA_STATUS_SUCCESS;

jdoerfertUnsubmitted

Done

Nit early exit first.

jdoerfert: Nit early exit first.

}

jhuber6Unsubmitted

Done

Nit. no else after return.

jhuber6: Nit. no else after return.

/// Arrays of the available GPU and CPU agents. These arrays of handles should

/// not be here but in the AMDGPUDeviceTy structures directly. However, the

/// HSA standard does not provide API functions to retirve agents directly,

/// only iterating functions. We cache the agents here for convenience.

llvm::SmallVector<hsa_agent_t> KernelAgents;

/// The device representing all HSA host agents.

AMDHostDeviceTy *HostDevice;

};

Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,

uint32_t NumThreads, uint64_t NumBlocks,

uint32_t DynamicMemorySize,

int32_t NumKernelArgs, void *KernelArgs,

AsyncInfoWrapperTy &AsyncInfoWrapper) const {

const uint32_t KernelArgsSize = NumKernelArgs * sizeof(void *);

if (ArgsSize < KernelArgsSize)

return Plugin::error("Mismatch of kernel arguments size");

// The args size reported by HSA may or may not contain the implicit args.

// For now, assume that HSA does not consider the implicit arguments when

// reporting the arguments of a kernel. In the worst case, we can waste

// 56 bytes per allocation.

uint32_t AllArgsSize = KernelArgsSize + ImplicitArgsSize;

AMDHostDeviceTy &HostDevice = Plugin::get<AMDGPUPluginTy>().getHostDevice();

AMDGPUMemoryManagerTy &ArgsMemoryManager = HostDevice.getArgsMemoryManager();

void *AllArgs = nullptr;

if (auto Err = ArgsMemoryManager.allocate(AllArgsSize, &AllArgs))

return Err;

// Initialize implicit arguments.

utils::impl_implicit_args_t *ImplArgs =

reinterpret_cast<utils::impl_implicit_args_t *>(

static_cast<char *>(AllArgs) + KernelArgsSize);

// Initialize the implicit arguments to zero.

std::memset(ImplArgs, 0, ImplicitArgsSize);

// Copy the explicit arguments.

for (int32_t ArgId = 0; ArgId < NumKernelArgs; ++ArgId) {

void *Dst = (char *)AllArgs + sizeof(void *) * ArgId;

void *Src = *((void **)KernelArgs + ArgId);

std::memcpy(Dst, Src, sizeof(void *));

}

AMDGPUDeviceTy &AMDGPUDevice = static_cast<AMDGPUDeviceTy &>(GenericDevice);

AMDGPUStreamTy &Stream = AMDGPUDevice.getStream(AsyncInfoWrapper);

// Push the kernel launch into the stream.

return Stream.pushKernelLaunch(*this, AllArgs, NumThreads, NumBlocks,

ArgsMemoryManager);

}

GenericPluginTy *Plugin::createPlugin() { return new AMDGPUPluginTy(); }

GenericDeviceTy *Plugin::createDevice(int32_t DeviceId, int32_t NumDevices) {

AMDGPUPluginTy &Plugin = get<AMDGPUPluginTy &>();

return new AMDGPUDeviceTy(DeviceId, NumDevices, Plugin.getHostDevice(),

Plugin.getKernelAgent(DeviceId));

}

GenericGlobalHandlerTy *Plugin::createGlobalHandler() {

return new AMDGPUGlobalHandlerTy();

}

template <typename... ArgsTy>

Error Plugin::check(int32_t Code, const char *ErrFmt, ArgsTy... Args) {

hsa_status_t ResultCode = static_cast<hsa_status_t>(Code);

if (ResultCode == HSA_STATUS_SUCCESS || ResultCode == HSA_STATUS_INFO_BREAK)

return Error::success();

const char *Desc = "Unknown error";

hsa_status_t Ret = hsa_status_string(ResultCode, &Desc);

if (Ret != HSA_STATUS_SUCCESS)

REPORT("Unrecognized " GETNAME(TARGET_NAME) " error code %d\n", Code);

return createStringError<ArgsTy..., const char *>(inconvertibleErrorCode(),

ErrFmt, Args..., Desc);

}

void *AMDGPUMemoryManagerTy::allocate(size_t Size, void *HstPtr,

TargetAllocTy Kind) {

// Allocate memory from the pool.

void *Ptr = nullptr;

if (auto Err = MemoryPool->allocate(Size, &Ptr)) {

consumeError(std::move(Err));

return nullptr;

}

assert(Ptr && "Invalid pointer");

auto &KernelAgents = Plugin::get<AMDGPUPluginTy>().getKernelAgents();

// Allow all kernel agents to access the allocation.

if (auto Err = MemoryPool->enableAccess(Ptr, Size, KernelAgents)) {

REPORT("%s\n", toString(std::move(Err)).data());

return nullptr;

}

return Ptr;

}

void *AMDGPUDeviceTy::allocate(size_t Size, void *, TargetAllocTy Kind) {

if (Size == 0)

return nullptr;

// Find the correct memory pool.

AMDGPUMemoryPoolTy *MemoryPool = nullptr;

switch (Kind) {

case TARGET_ALLOC_DEFAULT:

case TARGET_ALLOC_DEVICE:

MemoryPool = CoarseGrainedMemoryPools[0];

break;

case TARGET_ALLOC_HOST:

MemoryPool = &HostDevice.getHostMemoryPool();

break;

case TARGET_ALLOC_SHARED:

// TODO: Not supported yet. We could look at fine-grained host memory

// pools that are accessible by this device. The allocation should be made

// explicitly accessible if it is not yet.

break;

}

if (!MemoryPool) {

REPORT("No memory pool for the specified allocation kind\n");

return nullptr;

}

// Allocate from the corresponding memory pool.

void *Alloc = nullptr;

if (Error Err = MemoryPool->allocate(Size, &Alloc)) {

REPORT("%s\n", toString(std::move(Err)).data());

return nullptr;

}

if (Kind == TARGET_ALLOC_HOST && Alloc) {

auto &KernelAgents = Plugin::get<AMDGPUPluginTy>().getKernelAgents();

// Enable all kernel agents to access the host pinned buffer.

if (auto Err = MemoryPool->enableAccess(Alloc, Size, KernelAgents)) {

REPORT("%s\n", toString(std::move(Err)).data());

}

// Keep track of the host pinned allocations for optimizations in transfers.

std::lock_guard<std::shared_mutex> Lock(HostAllocationsMutex);

HostAllocations.insert({Alloc, Size});

}

return Alloc;

}

} // namespace plugin

} // namespace target

} // namespace omp

} // namespace llvm

openmp/libomptarget/plugins/amdgpu/dynamic_hsa/hsa.h

	Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	typedef enum {			typedef enum {
	HSA_ISA_INFO_NAME_LENGTH = 0,			HSA_ISA_INFO_NAME_LENGTH = 0,
	HSA_ISA_INFO_NAME = 1			HSA_ISA_INFO_NAME = 1
	} hsa_isa_info_t;			} hsa_isa_info_t;

	typedef enum {			typedef enum {
	HSA_AGENT_INFO_NAME = 0,			HSA_AGENT_INFO_NAME = 0,
	HSA_AGENT_INFO_VENDOR_NAME = 1,			HSA_AGENT_INFO_VENDOR_NAME = 1,
				HSA_AGENT_INFO_FEATURE = 2,
	HSA_AGENT_INFO_PROFILE = 4,			HSA_AGENT_INFO_PROFILE = 4,
	HSA_AGENT_INFO_WAVEFRONT_SIZE = 6,			HSA_AGENT_INFO_WAVEFRONT_SIZE = 6,
	HSA_AGENT_INFO_WORKGROUP_MAX_DIM = 7,			HSA_AGENT_INFO_WORKGROUP_MAX_DIM = 7,
	HSA_AGENT_INFO_WORKGROUP_MAX_SIZE = 8,			HSA_AGENT_INFO_WORKGROUP_MAX_SIZE = 8,
	HSA_AGENT_INFO_GRID_MAX_DIM = 9,			HSA_AGENT_INFO_GRID_MAX_DIM = 9,
	HSA_AGENT_INFO_GRID_MAX_SIZE = 10,			HSA_AGENT_INFO_GRID_MAX_SIZE = 10,
	HSA_AGENT_INFO_FBARRIER_MAX_SIZE = 11,			HSA_AGENT_INFO_FBARRIER_MAX_SIZE = 11,
	HSA_AGENT_INFO_QUEUES_MAX = 12,			HSA_AGENT_INFO_QUEUES_MAX = 12,
	HSA_AGENT_INFO_QUEUE_MIN_SIZE = 13,			HSA_AGENT_INFO_QUEUE_MIN_SIZE = 13,
	HSA_AGENT_INFO_QUEUE_MAX_SIZE = 14,			HSA_AGENT_INFO_QUEUE_MAX_SIZE = 14,
	HSA_AGENT_INFO_DEVICE = 17,			HSA_AGENT_INFO_DEVICE = 17,
	HSA_AGENT_INFO_CACHE_SIZE = 18,			HSA_AGENT_INFO_CACHE_SIZE = 18,
	HSA_AGENT_INFO_FAST_F16_OPERATION = 24,			HSA_AGENT_INFO_FAST_F16_OPERATION = 24,
	} hsa_agent_info_t;			} hsa_agent_info_t;

	typedef enum {			typedef enum {
	HSA_SYSTEM_INFO_VERSION_MAJOR = 0,			HSA_SYSTEM_INFO_VERSION_MAJOR = 0,
	HSA_SYSTEM_INFO_VERSION_MINOR = 1,			HSA_SYSTEM_INFO_VERSION_MINOR = 1,
	} hsa_system_info_t;			} hsa_system_info_t;

				typedef enum {
				HSA_AGENT_FEATURE_KERNEL_DISPATCH = 1,
				HSA_AGENT_FEATURE_AGENT_DISPATCH = 2,
				} hsa_agent_feature_t;

	typedef struct hsa_region_s {			typedef struct hsa_region_s {
	uint64_t handle;			uint64_t handle;
	} hsa_region_t;			} hsa_region_t;

	typedef struct hsa_isa_s {			typedef struct hsa_isa_s {
	uint64_t handle;			uint64_t handle;
	} hsa_isa_t;			} hsa_isa_t;

	Show All 24 Lines
	typedef int32_t hsa_signal_value_t;			typedef int32_t hsa_signal_value_t;
	#endif			#endif

	hsa_status_t hsa_signal_create(hsa_signal_value_t initial_value,			hsa_status_t hsa_signal_create(hsa_signal_value_t initial_value,
	uint32_t num_consumers,			uint32_t num_consumers,
	const hsa_agent_t *consumers,			const hsa_agent_t *consumers,
	hsa_signal_t *signal);			hsa_signal_t *signal);

				hsa_status_t hsa_amd_signal_create(hsa_signal_value_t initial_value,
				uint32_t num_consumers,
				const hsa_agent_t* consumers,
				uint64_t attributes,
				hsa_signal_t* signal);

	hsa_status_t hsa_signal_destroy(hsa_signal_t signal);			hsa_status_t hsa_signal_destroy(hsa_signal_t signal);

	void hsa_signal_store_relaxed(hsa_signal_t signal, hsa_signal_value_t value);			void hsa_signal_store_relaxed(hsa_signal_t signal, hsa_signal_value_t value);

	void hsa_signal_store_screlease(hsa_signal_t signal, hsa_signal_value_t value);			void hsa_signal_store_screlease(hsa_signal_t signal, hsa_signal_value_t value);

				hsa_signal_value_t hsa_signal_load_scacquire(hsa_signal_t signal);

				void hsa_signal_subtract_screlease(hsa_signal_t signal,
				hsa_signal_value_t value);

	typedef enum {			typedef enum {
	HSA_SIGNAL_CONDITION_EQ = 0,			HSA_SIGNAL_CONDITION_EQ = 0,
	HSA_SIGNAL_CONDITION_NE = 1,			HSA_SIGNAL_CONDITION_NE = 1,
	} hsa_signal_condition_t;			} hsa_signal_condition_t;

	typedef enum {			typedef enum {
	HSA_WAIT_STATE_BLOCKED = 0,			HSA_WAIT_STATE_BLOCKED = 0,
	HSA_WAIT_STATE_ACTIVE = 1			HSA_WAIT_STATE_ACTIVE = 1
	} hsa_wait_state_t;			} hsa_wait_state_t;

	hsa_signal_value_t hsa_signal_wait_scacquire(hsa_signal_t signal,			hsa_signal_value_t hsa_signal_wait_scacquire(hsa_signal_t signal,
	hsa_signal_condition_t condition,			hsa_signal_condition_t condition,
	hsa_signal_value_t compare_value,			hsa_signal_value_t compare_value,
	uint64_t timeout_hint,			uint64_t timeout_hint,
	hsa_wait_state_t wait_state_hint);			hsa_wait_state_t wait_state_hint);

	typedef enum {			typedef enum {
	HSA_QUEUE_TYPE_MULTI = 0,			HSA_QUEUE_TYPE_MULTI = 0,
	HSA_QUEUE_TYPE_SINGLE = 1,			HSA_QUEUE_TYPE_SINGLE = 1,
	} hsa_queue_type_t;			} hsa_queue_type_t;

				typedef enum {
				HSA_QUEUE_FEATURE_KERNEL_DISPATCH = 1,
				HSA_QUEUE_FEATURE_AGENT_DISPATCH = 2
				} hsa_queue_feature_t;

	typedef uint32_t hsa_queue_type32_t;			typedef uint32_t hsa_queue_type32_t;

	typedef struct hsa_queue_s {			typedef struct hsa_queue_s {
	hsa_queue_type32_t type;			hsa_queue_type32_t type;
	uint32_t features;			uint32_t features;

	#ifdef HSA_LARGE_MODEL			#ifdef HSA_LARGE_MODEL
	void *base_address;			void *base_address;
	Show All 21 Lines

	uint64_t hsa_queue_load_read_index_scacquire(const hsa_queue_t *queue);			uint64_t hsa_queue_load_read_index_scacquire(const hsa_queue_t *queue);

	uint64_t hsa_queue_add_write_index_relaxed(const hsa_queue_t *queue,			uint64_t hsa_queue_add_write_index_relaxed(const hsa_queue_t *queue,
	uint64_t value);			uint64_t value);

	typedef enum {			typedef enum {
	HSA_PACKET_TYPE_KERNEL_DISPATCH = 2,			HSA_PACKET_TYPE_KERNEL_DISPATCH = 2,
				HSA_PACKET_TYPE_BARRIER_AND = 3,
	} hsa_packet_type_t;			} hsa_packet_type_t;

	typedef enum { HSA_FENCE_SCOPE_SYSTEM = 2 } hsa_fence_scope_t;			typedef enum { HSA_FENCE_SCOPE_SYSTEM = 2 } hsa_fence_scope_t;

	typedef enum {			typedef enum {
	HSA_PACKET_HEADER_TYPE = 0,			HSA_PACKET_HEADER_TYPE = 0,
	HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE = 9,			HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE = 9,
	HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE = 11			HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE = 11
	Show All 28 Lines
	#else			#else
	uint32_t reserved1;			uint32_t reserved1;
	void *kernarg_address;			void *kernarg_address;
	#endif			#endif
	uint64_t reserved2;			uint64_t reserved2;
	hsa_signal_t completion_signal;			hsa_signal_t completion_signal;
	} hsa_kernel_dispatch_packet_t;			} hsa_kernel_dispatch_packet_t;

				typedef struct hsa_barrier_and_packet_s {
				uint16_t header;
				uint16_t reserved0;
				uint32_t reserved1;
				hsa_signal_t dep_signal[5];
				uint64_t reserved2;
				hsa_signal_t completion_signal;
				} hsa_barrier_and_packet_t;

	typedef enum { HSA_PROFILE_BASE = 0, HSA_PROFILE_FULL = 1 } hsa_profile_t;			typedef enum { HSA_PROFILE_BASE = 0, HSA_PROFILE_FULL = 1 } hsa_profile_t;

	typedef enum {			typedef enum {
	HSA_EXECUTABLE_STATE_UNFROZEN = 0,			HSA_EXECUTABLE_STATE_UNFROZEN = 0,
	HSA_EXECUTABLE_STATE_FROZEN = 1			HSA_EXECUTABLE_STATE_FROZEN = 1
	} hsa_executable_state_t;			} hsa_executable_state_t;

	typedef struct hsa_executable_s {			typedef struct hsa_executable_s {
	Show All 21 Lines
	} hsa_code_object_t;			} hsa_code_object_t;

	typedef enum {			typedef enum {
	HSA_SYMBOL_KIND_VARIABLE = 0,			HSA_SYMBOL_KIND_VARIABLE = 0,
	HSA_SYMBOL_KIND_KERNEL = 1,			HSA_SYMBOL_KIND_KERNEL = 1,
	HSA_SYMBOL_KIND_INDIRECT_FUNCTION = 2			HSA_SYMBOL_KIND_INDIRECT_FUNCTION = 2
	} hsa_symbol_kind_t;			} hsa_symbol_kind_t;

				typedef enum {
				HSA_DEFAULT_FLOAT_ROUNDING_MODE_DEFAULT = 0,
				HSA_DEFAULT_FLOAT_ROUNDING_MODE_ZERO = 1,
				HSA_DEFAULT_FLOAT_ROUNDING_MODE_NEAR = 2,
				} hsa_default_float_rounding_mode_t;

	hsa_status_t hsa_memory_copy(void dst, const void src, size_t size);			hsa_status_t hsa_memory_copy(void dst, const void src, size_t size);

	hsa_status_t hsa_executable_create(hsa_profile_t profile,			hsa_status_t hsa_executable_create(hsa_profile_t profile,
	hsa_executable_state_t executable_state,			hsa_executable_state_t executable_state,
	const char *options,			const char *options,
	hsa_executable_t *executable);			hsa_executable_t *executable);

				hsa_status_t hsa_executable_create_alt(
				hsa_profile_t profile,
				hsa_default_float_rounding_mode_t default_float_rounding_mode,
				const char *options,
				hsa_executable_t *executable);

	hsa_status_t hsa_executable_destroy(hsa_executable_t executable);			hsa_status_t hsa_executable_destroy(hsa_executable_t executable);

	hsa_status_t hsa_executable_freeze(hsa_executable_t executable,			hsa_status_t hsa_executable_freeze(hsa_executable_t executable,
	const char *options);			const char *options);

				hsa_status_t hsa_executable_validate(hsa_executable_t executable,
				uint32_t *result);

	hsa_status_t			hsa_status_t
	hsa_executable_symbol_get_info(hsa_executable_symbol_t executable_symbol,			hsa_executable_symbol_get_info(hsa_executable_symbol_t executable_symbol,
	hsa_executable_symbol_info_t attribute,			hsa_executable_symbol_info_t attribute,
	void *value);			void *value);

	hsa_status_t hsa_executable_iterate_symbols(			hsa_status_t hsa_executable_iterate_symbols(
	hsa_executable_t executable,			hsa_executable_t executable,
	hsa_status_t (*callback)(hsa_executable_t exec,			hsa_status_t (*callback)(hsa_executable_t exec,
	hsa_executable_symbol_t symbol, void *data),			hsa_executable_symbol_t symbol, void *data),
	void *data);			void *data);

				hsa_status_t hsa_executable_get_symbol_by_name(
				hsa_executable_t executable,
				const char *symbol_name,
				const hsa_agent_t *agent,
				hsa_executable_symbol_t *symbol);

	hsa_status_t hsa_code_object_deserialize(void *serialized_code_object,			hsa_status_t hsa_code_object_deserialize(void *serialized_code_object,
	size_t serialized_code_object_size,			size_t serialized_code_object_size,
	const char *options,			const char *options,
	hsa_code_object_t *code_object);			hsa_code_object_t *code_object);

	hsa_status_t hsa_executable_load_code_object(hsa_executable_t executable,			hsa_status_t hsa_executable_load_code_object(hsa_executable_t executable,
	hsa_agent_t agent,			hsa_agent_t agent,
	hsa_code_object_t code_object,			hsa_code_object_t code_object,
	const char *options);			const char *options);

				hsa_status_t hsa_code_object_destroy(hsa_code_object_t code_object);

				typedef bool (hsa_amd_signal_handler)(hsa_signal_value_t value, void arg);

				hsa_status_t hsa_amd_signal_async_handler(hsa_signal_t signal,
				hsa_signal_condition_t cond,
				hsa_signal_value_t value,
				hsa_amd_signal_handler handler,
				void* arg);

	#ifdef __cplusplus			#ifdef __cplusplus
	}			}
	#endif			#endif

	#endif			#endif

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behaviorClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 482161

openmp/libomptarget/plugins-nextgen/CMakeLists.txt

openmp/libomptarget/plugins-nextgen/amdgpu/CMakeLists.txt

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

openmp/libomptarget/plugins/amdgpu/dynamic_hsa/hsa.h

[OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior
ClosedPublic