This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior
ClosedPublic

Authored by kevinsala on Nov 20 2022, 6:24 PM.

Details

Summary

This patch adds the AMDGPU NextGen plugin inheriting from PluginInterface's classes. It also implements the asynchronous behavior in the plugin operations: kernel launches and memory transfers. To this end, it implements the concept of streams of asynchronous operations. The streams are implemented using the HSA signals to define input and output dependencies between asynchronous operations.

Missing features:

  • Retrieve the maximum number of threads per group that a kernel can run. This requires reading the image.
  • Implement __tgt_rtl_sync_event, not used on the libomptarget side.

Diff Detail

Event Timeline

kevinsala created this revision.Nov 20 2022, 6:24 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 20 2022, 6:24 PM
kevinsala requested review of this revision.Nov 20 2022, 6:24 PM

Looks good overall for me, mostly nits. I'm not the most qualifies to scrutinize the HSA usage, maybe @JonChesterfield or @saiislam could help on that front.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
44

This is declared as a struct here then defined as a class later.

65

Empty argument isn't needed in C++17

149

These can be StringRef since they're constants. Avoids the heap allocation.

293

I'm not deeply familiar with the MemoryManager, but I thought that we didn't use it for AMD because the HSA plugin already handled its own caching in the memory pool? Maybe @JonChesterfield and @tianshilei1992 could comment further.

416

Never looked into it much myself, but does std::atomic allow us to avoid using a compiler builtin for this? Not a big deal if not.

537–538

Small nit, can use LLVM's string, it's basically just a small vector of chars.

1010

typo

1200

Probably best to specify the template arguments.

1573

Unused

1713

Using const_cast isn't great but otherwise the compiler will complain with warnings on.

1734

Same here.

1761

Nit, no else after return

2049

Is this supposed to be marked override?

2177

Best make a new one here as this was checked above.

2298

Nit. no else after return.

There's a lot of code here. Some of it very familiar, in copy/paste fashion from the current plugin. Some of it quite subtle in terms of whether it's going to work or not. Not an easy thing to review successfully in one block.

How was this tested? What's the intended lifetime plan for new and old plugins (it looks like cuda's old one is still with us)?

There's a lot of code here. Some of it very familiar, in copy/paste fashion from the current plugin. Some of it quite subtle in terms of whether it's going to work or not. Not an easy thing to review successfully in one block.

I fondly remember when we merged the AMD plugin, which is to this day full of dead code... That said, @kevinsala, everything that is not AMD related should be split off. Make one ore more pre-commits to update interfaces etc. of the generic parts.

How was this tested?

We need a configuration for our regression tests. I thought we had one (or a patch) already. @kevinsala @jhuber6

What's the intended lifetime plan for new and old plugins (it looks like cuda's old one is still with us)?

The next release will have both. After we branch we'll delete the old plugins. Like we did with the runtime.

openmp/libomptarget/plugins-nextgen/common/PluginInterface/GlobalHandler.cpp
136 ↗(On Diff #476782)

@kevinsala: Split these changes into a separate pre-commit please.

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
570 ↗(On Diff #476782)

@kevinsala: Split these changes into a separate pre-commit please.

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
764 ↗(On Diff #476782)

@kevinsala: Split these changes into a separate pre-commit please.

openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
1004 ↗(On Diff #476782)

@kevinsala: Split these changes into a separate pre-commit please.

openmp/libomptarget/plugins-nextgen/generic-elf-64bit/src/rtl.cpp
371 ↗(On Diff #476782)

@kevinsala: Split these changes into a separate pre-commit please.

jplehr added a subscriber: jplehr.Nov 23 2022, 6:49 AM
jdoerfert added inline comments.Nov 23 2022, 11:10 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
293

I'm generally in favor of using it either way.

335
416

We'd need to bend over backwards as the definitions are plain pointers. Let's keep it this way.

674

Pick a size that is at least the default Capacity.

830

Isn't this a problem? Shouldn't we wait instead? Same in other places. Let's mark it as TODO and address it in a follow up.

1252–1253
  • CUDA + AMD
1462–1463

4 x 512 x 128 might be enough

ye-luo added a subscriber: ye-luo.Nov 23 2022, 11:21 AM
ye-luo added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1462

What is QUEUE_SIZE? Prefer not to use SIZE but NUM_XXX_PER_QUEUE

1463

What is the STREAM_SIZE?

1464

LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES?

jdoerfert added inline comments.Nov 23 2022, 2:41 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1464

We need documentation for those in openmp/docs, next to the other env vars.

How was this tested?

We need a configuration for our regression tests. I thought we had one (or a patch) already. @kevinsala @jhuber6

For functional tests I've used the LLVM OpenMP target tests (make check) and miniQMC. For performance experiments, the miniQMC directly. Is there any test suite you use for functional/performance testing?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
830

Yes, there is no problem waiting on it. Even so, this check will go away with the new version of dynamically sized streams.

1464

I'll add the documentation for these envars. They control the following aspects:

  1. LIBOMPTARGET_AMDGPU_QUEUE_SIZE: The number of HSA packets that can be pushed into each HSA queue without waiting the driver to process them.
  2. LIBOMPTARGET_AMDGPU_STREAM_SIZE: Number of asynchronous operations (e.g., kernel launches, memory transfers) that can be pushed into each of our streams without waiting on their finalization. With the upcoming patch implementing dynamically sized streams, this envar will be renamed and become a hint for the initial stream size.
  3. LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_SIZE: Up to this size, the memory copies (tgt_rtl_submit_data, tgt_rtl_retrieve_data) will be asynchronous operations appended to the corresponding stream. For larger transfer, it will become synchronous transfers. This can be seen dataSubmitImpl (1695).

For functional tests I've used the LLVM OpenMP target tests (make check) and miniQMC. For performance experiments, the miniQMC directly. Is there any test suite you use for functional/performance testing?

This plugin introduces new functionality which seems unlikely to be covered by existing make check, particularly given quite a lot of that is disabled on amdgpu anyway.

AMD's aomp and rocm forks have more aggressive runtime testing but presumably don't make heavy use of async given it doesn't exist before this patch.

Code review isn't likely to catch problems in thousands of lines of async c++. Can we split this into a non-functional change where the current plugin is refactored to use the new interface, or equivalently split all the async feature gain out of this to get a more boring alternative plugin which is expected to behave identically to the current one? That lets us distinguish between unintentional behaviour changes (when the boring/refactor doesn't work) and intentional ones (which ideally would have tests).

The big bang replace a large functional unit with a totally new one that behaves differently tends to leave intel and amd's product forks behind, which is unfortunate when upstream testing is relatively superficial and code review struggles to spot bugs in large commits.

ye-luo added inline comments.Nov 25 2022, 10:08 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1464

Thanks for adding docs.

  1. What I found in hsa doc is "Number of packets the queue is expected to hold". This is more understandable than your line. I think your description is true but that is a second level interpretation. I think both lines can be useful in the doc. LIBOMPTARGET_AMDGPU_QUEUE_SIZE is an insufficient name. LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE is better IMO.
  2. A STREAM may have many contents with different sizes. STREAM_SIZE is to vague. Better to have something like ASYNC_OP_DEPTH_PER_STREAM.
  3. Still prefer COPY_BYTES.

For functional tests I've used the LLVM OpenMP target tests (make check) and miniQMC. For performance experiments, the miniQMC directly. Is there any test suite you use for functional/performance testing?

This plugin introduces new functionality which seems unlikely to be covered by existing make check, particularly given quite a lot of that is disabled on amdgpu anyway.

AMD's aomp and rocm forks have more aggressive runtime testing but presumably don't make heavy use of async given it doesn't exist before this patch.

Code review isn't likely to catch problems in thousands of lines of async c++. Can we split this into a non-functional change where the current plugin is refactored to use the new interface, or equivalently split all the async feature gain out of this to get a more boring alternative plugin which is expected to behave identically to the current one? That lets us distinguish between unintentional behaviour changes (when the boring/refactor doesn't work) and intentional ones (which ideally would have tests).

The big bang replace a large functional unit with a totally new one that behaves differently tends to leave intel and amd's product forks behind, which is unfortunate when upstream testing is relatively superficial and code review struggles to spot bugs in large commits.

  1. Intel and AMD not upstreaming their tests is not our problem per se. I'd love for them to do it but they would also need to upstreaming/using the functionality in their plugins. They had 2+ years to do that, "waiting" doesn't help. Maybe if we diverge more they finally have some incentive to work closer to/with upstream.
  2. We should not apply two different sets of review rules only to make it easier for AMD. Your initial drop of the "old" plugin (https://reviews.llvm.org/D85742) has all the problems you listed above and more (including and lots of more lines with dead and untested code). Most of the things you argue are bad about this patch are literally states as the state of the "old plugin" in the commit message of D85742. Still, then the sentiment was to merge it "as-is and then improve it in-tree", with an explanation of why that is better.
  3. The splits of non-AMD related parts and the actual AMD plugin was already part of my prior review and I'm sure @kevinsala will address the last of those comments soon (=before we merge this).
kosarev added a reviewer: Restricted Project.Nov 28 2022, 10:42 AM
arsenm added a subscriber: arsenm.Nov 28 2022, 11:18 AM
arsenm added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
138

Do we not have this in a shared place already?

jhuber6 added inline comments.Nov 28 2022, 11:20 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
138

I believe we have some support in clang somewhere, we could probably move that to somewhere in llvm and include it here instead.

jdoerfert added inline comments.Nov 28 2022, 1:26 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
138

I used the github search and the only place I found that looks for the sramecc+ string is the existing AMD plugin (same function basically):
https://github.com/llvm/llvm-project/blob/a35ad711d90497994701a99723a81badf3d4348e/openmp/libomptarget/plugins/amdgpu/src/rtl.cpp#L1855

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
138

If you find it fine, I can move it to a new utils header in plugins-nextgen/amdgpu/ and be included by both AMDGPU plugins.

  1. Intel and AMD not upstreaming their tests is not our problem per se. I'd love for them to do it but they would also need to upstreaming/using the functionality in their plugins. They had 2+ years to do that, "waiting" doesn't help. Maybe if we diverge more they finally have some incentive to work closer to/with upstream.

Yeah, likewise. There was a period where trunk, rocm and aomp all had exactly the same implementation for this plugin which was hard won. After the initial drop I mean, I spent a few weeks to pull them back into exact alignment and encouraged the internal devs to patch trunk instead. Didn't take. Today's status is the trunk plugin looks pretty much like the rocm one used to, but the rocm one has a bunch of stuff changed in it for a new code object ABI and some variant of asynchronous offloading. Further divergence seems unlikely to persuade the people who prefer working internally to stop doing that.

  1. We should not apply two different sets of review rules only to make it easier for AMD. Your initial drop of the "old" plugin (https://reviews.llvm.org/D85742) has all the problems you listed above and more (including and lots of more lines with dead and untested code). Most of the things you argue are bad about this patch are literally states as the state of the "old plugin" in the commit message of D85742. Still, then the sentiment was to merge it "as-is and then improve it in-tree", with an explanation of why that is better.

Pretty much. The version today is smaller and saner than the initial drop but not refactored to a fixpoint by any stretch. This suffered from me handing it over to @pdhaliwal who did good work before moving on to greener pastures, and I haven't really picked it back up. I'm not clear why various AMD devs decided to patch the rocm version instead of upstream - gentle encouragement and declining to review things on the internal board hasn't really changed anything there.

One thing the initial drop of non-review-compliant and suspected buggy code did have going for it was that it passed a bunch of application testing when run alongside a broadly similar clang toolchain. So it didn't come with in tree tests to keep it known working but at least it started out OK. I would guess we can do similar here - land the new plugin without tests, and manually feed it through various levels of internal testing and report back what works. That'll probably suffice, painfully, for getting it to functional equivalence with the current plugin.

If there's a load of tests for the asynchronous behaviour that landed for cuda (which I didn't notice at the time) then we can xfail the ones that don't work on amdgpu and (ideally) fix those over time. I'd quite like the plugins to be unit tested, as opposed to running whole openmp programs through them, but am not sure what can be done in terms of infra for that. Feels like we might need a mock/stub GPU to tie it to, which perhaps works better with more of the plugin code shared than is currently the case.

A few comments above. General case, I don't see the benefit of getting constructs like:
hsa_signal_t get() const { return Signal; }
can we either not allow access to the private member, or just make it public and drop the getter?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
182

this is pointless, just make the member public if we have to leak it. but again, existing code...

221

this is a scary comment, though presumably in existing code

293

IIUC the HSA interface with 'memory pool' in the name doesn't implement a memory pool so caching it here is probably faster. I expect it'll interact badly with the asan development effort but that was rocm-internal last time I looked at it so can't influence decisions here

394

does this behave sensibly on wrap around? the arithmetic used elsewhere is

static uint64_t acquireAvailablePacketId(hsa_queue_t *Queue) {
  uint64_t PacketId = hsa_queue_add_write_index_relaxed(Queue, 1);
  bool Full = true;
  while (Full) {
    Full =
        PacketId >= (Queue->size + hsa_queue_load_read_index_scacquire(Queue));
  }
  return PacketId;
}
453

This is strange. The whole point of the queue abstraction is that multiple threads can enqueue work onto it at the same time, so why would we be locking it?

482

Is {0} a meaningful value for a signal? e.g. can we pass it to signal_destroy or similar without it falling over?

500

comment for why active instead of blocked? wait in particular sounds like we expect the signal to be unavailable when the function is called so busy-spin in the host thread seems bad

508

This looks racy. If a few threads are calling decrement and one calls reset, it's going to break.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
726

Confused by this, why are we locking the queue?

969

passing queue.getAgent() as two arguments here seems suspicious - shouldn't one of these be the GPU and one the host?

1932

I can't work out what's going on here. The corresponding logic for erase looks up the pointer directly, should found not be the same? Also can't tell why we're recording the size of the allocation next to the pointer, as opposed to a DenseSet<void*>

1974

this is implemented as a tree - why std::map?

kevinsala added inline comments.Dec 3 2022, 4:27 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
221

I can implement it, but I would perform this check only in debug mode.

394

I obtained that method from the HSA standard document.

453

Multiple threads may enqueue work onto the queue concurrently thanks to the atomic operations to retrieve a packet slot and publishing/ringing the doorbell. But, if I understood correctly, we shouldn't allow publishing a packet (ringing the doorbell with that packet id) if there is any previous packet in the queue that is not ready to be processed yet. For instance, this sequence will be invalid according to my understanding:

   Thread 1 (T1):                               Thread2 (T2):
   ========                                     ========
1. Gets packet slot 0
2. Fills packet 0's fields                      Gets packet slot 1
3.                                              Fills packet 1's fields
4.                                              Publish packet writing doorbell with id 1
5. Publish packet writing doorbell with id 0)

Probably, we could reduce the locked section or remove the lock, by preventing T2 (or any thread in the generalized case) from ringing the doorbell if any previous packet is not ready. But I didn't find that critical currently. If we see this lock is a bottleneck, we can try to optimize it.

482

We shouldn't call destroy with the {0} handle. I can add asserts.

500

True, that's something I was experimenting with. I'll change it back to blocked wait.

508

Reset is expected to be called when the signal is not being used by any other thread nor the HSA runtime. It should be called just before being reused.

1932

The __tgt_rtl_data_delete operation should pass the same pointer provided by the __tgt_rtl_data_alloc. As far as I know, it's not valid to make a partial deletion of an allocated buffer. But in the case of __tgt_rtl_data_submit/retrieve, the pointer can come with an applied offset. Thus, we should check whether the provided pointer is inside any host allocation, considering the sizes of the allocations.

1974

To perform lower_bound operations with logarithmic complexity.

jdoerfert added inline comments.Dec 3 2022, 6:48 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
182

It's not pointless, it gives a common name for abstracted resources. Often this would be operator* or similar but get is as good as anything for now. Let's not go down rabbit holes that are indeed pointless.

500

I'm not sure which way to go. Add a TODO to try this out. I'm uncertain if the host thread can do something useful while it's not busy waiting.

508

Add that to the function comment please.

726

I think that's related to the answer above. Add a TODO to look into this.

1932

We might want to make our lives easier here and not do a search.
Later we want two changes that will help:

  1. Have a pre-allocated pinned buffer for arguments.
  2. Use the hsa lookup function if it's not a pointer to pinned memory allocated by us.

@kevinsala You think we need the search for the current use cases?

kevinsala added inline comments.Dec 4 2022, 1:13 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
182

I don't think it's pointless; we limit the access to the HSA handle and only expose it when it's really needed. And we can also detect easily where these handles are being accessed.

1932

We can replace the search by a call to hsa_amd_pointer_info and let the HSA runtime do the work. If the buffer is explicitly locked by the user (malloc + HSA lock), it should return HSA_EXT_POINTER_TYPE_LOCKED. If the buffer was allocated using the HSA allocator functions, I guess it will return HSA_EXT_POINTER_TYPE_HSA. This latter does not mean that the buffer is host pinned memory because it may also be any other kind of memory allocated through HSA API. But we can assume the user won't pass such invalid buffer types.

JonChesterfield accepted this revision.Dec 7 2022, 1:32 PM

This has fared OK under ad hoc testing but we're struggling with moving code between branches, preference would be to ship this ~ as-is and iterate in place. Going to mark it accepted for what it's worth, but probably want a confirmation from @jdoerfert before merging.

Thank you for digging through the HSA interface to put this together!

This revision is now accepted and ready to land.Dec 7 2022, 1:32 PM
jhuber6 accepted this revision.Dec 7 2022, 1:34 PM

I'll accept it under the expectation that the comments Johannes and I made get addressed.

I'm working on fixing all your comments, allowing the streams to change their size based on the demand of pushed async operations, and also re-using their HSA signals. These last aspects should reduce the number of signals we need at a given time. I'll update the patch soon. Thanks for the reviews!

kevinsala updated this revision to Diff 482161.Dec 12 2022, 9:26 AM

Rebasing onto main to include several pre-commits. Fixing several reviewer's issues. Also making streams dynamically sized (std::deque to avoid invalidating elements on insertions). In this way, we can re-use signals to reduce signal consumption through a signal resource manager.

There are some issues to fix:

  • There is still one error when running the tests. The stream hangs with the recent changes regarding dynamically sized streams. Currently under investigation.
  • Missing documentation on openmp/docs. We can open a separate patch for that.
  • Missing std::atomic_thread_fence inside the stream implementation. May be related to the issue mentioned above.
  • Some reviewer's comments not fixed yet. Currently working on it.
kevinsala marked 38 inline comments as done and an inline comment as not done.Dec 12 2022, 10:03 AM
kevinsala added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
138

Opened a separate patch for that: https://reviews.llvm.org/D139792

500

Kept as BLOCKED waiting and added a TODO comment about it.

537–538

For some reason, that doesn't seem to work. It's printing strange characters after ".kd", even if I add a \0. I'm looking into it.

726

Added a long comment about it on top of the Stream's std::mutex data member.

969

I'm looking into it. The documentation of hsa_amd_memory_async_copy in the header doesn't give much detail on what's the meaning of those two parameters and which are the consequences of passing GPU/CPU agents. I passed the GPU agent as both parameters as the original AMDGPU plugin does.

In case we pass the CPU agent as the src/dst agent, we will need to allow the CPU agent access to each of the device allocations. Otherwise, it will fail to comply with "the agent must be able to directly access both the source and destination buffers in their current locations" (header doc).

1464

Current names:

  • LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES (default: 8)
  • LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE (default: 1024)
  • LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES (default: 1*1024*1024, 1MB)
  • LIBOMPTARGET_AMDGPU_NUM_INITIAL_SIGNALS (default: 64), probably better to name HSA_SIGNALS instead of SIGNALS

The description of each one appears on top of the envar's declarations. I'll add documentation on openmp/docs in another patch.

jdoerfert added inline comments.Dec 12 2022, 10:55 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
2296

Nit early exit first.

kevinsala marked 3 inline comments as done.

Fixing format with clang-format and renaming envar to LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS.

kevinsala added inline comments.Dec 14 2022, 12:41 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1688

Should be a return

kevinsala set the repository for this revision to rG LLVM Github Monorepo.

Rebasing and several fixes:

kevinsala updated this revision to Diff 483347.Dec 15 2022, 2:29 PM
  • Checking whether agents can actually access a memory pool (debug mode)
  • Fixing two asserts in the events' implementation
kevinsala updated this revision to Diff 483351.Dec 15 2022, 2:45 PM
kevinsala marked 3 inline comments as done.

Adding minor fix in eventHandler.

kevinsala marked an inline comment as done.Dec 15 2022, 2:45 PM
kevinsala added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
221

Now we're checking it when debug is enabled.

jdoerfert accepted this revision.Dec 15 2022, 2:50 PM

Let's get this in so we can iterate on it.

ye-luo accepted this revision.Dec 15 2022, 3:12 PM

My requests have been addressed.

Seems that it caused this problem:
https://github.com/llvm/llvm-project/issues/59543

could you please fix it asap or revert this?

kevinsala updated this revision to Diff 483517.Dec 16 2022, 6:36 AM

Removing changes from dynamic_hsa and moving them to https://reviews.llvm.org/D140213