This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking
ClosedPublic

Authored by mhalk on Jul 5 2023, 9:32 AM.

Details

Summary

This patch lazily initializes queues/streams/events since their initialization
might come at a cost even if we do not use them.

To further benefit from this, AMDGPU/HSA queue management is moved into the
AMDGPUStreamManager of an AMDGPUDevice. Streams may now use different HSA queues
during their lifetime and identify busy queues.

When a Stream is requested from the resource manager, it will search for and
try to assign an idle queue. During the search for an idle queue the manager
may initialize more queues, up to the set maximum (default: 4).
When no idle queue could be found: resort to round robin selection.

With contributions from Johannes Doerfert <johannes@jdoerfert.de>

Depends on D156245

Diff Detail

Event Timeline

mhalk created this revision.Jul 5 2023, 9:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 9:32 AM
mhalk requested review of this revision.Jul 5 2023, 9:32 AM
Herald added a project: Restricted Project. · View Herald Transcript
jdoerfert added inline comments.Jul 5 2023, 9:57 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
798

rename Busy into sth more meaningful, e.g., NumUsers.
Spell out the inc/dec, maybe "addUser", "removeUser"

1459

Make the class final.

1467

Just resize.

1503

This ignores the queue size, doesn't it? We should not grow larger than that value.

1507

Start with Current = NextQueue.fetch_add(1, std::memory_order_relaxed) % Size when you iterate, that way you won't check the first queue all the time.

jplehr added inline comments.Jul 6 2023, 2:51 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1470

In case of error, I think we should report and return the error instead of reporting and keep going.

This is the initial queue, so if this fails, there is no queue available.

mhalk added inline comments.Jul 6 2023, 4:25 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1467

Initially, I wanted to do that, but AMDGPUQueueTy has a member (Mutex) with deleted copy ctor.

1470

Agreed; as discussed, we'll return Err and not report it at this level.

It will be reported by the higher layer, for example like this:
"PluginInterface" error: Failure to initialize device 0: Error in hsa_queue_create: HSA_STATUS_ERROR_INVALID_ARGUMENT: One of the actual arguments does not meet a precondition stated in the documentation of the corresponding formal argument.

1503

To the best of my knowledge we should not exceed the set (max) number of queues.
This will only resize the streams, as the size of Queues is only set once, during init(...).

1507

Agreed; as discussed, we're going with two loops to search for an idle queue.

mhalk updated this revision to Diff 537666.Jul 6 2023, 4:52 AM

Thank you very much for the comments.
This should implement the current feedback.

kevinsala added inline comments.Jul 11 2023, 6:29 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
608

Do we need these three atomic operations with memory_order_seq_cst? Or a more relaxed memory order could be enough?

1498

I'm afraid most of the code in getResource and returnResource is duplicated from the generic resource manager. Would it be possible to use the original GenericDeviceResourceManagerTy::getResource() instead? For instance:

ResourceRef getResource() override {
  ResourceRef resource = GenericDeviceResourceManagerTy::getResource();
  // Perform any change on resource
  ...

  return resource;
}

(Note that with this modification, the changes on the resource are no longer protected by the stream manager mutex)

1531

I don't think NextQueue needs atomic operations if this function is called while holding the stream manager mutex.

1568

I feel this patch implements a Queue manager inside a Stream manager. Wouldn't it be better to define this logic inside a new AMDGPUQueueManagerTy and just have a reference of it in the AMDGPUStreamManagerTy?

kevinsala added inline comments.Jul 11 2023, 7:12 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1535

Also, NextQueue shouldn't be incremented while we advance in the search of non busy queues?

1548

I believe we can simplify and merge these two loops into a single one

ye-luo added a subscriber: ye-luo.Jul 11 2023, 7:42 AM
ye-luo added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
608

Is this function already under the stream manager mutex when being called? I feel NumUsers doesn't need to be atomic.

kevinsala added inline comments.Jul 11 2023, 7:46 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
608

Yes, as it is now (everything protected by the stream manager mutex), it seems it shouldn't be atomic.

jdoerfert added inline comments.Jul 11 2023, 10:25 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1471–1473
1505

As @kevinsala said, the above, and the auto &Resource = part below, can be replaced with ResourceRef Resource = GenericDeviceResourceManagerTy::getResource();

1523

The above, except removeUser, should just be GenericDeviceResourceManagerTy::returnResource(Resource);, no?

1537

early exit if (busy) continue

1548

Yeah, my bad, I suggested this, we can do sth like

for (I = 0; < MaxNumQueues; ++I) {
  Idx = StartIndex++;
  if (StartIndex == MaxNumQueues) StartIndex = 0;
  // use Idx not I
...
}
1568

We do not really manage the queues the same way. we can do more reuse, see above.

mhalk updated this revision to Diff 540045.Jul 13 2023, 8:00 AM

Thanks for all the valuable feedback -- this should implement it.

Still trying to figure out which parts really need to be protected by Locks.
So, please take a close look at these parts of the code.

Since they are currently guarded, I removed the:
atomics as suggested.
Lock from initLazy

mhalk added inline comments.Jul 13 2023, 8:27 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1499–1502

Rather unhappy with this particular piece of code.
Now there's a chance for interleaving.
I don't know if that's good, "ok" or bad, but if I had to guess: the latter.

But without unlocking, this will cause a deadlock.
(Could also revert to a lock_guard and call it's dtor, if preferred.)

Thoughts and comments on this would be highly appreciated.

mhalk added inline comments.Jul 13 2023, 9:31 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1027–1033

Oversight: Will be removed.

Looks pretty good to me, others?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1519–1529
kevinsala added inline comments.Jul 17 2023, 3:57 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1499–1502

Yes, this seems correct, but I agree that it may produce unnecessary overhead.

To cover this use case, we could extend the generic getResource and returnResource to accept an (optional) template functor that processes the element before returning (i.e., with the lock acquired). Something like:

class GenericDeviceResourceManagerTy {
protected:
    template <typename Func>
    ResourceRef getResourceAndProcess(Func processor)
    {
        ResourceRef ref = ...;
        processor(ref);
        return ref;
    }

public:
    virtual ResourceRef getResource()
    {
        return getResource([](ResourceRef &) { /* do nothing */ });
    }
}

And your getResource function would look like something like:

ResourceRef getResource() override {
    return GenericDeviceResourceManagerTy::getResource(
        [this](ResourceRef &ref) {
            assignNextQueue(ref);
        });
}

I will implement this change in another patch. For the moment, this look fine to me. Thanks

kevinsala added inline comments.Jul 17 2023, 4:02 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1499–1502

Fix to my previous comment:

template <typename Func>
ResourceRef getResource(Func processor)
{
    std::lock_guard<std::mutex> Lock(Mutex);
    ResourceRef ref = ...;
    processor(ref);
    return ref;
}
mhalk added a comment.Jul 17 2023, 5:33 AM

Thanks for checking and letting me know!

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1499–1502

Ok, great -- will leave this as it is, for now.

mhalk updated this revision to Diff 544877.Jul 27 2023, 11:55 AM

Adapted to D156245 @kevinsala
Thanks for the heads-up!

Please take a look at the signature of the two lambdas, I decided to got with
AMDGPUStreamTy *, instead of duplicating the private ResourceHandleTy.

I'll try to do some further testing in the next days.

kevinsala added inline comments.Jul 31 2023, 3:57 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1490

I would rename the parameter to Resource, Handle, or similar. The parameter is no longer a ResourceReference. The same for the lambda's parmeter and the returnResource function.

1492

You can make this function to return the error and propagate it outside. Then, the assignNextQueue function can return the error when failing to initialize a queue, instead of handling it.

1500
1508

You can work with a specific parameter AMDGPUStreamTy *Stream directly.

1522
if (auto Err = Q->initLazy(Agent, QueueSize))
  return Err;
mhalk updated this revision to Diff 546128.Aug 1 2023, 10:37 AM

Merged this diff with D152035 as discussed with @jdoerfert.
Updated commit message accordingly.

Implemented feedback from @kevinsala + rebased.
Thank you both!

mhalk retitled this revision from [OpenMP][AMDGPU] Tracking of busy HSA queues to [OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking.Aug 1 2023, 10:39 AM
mhalk edited the summary of this revision. (Show Details)

Looks good to me now. Thanks!

mhalk updated this revision to Diff 546138.Aug 1 2023, 10:50 AM

Fixed oversight and made AMDGPUQueueTy::init lazy by default.

kevinsala accepted this revision.Aug 1 2023, 12:45 PM
This revision is now accepted and ready to land.Aug 1 2023, 12:45 PM
jdoerfert added inline comments.Aug 1 2023, 12:56 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1466

Style: I would suggest HSA rather than Hsa.

mhalk updated this revision to Diff 546211.Aug 1 2023, 1:49 PM

Sure!
Changed Hsa to HSA within names.

Thanks for all the feedback and patience :)