This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking
ClosedPublic

Authored by mhalk on Jul 5 2023, 9:32 AM.

Details

Summary

This patch lazily initializes queues/streams/events since their initialization
might come at a cost even if we do not use them.

To further benefit from this, AMDGPU/HSA queue management is moved into the
AMDGPUStreamManager of an AMDGPUDevice. Streams may now use different HSA queues
during their lifetime and identify busy queues.

When a Stream is requested from the resource manager, it will search for and
try to assign an idle queue. During the search for an idle queue the manager
may initialize more queues, up to the set maximum (default: 4).
When no idle queue could be found: resort to round robin selection.

With contributions from Johannes Doerfert <johannes@jdoerfert.de>

Depends on D156245

Diff Detail

Event Timeline

mhalk created this revision.Jul 5 2023, 9:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 9:32 AM
mhalk requested review of this revision.Jul 5 2023, 9:32 AM
Herald added a project: Restricted Project. · View Herald Transcript
jdoerfert added inline comments.Jul 5 2023, 9:57 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
791

rename Busy into sth more meaningful, e.g., NumUsers.
Spell out the inc/dec, maybe "addUser", "removeUser"

1438

Make the class final.

1446

Just resize.

1482

This ignores the queue size, doesn't it? We should not grow larger than that value.

1486

Start with Current = NextQueue.fetch_add(1, std::memory_order_relaxed) % Size when you iterate, that way you won't check the first queue all the time.

jplehr added inline comments.Jul 6 2023, 2:51 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1449

In case of error, I think we should report and return the error instead of reporting and keep going.

This is the initial queue, so if this fails, there is no queue available.

mhalk added inline comments.Jul 6 2023, 4:25 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1446

Initially, I wanted to do that, but AMDGPUQueueTy has a member (Mutex) with deleted copy ctor.

1449

Agreed; as discussed, we'll return Err and not report it at this level.

It will be reported by the higher layer, for example like this:
"PluginInterface" error: Failure to initialize device 0: Error in hsa_queue_create: HSA_STATUS_ERROR_INVALID_ARGUMENT: One of the actual arguments does not meet a precondition stated in the documentation of the corresponding formal argument.

1482

To the best of my knowledge we should not exceed the set (max) number of queues.
This will only resize the streams, as the size of Queues is only set once, during init(...).

1486

Agreed; as discussed, we're going with two loops to search for an idle queue.

mhalk updated this revision to Diff 537666.Jul 6 2023, 4:52 AM

Thank you very much for the comments.
This should implement the current feedback.

kevinsala added inline comments.Jul 11 2023, 6:29 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
601

Do we need these three atomic operations with memory_order_seq_cst? Or a more relaxed memory order could be enough?

1477

I'm afraid most of the code in getResource and returnResource is duplicated from the generic resource manager. Would it be possible to use the original GenericDeviceResourceManagerTy::getResource() instead? For instance:

ResourceRef getResource() override {
  ResourceRef resource = GenericDeviceResourceManagerTy::getResource();
  // Perform any change on resource
  ...

  return resource;
}

(Note that with this modification, the changes on the resource are no longer protected by the stream manager mutex)

1510

I don't think NextQueue needs atomic operations if this function is called while holding the stream manager mutex.

1547

I feel this patch implements a Queue manager inside a Stream manager. Wouldn't it be better to define this logic inside a new AMDGPUQueueManagerTy and just have a reference of it in the AMDGPUStreamManagerTy?

kevinsala added inline comments.Jul 11 2023, 7:12 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1514

Also, NextQueue shouldn't be incremented while we advance in the search of non busy queues?

1527

I believe we can simplify and merge these two loops into a single one

ye-luo added a subscriber: ye-luo.Jul 11 2023, 7:42 AM
ye-luo added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
601

Is this function already under the stream manager mutex when being called? I feel NumUsers doesn't need to be atomic.

kevinsala added inline comments.Jul 11 2023, 7:46 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
601

Yes, as it is now (everything protected by the stream manager mutex), it seems it shouldn't be atomic.

jdoerfert added inline comments.Jul 11 2023, 10:25 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1450–1452
1484

As @kevinsala said, the above, and the auto &Resource = part below, can be replaced with ResourceRef Resource = GenericDeviceResourceManagerTy::getResource();

1502

The above, except removeUser, should just be GenericDeviceResourceManagerTy::returnResource(Resource);, no?

1516

early exit if (busy) continue

1527

Yeah, my bad, I suggested this, we can do sth like

for (I = 0; < MaxNumQueues; ++I) {
  Idx = StartIndex++;
  if (StartIndex == MaxNumQueues) StartIndex = 0;
  // use Idx not I
...
}
1547

We do not really manage the queues the same way. we can do more reuse, see above.

mhalk updated this revision to Diff 540045.Jul 13 2023, 8:00 AM

Thanks for all the valuable feedback -- this should implement it.

Still trying to figure out which parts really need to be protected by Locks.
So, please take a close look at these parts of the code.

Since they are currently guarded, I removed the:
atomics as suggested.
Lock from initLazy

mhalk added inline comments.Jul 13 2023, 8:27 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1478–1481

Rather unhappy with this particular piece of code.
Now there's a chance for interleaving.
I don't know if that's good, "ok" or bad, but if I had to guess: the latter.

But without unlocking, this will cause a deadlock.
(Could also revert to a lock_guard and call it's dtor, if preferred.)

Thoughts and comments on this would be highly appreciated.

mhalk added inline comments.Jul 13 2023, 9:31 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1009–1015

Oversight: Will be removed.

Looks pretty good to me, others?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1498–1508
kevinsala added inline comments.Jul 17 2023, 3:57 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1478–1481

Yes, this seems correct, but I agree that it may produce unnecessary overhead.

To cover this use case, we could extend the generic getResource and returnResource to accept an (optional) template functor that processes the element before returning (i.e., with the lock acquired). Something like:

class GenericDeviceResourceManagerTy {
protected:
    template <typename Func>
    ResourceRef getResourceAndProcess(Func processor)
    {
        ResourceRef ref = ...;
        processor(ref);
        return ref;
    }

public:
    virtual ResourceRef getResource()
    {
        return getResource([](ResourceRef &) { /* do nothing */ });
    }
}

And your getResource function would look like something like:

ResourceRef getResource() override {
    return GenericDeviceResourceManagerTy::getResource(
        [this](ResourceRef &ref) {
            assignNextQueue(ref);
        });
}

I will implement this change in another patch. For the moment, this look fine to me. Thanks

kevinsala added inline comments.Jul 17 2023, 4:02 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1478–1481

Fix to my previous comment:

template <typename Func>
ResourceRef getResource(Func processor)
{
    std::lock_guard<std::mutex> Lock(Mutex);
    ResourceRef ref = ...;
    processor(ref);
    return ref;
}
mhalk added a comment.Jul 17 2023, 5:33 AM

Thanks for checking and letting me know!

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1478–1481

Ok, great -- will leave this as it is, for now.

mhalk updated this revision to Diff 544877.Jul 27 2023, 11:55 AM

Adapted to D156245 @kevinsala
Thanks for the heads-up!

Please take a look at the signature of the two lambdas, I decided to got with
AMDGPUStreamTy *, instead of duplicating the private ResourceHandleTy.

I'll try to do some further testing in the next days.

kevinsala added inline comments.Jul 31 2023, 3:57 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1469

I would rename the parameter to Resource, Handle, or similar. The parameter is no longer a ResourceReference. The same for the lambda's parmeter and the returnResource function.

1471

You can make this function to return the error and propagate it outside. Then, the assignNextQueue function can return the error when failing to initialize a queue, instead of handling it.

1479
1487

You can work with a specific parameter AMDGPUStreamTy *Stream directly.

1501
if (auto Err = Q->initLazy(Agent, QueueSize))
  return Err;
mhalk updated this revision to Diff 546128.Aug 1 2023, 10:37 AM

Merged this diff with D152035 as discussed with @jdoerfert.
Updated commit message accordingly.

Implemented feedback from @kevinsala + rebased.
Thank you both!

mhalk retitled this revision from [OpenMP][AMDGPU] Tracking of busy HSA queues to [OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking.Aug 1 2023, 10:39 AM
mhalk edited the summary of this revision. (Show Details)

Looks good to me now. Thanks!

mhalk updated this revision to Diff 546138.Aug 1 2023, 10:50 AM

Fixed oversight and made AMDGPUQueueTy::init lazy by default.

kevinsala accepted this revision.Aug 1 2023, 12:45 PM
This revision is now accepted and ready to land.Aug 1 2023, 12:45 PM
jdoerfert added inline comments.Aug 1 2023, 12:56 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1445

Style: I would suggest HSA rather than Hsa.

mhalk updated this revision to Diff 546211.Aug 1 2023, 1:49 PM

Sure!
Changed Hsa to HSA within names.

Thanks for all the feedback and patience :)