This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][AMDGPU] Add Envar for controlling HSA busy queue tracking
ClosedPublic

Authored by mhalk on Aug 3 2023, 5:59 AM.

Details

Summary

If the Envar is set to true (default), busy HSA queues will be
actively avoided when assigning a queue to a Stream.

Otherwise, we will initialize a new HSA queue for each requested
Stream, then default to round robin once the set maximum has been
reached.

Diff Detail

Event Timeline

mhalk created this revision.Aug 3 2023, 5:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2023, 5:59 AM
mhalk requested review of this revision.Aug 3 2023, 5:59 AM
Herald added a project: Restricted Project. · View Herald Transcript
mhalk added a subscriber: kevinsala.Aug 3 2023, 6:01 AM

In case this might be interesting for upstream. @kevinsala

Also implemented a suggestion that I lost track of in D154523.

jdoerfert accepted this revision.Aug 3 2023, 8:49 AM

Add the option to the docs though.

This revision is now accepted and ready to land.Aug 3 2023, 8:49 AM
kevinsala added inline comments.Aug 3 2023, 9:42 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1476

This shouldn't be needed. Envar objects load the value from the environment in their constructor.

1540

I rewrote the function trying to be a bit simplier. Conceptually, I think the only difference is that NextQueue is always increased. Would it make sense?

inline Error assignNextQueue(AMDGPUStreamTy *Stream) {
  uint32_t StartIndex = NextQueue % MaxNumQueues;

  if (OMPX_QueueTracking) {
    // Find the first idle queue.
    for (uint32_t I = 0; I < MaxNumQueues; ++I) {
      if (!Queues[StartIndex].isBusy())
        break;

      StartIndex = (StartIndex + 1) % MaxNumQueues;
    }
  }

  // Try to initialize queue and increase its users.
  if (Queues[StartIndex].init(Agent, QueueSize))
    return Err;
  Queues[StartIndex].addUser();

  // Assign the queue.
  Stream->Queue = &Queues[StartIndex];

  // Always move to the next queue.
  ++NextQueue;

  return Plugin::success();
}

Also, in the case OMPX_QueueTracking is enabled, when no queues are idle, we could fallback to the queue with less users. The minimum can be computed while iterating the queues in that loop.

mhalk updated this revision to Diff 546912.Aug 3 2023, 9:55 AM

Added documentation.

Thank you -- much appreciated!

mhalk added inline comments.Aug 3 2023, 10:30 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1476

Good catch, thanks! I wasn't aware of that.

1540

Really like the reduced complexity.

Now that NextQueue is always increased wouldn't that forfeit the purpose of handling busy queues? -- at least for the first MaxNumQueues Stream-requests.
Because I think we will now always look at a Queue that has not been initialized and is therefore considered "idle"/not busy.
I guess keeping NextQueue at zero until the maximum number of queues is actually initialized is important, unless we find sth. equivalent.
I'll think about it a bit more.

kevinsala added inline comments.Aug 3 2023, 11:28 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1540

I don't think updating NextQueue will have an impact on the tracking mechanism. For the tracking, NextQueue just indicates which queue to start looking at. But you will traverse the whole array of queues if we are not finding idle queues.

So, in the first MaxNumQueues Stream-request, the mechanism will do the following:

1st request: NextQueue is 0 -> check queue 0 (idle) -> break
2nd request: NextQueue is 1 -> check queue 1 (idle) -> break
3rd request: NextQueue is 2 -> check queue 2 (idle) -> break
...

In these first MaxNumQueues requests, we will always break at the first loop iteration because no one will be using the queue at`NextQueue` positions yet. Then, after those first requests, we will probably start having to iterate over the queues to find if any is idle.

mhalk added inline comments.Aug 3 2023, 11:43 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1540

Yes, but what if upon the 2nd request queue_0 is idle.
Won't we initialize queue_1 albeit there is an already initialized one ready?

Or do we not care about this? If so & should you have time -- please share some insight in this.

kevinsala added inline comments.Aug 3 2023, 11:51 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1540

True, it would make sense to re-use the already initialized one. That can be changed on the simplified function easily.

I've some other doubts regarding falling back to round robin when there is no idle queue. Since the number of queues is very low, I think that improving the fallback (i.e., when all queues are busy) is important. The default number of queues is four, so we will already move to the fallback mechanism (simple round-robin) when having 4 streams working concurrently.

For the same cost as now, we can chose the queue with the minimum number of users instead. I believe this could keep the users (more or less) well-balanced among queue. Does it make sense?

mhalk added inline comments.Aug 3 2023, 12:11 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1540

Yes, that does make sense to me.

When you say "for the same cost", what do you have in mind?

I guess, I'll check if there are measurable drawbacks to round robin vs. "least contention", performance-wise.

mhalk added a comment.Aug 4 2023, 4:53 AM

Just as a heads-up, what I'll be looking at.
Maybe this far away from what you had in mind.

One thing I noticed when refactoring this to the selection dependent on "user count" is that:
In conjunction with an early exit, which I definitely wanted, I think we can eliminate one iteration on the for loop?!
(Alleviates the re-introduced complexity a bit, I guess.)
I'm positive we find a good solution to this.

Changed isBusy to getUserCount and added isInitialized.

@kevinsala Let me know what you think.
Also, tell me if I should just update the diff rather than posting code snippets.

inline Error assignNextQueue(AMDGPUStreamTy *Stream) {
    uint32_t SelectedIndex = NextQueue % MaxNumQueues;

    if (OMPX_QueueTracking) {
      // Take utilization into account, begin at SelectedIndex
      uint32_t Index = SelectedIndex;

      for (uint32_t I = 1; I < MaxNumQueues; ++I) {
        // Early exit when an initialized queue is idle
        if (Queues[SelectedIndex].isInitialized() &&
            Queues[SelectedIndex].getUserCount() == 0)
          break;

        // Increment Index & potential wrap around
        if (++Index >= MaxNumQueues)
          Index = 0;

        // Update the least contested queue
        if (Queues[SelectedIndex].getUserCount() > Queues[Index].getUserCount())
          SelectedIndex = Index;
      }
    }

    // Make sure we assign an initialized queue, then add user & assign
    if (auto Err = Queues[SelectedIndex].init(Agent, QueueSize))
      return Err;
    Queues[SelectedIndex].addUser();
    Stream->Queue = &Queues[SelectedIndex];

    // Move cursor to the next queue
    ++NextQueue;
    return Plugin::success();
  }
kevinsala added a comment.EditedAug 4 2023, 6:50 AM

I would keep it simple:

inline Error assignNextQueue(AMDGPUStreamTy *Stream) {
  uint32_t SelectedIndex = 0;

  if (OMPX_QueueTracking) {
    // Find the least used queue.
    for (uint32_t I = 0; I < MaxNumQueues; ++I) {
      // Early exit when an initialized queue is idle
      if (Queues[I].isInitialized() && Queues[I].getUserCount() == 0) {
        SelectedIndex = I;
        break;
      }
      
      // Update the least contested queue
      if (Queues[SelectedIndex].getUserCount() > Queues[I].getUserCount())
        SelectedIndex = I;
    }
  } else {
    // Round-robin policy.
    SelectedIndex = NextQueue++ % MaxNumQueues;
  }
  
  // Make sure we assign an initialized queue, then add user & assign
  if (auto Err = Queues[SelectedIndex].init(Agent, QueueSize))
    return Err;
  Queues[SelectedIndex].addUser();
  Stream->Queue = &Queues[SelectedIndex];
  
  return Plugin::success();
}

This variant is based on your last snippet. NextQueue is not used at all when queue tracking is enabled; it's only used when working in round-robin policy. In queue tracking mode, we start from the first queue. It features the early exit you wrote. Although it doesn't skip any iteration, I don't think it'll impact the performance.

mhalk updated this revision to Diff 547277.Aug 4 2023, 11:02 AM

Implemented feedback and adapted docs.

Thanks @kevinsala for the helpful feedback!
IMHO think this is a very readable solution.

Changes to the snippet:
Decided to rename the Index variable and change its assignment.

kevinsala accepted this revision.Aug 6 2023, 2:02 AM

LGTM. Thanks!