This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/plugins-nextgen/amdgpu/src/
-
libomptarget/
-
plugins-nextgen/
-
amdgpu/
-
src/
5/13
rtl.cpp

Differential D146849

[OpenMP][libomptarget] Active and blocking HSA wait states
AbandonedPublic

Authored by jplehr on Mar 24 2023, 3:28 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
jhuber6
JonChesterfield
tianshilei1992
ye-luo

Summary

This patch adds the timeout-wait for HSA signal wait states. Meaning
that the thread is going into a blocking state only after a certain
timeout.
The idea of the implementation is copied from the AOMP implementation.
This came from a discussion about a hang observed in the wait() implementation
of NextGen plugin.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jplehr created this revision.Mar 24 2023, 3:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2023, 3:28 PM

Herald added subscribers: sunshaoce, kosarev, kerbowa and 3 others. · View Herald Transcript

jplehr requested review of this revision.Mar 24 2023, 3:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2023, 3:28 PM

Herald added subscribers: openmp-commits, sstefan1. · View Herald Transcript

Harbormaster completed remote builds in B221692: Diff 508230.Mar 24 2023, 3:31 PM

jdoerfert added inline comments.Mar 24 2023, 3:44 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
501	I would think we want a Env var and all Signal behave the same, so no argument here, wdyt?
558
570	assert?
573–575

ye-luo added inline comments.Mar 24 2023, 4:46 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
555	const uint64_t MicrosToWait
1964	Not getting what you mean by "Example use". Could you document the magic number 300000?

jplehr added inline comments.Mar 27 2023, 1:04 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
501	In AOMP, we actually use different values for data transfers and kernel launches. I don't think we need this for now, but we should do some experimentation to see what works best with NextGen plugin. So, I'll change that and make it an EnvVar.
1964	Example use is copied from AOMP. For data transfers we use this timeout (for kernels we use a different one). As said above, probably we should remove for now, and do some experimentation.

jplehr added inline comments.Mar 27 2023, 1:25 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
570	I think the intention was that if `MicrosToWait == 0` we fall back to the original behavior, but I just saw that we are now checking in two places. If this patch is desirable, I would prefer to move the `activeWaitImpl` implementation into the `AMDGPUSignaltTy::wait()`.

kevinsala added a subscriber: kevinsala.Mar 27 2023, 2:49 AM

kevinsala added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
568	Nit. This assignment is not needed, right? We could define it when it's used: `hsa_signal_value_t Got = hsa_signal_wait_scacquire(...);`
571	Could we make the exit condition consistent in both `activeWaitImpl` and `waitImpl`? The first is waiting until `Signal != Init (1)` and the second is waiting until `Signal == 0`.

jplehr marked an inline comment as done.Mar 27 2023, 3:58 AM

jplehr added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
571	Sure. I would vote for the `!=` version. though I'd need to see what changes are necessary. Basically, making the `wait` implementation logically the same as in AOMP: https://github.com/RadeonOpenCompute/llvm-project/blob/amd-stg-open/openmp/libomptarget/plugins/amdgpu/impl/impl.cpp#L16 Or are there objections?

kevinsala added inline comments.Mar 27 2023, 4:48 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
571	That new condition looks good to me. But we should extend the code documentation of these two functions to specify the new behavior. The wait operation does no longer stop when the signal is observed with a zero value, but when we stop observing the `Init` (default: 1) value.

Unfortunately my original problem still persists with this patch. The active wait timed out and then still stuck in the blocking wait.
Tested 3, 30, 300, 3000, 30000, 300000, 3000000 on my W6800. I have to consider some software between hsa and the hardware is broken.

Could we please not have a customisable timeout until someone argues passionately for one? It seems unlikely that it'll get much use and we'll end up carrying that branch for ages. Also environment variables are bad things.

In D146849#4225134, @ye-luo wrote:

Unfortunately my original problem still persists with this patch. The active wait timed out and then still stuck in the blocking wait.
Tested 3, 30, 300, 3000, 30000, 300000, 3000000 on my W6800. I have to consider some software between hsa and the hardware is broken.

That's too bad. Which use case are you running into the issue?

Given that is does not help the problem, I agree with @JonChesterfield and would prefer to first do some more experiments with the NextGen plugin and this to see its impact. If we come to the conclusion that having this is desirable, we can move it forward.
I will still update the patch with the comments.

jplehr mentioned this in D148808: [OpenMP][libomptarget][AMDGPU] Enable active HSA wait state.Apr 20 2023, 8:17 AM

Abandoning in favor of https://reviews.llvm.org/D148808

Revision Contents

Path

Size

openmp/

libomptarget/

plugins-nextgen/

amdgpu/

src/

rtl.cpp

46 lines

Diff 508230

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Show First 20 Lines • Show All 488 Lines • ▼ Show 20 Lines private:

/// Additional Info for the AMD GPU Kernel /// Additional Info for the AMD GPU Kernel

std::optional<utils::KernelMetaDataTy> KernelInfo; std::optional<utils::KernelMetaDataTy> KernelInfo;

}; };

/// Class representing an HSA signal. Signals are used to define dependencies /// Class representing an HSA signal. Signals are used to define dependencies

/// between asynchronous operations: kernel launches and memory transfers. /// between asynchronous operations: kernel launches and memory transfers.

struct AMDGPUSignalTy { struct AMDGPUSignalTy {

/// Create an empty signal. /// Create an empty signal.

AMDGPUSignalTy() : Signal({0}), UseCount() {} AMDGPUSignalTy() : Signal({0}), UseCount(), MicrosToWait(0) {}

AMDGPUSignalTy(AMDGPUDeviceTy &Device) : Signal({0}), UseCount() {} AMDGPUSignalTy(AMDGPUDeviceTy &Device)

: Signal({0}), UseCount(), MicrosToWait(0) {}

AMDGPUSignalTy(uint64_t MicrosToWait)

: Signal({0}), UseCount(), MicrosToWait(MicrosToWait) {}

jdoerfertUnsubmitted

Not Done

I would think we want a Env var and all Signal behave the same, so no argument here, wdyt?

jdoerfert: I would think we want a Env var and all Signal behave the same, so no argument here, wdyt?

jplehrAuthorUnsubmitted

Done

In AOMP, we actually use different values for data transfers and kernel launches.
I don't think we need this for now, but we should do some experimentation to see what works best with NextGen plugin.
So, I'll change that and make it an EnvVar.

jplehr: In AOMP, we actually use different values for data transfers and kernel launches. I don't think…

/// Initialize the signal with an initial value. /// Initialize the signal with an initial value.

Error init(uint32_t InitialValue = 1) { Error init(uint32_t InitialValue = 1) {

hsa_status_t Status = hsa_status_t Status =

hsa_amd_signal_create(InitialValue, 0, nullptr, 0, &Signal); hsa_amd_signal_create(InitialValue, 0, nullptr, 0, &Signal);

return Plugin::check(Status, "Error in hsa_signal_create: %s"); return Plugin::check(Status, "Error in hsa_signal_create: %s");

} }

/// Deinitialize the signal. /// Deinitialize the signal.

Error deinit() { Error deinit() {

hsa_status_t Status = hsa_signal_destroy(Signal); hsa_status_t Status = hsa_signal_destroy(Signal);

return Plugin::check(Status, "Error in hsa_signal_destroy: %s"); return Plugin::check(Status, "Error in hsa_signal_destroy: %s");

} }

/// Wait until the signal gets a zero value. /// Wait until the signal gets a zero value.

Error wait() const { Error wait() const {

// TODO: Is it better to use busy waiting or blocking the thread? if (MicrosToWait)

while (hsa_signal_wait_scacquire(Signal, HSA_SIGNAL_CONDITION_EQ, 0, return activeWaitImpl();

UINT64_MAX, HSA_WAIT_STATE_BLOCKED) != 0)

; return waitImpl();

return Plugin::success();

} }

/// Load the value on the signal. /// Load the value on the signal.

hsa_signal_value_t load() const { return hsa_signal_load_scacquire(Signal); } hsa_signal_value_t load() const { return hsa_signal_load_scacquire(Signal); }

/// Signal decrementing by one. /// Signal decrementing by one.

void signal() { void signal() {

assert(load() > 0 && "Invalid signal value"); assert(load() > 0 && "Invalid signal value");

Show All 15 Lines

private: private:

/// The underlying HSA signal. /// The underlying HSA signal.

hsa_signal_t Signal; hsa_signal_t Signal;

/// Reference counter for tracking the concurrent use count. This is mainly /// Reference counter for tracking the concurrent use count. This is mainly

/// used for knowing how many streams are using the signal. /// used for knowing how many streams are using the signal.

RefCountTy<> UseCount; RefCountTy<> UseCount;

/// Microseconds to stay in HSA_WAIT_STATE_ACTIVE before switching to blocking

uint64_t MicrosToWait;

ye-luoUnsubmitted

Done

const uint64_t MicrosToWait

ye-luo: const uint64_t MicrosToWait

/// Blocking the waiting thread

Error waitImpl() const {

jdoerfertUnsubmitted

Not Done

/// Blocking the waiting thread

- Error waitImpl() const {

+ Error blockingWaitImpl() const {

// TODO: Is it better to use busy waiting or blocking the thread?

jdoerfert:

// TODO: Is it better to use busy waiting or blocking the thread?

while (hsa_signal_wait_scacquire(Signal, HSA_SIGNAL_CONDITION_EQ, 0,

UINT64_MAX, HSA_WAIT_STATE_BLOCKED) != 0)

;

return Plugin::success();

}

/// Switch to blocking wait state after specified timeout

Error activeWaitImpl(hsa_signal_value_t Init = 1) const {

hsa_signal_value_t Got = Init;

kevinsalaUnsubmitted

Not Done

Nit. This assignment is not needed, right? We could define it when it's used:

hsa_signal_value_t Got = hsa_signal_wait_scacquire(...);

kevinsala: Nit. This assignment is not needed, right? We could define it when it's used…

hsa_signal_value_t Success = 0;

if (MicrosToWait) {

jdoerfertUnsubmitted

Not Done

assert?

jdoerfert: assert?

jplehrAuthorUnsubmitted

Done

I think the intention was that if MicrosToWait == 0 we fall back to the original behavior, but I just saw that we are now checking in two places.
If this patch is desirable, I would prefer to move the activeWaitImpl implementation into the AMDGPUSignaltTy::wait().

jplehr: I think the intention was that if `MicrosToWait == 0` we fall back to the original behavior…

Got = hsa_signal_wait_scacquire(Signal, HSA_SIGNAL_CONDITION_NE, Init,

kevinsalaUnsubmitted

Not Done

Could we make the exit condition consistent in both activeWaitImpl and waitImpl? The first is waiting until Signal != Init (1) and the second is waiting until Signal == 0.

kevinsala: Could we make the exit condition consistent in both `activeWaitImpl` and `waitImpl`? The first…

jplehrAuthorUnsubmitted

Done

Sure.
I would vote for the != version. though I'd need to see what changes are necessary. Basically, making the wait implementation logically the same as in AOMP: https://github.com/RadeonOpenCompute/llvm-project/blob/amd-stg-open/openmp/libomptarget/plugins/amdgpu/impl/impl.cpp#L16

Or are there objections?

jplehr: Sure. I would vote for the `!=` version. though I'd need to see what changes are necessary.

kevinsalaUnsubmitted

Not Done

That new condition looks good to me. But we should extend the code documentation of these two functions to specify the new behavior. The wait operation does no longer stop when the signal is observed with a zero value, but when we stop observing the Init (default: 1) value.

kevinsala: That new condition looks good to me. But we should extend the code documentation of these two…

MicrosToWait, HSA_WAIT_STATE_ACTIVE);

if (Got == Success) {

return Plugin::success();

}

jdoerfertUnsubmitted

Not Done

MicrosToWait, HSA_WAIT_STATE_ACTIVE);

- if (Got == Success) {

+ if (Got == Success)

return Plugin::success();

- }

}

// Switch to blocked state

jdoerfert:

}

// Switch to blocked state

return waitImpl();

}

}; };

/// Classes for holding AMDGPU signals and managing signals. /// Classes for holding AMDGPU signals and managing signals.

using AMDGPUSignalRef = AMDGPUResourceRef<AMDGPUSignalTy>; using AMDGPUSignalRef = AMDGPUResourceRef<AMDGPUSignalTy>;

using AMDGPUSignalManagerTy = GenericDeviceResourceManagerTy<AMDGPUSignalRef>; using AMDGPUSignalManagerTy = GenericDeviceResourceManagerTy<AMDGPUSignalRef>;

/// Class holding an HSA queue to submit kernel and barrier packets. /// Class holding an HSA queue to submit kernel and barrier packets.

struct AMDGPUQueueTy { struct AMDGPUQueueTy {

▲ Show 20 Lines • Show All 1,367 Lines • ▼ Show 20 Lines if (Size >= OMPX_MaxAsyncCopyBytes) {

hsa_status_t Status; hsa_status_t Status;

Status = hsa_amd_memory_lock(const_cast<void *>(HstPtr), Size, nullptr, 0, Status = hsa_amd_memory_lock(const_cast<void *>(HstPtr), Size, nullptr, 0,

&PinnedHstPtr); &PinnedHstPtr);

if (auto Err = if (auto Err =

Plugin::check(Status, "Error in hsa_amd_memory_lock: %s\n")) Plugin::check(Status, "Error in hsa_amd_memory_lock: %s\n"))

return Err; return Err;

AMDGPUSignalTy Signal; /* Example use for microseconds to wait */

AMDGPUSignalTy Signal(300000);

ye-luoUnsubmitted

Not Done

Not getting what you mean by "Example use".
Could you document the magic number 300000?

ye-luo: Not getting what you mean by "Example use". Could you document the magic number 300000?

jplehrAuthorUnsubmitted

Done

Example use is copied from AOMP.
For data transfers we use this timeout (for kernels we use a different one). As said above, probably we should remove for now, and do some experimentation.

jplehr: Example use is copied from AOMP. For data transfers we use this timeout (for kernels we use a…

if (auto Err = Signal.init()) if (auto Err = Signal.init())

return Err; return Err;

Status = hsa_amd_memory_async_copy(TgtPtr, Agent, PinnedHstPtr, Agent, Status = hsa_amd_memory_async_copy(TgtPtr, Agent, PinnedHstPtr, Agent,

Size, 0, nullptr, Signal.get()); Size, 0, nullptr, Signal.get());

if (auto Err = if (auto Err =

Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s")) Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))

return Err; return Err;

▲ Show 20 Lines • Show All 806 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget] Active and blocking HSA wait statesAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 508230

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

[OpenMP][libomptarget] Active and blocking HSA wait states
AbandonedPublic