This patch adds the timeout-wait for HSA signal wait states. Meaning
that the thread is going into a blocking state only after a certain
timeout.
The idea of the implementation is copied from the AOMP implementation.
This came from a discussion about a hang observed in the wait() implementation
of NextGen plugin.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp | ||
---|---|---|
501 | In AOMP, we actually use different values for data transfers and kernel launches. | |
1964 | Example use is copied from AOMP. |
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp | ||
---|---|---|
570 | I think the intention was that if MicrosToWait == 0 we fall back to the original behavior, but I just saw that we are now checking in two places. |
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp | ||
---|---|---|
568 | Nit. This assignment is not needed, right? We could define it when it's used: hsa_signal_value_t Got = hsa_signal_wait_scacquire(...); | |
571 | Could we make the exit condition consistent in both activeWaitImpl and waitImpl? The first is waiting until Signal != Init (1) and the second is waiting until Signal == 0. |
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp | ||
---|---|---|
571 | Sure. Or are there objections? |
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp | ||
---|---|---|
571 | That new condition looks good to me. But we should extend the code documentation of these two functions to specify the new behavior. The wait operation does no longer stop when the signal is observed with a zero value, but when we stop observing the Init (default: 1) value. |
Unfortunately my original problem still persists with this patch. The active wait timed out and then still stuck in the blocking wait.
Tested 3, 30, 300, 3000, 30000, 300000, 3000000 on my W6800. I have to consider some software between hsa and the hardware is broken.
Could we please not have a customisable timeout until someone argues passionately for one? It seems unlikely that it'll get much use and we'll end up carrying that branch for ages. Also environment variables are bad things.
That's too bad. Which use case are you running into the issue?
Given that is does not help the problem, I agree with @JonChesterfield and would prefer to first do some more experiments with the NextGen plugin and this to see its impact. If we come to the conclusion that having this is desirable, we can move it forward.
I will still update the patch with the comments.
I would think we want a Env var and all Signal behave the same, so no argument here, wdyt?