[OpenMP][libomptarget][AMDGPU] Enable active HSA wait state Adds HSA timeout hint of 2 seconds to the AMDGPU nextgen-plugin to improve performance of small kernels. The HSA runtime may stay in HSA_WAIT_STATE_ACTIVE for up to the timeout value before switching to HSA_WAIT_STATE_BLOCKED. This can improve latency from which small kernels can benefit. The value was determined via experimentation w/ different benchmarks. The timeout value can be overriden using the environment variable LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT with a value in microseconds.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Why separate KERNEL_BUSYWAIT and DATA_BUSYWAIT? I don't feel the need of two separate controls.
What are the units? Documentation?
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp | ||
---|---|---|
521 | When reading the implementation of wait(). I'm still wondering how error got handled when kernels finished (signal equals 0) with an error. | |
1248 | Should this be private? | |
1575 | Should they be private? |
It is a matter of priorities. If you are doing async copies, you may not want them to every busy-wait because you have more important things to do on the CPU. However, in that same situation you may have high priority short kernels that you prefer to get the performance of busywait.
For long-term tuning, we may actually want more control over busy-wait timeouts. For now, we just separate the timeouts on data transfers and kernel completions.
In the OpenMP context, when an OpenMP target without nowait reaches a signal wait. The user intention is busy/active waiting. There are no more important things to do on the core that OpenMP thread resides. If you were thinking of CPU cores being oversubscribed by threads, I would say one may gain or lose all depending on the application and there is no need to put such cases in high priority of discussion.
If nowait is added, events are used instead of signal wait. I think active waiting is always preferred.
Thanks for all the comments, I think there is a bit of confusion due to my lack of documentation of the code.
- Units of the variables: microseconds.
- Wait state: This indicates whether the HSA runtime should actively wait. As far as I understand it, this means that it will likely not perform a context switch while waiting. This can improve the responsiveness to the signal value change. I did some basic profiles with babelstream using standard waiting and the timeout values we used in the old plugin. I can see improvements in the duration that the system stays within the hsa_signal_wait_scaqcuire. This suggests that this is indeed helping with latency, and therefore with the runtime "realizing" that a short running kernel has finished. Babelstream results do reproducibly improve.
- As a result of 2., I think, we used different values for kernels and data movements, as the idea is that the data transfers may just be a little slower.
I will go ahead and update the patch to address all code-related comments.
From the meeting discussion.
- default to active waiting rather than 0.
- Consider a single LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT.
One environment variable, defaulting to 2 seconds timeout for active waiting, rename, const
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp | ||
---|---|---|
903 | Could you move this right below line 891 mutex? Prefer not to have variables stuck between functions. |
When reading the implementation of wait(). I'm still wondering how error got handled when kernels finished (signal equals 0) with an error.