This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget][AMDGPU] Enable active HSA wait state
ClosedPublic

Authored by jplehr on Apr 20 2023, 8:17 AM.

Details

Summary
[OpenMP][libomptarget][AMDGPU] Enable active HSA wait state

Adds HSA timeout hint of 2 seconds to the AMDGPU nextgen-plugin to improve
performance of small kernels.
The HSA runtime may stay in HSA_WAIT_STATE_ACTIVE for up to the timeout
value before switching to HSA_WAIT_STATE_BLOCKED. This can improve
latency from which small kernels can benefit.
The value was determined via experimentation w/ different benchmarks.

The timeout value can be overriden using the environment variable
LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT with a value in microseconds.

Diff Detail

Event Timeline

jplehr created this revision.Apr 20 2023, 8:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2023, 8:17 AM
jplehr requested review of this revision.Apr 20 2023, 8:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2023, 8:17 AM

Why separate KERNEL_BUSYWAIT and DATA_BUSYWAIT? I don't feel the need of two separate controls.
What are the units? Documentation?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
521

When reading the implementation of wait(). I'm still wondering how error got handled when kernels finished (signal equals 0) with an error.

1248

Should this be private?

1575

Should they be private?

kevinsala added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1248

Can we place this class data member just after the existing ones?

1575

Are these fields really needed? We can access the envars directly (which have the values cached in a variable), right?

Why separate KERNEL_BUSYWAIT and DATA_BUSYWAIT? I don't feel the need of two separate controls.
What are the units? Documentation?

It is a matter of priorities. If you are doing async copies, you may not want them to every busy-wait because you have more important things to do on the CPU. However, in that same situation you may have high priority short kernels that you prefer to get the performance of busywait.

For long-term tuning, we may actually want more control over busy-wait timeouts. For now, we just separate the timeouts on data transfers and kernel completions.

ye-luo added a comment.EditedApr 25 2023, 8:52 PM

Why separate KERNEL_BUSYWAIT and DATA_BUSYWAIT? I don't feel the need of two separate controls.
What are the units? Documentation?

It is a matter of priorities. If you are doing async copies, you may not want them to every busy-wait because you have more important things to do on the CPU.

In the OpenMP context, when an OpenMP target without nowait reaches a signal wait. The user intention is busy/active waiting. There are no more important things to do on the core that OpenMP thread resides. If you were thinking of CPU cores being oversubscribed by threads, I would say one may gain or lose all depending on the application and there is no need to put such cases in high priority of discussion.
If nowait is added, events are used instead of signal wait. I think active waiting is always preferred.

Thanks for all the comments, I think there is a bit of confusion due to my lack of documentation of the code.

  1. Units of the variables: microseconds.
  2. Wait state: This indicates whether the HSA runtime should actively wait. As far as I understand it, this means that it will likely not perform a context switch while waiting. This can improve the responsiveness to the signal value change. I did some basic profiles with babelstream using standard waiting and the timeout values we used in the old plugin. I can see improvements in the duration that the system stays within the hsa_signal_wait_scaqcuire. This suggests that this is indeed helping with latency, and therefore with the runtime "realizing" that a short running kernel has finished. Babelstream results do reproducibly improve.
  3. As a result of 2., I think, we used different values for kernels and data movements, as the idea is that the data transfers may just be a little slower.

I will go ahead and update the patch to address all code-related comments.

jplehr updated this revision to Diff 517116.Apr 26 2023, 3:39 AM
  • rebase the patch
  • add documentation
  • address reviewer comments

From the meeting discussion.

  1. default to active waiting rather than 0.
  2. Consider a single LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT.
jplehr updated this revision to Diff 519049.May 3 2023, 6:06 AM

One environment variable, defaulting to 2 seconds timeout for active waiting, rename, const

ye-luo accepted this revision.May 3 2023, 7:11 AM
ye-luo added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
903

Could you move this right below line 891 mutex? Prefer not to have variables stuck between functions.

This revision is now accepted and ready to land.May 3 2023, 7:11 AM
ye-luo added a comment.EditedMay 3 2023, 7:11 AM

Updating the patch description is also needed.

jplehr updated this revision to Diff 519402.May 4 2023, 2:43 AM

Address inline comment

jplehr retitled this revision from [OpenMP][libomptarget][AMDGPU] Enable optional active HSA wait state to [OpenMP][libomptarget][AMDGPU] Enable active HSA wait state.May 4 2023, 2:51 AM
jplehr edited the summary of this revision. (Show Details)
This revision was automatically updated to reflect the committed changes.