This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][LibC] Run libc GPU tests with 1 thread
ClosedPublic

Authored by jplehr on May 5 2023, 5:18 AM.

Details

Summary

Executing the libc GPU tests multithreaded may overload the GPU runtimes
for CUDA or HSA, which results in test errors. In addition, libc seems
to not respect the -j option of ninja, when it is invoked at the top
level.

This patch adds special handling of the check-libc rule when passed as
additional check to the OpenMP builder. Ninja will thus first cd into
the runtimes build directory to make the libc testing respect the given
amount of parallelism.
We then limit the number of threads to 1 as to minimize the potential
for any errors with GPU runtimes.

Diff Detail

Event Timeline

jplehr created this revision.May 5 2023, 5:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2023, 5:18 AM
jplehr requested review of this revision.May 5 2023, 5:18 AM
jhuber6 accepted this revision.May 5 2023, 5:21 AM

I think this is fine, one thread is probably overly conservative but the least likely to fail. Maybe someone else more familiar with the bots can chime in, but this works around a problem in the HSA runtime for now.

This revision is now accepted and ready to land.May 5 2023, 5:21 AM

Thanks @jhuber6
I would love to hear @gkistanova opinion on the patch before I land it.

gkistanova accepted this revision.Jun 1 2023, 12:49 PM

LGTM.

A nit pick. It might worth adding a note why you prefer using -C ninja flag rather than changing workdir of the corresponding LitTestCommand.

jplehr added a comment.Jun 1 2023, 2:49 PM

Thank you Galina,
Using the -C flag instead of the workdir of LitTestCommand is probably me still finding my way around the buildbots and being able to test that it does what I want locally without impacting all the bots.

So, what i understand then is that setting the workdir of the LitTestCommand should have the same result, i.e., cding into the directory before invoking the actual ninja command?
if so, it seems that using workdir is preferable.

So, what i understand then is that setting the workdir of the LitTestCommand should have the same result, i.e., cding into the directory before invoking the actual ninja command?

Correct. workdir is basically a path to a directory (if relative, it is relative to the root dir of that builder on that worker) where the command will be run from.

The main difference with -C is the case when such a directory does not exist at the command start time. Unless I'm missing something, ninja with -C will fail with an error "chdir to <path> - No such file or directory", and with workdir, buildbot will create that directory, and ninja will fail with an error "loading 'build.ninja': No such file or directory" or something along these lines.

So, either way is fine. -C might be a bit better from the error reporting perspective, as it mentions the exact problem right away, and maybe a bit easier to code around properties and interpolations.

You may add something like "ninja -C is used instead of changing workdir to highlight that we run this command in a directory different than all the other check commands." or something, and commit.

jplehr added a comment.Jun 2 2023, 2:18 AM

Thank you so much for the elaborate explanation. I'll leave the patch as is and add a comment as you suggested.

jplehr updated this revision to Diff 527880.Jun 2 2023, 9:10 AM

Updated comment as suggested in review

This revision was automatically updated to reflect the committed changes.

Also, FWIW I'm pretty sure the failure in libc is due to the runtime running out of queues. If you check dmesg you'll see stuff like

[189013.108736] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.
[189013.136334] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.
[189013.159077] kfd kfd: amdgpu: Process 795079 (pasid 0x804b) got unhandled exception
[189013.170028] amdgpu: No more SDMA queue to allocate
[189013.170032] amdgpu: Pasid 0x8051 DQM create queue type 1 failed. ret -12