This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add initial support for 'puts' and 'fputs' to the GPU
ClosedPublic

Authored by jhuber6 on May 23 2023, 8:38 PM.

Details

Summary

This patch adds the initial support required to support basic priting in
stdio.h via puts and fputs. This is done using the existing LLVM C
library File API. In this sense we can think of the RPC interface as
our system call to dump the character string to the file. We carry a
uintptr_t reference as our native "file descriptor" as it will be used
as an opaque reference to the host's version once functions like
fopen are supported.

For some unknown reason the declaration of the StdIn variable causes
both the AMDGPU and NVPTX backends to crash if I use the READ flag.
This is not used currently as we only support output now, but it needs
to be fixed

Diff Detail

Event Timeline

jhuber6 created this revision.May 23 2023, 8:38 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 23 2023, 8:38 PM
jhuber6 requested review of this revision.May 23 2023, 8:38 PM
jhuber6 updated this revision to Diff 525275.May 24 2023, 11:19 AM

Fix the issue with stdin. This requires removing constexpr from this function and I'm unsure why it's problematic, maybe someone can help me understand why it would caue an infinite reference and crash the GPU backend.

Overall, the precense of all the buffering code in this implementation is detrimental to implementing this on the GPU. The register usage for puts is currently ~70 registers while the basic write_to_stderr is only ~25. There could be some discussions on how to best optionally disable the buffering, but as it stands I don't think the runtime configuration is enough. A templated class would be ideal, but then we'd have a hard time keeping compatibility with user's custom streams for the Linux case. Worst case scenario we have a separate implementation just for GPUs to use. It would be very short and unlikely to change so it might be the easiest solution rather than to force the other targets into this very specific mold.

jhuber6 updated this revision to Diff 525370.May 24 2023, 4:51 PM

Add the standard stream entrypoints

jhuber6 updated this revision to Diff 526135.May 26 2023, 11:16 AM

Updating. I decided that the most painless method would be to make a constexpr
function that the GPU can use to disable buffering features. This is a little
ugly, but not unreasonable I don't think. Previously a lot of buffering code was
either causing errors or leading to very inefficient implementations on the GPU.
Hopefully this is an acceptable amount of ugliness to implement these in
general.

For reference, the GPU doesn't support buffering for a few reasons. First, the
buffer is a global shared resource. This would require a functioning mutex which
the GPU does not have. In general locks on the GPU cannot be safely implemented
due to lack of forward progress guaruntees. I.e. one thread cannot depend on
another reliably. Even if they were to be implemented it would be undesireable
as the GPU is highly parallel with thousands of threads running at the same
time. Second, even if we did have a reliable lock, it would likely add a
significant amount of register pressure to the GPU kernel. It is quite easy to
run out of resources so we need to be sparse if possible.

Buffering in theory would be nice on the GPU as it could prevent us from
performing an RPC call in situations where the user needs to send less than 64
bits, but the rest of the cost is unlikely to make up for it.

jhuber6 updated this revision to Diff 526139.May 26 2023, 11:23 AM

A little uglier, need to completely suppress the custom destructor as making a no-op.

tra added a comment.May 26 2023, 12:14 PM

In general locks on the GPU cannot be safely implemented due to lack of forward progress guaruntees.

I think sm_70+ does provide forward progress guarantees. This was one of the major architectural changes introduced on Volta.
https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#independent-thread-scheduling

In general locks on the GPU cannot be safely implemented due to lack of forward progress guaruntees.

I think sm_70+ does provide forward progress guarantees. This was one of the major architectural changes introduced on Volta.
https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#independent-thread-scheduling

So, I believe what Volta added was forward progress guarantees within a warp. E.g. a warp cannot deadlock on itself because the threads within a warp can make independent progress. The OpenCL model is extremely pessimistic and offers absolutely zero forward progress guarantees. As far as I understand the underlying hardware is a different story. I haven't heard of any vendors guaranteeing forward progress, or a fair scheduler on the GPU globally. However, I think that Nvidia in general is on firmer ground here than AMD so it's less likely to be an issue there. The typical way to bypass this is to provide enough locks in parallel that an active warp / wavefront won't get blocked on any other active one. It's very wasteful but it's "safe". At least for the AMD case.

lntue added inline comments.May 30 2023, 10:28 AM
libc/src/__support/File/file.h
24

Since most of the usage of this constexpr function is kind of double-negative form if ( !buffer_disabled() ), I think it is better to define this as a static constexpr member of class File, and change the name to positive form like File::USE_BUFFER, File::ENABLE_BUFFER or something like that.

jhuber6 added inline comments.May 30 2023, 10:36 AM
libc/src/__support/File/file.h
24

Yeah that might make it more readable. Is @sivachandra okay with this somewhat hacky workaround? We cannot compile otherwise due to the GPU backend's not allowing circular references, i.e. initializers that use offsets from itself. Also this gets the resource usage within acceptable levels.

jhuber6 updated this revision to Diff 526738.May 30 2023, 12:31 PM

Changing to a constexpr member variable with a positive check.

lntue added inline comments.May 30 2023, 3:53 PM
libc/src/__support/File/file.cpp
345

Shouldn't this be !ENABLE_BUFFER?

jhuber6 added inline comments.May 30 2023, 3:54 PM
libc/src/__support/File/file.cpp
345

You're right, I'll fix it.

lntue accepted this revision.May 30 2023, 5:56 PM
This revision is now accepted and ready to land.May 30 2023, 5:56 PM
sivachandra added inline comments.Jun 2 2023, 5:23 PM
libc/src/__support/File/file.cpp
28

Can we repeat the order of conditionals so that the ENABLE_BUFFER is checked only once:

if (!ENABLE_BUFFER || bufmode == _IONBF) {
  ...
} else if (bufmode =_ _IOFBF) {
  ...
} else { // if (bufmode == _IOLBF)
  ...
}
327

Can we do an early return:

if constexpr (!ENABLE_BUFFER)
  return;
341

Same here:

if constexpr (!ENABLE_BUFFER)
  return;
libc/src/__support/File/gpu/dir.cpp
17

If they are just erroring out, why are they required?

sivachandra added inline comments.Jun 2 2023, 5:31 PM
libc/src/__support/File/file.h
54

Sorry, forgot to mention this in the earlier post: Just pointing out that files opened even with fopencookie will become unbuffered.

That we have new switch means we should add tests for it on non-GPU platforms also. So, one way would be to do something like this:

  1. Add a compile option, say LIBC_COPT_UNBUFFERED_FILE and update this conditional to:
#if defined(LIBC_TARGET_ARCH_IS_GPU) || defined(LIBC_COPT_UNBUFFERED_FILE)
  1. The above flag will allow to setup up a separate platform independent test only target for unbuffered files.
  2. Add a test for unbuffered files which depends on the ubuffered file target.
jhuber6 marked 4 inline comments as done.Jun 2 2023, 5:35 PM
jhuber6 added inline comments.
libc/src/__support/File/file.h
54

It would be fine to have other users of unbuffered IO, maybe for a more minimal implementation.

libc/src/__support/File/gpu/dir.cpp
17

Wasn't sure if this was required for linking somewhere.

jhuber6 updated this revision to Diff 528047.Jun 2 2023, 5:39 PM

Addressing comments

sivachandra added inline comments.Jun 2 2023, 10:49 PM
libc/src/__support/File/file.cpp
39

line buffered

libc/src/__support/File/file.h
54

I agree that it is fine to have this switch. My comment about fopencookie was just FYI. But, to protect accidental regressions around this switch, I am suggesting that you should add a test for it even for non-GPU platforms.

188

Can you add a comment why?

libc/src/__support/File/gpu/dir.cpp
17

You are seeing some linker error if you don't add this file?

jhuber6 marked an inline comment as done.Jun 3 2023, 4:19 AM
jhuber6 added inline comments.
libc/src/__support/File/file.h
54

I'm not sure how to add a test on a non-GPU platform since it's a build-time thing.

libc/src/__support/File/gpu/dir.cpp
17

The CMake expects it currently so I just provide an empty file.

jhuber6 updated this revision to Diff 528158.Jun 3 2023, 4:15 PM

Address comments

sivachandra accepted this revision.Jun 5 2023, 10:16 AM