This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add GPU support for `printf` and `fprintf`
AcceptedPublic

Authored by jhuber6 on Aug 24 2023, 1:33 PM.

Details

Summary

This patch adds support for calling the printf and fprintf functions
from the GPU. The overall goal is to provide an implementation of
printf that is light-weight enough to be used on the GPU without
straining resources or being too slow.

The RPC interface allows for streaming arbitrary bytes from the client to
the server. To implement printf, we first stream the format string to
the server. Then, both sides use an extremely minimal parser to identify
the sizes and number of arguments that will be exchanged. The client will
then use this information to traverse through the variadic list with the
appropriate sizes and send the argument to the server. The server uses
this to build an array of all the argument to printf.

Once the server has these arguments, we can re-use the LLVM C library
handling of printf. With a custom argument list we can instead
traverse this array of pointers instead of a varadic list.

This patch depends on correct codegen and implementation of variadics on
the GPU, which is being handled by @JonChesterfield.

Depends on D158246

Diff Detail

Event Timeline

jhuber6 created this revision.Aug 24 2023, 1:33 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 24 2023, 1:33 PM
jhuber6 requested review of this revision.Aug 24 2023, 1:33 PM
jplehr added inline comments.Aug 24 2023, 2:06 PM
libc/src/__support/arg_list.h
73

Should the argument be const & here and below?

libc/utils/gpu/server/rpc_server.cpp
94

Allocating char here and deleting uint8_t later on.
Is there a particular reason for the type difference?

arsenm added inline comments.Aug 24 2023, 2:15 PM
libc/src/stdio/gpu/parser.h
59

Why all the ()s

libc/utils/gpu/server/rpc_server.cpp
111

use unique_ptr<>?

arsenm added inline comments.Aug 24 2023, 2:17 PM
libc/test/integration/startup/gpu/rpc_printf_test.cpp
17

ASSERT_EQ

arsenm added inline comments.Aug 24 2023, 2:21 PM
libc/src/stdio/gpu/parser.h
90

swap these cases?

jhuber6 updated this revision to Diff 553268.Aug 24 2023, 2:41 PM

Addressing comments

jhuber6 marked 6 inline comments as done.Aug 24 2023, 2:41 PM
jhuber6 updated this revision to Diff 553298.Aug 24 2023, 4:20 PM

Improve test

jhuber6 updated this revision to Diff 553470.Aug 25 2023, 7:35 AM

Some small updates. I verified the implementation on an Nvidia sm_70 and noted
functionality. The test as written does not work because for some reason it
seems that the Nvidia loader cannot call RPC functions inside of the
constructor. Unsure why, I'll either work around it or fix it.

jhuber6 updated this revision to Diff 553974.Aug 28 2023, 9:58 AM

Fix a bug that caused deadlocks if a different number of arguments were passed.
It's very important that the *same* warp / wave mask that opened the port
participates in the sending. So this changes it to always send a size of 0 if
it is not active.

jhuber6 updated this revision to Diff 555871.Sep 5 2023, 8:52 AM

Fix printing pointers not working. This required having the libc parser
explicitly use char * when it gets the next argument so we can detect if we
need to treat the array argument as a pointer to a string or a pointer to a
pointer.

michaelrj added inline comments.Sep 5 2023, 10:25 AM
libc/src/stdio/gpu/parser.h
135

I realize that this was probably for size reasons, but this should probably add back something for parsing flags, widths, and precisions. The current design will accept all valid format specifiers, but won't reject all invalid format specifiers (e.g. %?d).

169

this should be a bit_cast

jhuber6 added inline comments.Sep 5 2023, 10:42 AM
libc/src/stdio/gpu/parser.h
135

I originally had code for that but it added an absurd amount of resource usage that I don't think we really need. The parsing here will definitely pick this up as an argument and copy the argument to the server. We use a real parser there so the actual printing will pick it up, but I guess the number of arguments might be off?

I'm guessing the standard solution is to just ignore it?

michaelrj added inline comments.Sep 5 2023, 1:51 PM
libc/src/stdio/gpu/parser.h
135

in the case of an invalid format specifier (like %?d) the behavior is undefined. The way we generally handle it is by treating the format specifier (in this case %?) as a raw string. The current micro-parser would mistakenly skip past the ? and assume the format name is d, which is just a letter d in this case. This would cause argument misalignment since there would be an extra int argument where there shouldn't be one.

jhuber6 updated this revision to Diff 555944.Sep 5 2023, 3:01 PM

Address comments and enable the printf handling to handle '*' flags and more gracefully handle malformed input.

jhuber6 updated this revision to Diff 556038.Sep 6 2023, 7:55 AM

Update to handle malformed inputs better, we now catch things like %1.1.1f.
This does not have a solution to severely malformed input like %****d, but I'm
unsure if we should sacrifice more complexity to catch this case.

michaelrj added inline comments.Sep 6 2023, 11:17 AM
libc/src/stdio/gpu/file.h
2

nit: formatting

libc/src/stdio/gpu/parser.h
104

This should be

if(format[cur_pos] == '.'){
  ++cur_pos;
  while (format[cur_pos] != '\0' && isdigit(format[cur_pos]))
    ++cur_pos;
}

A solo . is an acceptable precision, but ..1 is not.

143

instead of having all of the flag/width/precision parsing in parse_length_modifier wouldn't it be better to put those out here? That would also let you skip those parsing steps when unfinished is set.

jhuber6 updated this revision to Diff 556067.Sep 6 2023, 12:26 PM

Address comments

michaelrj added inline comments.Sep 6 2023, 1:11 PM
libc/src/stdio/gpu/parser.h
132

to handle the *s you could check if format[cur_pos]=='*' after this loop and if true set unfinished and return a specifier with the int.

If you make unfinished an enum with three states - finished, width, and precision - then you can set it to width here and precision after precision. This lets you skip all the loops before the point you're resuming to, which also makes it more accurate by avoiding consuming flags in invalid positions (e.g. the current parser would accept %-*-d since it loops through the flags again after each *).

jhuber6 added inline comments.Sep 6 2023, 1:16 PM
libc/src/stdio/gpu/parser.h
132

I was messing around with something like that, but a lot of complicated control flow really bloats the register count. The first version of this used like 22 and we're already up to 26 so we're well into the bounds of non-trivial. I'm somewhat apprehensive to increase it further, but I can try to see what it costs since I'm not happy with the big FIXME up there.

michaelrj added inline comments.Sep 6 2023, 1:26 PM
libc/src/stdio/gpu/parser.h
132

is register usage proportional to the number of variables or to the number of branches? I don't know a lot about how GPU code works.

jhuber6 added inline comments.Sep 6 2023, 1:35 PM
libc/src/stdio/gpu/parser.h
135

So it's similar to any other architecture, you have registers that store intermediate values. The way the GPU differs is that we have different kind of registers, and a lot of them. So, because the GPU uses a SIMD architecture at its core, we have a separation between scalar registers and vector registers. These conditions are heavy on the vector registers because they're values that can be divergent across the SIMD group. E.g. we have 32 threads in a "warp" all executing in lock-step. A vector register thus contains 32-values. If we're doing a printf where the format string is different between threads (uncommon but possible) then we'll hit the vector registers harder.

Unrelated, @arsenm, do we have a way to assert that some of these values will be uniform if the format string is uniform? It would probably save us a good amount of vector registers if we knew that this internal state would all be the same.

jhuber6 added inline comments.Sep 6 2023, 1:39 PM
libc/src/stdio/gpu/parser.h
132

I replied to the wrong comment.

Also, would the above description handle a case like %*.**? I think we'd skip past the parsing and then just get the next *.

michaelrj added inline comments.Sep 6 2023, 3:46 PM
libc/src/stdio/gpu/parser.h
132

the way I'm describing would, I believe, parse %*.** as follows:
% - start the format specifier
* - width is a variable (need to send an int)
. - precision exists
* - precision is a variable (need to send an int)
* - format name is *.

The parser treats the first character that doesn't fit into another category as the format name. Since this should be past the width and precision members the only category left that isn't the format name is length modifier, and * is not valid as a length modifier. Since * isn't valid as a format name this whole format specifier gets marked as raw and is printed literally, though the two variables will be consumed. Needless to say this is very undefined and I wouldn't recommend actually doing it.

As a sidenote, I'd be fascinated to know what NVidia's printf does for printf("%*.** %d %d\n",1,2,3,4);. I've gotten a different answer from every libc I've tried it on.

jhuber6 added inline comments.Sep 6 2023, 3:49 PM
libc/src/stdio/gpu/parser.h
132

I'll sate your curiousity, Nvidia's printf on CUDA 12.2 returns the following: %1* 0 2. My glibc version 2.38 prints %1.2* 3 4. I have no clue what ** is even supposed to do since it doesn't seem to be ignored directly.

michaelrj added inline comments.Sep 6 2023, 4:09 PM
libc/src/stdio/gpu/parser.h
132

** should be treated like a * followed by any other invalid format name, equivalent to *?. Evidently not all libc implementations handle it cleanly though, I tried it with MacOS's libc and it returns %d 4. I have no idea how CUDA manages to find a 0 in there. From previous experience I believe that glibc prints a format specifier with all of the arguments included when it fails to parse, meaning it will deduplicate flags and convert integer arguments to strings. It's actually pretty cool, if a bit overkill for such an unusual situation. Our libc returns %*.** 3 4 which, is in my opinion, reasonable.

jhuber6 added inline comments.Sep 6 2023, 4:14 PM
libc/src/stdio/gpu/parser.h
132

The Nvidia implementation of printf is a complete mystery to me. It's proprietary so you can't see how it's implemented, but process of elimination suggests that it's a ring buffer that's flushed by the host runtime at predetermined points. They must do some form of parsing on the GPU similar to here, because it handles non-constant strings and divergent values correctly as far as I'm aware. Pretty much the way it works is if your program has an undefined reference to something called printf it will automatically link in a magic implementation that lives somewhere. What's confusing to me is the resource usage. I'm pretty sure that this printf support is some opaque binary blob that isn't taken into account when checking things like register usage. This is because I attempted to compare register usage with my implementation and theirs. This implementation uses something like 64 (which used to be the entire register budget back in the sm_20 days, but we have a lot more now) but checking the Nvidia implementation it states it's something like 23 register, which just so happens to be the exact same results you'd get if you called an empty external function with the same signature. That's about all I've been able to glean from it, it's basically magic. I think they've improved it a lot since I last checked, they used to ignore a lot of flags.

jhuber6 updated this revision to Diff 556092.Sep 6 2023, 4:31 PM

Addressing malformed star inputs. This used like five more registers in the average case, but they tended to be scalar so it's a bit less of an issue.

jhuber6 added inline comments.Sep 6 2023, 4:32 PM
libc/src/stdio/gpu/parser.h
132

With this most recent patch I get %*.** 3 4 as the output for your provided most-vexing printf parse.

michaelrj added inline comments.Sep 6 2023, 4:40 PM
libc/src/stdio/gpu/parser.h
132

That's very strange. We know they have to parse the format string on-device or else the %s conversion wouldn't work. Maybe they have some proprietary hardware trickery.

I'm okay with this parser having some strange behavior on aggressively undefined inputs, though I would recommend adding tests for the parser specifically.

libc/test/integration/startup/gpu/rpc_printf_test.cpp
18

it's likely not important here since this is a very specific test, but we do have a utility (libc_make_test_file_path) for finding the correct folder to write test data to. It's mostly important for bazel since it complains when files write to unexpected places.

73

is it possible to re-open the file in read mode at the end to check that the output strings are what is expected?

jhuber6 added inline comments.Sep 6 2023, 4:44 PM
libc/test/integration/startup/gpu/rpc_printf_test.cpp
73

It's a little annoying here because I need to test this stuff with multiple threads. Meaning that everything's in a random order.

michaelrj added inline comments.Sep 6 2023, 4:49 PM
libc/src/stdio/gpu/parser.h
132

excellent! The printf parser side looks good to me.

137–141

this check needs to go before the isdigit loop.

michaelrj added inline comments.Sep 6 2023, 4:52 PM
libc/test/integration/startup/gpu/rpc_printf_test.cpp
73

Would it be possible to have each thread open a file with a different name (e.g. thread 1 opens test_data1.txt)?

jhuber6 updated this revision to Diff 556096.Sep 6 2023, 4:55 PM

Moving '*' check above width check.

jhuber6 added inline comments.Sep 6 2023, 4:59 PM
libc/test/integration/startup/gpu/rpc_printf_test.cpp
73

Might be possible in theory. I deliberately chose string sizes that would hopefully be obviously wrong if it wasn't parsed correctly to avoid the extra trouble however. I can try to do it if it's necessary, but I'm wondering if that would be handed by another unit test. The importance of this "integration test" is that the integration tests are the only ones I can run with many threads to make sure it doesn't deadlock or anything.

LGTM from the printf side, I don't know enough about GPU coding to review the RPC stuff

libc/test/integration/startup/gpu/rpc_printf_test.cpp
73

ah, fair enough.

LGTM from the printf side, I don't know enough about GPU coding to review the RPC stuff

I'm actually having second thoughts about this approach. There were basically two options, we either parse it on the CPU and tell the GPU what to do, or we parse it on the GPU directly. I went with the latter because the former requires more cross-device communication (which is the really slow part). But looking at it now, it's basically doubling the amount of time it takes to process an argument (~5us to ~10us) but it does save us a lot of time. It would basically require yet another mock argument parser that goes through the list and encodes some enum for what type to use for next_var() in the va_list. I did a mock-up of this and it's much lighter weight. I would allow us to get rid of the ad-hoc parser I've created though, so you might like that. I could maybe hide some of the latency better... It's a tough choice. I might play around with it some more tomorrow.

I'm actually having second thoughts about this approach. There were basically two options, we either parse it on the CPU and tell the GPU what to do, or we parse it on the GPU directly. I went with the latter because the former requires more cross-device communication (which is the really slow part). But looking at it now, it's basically doubling the amount of time it takes to process an argument (~5us to ~10us) but it does save us a lot of time. It would basically require yet another mock argument parser that goes through the list and encodes some enum for what type to use for next_var() in the va_list. I did a mock-up of this and it's much lighter weight. I would allow us to get rid of the ad-hoc parser I've created though, so you might like that. I could maybe hide some of the latency better... It's a tough choice. I might play around with it some more tomorrow.

Just for reference, tried this and didn't get nearly enough of a decrease in resource usage to make it worthwhile.

arsenm added inline comments.Sep 12 2023, 4:32 AM
libc/src/stdio/gpu/parser.h
55

no else after return

65

no else after return

This revision is now accepted and ready to land.Sep 22 2023, 3:27 PM
libc/src/stdio/generic/CMakeLists.txt