This is an archive of the discontinued LLVM Phabricator instance.

libclc: add printf support on amd target
AcceptedPublic

Authored by EdB on Jul 23 2020, 2:54 AM.

Details

Summary

On AMD target, printf is replaced and call to printf_alloc are inserted
in order to get the buffer offset to write to

Diff Detail

Event Timeline

EdB created this revision.Jul 23 2020, 2:54 AM
EdB removed a project: Restricted Project.Jul 23 2020, 2:55 AM
EdB edited subscribers, added: jvesely, tstellar; removed: nhaehnle, jfb, kerbowa.

I don't think there are many benefits in merging this ahead of time without the rest of printf implementation so that it can be tested.

arsenm added a subscriber: arsenm.Jul 23 2020, 1:27 PM

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

EdB added a comment.Jul 23 2020, 1:57 PM

I don't think there are many benefits in merging this ahead of time without the rest of printf implementation so that it can be tested.

An implementation can be tested here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6040

EdB added a comment.Jul 23 2020, 2:01 PM

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

It's based on what it's expected by the the LLVM AMD gpu target when compiling cl file.
AMD Roc is using a similar implementation.

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

since it uses gcn specific builtin it should be in amdgcn/ rather than amdgpu/

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

Really I would prefer if this just cloned rocm-device-libs. The backend doesn't need to maintain two ABIs for dnot

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

EdB added a comment.Jul 23 2020, 4:02 PM

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

Really I would prefer if this just cloned rocm-device-libs. The backend doesn't need to maintain two ABIs for dnot

the main difference is atomic usage (I use those already in liclc), implict arg pos and asking to set the offset to 8 at buffer init instead of adding 8 on every call.
I can easily change the last 2

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

I don't believe so, but I know next to nothing about the runtime so I'm not 100% confident

EdB added a comment.Jul 23 2020, 4:11 PM
In D84392#2170911, @EdB wrote:

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

Really I would prefer if this just cloned rocm-device-libs. The backend doesn't need to maintain two ABIs for dnot

the main difference is atomic usage (I use those already in liclc), implict arg pos and asking to set the offset to 8 at buffer init instead of adding 8 on every call.
I can easily change the last 2

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

The implementation use a regular device buffer that is read at the end of the execution. Mostly like a clEnqueueReadBuffer

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

The implementation use a regular device buffer that is read at the end of the execution. Mostly like a clEnqueueReadBuffer

Is that on ROCm side or clover? will it stay that way on the ROCm side in the future?
The idea of using different ABI (and -mesa3d triple) was that clover is not tied to changes AMD makes for ROCm.
it can make independent decisions, it's not exposed to potential breakage when ROCm does incompatible changes and does not force ROCm ABI on other users of clover (like r600 or nouveau).
It might be OK for printf (let's see what curro says on the mesa side), but I'd expect that attempting to closely follow ROCm will speed up efforts to switch mesa opencl to SPIRV->NIR path.

EdB retitled this revision from libclc: add printf to amd to libclc: add printf support on amd target.Aug 2 2020, 4:30 AM
EdB updated this revision to Diff 282443.Aug 2 2020, 4:34 AM

Add printf declaration to generic include
Move implementation part to amdgcn subdir
Add CL 1.2 flag to support variadic arguments

awatry added a subscriber: awatry.Aug 5 2020, 6:29 AM
awatry added inline comments.
libclc/generic/include/clc/printf/printf.h
2

I think we should probably have an #if guard around this declaration to make sure the CLC language version is 1.2 or higher.

With this patch applied, mesa/clover on a CL1.1 device can no longer build CL kernels.

Have you checked that returning NULL is handled properly?
Should the buffer wrap around instead?

libclc/amdgcn/lib/SOURCES
7

Calling the file printf_alloc or similar would make it clearer that it doesn't actually implement printf

libclc/amdgcn/lib/printf/printf.cl
3

'replace' -> 'replaces'
'call' -> 'calls'
'add' -> 'adds'

10

since the return value is a pointer, NULL would be more appropriate

13

'store' -> 'stores'

pls add a short table of the expected buffer layout

15

This would be more readable with a temp var buffer_offset_ptr since you're using it below as well.

28

return type is a pointer, NULL would be more appropriate

EdB added a comment.Aug 15 2020, 11:42 PM

Have you checked that returning NULL is handled properly?
Should the buffer wrap around instead?

When returning NULL, no data is append to the buffer.
The buffer is seen as full

EdB updated this revision to Diff 285884.Aug 16 2020, 1:24 AM
EdB marked 2 inline comments as done.
EdB marked 5 inline comments as done.Aug 16 2020, 1:26 AM
jvesely accepted this revision.Aug 16 2020, 9:35 AM

Thanks. This LGTM, but you might want to get @tstellar's opinion as well.

This revision is now accepted and ready to land.Aug 16 2020, 9:35 AM

This is fine with me.

EdB added a comment.Aug 23 2020, 3:10 AM

Please push the patch for me since I have no commit rigths

vcosta accepted this revision.EditedApr 29 2021, 4:13 PM
vcosta added a subscriber: vcosta.

Uh... It seems this is needed to add printf support to the AMD OpenCL drivers in Gallium Clover. Which would enable that architecture to have OpenCL 1.2 driver support (finally).
Why wasn't this committed to LLVM yet? I am saying this as an outside observer who was passing by looking at https://mesamatrix.net/ wondering why the AMD cards did not support printf for OpenCL 1.2. Then went down a rabbit hole and got here.
It seems this is a major stopper for that and yet this patch, which was approved by the reviewers some 6 months or more ago, has had no one get around to commit it yet.
Anyone with commit rights please help.
Unless I am missing something here.