Page MenuHomePhabricator

libclc: add printf support on amd target
Needs ReviewPublic

Authored by EdB on Thu, Jul 23, 2:54 AM.

Details

Reviewers
tstellar
jvesely
Summary

On AMD target, printf is replaced and call to printf_alloc are inserted
in order to get the buffer offset to write to

Diff Detail

Event Timeline

EdB created this revision.Thu, Jul 23, 2:54 AM
EdB removed a project: Restricted Project.Thu, Jul 23, 2:55 AM
EdB edited subscribers, added: jvesely, tstellar; removed: nhaehnle, jfb, kerbowa.

I don't think there are many benefits in merging this ahead of time without the rest of printf implementation so that it can be tested.

arsenm added a subscriber: arsenm.Thu, Jul 23, 1:27 PM

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

EdB added a comment.Thu, Jul 23, 1:57 PM

I don't think there are many benefits in merging this ahead of time without the rest of printf implementation so that it can be tested.

An implementation can be tested here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6040

EdB added a comment.Thu, Jul 23, 2:01 PM

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

It's based on what it's expected by the the LLVM AMD gpu target when compiling cl file.
AMD Roc is using a similar implementation.

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

since it uses gcn specific builtin it should be in amdgcn/ rather than amdgpu/

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

Really I would prefer if this just cloned rocm-device-libs. The backend doesn't need to maintain two ABIs for dnot

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

EdB added a comment.Thu, Jul 23, 4:02 PM

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

Really I would prefer if this just cloned rocm-device-libs. The backend doesn't need to maintain two ABIs for dnot

the main difference is atomic usage (I use those already in liclc), implict arg pos and asking to set the offset to 8 at buffer init instead of adding 8 on every call.
I can easily change the last 2

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

I don't believe so, but I know next to nothing about the runtime so I'm not 100% confident

EdB added a comment.Thu, Jul 23, 4:11 PM
In D84392#2170911, @EdB wrote:

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

Really I would prefer if this just cloned rocm-device-libs. The backend doesn't need to maintain two ABIs for dnot

the main difference is atomic usage (I use those already in liclc), implict arg pos and asking to set the offset to 8 at buffer init instead of adding 8 on every call.
I can easily change the last 2

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

The implementation use a regular device buffer that is read at the end of the execution. Mostly like a clEnqueueReadBuffer

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

The implementation use a regular device buffer that is read at the end of the execution. Mostly like a clEnqueueReadBuffer

Is that on ROCm side or clover? will it stay that way on the ROCm side in the future?
The idea of using different ABI (and -mesa3d triple) was that clover is not tied to changes AMD makes for ROCm.
it can make independent decisions, it's not exposed to potential breakage when ROCm does incompatible changes and does not force ROCm ABI on other users of clover (like r600 or nouveau).
It might be OK for printf (let's see what curro says on the mesa side), but I'd expect that attempting to closely follow ROCm will speed up efforts to switch mesa opencl to SPIRV->NIR path.

EdB retitled this revision from libclc: add printf to amd to libclc: add printf support on amd target.Sun, Aug 2, 4:30 AM
EdB updated this revision to Diff 282443.Sun, Aug 2, 4:34 AM

Add printf declaration to generic include
Move implementation part to amdgcn subdir
Add CL 1.2 flag to support variadic arguments

awatry added a subscriber: awatry.Wed, Aug 5, 6:29 AM
awatry added inline comments.
libclc/generic/include/clc/printf/printf.h
2

I think we should probably have an #if guard around this declaration to make sure the CLC language version is 1.2 or higher.

With this patch applied, mesa/clover on a CL1.1 device can no longer build CL kernels.

Have you checked that returning NULL is handled properly?
Should the buffer wrap around instead?

libclc/amdgcn/lib/SOURCES
7

Calling the file printf_alloc or similar would make it clearer that it doesn't actually implement printf

libclc/amdgcn/lib/printf/printf.cl
3

'replace' -> 'replaces'
'call' -> 'calls'
'add' -> 'adds'

10

since the return value is a pointer, NULL would be more appropriate

13

'store' -> 'stores'

pls add a short table of the expected buffer layout

15

This would be more readable with a temp var buffer_offset_ptr since you're using it below as well.

28

return type is a pointer, NULL would be more appropriate