This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libclc/amdgpu/lib/
-
amdgpu/
-
lib/
-
SOURCES
-
misc/
-
printf.cl

Differential D84392

libclc: add printf support on amd target
AcceptedPublic

Authored by EdB on Jul 23 2020, 2:54 AM.

Download Raw Diff

Details

Reviewers

tstellar
jvesely
vcosta

Summary

On AMD target, printf is replaced and call to printf_alloc are inserted
in order to get the buffer offset to write to

Diff Detail

Event Timeline

EdB created this revision.Jul 23 2020, 2:54 AM

Herald added subscribers: kerbowa, jfb, nhaehnle. · View Herald TranscriptJul 23 2020, 2:54 AM

EdB removed a project: Restricted Project.Jul 23 2020, 2:55 AM

EdB edited subscribers, added: jvesely, tstellar; removed: nhaehnle, jfb, kerbowa.

I don't think there are many benefits in merging this ahead of time without the rest of printf implementation so that it can be tested.

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

In D84392#2170493, @jvesely wrote:

I don't think there are many benefits in merging this ahead of time without the rest of printf implementation so that it can be tested.

An implementation can be tested here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6040

In D84392#2170601, @arsenm wrote:

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

It's based on what it's expected by the the LLVM AMD gpu target when compiling cl file.
AMD Roc is using a similar implementation.

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

since it uses gcn specific builtin it should be in amdgcn/ rather than amdgpu/

In D84392#2170601, @arsenm wrote:

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

Really I would prefer if this just cloned rocm-device-libs. The backend doesn't need to maintain two ABIs for dnot

In D84392#2170731, @jvesely wrote:

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

In D84392#2170732, @arsenm wrote:

In D84392#2170731, @jvesely wrote:

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

In D84392#2170733, @arsenm wrote:

In D84392#2170732, @arsenm wrote:

In D84392#2170731, @jvesely wrote:

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

In D84392#2170732, @arsenm wrote:

In D84392#2170601, @arsenm wrote:

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

Really I would prefer if this just cloned rocm-device-libs. The backend doesn't need to maintain two ABIs for dnot

the main difference is atomic usage (I use those already in liclc), implict arg pos and asking to set the offset to 8 at buffer init instead of adding 8 on every call.
I can easily change the last 2

In D84392#2170909, @jvesely wrote:

In D84392#2170733, @arsenm wrote:

In D84392#2170732, @arsenm wrote:

In D84392#2170731, @jvesely wrote:

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

I don't believe so, but I know next to nothing about the runtime so I'm not 100% confident

In D84392#2170911, @EdB wrote:

In D84392#2170732, @arsenm wrote:

In D84392#2170601, @arsenm wrote:

clang already emits an expansion for printf using the hostcall mechanism. This seems to be doing something different, so I'm not sure where it's going

Really I would prefer if this just cloned rocm-device-libs. The backend doesn't need to maintain two ABIs for dnot

the main difference is atomic usage (I use those already in liclc), implict arg pos and asking to set the offset to 8 at buffer init instead of adding 8 on every call.
I can easily change the last 2

In D84392#2170909, @jvesely wrote:

In D84392#2170733, @arsenm wrote:

In D84392#2170732, @arsenm wrote:

In D84392#2170731, @jvesely wrote:

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

The implementation use a regular device buffer that is read at the end of the execution. Mostly like a clEnqueueReadBuffer

In D84392#2170909, @jvesely wrote:

In D84392#2170733, @arsenm wrote:

In D84392#2170732, @arsenm wrote:

In D84392#2170731, @jvesely wrote:

Does the printf expansion work for all amdgpu triples, including amdgcn--mesa3d?
will it work for non-ROC capable GPUs?

Nothing in the compiler actually depends on roc capable GPUs. The limitations are all runtime/driver related

The exception to this would be places in the libraries that assume flat instructions are available. Theoretically we could handle these in the backend for SI with tagged pointers or something

so there are no hidden assumptions; like the printf buffer being cache coherent with host CPU, or the implementation using s_sendmsg to trigger CPU draining the buffer before the kernel finishes execution?

The implementation use a regular device buffer that is read at the end of the execution. Mostly like a clEnqueueReadBuffer

Is that on ROCm side or clover? will it stay that way on the ROCm side in the future?
The idea of using different ABI (and -mesa3d triple) was that clover is not tied to changes AMD makes for ROCm.
it can make independent decisions, it's not exposed to potential breakage when ROCm does incompatible changes and does not force ROCm ABI on other users of clover (like r600 or nouveau).
It might be OK for printf (let's see what curro says on the mesa side), but I'd expect that attempting to closely follow ROCm will speed up efforts to switch mesa opencl to SPIRV->NIR path.

EdB retitled this revision from libclc: add printf to amd to libclc: add printf support on amd target.Aug 2 2020, 4:30 AM

Add printf declaration to generic include
Move implementation part to amdgcn subdir
Add CL 1.2 flag to support variadic arguments

Herald added a subscriber: mgorny. · View Herald TranscriptAug 2 2020, 4:34 AM

awatry added a subscriber: awatry.Aug 5 2020, 6:29 AM

awatry added inline comments.

libclc/generic/include/clc/printf/printf.h
2 ↗	(On Diff #282443)	I think we should probably have an #if guard around this declaration to make sure the CLC language version is 1.2 or higher. With this patch applied, mesa/clover on a CL1.1 device can no longer build CL kernels.

Have you checked that returning NULL is handled properly?
Should the buffer wrap around instead?

libclc/amdgcn/lib/SOURCES
7 ↗	(On Diff #282443)	Calling the file `printf_alloc` or similar would make it clearer that it doesn't actually implement printf
libclc/amdgcn/lib/printf/printf.cl
3 ↗	(On Diff #282443)	'replace' -> 'replaces' 'call' -> 'calls' 'add' -> 'adds'
10 ↗	(On Diff #282443)	since the return value is a pointer, `NULL` would be more appropriate
13 ↗	(On Diff #282443)	'store' -> 'stores' pls add a short table of the expected buffer layout
15 ↗	(On Diff #282443)	This would be more readable with a temp var `buffer_offset_ptr` since you're using it below as well.
28 ↗	(On Diff #282443)	return type is a pointer, `NULL` would be more appropriate

In D84392#2205424, @jvesely wrote:

Have you checked that returning NULL is handled properly?
Should the buffer wrap around instead?

When returning NULL, no data is append to the buffer.
The buffer is seen as full

EdB updated this revision to Diff 285884.Aug 16 2020, 1:24 AM

EdB marked 2 inline comments as done.

EdB marked 5 inline comments as done.Aug 16 2020, 1:26 AM

Thanks. This LGTM, but you might want to get @tstellar's opinion as well.

This revision is now accepted and ready to land.Aug 16 2020, 9:35 AM

This is fine with me.

Please push the patch for me since I have no commit rigths

ping :)

Uh... It seems this is needed to add printf support to the AMD OpenCL drivers in Gallium Clover. Which would enable that architecture to have OpenCL 1.2 driver support (finally).
Why wasn't this committed to LLVM yet? I am saying this as an outside observer who was passing by looking at https://mesamatrix.net/ wondering why the AMD cards did not support printf for OpenCL 1.2. Then went down a rabbit hole and got here.
It seems this is a major stopper for that and yet this patch, which was approved by the reviewers some 6 months or more ago, has had no one get around to commit it yet.
Anyone with commit rights please help.
Unless I am missing something here.

Revision Contents

Path

Size

libclc/

amdgpu/

lib/

SOURCES

1 line

misc/

printf.cl

29 lines

Diff 280052

libclc/amdgpu/lib/SOURCES

Context not available.
	math/half_sqrt.cl	math/half_sqrt.cl
	math/nextafter.cl	math/nextafter.cl
	math/sqrt.cl	math/sqrt.cl
		misc/printf.cl
Context not available.

libclc/amdgpu/lib/misc/printf.cl

This file was added.


				// AMD target compiler replace printf call
				int printf(__constant const char* st, ...) __attribute__((format(printf, 1, 2)));

				//reserves space to the printf data buffer and returns a pointer to it
				__global char *
				__printf_alloc(uint bytes) {
				__global char ptr = (__global char )(((__constant size_t *)__builtin_amdgcn_implicitarg_ptr())[2]);
				if (!ptr)
				return NULL;

				// A printf buffer store the current offset and its size using 2 uint
				// It's expected that the runtime store a initial offset of 8 (sizeof(cl_uint) * 2)
				uint buffer_offset = ((__global uint*)ptr)[0];
				uint buffer_size = ((__global uint*)ptr)[1];

				for (;;) {
				if (buffer_offset + bytes > buffer_size)
				break;

				buffer_offset = atomic_add((__global uint*)ptr, bytes);
				if (buffer_offset + bytes <= buffer_size)
				return ptr + buffer_offset;
				else
				atomic_sub((__global uint*)ptr, bytes);
				}

				return NULL;
				}