This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Non hostcall printf support for HIP
ClosedPublic

Authored by vikramRH on May 11 2023, 11:32 PM.

Details

Summary

This is an alternative to currently existing hostcall implementation and uses printf buffer similar to OpenCL,
The data stored in the buffer (i.e the data frame) for each printf call are as follows,

  1. Control DWord - contains info regarding stream, format string constness and size of data frame
  2. Hash of the format string (if constant) else the format string itself
  3. Printf arguments (each aligned to 8 byte boundary)

The format string Hash is generated using LLVM's MD5 Message-Digest Algorithm implementation and only low 64 bits are used.
The implementation still uses amdhsa metadata and hash is stored as part of format string itself to ensure
minimal changes in runtime.

Diff Detail

Event Timeline

vikramRH created this revision.May 11 2023, 11:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 11 2023, 11:32 PM
vikramRH requested review of this revision.May 11 2023, 11:32 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 11 2023, 11:32 PM
yaxunl added inline comments.May 12 2023, 9:28 AM
clang/include/clang/Basic/LangOptions.def
274

This should be a target option like https://reviews.llvm.org/D91546 instead of a language option since it is target specific.

clang/lib/Driver/ToolChains/Clang.cpp
4663–4664

You can check whether target triple is amdgpu instead.

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
416

Please add some comments about the transformations done for buffered and non-buffered cases. Better provide a simple example.

yaxunl added inline comments.May 12 2023, 9:34 AM
clang/include/clang/Driver/Options.td
1036

as a target option, usually it starts with -m and does not contain target name, e.g. -mprintf=.

Better describe allowed values and their effect from users point of view, e.g. hostcall outputs immediately during kernel execution but requires some hardware feature.

Where does the runtime implementation of this live? I'm not very familiar with the HIP / hostcall ecosystem.

arsenm added inline comments.May 12 2023, 10:42 AM
clang/include/clang/Basic/LangOptions.def
274

Should be -m option

clang/lib/Driver/ToolChains/Clang.cpp
4673

Capitalize, typo 'pritnf'

4674

Typo 'will should'

Don't really understand the TODO, this should trigger for OpenCL as it is

clang/test/CodeGenHIP/printf_nonhostcall.cpp
5

Do we need a test with -fno-builtin?

69

Doesn't cover the full range of printable types. need some other non-string pointers and different address spaces, some FP promotions, 16 and 64 bit integers

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
259–261

Don't understand the point of the comment, I would assume anything involving analysis of a constant string has ignorable compile time

371–384

Builder.CreateMemCpy

402

Just use the type store size

403

You don't need a SmallVector to push back a single entry

428

Don't need llvm::

430

No auto

458

Do we really need another ockl control variable for this? Why isn't it a parameter? printf=stdout always

473

Don't need llvm::

498

Backwards conditional

vikramRH updated this revision to Diff 522954.EditedMay 17 2023, 1:54 AM

Handled review comments

vikramRH marked 10 inline comments as done.May 17 2023, 1:57 AM

Where does the runtime implementation of this live? I'm not very familiar with the HIP / hostcall ecosystem.

Currently there is a rocclr review up internally for the runtime side support

vikramRH added inline comments.May 17 2023, 2:05 AM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
458

There are certain HIP API's such as "hip_assert" that output to stderr. currently such API's are supported via hostcalls. Although this implementation does not currently support the API's ,its kept as an option.

sameerds added inline comments.May 17 2023, 3:11 AM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
212

you can just say "struct StringData"

218–219

Every argument has a default value. Why not simply assign them as member initializers?

238

Should this say 4?

256

move this closer to first use

313

function argument should use SmallVectorImpl, and not SmallVector

321

This "if (...) { ... }" has a really long body. Can it be moved into a separate function for readability?

324

Try to move declarations to the smallest scope that they are used in. For ReadSize, this seems to be the while loop later.

352

typo: bother

438–439

So we cannot use a buffered printf in HIP?

463

FmtStr.empty() gets checked in multiple places. Can this be captured as one or more boolean variables with meaningful names?

499

Probably not a good choice if it shows up where the user can see it. It's also not very informative. A more descriptive string will be better.

arsenm added inline comments.May 17 2023, 3:42 AM
clang/include/clang/Basic/TargetOptions.h
99 ↗(On Diff #522954)

Typo pritnf

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
458

Right but the way to handle that would be a parameter for where to output, not an externally set global

sameerds added inline comments.May 17 2023, 5:25 AM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
184–186

Seems like the only purpose of this change is to make the new StringRef available outside. It will be less confusing to just move the call to getConstantStringInfo() out of this function, and pass the StringRef as an input parameter instead of output. Also, try not to change existing variable names unless it is really relevant. That reduces the number of lines affected by mostly non-functional changes.

223

Needs consistent capitalization in the argument names. Since the rest of the code seems to prefer TitleCase, maybe alignment should start with "A"?

369

Avoid first person comments.

vikramRH updated this revision to Diff 523344.EditedMay 18 2023, 5:03 AM

Handled Review comments from @sameerds

vikramRH marked 13 inline comments as done.May 18 2023, 5:07 AM
vikramRH added inline comments.May 18 2023, 5:13 AM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
438–439

I guess hostcall version returns number of bytes printed as the printf return value, This is not done in Buffered case since we do not print stuff in "real time", instead I have decided to stick with OpenCL standard (return 0 on success and -1 on failure)

vikramRH added inline comments.May 18 2023, 5:16 AM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
458

I am not clear here, you expect additional inputs to device lib function ?

vikramRH marked 5 inline comments as done.May 22 2023, 10:04 PM
sameerds added inline comments.May 22 2023, 11:01 PM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
186

Where is Fmt used in this function? Also, StringRef is passed by value. It's already a lightweight reference, as demonstrated by its name.

212

typedef is not required for this in C++

458

@arsenm, this "control word" written into the buffer. In that sense, it is indeed a parameter passed from device to host as part of the printf packet. It is not yet another global variable.

495–496

else should be on the same line as the closing brace. Please do run clang-format on the entire change.

vikramRH updated this revision to Diff 524590.May 23 2023, 12:48 AM

Handled review comments and rebase

vikramRH marked 3 inline comments as done.May 23 2023, 12:58 AM
arsenm added inline comments.May 23 2023, 1:21 AM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
428

MathExtras already has alignTo

467

There's a getFalse

492

CreateConstGEP

599

Remove this newline

601

You can use one of the CreateConstGEPs. I also suspect this can be inbounds

607

Remove this newline

arsenm added inline comments.May 23 2023, 3:22 AM
clang/test/CodeGenHIP/printf_nonhostcall.cpp
171–172

Just store the double, don't need the bitcast?

vikramRH updated this revision to Diff 525082.May 24 2023, 2:51 AM

Handled review comments and rebase

vikramRH marked 7 inline comments as done.May 24 2023, 2:54 AM
vikramRH updated this revision to Diff 525946.May 25 2023, 10:51 PM

rebase and ping, apologies for the short frequency.

arsenm added inline comments.May 30 2023, 3:19 AM
clang/include/clang/Driver/Options.td
1030

I'm a bit worried this is introducing a side-ABI option not captured in the triple or at least module flags

clang/lib/Driver/ToolChains/Clang.cpp
4666

Braces

clang/test/CodeGenHIP/printf_nonhostcall.cpp
65

Can directly store the pointer, don't ptrtoint? Should have some other pointer address spaces tested, the spec is quiet on how %p is supposed to work with different sized pointers

229

Need some vector tests, especially 3 x vectors

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
369

You have better alignment info than Align(1)

419

Don't need = ""

420

Move last

432

ArrayRef should be passed by value

443

No auto

443–446

No auto

458

I'm not following why this part requires introducing another call into device libs. It's expanding this not-really defined API. I'd expect this to be a trivial function we can expand inline here and write directly into the parameter buffer?

502

I suppose you can't access the AMDGPU enum from Utils. You could use DL.getDefaultGlobalsAddrSpace

589

Don't see why isDoubleTy would be special cased here and not part of fitArgInto64Bits

603

This should probably be getTypeAllocSize

arsenm added inline comments.May 30 2023, 3:20 AM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
419

Can't you just use the raw StringRef out of getConstantStringInfo

vikramRH updated this revision to Diff 526963.May 31 2023, 2:16 AM

Handled most review comments

vikramRH marked 11 inline comments as done.May 31 2023, 2:35 AM
vikramRH added inline comments.
clang/test/CodeGenHIP/printf_nonhostcall.cpp
65

Added few more address space cases in foo2(), do you think we need more ?

229

vectors are not yet supported as in the hostcall case since the code currently runs for HIP only. I plan to port the OpenCL lowering also here along with hostcall support . we would need to extend hostcall and this implementation to support vectors as part of the porting activity.

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
419

The structure just caches the string calculated from getConstantStringInfo

458

I started the implementation keeping in mind the device side asserts currenlty supported via hostcall. They use explicit calls to such device lib functions in their definitions. I could not support them now but these would be required if I ever decide to. also I never ended up inlining this call. do you think its really necessary ?

589

The redundant cast instructions and this case were due to function "fitArgInto64Bits", which I reused from hostcall Implementation and did not want to modify. This dependency has been removed now and I have inlined the code

vikramRH marked 2 inline comments as done.May 31 2023, 2:43 AM
vikramRH added inline comments.
clang/include/clang/Driver/Options.td
1030

I did not understand the concern here. I agree The usage of option itself is not the most robust solution and it would have been better if we could do this without user intervention. but I do not see ways to do that now (atleast not immediately).

vikramRH updated this revision to Diff 527462.Jun 1 2023, 9:39 AM

Few additional changes,

  1. reflect printf-kind in module flags metadata
  2. Test cases for the change
vikramRH updated this revision to Diff 527472.Jun 1 2023, 9:50 AM

new line at the end of test file

arsenm added inline comments.Jun 1 2023, 10:08 AM
clang/include/clang/Driver/Options.td
1030

This introduces a way to produce binaries that will silently not work correctly depending on the host system. Without somehow tracking this property in the binary, there's no way to diagnose trying to load a binary on an incompatible system.

clang/test/CodeGenHIP/printf_nonhostcall.cpp
229

Something still needs to happen for vectors, it can't just crash. The different types of vectors may appear regardless of language

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
419

But you can just use returned StringRef for that. The actual constant contents persists in the LLVMContext, you don't need to copy it

458

But that implementation is in the library itself? What does the function actually do?

I think we're better off the more defined/documented ABI we have in terms of size-and-offset from base pointer than call into unstable library calls.

Can we get some documentation on the ABI in AMDGPUUsage?

490

Builder.getInt64(Hash.low())

Should also be mentioned in the release notes

vikramRH updated this revision to Diff 528407.Jun 5 2023, 6:37 AM

Handled few more review comments

vikramRH marked 4 inline comments as done.Jun 5 2023, 6:51 AM
vikramRH added inline comments.
clang/test/CodeGenHIP/printf_nonhostcall.cpp
229

I have forced "-Werror=format-invalid-specifier" with printf option which should be a sufficient condition for vectors I guess ? (this causes clang to break out earlier with an error)

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
458

The library function itself just handles creation of controlDWord via minor bit manipulations. nevertheless Inlining is pretty simple and im fine either way. Call has been expanded.

arsenm added inline comments.Jun 5 2023, 4:17 PM
clang/test/CodeGenHIP/printf_nonhostcall.cpp
138

Also half

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
579

I don't follow this alignment logic, but I guess the optimizer will figure it out later anyway so I suppose the Align(1) was fine.

Also should use Align.

581

The source align is known

594–595

isIntegerTy(32).

I also do not understand why only 32-bit integers would be promoted to 64-bit (or why this would be zext). This doesn't match varargs ABI handling, where everything smaller would be promoted to i32.

arsenm added a comment.Jun 5 2023, 4:18 PM

Also should be noted in the release notes

vikramRH updated this revision to Diff 529207.Jun 7 2023, 1:06 AM

Further review comments

vikramRH marked 4 inline comments as done.Jun 7 2023, 1:12 AM
vikramRH added inline comments.
clang/test/CodeGenHIP/printf_nonhostcall.cpp
138

C++ default arg promotions does not seem to list float16 case and consequently clang does not handle it. I have handled this case by promoting to double currenlty. Clang however still emits a warning when used with a %f conv specifier

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
594–595

Right, But default promotions would have happened already, this additional promotion is due to runtime processing requirement

vikramRH updated this revision to Diff 529219.Jun 7 2023, 2:09 AM
vikramRH marked an inline comment as done.

updated formating

arsenm added inline comments.Jun 7 2023, 4:57 PM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
594–595

Is this documented somewhere? If it's promote everything to i64, I'd prefer to handle all types rather than special casing 32

vikramRH added inline comments.Jun 7 2023, 9:43 PM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
594–595

That's precisely what the function "fitArgInto64Bits()" did. We eliminated it due to some redundant instructions generated. int32 was the only case (apart from float16 now) that required this special casting.

sameerds added inline comments.Jun 7 2023, 10:41 PM
llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
594–595

I think the problem is that the check for "isIntergerTy(32)" makes it look like it is some sort of special case. What we really mean is that if the type is an integer with width less than 64 bits, then zext it with padding to fit the 64-bit store into the buffer. We can do that with just "isIntegerTy()".

Separately, we know from auto promotion of varargs that the integer can only be of size 32 or 64. We can assert that instead of assuming it in the code.

vikramRH updated this revision to Diff 529551.Jun 8 2023, 3:41 AM

Reafactored non string arg handling into a seperate function with additional asserts

arsenm added inline comments.Jun 8 2023, 5:32 AM
clang/docs/ReleaseNotes.rst
589 ↗(On Diff #529551)

I thought the default was buffered, such that it would always work. Defaults that always work are better

Also, amdgpu-arch should probably learn to report if pcie atomics work to improve the autodetect

clang/lib/Driver/ToolChains/Clang.cpp
4670–4672

I don't see why we would special case this

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
569

It costs nothing to handle all types < 64. For larger just return the original value

576

Should avoid special casing this too, there's also bfloat to consider at a minimum

579

Safest to just return the original value

621

Range loop, index is unused

arsenm added inline comments.Jun 8 2023, 7:15 AM
clang/test/CodeGenHIP/printf_nonhostcall.cpp
241

Test _BitInt for small and odd types, plus i128

vikramRH updated this revision to Diff 529865.Jun 9 2023, 2:26 AM

Handled last set of review comments from @arsenm, would be willing to handle any new findings/concerns via additional patches

vikramRH marked 5 inline comments as done.Jun 9 2023, 2:27 AM
arsenm accepted this revision.Jun 9 2023, 9:13 AM

lgtm with nit

llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
488

BufSize += std::max(getTypeAllocSize(), 8)

577

maybe add a constrained fp run line to a test

This revision is now accepted and ready to land.Jun 9 2023, 9:13 AM
This revision was landed with ongoing or failed builds.Jun 10 2023, 7:30 AM
This revision was automatically updated to reflect the committed changes.