This is an alternative to currently existing hostcall implementation and uses printf buffer similar to OpenCL,
The data stored in the buffer (i.e the data frame) for each printf call are as follows,
- Control DWord - contains info regarding stream, format string constness and size of data frame
- Hash of the format string (if constant) else the format string itself
- Printf arguments (each aligned to 8 byte boundary)
The format string Hash is generated using LLVM's MD5 Message-Digest Algorithm implementation and only low 64 bits are used.
The implementation still uses amdhsa metadata and hash is stored as part of format string itself to ensure
minimal changes in runtime.
I thought the default was buffered, such that it would always work. Defaults that always work are better
Also, amdgpu-arch should probably learn to report if pcie atomics work to improve the autodetect