This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add a loader utility for AMDHSA architectures for testing
ClosedPublic

Authored by jhuber6 on Dec 12 2022, 6:12 AM.

Details

Summary

This is the first attempt to get some testing support for GPUs in LLVM's
libc. We want to be able to compile for and call generic code while on
the device. This is difficult as most GPU applications also require the
support of large runtimes that may contain their own bugs (e.g. CUDA /
HIP / OpenMP / OpenCL / SYCL). The proposed solution is to provide a
"loader" utility that allows us to execute a "main" function on the GPU.

This patch implements a simple loader utility targeting the AMDHSA
runtime called amdhsa_loader that takes a GPU program as its first
argument. It will then attempt to load a predetermined _start kernel
inside that image and launch execution. The _start symbol is provided
by a start utility function that will be linked alongside the
application. Thus, this should allow us to run arbitrary code on the
user's GPU with the following steps for testing.

clang++ Start.cpp --target=amdgcn-amd-amdhsa -mcpu=<arch> -ffreestanding -nogpulib -nostdinc -nostdlib -c
clang++ Main.cpp --target=amdgcn-amd-amdhsa -mcpu=<arch> -nogpulib -nostdinc -nostdlib -c
clang++ Start.o Main.o --target=amdgcn-amd-amdhsa -o image
amdhsa_loader image <args, ...>

We determine the -mcpu value using the amdgpu-arch utility provided
either by clang or rocm. If amdgpu-arch isn't found or returns an
error we shouldn't run the tests as the machine does not have a valid
HSA compatible GPU. Alternatively we could make this utility in-source
to avoid the external dependency.

This patch provides a single test for this untility that simply checks
to see if we can compile an application containing a simple main
function and execute it.

The proposed solution in the future is to create an alternate
implementation of the LibcTest.cpp source that can be compiled and
launched using this utility. This approach should allow us to use the
same test sources as the other applications.

This is primarily a prototype, suggestions for how to better integrate
this with the existing LibC infastructure would be greatly appreciated.
The loader code should also be cleaned up somewhat. An implementation
for NVPTX will need to be written as well.

Diff Detail

Event Timeline

jhuber6 created this revision.Dec 12 2022, 6:12 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 12 2022, 6:12 AM
jhuber6 requested review of this revision.Dec 12 2022, 6:12 AM

The file called start is traditionally called crt for c runtime. Making to main across linked elf files probably works but it's a bit sketchy, would be inclined to compile it to IR instead.

The HSA boilerplate looks ok to me. A lot is common to the openmp plugin, if there's somewhere we can put a HSA.hpp that is reachable from libc and openmp I think that would be a win. Doesn't need associated object code, just the template wrappers.

libc/utils/DeviceLoader/Main.cpp
26 ↗(On Diff #482094)

If mmap is available it's probably a clearer than fopen here, but using only C is also appealing.

libc/utils/DeviceLoader/amdgpu/CMakeLists.txt
37 ↗(On Diff #482094)

I think this needs to be emit-llvm for amdgpu

libc/utils/DeviceLoader/amdgpu/Loader.cpp
58 ↗(On Diff #482094)

This is all pretty familiar. Do we have anywhere reasonable to drop a HSA.hpp header that puts these interfaces over the C?

239 ↗(On Diff #482094)

This doesn't work for some of the HSA calls. Group segment size or kernarg segment size maybe? One of them unconditionally returns zero (the openmp plugin reads it via msgpack because of this). Or at least, it used to return zero, I haven't checked recently.

291 ↗(On Diff #482094)

Could probably leave argv in fine grain memory instead of setting it up on the GPU, would make this somewhat simpler.

libc/utils/DeviceLoader/amdgpu/Start.cpp
13 ↗(On Diff #482094)

Oh, ret has to be on the GPU if you want to fetch_or into it

The file called start is traditionally called crt for c runtime. Making to main across linked elf files probably works but it's a bit sketchy, would be inclined to compile it to IR instead.

Wasn't sure if I should exactly copy the crt nomenclature, but it might warrant it since it's the same concept. And isn't this linking the same as the handling for the C runtime? Only the "start" code is compiled freestanding so we should be treating the rest of the compilation like a standard build. Also for AMD it is IR.

The HSA boilerplate looks ok to me. A lot is common to the openmp plugin, if there's somewhere we can put a HSA.hpp that is reachable from libc and openmp I think that would be a win. Doesn't need associated object code, just the template wrappers.

It's possible we could create some "GPUUtils" or something in LLVM, but otherwise I'm not sure if there would be a good place for this to live.

libc/utils/DeviceLoader/Main.cpp
26 ↗(On Diff #482094)

Alternatively we could link against the LLVM libraries and use their methods which get reduced to mmap, I could probably make the error handling cleaner with access to LLVM as well.

libc/utils/DeviceLoader/amdgpu/CMakeLists.txt
37 ↗(On Diff #482094)

-c emits bitcode for amdgcn-amd-amdhsa since we consider that the relocatable object type.

libc/utils/DeviceLoader/amdgpu/Loader.cpp
58 ↗(On Diff #482094)

Making a common header between projects is a little tough, since it would need to live in LLVM and this probably doesn't quality for mainline LLVM. Maybe we could encourage the HSA people to provide some C++ bindings.

239 ↗(On Diff #482094)

I remember there was some weird behaviour with KERNARG_SEGMENT_SIZE not including the implicit arguments. But when I checked the ELF notes on all the binaries created the size was correct so I'm assuming it will be fine since we're always compiling with upstream clang here. I didn't see any other problems with the group size, but I could double check.

291 ↗(On Diff #482094)

Can the GPU access fine-grained memory directly? I thought this always required some DMA transfer from the pinned buffer. But setting up the argv data is definitely the most annoying part.

libc/utils/DeviceLoader/amdgpu/Start.cpp
13 ↗(On Diff #482094)

It should be, in the Loader.cpp I allocate some coarse-grained memory for it, set it to zero, and then copy it back.

Also I don't bother freeing any of the memory, not sure how good of a practice that is but I'm mostly just testing stuff right now.

tianshilei1992 added inline comments.Dec 12 2022, 6:55 AM
libc/utils/DeviceLoader/Main.cpp
30 ↗(On Diff #482094)
jhuber6 updated this revision to Diff 482130.Dec 12 2022, 7:56 AM

Changing to amdhsa_crt and addressing other comments.

One problem with the existing test is it doesn't provide good terminal output on success or failure, is there a good way to have CMake print out a pass/fail for these?

I don't think this is what we should to right now. We should have reused the existing offload runtime with the two simple overlay headers, effectively what we used here: https://tianshilei.me/wp-content/uploads/2022/09/llvm-hpc-22.pdf
We would get support for all backends without haveing the 3rd! HSA driver in tree. Someone will start to improve the driver, we add a 3rd CUDA one, we start to build APIs, we end up with the plugin interface all over again.
The right choices is to use it and to decouple it rather than to duplicate it.

On the "loader" question: Why do we need a loader executable in the first place? Why not patch the original up so we can just execute that one? Lot's of benefits (self contained, can have fallback path and multiple archs, ...)

I don't think this is what we should do right now. We should have reused the existing offload runtime with the two simple overlay headers, effectively what we used here: https://tianshilei.me/wp-content/uploads/2022/09/llvm-hpc-22.pdf
We would get support for all backends without having the 3rd! HSA driver in tree. Someone will start to improve the driver, we add a 3rd CUDA one, we start to build APIs, we end up with the plugin interface all over again.
The right choices is to use it and to decouple it rather than to duplicate it.

I don't think it's a good idea to have an external dependency on the openmp runtime for the purpose of testing this project. The work here is the same concept as the direct GPU compilation work, just with the working parts separated out into a standalone application with minimal dependencies. This does duplicate HSA code, but I think the existence of these utilities is useful in its own right. If we want to write a libc on the GPU it's very useful to have a utility in-source that lets us run a generic source file on the GPU.

On the "loader" question: Why do we need a loader executable in the first place? Why not patch the original up so we can just execute that one? Lot's of benefits (self contained, can have fallback path and multiple archs, ...)

I don't think supporting multiple architectures or having fallback is useful in this scenario, we just want to see if the source runs on the user's chosen GPU. If not then that's a failure. As for it being standalone, I think it just makes ti easier to use as a utility. We could change it to load from some internal pointer and recompile it for each test if that is worthwhile.

jhuber6 updated this revision to Diff 482316.Dec 12 2022, 5:07 PM

Simplify code, as fine-grained memory can be read from the device without needing to be copied explicitly.

I have not gone through every line of code and I am likely not the right person to review the GPU side of things. Few high level comments about organization:

  1. I think what you are proposing here is the ability to add what we call as "integration tests" in the libc today. See https://github.com/llvm/llvm-project/tree/main/libc/test/integration. The idea behind an integration test is that the test executable contains only pieces from the libc. So, even the startup object like crt1.o come from the libc. Items like the thread libraries, whose implementation is tightly coupled to implementation of the startup system, are tested using integration tests. Things which can be tested without pulling the startup code are still tested using unit tests. If I understand your patch, you are proposing to use special startup code from the libc to run even the unit tests also. Which is probably implying that you want to run the unit tests also as integration tests on the GPU.
  2. The Linux startup code currently is in the directory named loader: https://github.com/llvm/llvm-project/tree/main/libc/loader. I agree that the name "loader" is not appropriate. We have plans to change it to "startup". What you are calling as amdhsa_crt should live along side the linux startup code and follow the conventions followed by the linux startup code.

As I see it, the way forward for this patch is this:

  1. I will rename the current loader directory to startup and you can move the GPU startup implementation there. As it stands in this patch, it is a trivial implementation so you can land it separately after I do the renaming.
  2. You can land the DeviceLoader implementation as a next step - GPU/AMD experts should review that. Keeping it under libc/utils sounds good but may be it should live in a nested directory libc/utils/gpu.
  3. As far as running the unit tests on the device, I think what we really want is the ability to be able to run the unit tests as integration tests. We might not be able to run all of the unit tests in that fashion - for example, the math unit tests use MPFR which I would think cannot be made to run on the GPU. So, we can do two different things here: i) Add some integration tests which can be run on the GPU as well as a Linux host. ii) Develop the ability to run any unit test as an integration test. We really want to be explicit in listing targets in the libc project so one approach could be to list integration test targets which point to the unit test sources. And, this can be done for a Linux host and the GPU.
JonChesterfield added a comment.EditedDec 13 2022, 3:59 AM

Hi sivachandra. Integration tests might be the right terminology here. The idea is to take a self contained program containing int main(), compile it for the GPU, then run it using as little infra as possible to minimise the number of places to look for bugs when it goes wrong. The same test would compile and run unchanged on the host architecture, or through qemu or similar if an emulator is available.

That main() could contain a bunch of unit tests, given a framework that can run without dependencies that we don't have yet (so no ostream, at least until a libc++ on the GPU is a thing), but in the first instance it probably streams data to various libc functions directly and returns non-zero if something unexpected happens.

I'd be especially pleased to have this sort of run main() infra in tree as it would let us add other functional testing to the GPU later, e.g. checking that the backend computes the right values under diff optimisation levels, or driving individual pieces of higher level language runtimes. So in a perfect world the loader machinery might end up in some utility folder outside of libc, but this is a great place to stand it up.

Thanks for the helpful feedback Siva.

I have not gone through every line of code and I am likely not the right person to review the GPU side of things. Few high level comments about organization:

  1. I think what you are proposing here is the ability to add what we call as "integration tests" in the libc today. See https://github.com/llvm/llvm-project/tree/main/libc/test/integration. The idea behind an integration test is that the test executable contains only pieces from the libc. So, even the startup object like crt1.o come from the libc. Items like the thread libraries, whose implementation is tightly coupled to implementation of the startup system, are tested using integration tests. Things which can be tested without pulling the startup code are still tested using unit tests. If I understand your patch, you are proposing to use special startup code from the libc to run even the unit tests also. Which is probably implying that you want to run the unit tests also as integration tests on the GPU.

Yes, executing things on the GPU requires at some interaction by the host agent to set up the data and begin execution. The "loader" in this patch provides the minimum amount of infrastructure required to launch a GPU agent's execution, while the "startup" code provides the glue to call main. The current idea is to use these tools to make a separate version of the LibcTest.cpp source that is compatible for the GPU. This way we could compile the unit test source and test file after linking in the "startup" code and then launch it on the GPU using the loader to test.

Another option would be to use an existing offloading runtime like OpenMP to perform the loading, but I'm hesitant to bring a large established runtime into libc and prefer this method for the reasons @JonChesterfield mentioned above.

  1. The Linux startup code currently is in the directory named loader: https://github.com/llvm/llvm-project/tree/main/libc/loader. I agree that the name "loader" is not appropriate. We have plans to change it to "startup". What you are calling as amdhsa_crt should live along side the linux startup code and follow the conventions followed by the linux startup code.

Would that cause this to be exported? It's primarily being proposed for testing but we could probably expose it like any other loader, it would make the GPU behave more like a standard libc target which would be nice.

As I see it, the way forward for this patch is this:

  1. I will rename the current loader directory to startup and you can move the GPU startup implementation there. As it stands in this patch, it is a trivial implementation so you can land it separately after I do the renaming.

Sounds good.

  1. You can land the DeviceLoader implementation as a next step - GPU/AMD experts should review that. Keeping it under libc/utils sounds good but may be it should live in a nested directory libc/utils/gpu.

@JonChesterfield is the resident HSA expert, we can review that separately but moving the directories sounds good.

  1. As far as running the unit tests on the device, I think what we really want is the ability to be able to run the unit tests as integration tests. We might not be able to run all of the unit tests in that fashion - for example, the math unit tests use MPFR which I would think cannot be made to run on the GPU. So, we can do two different things here: i) Add some integration tests which can be run on the GPU as well as a Linux host. ii) Develop the ability to run any unit test as an integration test. We really want to be explicit in listing targets in the libc project so one approach could be to list integration test targets which point to the unit test sources. And, this can be done for a Linux host and the GPU.

Could you elaborate on the difference in the libc source? I think the latter is the option I was going for, for example we would take a strcmp test, compile it directly for the GPU architecture along with a modified LibcTest.cpp source, and pass it to the loader to see if it fails. This approach might also be useful for Linux to see if we can bootstrap calls to libc routines with the loader I'd assume.

sivachandra added a comment.EditedDec 13 2022, 2:09 PM

Hi sivachandra. Integration tests might be the right terminology here. The idea is to take a self contained program containing int main(), compile it for the GPU, then run it using as little infra as possible to minimise the number of places to look for bugs when it goes wrong. The same test would compile and run unchanged on the host architecture, or through qemu or similar if an emulator is available.

This is exactly what the libc integration tests do. They are built as the smallest self-contained exe containing pieces only from LLVM's libc and executed. For each integration test, we build a libc containing only the functions that are required for the test and link the test exe against that minimal libc. Effectively, the integration tests do not link against the entire libc.

That main() could contain a bunch of unit tests, given a framework that can run without dependencies that we don't have yet (so no ostream, at least until a libc++ on the GPU is a thing), but in the first instance it probably streams data to various libc functions directly and returns non-zero if something unexpected happens.

Again, this is exactly what we do in the libc integration tests. Take a look at an example: https://github.com/llvm/llvm-project/blob/main/libc/test/integration/src/stdlib/getenv_test.cpp#L30. We use a TEST_MAIN macro which is just a short hand for extern "C" int main: https://github.com/llvm/llvm-project/blob/main/libc/utils/IntegrationTest/test.h#L69

For the GPU, one will have to implement the equivalents of __llvm_libc::write_to_stderr and __llvm_libc::quick_exit that are used here: https://github.com/llvm/llvm-project/blob/main/libc/utils/IntegrationTest/test.h#L69. Likely, a gpu directory should be added here and those implementation should live in that directory: https://github.com/llvm/llvm-project/tree/main/libc/src/__support/OSUtil/linux

I'd be especially pleased to have this sort of run main() infra in tree as it would let us add other functional testing to the GPU later, e.g. checking that the backend computes the right values under diff optimisation levels, or driving individual pieces of higher level language runtimes. So in a perfect world the loader machinery might end up in some utility folder outside of libc, but this is a great place to stand it up.

I don't see any problems including the loader system in the libc part of the tree.

The current idea is to use these tools to make a separate version of the LibcTest.cpp source that is compatible for the GPU. This way we could compile the unit test source and test file after linking in the "startup" code and then launch it on the GPU using the loader to test.

I think, a simple way to get this going is to make LibcTest.h (or https://github.com/llvm/llvm-project/blob/main/libc/utils/UnitTest/Test.h) short-circuit to libc/utils/IntegrationTest/test.h when building a unit test as an integration test. That way, we don't have to build all the unit test machinery for the GPU at least to begin with. The integration test machinery is very simple, very much like what @JonChesterfield wanted: a collection of truth value assertions with some syntactic sugar.

  1. The Linux startup code currently is in the directory named loader: https://github.com/llvm/llvm-project/tree/main/libc/loader. I agree that the name "loader" is not appropriate. We have plans to change it to "startup". What you are calling as amdhsa_crt should live along side the linux startup code and follow the conventions followed by the linux startup code.

Would that cause this to be exported? It's primarily being proposed for testing but we could probably expose it like any other loader, it would make the GPU behave more like a standard libc target which would be nice.

I do not know if it will just work out of the box for you. But, if we can get GPU environment behave more like a normal host environment, it would be easier to maintain and reason for developers not working on the GPU libc. So, I definitely vote for making the GPU procedures similar to the host procedures.

  1. As far as running the unit tests on the device, I think what we really want is the ability to be able to run the unit tests as integration tests. We might not be able to run all of the unit tests in that fashion - for example, the math unit tests use MPFR which I would think cannot be made to run on the GPU. So, we can do two different things here: i) Add some integration tests which can be run on the GPU as well as a Linux host. ii) Develop the ability to run any unit test as an integration test. We really want to be explicit in listing targets in the libc project so one approach could be to list integration test targets which point to the unit test sources. And, this can be done for a Linux host and the GPU.

Could you elaborate on the difference in the libc source? I think the latter is the option I was going for, for example we would take a strcmp test, compile it directly for the GPU architecture along with a modified LibcTest.cpp source, and pass it to the loader to see if it fails. This approach might also be useful for Linux to see if we can bootstrap calls to libc routines with the loader I'd assume.

I also vote for the latter. So, for strcmp_test, we should add an integration test in test/integration/src/string which picks up sources from its unit tests in test/src/string/ and runs them as an integration test.

On the general topic of adding dependencies on other projects like openmp, we should strive to keep the libc project as self-contained as possible.

I also vote for the latter. So, for strcmp_test, we should add an integration test in test/integration/src/string which picks up sources from its unit tests in test/src/string/ and runs them as an integration test.

I do not imply that this will just work today. I am happy to work with you make this happen.

On the general topic of adding dependencies on other projects like openmp, we should strive to keep the libc project as self-contained as possible.

The problem is that now we duplicate (and in the future extend) the "gpu launch" logic for each target.
We are effectively creating yet another offloading runtime. We will invent new abstraction layers, new features will be added, etc.

This patch has 400 lines of HSA magic copied probably from OpenMP. We will need the same for CUDA, OneAPI, maybe OpenCL, ...
Then we need extra logic in all of them to support allocators (via pre-allocation), in all their shapes (bump, free-lists, ...).
Then we need to put RPC logic in them, again per target, and all of it is then not yet tested with OpenMP offload.

A loader using the existing offloading runtime:

  1. does not require clang changes
  2. requires 2 minimal files (shown below)
  3. will work on all supported GPU targets, the MPI target, virtual GPU target, ...
 // main.c
#include <string.h>
extern int __user_main(int, char *[]);

int main(int argc, char *argv[]) {
  #pragma omp target enter data map(to: argv[:argc])
  for (int I = 0; I < argc; ++I) {
    size_t Len = strlen(argv[I]);
    #pragma omp target enter data map(to: argv[I][:Len])
  }

  int Ret;
  #pragma omp target teams num_teams(1) thread_limit(1024) map(from: Ret)
  { Ret = __user_main(argc, argv); }
  return Ret;
}
 // header.h, could be replaced by a clang command line flag
#pragma omp begin declare target device_type(nohost)
int main(int, char *[]) asm("__user_main");

On the general topic of adding dependencies on other projects like openmp, we should strive to keep the libc project as self-contained as possible.

The problem is that now we duplicate (and in the future extend) the "gpu launch" logic for each target.
We are effectively creating yet another offloading runtime. We will invent new abstraction layers, new features will be added, etc.

Let me clarify my side: when I say we should strive to keep the libc project self-contained, I mean that the libc business logic and test logic should be self-contained. If you can build a test binary using source code only from the libc directory, that is good enough. The tests and business logic can use compiler intrinsics. I will let the GPU experts here decide the how/where/what with the test binary and the loaders which load and run the test binary on the device.

On the general topic of adding dependencies on other projects like openmp, we should strive to keep the libc project as self-contained as possible.

This is my philosophy as well, I'd prefer to keep this libc implementation on the GPU separate from other GPU runtimes.

The problem is that now we duplicate (and in the future extend) the "gpu launch" logic for each target.
We are effectively creating yet another offloading runtime. We will invent new abstraction layers, new features will be added, etc.

I don't think this is a bad thing, the loader tool here has utility outside of this project if we wanted to directly stimulate the GPU without potentially pulling in potential bugs and complexity from the existing (and quite large) offloading runtimes like CUDA and OpenMP. The design I outlined is more in-line with a libc on the GPU as it behaves like simple cross-compilation for testing purposes. My plan is that the additional features are part of a "host" portion of the GPU libc, which means the loader in its current state will be unchanged and simply link against the host runtime as would CUDA or OpenMP or HIP if they link in the libc.

This patch has 400 lines of HSA magic copied probably from OpenMP. We will need the same for CUDA, OneAPI, maybe OpenCL, ...
Then we need extra logic in all of them to support allocators (via pre-allocation), in all their shapes (bump, free-lists, ...).
Then we need to put RPC logic in them, again per target, and all of it is then not yet tested with OpenMP offload.

A lot of it is boilerplate, we could clean it up with some shared headers as @JonChesterfield brought up. But I do agree there's a non-zero amount of effort required to write these loaders for a new architecture, but standing up a new target in the libc would require runtime support for the future RPC, allocators, etc as well.

A loader using the existing offloading runtime:

  1. does not require clang changes
  2. requires 2 minimal files (shown below)
  3. will work on all supported GPU targets, the MPI target, virtual GPU target, ...

To be fair, the clang changes are simply to support what I think is a valuable compilation mode for targeting GPUs which allows us to treat it as cross-compiling rather than a distinct offloading system. These changes are pretty minimal for NVPTX and weren't required for AMDGPU.

It's difficult to reach a consensus on these topics, I'm with @sivachandra and @JonChesterfield that a libc should be mostly standalone implementation wise and then linked into different targets.

Trying to rephrase (and shorten) what @sivachandra said in the last comment (https://reviews.llvm.org/D139839#3993369):
"As long as libc does not depend on outside code, we are fine."

So the "self contained" arguments raised here are not all the same.
My proposed approach will make libc testing "self contained" via 20 lines of code (+CMake).
It will work for all targets (=GPUs and more).
The only downside is that you need to enable the OpenMP offloading runtime for your clang build (as you need to enable libc GPU support explicitly).
I listed sufficient downsides with this approach already but let's add two more:

  • The fact that we are about to modify the AMDGPU kernel issue code due to the "code object v5" switch should be another indication that this approach is not the right one.
  • How do you imagine to test these functions with multiple threads? Implement a mapping layer to translate "get_id", barrier, atomic, ... calls to intrinsics and such? I can already see yet another device runtime.

First of all, I agree with @jdoerfert that it's not a good practice to have a target dependent "loader" (basically almost target dependent duplicate code), given the fact that we (LLVM) already provides a target independent layer, OpenMP. In addition, libc would not be the only project in LLVM depending on another project, but of course the libc community needs to approve it first. However, I can also understand the concern of bring OpenMP as a dependence for testing (or even feature development), especially OpenMP itself is also a fast evolving project. It's gonna be quite annoying for libc community to see some false negative test results because of something off in OpenMP.

I'm thinking one potential solution is to enable libc tests in OpenMP test (Buildbot) as well. In this way, OpenMP can make sure that patches will at least not break libc, as long as they are not committed directly w/o code review (that's gonna be another story).

enable libc tests in OpenMP test (Buildbot)

No matter where we fall, this is a given. Also beyond the "loader" tests, we want to create libomptarget/tests that include libc_gpu.a and run it via our driver.

we (LLVM) already provides a target independent layer, OpenMP

That is correct but probably misleading. The important part (and the long term goal) is to have the former (a target independent offload layer), the fact that OpenMP uses (and it lives there for now) it is coincidental.

Trying to rephrase (and shorten) what @sivachandra said in the last comment (https://reviews.llvm.org/D139839#3993369):
"As long as libc does not depend on outside code, we are fine."

Yes, but something more: even the test binary should not depend on code outside of the libc directory. The loader, which takes this binary and loads it on to the device and executes it, can be anywhere.

The problem is that now we duplicate (and in the future extend) the "gpu launch" logic for each target.
We are effectively creating yet another offloading runtime. We will invent new abstraction layers, new features will be added, etc.

There are two separate pieces to libc. One is all the library - stuff you get from including headers. Type definitions, strlen, printf, malloc, fopen,
The other is the C runtime. This is so tiny it tends to get forgotten about. For static executables it's walking ctor/dtor pointer arrays and a jump to main.

A GPU libc library has (relatively) wide applicability. There might be some uncertainty over what thread creation means in this context but libm and malloc are pretty clear.

The runtime part is a different game and worth considering separate to the above. An in tree use case is testing - libc, compiler backends, parts of other language runtimes.

A different use case is compiling code directly to the GPU and executing it without the offloading language models. A persistent kernel that watches the network card using DMA doesn't need to talk to the host.

My professional interest is in having openmp work more efficiently and robustly, and driving more code through that framework will help with that. However my interest in libc is not commercially driven - I want persistent kernels running lua on the GPU because that sounds awesome, and I'd rather not have debugging openmp on the critical path to making things like that work.

This patch has 400 lines of HSA magic copied probably from OpenMP. We will need the same for CUDA, OneAPI, maybe OpenCL, ...
Then we need extra logic in all of them to support allocators (via pre-allocation), in all their shapes (bump, free-lists, ...).
Then we need to put RPC logic in them, again per target, and all of it is then not yet tested with OpenMP offload

The signal to noise ratio induced by the HSA interface might be the worst case. A bunch of this is wrapping the C API in C++ to make the callbacks easier to use, which really should be a hsa.hpp somewhere. Fundamentally though yes - there is plumbing needed behind the syscall part of a C runtime. To host on x64 you need some approximation to the linux kernel providing syscall, on top of which you get a libc loader to handle dlopen et al. To host on nvptx you need all that linux layer plus all of cuda plus some translation shim to put at least one kernel on the GPU. It's not as easy as one might like, but that's the abstraction we've been dealt.

I don't see why you're talking about scope creep though. The C runtime doesn't do very much. I can see it picking up global constructors / destructors when bringing up libc++ on top of it, and it needs 'syscall', but that's about it. If people want to use stuff beyond libc, well that's fine, they get to use non-libc libraries to do so.

RPC shouldn't care about the architecture. Doesn't need to anyway, if the API you use is syscall. Each side of the call sees some opaque target that it sends byes to.

A loader using the existing offloading runtime:

  1. does not require clang changes
  2. requires 2 minimal files (shown below)
  3. will work on all supported GPU targets, the MPI target, virtual GPU target, ...

What clang changes?

We can build a loader out of any GPU language we see fit, clang isn't involved. All it has to do is launch a kernel with one thread and tie up one end of syscall. If the in tree one is openmp, I'll just put a hsa one on my github so I've got a means of distinguishing bugs in openmp from bugs in libc. If other people want to walk through libomptarget while debugging libc tests, more power to them.

arsenm added a subscriber: arsenm.Dec 14 2022, 5:53 AM
arsenm added inline comments.
libc/utils/DeviceLoader/amdgpu/CMakeLists.txt
37 ↗(On Diff #482094)

-emit-llvm is probably more future proof

  1. I will rename the current loader directory to startup and you can move the GPU startup implementation there. As it stands in this patch, it is a trivial implementation so you can land it separately after I do the renaming.

This is now done.

ii) Develop the ability to run any unit test as an integration test. We really want to be explicit in listing targets in the libc project so one approach could be to list integration test targets which point to the unit test sources. And, this can be done for a Linux host and the GPU.

Once you land the GPU startup code, we can discuss this more.

  1. I will rename the current loader directory to startup and you can move the GPU startup implementation there. As it stands in this patch, it is a trivial implementation so you can land it separately after I do the renaming.

This is now done.

ii) Develop the ability to run any unit test as an integration test. We really want to be explicit in listing targets in the libc project so one approach could be to list integration test targets which point to the unit test sources. And, this can be done for a Linux host and the GPU.

Once you land the GPU startup code, we can discuss this more.

Sounds good, I'll make a separate patch for it.

I propose we rewrite the loader implementation to use OpenMP runtime calls to perform the loading, but keep the general scheme the same. I think this is a good compromise so we can treat the GPU as a standalone system, but forego re-implementing kernel launch code that exists in-tree. If we wanted to prevent the dependency then we can go back and write a dedicated loader for that architecture.

jhuber6 updated this revision to Diff 495846.Feb 8 2023, 7:27 AM

Updating to just present the loader interface. I attempted to rewrite this with
the OpenMP interface, but found it more difficult than just using HSA directly.

I think this direction is good, as this loader will present itself as a very
simplified interface to test and use the future RPC API.

I've also found some prior work in this direction by Mentor Graphics that
presents a gcn-run executable, which is pretty much exactly that this patch
provides.

Right now my plans for this are as follows:

  1. Write a single integration test that uses this loader interface
  2. Write the NVPTX interface
  3. Implement write_to_stderr and exit_early (Basically a circular buffer the loader will copy back)
  4. Find a way to run the regular unit test suite as an integration test.
  5. Write a full RPC server that essentially acts as the GPU's "syscall"

After this point, adding new features will simply be implementing it in code.

sivachandra accepted this revision.Feb 8 2023, 10:27 PM

OK for code organization. A GPU expert should OK the actual code.

libc/utils/gpu/loader/amdgpu/CMakeLists.txt
5

How does this get discovered?

This revision is now accepted and ready to land.Feb 8 2023, 10:27 PM
jhuber6 added inline comments.Feb 9 2023, 5:01 AM
libc/utils/gpu/loader/amdgpu/CMakeLists.txt
5
find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)
if(hsa-runtime64_FOUND)
  add_subdirectory(amdgpu)
endif()

From this code in the base loader, we'll need to add a check to only enable tests if the user has a GPU set and the ability to build something to launch it.

JonChesterfield accepted this revision.Feb 13 2023, 6:39 AM

I don't rate the copy&paste but I suppose we can fix that later if we choose to. The implementation is probably alright as it was copied from things that are probably alright, and if it turns out to be buggy, we'll fix it when that arises.