This is an archive of the discontinued LLVM Phabricator instance.

[libc][hdr-gen] Add special offloading handling for the GPU target
AbandonedPublic

Authored by jhuber6 on Jun 27 2023, 10:30 AM.

Details

Summary

The previous patch in D153794 provided us with the necessary hooks to
wrap special handling required for GPU headers. As discussed in
https://discourse.llvm.org/t/rfc-implementing-gpu-headers-in-the-llvm-c-library,
we need special handling for the GPU headers if we want them to be
included from a standard offloading lanuage such as CUDA, HIP, or
OpenMP. This is important if we with to actually export the GPU libc
as there are the expected users of the interface. The internal
implementation uses a freestanding C++ mode chosen to keep the interface
common.

The requirements summarized for these offloading headers are as follows:

  1. The same header must be included by both the host and device as they must agree on definitions
  2. We must outline which definitions are provided by the LLVM GPU libc

This means that we need to include the system headers when compiling for
the GPU. Many issues with that are worked around with offloading
language semantics.

It is worth noting that this header mess will not munge the internal
implementation. This is merely a hack required to export these functions
in a way that's compatible with existing offloading languages. The
current use-case for libc is a freestanding mode which can use the
standard headers generated by libc. If we do not want these headers I
have added an option to disable this.

Depends on D153794

Diff Detail

Event Timeline

jhuber6 created this revision.Jun 27 2023, 10:30 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 27 2023, 10:30 AM
jhuber6 requested review of this revision.Jun 27 2023, 10:30 AM

The extent with which we need to work against system headers that are hostile to being included is unfortunate, but I firmly believe this is the only workable solution for exporting these interfaces to the expected users. The only alternative and sound solution is to provide a complete libc for the host system and GPU and use a unified header that works on both flawlessly. This is a nice goal, but is most likely many, many years away from being possible and would require the user to change their system configuration. This at least has the effect of generating headers that can then replace the ad-hoc ones that exist in clang and can be used as a base for further development.

There are a few ugly points that have not been addressed by this patch, namely

  1. If the system header is not compatible with the implemented ABI: E.g. if div_t is not defined in the order quot; rem. For these cases we can statically assert to at least prevent miscompilations, but I think this will be rare
  2. System definitions we will need to manually override on the GPU: Something like errno is a macro that's defined by the system headers. If we with to provide this on the GPU we will need a special section to undef it and reset it to ours only for the GPU compilation.
  3. A handful of entrypoints use a type that is not C-standard. E.g. the atexit function uses __atexit_handler_t which is defined internally. We will need to replace these, or define it as a workaround for the GPU.

Including system headers which know nothing about GPU architectures on the GPU causes a lot of grief in the offloading languages.

How necessary is it to put both our headers and the system one in the same translation unit, given that libc is meant to abstract over architectures? If a fundamental goal is bugward compat with glibc, we're probably better off including the glibc headers in whatever hacky way seems fit and providing an implementation shim that makes this work. E.g. for the div_t example, we swap the fields on the way into and out of the libc implementation.

This setup would be
Glibc headers
Header hacks
Llvm libc translation shim
Llvm libc

While the pure, non-glibc-hybrid thing would just use the libc headers directly on the host and the GPU.

Including system headers which know nothing about GPU architectures on the GPU causes a lot of grief in the offloading languages.

Yes, but not including system headers causes an equivalent amount of grief. The main problem is that we don't have nice system headers to use, but that's a far bigger ask than making a shim into what already exists.

How necessary is it to put both our headers and the system one in the same translation unit, given that libc is meant to abstract over architectures? If a fundamental goal is bugward compat with glibc, we're probably better off including the glibc headers in whatever hacky way seems fit and providing an implementation shim that makes this work. E.g. for the div_t example, we swap the fields on the way into and out of the libc implementation.

This setup would be
Glibc headers
Header hacks
Llvm libc translation shim
Llvm libc

While the pure, non-glibc-hybrid thing would just use the libc headers directly on the host and the GPU.

I don't think we could have any other ordering. If we include the libc headers unmodified they will always result in a conflict with the system types, yet we still need them to tell the host / device which values are on the GPU. The approach taken here is to let the system headers define the types and then post-facto declare things on the GPU.

Should probably show what one of these generated headers looks like,

#ifndef LLVM_LIBC_STDIO_H
#define LLVM_LIBC_STDIO_H

#include <llvm-libc-macros/gpu-macros.h>

#ifdef __OFFLOADING
#include_next <stdio.h>
#else
#include <__llvm-libc-common.h>
#include <llvm-libc-macros/file-seek-macros.h>
#include <llvm-libc-macros/stdio-macros.h>
#endif

#ifndef __OFFLOADING
#define EOF -1

#define _IONBF 2

#define _IOLBF 1

#define _IOFBF 0

#include <llvm-libc-types/FILE.h>
#include <llvm-libc-types/size_t.h>

#endif

__BEGIN_OPENMP_DECLS
__BEGIN_C_DECLS

int puts(const char *__restrict) __NOEXCEPT __DEVICE;

int fputs(const char *__restrict, FILE *__restrict) __NOEXCEPT __DEVICE;

extern FILE * stdin __DEVICE;
extern FILE * stdout __DEVICE;
extern FILE * stderr __DEVICE;

__END_C_DECLS
__END_OPENMP_DECLS

#endif // LLVM_LIBC_STDIO_H
jhuber6 added a subscriber: tra.Jun 27 2023, 12:08 PM
jhuber6 updated this revision to Diff 535212.Jun 27 2023, 6:52 PM

So, these headers need the ability to do some potentially ugly hacks. I have
added an extra field to the PublicAPI called ExtraDefn, which is currently a
simple string that we can use to inject some special-case handling for the
wrapper headers. In this case, we needed to define the __atexithander_t type
for the atexit function. In general this can be used as a general purpose shim
to address edge cases.

It's not pretty, but as I've stated I think this is the only sound solution to
this problem short of fully implementing the LLVM C library for both the host
and device.

The generated headers still pass the test suite as internally nothing has
changed. Externally, I can now safely include these headers inside OpenMP
without the prior errors.

So, these headers need the ability to do some potentially ugly hacks. I have
added an extra field to the PublicAPI called ExtraDefn, which is currently a
simple string that we can use to inject some special-case handling for the
wrapper headers. In this case, we needed to define the __atexithander_t type
for the atexit function. In general this can be used as a general purpose shim
to address edge cases.

It's not pretty, but as I've stated I think this is the only sound solution to
this problem short of fully implementing the LLVM C library for both the host
and device.

I am still not convinced about the "soundness" of the solution. I am on vacation but will try to spend more time on this tomorrow (late evening/night US west coast time) so that we can move forward on this in whatever way. I think the RFC is missing a comparison of alternatives. If you can describe the potential alternative approaches and their pros and cons, it will become easier for me (and others I am sure) to understand (and may be even concur). One of the alternatives that I had mentioned previously is separating out the offloading language use case and standalone GPU use case. IIRC, you wanted to combine them as a matter of convenience - I think you should describe that in the RFC for others benefit (and at the least for the sake of record).

I am still not convinced about the "soundness" of the solution. I am on vacation but will try to spend more time on this tomorrow (late evening/night US west coast time) so that we can move forward on this in whatever way. I think the RFC is missing a comparison of alternatives. If you can describe the potential alternative approaches and their pros and cons, it will become easier for me (and others I am sure) to understand (and may be even concur). One of the alternatives that I had mentioned previously is separating out the offloading language use case and standalone GPU use case. IIRC, you wanted to combine them as a matter of convenience - I think you should describe that in the RFC for others benefit (and at the least for the sake of record).

Thanks a lot for looking into this admittedly weird and unfortunate quirk of offloading languages and sorry to disturb your vacation. Yes, this presents a lot of unique difficulties because we are trying to provide a library for something like a kind-of-hosted-but-not-really interface. I don't think separating the headers is a big distinction, it's the same approach in the end. Whether we have a single header that ifdefs it off, or two headers that we conditionally include, it'll be the same to the preprocessor at the end of the day. I went with a single header because it is easier to implement and use since the existing workflow only creates one header and I didn't see a compelling reason to separate them given the extra work.

For alternative approaches, I've mentioned before that if we had a unified environment all of this can go away save for the device declarations. That's a solution but is a much, much bigger ask at this point in time. Another alternative would be to only export the functions via the LLVM-libc and include that in a separate wrapper header, maybe in clang/lib/Headers/__, e.g. like the following

#include_next <stdlib.h>

#define __DEVICE __attribute__((device))

#pragma omp begin declare target
#include <__llvm_libc_declarations/stdlib.h>
#pragma omp end declare target

Where that file could look like

#ifndef __DEVICE
#define __DEVICE
#endif

int abs(int) __DEVICE;
...

This would have the benefit of keeping CUDA / OpenMP weirdness out of libc and we'd be more or less free to hack our way to victory without anyone in libc smelling it. The downside is that we would still need a few workarounds, e.g. for something like errno we'd need to copy the declaration from libc, or with atexit we would need to manually define the __atexit_handler_t somewhere. Doing this would basically need a special operation to libc-hdrgen that only emits the entrypoints and gets rid of the other stuff, so we might end up still needing the %%begin and %%end mess unless we want to parse it more completely.

The offloading languages do some things with libc which seem broadly reasonable but might not be optimal. The reasoning goes roughly - the host part will probably use glibc, so the gpu part should use the same ABI as glibc, so the gpu should include the glibc headers. Unfortunately the glibc headers don't know anything about the GPU so there's a cascade of hacks to make that work which have been hard won.

I don't think this is totally sensible. For one thing, using the glibc ABI on the gpu has rather limited scope since we can't make normal function calls across that barrier by one arch emitting a jump or call. I think all that matters is agreeing on struct layout in case they get copied across. A defining characteristic of libc is that we know what is supposed to be in it, and of glibc in particular is the ABI is stable, versioned and totally documented. So the GPU could use the same struct definitions as glibc without including glibc headers. It seems unimportant to me that the two sides have exact agreement on which functions are constexpr or not.

Nevertheless, openmp/cuda/hip presently include glibc headers directly on the gpu and patch around it. The dev teams involved seem committed to this strategy so I suggest we leave it in place.

I'd therefore like us to localise whatever header related hacks the offloading languages currently use where they are at present, i.e. in the overlay stuff in clang/lib/headers, and have absolutely none of it anywhere in libc. That allows the offloading languages to add or remove complexity, diverge from one another or converge as they see fit. This totally isolates libc from any adverse behaviour that results from those changes.

Nevertheless, openmp/cuda/hip presently include glibc headers directly on the gpu and patch around it. The dev teams involved seem committed to this strategy so I suggest we leave it in place.

The history of this is long. If we want to avoid system headers we run into various problems, including but not limited:
Missing extensions, missing/different macros, different function pointer value, missing includes (as a header on the path provides a function or an include that it doesn't have to according to the standard).
During the time we provided math functions w/o system headers we had a lot of complaints, lots of people found it pretty unusable.
We might be able to work around those issue with our libc headers, but it is uncharted territory and I foresee problems. I'd be happy to try, but rather have a working fallback fist.

I'd therefore like us to localise whatever header related hacks the offloading languages currently use where they are at present, i.e. in the overlay stuff in clang/lib/headers, and have absolutely none of it anywhere in libc. That allows the offloading languages to add or remove complexity, diverge from one another or converge as they see fit. This totally isolates libc from any adverse behaviour that results from those changes.

Localizing the hacks seems reasonable. Putting it in clang/lib/headers might not be the best idea though but I won't fight it. I was hoping to have it self contained in libc, but again, that is debatable.

While the pure, non-glibc-hybrid thing would just use the libc headers directly on the host and the GPU.

I don't think we could have any other ordering. If we include the libc headers unmodified they will always result in a conflict with the system types, yet we still need them to tell the host / device which values are on the GPU. The approach taken here is to let the system headers define the types and then post-facto declare things on the GPU.

I agree, just libc headers will clash. That is why our current headers are annotated properly to make sure they will work well with host headers.
It is the curse of offloading that you need to keep the user program running the way it was, and make it work in the same build configuration, or people will take away it does not work.

Localizing the hacks seems reasonable. Putting it in clang/lib/headers might not be the best idea though but I won't fight it. I was hoping to have it self contained in libc, but again, that is debatable.

I'll mirror a comment I made on the original RFC what the alternate approach will look like.

#ifndef __ATTRS
#define __ATTRS
#endif

int fputs(const char *__restrict, FILE *__restrict) __ATTRS;
extern FILE * stderr __ATTRS;

This would then be included by a wrapper header in clang/ilb/Headers/__libc_wrappers/stdio.h which looks like the following,

#ifndef __CLANG_LIBC_WRAPPERS_STDIO_H__
#define __CLANG_LIBC_WRAPPERS_STDIO_H__
#include_next <stdio.h>

#if defined(__CUDA__) || defined(__HIP__)
#define __ATTRS __attribiute__((device))
#endif

#pragma omp begin declare target
#include <llvm-gpu-none/llvm-libc-declarations/stdlib.h>
#pragma omp end declare target

#endif

The downside is that we don't get these nice unified headers, but I think with the extent of hacks for each language it may be better to shuttle that next to where clang is. This would pretty much be a separate mode for libc-hdrgen that just spits out the entrypoints, so it's much less intrusive to libc and it gives us a hook that we can add whatever workarounds we'll require for the offloading case. I think I'm favoring this approach having thought through it, because in my head I can already imagine a large number of #undefs we'll need to do, and I don't think we should mash those into the libc implementation itself.

I'll try to make a new patch to illustrate this after lunch that's hopefully agreeable with everyone involved. Getting this done would be a major milestone in being able to ship the libc so I'm eager to make progress. So thanks for everyone's continued involvement.

jhuber6 abandoned this revision.Jul 14 2023, 3:59 PM