Adding these pragmas will force all global storage variables
to be emitted with device attribute. This allows a few testsuites
to avoid tagging every global variable with attribute((device))
which may not be feasible or easily upstreamable. These pragmas
may be nest similarly to force_cuda_host_device pragmas.
Details
Diff Detail
Unit Tests
Event Timeline
This allows a few testsuites to avoid tagging every global variable
Can you elaborate on that? Forcing host/device on some functions is needed to make some standard headers work.
I'm not convinced that making all globals a __device__ variables is a good idea. Things may compile, but I have serious doubts that it will be particularly useful in most cases due to the various concurrency issues and the fact that GPU side imposes additional restrictions on the initializers.
Perhaps those few testsuites should be ported to be compileable with CUDA instead?
An example usage is to run a large part of the gdb test suite on the GPU. The tests normally run on the CPU, but can also be made to run on the GPU within a test harness that emulates the necessary environment. For that to work the variables need to be forced to be device. The issues of concurrency do not happen due to the nature of the environment. Porting or modifying the entire test suite is not particularly viable.
Currently a clang plugin is being used to do this, but the hope was to have clang support this directly in a similar way that it already supports something similar for functions. It is up to the user to use appropriately.
Interesting. Once the globals are forced to be __device__, what ends up using them? Is that just for the GDB itself to access them? Or are they used by some code? If so, how is the code forced into being __global__/__device__ functions?
I can see this patch being useful for the former case.
clang/test/SemaCUDA/force-device-globals.cu | ||
---|---|---|
49 | You may also want to test local static vars. I guess those should remain on the host in the explicitly __host__ functions and become __device__ in unattributed functions with the pragma. Another case to test would be implicitly host/device functions. E.g. constexpr ones. I guess those should already place the local static variables on the correct side of the compilation, depending on where we compile them. | |
50–57 | So, technically, only global_before_pragma is a __device__ variable now. Everything else we should not be allowed to read from. At the moment clang does allow reading the variables (or, rather, their shadows), but it should not be the case. At the very least I'd add a comment about that so it's clear that accessing device vars from a host function is not OK. |
I believe the GDB testsuite is being compiled targeting the device/AMDGPU, and the device code accesses these global variables which need to be marked with __device__. This allows many of GDB testcases to be compiled for device unmodified. They are using the existing #pragma clang force_cuda_host_device begin to force functions to become device attributed.
clang/test/SemaCUDA/force-device-globals.cu | ||
---|---|---|
49 | Thanks, I will add checking for local static vars and constexpr vars on host, device, and host/device functions. | |
50–57 | Thanks, I will add a comment so that its clear we cannot access device vars here. |
Sorry, I still don't understand.
GDB testsuite is being compiled targeting the device/AMDGPU
Do you mean that tests are being compiled with --cuda-device-only ? I.e. it's not a regular HIP compilation where we do compile the same source code for both the host and N GPUs?
They are using the existing #pragma clang force_cuda_host_device begin to force functions to become device attributed.
So, the end goal appears to be able to compile a pure C++ source for AMD GPUs.
I wonder if all we need for that is clang++ -target amdgcn-amd-amdhsa -x c++
This appears to produce an AMDGPU binary:
$ echo 'int f () {return 1;}' | bin/clang++ -target amdgcn-amd-amdhsa -x c++ -c -o zzz.o -nogpulib - $ readelf -e zzz.o ELF Header: Magic: 7f 45 4c 46 02 01 01 40 01 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: <unknown: 40> ABI Version: 1 Type: REL (Relocatable file) Machine: AMD GPU
Hi!
This allows running the whole GDB testsuite in HIP mode, using a custom DejaGnu board file, to test device debugging. With that board file, every C and C++ program in the testsuite is compiled using HIPCC, targeting the device/GPU. Many GDB testcases don't make sense to run against the GPU (e.g., posix threads tests, fork, exec, etc.), but many of the core GDB tests do make sense to run. Breakpoints, watchpoints, listing, backtracing, etc.
The custom DejaGnu board file links some HIP glue code into every GDB testcase. I call that glue code the driver. It contains both the host's main() entry point, and a kernel entry point. The kernel entry point calls the testcase's actual main() (the preexisting main() function that is written in each individual GDB testcase). The host's argc/argv/envp are forwarded from host's main() to the kernel and then to the testcase's main(), now in device code. Some really basic custom C runtime routines missing in the HIP runtime are linked in as well, like puts, (a really dumb) malloc/free, strlen, etc. Using HIP instead of -target=amdgcn C/C++ takes care of all the sordid details of device code linking and loading, plus we can make use of the HIP headers and runtime on the device side, and HIP is actually the language that is actually supported for debugging in GDB anyhow.
The board file overrides the default target compilation procedure to instead compile C/C++ testcase files using 'hipcc' and link in the driver. (Using -fgpu-rdc to allow compiling translation units separately.)
Now, to avoid having to modify the hundreds of actual tests, we also pass "-include" to hipcc to force-include a header into every testcase compiled. That header, among other things, has:
/* Avoid having to write explicit __device__ in all functions throughout. */ #pragma clang force_cuda_host_device begin
This results in code being emitted for both the host and the device. The resulting code will only be run on the device, though the fact that the debug info for the host code is emitted as well is very nice, because it lets the GDB testsuite set breakpoints by file and line number or function name, even before the program is started and the device code is loaded. The testsuite does that _a lot_. (E.g. "b some_test_function; run"). GDB then re-resolves breakpoint locations once the device code is loaded, and the testcase Just Works.
The pre-existing force_cuda_host_device pragma does not force device for global variables, however, it's only for functions. To avoid having to explicitly tag global variables throughout the hundreds of files, I wrote a Clang plugin that implements a pragma similar to the one being proposed, to automatically tag global variables (except system header variables) with device. That force-included header file then also does:
/* The above only works for functions. Global variables must still normally be tagged with __device__. We address that with our plugin. */ #pragma force_cuda_device_globals
It would be much better if a plugin wasn't required though. Hence this proposal.
Plus, it seems to me that if "#pragma clang force_cuda_host_device begin" is useful enough to have in the compiler proper, then a similar feature for variables might be useful for the broader community as well.
I hope that clarifies things.
Thank you for the detailed explanation. I think I understand now.
OK. I think the use case is reasonably useful even beyond testing GDB. This could be used to compile portable code for the GPU in general. E.g. we could conceivably use that to compile a libm implementation without having to manually port it to CUDA/HIP.
LGTM, modulo the previous test comments and the source formatting nit.
clang/lib/Sema/SemaDecl.cpp | ||
---|---|---|
7250 | Please reformat the code as suggested. |
Thank you for the review, please see latest test updates adding tests for static/constexpr local var combinations.
clang/include/clang/Basic/DiagnosticParseKinds.td | ||
---|---|---|
1444–1450 | These could be merged similarly to warn_pragma_force_cuda_bad_arg above. | |
clang/lib/Sema/SemaCUDA.cpp | ||
671–672 | Is there a particular reason not to apply the pragma to the system headers? I think it would be more consistent to apply pragma to everything the user put within its boundaries. This brings another interesting question -- if system includes are affected by this pragma, how will you handle the files pre-included by the compiler? I'm not sure what would be the right way to handle this. @rsmith -- Do you have any suggestions? |
clang/lib/Sema/SemaCUDA.cpp | ||
---|---|---|
671–672 | Hmm, applying the pragma to system headers too would break GDB testsuite use case I detailed, since we issue the #pragma before system includes. We do: clang/hipcc -include force-globals.h unmodified-testcase.c And the #pragma is in force-globals.h. The goal is to not modify unmodified-testcase.c. And that file includes system headers. System headers contain global variable declarations which we can't mess with. |
Merged DiagnosticParseKinds as requested.
Regarding system headers, could that work be investigated in a later patch? It does not work for the GDB testsuite to enable the pragma on system headers.
@rsmith suggested that #pragma clang attribute push (__device__, apply_to = variable(is_global)) may already be able to do what this patch is attempting to do. Can you check if you can make it work for your tests?
That sounds quite promising. However, I just gave it a try, and I got two errors:
src/gdb/testsuite/lib/hip-test.h:38:15: error: attribute 'device' can't be applied to 'variable(is_global)' #pragma clang attribute push (__device__, apply_to = variable(is_global)) ^ ~~~~~~~~~~~~~~~~~~~ src/gdb/testsuite/lib/hip-test.h:38:15: error: unterminated '#pragma clang attribute push' at end of file
So three points / questions:
#1 - Is there a reason the __device__ attribute can't be applied with this pragma?
#2 - The unterminated #pragma clang attribute push error is also a blocker for our use case, because as I mentioned, we're putting the #pragma push in a header that is force-included, there's nowhere to put the corresponding pop. This seems like a needless Clang restriction -- note that #pragma clang force_cuda_host_device begin does not error out with a missing corresponding pop, the pragma just ends up being in effect until the end of the translation unit. That's what we need here too.
#3 - I was going to verify whether #pragma attribute is_global also applies to system header globals (I suspect so). If it does apply to system header globals too, then is there a way to avoid it? I didn't see a predicate for it. Maybe one could be added? Something like "unless(system_header)"? Not sure how the right syntax to combine it with is_global would be, though I'd need it, of course.
It does appear to be somewhat broken. It works with apply_to=variable, but does not work with apply_to=variable(is_global):
https://godbolt.org/z/6K3zMbd33
I don't see any particular reason why it should not work in principle.
#2 - The unterminated #pragma clang attribute push error is also a blocker for our use case, because as I mentioned, we're putting the #pragma push in a header that is force-included, there's nowhere to put the corresponding pop.
This seems like a needless Clang restriction -- note that #pragma clang force_cuda_host_device begin does not error out with a missing corresponding pop, the pragma just ends up being in effect until the end of the translation unit. That's what we need here too.
Allowing pragma push to be unmatched, maybe with an explicit option to enable it, would probably be less controversial than adding a new pragma that duplicates existing functionality.
#3 - I was going to verify whether #pragma attribute is_global also applies to system header globals (I suspect so). If it does apply to system header globals too, then is there a way to avoid it? I didn't see a predicate for it. Maybe one could be added? Something like "unless(system_header)"? Not sure how the right syntax to combine it with is_global would be, though I'd need it, of course.
This 'do magic on all globals, except if they are in the system headers' still looks questionable to me. It may happen to work for you, but I don't understand why it needs to work that way and whether it's a generally useful behavior for the compiler to implement.
Is there a particular reason not to apply the pragma to the system headers?
Regarding system headers, could that work be investigated in a later patch? It does not work for the GDB testsuite to enable the pragma on system headers.
Considering that you propose to add this functionality in this patch, we should figure out the reasons for such exclusion in this review as well. "Does not work for GDB testsuite" is not particularly informative. Until you understand the reason for the failure, we would not know whether there's a good reason for skipping the system headers of it's a wrong thing to do that just happened to cover up the real problem somewhere else.
To be clear, the pragma that I mentioned allows unmatching -- #pragma clang force_cuda_host_device begin -- is a preexisting Clang pragma, not the one proposed by this review. It's highly inconsistent for one pragma to error out when unmatched, while the other doesn't. Would you suggest that the pre-existing #pragma clang force_cuda_host_device begin should error out when unmatched?
Here's another pragma that doesn't error out when unbalanced:
#pragma GCC diagnostic push
Would you suggest that this should error out when unbalanced unless you specify an explicit option?
#3 - I was going to verify whether #pragma attribute is_global also applies to system header globals (I suspect so). If it does apply to system header globals too, then is there a way to avoid it? I didn't see a predicate for it. Maybe one could be added? Something like "unless(system_header)"? Not sure how the right syntax to combine it with is_global would be, though I'd need it, of course.
This 'do magic on all globals, except if they are in the system headers' still looks questionable to me. It may happen to work for you, but I don't understand why it needs to work that way and whether it's a generally useful behavior for the compiler to implement.
Is there a particular reason not to apply the pragma to the system headers?
Regarding system headers, could that work be investigated in a later patch? It does not work for the GDB testsuite to enable the pragma on system headers.
Considering that you propose to add this functionality in this patch, we should figure out the reasons for such exclusion in this review as well. "Does not work for GDB testsuite" is not particularly informative. Until you understand the reason for the failure, we would not know whether there's a good reason for skipping the system headers of it's a wrong thing to do that just happened to cover up the real problem somewhere else.
But I understand the reason for the failure, and I think I mentioned it before. System headers contain declarations of host globals that should remain host variables and not get the device attribute -- compilation fails.
I don't have an error log to paste here handy, because I blew up my previous ROCm setup (the one using my plugin that implements the proposed pragma) by accident when I upgraded my setup to test your "#pragma clang attribute push (device, apply_to)" suggestion with a more up to date Clang. It will take time to rebuild it. :-/
I think it's reasonable to have a way to skip applying something to system headers because those are beyond a user's control. Other compiler features give special treatment to system headers exactly for the reason for being out of control of the user -- e.g., diagnostics https://clang.llvm.org/docs/UsersManual.html#controlling-diagnostics-in-system-headers -- it's not like there's no precedent.
It appears that '#pragma attribute push' does have a bug (or a design quirk).
It apparently requires the 'apply_to' part to match the set of subjects for the attribute. For device it's only allowed to accept 'variable', which is not very useful as it attempts to apply the attribute to way too many things. This should be fixed to allow applying the attribute to a subset.
No, what I'm saying is that we can allow #pragma clang attribute push to be unbalanced if the user requests it. Injecting it with -include is a reasonable use case, IMO and you've correctly pointed out that there's no easy way to add a matching pop.
#pragma clang attribute appears to be a better and more generic mechanism for tinkering with attributes and I would prefer to use it instead of adding more pragmas that do about the same thing.
Is there a particular reason not to apply the pragma to the system headers?
Regarding system headers, could that work be investigated in a later patch? It does not work for the GDB testsuite to enable the pragma on system headers.
Considering that you propose to add this functionality in this patch, we should figure out the reasons for such exclusion in this review as well. "Does not work for GDB testsuite" is not particularly informative. Until you understand the reason for the failure, we would not know whether there's a good reason for skipping the system headers of it's a wrong thing to do that just happened to cover up the real problem somewhere else.
But I understand the reason for the failure, and I think I mentioned it before. System headers contain declarations of host globals that should remain host variables and not get the device attribute -- compilation fails.
This level of details is not very helpful as "compilation fails" is the end result of a pretty large set of root causes.
What makes the variables in the system headers different from the variables in the user code? Which variables should remain host-only and why? Some of them? All of them?
Illustrating the issue with specific code examples on godbolt.org would also be very helpful.
No, what I'm saying is that we can allow #pragma clang attribute push to be unbalanced if the user requests it. Injecting it with -include is a reasonable use case, IMO and you've correctly pointed out that there's no easy way to add a matching pop.
#pragma clang attribute appears to be a better and more generic mechanism for tinkering with attributes and I would prefer to use it instead of adding more pragmas that do about the same thing.
Regarding this point, should the absence of the push keyword apply the pragma on the TU, or should we allow open push without pop?
I've just landed https://reviews.llvm.org/D100136, so apply_to=variables(global) should work now.
If you do need to restrict the scope of the pragma to system headers, it could be implemented as another matcher for , similar
No, what I'm saying is that we can allow #pragma clang attribute push to be unbalanced if the user requests it. Injecting it with -include is a reasonable use case, IMO and you've correctly pointed out that there's no easy way to add a matching pop.
#pragma clang attribute appears to be a better and more generic mechanism for tinkering with attributes and I would prefer to use it instead of adding more pragmas that do about the same thing.
Regarding this point, should the absence of the push keyword apply the pragma on the TU, or should we allow open push without pop?
I potentially see few ways to deal with this:
- add an CLI option to ignore mismatched pus/pop. It's a bit of a sledgehammer, but it would do the job if the user needs is.
- allow #pragma clang attribute without a push, which would be equivalent to a push with a specian namespace. Missing pop for such namespace would be ignored.
- same as bove, but require user to use a special namespace 'no_pop' (e.g #pragma clang attribute no_pop.push (...)) and ignore missing pops for that namespace only.
I think no_pop would probably be the easiest to implement and will be consistent with the documented behavior of the pragma.
@arphaman Do you have any any suggestions on what would be the best approach to allow #pragma clang attribute to work from a header injected with -include for which we can't conveniently inject a matching pop at the end of a TU?
As for restricting the scope of the pragma to system headers or user code only, I think we should be able to extend the pragma by adding something like a in_files={system_headers,user_code,main_source,...., file/name/pattern*.h}, with the attribute applied only to constructs satisfying both apply_to and in_files.
@palves -- For some reason your reply didn't make it to the tracker. I guess phabricator does not handle email replies well.
I don't understand the desire to for extra syntax -- no other push/pop pragma requires separate syntax, they simply don't error out
if the pop is missing at the end of the translation unit.
I'd argue that a missing pop is an error ( extra one should be, too), and it's those other pragmas that are not doing the right thing.
That said, I'm not inherently opposed to allowing mismatched push/pop, but I'd prefer not to, if we can.
In general, relaxing error checking, which affects all users, for the sake of a niche use case does not look like a good trade-off to me.
If you require a separate syntax here, then that suggests that the other
push/pop pragmas should gain that no_pop syntax too. It just seems like pointless complication to me. (Moreso since some of those
are implemented (or even originated) in GCC too.)
The fact that some pragma does X does not necessarily imply that all of them must do X the same way. Sometimes it makes sense, sometimes it does not.
Historic precedence is a guideline, not the unreakable law. If we can do better when we're not constrained by having to be quirk-for-quirk compatible with something implemented in GCC few decades ago, I think we should do it. Stricter error checking would be one of those things.
IMO the strict push/pop checking done by #pragma clang attribute does make sense and it's reasonably easy to extend it in a way that allows user to bypass the check if necessary in a way that is both visible in the source and does not affect other users.
On 12/04/21 18:37, Artem Belevich via Phabricator wrote:
tra added a comment.
@palves -- For some reason your reply didn't make it to the tracker. I guess phabricator does not handle email replies well.
I don't understand the desire to for extra syntax -- no other push/pop pragma requires separate syntax, they simply don't error out
if the pop is missing at the end of the translation unit.I'd argue that a missing pop is an error ( extra one should be, too), and it's those other pragmas that are not doing the right thing.
That said, I'm not inherently opposed to allowing mismatched push/pop, but I'd prefer not to, if we can.
We have to entertain the possibility that whether to error out was thought about at the time those other pragmas were invented
(I'd think that it most certainly was thought about), but was decided then that it _was_ the right thing not to error out.
E.g., for the -include use case. Or just to put the #pragma at the top of a .cc file and not bother with the redundant pop at the end.
My mental model is:
#pragma foo push
means "foo" is in effect until the corresponding pop.
If no pop appears at the end of the file, then "foo" remains in effect until the very end, it does does not violate the mental model at
all. It's quite simple to think of, and explain it, this way.
In general, relaxing error checking, which affects all users, for the sake of a niche use case does not look like a good trade-off to me.
I'm looking at this from a consistency angle. Someone modeled the "attribute" push/pop pragma on the other push/pop pragmas, but thought
it a good idea to make it an error for something that likely was determined shouldn't be an error in the other cases. But the pragmas
push/pop ideas are so similar in the abstract (create "scopes"), that it just seems like the behavior is different because it was probably
implemented by different people at different times.
IMO, a better approach would be:
#1 - make unbalanced pop NOT be an error in "#pragma clang attribute push/pop", consistent with other #pragmas.
#2 - make clang WARN about unmatched push/pop, for all the different push/pop #pragmas. Make that controllable with
some new warning flag, like (straw man), say -Wunbalanced-push-pop.
This way, Clang would handle all #pragmas consistently, and, users who would want to catch the imbalance with an error
could use -Werror=unbalanced-push-pop, and that would work for all the different push/pop-style pragmas. If the warning is on
by default, users could disable with -Wno-unbalanced-push-pop.
Wouldn't this be better?
If you require a separate syntax here, then that suggests that the other
push/pop pragmas should gain that no_pop syntax too. It just seems like pointless complication to me. (Moreso since some of those
are implemented (or even originated) in GCC too.)The fact that some pragma does X does not necessarily imply that all of them must do X the same way. Sometimes it makes sense, sometimes it does not.
Historic precedence is a guideline, not the unreakable law. If we can do better when we're not constrained by having to be quirk-for-quirk compatible with something implemented in GCC few decades ago, I think we should do it. Stricter error checking would be one of those things.IMO the strict push/pop checking done by #pragma clang attribute does make sense and it's reasonably easy to extend it in a way that allows user to bypass the check if necessary in a way that is both visible in the source and does not affect other users.
FWIW, I remain unconvinced. It makes as much sense with "#pragma clang attribute" as it does with "#pragma clang force_cuda_host_device". They both basically apply attributes to things.
I don't have particularly strong opinion on this. Warning on mismatched pop at the end of a TU would work for me, too. This should probably be done and discussed in the new patch implementing it.
I think this review tracker has served its purpose and can be closed as we no longer need to add a new pragma.
Closing this revision, I have a patch to add no_pop variant of #pragma clang attribute push.
https://reviews.llvm.org/D100404
Alternatively, we could look into making no pop the default.
These could be merged similarly to warn_pragma_force_cuda_bad_arg above.