This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/
-
clang/
-
Basic/
1/1
DiagnosticParseKinds.td
-
Parse/
-
Parser.h
-
Sema/
-
Sema.h
-
Serialization/
-
ASTBitCodes.h
-
ASTReader.h
-
lib/
-
Parse/
-
ParsePragma.cpp
-
Sema/
2
SemaCUDA.cpp
1/1
SemaDecl.cpp
-
SemaTemplateInstantiateDecl.cpp
-
Serialization/
-
ASTReader.cpp
-
ASTWriter.cpp
-
test/
-
PCH/
-
pragma-cuda-force-device-globals.cu
-
SemaCUDA/
4/4
force-device-globals.cu

Differential D98201

[CUDA][HIP] Add #pragma clang force_cuda_device_globals {begin,end}
AbandonedPublic

Authored by ashi1 on Mar 8 2021, 10:58 AM.

Download Raw Diff

Details

Reviewers

yaxunl
tra
jlebar
rsmith

Summary

Adding these pragmas will force all global storage variables
to be emitted with device attribute. This allows a few testsuites
to avoid tagging every global variable with attribute((device))
which may not be feasible or easily upstreamable. These pragmas
may be nest similarly to force_cuda_host_device pragmas.

Diff Detail

Unit TestsFailed

	Time	Test
	70 ms	x64 windows > MLIR.Conversion/PDLToPDLInterp::pdl-to-pdl-interp-matcher.mlir

Event Timeline

ashi1 requested review of this revision.Mar 8 2021, 10:58 AM

ashi1 created this revision.

ashi1 added a subscriber: t-tye.Mar 8 2021, 11:04 AM

Harbormaster completed remote builds in B92708: Diff 329072.Mar 8 2021, 4:33 PM

This allows a few testsuites to avoid tagging every global variable

Can you elaborate on that? Forcing host/device on some functions is needed to make some standard headers work.

I'm not convinced that making all globals a __device__ variables is a good idea. Things may compile, but I have serious doubts that it will be particularly useful in most cases due to the various concurrency issues and the fact that GPU side imposes additional restrictions on the initializers.

Perhaps those few testsuites should be ported to be compileable with CUDA instead?

An example usage is to run a large part of the gdb test suite on the GPU. The tests normally run on the CPU, but can also be made to run on the GPU within a test harness that emulates the necessary environment. For that to work the variables need to be forced to be device. The issues of concurrency do not happen due to the nature of the environment. Porting or modifying the entire test suite is not particularly viable.

Currently a clang plugin is being used to do this, but the hope was to have clang support this directly in a similar way that it already supports something similar for functions. It is up to the user to use appropriately.

Interesting. Once the globals are forced to be __device__, what ends up using them? Is that just for the GDB itself to access them? Or are they used by some code? If so, how is the code forced into being __global__/__device__ functions?

I can see this patch being useful for the former case.

clang/test/SemaCUDA/force-device-globals.cu
49	You may also want to test local static vars. I guess those should remain on the host in the explicitly `__host__` functions and become `__device__` in unattributed functions with the pragma. Another case to test would be implicitly `host/device` functions. E.g. constexpr ones. I guess those should already place the local static variables on the correct side of the compilation, depending on where we compile them.
50–57	So, technically, only `global_before_pragma` is a `__device__` variable now. Everything else we should not be allowed to read from. At the moment clang does allow reading the variables (or, rather, their shadows), but it should not be the case. At the very least I'd add a comment about that so it's clear that accessing device vars from a host function is not OK.

In D98201#2617905, @tra wrote:

Interesting. Once the globals are forced to be __device__, what ends up using them? Is that just for the GDB itself to access them? Or are they used by some code? If so, how is the code forced into being __global__/__device__ functions?

I can see this patch being useful for the former case.

I believe the GDB testsuite is being compiled targeting the device/AMDGPU, and the device code accesses these global variables which need to be marked with __device__. This allows many of GDB testcases to be compiled for device unmodified. They are using the existing #pragma clang force_cuda_host_device begin to force functions to become device attributed.

clang/test/SemaCUDA/force-device-globals.cu
49	Thanks, I will add checking for local static vars and constexpr vars on host, device, and host/device functions.
50–57	Thanks, I will add a comment so that its clear we cannot access device vars here.

In D98201#2629656, @ashi1 wrote:

In D98201#2617905, @tra wrote:

Interesting. Once the globals are forced to be __device__, what ends up using them? Is that just for the GDB itself to access them? Or are they used by some code? If so, how is the code forced into being __global__/__device__ functions?

I can see this patch being useful for the former case.

I believe the GDB testsuite is being compiled targeting the device/AMDGPU, and the device code accesses these global variables which need to be marked with __device__. This allows many of GDB testcases to be compiled for device unmodified.

Sorry, I still don't understand.

GDB testsuite is being compiled targeting the device/AMDGPU

Do you mean that tests are being compiled with --cuda-device-only ? I.e. it's not a regular HIP compilation where we do compile the same source code for both the host and N GPUs?

They are using the existing #pragma clang force_cuda_host_device begin to force functions to become device attributed.

So, the end goal appears to be able to compile a pure C++ source for AMD GPUs.
I wonder if all we need for that is clang++ -target amdgcn-amd-amdhsa -x c++

This appears to produce an AMDGPU binary:

$ echo 'int f () {return 1;}' | bin/clang++ -target amdgcn-amd-amdhsa -x c++ -c -o zzz.o -nogpulib -
$ readelf -e zzz.o
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 40 01 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            <unknown: 40>
  ABI Version:                       1
  Type:                              REL (Relocatable file)
  Machine:                           AMD GPU

Hi!

In D98201#2629709, @tra wrote:

In D98201#2629656, @ashi1 wrote:

I believe the GDB testsuite is being compiled targeting the device/AMDGPU, and the device code accesses these global variables which need to be marked with __device__. This allows many of GDB testcases to be compiled for device unmodified.

Sorry, I still don't understand.

This allows running the whole GDB testsuite in HIP mode, using a custom DejaGnu board file, to test device debugging. With that board file, every C and C++ program in the testsuite is compiled using HIPCC, targeting the device/GPU. Many GDB testcases don't make sense to run against the GPU (e.g., posix threads tests, fork, exec, etc.), but many of the core GDB tests do make sense to run. Breakpoints, watchpoints, listing, backtracing, etc.

The custom DejaGnu board file links some HIP glue code into every GDB testcase. I call that glue code the driver. It contains both the host's main() entry point, and a kernel entry point. The kernel entry point calls the testcase's actual main() (the preexisting main() function that is written in each individual GDB testcase). The host's argc/argv/envp are forwarded from host's main() to the kernel and then to the testcase's main(), now in device code. Some really basic custom C runtime routines missing in the HIP runtime are linked in as well, like puts, (a really dumb) malloc/free, strlen, etc. Using HIP instead of -target=amdgcn C/C++ takes care of all the sordid details of device code linking and loading, plus we can make use of the HIP headers and runtime on the device side, and HIP is actually the language that is actually supported for debugging in GDB anyhow.

The board file overrides the default target compilation procedure to instead compile C/C++ testcase files using 'hipcc' and link in the driver. (Using -fgpu-rdc to allow compiling translation units separately.)

Now, to avoid having to modify the hundreds of actual tests, we also pass "-include" to hipcc to force-include a header into every testcase compiled. That header, among other things, has:

/* Avoid having to write explicit __device__ in all functions throughout.  */
#pragma clang force_cuda_host_device begin

This results in code being emitted for both the host and the device. The resulting code will only be run on the device, though the fact that the debug info for the host code is emitted as well is very nice, because it lets the GDB testsuite set breakpoints by file and line number or function name, even before the program is started and the device code is loaded. The testsuite does that _a lot_. (E.g. "b some_test_function; run"). GDB then re-resolves breakpoint locations once the device code is loaded, and the testcase Just Works.

The pre-existing force_cuda_host_device pragma does not force device for global variables, however, it's only for functions. To avoid having to explicitly tag global variables throughout the hundreds of files, I wrote a Clang plugin that implements a pragma similar to the one being proposed, to automatically tag global variables (except system header variables) with device. That force-included header file then also does:

/* The above only works for functions.  Global variables must still normally be tagged with __device__.  We address that with our plugin.  */
#pragma force_cuda_device_globals

It would be much better if a plugin wasn't required though. Hence this proposal.

Plus, it seems to me that if "#pragma clang force_cuda_host_device begin" is useful enough to have in the compiler proper, then a similar feature for variables might be useful for the broader community as well.

I hope that clarifies things.

Thank you for the detailed explanation. I think I understand now.

OK. I think the use case is reasonably useful even beyond testing GDB. This could be used to compile portable code for the GPU in general. E.g. we could conceivably use that to compile a libm implementation without having to manually port it to CUDA/HIP.

LGTM, modulo the previous test comments and the source formatting nit.

clang/lib/Sema/SemaDecl.cpp
7250	Please reformat the code as suggested.

This revision is now accepted and ready to land.Mar 16 2021, 1:31 PM

ashi1 updated this revision to Diff 331110.Mar 16 2021, 2:34 PM

This comment was removed by ashi1.

ashi1 updated this revision to Diff 331112.Mar 16 2021, 2:41 PM

ashi1 marked 3 inline comments as done.

Thank you for the review, please see latest test updates adding tests for static/constexpr local var combinations.

Harbormaster completed remote builds in B94130: Diff 331110.Mar 16 2021, 3:23 PM

Harbormaster completed remote builds in B94132: Diff 331112.Mar 16 2021, 3:51 PM

tra added inline comments.Mar 16 2021, 4:17 PM

clang/include/clang/Basic/DiagnosticParseKinds.td
1444–1450	These could be merged similarly to `warn_pragma_force_cuda_bad_arg` above.
clang/lib/Sema/SemaCUDA.cpp
671–672	Is there a particular reason not to apply the pragma to the system headers? Presumably we do want them to work. E.g. what if the header has a function which relies on a local static var? You may end up using such function in the code that does have the pragma applied, but it will fail to link. It may even fail to compile, as the function and the variable would end up on the different sides. I think it would be more consistent to apply pragma to everything the user put within its boundaries. This brings another interesting question -- if system includes are affected by this pragma, how will you handle the files pre-included by the compiler? I guess it's easy enough to work around with `-nocudainc` and including all relevant headers from within the pragma. I'm not sure what would be the right way to handle this. "pragma affects everything" looks more sound, but is a bit more hassle to use. "pragma does not affect system headers" would have some corner cases as we'll only have a subset of system headers available for the code compiled within pragma boundaries. @rsmith -- Do you have any suggestions?

Let's figure out what to do about the system headers before we land this.

This revision now requires changes to proceed.Mar 16 2021, 4:18 PM

palves added inline comments.Mar 16 2021, 4:35 PM

clang/lib/Sema/SemaCUDA.cpp
671–672	Hmm, applying the pragma to system headers too would break GDB testsuite use case I detailed, since we issue the #pragma before system includes. We do: clang/hipcc -include force-globals.h unmodified-testcase.c And the #pragma is in force-globals.h. The goal is to not modify unmodified-testcase.c. And that file includes system headers. System headers contain global variable declarations which we can't mess with.

Merged DiagnosticParseKinds as requested.

Regarding system headers, could that work be investigated in a later patch? It does not work for the GDB testsuite to enable the pragma on system headers.

ashi1 marked an inline comment as done.Mar 31 2021, 2:16 PM

Harbormaster completed remote builds in B96592: Diff 334534.Mar 31 2021, 3:03 PM

@rsmith suggested that #pragma clang attribute push (__device__, apply_to = variable(is_global)) may already be able to do what this patch is attempting to do. Can you check if you can make it work for your tests?

In D98201#2664431, @tra wrote:

@rsmith suggested that #pragma clang attribute push (__device__, apply_to = variable(is_global)) may already be able to do what this patch is attempting to do. Can you check if you can make it work for your tests?

That sounds quite promising. However, I just gave it a try, and I got two errors:

src/gdb/testsuite/lib/hip-test.h:38:15: error: attribute 'device' can't be applied to 'variable(is_global)'
#pragma clang attribute push (__device__, apply_to = variable(is_global))
              ^                                      ~~~~~~~~~~~~~~~~~~~
src/gdb/testsuite/lib/hip-test.h:38:15: error: unterminated '#pragma clang attribute push' at end of file

So three points / questions:

#1 - Is there a reason the __device__ attribute can't be applied with this pragma?

#2 - The unterminated #pragma clang attribute push error is also a blocker for our use case, because as I mentioned, we're putting the #pragma push in a header that is force-included, there's nowhere to put the corresponding pop. This seems like a needless Clang restriction -- note that #pragma clang force_cuda_host_device begin does not error out with a missing corresponding pop, the pragma just ends up being in effect until the end of the translation unit. That's what we need here too.

#3 - I was going to verify whether #pragma attribute is_global also applies to system header globals (I suspect so). If it does apply to system header globals too, then is there a way to avoid it? I didn't see a predicate for it. Maybe one could be added? Something like "unless(system_header)"? Not sure how the right syntax to combine it with is_global would be, though I'd need it, of course.

In D98201#2673847, @palves wrote:
In D98201#2664431, @tra wrote:

@rsmith suggested that #pragma clang attribute push (__device__, apply_to = variable(is_global)) may already be able to do what this patch is attempting to do. Can you check if you can make it work for your tests?

That sounds quite promising. However, I just gave it a try, and I got two errors:
src/gdb/testsuite/lib/hip-test.h:38:15: error: attribute 'device' can't be applied to 'variable(is_global)'
#pragma clang attribute push (__device__, apply_to = variable(is_global))
              ^                                      ~~~~~~~~~~~~~~~~~~~
src/gdb/testsuite/lib/hip-test.h:38:15: error: unterminated '#pragma clang attribute push' at end of file
So three points / questions:

#1 - Is there a reason the __device__ attribute can't be applied with this pragma?

It does appear to be somewhat broken. It works with apply_to=variable, but does not work with apply_to=variable(is_global):
https://godbolt.org/z/6K3zMbd33

I don't see any particular reason why it should not work in principle.

#2 - The unterminated #pragma clang attribute push error is also a blocker for our use case, because as I mentioned, we're putting the #pragma push in a header that is force-included, there's nowhere to put the corresponding pop.
This seems like a needless Clang restriction -- note that #pragma clang force_cuda_host_device begin does not error out with a missing corresponding pop, the pragma just ends up being in effect until the end of the translation unit. That's what we need here too.

Allowing pragma push to be unmatched, maybe with an explicit option to enable it, would probably be less controversial than adding a new pragma that duplicates existing functionality.

#3 - I was going to verify whether #pragma attribute is_global also applies to system header globals (I suspect so). If it does apply to system header globals too, then is there a way to avoid it? I didn't see a predicate for it. Maybe one could be added? Something like "unless(system_header)"? Not sure how the right syntax to combine it with is_global would be, though I'd need it, of course.

This 'do magic on all globals, except if they are in the system headers' still looks questionable to me. It may happen to work for you, but I don't understand why it needs to work that way and whether it's a generally useful behavior for the compiler to implement.

Is there a particular reason not to apply the pragma to the system headers?

Regarding system headers, could that work be investigated in a later patch? It does not work for the GDB testsuite to enable the pragma on system headers.

Considering that you propose to add this functionality in this patch, we should figure out the reasons for such exclusion in this review as well. "Does not work for GDB testsuite" is not particularly informative. Until you understand the reason for the failure, we would not know whether there's a good reason for skipping the system headers of it's a wrong thing to do that just happened to cover up the real problem somewhere else.

In D98201#2674765, @tra wrote:

In D98201#2673847, @palves wrote:

#2 - The unterminated #pragma clang attribute push error is also a blocker for our use case, because as I mentioned, we're putting the #pragma push in a header that is force-included, there's nowhere to put the corresponding pop.
This seems like a needless Clang restriction -- note that #pragma clang force_cuda_host_device begin does not error out with a missing corresponding pop, the pragma just ends up being in effect until the end of the translation unit. That's what we need here too.

Allowing pragma push to be unmatched, maybe with an explicit option to enable it, would probably be less controversial than adding a new pragma that duplicates existing functionality.

To be clear, the pragma that I mentioned allows unmatching -- #pragma clang force_cuda_host_device begin -- is a preexisting Clang pragma, not the one proposed by this review. It's highly inconsistent for one pragma to error out when unmatched, while the other doesn't. Would you suggest that the pre-existing #pragma clang force_cuda_host_device begin should error out when unmatched?

Here's another pragma that doesn't error out when unbalanced:

#pragma GCC diagnostic push

Would you suggest that this should error out when unbalanced unless you specify an explicit option?

#3 - I was going to verify whether #pragma attribute is_global also applies to system header globals (I suspect so). If it does apply to system header globals too, then is there a way to avoid it? I didn't see a predicate for it. Maybe one could be added? Something like "unless(system_header)"? Not sure how the right syntax to combine it with is_global would be, though I'd need it, of course.

This 'do magic on all globals, except if they are in the system headers' still looks questionable to me. It may happen to work for you, but I don't understand why it needs to work that way and whether it's a generally useful behavior for the compiler to implement.

Is there a particular reason not to apply the pragma to the system headers?

Regarding system headers, could that work be investigated in a later patch? It does not work for the GDB testsuite to enable the pragma on system headers.

Considering that you propose to add this functionality in this patch, we should figure out the reasons for such exclusion in this review as well. "Does not work for GDB testsuite" is not particularly informative. Until you understand the reason for the failure, we would not know whether there's a good reason for skipping the system headers of it's a wrong thing to do that just happened to cover up the real problem somewhere else.

But I understand the reason for the failure, and I think I mentioned it before. System headers contain declarations of host globals that should remain host variables and not get the device attribute -- compilation fails.
I don't have an error log to paste here handy, because I blew up my previous ROCm setup (the one using my plugin that implements the proposed pragma) by accident when I upgraded my setup to test your "#pragma clang attribute push (device, apply_to)" suggestion with a more up to date Clang. It will take time to rebuild it. :-/

I think it's reasonable to have a way to skip applying something to system headers because those are beyond a user's control. Other compiler features give special treatment to system headers exactly for the reason for being out of control of the user -- e.g., diagnostics https://clang.llvm.org/docs/UsersManual.html#controlling-diagnostics-in-system-headers -- it's not like there's no precedent.

It appears that '#pragma attribute push' does have a bug (or a design quirk).
It apparently requires the 'apply_to' part to match the set of subjects for the attribute. For device it's only allowed to accept 'variable', which is not very useful as it attempts to apply the attribute to way too many things. This should be fixed to allow applying the attribute to a subset.

In D98201#2674942, @palves wrote:

Allowing pragma push to be unmatched, maybe with an explicit option to enable it, would probably be less controversial than adding a new pragma that duplicates existing functionality.

To be clear, the pragma that I mentioned allows unmatching -- #pragma clang force_cuda_host_device begin -- is a preexisting Clang pragma, not the one proposed by this review. It's highly inconsistent for one pragma to error out when unmatched, while the other doesn't. Would you suggest that the pre-existing #pragma clang force_cuda_host_device begin should error out when unmatched?

No, what I'm saying is that we can allow #pragma clang attribute push to be unbalanced if the user requests it. Injecting it with -include is a reasonable use case, IMO and you've correctly pointed out that there's no easy way to add a matching pop.

#pragma clang attribute appears to be a better and more generic mechanism for tinkering with attributes and I would prefer to use it instead of adding more pragmas that do about the same thing.

Is there a particular reason not to apply the pragma to the system headers?

Regarding system headers, could that work be investigated in a later patch? It does not work for the GDB testsuite to enable the pragma on system headers.

Considering that you propose to add this functionality in this patch, we should figure out the reasons for such exclusion in this review as well. "Does not work for GDB testsuite" is not particularly informative. Until you understand the reason for the failure, we would not know whether there's a good reason for skipping the system headers of it's a wrong thing to do that just happened to cover up the real problem somewhere else.

But I understand the reason for the failure, and I think I mentioned it before. System headers contain declarations of host globals that should remain host variables and not get the device attribute -- compilation fails.

This level of details is not very helpful as "compilation fails" is the end result of a pretty large set of root causes.

What makes the variables in the system headers different from the variables in the user code? Which variables should remain host-only and why? Some of them? All of them?
Illustrating the issue with specific code examples on godbolt.org would also be very helpful.

In D98201#2674942, @palves wrote:

Allowing pragma push to be unmatched, maybe with an explicit option to enable it, would probably be less controversial than adding a new pragma that duplicates existing functionality.

To be clear, the pragma that I mentioned allows unmatching -- #pragma clang force_cuda_host_device begin -- is a preexisting Clang pragma, not the one proposed by this review. It's highly inconsistent for one pragma to error out when unmatched, while the other doesn't. Would you suggest that the pre-existing #pragma clang force_cuda_host_device begin should error out when unmatched?

No, what I'm saying is that we can allow #pragma clang attribute push to be unbalanced if the user requests it. Injecting it with -include is a reasonable use case, IMO and you've correctly pointed out that there's no easy way to add a matching pop.

#pragma clang attribute appears to be a better and more generic mechanism for tinkering with attributes and I would prefer to use it instead of adding more pragmas that do about the same thing.

Regarding this point, should the absence of the push keyword apply the pragma on the TU, or should we allow open push without pop?

I've just landed https://reviews.llvm.org/D100136, so apply_to=variables(global) should work now.
If you do need to restrict the scope of the pragma to system headers, it could be implemented as another matcher for , similar

No, what I'm saying is that we can allow #pragma clang attribute push to be unbalanced if the user requests it. Injecting it with -include is a reasonable use case, IMO and you've correctly pointed out that there's no easy way to add a matching pop.

#pragma clang attribute appears to be a better and more generic mechanism for tinkering with attributes and I would prefer to use it instead of adding more pragmas that do about the same thing.

Regarding this point, should the absence of the push keyword apply the pragma on the TU, or should we allow open push without pop?

I potentially see few ways to deal with this:

add an CLI option to ignore mismatched pus/pop. It's a bit of a sledgehammer, but it would do the job if the user needs is.
allow #pragma clang attribute without a push, which would be equivalent to a push with a specian namespace. Missing pop for such namespace would be ignored.
same as bove, but require user to use a special namespace 'no_pop' (e.g #pragma clang attribute no_pop.push (...)) and ignore missing pops for that namespace only.

I think no_pop would probably be the easiest to implement and will be consistent with the documented behavior of the pragma.

@arphaman Do you have any any suggestions on what would be the best approach to allow #pragma clang attribute to work from a header injected with -include for which we can't conveniently inject a matching pop at the end of a TU?

As for restricting the scope of the pragma to system headers or user code only, I think we should be able to extend the pragma by adding something like a in_files={system_headers,user_code,main_source,...., file/name/pattern*.h}, with the attribute applied only to constructs satisfying both apply_to and in_files.

@palves -- For some reason your reply didn't make it to the tracker. I guess phabricator does not handle email replies well.

I don't understand the desire to for extra syntax -- no other push/pop pragma requires separate syntax, they simply don't error out
if the pop is missing at the end of the translation unit.

I'd argue that a missing pop is an error ( extra one should be, too), and it's those other pragmas that are not doing the right thing.
That said, I'm not inherently opposed to allowing mismatched push/pop, but I'd prefer not to, if we can.
In general, relaxing error checking, which affects all users, for the sake of a niche use case does not look like a good trade-off to me.

If you require a separate syntax here, then that suggests that the other
push/pop pragmas should gain that no_pop syntax too. It just seems like pointless complication to me. (Moreso since some of those
are implemented (or even originated) in GCC too.)

The fact that some pragma does X does not necessarily imply that all of them must do X the same way. Sometimes it makes sense, sometimes it does not.
Historic precedence is a guideline, not the unreakable law. If we can do better when we're not constrained by having to be quirk-for-quirk compatible with something implemented in GCC few decades ago, I think we should do it. Stricter error checking would be one of those things.

IMO the strict push/pop checking done by #pragma clang attribute does make sense and it's reasonably easy to extend it in a way that allows user to bypass the check if necessary in a way that is both visible in the source and does not affect other users.

On 12/04/21 18:37, Artem Belevich via Phabricator wrote:

tra added a comment.

@palves -- For some reason your reply didn't make it to the tracker. I guess phabricator does not handle email replies well.

I don't understand the desire to for extra syntax -- no other push/pop pragma requires separate syntax, they simply don't error out
if the pop is missing at the end of the translation unit.

I'd argue that a missing pop is an error ( extra one should be, too), and it's those other pragmas that are not doing the right thing.
That said, I'm not inherently opposed to allowing mismatched push/pop, but I'd prefer not to, if we can.

We have to entertain the possibility that whether to error out was thought about at the time those other pragmas were invented
(I'd think that it most certainly was thought about), but was decided then that it _was_ the right thing not to error out.
E.g., for the -include use case. Or just to put the #pragma at the top of a .cc file and not bother with the redundant pop at the end.

My mental model is:

#pragma foo push

means "foo" is in effect until the corresponding pop.

If no pop appears at the end of the file, then "foo" remains in effect until the very end, it does does not violate the mental model at
all. It's quite simple to think of, and explain it, this way.

In general, relaxing error checking, which affects all users, for the sake of a niche use case does not look like a good trade-off to me.

I'm looking at this from a consistency angle. Someone modeled the "attribute" push/pop pragma on the other push/pop pragmas, but thought
it a good idea to make it an error for something that likely was determined shouldn't be an error in the other cases. But the pragmas
push/pop ideas are so similar in the abstract (create "scopes"), that it just seems like the behavior is different because it was probably
implemented by different people at different times.

IMO, a better approach would be:

#1 - make unbalanced pop NOT be an error in "#pragma clang attribute push/pop", consistent with other #pragmas.

#2 - make clang WARN about unmatched push/pop, for all the different push/pop #pragmas. Make that controllable with

some new warning flag, like (straw man), say -Wunbalanced-push-pop.

This way, Clang would handle all #pragmas consistently, and, users who would want to catch the imbalance with an error
could use -Werror=unbalanced-push-pop, and that would work for all the different push/pop-style pragmas. If the warning is on
by default, users could disable with -Wno-unbalanced-push-pop.

Wouldn't this be better?

If you require a separate syntax here, then that suggests that the other
push/pop pragmas should gain that no_pop syntax too. It just seems like pointless complication to me. (Moreso since some of those
are implemented (or even originated) in GCC too.)

The fact that some pragma does X does not necessarily imply that all of them must do X the same way. Sometimes it makes sense, sometimes it does not.
Historic precedence is a guideline, not the unreakable law. If we can do better when we're not constrained by having to be quirk-for-quirk compatible with something implemented in GCC few decades ago, I think we should do it. Stricter error checking would be one of those things.

IMO the strict push/pop checking done by #pragma clang attribute does make sense and it's reasonably easy to extend it in a way that allows user to bypass the check if necessary in a way that is both visible in the source and does not affect other users.

FWIW, I remain unconvinced. It makes as much sense with "#pragma clang attribute" as it does with "#pragma clang force_cuda_host_device". They both basically apply attributes to things.

I don't have particularly strong opinion on this. Warning on mismatched pop at the end of a TU would work for me, too. This should probably be done and discussed in the new patch implementing it.

I think this review tracker has served its purpose and can be closed as we no longer need to add a new pragma.

Closing this revision, I have a patch to add no_pop variant of #pragma clang attribute push.
https://reviews.llvm.org/D100404
Alternatively, we could look into making no pop the default.

tra mentioned this in D100404: Add Global support for #pragma clang attributes.Apr 13 2021, 12:36 PM

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

DiagnosticParseKinds.td

7 lines

Parse/

Parser.h

1 line

Sema/

Sema.h

15 lines

Serialization/

ASTBitCodes.h

4 lines

ASTReader.h

4 lines

lib/

Parse/

ParsePragma.cpp

41 lines

Sema/

SemaCUDA.cpp

33 lines

SemaDecl.cpp

4 lines

SemaTemplateInstantiateDecl.cpp

2 lines

Serialization/

ASTReader.cpp

9 lines

ASTWriter.cpp

5 lines

test/

PCH/

pragma-cuda-force-device-globals.cu

37 lines

SemaCUDA/

force-device-globals.cu

94 lines

Diff 331110

clang/include/clang/Basic/DiagnosticParseKinds.td

	Show First 20 Lines • Show All 1,432 Lines • ▼ Show 20 Lines
	// Pragma unroll support.			// Pragma unroll support.
	def warn_pragma_unroll_cuda_value_in_parens : Warning<			def warn_pragma_unroll_cuda_value_in_parens : Warning<
	"argument to '#pragma unroll' should not be in parentheses in CUDA C/C++">,			"argument to '#pragma unroll' should not be in parentheses in CUDA C/C++">,
	InGroup<CudaCompat>;			InGroup<CudaCompat>;

	def warn_cuda_attr_lambda_position : Warning<			def warn_cuda_attr_lambda_position : Warning<
	"nvcc does not allow '__%0__' to appear after '()' in lambdas">,			"nvcc does not allow '__%0__' to appear after '()' in lambdas">,
	InGroup<CudaCompat>;			InGroup<CudaCompat>;
	def warn_pragma_force_cuda_host_device_bad_arg : Warning<			def warn_pragma_force_cuda_bad_arg : Warning<
	"incorrect use of #pragma clang force_cuda_host_device begin\|end">,			"incorrect use of #pragma clang force_cuda_%select{host_device\|device_globals}0 begin\|end">,
	InGroup<IgnoredPragmas>;			InGroup<IgnoredPragmas>;
	def err_pragma_cannot_end_force_cuda_host_device : Error<			def err_pragma_cannot_end_force_cuda_host_device : Error<
	"force_cuda_host_device end pragma without matching "			"force_cuda_host_device end pragma without matching "
	"force_cuda_host_device begin">;			"force_cuda_host_device begin">;
				def err_pragma_cannot_end_force_cuda_device_globals : Error<
				"force_cuda_device_globals end pragma without matching "
				"force_cuda_device_globals begin">;
	} // end of Parse Issue category.			} // end of Parse Issue category.
				traUnsubmitted Done Reply Inline Actions These could be merged similarly to `warn_pragma_force_cuda_bad_arg` above. tra: These could be merged similarly to `warn_pragma_force_cuda_bad_arg` above.

	let CategoryName = "Modules Issue" in {			let CategoryName = "Modules Issue" in {
	def err_unexpected_module_decl : Error<			def err_unexpected_module_decl : Error<
	"module declaration can only appear at the top level">;			"module declaration can only appear at the top level">;
	def err_module_expected_ident : Error<			def err_module_expected_ident : Error<
	"expected a module name after '%select{module\|import}0'">;			"expected a module name after '%select{module\|import}0'">;
	def err_attribute_not_module_attr : Error<			def err_attribute_not_module_attr : Error<
	"%0 attribute cannot be applied to a module">;			"%0 attribute cannot be applied to a module">;
	▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines

clang/include/clang/Parse/Parser.h

Show First 20 Lines • Show All 188 Lines • ▼ Show 20 Lines	class Parser : public CodeCompletionHandler {
std::unique_ptr<PragmaHandler> MSBSSSeg;		std::unique_ptr<PragmaHandler> MSBSSSeg;
std::unique_ptr<PragmaHandler> MSConstSeg;		std::unique_ptr<PragmaHandler> MSConstSeg;
std::unique_ptr<PragmaHandler> MSCodeSeg;		std::unique_ptr<PragmaHandler> MSCodeSeg;
std::unique_ptr<PragmaHandler> MSSection;		std::unique_ptr<PragmaHandler> MSSection;
std::unique_ptr<PragmaHandler> MSRuntimeChecks;		std::unique_ptr<PragmaHandler> MSRuntimeChecks;
std::unique_ptr<PragmaHandler> MSIntrinsic;		std::unique_ptr<PragmaHandler> MSIntrinsic;
std::unique_ptr<PragmaHandler> MSOptimize;		std::unique_ptr<PragmaHandler> MSOptimize;
std::unique_ptr<PragmaHandler> CUDAForceHostDeviceHandler;		std::unique_ptr<PragmaHandler> CUDAForceHostDeviceHandler;
		std::unique_ptr<PragmaHandler> CUDAForceDeviceGlobalsHandler;
std::unique_ptr<PragmaHandler> OptimizeHandler;		std::unique_ptr<PragmaHandler> OptimizeHandler;
std::unique_ptr<PragmaHandler> LoopHintHandler;		std::unique_ptr<PragmaHandler> LoopHintHandler;
std::unique_ptr<PragmaHandler> UnrollHintHandler;		std::unique_ptr<PragmaHandler> UnrollHintHandler;
std::unique_ptr<PragmaHandler> NoUnrollHintHandler;		std::unique_ptr<PragmaHandler> NoUnrollHintHandler;
std::unique_ptr<PragmaHandler> UnrollAndJamHintHandler;		std::unique_ptr<PragmaHandler> UnrollAndJamHintHandler;
std::unique_ptr<PragmaHandler> NoUnrollAndJamHintHandler;		std::unique_ptr<PragmaHandler> NoUnrollAndJamHintHandler;
std::unique_ptr<PragmaHandler> FPHandler;		std::unique_ptr<PragmaHandler> FPHandler;
std::unique_ptr<PragmaHandler> STDCFenvAccessHandler;		std::unique_ptr<PragmaHandler> STDCFenvAccessHandler;
▲ Show 20 Lines • Show All 3,267 Lines • Show Last 20 Lines

clang/include/clang/Sema/Sema.h

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,883 Lines • ▼ Show 20 Lines
/// Returns false on success.		/// Returns false on success.
/// Can optionally return whether the bit-field is of width 0		/// Can optionally return whether the bit-field is of width 0
ExprResult VerifyBitField(SourceLocation FieldLoc, IdentifierInfo *FieldName,		ExprResult VerifyBitField(SourceLocation FieldLoc, IdentifierInfo *FieldName,
QualType FieldTy, bool IsMsStruct,		QualType FieldTy, bool IsMsStruct,
Expr BitWidth, bool ZeroWidth = nullptr);		Expr BitWidth, bool ZeroWidth = nullptr);

private:		private:
unsigned ForceCUDAHostDeviceDepth = 0;		unsigned ForceCUDAHostDeviceDepth = 0;
		unsigned ForceCUDADeviceGlobalsDepth = 0;

public:		public:
/// Increments our count of the number of times we've seen a pragma forcing		/// Increments our count of the number of times we've seen a pragma forcing
/// functions to be __host__ __device__. So long as this count is greater		/// functions to be __host__ __device__. So long as this count is greater
/// than zero, all functions encountered will be __host__ __device__.		/// than zero, all functions encountered will be __host__ __device__.
void PushForceCUDAHostDevice();		void PushForceCUDAHostDevice();

/// Decrements our count of the number of times we've seen a pragma forcing		/// Decrements our count of the number of times we've seen a pragma forcing
/// functions to be __host__ __device__. Returns false if the count is 0		/// functions to be __host__ __device__. Returns false if the count is 0
/// before incrementing, so you can emit an error.		/// before incrementing, so you can emit an error.
bool PopForceCUDAHostDevice();		bool PopForceCUDAHostDevice();

		/// Increments our count of the number of times we've seen a pragma forcing
		/// global variables to be __device__. So long as this count is greater than
		/// zero, all global variables encounters will be __device__.
		void PushForceCUDADeviceGlobals();

		/// Decrements our count of the number of times we've seen a pragma forcing
		/// global variables to be __device__. Returns false if the count is 0 before
		/// decrementing, so we can emit an error.
		bool PopForceCUDADeviceGlobals();

/// Diagnostics that are emitted only if we discover that the given function		/// Diagnostics that are emitted only if we discover that the given function
/// must be codegen'ed. Because handling these correctly adds overhead to		/// must be codegen'ed. Because handling these correctly adds overhead to
/// compilation, this is currently only enabled for CUDA compilations.		/// compilation, this is currently only enabled for CUDA compilations.
llvm::DenseMap<CanonicalDeclPtr<FunctionDecl>,		llvm::DenseMap<CanonicalDeclPtr<FunctionDecl>,
std::vector<PartialDiagnosticAt>>		std::vector<PartialDiagnosticAt>>
DeviceDeferredDiags;		DeviceDeferredDiags;

/// A pair of a canonical FunctionDecl and a SourceLocation. When used as the		/// A pair of a canonical FunctionDecl and a SourceLocation. When used as the
▲ Show 20 Lines • Show All 150 Lines • ▼ Show 20 Lines	public:
/// depending on FD and the current compilation settings.		/// depending on FD and the current compilation settings.
void maybeAddCUDAHostDeviceAttrs(FunctionDecl *FD,		void maybeAddCUDAHostDeviceAttrs(FunctionDecl *FD,
const LookupResult &Previous);		const LookupResult &Previous);

/// May add implicit CUDAConstantAttr attribute to VD, depending on VD		/// May add implicit CUDAConstantAttr attribute to VD, depending on VD
/// and current compilation settings.		/// and current compilation settings.
void MaybeAddCUDAConstantAttr(VarDecl *VD);		void MaybeAddCUDAConstantAttr(VarDecl *VD);

		/// May add CUDADeviceAttr attribute to VD depending on pragma pair
		/// force_cuda_device_globals depth.
		void MaybeAddCUDADeviceAttr(VarDecl *VD);

public:		public:
/// Check whether we're allowed to call Callee from the current context.		/// Check whether we're allowed to call Callee from the current context.
///		///
/// - If the call is never allowed in a semantically-correct program		/// - If the call is never allowed in a semantically-correct program
/// (CFP_Never), emits an error and returns false.		/// (CFP_Never), emits an error and returns false.
///		///
/// - If the call is allowed in semantically-correct programs, but only if		/// - If the call is allowed in semantically-correct programs, but only if
/// it's never codegen'ed (CFP_WrongSide), creates a deferred diagnostic to		/// it's never codegen'ed (CFP_WrongSide), creates a deferred diagnostic to
▲ Show 20 Lines • Show All 837 Lines • Show Last 20 Lines

clang/include/clang/Serialization/ASTBitCodes.h

Show First 20 Lines • Show All 686 Lines • ▼ Show 20 Lines	enum ASTRecordTypes {
/// A table of skipped ranges within the preprocessing record.		/// A table of skipped ranges within the preprocessing record.
PPD_SKIPPED_RANGES = 63,		PPD_SKIPPED_RANGES = 63,

/// Record code for the Decls to be checked for deferred diags.		/// Record code for the Decls to be checked for deferred diags.
DECLS_TO_CHECK_FOR_DEFERRED_DIAGS = 64,		DECLS_TO_CHECK_FOR_DEFERRED_DIAGS = 64,

/// Record code for \#pragma float_control options.		/// Record code for \#pragma float_control options.
FLOAT_CONTROL_PRAGMA_OPTIONS = 65,		FLOAT_CONTROL_PRAGMA_OPTIONS = 65,

		/// Number of unmatched #pragma clang cuda_force_device_globals
		/// begin directives we're seen.
		CUDA_PRAGMA_FORCE_DEVICE_GLOBALS_DEPTH = 66,
};		};

/// Record types used within a source manager block.		/// Record types used within a source manager block.
enum SourceManagerRecordTypes {		enum SourceManagerRecordTypes {
/// Describes a source location entry (SLocEntry) for a		/// Describes a source location entry (SLocEntry) for a
/// file.		/// file.
SM_SLOC_FILE_ENTRY = 1,		SM_SLOC_FILE_ENTRY = 1,

▲ Show 20 Lines • Show All 1,422 Lines • Show Last 20 Lines

clang/include/clang/Serialization/ASTReader.h

Show First 20 Lines • Show All 827 Lines • ▼ Show 20 Lines	private:
///		///
/// Sema tracks these to emit warnings.		/// Sema tracks these to emit warnings.
SmallVector<uint64_t, 16> UnusedLocalTypedefNameCandidates;		SmallVector<uint64_t, 16> UnusedLocalTypedefNameCandidates;

/// Our current depth in #pragma cuda force_host_device begin/end		/// Our current depth in #pragma cuda force_host_device begin/end
/// macros.		/// macros.
unsigned ForceCUDAHostDeviceDepth = 0;		unsigned ForceCUDAHostDeviceDepth = 0;

		/// Our current depth in #pragma clang force_cuda_device_globals
		/// begin/end macros.
		unsigned ForceCUDADeviceGlobalsDepth = 0;

/// The IDs of the declarations Sema stores directly.		/// The IDs of the declarations Sema stores directly.
///		///
/// Sema tracks a few important decls, such as namespace std, directly.		/// Sema tracks a few important decls, such as namespace std, directly.
SmallVector<uint64_t, 4> SemaDeclRefs;		SmallVector<uint64_t, 4> SemaDeclRefs;

/// The IDs of the types ASTContext stores directly.		/// The IDs of the types ASTContext stores directly.
///		///
/// The AST context tracks a few important types, such as va_list, directly.		/// The AST context tracks a few important types, such as va_list, directly.
▲ Show 20 Lines • Show All 1,465 Lines • Show Last 20 Lines

clang/lib/Parse/ParsePragma.cpp

Show First 20 Lines • Show All 263 Lines • ▼ Show 20 Lines	PragmaForceCUDAHostDeviceHandler(Sema &Actions)
: PragmaHandler("force_cuda_host_device"), Actions(Actions) {}		: PragmaHandler("force_cuda_host_device"), Actions(Actions) {}
void HandlePragma(Preprocessor &PP, PragmaIntroducer Introducer,		void HandlePragma(Preprocessor &PP, PragmaIntroducer Introducer,
Token &FirstToken) override;		Token &FirstToken) override;

private:		private:
Sema &Actions;		Sema &Actions;
};		};

		struct PragmaForceCUDADeviceGlobalsHandler : public PragmaHandler {
		PragmaForceCUDADeviceGlobalsHandler(Sema &Actions)
		: PragmaHandler("force_cuda_device_globals"), Actions(Actions) {}
		void HandlePragma(Preprocessor &PP, PragmaIntroducer Introducer,
		Token &FirstToken) override;

		private:
		Sema &Actions;
		};

/// PragmaAttributeHandler - "\#pragma clang attribute ...".		/// PragmaAttributeHandler - "\#pragma clang attribute ...".
struct PragmaAttributeHandler : public PragmaHandler {		struct PragmaAttributeHandler : public PragmaHandler {
PragmaAttributeHandler(AttributeFactory &AttrFactory)		PragmaAttributeHandler(AttributeFactory &AttrFactory)
: PragmaHandler("attribute"), AttributesForPragmaAttribute(AttrFactory) {}		: PragmaHandler("attribute"), AttributesForPragmaAttribute(AttrFactory) {}
void HandlePragma(Preprocessor &PP, PragmaIntroducer Introducer,		void HandlePragma(Preprocessor &PP, PragmaIntroducer Introducer,
Token &FirstToken) override;		Token &FirstToken) override;

/// A pool of attributes that were parsed in \#pragma clang attribute.		/// A pool of attributes that were parsed in \#pragma clang attribute.
▲ Show 20 Lines • Show All 104 Lines • ▼ Show 20 Lines	if (getLangOpts().MicrosoftExt) {
MSOptimize = std::make_unique<PragmaMSOptimizeHandler>();		MSOptimize = std::make_unique<PragmaMSOptimizeHandler>();
PP.AddPragmaHandler(MSOptimize.get());		PP.AddPragmaHandler(MSOptimize.get());
}		}

if (getLangOpts().CUDA) {		if (getLangOpts().CUDA) {
CUDAForceHostDeviceHandler =		CUDAForceHostDeviceHandler =
std::make_unique<PragmaForceCUDAHostDeviceHandler>(Actions);		std::make_unique<PragmaForceCUDAHostDeviceHandler>(Actions);
PP.AddPragmaHandler("clang", CUDAForceHostDeviceHandler.get());		PP.AddPragmaHandler("clang", CUDAForceHostDeviceHandler.get());
		CUDAForceDeviceGlobalsHandler =
		std::make_unique<PragmaForceCUDADeviceGlobalsHandler>(Actions);
		PP.AddPragmaHandler("clang", CUDAForceDeviceGlobalsHandler.get());
}		}

OptimizeHandler = std::make_unique<PragmaOptimizeHandler>(Actions);		OptimizeHandler = std::make_unique<PragmaOptimizeHandler>(Actions);
PP.AddPragmaHandler("clang", OptimizeHandler.get());		PP.AddPragmaHandler("clang", OptimizeHandler.get());

LoopHintHandler = std::make_unique<PragmaLoopHintHandler>();		LoopHintHandler = std::make_unique<PragmaLoopHintHandler>();
PP.AddPragmaHandler("clang", LoopHintHandler.get());		PP.AddPragmaHandler("clang", LoopHintHandler.get());

▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	if (getLangOpts().MicrosoftExt) {
MSIntrinsic.reset();		MSIntrinsic.reset();
PP.RemovePragmaHandler(MSOptimize.get());		PP.RemovePragmaHandler(MSOptimize.get());
MSOptimize.reset();		MSOptimize.reset();
}		}

if (getLangOpts().CUDA) {		if (getLangOpts().CUDA) {
PP.RemovePragmaHandler("clang", CUDAForceHostDeviceHandler.get());		PP.RemovePragmaHandler("clang", CUDAForceHostDeviceHandler.get());
CUDAForceHostDeviceHandler.reset();		CUDAForceHostDeviceHandler.reset();
		PP.RemovePragmaHandler("clang", CUDAForceDeviceGlobalsHandler.get());
		CUDAForceDeviceGlobalsHandler.reset();
}		}

PP.RemovePragmaHandler("STDC", FPContractHandler.get());		PP.RemovePragmaHandler("STDC", FPContractHandler.get());
FPContractHandler.reset();		FPContractHandler.reset();

PP.RemovePragmaHandler("STDC", STDCFenvAccessHandler.get());		PP.RemovePragmaHandler("STDC", STDCFenvAccessHandler.get());
STDCFenvAccessHandler.reset();		STDCFenvAccessHandler.reset();

▲ Show 20 Lines • Show All 2,983 Lines • ▼ Show 20 Lines

void PragmaForceCUDAHostDeviceHandler::HandlePragma(		void PragmaForceCUDAHostDeviceHandler::HandlePragma(
Preprocessor &PP, PragmaIntroducer Introducer, Token &Tok) {		Preprocessor &PP, PragmaIntroducer Introducer, Token &Tok) {
Token FirstTok = Tok;		Token FirstTok = Tok;

PP.Lex(Tok);		PP.Lex(Tok);
IdentifierInfo *Info = Tok.getIdentifierInfo();		IdentifierInfo *Info = Tok.getIdentifierInfo();
if (!Info \|\| (!Info->isStr("begin") && !Info->isStr("end"))) {		if (!Info \|\| (!Info->isStr("begin") && !Info->isStr("end"))) {
PP.Diag(FirstTok.getLocation(),		PP.Diag(FirstTok.getLocation(), diag::warn_pragma_force_cuda_bad_arg) << 0;
diag::warn_pragma_force_cuda_host_device_bad_arg);
return;		return;
}		}

if (Info->isStr("begin"))		if (Info->isStr("begin"))
Actions.PushForceCUDAHostDevice();		Actions.PushForceCUDAHostDevice();
else if (!Actions.PopForceCUDAHostDevice())		else if (!Actions.PopForceCUDAHostDevice())
PP.Diag(FirstTok.getLocation(),		PP.Diag(FirstTok.getLocation(),
diag::err_pragma_cannot_end_force_cuda_host_device);		diag::err_pragma_cannot_end_force_cuda_host_device);

PP.Lex(Tok);		PP.Lex(Tok);
if (!Tok.is(tok::eod))		if (!Tok.is(tok::eod))
		PP.Diag(FirstTok.getLocation(), diag::warn_pragma_force_cuda_bad_arg) << 0;
		}

		void PragmaForceCUDADeviceGlobalsHandler::HandlePragma(
		Preprocessor &PP, PragmaIntroducer Introducer, Token &Tok) {
		Token FirstTok = Tok;

		PP.Lex(Tok);
		IdentifierInfo *Info = Tok.getIdentifierInfo();
		if (!Info \|\| (!Info->isStr("begin") && !Info->isStr("end"))) {
		PP.Diag(FirstTok.getLocation(), diag::warn_pragma_force_cuda_bad_arg) << 1;
		return;
		}

		if (Info->isStr("begin"))
		Actions.PushForceCUDADeviceGlobals();
		else if (!Actions.PopForceCUDADeviceGlobals())
PP.Diag(FirstTok.getLocation(),		PP.Diag(FirstTok.getLocation(),
diag::warn_pragma_force_cuda_host_device_bad_arg);		diag::err_pragma_cannot_end_force_cuda_device_globals);

		PP.Lex(Tok);
		if (Tok.isNot(tok::eod))
		PP.Diag(FirstTok.getLocation(), diag::warn_pragma_force_cuda_bad_arg) << 1;
}		}

/// Handle the #pragma clang attribute directive.		/// Handle the #pragma clang attribute directive.
///		///
/// The syntax is:		/// The syntax is:
/// \code		/// \code
/// #pragma clang attribute push (attribute, subject-set)		/// #pragma clang attribute push (attribute, subject-set)
/// #pragma clang attribute push		/// #pragma clang attribute push
▲ Show 20 Lines • Show All 198 Lines • Show Last 20 Lines

clang/lib/Sema/SemaCUDA.cpp

	Show All 33 Lines
	bool Sema::PopForceCUDAHostDevice() {			bool Sema::PopForceCUDAHostDevice() {
	assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");			assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");
	if (ForceCUDAHostDeviceDepth == 0)			if (ForceCUDAHostDeviceDepth == 0)
	return false;			return false;
	ForceCUDAHostDeviceDepth--;			ForceCUDAHostDeviceDepth--;
	return true;			return true;
	}			}

				void Sema::PushForceCUDADeviceGlobals() {
				assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");
				ForceCUDADeviceGlobalsDepth++;
				}

				bool Sema::PopForceCUDADeviceGlobals() {
				assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");
				if (ForceCUDADeviceGlobalsDepth == 0)
				return false;
				ForceCUDADeviceGlobalsDepth--;
				return true;
				}

	ExprResult Sema::ActOnCUDAExecConfigExpr(Scope *S, SourceLocation LLLLoc,			ExprResult Sema::ActOnCUDAExecConfigExpr(Scope *S, SourceLocation LLLLoc,
	MultiExprArg ExecConfig,			MultiExprArg ExecConfig,
	SourceLocation GGGLoc) {			SourceLocation GGGLoc) {
	FunctionDecl *ConfigDecl = Context.getcudaConfigureCallDecl();			FunctionDecl *ConfigDecl = Context.getcudaConfigureCallDecl();
	if (!ConfigDecl)			if (!ConfigDecl)
	return ExprError(Diag(LLLLoc, diag::err_undeclared_var_use)			return ExprError(Diag(LLLLoc, diag::err_undeclared_var_use)
	<< getCudaConfigureFuncName());			<< getCudaConfigureFuncName());
	QualType ConfigQTy = ConfigDecl->getType();			QualType ConfigQTy = ConfigDecl->getType();
	▲ Show 20 Lines • Show All 587 Lines • ▼ Show 20 Lines

	void Sema::MaybeAddCUDAConstantAttr(VarDecl *VD) {			void Sema::MaybeAddCUDAConstantAttr(VarDecl *VD) {
	if (getLangOpts().CUDAIsDevice && VD->isConstexpr() &&			if (getLangOpts().CUDAIsDevice && VD->isConstexpr() &&
	(VD->isFileVarDecl() \|\| VD->isStaticDataMember())) {			(VD->isFileVarDecl() \|\| VD->isStaticDataMember())) {
	VD->addAttr(CUDAConstantAttr::CreateImplicit(getASTContext()));			VD->addAttr(CUDAConstantAttr::CreateImplicit(getASTContext()));
	}			}
	}			}

				/// All global variables are added with __device__ attribute when
				/// ForceCUDADeviceGlobalsDepth > 0 (corresponding to code within a
				/// #pragma clang force_cuda_device_globals begin/end pair.
				void Sema::MaybeAddCUDADeviceAttr(VarDecl *VD) {
				assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");
				// Avoid non-global variables.
				if (!VD->hasGlobalStorage())
				return;

				// Avoid system globals.
				ASTContext &Context = getASTContext();
				const SourceManager &SM = Context.getSourceManager();
				FullSourceLoc Loc = Context.getFullLoc(VD->getBeginLoc());
				if (SM.isInSystemHeader(Loc))
				return;
				traUnsubmitted Not Done Reply Inline Actions Is there a particular reason not to apply the pragma to the system headers? Presumably we do want them to work. E.g. what if the header has a function which relies on a local static var? You may end up using such function in the code that does have the pragma applied, but it will fail to link. It may even fail to compile, as the function and the variable would end up on the different sides. I think it would be more consistent to apply pragma to everything the user put within its boundaries. This brings another interesting question -- if system includes are affected by this pragma, how will you handle the files pre-included by the compiler? I guess it's easy enough to work around with `-nocudainc` and including all relevant headers from within the pragma. I'm not sure what would be the right way to handle this. "pragma affects everything" looks more sound, but is a bit more hassle to use. "pragma does not affect system headers" would have some corner cases as we'll only have a subset of system headers available for the code compiled within pragma boundaries. @rsmith -- Do you have any suggestions? tra: Is there a particular reason not to apply the pragma to the system headers? Presumably we do…
				palvesUnsubmitted Not Done Reply Inline Actions Hmm, applying the pragma to system headers too would break GDB testsuite use case I detailed, since we issue the #pragma before system includes. We do: clang/hipcc -include force-globals.h unmodified-testcase.c And the #pragma is in force-globals.h. The goal is to not modify unmodified-testcase.c. And that file includes system headers. System headers contain global variable declarations which we can't mess with. palves: Hmm, applying the pragma to system headers too would break GDB testsuite use case I detailed…

				if (ForceCUDADeviceGlobalsDepth > 0)
				VD->addAttr(CUDADeviceAttr::CreateImplicit(VD->getASTContext()));
				}

	Sema::SemaDiagnosticBuilder Sema::CUDADiagIfDeviceCode(SourceLocation Loc,			Sema::SemaDiagnosticBuilder Sema::CUDADiagIfDeviceCode(SourceLocation Loc,
	unsigned DiagID) {			unsigned DiagID) {
	assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");			assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");
	SemaDiagnosticBuilder::Kind DiagKind = [&] {			SemaDiagnosticBuilder::Kind DiagKind = [&] {
	if (!isa<FunctionDecl>(CurContext))			if (!isa<FunctionDecl>(CurContext))
	return SemaDiagnosticBuilder::K_Nop;			return SemaDiagnosticBuilder::K_Nop;
	switch (CurrentCUDATarget()) {			switch (CurrentCUDATarget()) {
	case CFT_Global:			case CFT_Global:
	▲ Show 20 Lines • Show All 221 Lines • Show Last 20 Lines

clang/lib/Sema/SemaDecl.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,241 Lines • ▼ Show 20 Lines	if (getLangOpts().OpenCL) {
deduceOpenCLAddressSpace(NewVD);		deduceOpenCLAddressSpace(NewVD);

diagnoseOpenCLTypes(S, *this, D, DC, NewVD->getType());		diagnoseOpenCLTypes(S, *this, D, DC, NewVD->getType());
}		}

// Handle attributes prior to checking for duplicates in MergeVarDecl		// Handle attributes prior to checking for duplicates in MergeVarDecl
ProcessDeclAttributes(S, NewVD, D);		ProcessDeclAttributes(S, NewVD, D);

		if (getLangOpts().CUDA) {
		traUnsubmitted Done Reply Inline Actions Please reformat the code as suggested. tra: Please reformat the code as suggested.
		MaybeAddCUDADeviceAttr(NewVD);
		}

if (getLangOpts().CUDA \|\| getLangOpts().OpenMPIsDevice \|\|		if (getLangOpts().CUDA \|\| getLangOpts().OpenMPIsDevice \|\|
getLangOpts().SYCLIsDevice) {		getLangOpts().SYCLIsDevice) {
if (EmitTLSUnsupportedError &&		if (EmitTLSUnsupportedError &&
((getLangOpts().CUDA && DeclAttrsMatchCUDAMode(getLangOpts(), NewVD)) \|\|		((getLangOpts().CUDA && DeclAttrsMatchCUDAMode(getLangOpts(), NewVD)) \|\|
(getLangOpts().OpenMPIsDevice &&		(getLangOpts().OpenMPIsDevice &&
OMPDeclareTargetDeclAttr::isDeclareTargetDeclaration(NewVD))))		OMPDeclareTargetDeclAttr::isDeclareTargetDeclaration(NewVD))))
Diag(D.getDeclSpec().getThreadStorageClassSpecLoc(),		Diag(D.getDeclSpec().getThreadStorageClassSpecLoc(),
diag::err_thread_unsupported);		diag::err_thread_unsupported);
▲ Show 20 Lines • Show All 11,175 Lines • Show Last 20 Lines

clang/lib/Sema/SemaTemplateInstantiateDecl.cpp

Show First 20 Lines • Show All 5,025 Lines • ▼ Show 20 Lines	void Sema::BuildVariableInstantiation(
} else if (OldVar->isOutOfLine())		} else if (OldVar->isOutOfLine())
NewVar->setLexicalDeclContext(OldVar->getLexicalDeclContext());		NewVar->setLexicalDeclContext(OldVar->getLexicalDeclContext());
NewVar->setTSCSpec(OldVar->getTSCSpec());		NewVar->setTSCSpec(OldVar->getTSCSpec());
NewVar->setInitStyle(OldVar->getInitStyle());		NewVar->setInitStyle(OldVar->getInitStyle());
NewVar->setCXXForRangeDecl(OldVar->isCXXForRangeDecl());		NewVar->setCXXForRangeDecl(OldVar->isCXXForRangeDecl());
NewVar->setObjCForDecl(OldVar->isObjCForDecl());		NewVar->setObjCForDecl(OldVar->isObjCForDecl());
NewVar->setConstexpr(OldVar->isConstexpr());		NewVar->setConstexpr(OldVar->isConstexpr());
MaybeAddCUDAConstantAttr(NewVar);		MaybeAddCUDAConstantAttr(NewVar);
		if (getLangOpts().CUDA)
		MaybeAddCUDADeviceAttr(NewVar);
NewVar->setInitCapture(OldVar->isInitCapture());		NewVar->setInitCapture(OldVar->isInitCapture());
NewVar->setPreviousDeclInSameBlockScope(		NewVar->setPreviousDeclInSameBlockScope(
OldVar->isPreviousDeclInSameBlockScope());		OldVar->isPreviousDeclInSameBlockScope());
NewVar->setAccess(OldVar->getAccess());		NewVar->setAccess(OldVar->getAccess());

if (!OldVar->isStaticDataMember()) {		if (!OldVar->isStaticDataMember()) {
if (OldVar->isUsed(false))		if (OldVar->isUsed(false))
NewVar->setIsUsed();		NewVar->setIsUsed();
▲ Show 20 Lines • Show All 1,164 Lines • Show Last 20 Lines

clang/lib/Serialization/ASTReader.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,780 Lines • ▼ Show 20 Lines	while (true) {
case CUDA_PRAGMA_FORCE_HOST_DEVICE_DEPTH:		case CUDA_PRAGMA_FORCE_HOST_DEVICE_DEPTH:
if (Record.size() != 1) {		if (Record.size() != 1) {
Error("invalid cuda pragma options record");		Error("invalid cuda pragma options record");
return Failure;		return Failure;
}		}
ForceCUDAHostDeviceDepth = Record[0];		ForceCUDAHostDeviceDepth = Record[0];
break;		break;

		case CUDA_PRAGMA_FORCE_DEVICE_GLOBALS_DEPTH:
		if (Record.size() != 1) {
		Error("invalid cuda pragma options record");
		return Failure;
		}
		ForceCUDADeviceGlobalsDepth = Record[0];
		break;

case ALIGN_PACK_PRAGMA_OPTIONS: {		case ALIGN_PACK_PRAGMA_OPTIONS: {
if (Record.size() < 3) {		if (Record.size() < 3) {
Error("invalid pragma pack record");		Error("invalid pragma pack record");
return Failure;		return Failure;
}		}
PragmaAlignPackCurrentValue = ReadAlignPackInfo(Record[0]);		PragmaAlignPackCurrentValue = ReadAlignPackInfo(Record[0]);
PragmaAlignPackCurrentLocation = ReadSourceLocation(F, Record[1]);		PragmaAlignPackCurrentLocation = ReadSourceLocation(F, Record[1]);
unsigned NumStackEntries = Record[2];		unsigned NumStackEntries = Record[2];
▲ Show 20 Lines • Show All 4,125 Lines • ▼ Show 20 Lines	if (PragmaMSStructState != -1)
SemaObj->ActOnPragmaMSStruct((PragmaMSStructKind)PragmaMSStructState);		SemaObj->ActOnPragmaMSStruct((PragmaMSStructKind)PragmaMSStructState);
if (PointersToMembersPragmaLocation.isValid()) {		if (PointersToMembersPragmaLocation.isValid()) {
SemaObj->ActOnPragmaMSPointersToMembers(		SemaObj->ActOnPragmaMSPointersToMembers(
(LangOptions::PragmaMSPointersToMembersKind)		(LangOptions::PragmaMSPointersToMembersKind)
PragmaMSPointersToMembersState,		PragmaMSPointersToMembersState,
PointersToMembersPragmaLocation);		PointersToMembersPragmaLocation);
}		}
SemaObj->ForceCUDAHostDeviceDepth = ForceCUDAHostDeviceDepth;		SemaObj->ForceCUDAHostDeviceDepth = ForceCUDAHostDeviceDepth;
		SemaObj->ForceCUDADeviceGlobalsDepth = ForceCUDADeviceGlobalsDepth;

if (PragmaAlignPackCurrentValue) {		if (PragmaAlignPackCurrentValue) {
// The bottom of the stack might have a default value. It must be adjusted		// The bottom of the stack might have a default value. It must be adjusted
// to the current value to ensure that the packing state is preserved after		// to the current value to ensure that the packing state is preserved after
// popping entries that were included/imported from a PCH/module.		// popping entries that were included/imported from a PCH/module.
bool DropFirst = false;		bool DropFirst = false;
if (!PragmaAlignPackStack.empty() &&		if (!PragmaAlignPackStack.empty() &&
PragmaAlignPackStack.front().Location.isInvalid()) {		PragmaAlignPackStack.front().Location.isInvalid()) {
▲ Show 20 Lines • Show All 5,015 Lines • Show Last 20 Lines

clang/lib/Serialization/ASTWriter.cpp

Show First 20 Lines • Show All 783 Lines • ▼ Show 20 Lines	#define RECORD(X) EmitRecordID(X, #X, Stream, Record)
RECORD(OPTIMIZE_PRAGMA_OPTIONS);		RECORD(OPTIMIZE_PRAGMA_OPTIONS);
RECORD(MSSTRUCT_PRAGMA_OPTIONS);		RECORD(MSSTRUCT_PRAGMA_OPTIONS);
RECORD(POINTERS_TO_MEMBERS_PRAGMA_OPTIONS);		RECORD(POINTERS_TO_MEMBERS_PRAGMA_OPTIONS);
RECORD(UNUSED_LOCAL_TYPEDEF_NAME_CANDIDATES);		RECORD(UNUSED_LOCAL_TYPEDEF_NAME_CANDIDATES);
RECORD(DELETE_EXPRS_TO_ANALYZE);		RECORD(DELETE_EXPRS_TO_ANALYZE);
RECORD(CUDA_PRAGMA_FORCE_HOST_DEVICE_DEPTH);		RECORD(CUDA_PRAGMA_FORCE_HOST_DEVICE_DEPTH);
RECORD(PP_CONDITIONAL_STACK);		RECORD(PP_CONDITIONAL_STACK);
RECORD(DECLS_TO_CHECK_FOR_DEFERRED_DIAGS);		RECORD(DECLS_TO_CHECK_FOR_DEFERRED_DIAGS);
		RECORD(CUDA_PRAGMA_FORCE_DEVICE_GLOBALS_DEPTH);

// SourceManager Block.		// SourceManager Block.
BLOCK(SOURCE_MANAGER_BLOCK);		BLOCK(SOURCE_MANAGER_BLOCK);
RECORD(SM_SLOC_FILE_ENTRY);		RECORD(SM_SLOC_FILE_ENTRY);
RECORD(SM_SLOC_BUFFER_ENTRY);		RECORD(SM_SLOC_BUFFER_ENTRY);
RECORD(SM_SLOC_BUFFER_BLOB);		RECORD(SM_SLOC_BUFFER_BLOB);
RECORD(SM_SLOC_BUFFER_BLOB_COMPRESSED);		RECORD(SM_SLOC_BUFFER_BLOB_COMPRESSED);
RECORD(SM_SLOC_EXPANSION_ENTRY);		RECORD(SM_SLOC_EXPANSION_ENTRY);
▲ Show 20 Lines • Show All 3,231 Lines • ▼ Show 20 Lines	void ASTWriter::WriteOpenCLExtensionDecls(Sema &SemaRef) {
Stream.EmitRecord(OPENCL_EXTENSION_DECLS, Record);		Stream.EmitRecord(OPENCL_EXTENSION_DECLS, Record);
}		}

void ASTWriter::WriteCUDAPragmas(Sema &SemaRef) {		void ASTWriter::WriteCUDAPragmas(Sema &SemaRef) {
if (SemaRef.ForceCUDAHostDeviceDepth > 0) {		if (SemaRef.ForceCUDAHostDeviceDepth > 0) {
RecordData::value_type Record[] = {SemaRef.ForceCUDAHostDeviceDepth};		RecordData::value_type Record[] = {SemaRef.ForceCUDAHostDeviceDepth};
Stream.EmitRecord(CUDA_PRAGMA_FORCE_HOST_DEVICE_DEPTH, Record);		Stream.EmitRecord(CUDA_PRAGMA_FORCE_HOST_DEVICE_DEPTH, Record);
}		}
		if (SemaRef.ForceCUDADeviceGlobalsDepth > 0) {
		RecordData::value_type Record[] = {SemaRef.ForceCUDADeviceGlobalsDepth};
		Stream.EmitRecord(CUDA_PRAGMA_FORCE_DEVICE_GLOBALS_DEPTH, Record);
		}
}		}

void ASTWriter::WriteObjCCategories() {		void ASTWriter::WriteObjCCategories() {
SmallVector<ObjCCategoriesInfo, 2> CategoriesMap;		SmallVector<ObjCCategoriesInfo, 2> CategoriesMap;
RecordData Categories;		RecordData Categories;

for (unsigned I = 0, N = ObjCClassesWithCategories.size(); I != N; ++I) {		for (unsigned I = 0, N = ObjCClassesWithCategories.size(); I != N; ++I) {
unsigned Size = 0;		unsigned Size = 0;
▲ Show 20 Lines • Show All 2,705 Lines • Show Last 20 Lines

clang/test/PCH/pragma-cuda-force-device-globals.cu

This file was added.

				// RUN: %clang_cc1 -triple nvptx -fcuda-is-device -emit-pch %s -o %t
				// RUN: %clang_cc1 -triple nvptx -verify -verify-ignore-unexpected=note \
				// RUN: -fcuda-is-device -include-pch %t -S -o /dev/null %s
				//
				// This test checks that serialization code maintains push count in PCH.
				// Also, non-zero count at the end of TU is okay.

				#ifndef HEADER
				#define HEADER

				static int global_before_pragma = 1;

				#pragma clang force_cuda_device_globals begin
				#pragma clang force_cuda_device_globals begin
				#pragma clang force_cuda_device_globals end

				static int global1 = 1;
				static const int const_global = 1;

				#else

				static int global2 = 1;

				#pragma clang force_cuda_device_globals end

				static int global_host_only = 1;

				__attribute__((device)) void device() {
				int g = 0;
				g += global_before_pragma; // expected-error {{reference to __host__ variable 'global_before_pragma' in __device__ function}}
				g += global1;
				g += const_global;
				g += global2;
				g+= global_host_only; // expected-error {{reference to __host__ variable 'global_host_only' in __device__ function}}
				}

				#endif

clang/test/SemaCUDA/force-device-globals.cu

This file was added.

				// RUN: %clang_cc1 -std=c++14 %s -o - -triple nvptx64-nvidia-cuda \
				// RUN: -fcuda-is-device -verify -fsyntax-only
				#include "Inputs/cuda.h"

				static int global_before_pragma = 1;

				#pragma clang force_cuda_device_globals end
				// expected-error@-1 {{force_cuda_device_globals end pragma without matching force_cuda_device_globals begin}}

				#pragma clang force_cuda_device_globals begin

				static int global = 1;
				static const int const_global = 1;

				struct S {
				static const int static_const_field = 1;
				static int static_field;
				};

				namespace NS1 {
				namespace NS2 {
				int ns_global = 1;
				}

				struct S {
				static const int ns1_static_const_field = 1;
				static int ns1_static_field;
				};
				}

				int S::static_field = 1;
				int NS1::S::ns1_static_field = 1;

				#pragma clang force_cuda_device_globals end

				__device__ void device_func(int &a) {
				static constexpr int local_static_constexpr = 1;
				constexpr int local_constexpr = 2;
				static int local_static = 3;
				a += local_static_constexpr;
				a += local_constexpr;
				a += local_static;
				// expected-error@-1 {{reference to __host__ variable 'local_static' in __device__ function}}
				a += global_before_pragma;
				// expected-error@-1 {{reference to __host__ variable 'global_before_pragma' in __device__ function}}
				a += global;
				a += const_global;
				a += S::static_const_field;
				a += S::static_field;
				traUnsubmitted Done Reply Inline Actions You may also want to test local static vars. I guess those should remain on the host in the explicitly `__host__` functions and become `__device__` in unattributed functions with the pragma. Another case to test would be implicitly `host/device` functions. E.g. constexpr ones. I guess those should already place the local static variables on the correct side of the compilation, depending on where we compile them. tra: You may also want to test local static vars. I guess those should remain on the host in the…
				ashi1AuthorUnsubmitted Done Reply Inline Actions Thanks, I will add checking for local static vars and constexpr vars on host, device, and host/device functions. ashi1: Thanks, I will add checking for local static vars and constexpr vars on host, device, and…
				a += NS1::S::ns1_static_const_field;
				a += NS1::S::ns1_static_field;
				a += NS1::NS2::ns_global;
				}

				void host_func(int &a) {
				static constexpr int local_static_constexpr = 1;
				constexpr int local_constexpr = 1;
				traUnsubmitted Done Reply Inline Actions So, technically, only `global_before_pragma` is a `__device__` variable now. Everything else we should not be allowed to read from. At the moment clang does allow reading the variables (or, rather, their shadows), but it should not be the case. At the very least I'd add a comment about that so it's clear that accessing device vars from a host function is not OK. tra: So, technically, only `global_before_pragma` is a `__device__` variable now. Everything else…
				ashi1AuthorUnsubmitted Done Reply Inline Actions Thanks, I will add a comment so that its clear we cannot access device vars here. ashi1: Thanks, I will add a comment so that its clear we cannot access device vars here.
				static int local_static = 2;
				a += local_static_constexpr;
				a += local_constexpr;
				a += local_static;
				a += global_before_pragma;
				// Note: variables below should not be allowed to read from host.
				// Clang allows reading their shadows, but that is not OK.
				a += global;
				a += const_global;
				a += S::static_const_field;
				a += S::static_field;
				a += NS1::S::ns1_static_const_field;
				a += NS1::S::ns1_static_field;
				a += NS1::NS2::ns_global;
				}

				__host__ __device__ void host_device_func(int &a) {
				static constexpr int local_static_constexpr = 1;
				constexpr int local_constexpr = 1;
				static int local_static = 2;
				a += local_static_constexpr;
				a += local_constexpr;
				a += local_static;
				// expected-error@-1 {{reference to __host__ variable 'local_static' in __host__ __device__ function}}
				a += global_before_pragma;
				// expected-error@-1 {{reference to __host__ variable 'global_before_pragma' in __host__ __device__ function}}
				a += global;
				a += const_global;
				a += S::static_const_field;
				a += S::static_field;
				a += NS1::S::ns1_static_const_field;
				a += NS1::S::ns1_static_field;
				a += NS1::NS2::ns_global;
				}

				#pragma clang force_cuda_device_globals begin foo
				// expected-warning@-1 {{incorrect use of #pragma clang force_cuda_device_globals begin\|end}}

This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Add #pragma clang force_cuda_device_globals {begin,end}AbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 331110

clang/include/clang/Basic/DiagnosticParseKinds.td

clang/include/clang/Parse/Parser.h

clang/include/clang/Sema/Sema.h

clang/include/clang/Serialization/ASTBitCodes.h

clang/include/clang/Serialization/ASTReader.h

clang/lib/Parse/ParsePragma.cpp

clang/lib/Sema/SemaCUDA.cpp

clang/lib/Sema/SemaDecl.cpp

clang/lib/Sema/SemaTemplateInstantiateDecl.cpp

clang/lib/Serialization/ASTReader.cpp

clang/lib/Serialization/ASTWriter.cpp

clang/test/PCH/pragma-cuda-force-device-globals.cu

clang/test/SemaCUDA/force-device-globals.cu

[CUDA][HIP] Add #pragma clang force_cuda_device_globals {begin,end}
AbandonedPublic