This is an archive of the discontinued LLVM Phabricator instance.

[Cuda] Use fallback method to mangle externalized decls if no CUID given
ClosedPublic

Authored by jhuber6 on May 18 2022, 9:49 AM.

Details

Summary

CUDA requires that static variables be visible to the host when
offloading. However, The standard semantics of a stiatc variable dictate
that it should not be visible outside of the current file. In order to
access it from the host we need to perform "externalization" on the
static variable on the device. This requires generating a semi-unique
name that can be affixed to the variable as to not cause linker errors.

This is currently done using the CUID functionality, an MD5 hash value
set up by the clang driver. This allows us to achieve is mostly unique
ID that is unique even between multiple compilations of the same file.
However, this is not always availible. Instead, this patch uses the
unique ID from the file to generate a unique symbol name. This will
create a unique name that is consistent between the host and device side
compilations without requiring the CUID to be entered by the driver. The
one downside to this is that we are no longer stable under multiple
compilations of the same file. However, this is a very niche use-case
and is not supported by Nvidia's CUDA compiler so it likely to be good
enough.

Diff Detail

Event Timeline

jhuber6 created this revision.May 18 2022, 9:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2022, 9:49 AM
jhuber6 requested review of this revision.May 18 2022, 9:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2022, 9:49 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
tra added a comment.May 23 2022, 2:30 PM

The one downside to this is that we are no longer stable under multiple compilations of the same file.

This is a moderately serious issue. Some users care about the build reproducibility. Recompiling the same sources and getting different results will trigger all sorts of red flags that would need to be investigated in order to make sure the build is not broken.

However, this is a very niche use-case and is not supported by Nvidia's CUDA compiler so it likely to be good enough.

The fact that NVCC didn't always generate the same output was an issue when we were using it for CUDA compilation.
In general, "not supported by NVCC" is not quite applicable here, IMO. The goal here is to make clang work correctly.

This is a moderately serious issue. Some users care about the build reproducibility. Recompiling the same sources and getting different results will trigger all sorts of red flags that would need to be investigated in order to make sure the build is not broken.

I mean this in the context that the following will not work

clang a.c -c -o a-0.o // Has some externalized static variable.
clang a.c -c -o a-1.o
clang a-0.o a-1.o // Redefined symbol error

The build will be perfectly reproducible, the ID we append here is just <var>__static__<file id><device id><line number> which should be the same in a static source tree. Though it might be annoying that the line number may change on white-space changes, so we could do without the line number at the end if that's an issue.

However, this is a very niche use-case and is not supported by Nvidia's CUDA compiler so it likely to be good enough.

The fact that NVCC didn't always generate the same output was an issue when we were using it for CUDA compilation.
In general, "not supported by NVCC" is not quite applicable here, IMO. The goal here is to make clang work correctly.

I feel like linking a file with itself is pretty uncommon, but in order to support that we'd definitely need the CUID method so we can pass it to both the host and device. I'm personally fine with this and the CUID living together so if for whatever reason there's a symbol clash, the user can specify a CUID to make it go away. We also discussed the problem of non-static source trees which neither this nor the current CUID would solve. As far as I can tell, this method would work fine for like 99.99% of codes, but getting that last 0.01% would require something like generating a UUID for each compilation job, which requires intervention from the driver to set up offloading compilation properly. So I'm not sure if it's the best trade-off.

tra added a comment.May 23 2022, 3:25 PM

clang a.c -c -o a-0.o Has some externalized static variable.
clang a.c -c -o a-1.o
clang a-0.o a-1.o
Redefined symbol error

Ah. OK. This is a bit less of a concern. As long as we take compiler options into account it should work.

There are use cases when the same file is recompiled with different -DFOO macros. And that *is* supported by NVCC and, I think it is (or used to be?) broken in clang.
I think we currently end up renaming the source files for each compilation variant.

The fact that NVCC didn't always generate the same output was an issue when we were using it for CUDA compilation.
In general, "not supported by NVCC" is not quite applicable here, IMO. The goal here is to make clang work correctly.

I feel like linking a file with itself is pretty uncommon, but in order to support that we'd definitely need the CUID method so we can pass it to both the host and device. I'm personally fine with this and the CUID living together so if for whatever reason there's a symbol clash, the user can specify a CUID to make it go away.

I agree that compiling and linking the same file is probably not very common and I can't think of practical use case for it.
That said, I would consider compiling the same source with different preprocessor options to be a legitimate use case that we should support.
Explicitly passing cuid would work as a workaround in those cases, so it's not a major issue if we can't make it work out of the box without explicit cuid.

We also discussed the problem of non-static source trees which neither this nor the current CUID would solve. As far as I can tell, this method would work fine for like 99.99% of codes, but getting that last 0.01% would require something like generating a UUID for each compilation job, which requires intervention from the driver to set up offloading compilation properly. So I'm not sure if it's the best trade-off.

Acknowledged. Creating globally-unique, yet consistent across all sub-compilations ID based only the info available to individual subcompilation is probably hard-to-impossible to do.

clang/lib/CodeGen/CodeGenModule.cpp
6845

Considering that the file name may have arbitrary symbols in it, that may result in a symbol name that we can't really use.
I'd print a hash of the (finename+device+line) -- that would guarantee that we know there are no funky characters in the suffix.

I'm also not sure if the line number makes any difference here. There's no need to differentiate between symbols within the same TU within the same compilation, only across different compilations and for that filename+device should be sufficient.

That said, I would consider compiling the same source with different preprocessor options to be a legitimate use case that we should support.
Explicitly passing cuid would work as a workaround in those cases, so it's not a major issue if we can't make it work out of the box without explicit cuid.

I could try to find a way to include the preprocessor options. It might be a bit more difficult to make it stable however (unless we just append the whole string).

clang/lib/CodeGen/CodeGenModule.cpp
6845

What we're using here is basically just the file's integer index on a Unix system, so it'll just be a hex number at the end of the day, the filename will look something like

foo__static__18a43f325834

Yeah, I can remove the line number. I'm not a fan of how it makes the symbol names change if you just add some white-space.

jhuber6 updated this revision to Diff 431666.May 24 2022, 7:11 AM

Removing use of the line number, instead replacing it with an 8 character wide hash of the -D options passed to the front-end. This should make it sufficiently unique for users compiling the same file with different options. The format now looks like <var>__<qualifier>__<file-id><device-id>_<hash>.

yaxunl added inline comments.May 24 2022, 9:00 AM
clang/lib/CodeGen/CodeGenModule.cpp
6840–6841

Are these options always the same for device and host compilation?

If not, this may not work since host and device compilation will have different names for the same variable.

jhuber6 added inline comments.May 24 2022, 9:02 AM
clang/lib/CodeGen/CodeGenModule.cpp
6840–6841

This should only be definitions passed to the front-end via -D. There's a handful of these that the Clang driver will pass in, e.g. -D__GCC_HAVE_DWARF2_CFI_ASM=1, but as far as I know it's consistent between host and device compilations.

yaxunl added inline comments.May 24 2022, 9:20 AM
clang/lib/CodeGen/CodeGenModule.cpp
6840–6841

Can we add a comment here saying that this requires that device offloading toolchain does not add preprocessor arguments to clang? This is a hidden requirement.

jhuber6 added inline comments.May 24 2022, 9:45 AM
clang/lib/CodeGen/CodeGenModule.cpp
6840–6841

Yeah, it's a bit of a hidden requirement. Most toolchains add specific macros via the regular macro builder so this has worked for the configurations I've tested. I'll add a comment for it so it's at least mentioned more explicitly.

jhuber6 updated this revision to Diff 431702.May 24 2022, 9:47 AM

Adding extra commentto mention the hidden requirement that the driver shuold not define a different -D option for the host and device.

tra added inline comments.May 24 2022, 10:59 AM
clang/lib/CodeGen/CodeGenModule.cpp
6831

However, [CUID] is not always availible.

The question is -- when and why is it not available? I'm getting the feeling that we're fixing the consequence here, not the root cause.

Is there a reason we can't make sure that the driver always generates a cuid for offload subcompilations and error out if it's needed but is not provided?
That would make this fallback unnecessary and would be a more robust approach in general.

jhuber6 added inline comments.May 24 2022, 11:29 AM
clang/lib/CodeGen/CodeGenModule.cpp
6831

So, I'm more in favor of this approach because it doesn't require extra intervention from the compiler driver, this makes it less convoluted to do split compilation since we don't have an extra arguments. The way I would prefer it, is that we do this implicitly by default without requiring extra thought from the driver, but if it's not good enough we can support the manual CUID approach to let the user override it. I think this is a cleaner implementation, and is mostly coming from my support for CUDA in the new driver which currently doesn't implement the CUID as we do with the old driver. Generally I'd prefer things to behave independent of the driver, so we can consider host and device compilation more separately.

tra added inline comments.May 24 2022, 1:09 PM
clang/lib/CodeGen/CodeGenModule.cpp
6831

So, I'm more in favor of this approach because it doesn't require extra intervention from the compiler driver

We need the driver intervention for any cc1 compilations anyways, so this does not buy us anything. While you can run a sub-compilation manually with handcrafted cc1 flags, that's not a practical use case. The driver is the ultimate source of cc1 flags.

this makes it less convoluted to do split compilation since we don't have an extra arguments.

For CUDA/HIP sub-compilation should be done with clang --cuda-host-only/--cuda-device-only. Whether the driver supplies yet another cc1 option, --cuid=... makes no difference to the user launching such sub-compilation.

The way I would prefer it, is that we do this implicitly by default without requiring extra thought from the driver, but if it's not good enough we can support the manual CUID approach to let the user override it.

I agree that we can come up with something that will almost always work. Possibly even good enough for all practical purposes. However, if a better solution would take comparable effort, it would make sense to do things right and avoid adding technical debt.

On the other hand, requiring the driver to supply identical cuid to all sub-compilations appears to be a better approach to me:

  • Driver is the best place to do it, functionally. Driver has access to all user-provided inputs and is in position to guarantee that all subcompilations get the same cuid.
  • Calculating CUID in the driver keeps relevant logic in one place. Doing it in the driver *and* in the codegen
  • Figuring out what inputs are relevant for calculation of CUID in cc1 invocation is error prone. E.g. we have to guess which cc1 options are relevant or not and is the driver would pass a macro to one subcompilation but not to another, we would end up generating mismatching CUID and would not have any way to notice that. Even when that's not the case, we would need to guess which flags, supplied by the driver, are relevant. At CC1 level that may be somewhat complicated as top-level options may expand to quite a few more cc1 options. E.g. we'll need to take into account -std=..., --cuda-path=, -include ..., -I (and other include paths)... All of that does not belong to the codegen.

The driver is already doing CUID computation, so I do not see any downsides to just letting it do its job, and I do believe it will be a better, and likely less complicated, solution.

... mostly coming from my support for CUDA in the new driver which currently doesn't implement the CUID as we do with the old driver

Right. That appears to be the key missing piece.

What are the obstacles for having CUID calculation done in the new driver. It should have all the info it needs. What am I missing?

jhuber6 added inline comments.May 24 2022, 2:57 PM
clang/lib/CodeGen/CodeGenModule.cpp
6831

For CUDA/HIP sub-compilation should be done with clang --cuda-host-only/--cuda-device-only. Whether the driver supplies yet another cc1 option, --cuid=... makes no difference to the user launching such sub-compilation.

The problem I have with this is that we use the command line to generate the value, so they aren't going to be the same without the user manually specifying it. I guess we could filter out only "relevant" command line flags, maybe that's an option. I just think it's not intuitive for a name mangling scheme to depend on something external, but there's definitely advantages to doing it that way.

I can see your point for the Driver handling this stuff. Now that I'm thinking about it I don't think looking at the macros or the other arguments is a sound solution in the first place. Even without that it would work for almost all the same cases just using the file's unique ID. Without that, this solution is guaranteed not to conflict with any other file on the same file system at the time of compilation. This, as we discussed, potentially fails for non-static source trees and compiling the same file twice and linking it. The current CUID implementation fails on the former, this method fails on both.

If the CUID didn't exist, the way I would have implemented it would simply be with the File-ID, and have the CUID be a simple marshalling option that lets the user override it to something unique if needed. I personally think that's simpler for 99.99% of cases and has an easy-out in the last 0.01%. Given that it already exists there's some desire to keep it since the work has already been done I understand.

What are the obstacles for having CUID calculation done in the new driver. It should have all the info it needs. What am I missing?

It's less of a difficulty in implementing and more hoping we could make the name mangling more simple and work by default without the driver.
Also, we may need this support for a single case in OpenMP, and I'd prefer not need to generate the CUID for OpenMP offloading when it's unused the vast majority of the time. Generally I'd prefer if compiling for the host / device was conceptually the same to the user without requiring external values. If we're sold on the CUID method I can go forward with that, but from my perspective what it's buying us is the ability to compile the following

static __device__ int a;

#ifdef MACRO
  do_something_with(a);
#else
  do_something_else();
#endif
clang foo.cu -DMACRO -c -o 1.o
clang foo.cu 1.o

This is just a tough problem overall, I don't think there's a single perfect solution. Whatever we choose we'll be trading reproducibility for correctness or whatever. You have more seniority in this space so it's your call what you think I should go forward with.

jhuber6 added a comment.EditedMay 24 2022, 3:08 PM
clang/lib/CodeGen/CodeGenModule.cpp
6831

Also, it's incredibly convoluted, but I can think of a way to break even the current CUID for this.

static __device__ int a;

__device__ int __attribute__((weak)) *a_ref = &a;
$ clang a.cu -c -fgpu-rdc
$ mv a.o b.o
$ clang a.cu -c -fgpu-rdc
$ nvlink a.o b.o -arch=sm_35 -o out.cubin
nvlink error   : Multiple definition of '_ZL1a__static__d041026c8e4167e6' in '1.o', first defined in 'a.o'
nvlink fatal   : merge_elf failed
tra added a comment.May 24 2022, 4:37 PM

I'm still itching to figure out a way to avoid CUID altogether and with the new driver it may be possible.
CUID serves two purposes:
a) avoid name conflicts during device-side linking ("must be globally unique" part)
b) allow host to refer to something in the GPU executable ("stable within TU" part)

My understanding that we already collect the data about all offloading entities and that include those we have to externalize. We also postpone generation of the registration glue to the final linking step.

Let's suppose that we do not externalize those normally-internal symbols. The offloading table would still have entries for them, but there will be no issue with name conflicts during linking, as they do remain internal.
During the final linking, if an an offloading entity uses a pointer w/o a public symbol, we would be in position to generate a unique one, using the pointer value in the offload table entry. Linker can just use a free-running counter for the suffix, or could just generate a completely new symbol. It does not matter.
When we generate the host-side registration glue, we'll use the name of that generated symbol.

In the end linking will work exactly as it would for C++ (modulo having offloading tables) and host/device registration will be ensured by telling host side which symbols to use, instead of assuming that we've happened to generate exactly the same unique suffix on both sides.

@yaxunl -- do you see any holes in this approach?

clang/lib/CodeGen/CodeGenModule.cpp
6831

The problem I have with this is that we use the command line to generate the value, so they aren't going to be the same without the user manually specifying it. I guess we could filter out only "relevant" command line flags, maybe that's an option. I just think it's not intuitive for a name mangling scheme to depend on something external, but there's definitely advantages to doing it that way.

I'm not sure I follow the "they aren't going to be the same without the user manually specifying it." part. Do you mean that CUIDs passed to sub-compilations would not be same?
If so, why would that be the case? If would be up to the driver to pick the same set of inputs to hash into the cuid. We only case about single compilation. Separately compiling host/device with --cuda-host/device-only makes it two different compilations, which we may or may not provide any guarantees about. In case we don't we can document that it would be up to user to ensure consistency between host/device objects by using explicit --cuid argument. Within single top-level compilation the driver should have no problem picking single cuid value and passing it on to all subcompilations.

I don't think looking at the macros or the other arguments is a sound solution in the first place.

They are part of the compilation input set, along with include-related options and, likely, options like -std that also affect the sources seen by compiler.

If we have to generate globally-stable cuid within a cc1 compilation, we have to take as much of the relevant input set for the compilation as practical. I believe preprocessor-related options are relevant to existing use patterns. E.g. compiling the same source with different preprocessor definitions does happen.

We're dealing with more than one issue here.

  • who/where is responsible for CUID generation:
    • driver only
    • CC1 only
    • driver as the primary source of CUID and CC1 as the fallback.
  • how do we guarantee CUID stability within single TU compilation, while ensuring global uniqueness.
    • We can guarantee build-wise uniqueness if we delegate CUID generation to the build system which does know about all compilations and can simply enumerate all of them.
    • We can not generate globally unique CUID strictly within clang, whether by driver or by CC1. In both cases we'll have some chance of collisions and will need a way to deal with them.
    • Driver can guarantee within-compilation stability by generating CUID once and passing it to CC1 instances.
    • Generating CUID within CC1 relies on all CC1 instances producing the same CUID value. It's feasible if we can guarantee that all CC1 instances always operate on identical set of inputs taken into account during CUID generation. That is a dependency on implementation details as those inputs would likely depend on what the driver does. Can we make it work? Probably. But why?

It's less of a difficulty in implementing and more hoping we could make the name mangling more simple and work by default without the driver.

I do not think "without the driver part" (e.g. directly running -cc1) is a good metric for driving compiler development. It's the driver's explicit purpose to hide the complexity of the actual compiler command line.

If you think there's a practical use case of fallback cuid generation in cc1, I would consider it if it were done in parallel with driver-generated CUID during regular top-level compilation. I.e. clang a.cu would run clang -cc1 --cuid.=<driver-provided-cuid>, but if one runs clang -cc1 w/o --cuid, one would be generated for them internally. I would still prefer to see a warning for that, because existence of CUID will be something not obvious to the users and it would likely be very easy to end up with mismatched CUIDs used between the host and device compilations.

I'm still itching to figure out a way to avoid CUID altogether and with the new driver it may be possible.

I would be 100% in favor of working around this if possible, it's proving to be one of the most painful parts of the process.

CUID serves two purposes:
a) avoid name conflicts during device-side linking ("must be globally unique" part)
b) allow host to refer to something in the GPU executable ("stable within TU" part)

My understanding that we already collect the data about all offloading entities and that include those we have to externalize. We also postpone generation of the registration glue to the final linking step.

Yes, we would have all those entries see here. The final linker just gets a pointer to __start_omp_offloading_entries so we can iterate this at runtime.

Let's suppose that we do not externalize those normally-internal symbols. The offloading table would still have entries for them, but there will be no issue with name conflicts during linking, as they do remain internal.

We would also need to make sure that they're used so they don't get optimized out.

During the final linking, if an an offloading entity uses a pointer w/o a public symbol, we would be in position to generate a unique one, using the pointer value in the offload table entry. Linker can just use a free-running counter for the suffix, or could just generate a completely new symbol. It does not matter.

This is the part I'm not sure about, how would we generate new symbols during the linking stage? We can only iterate the offloading entry table after the final linking, which is when we're already supposed to have a fully linked and registered module. We could potentially generate the same kind of table for the device, but I don't think nvlink would perform the same linker magic to merge those entries.

When we generate the host-side registration glue, we'll use the name of that generated symbol.

When we make the registration glue we haven't created the final executable, so I don't think we could modify existing entries, only create new ones.

In the end linking will work exactly as it would for C++ (modulo having offloading tables) and host/device registration will be ensured by telling host side which symbols to use, instead of assuming that we've happened to generate exactly the same unique suffix on both sides.

@yaxunl -- do you see any holes in this approach?

I can't think of a way to generate these new symbols, we'd need to somehow have a list of all the static entries that need new symbols and then modify the object file after its been made. Not sure if this is possible in general considering the vendor linkers might not behave. I'm definitely open to discussion though, I'd love to have a solution for this.

clang/lib/CodeGen/CodeGenModule.cpp
6831

I'm not sure I follow the "they aren't going to be the same without the user manually specifying it." part. Do you mean that CUIDs passed to sub-compilations would not be same?

What I mean is just that if the user does this then it's not going to work.

clang foo.cu --offload-device-only -c // different arguments give different cuid
clang foo.cu --offload-host-only -c

I do not think "without the driver part" (e.g. directly running -cc1) is a good metric for driving compiler development. It's the driver's explicit purpose to hide the complexity of the actual compiler command line.

Fair enough, I just figured this option was more straightforward than passing things in from the command line.

I can't think of a way to generate these new symbols, we'd need to somehow have a list of all the static entries that need new symbols and then modify the object file after its been made. Not sure if this is possible in general considering the vendor linkers might not behave. I'm definitely open to discussion though, I'd love to have a solution for this.

Thinking further, we could theoretically read all the offloading entries via the object files, since we already scan for the .llvm.offloading section, we would just need to look in the same files for all the omp_offload_entries. It would be somewhat difficult to extract the strings, it's theoretically possible. Then we could do a sort of device-side registration where we create a new .ll file containing some new symbols to hook up to those offloading entries. The problem at this point, is even if we had this device glue, how would we register the static variable's pointer? The CUDA runtime looks up symbols by name, so we'd need a symbol with some arbitrary name, whose pointer somehow maps to an existing static variable. Another problem is we'd have duplicate names in the offloading entry, so we'd need to change those somehow to match up with the ones on the device.

Also, for the OpenMP case, we already pass the host-IR as a dependency for the device compilation. So it would be relatively easy for us to just generate these names on the host and then read them from the IR for the device. The problem is that CUDA / HIP doesn't use this approach so it wouldn't be a great solution to have two different ways to do this. So we would either need to make CUDA / HIP take the host IR and use that, or move OpenMP to use the driver. The benefit of passing the IR is that we can much more stably generate some arbitrary string to mangle these names and we're guarunteed to have them match up because we read them from the host. The downside is that it'd be somewhat of a regression because now we have an extra call to Clang for CUDA / HIP when we previously didn't need to.

tra added a comment.May 25 2022, 10:52 AM

Is this patch in its current form blocking any of your other work? no-cuid approach, even if we figure out how to do it, will likely take some time. Do you need an interim solution until then?

Also, for the OpenMP case, we already pass the host-IR as a dependency for the device compilation. So it would be relatively easy for us to just generate these names on the host and then read them from the IR for the device. The problem is that CUDA / HIP doesn't use this approach so it wouldn't be a great solution to have two different ways to do this. So we would either need to make CUDA / HIP take the host IR and use that, or move OpenMP to use the driver. The benefit of passing the IR is that we can much more stably generate some arbitrary string to mangle these names and we're guarunteed to have them match up because we read them from the host. The downside is that it'd be somewhat of a regression because now we have an extra call to Clang for CUDA / HIP when we previously didn't need to.

Yeah. The different compilation flows are a bit of a problem. So is the closeness of NVIDIA's binary format, which limits what we can do with them. E.g. we can't currently modify GPU binary and rename of add new symbols.

I'll need to think about the no-cuid solution. If we can solve it, not deviating from C++ linking would be a valuable benefit and would save us some headaches down the road. Extra clang invocation may be worth it, but it's too early to tell.

Is this patch in its current form blocking any of your other work? no-cuid approach, even if we figure out how to do it, will likely take some time. Do you need an interim solution until then?

Also, for the OpenMP case, we already pass the host-IR as a dependency for the device compilation. So it would be relatively easy for us to just generate these names on the host and then read them from the IR for the device. The problem is that CUDA / HIP doesn't use this approach so it wouldn't be a great solution to have two different ways to do this. So we would either need to make CUDA / HIP take the host IR and use that, or move OpenMP to use the driver. The benefit of passing the IR is that we can much more stably generate some arbitrary string to mangle these names and we're guarunteed to have them match up because we read them from the host. The downside is that it'd be somewhat of a regression because now we have an extra call to Clang for CUDA / HIP when we previously didn't need to.

Yeah. The different compilation flows are a bit of a problem. So is the closeness of NVIDIA's binary format, which limits what we can do with them. E.g. we can't currently modify GPU binary and rename of add new symbols.

I'll need to think about the no-cuid solution. If we can solve it, not deviating from C++ linking would be a valuable benefit and would save us some headaches down the road. Extra clang invocation may be worth it, but it's too early to tell.

It's blocking the new driver from handling static variables correctly because I haven't written the CUID support there yet. I could go ahead and copy over some necessary code to get it to work there, but a no-CUID solution would definitely be ideal. Personally, I think this is fine as a fallback so clang at least generates something that works rather than just leaving it completely blank. The code changed in this patch is pretty minimal. But I will probably get rid of the environment variable check, since I can't verify exactly that it will be the same between the host and device if we were to land this.

tra added a comment.May 25 2022, 11:17 AM

How much work would it take to add cuid generation in the new driver, similar to what the old driver does, using the same logic, however imperfect it is? I'd be OK with that as a possibly permanent solution.

I'm somewhat wary of temporary solutions as they tend to become permanent and age poorly.
That said, I'm OK with someone else tie-breaking us here.
@yaxunl -- Sam, do you have an opinion?

How much work would it take to add cuid generation in the new driver, similar to what the old driver does, using the same logic, however imperfect it is? I'd be OK with that as a possibly permanent solution.

Probably wouldn't be too difficult, primarily just setting up the glue since the rest of the infrastructure is in place. I was hoping it would become unnecessary, but it seems like that's not happening. I'm tempted to have OpenMP handle it on its own do we don't need to port this to the OpenMP case, I think we already do something similar there with the kernel names.

How much work would it take to add cuid generation in the new driver, similar to what the old driver does, using the same logic, however imperfect it is? I'd be OK with that as a possibly permanent solution.

I'm somewhat wary of temporary solutions as they tend to become permanent and age poorly.
That said, I'm OK with someone else tie-breaking us here.
@yaxunl -- Sam, do you have an opinion?

I am OK with this patch.

I doubt it is possible to find a solution for static variable without some form of CUID. Assuming we can rename duplicate symbols in device linker. We still need to let the host compiler know which symbol is for which TU+option, then we still need some CUID to match device and host compilation.

That said, this patch provided a default CUID that do not depend on driver, which is its advantage. It should be OK for most usecases. Driver provided CUID has more robustness, so it can serve as last resort. If the fallback is not sufficient for the new driver then we can revisit this.

tra added a comment.May 25 2022, 3:00 PM

I am OK with this patch.

OK.

That said, this patch provided a default CUID that do not depend on driver, which is its advantage. It should be OK for most usecases.

I agree with this part.

Driver provided CUID has more robustness, so it can serve as last resort.

I'd argue that it should be the default and, as such, the only mechanism for CUID generation. Doing it both in the driver and the codegen looks odd to me.

If the fallback is not sufficient for the new driver then we can revisit this.

It would be great to have some compile-time checks for that, if possible. Otherwise it will only manifest at run-time and the end user will have no clue what's going on.

clang/lib/CodeGen/CodeGenModule.cpp
6850

If we're compiling an already preprocessed file with #line directives, I assume that the source location will be pointing to the file specified by #line, not the input file itself.
If that file is not available (and it does not have to. E.g. I someone has preprocess the original CUDA source and use creduce on the preprocessed code) then we'll get an error when we should not have.

It also results in a different CUIDs being generated for different identifiers, which is different from one CUID for everything generated during the compilation. I can not tell whether it matters in practice. As long as they are in sync betwee the host and the device, it should be OK and it may even have the benefit of allowing commoning things that come from headers, while driver-set common CUID would make all such instances unique.

It would be great to have some compile-time checks for that, if possible. Otherwise it will only manifest at run-time and the end user will have no clue what's going on.

Not sure how we could check it at compile-time, if we knew what it was supposed to be we could just set it properly right?

clang/lib/CodeGen/CodeGenModule.cpp
6850

This should try to use the line directive first, and the current file second if that's not available. The only thing that could change this for the host to device is macros, but we check those currently so it should always be in sync. The downside is if the user somehow only passes a macro to the device side it'll somewhat silently fail when we try to register it.

tra added a comment.May 25 2022, 3:45 PM

It would be great to have some compile-time checks for that, if possible. Otherwise it will only manifest at run-time and the end user will have no clue what's going on.

Not sure how we could check it at compile-time, if we knew what it was supposed to be we could just set it properly right?

We don't need to know the specific values, just that they match between the host and device. The host would need to have all of the expected names used by the registration glue matching the corresponding symbol on the GPU side. Extracting that symbol from the GPU binary might be tricky. Oh, well, we tried.

clang/lib/CodeGen/CodeGenModule.cpp
6850

This should try to use the line directive first, and the current file second if that's not available.

SGTM. Please add a test case with a bogus #line in it to make sure we don't crash on it.

jhuber6 updated this revision to Diff 432146.May 25 2022, 4:22 PM

Add test for #line.

tra accepted this revision.May 25 2022, 4:47 PM
This revision is now accepted and ready to land.May 25 2022, 4:47 PM
This revision was landed with ongoing or failed builds.May 26 2022, 6:18 AM
This revision was automatically updated to reflect the committed changes.