- User Since
- May 4 2020, 11:17 AM (107 w, 4 d)
Thu, May 26
Wed, May 25
Add test for #line.
Not sure how we could check it at compile-time, if we knew what it was supposed to be we could just set it properly right?
Changing to use --offload-link and use -dlink as an alias.
Probably wouldn't be too difficult, primarily just setting up the glue since the rest of the infrastructure is in place. I was hoping it would become unnecessary, but it seems like that's not happening. I'm tempted to have OpenMP handle it on its own do we don't need to port this to the OpenMP case, I think we already do something similar there with the kernel names.
Yeah, I can see your point, --offload-link definitely works but it would be nice to have something less verbose. Maybe could just use -dlink or something.
Where is the code to change the target registration? We need a new initialization runtime call to use and change the old one to reallocate the __tgt_device_image array. Did you work around that some other way?
Also, for the OpenMP case, we already pass the host-IR as a dependency for the device compilation. So it would be relatively easy for us to just generate these names on the host and then read them from the IR for the device. The problem is that CUDA / HIP doesn't use this approach so it wouldn't be a great solution to have two different ways to do this. So we would either need to make CUDA / HIP take the host IR and use that, or move OpenMP to use the driver. The benefit of passing the IR is that we can much more stably generate some arbitrary string to mangle these names and we're guarunteed to have them match up because we read them from the host. The downside is that it'd be somewhat of a regression because now we have an extra call to Clang for CUDA / HIP when we previously didn't need to.
Tue, May 24
I would be 100% in favor of working around this if possible, it's proving to be one of the most painful parts of the process.
CUID serves two purposes:
a) avoid name conflicts during device-side linking ("must be globally unique" part)
b) allow host to refer to something in the GPU executable ("stable within TU" part)
My understanding that we already collect the data about all offloading entities and that include those we have to externalize. We also postpone generation of the registration glue to the final linking step.
Yes, we would have all those entries see here. The final linker just gets a pointer to __start_omp_offloading_entries so we can iterate this at runtime.
Let's suppose that we do not externalize those normally-internal symbols. The offloading table would still have entries for them, but there will be no issue with name conflicts during linking, as they do remain internal.
We would also need to make sure that they're used so they don't get optimized out.
During the final linking, if an an offloading entity uses a pointer w/o a public symbol, we would be in position to generate a unique one, using the pointer value in the offload table entry. Linker can just use a free-running counter for the suffix, or could just generate a completely new symbol. It does not matter.
This is the part I'm not sure about, how would we generate new symbols during the linking stage? We can only iterate the offloading entry table after the final linking, which is when we're already supposed to have a fully linked and registered module. We could potentially generate the same kind of table for the device, but I don't think nvlink would perform the same linker magic to merge those entries.
When we generate the host-side registration glue, we'll use the name of that generated symbol.
When we make the registration glue we haven't created the final executable, so I don't think we could modify existing entries, only create new ones.
In the end linking will work exactly as it would for C++ (modulo having offloading tables) and host/device registration will be ensured by telling host side which symbols to use, instead of assuming that we've happened to generate exactly the same unique suffix on both sides.
@yaxunl -- do you see any holes in this approach?
Adding extra commentto mention the hidden requirement that the driver shuold not define a different -D option for the host and device.
Removing use of the line number, instead replacing it with an 8 character wide hash of the -D options passed to the front-end. This should make it sufficiently unique for users compiling the same file with different options. The format now looks like <var>__<qualifier>__<file-id><device-id>_<hash>.
Mon, May 23
Merging into a single argument and checking if the joined arg is empty.
Go back to old joined method and also change the name to remote _EQ.
I mean this in the context that the following will not work
clang a.c -c -o a-0.o // Has some externalized static variable. clang a.c -c -o a-1.o clang a-0.o a-1.o // Redefined symbol error
The build will be perfectly reproducible, the ID we append here is just <var>__static__<file id><device id><line number> which should be the same in a static source tree. Though it might be annoying that the line number may change on white-space changes, so we could do without the line number at the end if that's an issue.
However, this is a very niche use-case and is not supported by Nvidia's CUDA compiler so it likely to be good enough.
The fact that NVCC didn't always generate the same output was an issue when we were using it for CUDA compilation.
In general, "not supported by NVCC" is not quite applicable here, IMO. The goal here is to make clang work correctly.
I feel like linking a file with itself is pretty uncommon, but in order to support that we'd definitely need the CUID method so we can pass it to both the host and device. I'm personally fine with this and the CUID living together so if for whatever reason there's a symbol clash, the user can specify a CUID to make it go away. We also discussed the problem of non-static source trees which neither this nor the current CUID would solve. As far as I can tell, this method would work fine for like 99.99% of codes, but getting that last 0.01% would require something like generating a UUID for each compilation job, which requires intervention from the driver to set up offloading compilation properly. So I'm not sure if it's the best trade-off.
Updating to use @MaskRay's suggestion.
Changing the -Xoffload-linker= to -Xoffload-linker-.
Fri, May 20
Thu, May 19
Wed, May 18
Tue, May 17
Mon, May 16
@ye-luo Can I land this now?
Remove unrelated change.
Changing the guard to only be for the static library.
The OpenMP offloading runtime was always intended to be built with a newly-built Clang, either through a two-phase build or with -DLLVM_ENABLE_RUNTIMES=openmp. This patch added some code that uses standard CMake to build the library, rather than locating the clang binary directly. I could add some code to skip building this static library, or the device runtime entirely, if the compiler isn't an up-to-date Clang. I'm not sure what the best solution to this is, since we always required that this was to be built with Clang, this patch just made it a more strict requirement.
Sat, May 14
Fri, May 13
Rebase to not change the include directories.
Changing to be directly included by each plugin rather than inherited from elf_common.