This is an archive of the discontinued LLVM Phabricator instance.

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E 
#... preprocessed output from host and 2 GPU compilations is printed out

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -E  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -S  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

I think I've borrowed that behavior from some of the macos-related functionality, so we do have a somewhat established model of how to handle multiple outputs.
Wrapping multiple outputs into a single bundle could be an option too.

The question is -- what would make most sense.
Are bundles useful in cases when the user would use options that give us intermediate compiler outputs?

In my experience, most of such use cases are intended for manual examination of compiler output and as such I'd prefer to keep the results immediately usable, without having to unbundle them. In such cases we're already changing command line options and adjusting them to produce the output from the specific sub-compilation I want is trivial. Having to unbundle things is more complicated as the bundler/unbundler tool as it is right now is poorly documented and is not particularly user-friendly. If it is to become a user-facing tool like ar/nm/objdump, it would need some improvements.

If you do have use cases when you do need to bundle intermediate results, are they for the human consumption or for tooling? Perhaps we should make the "bundle the outputs" behavior an controllable by a flag, and keep enforcing "one output only" as the default.

In D101630#2729573, @tra wrote:
CUDA compilation currently errors out if -o is used when more than one output would be produced.
E.g.
% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E 
#... preprocessed output from host and 2 GPU compilations is printed out

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -E  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -S  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files
I think I've borrowed that behavior from some of the macos-related functionality, so we do have a somewhat established model of how to handle multiple outputs.
Wrapping multiple outputs into a single bundle could be an option too.

The question is -- what would make most sense.
Are bundles useful in cases when the user would use options that give us intermediate compiler outputs?

In my experience, most of such use cases are intended for manual examination of compiler output and as such I'd prefer to keep the results immediately usable, without having to unbundle them. In such cases we're already changing command line options and adjusting them to produce the output from the specific sub-compilation I want is trivial. Having to unbundle things is more complicated as the bundler/unbundler tool as it is right now is poorly documented and is not particularly user-friendly. If it is to become a user-facing tool like ar/nm/objdump, it would need some improvements.

If you do have use cases when you do need to bundle intermediate results, are they for the human consumption or for tooling? Perhaps we should make the "bundle the outputs" behavior an controllable by a flag, and keep enforcing "one output only" as the default.

We use ccache and need one output for -E with device compilation. Also there are use cases to emit bitcode for device compilation and link them later. These use cases require output to be bundled.

If users want to get the unbundled output, they can use -save-temps. Is it sufficient?

What will happen with this patch in the following scenarios:

--offload_arch=A -S -o out.s
--offload_arch=A --offload-arch=B -S -o out.s

I would expect the first case to produce a plain text assembly file. With this patch the second case will produce a bundle. With some build tools users only add to the various compiler options provided by the system. Depending on whether those system-provided options include an --offload-arch, the format of the output in the first example becomes unstable. So the consistent way would be to always bundle everything, but that breaks (or at least complicates) the normal single-output case and makes it deviate from what users expect from a regular C++ compilation.

In D101630#2729768, @yaxunl wrote:

We use ccache and need one output for -E with device compilation. Also there are use cases to emit bitcode for device compilation and link them later. These use cases require output to be bundled.

This is a good point. I don't think I've ever used ccache on a CUDA compilation, but I see how ccache may get surprised.

Considering the scenario above, I think a better way to handle it would be to teach ccache about CUDA/HIP compilation. It's a similar situation with support for split DWARF, when compiler does something beyond the expected one-input to one-output transformation.
E.g. we could tell it to use stdout for -E. Or implement the bundle-everything flag in clang and let ccache use it if it needs to have a single output.

If users want to get the unbundled output, they can use -save-temps. Is it sufficient?

In terms of saving intermediate outputs - yes. In terms of usability - no. Sometimes I want one particular intermediate result saved with exact filename (or piped to stdout) and saving bunch and then picking one would be a pretty annoying usability regression for me.

In D101630#2729975, @tra wrote:

What will happen with this patch in the following scenarios:

--offload_arch=A -S -o out.s

--offload_arch=A --offload-arch=B -S -o out.s

I would expect the first case to produce a plain text assembly file. With this patch the second case will produce a bundle. With some build tools users only add to the various compiler options provided by the system. Depending on whether those system-provided options include an --offload-arch, the format of the output in the first example becomes unstable. So the consistent way would be to always bundle everything, but that breaks (or at least complicates) the normal single-output case and makes it deviate from what users expect from a regular C++ compilation.

In D101630#2729768, @yaxunl wrote:

We use ccache and need one output for -E with device compilation. Also there are use cases to emit bitcode for device compilation and link them later. These use cases require output to be bundled.

This is a good point. I don't think I've ever used ccache on a CUDA compilation, but I see how ccache may get surprised.

Considering the scenario above, I think a better way to handle it would be to teach ccache about CUDA/HIP compilation. It's a similar situation with support for split DWARF, when compiler does something beyond the expected one-input to one-output transformation.
E.g. we could tell it to use stdout for -E. Or implement the bundle-everything flag in clang and let ccache use it if it needs to have a single output.

If users want to get the unbundled output, they can use -save-temps. Is it sufficient?

In terms of saving intermediate outputs - yes. In terms of usability - no. Sometimes I want one particular intermediate result saved with exact filename (or piped to stdout) and saving bunch and then picking one would be a pretty annoying usability regression for me.

How about an option -fhip-bundle-device-output. If it is on, device output is bundled no matter how many GPU arch there are. By default it is on.

added option -fhip-bundle-device-output

Herald added a subscriber: dang. · View Herald TranscriptMay 1 2021, 3:32 PM

Harbormaster completed remote builds in B102130: Diff 342182.May 1 2021, 4:20 PM

In D101630#2730273, @yaxunl wrote:

How about an option -fhip-bundle-device-output. If it is on, device output is bundled no matter how many GPU arch there are. By default it is on.

+1 to the option, but I can't say I'm particularly happy about the default. I'd still prefer the default to be a no-bundling + an error in cases when we'd nominally produce multiple outputs.
We could use a third opinion here.

@jdoerfert : Do you have any thoughts on what would be a sensible default when a user uses -S -o foo.s for compilations that may produce multiple results? I think OpenMP may have to deal with similar issues.

On one hand it would be convenient for ccache to just work with the CUDA/HIP compilation out of the box. Compiler always produces one output file, regardless of what it does under the hood and ccache may not care what's in it.

On the other, this behavior breaks user expectations. I.e. clang -S is supposed to produce the assembly, not an opaque binary bundle blob.
Using an -S with multiple sub-compilations is also likely an error on the user's end and should be explicitly diagnosed and that's how it currently work.
Using -fno-hip-bundle-device-output to restore the expected behavior puts the burden on the wrong side, IMO. I believe, it should be ccache which should be using -fhip-bundle-device-output to deal with the CUDA/HIP compilations.

jansvoboda11 added a subscriber: jansvoboda11.May 4 2021, 5:00 AM

jansvoboda11 added inline comments.

clang/include/clang/Driver/Options.td
989	The TableGen marshalling infrastructure (`BoolFOption` et. al.) is only intended for flags that map to the `-cc1` frontend and its `CompilerInvocation` class. (`EmptyKPM` that disables this mapping shouldn't even exist anymore.) Since these flags only work on the driver level, use something like this instead: def fhip_bundle_device_output : Flag<["-"], "fhip-bundle-device-output">, Group<f_Group>; def fno_hip_bundle_device_output : Flag<["-"], "fno-hip-bundle-device-output">, Group<f_Group>, HelpText<"Do not bundle output files of HIP device compilation">;

tra added inline comments.May 4 2021, 9:29 AM

clang/include/clang/Driver/Options.td
989	<offtopic> It would be great if `BoolFOption` would at least document that assumption. It would be even better if it would be allowed to be used to implement a driver-only option, maybe with a flag. The class is very useful to handle -fsomething/-fno-something and restricting it to cc1 only will continue to be a constant source of this kind of confusion. I think we've already seen handful of them in reviews since the flag tablegen got overhauled.

In D101630#2733761, @tra wrote:

In D101630#2730273, @yaxunl wrote:

How about an option -fhip-bundle-device-output. If it is on, device output is bundled no matter how many GPU arch there are. By default it is on.

+1 to the option, but I can't say I'm particularly happy about the default. I'd still prefer the default to be a no-bundling + an error in cases when we'd nominally produce multiple outputs.
We could use a third opinion here.

@jdoerfert : Do you have any thoughts on what would be a sensible default when a user uses -S -o foo.s for compilations that may produce multiple results? I think OpenMP may have to deal with similar issues.

On one hand it would be convenient for ccache to just work with the CUDA/HIP compilation out of the box. Compiler always produces one output file, regardless of what it does under the hood and ccache may not care what's in it.

On the other, this behavior breaks user expectations. I.e. clang -S is supposed to produce the assembly, not an opaque binary bundle blob.
Using an -S with multiple sub-compilations is also likely an error on the user's end and should be explicitly diagnosed and that's how it currently work.
Using -fno-hip-bundle-device-output to restore the expected behavior puts the burden on the wrong side, IMO. I believe, it should be ccache which should be using -fhip-bundle-device-output to deal with the CUDA/HIP compilations.

I choose to emit the bundled output by default since it is the convention for compiler to have one output. The compilation is like a pipeline. If we break it into stages, users would expect to use the output from one stage as input for the next stage. This is possible only if there is one output. This happens with host compilations and combined device/host compilations. I would see it is a surprise that this is not true for device compilation.

Also, when users do not want the output to be bundled, it is usually for debugging or special purposes. Users need to know the naming convention of the multiple outputs. I think it is justifiable to enable this by an option.

In D101630#2744861, @yaxunl wrote:

[snip] it is the convention for compiler to have one output.
The compilation is like a pipeline. If we break it into stages, users would expect to use the output from one stage as input for the next stage. This is possible only if there is one output.
Also, when users do not want the output to be bundled, it is usually for debugging or special purposes. Users need to know the naming convention of the multiple outputs. I think it is justifiable to enable this by an option.

So in the end it's a trade-off between tools like ccache working out of the box vs additional option that would need to be used by users we do need specific intermediate output.
Considering that intermediate output already need special handling, I'll agree that bundling by default is likely more useful.

Now the question is how to make it work without breaking existing users.

There are some tools that rely on clang --cuda-host-only and --cuda-device-only to work as if it was a regular C++ compilation. E.g. godbolt.org.
It may be useful to do bundling only if we're dealing with multiple outputs.
On the other hand, it may puzzle users why they get a bundle with clang -S --offload_arch=X --offload_arch=Y but plain text assembly with clang -S --offload_arch=X.

How about this:
If the user explicitly specified --cuda-host-only or --cuda-device-only, then by default only allow producing the natural output format, unless a bundled output is requested by an option. This should keep existing users working.
If the compilation is done without explicitly requested sub-compilation(s), then bundle the output by default. This should keep the GPU-unaware tools like ccache happy as they would always get the single output they expect.

WDYT?

In D101630#2748513, @tra wrote:

How about this:
If the user explicitly specified --cuda-host-only or --cuda-device-only, then by default only allow producing the natural output format, unless a bundled output is requested by an option. This should keep existing users working.
If the compilation is done without explicitly requested sub-compilation(s), then bundle the output by default. This should keep the GPU-unaware tools like ccache happy as they would always get the single output they expect.

WDYT?

--cuda-host-only always have one output, therefore there is no point of bundle its output. We only need to decide the proper behavior of --cuda-device-only.

How about keeping the original default behavior of not bundling if users do not specify output file, whereas bundle the output if users specifying output file. Since specifying output file indicates users requesting one output. -f[no-]hip-bundle-device-output override the default behavior.

fixed option. bundle output if users specify output by -o or -E

yaxunl edited the summary of this revision. (Show Details)May 24 2021, 8:39 AM

Harbormaster completed remote builds in B105923: Diff 347400.May 24 2021, 9:18 AM

In D101630#2777346, @yaxunl wrote:

In D101630#2748513, @tra wrote:

How about this:
If the user explicitly specified --cuda-host-only or --cuda-device-only, then by default only allow producing the natural output format, unless a bundled output is requested by an option. This should keep existing users working.
If the compilation is done without explicitly requested sub-compilation(s), then bundle the output by default. This should keep the GPU-unaware tools like ccache happy as they would always get the single output they expect.

WDYT?

--cuda-host-only always have one output, therefore there is no point of bundle its output. We only need to decide the proper behavior of --cuda-device-only.

It still fits my proposal of requiring a single sub-compilation and not bundling the output.
The point was that such behavior is consistent regardless of whether we're compiling CUDA or HIP for the host or for device.

How about keeping the original default behavior of not bundling if users do not specify output file,
whereas bundle the output if users specifying output file.

I think it will make things worse. Compiler output should not change depending on whether -o is used.

Since specifying output file indicates users requesting one output. -f[no-]hip-bundle-device-output override the default behavior.

I disagree. When user specifies the output, the intent is to specify the location of the outputs, not its contents or format.

Telling compiler to produce a different output format should not depend on specifying (or not) the output location.

I think our options are:

Always bundle --cuda-device-only outputs by default. This is consistent for HIP compilation, but deviates from CUDA, which can't do bundling. Also, single-target subcompilation deviates from both CUDA and regular C++ compilation, which is what most users would be familiar with and which would probably be the most sensible default for a single sub-compilation. It can be overridden with an option, but it goes against the principle that it's specialized use case that should need extra options. The most common use case should not need them.

Only bundle multiple sub-compilations' output by default. This would preserve the sensible single sub-compilation behavior. The downside is that it makes the output format depend on whether compiler ends up doing one or many sub-compilations. E.g. --offload-arch=A -S would produce ASM and --offload-arch=A --offload-arch=B -S would produce a bundle. If the user can't control some of the compiler options, Such approach would make output format unpredictable. E.g. passing --offload-arch=foo to compiler on godbolt would all of a sudden produce bundled output instead of assembly text or a sensible error message that you're trying to produce multiple outputs.

Keep the current behavior (insist on single sub-compilation) as the default, allow overriding it for HIP with the flag. IMO that's the most consistent option and I still think it's the one most suitable to keep as the default.

I can see the benefit of always bundling for HIP, but I also believe that keeping things simple, consistent and predictable is important. Considering that we're tinkering in a relatively obscure niche of the compiler, it probably does not matter all that much, but it should not stop us from trying to figure out the best approach in a principled way.

I think we could benefit from a second opinion on which approach would make more sense for clang.
Summoning @jdoerfert and @echristo.

In D101630#2777702, @tra wrote:

In D101630#2777346, @yaxunl wrote:

In D101630#2748513, @tra wrote:

How about this:
If the user explicitly specified --cuda-host-only or --cuda-device-only, then by default only allow producing the natural output format, unless a bundled output is requested by an option. This should keep existing users working.
If the compilation is done without explicitly requested sub-compilation(s), then bundle the output by default. This should keep the GPU-unaware tools like ccache happy as they would always get the single output they expect.

WDYT?

--cuda-host-only always have one output, therefore there is no point of bundle its output. We only need to decide the proper behavior of --cuda-device-only.

It still fits my proposal of requiring a single sub-compilation and not bundling the output.
The point was that such behavior is consistent regardless of whether we're compiling CUDA or HIP for the host or for device.

How about keeping the original default behavior of not bundling if users do not specify output file,
whereas bundle the output if users specifying output file.

I think it will make things worse. Compiler output should not change depending on whether -o is used.

Since specifying output file indicates users requesting one output. -f[no-]hip-bundle-device-output override the default behavior.

I disagree. When user specifies the output, the intent is to specify the location of the outputs, not its contents or format.

Telling compiler to produce a different output format should not depend on specifying (or not) the output location.

I think our options are:

Always bundle --cuda-device-only outputs by default. This is consistent for HIP compilation, but deviates from CUDA, which can't do bundling. Also, single-target subcompilation deviates from both CUDA and regular C++ compilation, which is what most users would be familiar with and which would probably be the most sensible default for a single sub-compilation. It can be overridden with an option, but it goes against the principle that it's specialized use case that should need extra options. The most common use case should not need them.

Only bundle multiple sub-compilations' output by default. This would preserve the sensible single sub-compilation behavior. The downside is that it makes the output format depend on whether compiler ends up doing one or many sub-compilations. E.g. --offload-arch=A -S would produce ASM and --offload-arch=A --offload-arch=B -S would produce a bundle. If the user can't control some of the compiler options, Such approach would make output format unpredictable. E.g. passing --offload-arch=foo to compiler on godbolt would all of a sudden produce bundled output instead of assembly text or a sensible error message that you're trying to produce multiple outputs.

Keep the current behavior (insist on single sub-compilation) as the default, allow overriding it for HIP with the flag. IMO that's the most consistent option and I still think it's the one most suitable to keep as the default.

I can see the benefit of always bundling for HIP, but I also believe that keeping things simple, consistent and predictable is important. Considering that we're tinkering in a relatively obscure niche of the compiler, it probably does not matter all that much, but it should not stop us from trying to figure out the best approach in a principled way.

I think we could benefit from a second opinion on which approach would make more sense for clang.
Summoning @jdoerfert and @echristo.

How does nvcc --genco behave when there are multiple GPU arch's? Does it output a fat binary containing multiple ISA's? Also, does it support device-only compilation for intermediate outputs?

In D101630#2787714, @yaxunl wrote:

How does nvcc --genco behave when there are multiple GPU arch's? Does it output a fat binary containing multiple ISA's? Also, does it support device-only compilation for intermediate outputs?

It does not allow multiple outputs for -ptx and -cubin compilations, same as clang behaves now:

$ ~/local/cuda-11.3/bin/nvcc -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -ptx foo.cu
nvcc fatal   : Option '--ptx (-ptx)' is not allowed when compiling for multiple GPU architectures

NVCC does allow -E with multiple targets, but it does produce output for only *one* of them.

NVCC does bundle outputs for multiple GPU variants if -fatbin is used.

In D101630#2791734, @tra wrote:
In D101630#2787714, @yaxunl wrote:

How does nvcc --genco behave when there are multiple GPU arch's? Does it output a fat binary containing multiple ISA's? Also, does it support device-only compilation for intermediate outputs?

It does not allow multiple outputs for -ptx and -cubin compilations, same as clang behaves now:
$ ~/local/cuda-11.3/bin/nvcc -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -ptx foo.cu
nvcc fatal   : Option '--ptx (-ptx)' is not allowed when compiling for multiple GPU architectures
NVCC does allow -E with multiple targets, but it does produce output for only *one* of them.

NVCC does bundle outputs for multiple GPU variants if -fatbin is used.

I think for intermediate outputs e.g. preprocessor expansion, IR, and assembly, probably it makes sense not to bundle by default. However, for default action (emitting object), we need to bundle by default since it was the old behavior and existing HIP apps depend on that. Then we allow -fhip-bundle-device-output to override the default behavior.

In D101630#2792052, @yaxunl wrote:

I think for intermediate outputs e.g. preprocessor expansion, IR, and assembly, probably it makes sense not to bundle by default.

Agreed.

However, for default action (emitting object), we need to bundle by default since it was the old behavior and existing HIP apps depend on that.

Existing use is a valid point.
As a counterargument, I would suggest that in a compilation pipeline which does include bundling, an object file for one GPU variant *is* an intermediate output, similar to the ones you've listed above.

The final product of device-side subcompilations is a bundle. The question is what does "-c" mean?. Is it produce an object file or compile till the end of the pipeline ?
For CUDA and HIP compilation it's ambiguous. When we target just one GPU, it would be closer to the former. In general, it would be closer to the latter. NVCC side-steps the issue by using a different flags -cubin/-fatbin to disambiguate between two cases and avoid bolting on CUDA-related semantics on the compiler flags that were not designed for that.

Then we allow -fhip-bundle-device-output to override the default behavior.

OK. Bundling objects for HIP by default looks like a reasonable compromise.
It would be useful to generalize the flag to -fgpu-bundle... as it would be useful if/when we want to produce a fatbin during CUDA compilation. I'd still keep no-bundling as the default for CUDA's objects.

Now that we are in agreement of what we want, the next question is *how* we want to do it.

It appears that there's a fair bit of similarity between what the proposed -fgpu-bundle flag does and the handful of --emit-... options clang has now.
If we were to use something like --emit-gpu-object and --emit-gpu-bundle, it would be similar to NVCC's -cubin/-fatbinary, would decouple the default behavior for -c --cuda-device-only from the user's ability to specify what they want without burdening -c with additional flags that would have different defaults under different circumstances.

Compilation with "-c" would remain the "compile till the end", whatever it happens to mean for particular language and --emit-object/bundle would tell the compiler how far we want it to proceed and what kind of output we want. This would probably be easier to explain to the users as they are already familiar with flags like -emit-llvm, only now we are dealing with an extra bundling step in the compilation pipeline. It would also behave consistently across CUDA and HIP even though they have different defaults for bundling for the device-side compilation. E.g. -c --cuda-device-only --emit-gpu-bundle will always produce a bundle with the object files for both CUDA and HIP and -c --cuda-device-only --emit-gpu-object will always require single '-o' output.

WDYT? Does it make sense?

In D101630#2792160, @tra wrote:

In D101630#2792052, @yaxunl wrote:

I think for intermediate outputs e.g. preprocessor expansion, IR, and assembly, probably it makes sense not to bundle by default.

Agreed.

However, for default action (emitting object), we need to bundle by default since it was the old behavior and existing HIP apps depend on that.

Existing use is a valid point.
As a counterargument, I would suggest that in a compilation pipeline which does include bundling, an object file for one GPU variant *is* an intermediate output, similar to the ones you've listed above.

The final product of device-side subcompilations is a bundle. The question is what does "-c" mean?. Is it produce an object file or compile till the end of the pipeline ?
For CUDA and HIP compilation it's ambiguous. When we target just one GPU, it would be closer to the former. In general, it would be closer to the latter. NVCC side-steps the issue by using a different flags -cubin/-fatbin to disambiguate between two cases and avoid bolting on CUDA-related semantics on the compiler flags that were not designed for that.

Then we allow -fhip-bundle-device-output to override the default behavior.

OK. Bundling objects for HIP by default looks like a reasonable compromise.
It would be useful to generalize the flag to -fgpu-bundle... as it would be useful if/when we want to produce a fatbin during CUDA compilation. I'd still keep no-bundling as the default for CUDA's objects.

Now that we are in agreement of what we want, the next question is *how* we want to do it.

It appears that there's a fair bit of similarity between what the proposed -fgpu-bundle flag does and the handful of --emit-... options clang has now.
If we were to use something like --emit-gpu-object and --emit-gpu-bundle, it would be similar to NVCC's -cubin/-fatbinary, would decouple the default behavior for -c --cuda-device-only from the user's ability to specify what they want without burdening -c with additional flags that would have different defaults under different circumstances.

Compilation with "-c" would remain the "compile till the end", whatever it happens to mean for particular language and --emit-object/bundle would tell the compiler how far we want it to proceed and what kind of output we want. This would probably be easier to explain to the users as they are already familiar with flags like -emit-llvm, only now we are dealing with an extra bundling step in the compilation pipeline. It would also behave consistently across CUDA and HIP even though they have different defaults for bundling for the device-side compilation. E.g. -c --cuda-device-only --emit-gpu-bundle will always produce a bundle with the object files for both CUDA and HIP and -c --cuda-device-only --emit-gpu-object will always require single '-o' output.

WDYT? Does it make sense?

For sure we will need -fgpu-bundle-device-output to control bundling of intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be redundant and can cause confusion. What if users specify -c -fgpu-bundle-device-output -emit-gpu-object or -c -fno-gpu-bundle-device-output -emit-gpu-bundle? To me a single option -fgpu-bundle-device-output to control all device output seems cleaner.

In D101630#2798975, @yaxunl wrote:

For sure we will need -fgpu-bundle-device-output to control bundling of intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be redundant and can cause confusion. What if users specify -c -fgpu-bundle-device-output -emit-gpu-object or -c -fno-gpu-bundle-device-output -emit-gpu-bundle? To me a single option -fgpu-bundle-device-output to control all device output seems cleaner.

The idea is to use -emit-gpu-object and -emit-gpu-bundle instead of the -f[no-]gpu-bundle-device-output. Otherwise they'd do exactly the same thing.

I think -emit-gpu-{object,bundle} has a minor edge over -f[no-]gpu-bundle-device-output as it's similar to other -emit options for controlling clang compilation phases (and that's what we want to do here), while -f options are usually for tweaking code generation.

In D101630#2799409, @tra wrote:

In D101630#2798975, @yaxunl wrote:

For sure we will need -fgpu-bundle-device-output to control bundling of intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be redundant and can cause confusion. What if users specify -c -fgpu-bundle-device-output -emit-gpu-object or -c -fno-gpu-bundle-device-output -emit-gpu-bundle? To me a single option -fgpu-bundle-device-output to control all device output seems cleaner.

The idea is to use -emit-gpu-object and -emit-gpu-bundle instead of the -f[no-]gpu-bundle-device-output. Otherwise they'd do exactly the same thing.

I think -emit-gpu-{object,bundle} has a minor edge over -f[no-]gpu-bundle-device-output as it's similar to other -emit options for controlling clang compilation phases (and that's what we want to do here), while -f options are usually for tweaking code generation.

But how do we control emitting LLVM IR with or without bundle? -emit-llvm -emit-gpu-object or -emit-llvm -emit-gpu-bundle? -emit-* is usually for specifying a specific file type.

In D101630#2799425, @yaxunl wrote:

But how do we control emitting LLVM IR with or without bundle? -emit-llvm -emit-gpu-object or -emit-llvm -emit-gpu-bundle? -emit-* is usually for specifying a specific file type.

Hmm. I forgot that HIP can bundle things other than objects. -emit-llvm -emit-gpu-bundle looks reasonable, but -emit-llvm -emit-gpu-object is indeed odd.
OK. Making it some sort of do/do-not bundle flag makes sense. How about just --[no-]gpu-bundle-output?

In D101630#2799472, @tra wrote:

In D101630#2799425, @yaxunl wrote:

But how do we control emitting LLVM IR with or without bundle? -emit-llvm -emit-gpu-object or -emit-llvm -emit-gpu-bundle? -emit-* is usually for specifying a specific file type.

Hmm. I forgot that HIP can bundle things other than objects. -emit-llvm -emit-gpu-bundle looks reasonable, but -emit-llvm -emit-gpu-object is indeed odd.
OK. Making it some sort of do/do-not bundle flag makes sense. How about just --[no-]gpu-bundle-output?

--[no]gpu-bundle-output sounds good.

use --gpu-bundle-output to control bundling/unbundling output of HIP device only compilation

Harbormaster completed remote builds in B108008: Diff 350318.Jun 7 2021, 9:48 AM

tra accepted this revision.Jun 7 2021, 10:12 AM

tra added inline comments.

clang/lib/Driver/Driver.cpp
2910	We should document the behavior we expect from the `--gpu-bundle-output` option in one place. This may be a good place for it. Some day we should add CUDA/HIP section to clang manual. We already have tons of options that users do need to know about.

This revision is now accepted and ready to land.Jun 7 2021, 10:12 AM

yaxunl marked an inline comment as done.Jun 7 2021, 12:49 PM

yaxunl added inline comments.

clang/lib/Driver/Driver.cpp
2910	Will fix the comments when committing. Yes we should start adding CUDA/HIP documentation to clang manual in future patches.

Closed by commit rG5fc2673fbce2: [HIP] Add --gpu-bundle-output (authored by yaxunl). · Explain WhyJun 9 2021, 8:54 PM

This revision was automatically updated to reflect the committed changes.

yaxunl marked an inline comment as done.

yaxunl added a commit: rG5fc2673fbce2: [HIP] Add --gpu-bundle-output.

Herald added a project: Restricted Project. · View Herald TranscriptJun 9 2021, 8:54 PM

got test failures on some bots. fixed by 734213d7b51f9ea22a9d122c0646ca5b69f88ac8

Revision Contents

Path

Size

clang/

include/

clang/

Driver/

Options.td

4 lines

lib/

Driver/

Driver.cpp

55 lines

test/

Driver/

clang-offload-bundler.c

15 lines

hip-device-compile.hip

92 lines

hip-output-file-name.hip

39 lines

hip-phases.hip

133 lines

hip-rdc-device-only.hip

52 lines

tools/

clang-offload-bundler/

ClangOffloadBundler.cpp

20 lines

Diff 351053

clang/include/clang/Driver/Options.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 980 Lines • ▼ Show 20 Lines	def gpu_max_threads_per_block_EQ : Joined<["--"], "gpu-max-threads-per-block=">,
ShouldParseIf<hip.KeyPath>;		ShouldParseIf<hip.KeyPath>;
def fgpu_inline_threshold_EQ : Joined<["-"], "fgpu-inline-threshold=">,		def fgpu_inline_threshold_EQ : Joined<["-"], "fgpu-inline-threshold=">,
Flags<[HelpHidden]>,		Flags<[HelpHidden]>,
HelpText<"Inline threshold for device compilation for CUDA/HIP">;		HelpText<"Inline threshold for device compilation for CUDA/HIP">;
def gpu_instrument_lib_EQ : Joined<["--"], "gpu-instrument-lib=">,		def gpu_instrument_lib_EQ : Joined<["--"], "gpu-instrument-lib=">,
HelpText<"Instrument device library for HIP, which is a LLVM bitcode containing "		HelpText<"Instrument device library for HIP, which is a LLVM bitcode containing "
"__cyg_profile_func_enter and __cyg_profile_func_exit">;		"__cyg_profile_func_enter and __cyg_profile_func_exit">;
def fgpu_sanitize : Flag<["-"], "fgpu-sanitize">, Group<f_Group>,		def fgpu_sanitize : Flag<["-"], "fgpu-sanitize">, Group<f_Group>,
HelpText<"Enable sanitizer for AMDGPU target">;		HelpText<"Enable sanitizer for AMDGPU target">;
		jansvoboda11Unsubmitted Done Reply Inline Actions The TableGen marshalling infrastructure (`BoolFOption` et. al.) is only intended for flags that map to the `-cc1` frontend and its `CompilerInvocation` class. (`EmptyKPM` that disables this mapping shouldn't even exist anymore.) Since these flags only work on the driver level, use something like this instead: def fhip_bundle_device_output : Flag<["-"], "fhip-bundle-device-output">, Group<f_Group>; def fno_hip_bundle_device_output : Flag<["-"], "fno-hip-bundle-device-output">, Group<f_Group>, HelpText<"Do not bundle output files of HIP device compilation">; jansvoboda11: The TableGen marshalling infrastructure (`BoolFOption` et. al.) is only intended for flags that…
		traUnsubmitted Not Done Reply Inline Actions <offtopic> It would be great if `BoolFOption` would at least document that assumption. It would be even better if it would be allowed to be used to implement a driver-only option, maybe with a flag. The class is very useful to handle -fsomething/-fno-something and restricting it to cc1 only will continue to be a constant source of this kind of confusion. I think we've already seen handful of them in reviews since the flag tablegen got overhauled. tra: <offtopic> It would be great if `BoolFOption` would at least document that assumption. It would…
def fno_gpu_sanitize : Flag<["-"], "fno-gpu-sanitize">, Group<f_Group>;		def fno_gpu_sanitize : Flag<["-"], "fno-gpu-sanitize">, Group<f_Group>;
		def gpu_bundle_output : Flag<["--"], "gpu-bundle-output">,
		Group<f_Group>, HelpText<"Bundle output files of HIP device compilation">;
		def no_gpu_bundle_output : Flag<["--"], "no-gpu-bundle-output">,
		Group<f_Group>, HelpText<"Do not bundle output files of HIP device compilation">;
def cuid_EQ : Joined<["-"], "cuid=">, Flags<[CC1Option]>,		def cuid_EQ : Joined<["-"], "cuid=">, Flags<[CC1Option]>,
HelpText<"An ID for compilation unit, which should be the same for the same "		HelpText<"An ID for compilation unit, which should be the same for the same "
"compilation unit but different for different compilation units. "		"compilation unit but different for different compilation units. "
"It is used to externalize device-side static variables for single "		"It is used to externalize device-side static variables for single "
"source offloading languages CUDA and HIP so that they can be "		"source offloading languages CUDA and HIP so that they can be "
"accessed by the host code of the same compilation unit.">,		"accessed by the host code of the same compilation unit.">,
MarshallingInfoString<LangOpts<"CUID">>;		MarshallingInfoString<LangOpts<"CUID">>;
def fuse_cuid_EQ : Joined<["-"], "fuse-cuid=">,		def fuse_cuid_EQ : Joined<["-"], "fuse-cuid=">,
▲ Show 20 Lines • Show All 5,273 Lines • Show Last 20 Lines

clang/lib/Driver/Driver.cpp

Show First 20 Lines • Show All 2,901 Lines • ▼ Show 20 Lines	public:
}		}
};		};
/// \brief HIP action builder. It injects device code in the host backend		/// \brief HIP action builder. It injects device code in the host backend
/// action.		/// action.
class HIPActionBuilder final : public CudaActionBuilderBase {		class HIPActionBuilder final : public CudaActionBuilderBase {
/// The linker inputs obtained for each device arch.		/// The linker inputs obtained for each device arch.
SmallVector<ActionList, 8> DeviceLinkerInputs;		SmallVector<ActionList, 8> DeviceLinkerInputs;
bool GPUSanitize;		bool GPUSanitize;
		// The default bundling behavior depends on the type of output, therefore
		traUnsubmitted Done Reply Inline Actions We should document the behavior we expect from the `--gpu-bundle-output` option in one place. This may be a good place for it. Some day we should add CUDA/HIP section to clang manual. We already have tons of options that users do need to know about. tra: We should document the behavior we expect from the `--gpu-bundle-output` option in one place.
		yaxunlAuthorUnsubmitted Done Reply Inline Actions Will fix the comments when committing. Yes we should start adding CUDA/HIP documentation to clang manual in future patches. yaxunl: Will fix the comments when committing. Yes we should start adding CUDA/HIP documentation to…
		// BundleOutput needs to be tri-value: None, true, or false.
		// Bundle code objects except --no-gpu-output is specified for device
		// only compilation. Bundle other type of output files only if
		// --gpu-bundle-output is specified for device only compilation.
		Optional<bool> BundleOutput;

public:		public:
HIPActionBuilder(Compilation &C, DerivedArgList &Args,		HIPActionBuilder(Compilation &C, DerivedArgList &Args,
const Driver::InputList &Inputs)		const Driver::InputList &Inputs)
: CudaActionBuilderBase(C, Args, Inputs, Action::OFK_HIP) {		: CudaActionBuilderBase(C, Args, Inputs, Action::OFK_HIP) {
DefaultCudaArch = CudaArch::GFX803;		DefaultCudaArch = CudaArch::GFX803;
GPUSanitize = Args.hasFlag(options::OPT_fgpu_sanitize,		GPUSanitize = Args.hasFlag(options::OPT_fgpu_sanitize,
options::OPT_fno_gpu_sanitize, false);		options::OPT_fno_gpu_sanitize, false);
		if (Args.hasArg(options::OPT_gpu_bundle_output,
		options::OPT_no_gpu_bundle_output))
		BundleOutput = Args.hasFlag(options::OPT_gpu_bundle_output,
		options::OPT_no_gpu_bundle_output);
}		}

bool canUseBundlerUnbundler() const override { return true; }		bool canUseBundlerUnbundler() const override { return true; }

StringRef getCanonicalOffloadArch(StringRef IdStr) override {		StringRef getCanonicalOffloadArch(StringRef IdStr) override {
llvm::StringMap<bool> Features;		llvm::StringMap<bool> Features;
auto ArchStr =		auto ArchStr =
parseTargetID(getHIPOffloadTargetTriple(), IdStr, &Features);		parseTargetID(getHIPOffloadTargetTriple(), IdStr, &Features);
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	getDeviceDependences(OffloadAction::DeviceDependences &DA,
// device arch of the next action being propagated to the above link		// device arch of the next action being propagated to the above link
// action.		// action.
OffloadAction::DeviceDependences DDep;		OffloadAction::DeviceDependences DDep;
DDep.add(CudaDeviceActions[I], ToolChains.front(), GpuArchList[I],		DDep.add(CudaDeviceActions[I], ToolChains.front(), GpuArchList[I],
AssociatedOffloadKind);		AssociatedOffloadKind);
CudaDeviceActions[I] = C.MakeAction<OffloadAction>(		CudaDeviceActions[I] = C.MakeAction<OffloadAction>(
DDep, CudaDeviceActions[I]->getType());		DDep, CudaDeviceActions[I]->getType());
}		}

		if (!CompileDeviceOnly \|\| !BundleOutput.hasValue() \|\|
		BundleOutput.getValue()) {
// Create HIP fat binary with a special "link" action.		// Create HIP fat binary with a special "link" action.
CudaFatBinary =		CudaFatBinary = C.MakeAction<LinkJobAction>(CudaDeviceActions,
C.MakeAction<LinkJobAction>(CudaDeviceActions,
types::TY_HIP_FATBIN);		types::TY_HIP_FATBIN);

if (!CompileDeviceOnly) {		if (!CompileDeviceOnly) {
DA.add(CudaFatBinary, ToolChains.front(), /BoundArch=/nullptr,		DA.add(CudaFatBinary, ToolChains.front(), /BoundArch=/nullptr,
AssociatedOffloadKind);		AssociatedOffloadKind);
// Clear the fat binary, it is already a dependence to an host		// Clear the fat binary, it is already a dependence to an host
// action.		// action.
CudaFatBinary = nullptr;		CudaFatBinary = nullptr;
}		}

// Remove the CUDA actions as they are already connected to an host		// Remove the CUDA actions as they are already connected to an host
// action or fat binary.		// action or fat binary.
CudaDeviceActions.clear();		CudaDeviceActions.clear();
		}

return CompileDeviceOnly ? ABRT_Ignore_Host : ABRT_Success;		return CompileDeviceOnly ? ABRT_Ignore_Host : ABRT_Success;
} else if (CurPhase == phases::Link) {		} else if (CurPhase == phases::Link) {
// Save CudaDeviceActions to DeviceLinkerInputs for each GPU subarch.		// Save CudaDeviceActions to DeviceLinkerInputs for each GPU subarch.
// This happens to each device action originated from each input file.		// This happens to each device action originated from each input file.
// Later on, device actions in DeviceLinkerInputs are used to create		// Later on, device actions in DeviceLinkerInputs are used to create
// device link actions in appendLinkDependences and the created device		// device link actions in appendLinkDependences and the created device
// link actions are passed to the offload action as device dependence.		// link actions are passed to the offload action as device dependence.
Show All 10 Lines	getDeviceDependences(OffloadAction::DeviceDependences &DA,
return ABRT_Success;		return ABRT_Success;
}		}

// By default, we produce an action for each device arch.		// By default, we produce an action for each device arch.
for (Action *&A : CudaDeviceActions)		for (Action *&A : CudaDeviceActions)
A = C.getDriver().ConstructPhaseAction(C, Args, CurPhase, A,		A = C.getDriver().ConstructPhaseAction(C, Args, CurPhase, A,
AssociatedOffloadKind);		AssociatedOffloadKind);

		if (CompileDeviceOnly && CurPhase == FinalPhase &&
		BundleOutput.hasValue() && BundleOutput.getValue()) {
		for (unsigned I = 0, E = GpuArchList.size(); I != E; ++I) {
		OffloadAction::DeviceDependences DDep;
		DDep.add(CudaDeviceActions[I], ToolChains.front(), GpuArchList[I],
		AssociatedOffloadKind);
		CudaDeviceActions[I] = C.MakeAction<OffloadAction>(
		DDep, CudaDeviceActions[I]->getType());
		}
		CudaFatBinary =
		C.MakeAction<OffloadBundlingJobAction>(CudaDeviceActions);
		CudaDeviceActions.clear();
		}

return (CompileDeviceOnly && CurPhase == FinalPhase) ? ABRT_Ignore_Host		return (CompileDeviceOnly && CurPhase == FinalPhase) ? ABRT_Ignore_Host
: ABRT_Success;		: ABRT_Success;
}		}

void appendLinkDeviceActions(ActionList &AL) override {		void appendLinkDeviceActions(ActionList &AL) override {
if (DeviceLinkerInputs.size() == 0)		if (DeviceLinkerInputs.size() == 0)
return;		return;

▲ Show 20 Lines • Show All 2,494 Lines • Show Last 20 Lines

clang/test/Driver/clang-offload-bundler.c

	Show First 20 Lines • Show All 355 Lines • ▼ Show 20 Lines
	// CKLST-DAG: host-			// CKLST-DAG: host-
	// CKLST-DAG: openmp-powerpc64le-ibm-linux-gnu			// CKLST-DAG: openmp-powerpc64le-ibm-linux-gnu
	// CKLST-DAG: openmp-x86_64-pc-linux-gnu			// CKLST-DAG: openmp-x86_64-pc-linux-gnu

	// CKLST2-NOT: host-			// CKLST2-NOT: host-
	// CKLST2-NOT: openmp-powerpc64le-ibm-linux-gnu			// CKLST2-NOT: openmp-powerpc64le-ibm-linux-gnu
	// CKLST2-NOT: openmp-x86_64-pc-linux-gnu			// CKLST2-NOT: openmp-x86_64-pc-linux-gnu

				//
				// Check bundling without host target is allowed for HIP.
				//
				// RUN: clang-offload-bundler -type=bc -targets=hip-amdgcn-amd-amdhsa-gfx900,hip-amdgcn-amd-amdhsa-gfx906 \
				// RUN: -inputs=%t.tgt1,%t.tgt2 -outputs=%t.hip.bundle.bc
				// RUN: clang-offload-bundler -type=bc -list -inputs=%t.hip.bundle.bc \| FileCheck -check-prefix=NOHOST %s
				// RUN: clang-offload-bundler -type=bc -targets=hip-amdgcn-amd-amdhsa-gfx900,hip-amdgcn-amd-amdhsa-gfx906 \
				// RUN: -outputs=%t.res.tgt1,%t.res.tgt2 -inputs=%t.hip.bundle.bc -unbundle
				// RUN: diff %t.tgt1 %t.res.tgt1
				// RUN: diff %t.tgt2 %t.res.tgt2
				//
				// NOHOST-NOT: host-
				// NOHOST-DAG: hip-amdgcn-amd-amdhsa-gfx900
				// NOHOST-DAG: hip-amdgcn-amd-amdhsa-gfx906

	// Some code so that we can create a binary out of this file.			// Some code so that we can create a binary out of this file.
	int A = 0;			int A = 0;
	void test_func(void) {			void test_func(void) {
	++A;			++A;
	}			}

clang/test/Driver/hip-device-compile.hip

	// REQUIRES: clang-driver			// REQUIRES: clang-driver
	// REQUIRES: x86-registered-target			// REQUIRES: x86-registered-target
	// REQUIRES: amdgpu-registered-target			// REQUIRES: amdgpu-registered-target

	// If -emit-llvm and/or -S is used in device only compilation,			// If -emit-llvm and/or -S is used in device only compilation,
	// the output should not be bundled.			// the output should not be bundled, except --gpu-bundle-output
				// is specified.

				// Output unbundled bitcode.
	// RUN: %clang -c -emit-llvm --cuda-device-only -### -target x86_64-linux-gnu \			// RUN: %clang -c -emit-llvm --cuda-device-only -### -target x86_64-linux-gnu \
	// RUN: -o a.bc -x hip --cuda-gpu-arch=gfx900 \			// RUN: -o a.bc -x hip --cuda-gpu-arch=gfx900 --no-gpu-bundle-output \
	// RUN: --hip-device-lib=lib1.bc \			// RUN: --hip-device-lib=lib1.bc \
	// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \			// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
	// RUN: %S/Inputs/hip_multiple_inputs/a.cu \			// RUN: %S/Inputs/hip_multiple_inputs/a.cu \
	// RUN: 2>&1 \| FileCheck -check-prefixes=CHECK,BC %s			// RUN: 2>&1 \| FileCheck -check-prefixes=CHECK,BC,NBUN %s

				// Output bundled bitcode.
				// RUN: %clang -c -emit-llvm --cuda-device-only -### -target x86_64-linux-gnu \
				// RUN: -o a.bc -x hip --cuda-gpu-arch=gfx900 --no-gpu-bundle-output \
				// RUN: --hip-device-lib=lib1.bc \
				// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
				// RUN: %S/Inputs/hip_multiple_inputs/a.cu --gpu-bundle-output \
				// RUN: 2>&1 \| FileCheck -check-prefixes=CHECK,BCBUN %s

				// Output unbundled LLVM IR.
	// RUN: %clang -c -S -emit-llvm --cuda-device-only -### -target x86_64-linux-gnu \			// RUN: %clang -c -S -emit-llvm --cuda-device-only -### -target x86_64-linux-gnu \
	// RUN: -o a.ll -x hip --cuda-gpu-arch=gfx900 \			// RUN: -o a.ll -x hip --cuda-gpu-arch=gfx900 --no-gpu-bundle-output \
	// RUN: --hip-device-lib=lib1.bc \			// RUN: --hip-device-lib=lib1.bc \
	// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \			// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
	// RUN: %S/Inputs/hip_multiple_inputs/a.cu \			// RUN: %S/Inputs/hip_multiple_inputs/a.cu \
	// RUN: 2>&1 \| FileCheck -check-prefixes=CHECK,LL %s			// RUN: 2>&1 \| FileCheck -check-prefixes=CHECK,LL,NBUN %s

				// Output bundled LLVM IR.
				// RUN: %clang -c -S -emit-llvm --cuda-device-only -### -target x86_64-linux-gnu \
				// RUN: -o a.ll -x hip --cuda-gpu-arch=gfx900 --no-gpu-bundle-output \
				// RUN: --hip-device-lib=lib1.bc \
				// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
				// RUN: %S/Inputs/hip_multiple_inputs/a.cu --gpu-bundle-output \
				// RUN: 2>&1 \| FileCheck -check-prefixes=CHECK,LLBUN %s

				// Output unbundled assembly.
	// RUN: %clang -c -S --cuda-device-only -### -target x86_64-linux-gnu \			// RUN: %clang -c -S --cuda-device-only -### -target x86_64-linux-gnu \
	// RUN: -o a.s -x hip --cuda-gpu-arch=gfx900 \			// RUN: -o a.s -x hip --cuda-gpu-arch=gfx900 --no-gpu-bundle-output \
	// RUN: --hip-device-lib=lib1.bc \			// RUN: --hip-device-lib=lib1.bc \
	// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \			// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
	// RUN: %S/Inputs/hip_multiple_inputs/a.cu \			// RUN: %S/Inputs/hip_multiple_inputs/a.cu \
	// RUN: 2>&1 \| FileCheck -check-prefixes=CHECK,ASM %s			// RUN: 2>&1 \| FileCheck -check-prefixes=CHECK,ASM,NBUN %s

				// Output bundled assembly.
				// RUN: %clang -c -S --cuda-device-only -### -target x86_64-linux-gnu \
				// RUN: -o a.s -x hip --cuda-gpu-arch=gfx900 --no-gpu-bundle-output \
				// RUN: --hip-device-lib=lib1.bc \
				// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
				// RUN: %S/Inputs/hip_multiple_inputs/a.cu --gpu-bundle-output \
				// RUN: 2>&1 \| FileCheck -check-prefixes=CHECK,ASMBUN %s

	// CHECK: {{".clang."}} "-cc1" "-triple" "amdgcn-amd-amdhsa"			// CHECK: {{".clang."}} "-cc1" "-triple" "amdgcn-amd-amdhsa"
	// CHECK-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"			// CHECK-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"
	// BC-SAME: "-emit-llvm-bc"			// BC-SAME: "-emit-llvm-bc"
	// LL-SAME: "-emit-llvm"			// LL-SAME: "-emit-llvm"
	// ASM-NOT: "-emit-llvm"			// ASM-NOT: "-emit-llvm"
	// CHECK-SAME: "-main-file-name" "a.cu"			// CHECK-SAME: "-main-file-name" "a.cu"
	// CHECK-SAME: "-fcuda-is-device"			// CHECK-SAME: "-fcuda-is-device"
	// CHECK-SAME: {{".*lib1.bc"}}			// CHECK-SAME: {{".*lib1.bc"}}
	// CHECK-SAME: "-target-cpu" "gfx900"			// CHECK-SAME: "-target-cpu" "gfx900"
	// BC-SAME: "-o" "a.bc"			// BC-SAME: "-o" "a.bc"
				// BCBUN-SAME: "-o" "{{.*}}.bc"
	// LL-SAME: "-o" "a.ll"			// LL-SAME: "-o" "a.ll"
				// LLBUN-SAME: "-o" "{{.*}}.ll"
	// ASM-SAME: "-o" "a.s"			// ASM-SAME: "-o" "a.s"
				// ASMBUN-SAME: "-o" "{{.*}}.s"
	// CHECK-SAME: {{".*a.cu"}}			// CHECK-SAME: {{".*a.cu"}}

	// CHECK-NOT: {{"*.llvm-link"}}			// CHECK-NOT: {{"*.llvm-link"}}
	// CHECK-NOT: {{".*opt"}}			// CHECK-NOT: {{".*opt"}}
	// CHECK-NOT: {{".*llc"}}			// CHECK-NOT: {{".*llc"}}
	// CHECK-NOT: {{".lld."}}			// CHECK-NOT: {{".lld."}}
	// CHECK-NOT: {{".*clang-offload-bundler"}}			// NBUN-NOT: {{".*clang-offload-bundler"}}
				// BCBUN: {{".clang-offload-bundler"}}{{.}}"-outputs=a.bc"
				// LLBUN: {{".clang-offload-bundler"}}{{.}}"-outputs=a.ll"
				// ASMBUN: {{".clang-offload-bundler"}}{{.}}"-outputs=a.s"
	// CHECK-NOT: {{".ld."}}			// CHECK-NOT: {{".ld."}}

	// If neither -emit-llvm nor -S is used in device only compilation,			// If neither -emit-llvm nor -S is used in device only compilation,
	// the output should be bundled.			// the output should be bundled except --no-gpu-bundle-output is
				// specified.

				// Output bundled code objects.
	// RUN: %clang -c --cuda-device-only -### -target x86_64-linux-gnu \			// RUN: %clang -c --cuda-device-only -### -target x86_64-linux-gnu \
	// RUN: -o a.s -x hip --cuda-gpu-arch=gfx900 \			// RUN: -o a.o -x hip --cuda-gpu-arch=gfx900 \
	// RUN: --hip-device-lib=lib1.bc \			// RUN: --hip-device-lib=lib1.bc \
	// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \			// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
	// RUN: %S/Inputs/hip_multiple_inputs/a.cu \			// RUN: %S/Inputs/hip_multiple_inputs/a.cu \
	// RUN: 2>&1 \| FileCheck -check-prefixes=BUNDLE %s			// RUN: 2>&1 \| FileCheck -check-prefixes=OBJ,OBJ-BUN %s

				// Output unbundled code objects.
				// RUN: %clang -c --cuda-device-only -### -target x86_64-linux-gnu \
				// RUN: -o a.o -x hip --cuda-gpu-arch=gfx900 \
				// RUN: --hip-device-lib=lib1.bc \
				// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
				// RUN: %S/Inputs/hip_multiple_inputs/a.cu --no-gpu-bundle-output \
				// RUN: 2>&1 \| FileCheck -check-prefixes=OBJ,OBJ-UBUN %s

				// Output bundled code objects.
	// RUN: %clang --cuda-device-only -### -target x86_64-linux-gnu \			// RUN: %clang --cuda-device-only -### -target x86_64-linux-gnu \
	// RUN: -o a.s -x hip --cuda-gpu-arch=gfx900 \			// RUN: -o a.o -x hip --cuda-gpu-arch=gfx900 \
	// RUN: --hip-device-lib=lib1.bc \			// RUN: --hip-device-lib=lib1.bc \
	// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \			// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
	// RUN: %S/Inputs/hip_multiple_inputs/a.cu \			// RUN: %S/Inputs/hip_multiple_inputs/a.cu \
	// RUN: 2>&1 \| FileCheck -check-prefixes=BUNDLE %s			// RUN: 2>&1 \| FileCheck -check-prefixes=OBJ,OBJ-BUN %s

	// BUNDLE: {{".clang."}} {{.*}} "-emit-obj"			// Output unbundled code objects.
	// BUNDLE-NOT: {{"*.llvm-link"}}			// RUN: %clang --cuda-device-only -### -target x86_64-linux-gnu \
	// BUNDLE-NOT: {{".*opt"}}			// RUN: -o a.o -x hip --cuda-gpu-arch=gfx900 \
	// BUNDLE-NOT: {{".*llc"}}			// RUN: --hip-device-lib=lib1.bc \
	// BUNDLE: {{".lld."}}			// RUN: --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
	// BUNDLE: {{".*clang-offload-bundler"}}			// RUN: %S/Inputs/hip_multiple_inputs/a.cu --no-gpu-bundle-output \
				// RUN: 2>&1 \| FileCheck -check-prefixes=OBJ,OBJ-UBUN %s

				// OBJ: {{".clang."}} {{.*}} "-emit-obj"
				// OBJ-NOT: {{"*.llvm-link"}}
				// OBJ-NOT: {{".*opt"}}
				// OBJ-NOT: {{".*llc"}}
				// OBJ-BUN: {{".lld."}}{{.}}"-o" "{{.}}.o"
				// OBJ-UBUN: {{".lld."}}{{.*}}"-o" "a.o"
				// OBJ-BUN: {{".clang-offload-bundler"}}{{.}}"-outputs=a.o"
				// OBJ-UBUN-NOT: {{".*clang-offload-bundler"}}

clang/test/Driver/hip-output-file-name.hip

	// REQUIRES: clang-driver			// REQUIRES: clang-driver
	// REQUIRES: x86-registered-target			// REQUIRES: x86-registered-target
	// REQUIRES: amdgpu-registered-target			// REQUIRES: amdgpu-registered-target

				// Output bundled code objects for combined compilation.
	// RUN: %clang -### -c -target x86_64-linux-gnu -fgpu-rdc \			// RUN: %clang -### -c -target x86_64-linux-gnu -fgpu-rdc \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
	// RUN: 2>&1 \| FileCheck %s			// RUN: 2>&1 \| FileCheck %s

	// CHECK: {{.}}clang-offload-bundler{{.}}"-outputs=hip-output-file-name.o"			// CHECK: {{.}}clang-offload-bundler{{.}}"-outputs=hip-output-file-name.o"

	// Check -E default output is "-" (stdout).			// Check -E default output is "-" (stdout).
				// If there are multiple preprocessor expansion outputs clang-offload-bundler
				// is used to bundle the final output.

				// Output bundled PPE for one GPU for mixed compliation.
				// RUN: %clang -### -E -target x86_64-linux-gnu \
				// RUN: --cuda-gpu-arch=gfx803 %s \
				// RUN: 2>&1 \| FileCheck -check-prefixes=DASH %s

				// Output unbundled PPE for one GPU for device only compilation.
				// RUN: %clang -### -E --cuda-device-only -target x86_64-linux-gnu \
				// RUN: --cuda-gpu-arch=gfx803 %s \
				// RUN: 2>&1 \| FileCheck -check-prefixes=CLANG-DASH %s

				// Output bundled PPE for two GPUs for mixed compilation.
	// RUN: %clang -### -E -target x86_64-linux-gnu \			// RUN: %clang -### -E -target x86_64-linux-gnu \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
	// RUN: 2>&1 \| FileCheck -check-prefixes=DASH %s			// RUN: 2>&1 \| FileCheck -check-prefixes=DASH %s

				// Output bundled PPE for two GPUs for mixed compilation with -save-temps.
	// RUN: %clang -### -E -save-temps -target x86_64-linux-gnu \			// RUN: %clang -### -E -save-temps -target x86_64-linux-gnu \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
	// RUN: 2>&1 \| FileCheck -check-prefixes=DASH %s			// RUN: 2>&1 \| FileCheck -check-prefixes=DASH %s

				// Output unbundled PPE for two GPUs for device only compilation.
	// RUN: %clang -### -E --cuda-device-only -target x86_64-linux-gnu \			// RUN: %clang -### -E --cuda-device-only -target x86_64-linux-gnu \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
	// RUN: 2>&1 \| FileCheck -check-prefixes=CLANG-DASH %s			// RUN: 2>&1 \| FileCheck -check-prefixes=CLANG-DASH %s

				// Output bundled PPE for two GPUs for device only compilation with --gpu-bundle-output.
				// RUN: %clang -### -E --cuda-device-only -target x86_64-linux-gnu \
				// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s --gpu-bundle-output \
				// RUN: 2>&1 \| FileCheck -check-prefixes=DASH %s

				// Output unbundled PPE for two GPUs for device only compilation with --no-gpu-bundle-output.
				// RUN: %clang -### -E --cuda-device-only -target x86_64-linux-gnu \
				// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s --no-gpu-bundle-output \
				// RUN: 2>&1 \| FileCheck -check-prefixes=CLANG-DASH %s

				// Output unbundled PPE for host only compilation.
	// RUN: %clang -### -E --cuda-host-only -target x86_64-linux-gnu \			// RUN: %clang -### -E --cuda-host-only -target x86_64-linux-gnu \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
	// RUN: 2>&1 \| FileCheck -check-prefixes=CLANG-DASH %s			// RUN: 2>&1 \| FileCheck -check-prefixes=CLANG-DASH %s

				// DASH-NOT: {{.}}clang{{.}}"-o" "-"
	// DASH: {{.}}clang-offload-bundler{{.}}"-outputs=-"			// DASH: {{.}}clang-offload-bundler{{.}}"-outputs=-"
	// CLANG-DASH: {{.}}clang{{.}}"-o" "-"			// CLANG-DASH: {{.}}clang{{.}}"-o" "-"
				// CLANG-DASH-NOT: {{.}}clang-offload-bundler{{.}}"-outputs=-"

	// Check -E with -o.			// Check -E with -o.

				// Output bundled PPE for two GPUs for mixed compilation.
	// RUN: %clang -### -E -o test.cui -target x86_64-linux-gnu \			// RUN: %clang -### -E -o test.cui -target x86_64-linux-gnu \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
	// RUN: 2>&1 \| FileCheck -check-prefixes=OUT %s			// RUN: 2>&1 \| FileCheck -check-prefixes=OUT %s

				// Output bundled PPE for two GPUs for mixed compilation.
	// RUN: %clang -### -E -o test.cui -save-temps -target x86_64-linux-gnu \			// RUN: %clang -### -E -o test.cui -save-temps -target x86_64-linux-gnu \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
	// RUN: 2>&1 \| FileCheck -check-prefixes=OUT %s			// RUN: 2>&1 \| FileCheck -check-prefixes=OUT %s

				// Output bundled PPE for two GPUs for device only compilation with --gpu-bundle-output.
	// RUN: %clang -### -E -o test.cui --cuda-device-only -target x86_64-linux-gnu \			// RUN: %clang -### -E -o test.cui --cuda-device-only -target x86_64-linux-gnu \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 --gpu-bundle-output %s \
	// RUN: 2>&1 \| FileCheck -check-prefixes=CLANG-OUT %s			// RUN: 2>&1 \| FileCheck -check-prefixes=OUT %s

				// Output unbundled PPE for two GPUs for device only compilation.
	// RUN: %clang -### -E -o test.cui --cuda-host-only -target x86_64-linux-gnu \			// RUN: %clang -### -E -o test.cui --cuda-host-only -target x86_64-linux-gnu \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
	// RUN: 2>&1 \| FileCheck -check-prefixes=CLANG-OUT %s			// RUN: 2>&1 \| FileCheck -check-prefixes=CLANG-OUT %s

				// OUT-NOT: {{.}}clang{{.}}"-o" "test.cui"
	// OUT: {{.}}clang-offload-bundler{{.}}"-outputs=test.cui"			// OUT: {{.}}clang-offload-bundler{{.}}"-outputs=test.cui"
	// CLANG-OUT: {{.}}clang{{.}}"-o" "test.cui"			// CLANG-OUT: {{.}}clang{{.}}"-o" "test.cui"
				// CLANG-OUT-NOT: {{.}}clang-offload-bundler{{.}}"-outputs=test.cui"

clang/test/Driver/hip-phases.hip

	Show First 20 Lines • Show All 225 Lines • ▼ Show 20 Lines
	// DBIN-DAG: [[P7:[0-9]+]]: linker, {[[P6]]}, hip-fatbin, (device-hip, )			// DBIN-DAG: [[P7:[0-9]+]]: linker, {[[P6]]}, hip-fatbin, (device-hip, )
	// DBIN-DAG: [[P8:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:)" {[[P7]]}, hip-fatbin			// DBIN-DAG: [[P8:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:)" {[[P7]]}, hip-fatbin
	// DBIN-NOT: host			// DBIN-NOT: host
	//			//
	// Test single gpu architecture up to the assemble phase in device-only			// Test single gpu architecture up to the assemble phase in device-only
	// compilation mode.			// compilation mode.
	//			//
	// RUN: %clang -x hip -target x86_64-unknown-linux-gnu -ccc-print-phases \			// RUN: %clang -x hip -target x86_64-unknown-linux-gnu -ccc-print-phases \
	// RUN: --cuda-gpu-arch=gfx803 %s --cuda-device-only -S 2>&1 \			// RUN: --cuda-gpu-arch=gfx803 %s --cuda-device-only -S --no-gpu-bundle-output 2>&1 \
	// RUN: \| FileCheck -check-prefixes=DASM %s			// RUN: \| FileCheck -check-prefixes=DASM %s
	// DASM-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])			// DASM-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
	// DASM-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])			// DASM-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
	// DASM-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])			// DASM-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
	// DASM-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])			// DASM-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
	// DASM-DAG: [[P4:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P3]]}, assembler			// DASM-DAG: [[P4:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P3]]}, assembler
				// DASM-NOT: clang-offload-bundler
	// DASM-NOT: host			// DASM-NOT: host

	//			//
	// Test two gpu architectures with complete compilation in device-only			// Test two gpu architectures with complete compilation in device-only
	// compilation mode.			// compilation mode.
	//			//
	// RUN: %clang -x hip -target x86_64-unknown-linux-gnu -ccc-print-phases \			// RUN: %clang -x hip -target x86_64-unknown-linux-gnu -ccc-print-phases \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s --cuda-device-only \			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s --cuda-device-only \
	Show All 16 Lines
	// DBIN2-DAG: [[P15:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:)" {[[P14]]}, hip-fatbin			// DBIN2-DAG: [[P15:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:)" {[[P14]]}, hip-fatbin
	// DBIN2-NOT: host			// DBIN2-NOT: host
	//			//
	// Test two gpu architectures up to the assemble phase in device-only			// Test two gpu architectures up to the assemble phase in device-only
	// compilation mode.			// compilation mode.
	//			//
	// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \			// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
	// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \			// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
				// RUN: --cuda-device-only -S -o %t.s 2>&1 \
				// RUN: \| FileCheck -check-prefixes=DASM2,DASM2-NOBUNDLE %s
				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
				// RUN: --cuda-device-only -S -o %t.s --no-gpu-bundle-output 2>&1 \
				// RUN: \| FileCheck -check-prefixes=DASM2,DASM2-NOBUNDLE %s
				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
	// RUN: --cuda-device-only -S 2>&1 \			// RUN: --cuda-device-only -S 2>&1 \
	// RUN: \| FileCheck -check-prefixes=DASM2 %s			// RUN: \| FileCheck -check-prefixes=DASM2,DASM2-NOBUNDLE %s
				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
				// RUN: --cuda-device-only -S --gpu-bundle-output 2>&1 \
				// RUN: \| FileCheck -check-prefixes=DASM2,DASM2-BUNDLE %s
	// DASM2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])			// DASM2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
	// DASM2-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])			// DASM2-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
	// DASM2-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])			// DASM2-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
	// DASM2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])			// DASM2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
	// DASM2-DAG: [[P4:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P3]]}, assembler			// DASM2-DAG: [[P4:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P3]]}, assembler
	// DASM2-DAG: [[P5:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T]], (device-[[T]], [[ARCH2:gfx900]])			// DASM2-DAG: [[P5:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T]], (device-[[T]], [[ARCH2:gfx900]])
	// DASM2-DAG: [[P6:[0-9]+]]: preprocessor, {[[P5]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])			// DASM2-DAG: [[P6:[0-9]+]]: preprocessor, {[[P5]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])
	// DASM2-DAG: [[P7:[0-9]+]]: compiler, {[[P6]]}, ir, (device-[[T]], [[ARCH2]])			// DASM2-DAG: [[P7:[0-9]+]]: compiler, {[[P6]]}, ir, (device-[[T]], [[ARCH2]])
	// DASM2-DAG: [[P8:[0-9]+]]: backend, {[[P7]]}, assembler, (device-[[T]], [[ARCH2]])			// DASM2-DAG: [[P8:[0-9]+]]: backend, {[[P7]]}, assembler, (device-[[T]], [[ARCH2]])
	// DASM2-DAG: [[P9:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P8]]}, assembler			// DASM2-DAG: [[P9:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P8]]}, assembler
				// DASM2-BUNDLE: [[P10:[0-9]+]]: clang-offload-bundler, {[[P4]], [[P9]]}, assembler, (device-hip, )
				// DASM2-NOBUNDLE-NOT: clang-offload-bundler, {[[P4]], [[P9]]}, assembler, (device-hip, )
	// DASM2-NOT: host			// DASM2-NOT: host

	//			//
	// Test linking two objects with two gpu architectures.			// Test linking two objects with two gpu architectures.
	//			//
	// RUN: touch %T/obj1.o			// RUN: touch %T/obj1.o
	// RUN: touch %T/obj2.o			// RUN: touch %T/obj2.o
	//			//
	Show All 14 Lines
	// RL2-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH1]])" {[[P4]]}, image			// RL2-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH1]])" {[[P4]]}, image
	// RL2-DAG: [[P6:[0-9]+]]: linker, {[[P1]], [[P3]]}, image, (device-[[T]], [[ARCH2:gfx900]])			// RL2-DAG: [[P6:[0-9]+]]: linker, {[[P1]], [[P3]]}, image, (device-[[T]], [[ARCH2:gfx900]])
	// RL2-DAG: [[P7:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P6]]}, image			// RL2-DAG: [[P7:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P6]]}, image
	// RL2-DAG: [[P8:[0-9]+]]: linker, {[[P5]], [[P7]]}, object, (device-[[T]])			// RL2-DAG: [[P8:[0-9]+]]: linker, {[[P5]], [[P7]]}, object, (device-[[T]])
	// RL2-DAG: [[P9:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa)" {[[P8]]}, object			// RL2-DAG: [[P9:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa)" {[[P8]]}, object

	// NL2-DAG: [[P4:[0-9]+]]: linker, {[[P0]], [[P2]]}, image, (host-[[T]])			// NL2-DAG: [[P4:[0-9]+]]: linker, {[[P0]], [[P2]]}, image, (host-[[T]])
	// RL2-DAG: [[P4:[0-9]+]]: linker, {[[P1]], [[P3]], [[P9]]}, image, (host-[[T]])			// RL2-DAG: [[P4:[0-9]+]]: linker, {[[P1]], [[P3]], [[P9]]}, image, (host-[[T]])

				// Test one gpu architectures up to the preprocessor expansion output phase in device-only
				// compilation mode. no bundle.
				//
				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 %s \
				// RUN: --cuda-device-only -E 2>&1 \
				// RUN: \| FileCheck -check-prefixes=PPE,PPEN %s

				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 %s \
				// RUN: --cuda-device-only -E --no-gpu-bundle-output 2>&1 \
				// RUN: \| FileCheck -check-prefixes=PPE,PPEN %s

				// Test one gpu architectures up to the preprocessor expansion output phase in device-only
				// compilation mode. bundle.

				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 %s \
				// RUN: --cuda-device-only -E --gpu-bundle-output 2>&1 \
				// RUN: \| FileCheck -check-prefixes=PPE,PPEB %s

				// Test two gpu architectures up to the preprocessor expansion output phase in device-only
				// compilation mode. no bundle.

				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
				// RUN: --cuda-device-only -E 2>&1 \
				// RUN: \| FileCheck -check-prefixes=PPE2,PPE2N %s

				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
				// RUN: --cuda-device-only -E --no-gpu-bundle-output 2>&1 \
				// RUN: \| FileCheck -check-prefixes=PPE2,PPE2N %s

				// Test two gpu architectures up to the preprocessor expansion output phase in device-only
				// compilation mode. bundle.

				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
				// RUN: --cuda-device-only -E --gpu-bundle-output 2>&1 \
				// RUN: \| FileCheck -check-prefixes=PPE2,PPE2B %s

				// Test one gpu architectures up to the LLVM IR output phase in device-only
				// compilation mode. no bundle.
				//
				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 %s \
				// RUN: --cuda-device-only -c -emit-llvm 2>&1 \
				// RUN: \| FileCheck -check-prefixes=LLVM %s

				// Test two gpu architectures up to the LLVM IR output phase in device-only
				// compilation mode. bundle.
				//
				// RUN: %clang -x hip -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
				// RUN: --cuda-device-only -c -emit-llvm -o %t.bc --gpu-bundle-output 2>&1 \
				// RUN: \| FileCheck -check-prefixes=LLVM2 %s

				// Test two gpu architectures up to the LLVM IR output phase in device-only
				// compilation mode with bundled preprocessor expansion as input. bundle.
				//
				// RUN: %clang -x hip-cpp-output -target x86_64-unknown-linux-gnu \
				// RUN: -ccc-print-phases --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s \
				// RUN: --cuda-device-only -c -emit-llvm -o %t.bc --gpu-bundle-output 2>&1 \
				// RUN: \| FileCheck -check-prefixes=PPELLVM2 %s

				// PPE-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
				// PPE-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
				// PPE-DAG: [[P2:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P1]]}, [[T]]-cpp-output
				// PPEB-DAG: [[P3:[0-9]+]]: clang-offload-bundler, {[[P2]]}, [[T]]-cpp-output, (device-hip, )
				// PPEN-NOT: clang-offload-bundler, {{.*}}, [[T]]-cpp-output, (device-hip, )
				// PPE-NOT: host

				// PPE2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
				// PPE2-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
				// PPE2-DAG: [[P2:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P1]]}, [[T]]-cpp-output
				// PPE2-DAG: [[P5:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T]], (device-[[T]], [[ARCH2:gfx900]])
				// PPE2-DAG: [[P6:[0-9]+]]: preprocessor, {[[P5]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])
				// PPE2-DAG: [[P9:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P6]]}, [[T]]-cpp-output
				// PPE2B-DAG: [[P10:[0-9]+]]: clang-offload-bundler, {[[P2]], [[P9]]}, [[T]]-cpp-output, (device-hip, )
				// PPE2N-NOT: clang-offload-bundler, {{.*}}, [[T]]-cpp-output, (device-hip, )
				// PPE2-NOT: host

				// LLVM-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
				// LLVM-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
				// LLVM-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
				// LLVM-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, ir, (device-[[T]], [[ARCH]])
				// LLVM-NOT: clang-offload-bundler
				// LLVM-NOT: host

				// LLVM2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
				// LLVM2-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
				// LLVM2-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
				// LLVM2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, ir, (device-[[T]], [[ARCH]])
				// LLVM2-DAG: [[P4:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P3]]}, ir
				// LLVM2-DAG: [[P5:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T]], (device-[[T]], [[ARCH2:gfx900]])
				// LLVM2-DAG: [[P6:[0-9]+]]: preprocessor, {[[P5]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])
				// LLVM2-DAG: [[P7:[0-9]+]]: compiler, {[[P6]]}, ir, (device-[[T]], [[ARCH2]])
				// LLVM2-DAG: [[P8:[0-9]+]]: backend, {[[P7]]}, ir, (device-[[T]], [[ARCH2]])
				// LLVM2-DAG: [[P9:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P8]]}, ir
				// LLVM2-DAG: [[P10:[0-9]+]]: clang-offload-bundler, {[[P4]], [[P9]]}, ir, (device-hip, )
				// LLVM2-NOT: host

				// PPELLVM2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]]-cpp-output
				// PPELLVM2-DAG: [[P1:[0-9]+]]: clang-offload-unbundler, {[[P0]]}, hip-cpp-output
				// PPELLVM2-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH:gfx803]])
				// PPELLVM2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, ir, (device-[[T]], [[ARCH]])
				// PPELLVM2-DAG: [[P4:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P3]]}, ir
				// PPELLVM2-DAG: [[P7:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH2:gfx900]])
				// PPELLVM2-DAG: [[P8:[0-9]+]]: backend, {[[P7]]}, ir, (device-[[T]], [[ARCH2]])
				// PPELLVM2-DAG: [[P9:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P8]]}, ir
				// PPELLVM2-DAG: [[P10:[0-9]+]]: clang-offload-bundler, {[[P4]], [[P9]]}, ir, (device-hip, )
				// PPELLVM2-NOT: host

clang/test/Driver/hip-rdc-device-only.hip

	// REQUIRES: clang-driver			// REQUIRES: clang-driver
	// REQUIRES: x86-registered-target			// REQUIRES: x86-registered-target
	// REQUIRES: amdgpu-registered-target			// REQUIRES: amdgpu-registered-target

	// RUN: %clang -### -target x86_64-linux-gnu \			// RUN: %clang -### -target x86_64-linux-gnu \
	// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \			// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
	// RUN: -c -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \			// RUN: -c -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
	// RUN: %S/Inputs/hip_multiple_inputs/a.cu \			// RUN: %S/Inputs/hip_multiple_inputs/a.cu \
	// RUN: %S/Inputs/hip_multiple_inputs/b.hip \			// RUN: %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
	// RUN: 2>&1 \| FileCheck -check-prefixes=COMMON,EMITBC %s			// RUN: 2>&1 \| FileCheck -check-prefixes=COMMON,EMITBC %s

	// With `-emit-llvm`, the output should be the same as the aforementioned line			// With `-emit-llvm`, the output should be the same as the aforementioned line
	// as `-fgpu-rdc` in HIP implies `-emit-llvm`.			// as `-fgpu-rdc` in HIP implies `-emit-llvm`.

	// RUN: %clang -### -target x86_64-linux-gnu \			// RUN: %clang -### -target x86_64-linux-gnu \
	// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \			// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
	// RUN: -c -emit-llvm -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \			// RUN: -c -emit-llvm -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
	// RUN: %S/Inputs/hip_multiple_inputs/a.cu \			// RUN: %S/Inputs/hip_multiple_inputs/a.cu \
	// RUN: %S/Inputs/hip_multiple_inputs/b.hip \			// RUN: %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
	// RUN: 2>&1 \| FileCheck -check-prefixes=COMMON,EMITBC %s			// RUN: 2>&1 \| FileCheck -check-prefixes=COMMON,EMITBC %s

	// RUN: %clang -### -target x86_64-linux-gnu \			// RUN: %clang -### -target x86_64-linux-gnu \
	// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \			// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
	// RUN: -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \			// RUN: -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
	// RUN: %S/Inputs/hip_multiple_inputs/a.cu \			// RUN: %S/Inputs/hip_multiple_inputs/a.cu \
	// RUN: %S/Inputs/hip_multiple_inputs/b.hip \			// RUN: %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
	// RUN: 2>&1 \| FileCheck -check-prefixes=COMMON,EMITLL %s			// RUN: 2>&1 \| FileCheck -check-prefixes=COMMON,EMITLL %s

	// With `-emit-llvm`, the output should be the same as the aforementioned line			// With `-emit-llvm`, the output should be the same as the aforementioned line
	// as `-fgpu-rdc` in HIP implies `-emit-llvm`.			// as `-fgpu-rdc` in HIP implies `-emit-llvm`.

	// RUN: %clang -### -target x86_64-linux-gnu \			// RUN: %clang -### -target x86_64-linux-gnu \
	// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \			// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
	// RUN: -S -emit-llvm -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \			// RUN: -S -emit-llvm -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
	// RUN: %S/Inputs/hip_multiple_inputs/a.cu \			// RUN: %S/Inputs/hip_multiple_inputs/a.cu \
	// RUN: %S/Inputs/hip_multiple_inputs/b.hip \			// RUN: %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
	// RUN: 2>&1 \| FileCheck -check-prefixes=COMMON,EMITLL %s			// RUN: 2>&1 \| FileCheck -check-prefixes=COMMON,EMITLL %s

	// With `-save-temps`, commane lines for each steps are dumped. For assembly			// With `-save-temps`, commane lines for each steps are dumped. For assembly
	// output, there should 3 steps (preprocessor, compile, and backend) per source			// output, there should 3 steps (preprocessor, compile, and backend) per source
	// and per target, totally 12 steps.			// and per target, totally 12 steps.

	// RUN: %clang -### -save-temps -target x86_64-linux-gnu \			// RUN: %clang -### -save-temps -target x86_64-linux-gnu \
	// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \			// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
	// RUN: -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \			// RUN: -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
	// RUN: %S/Inputs/hip_multiple_inputs/a.cu \			// RUN: %S/Inputs/hip_multiple_inputs/a.cu \
	// RUN: %S/Inputs/hip_multiple_inputs/b.hip \			// RUN: %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
	// RUN: 2>&1 \| FileCheck -check-prefix=SAVETEMP %s			// RUN: 2>&1 \| FileCheck -check-prefix=SAVETEMP %s

				// Check output one file without bundling cause error.

				// RUN: %clang -### -target x86_64-linux-gnu \
				// RUN: -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
				// RUN: -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
				// RUN: %S/Inputs/hip_multiple_inputs/a.cu -o %t.s --no-gpu-bundle-output \
				// RUN: 2>&1 \| FileCheck -check-prefix=FAIL %s

	// COMMON: [[CLANG:".clang."]] "-cc1" "-triple" "amdgcn-amd-amdhsa"			// COMMON: [[CLANG:".clang."]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
	// COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"			// COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"
	// EMITBC-SAME: "-emit-llvm-bc"			// EMITBC-SAME: "-emit-llvm-bc"
	// EMITLL-SAME: "-emit-llvm"			// EMITLL-SAME: "-emit-llvm"
	// COMMON-SAME: {{.*}} "-main-file-name" "a.cu"			// COMMON-SAME: {{.*}} "-main-file-name" "a.cu"
	// COMMON-SAME: "-fcuda-is-device" "-fcuda-allow-variadic-functions" "-fvisibility" "hidden"			// COMMON-SAME: "-fcuda-is-device" "-fcuda-allow-variadic-functions" "-fvisibility" "hidden"
	// COMMON-SAME: "-fapply-global-visibility-to-externs"			// COMMON-SAME: "-fapply-global-visibility-to-externs"
	// COMMON-SAME: "-target-cpu" "gfx803"			// COMMON-SAME: "-target-cpu" "gfx803"
	// COMMON-SAME: "-fgpu-rdc"			// COMMON-SAME: "-fgpu-rdc"
	// EMITBC-SAME: {{.}} "-o" {{"a.bc"}} "-x" "hip"			// EMITBC-SAME: {{.}} "-o" {{".a.*bc"}} "-x" "hip"
	// EMITLL-SAME: {{.}} "-o" {{"a.ll"}} "-x" "hip"			// EMITLL-SAME: {{.}} "-o" {{".a.*ll"}} "-x" "hip"
	// CHECK-SAME: {{.}} {{".a.cu"}}			// CHECK-SAME: {{.}} {{".a.cu"}}

	// COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"			// COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
	// COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"			// COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"
	// EMITBC-SAME: "-emit-llvm-bc"			// EMITBC-SAME: "-emit-llvm-bc"
	// EMITLL-SAME: "-emit-llvm"			// EMITLL-SAME: "-emit-llvm"
	// COMMON-SAME: {{.*}} "-main-file-name" "a.cu"			// COMMON-SAME: {{.*}} "-main-file-name" "a.cu"
	// COMMON-SAME: "-fcuda-is-device" "-fcuda-allow-variadic-functions" "-fvisibility" "hidden"			// COMMON-SAME: "-fcuda-is-device" "-fcuda-allow-variadic-functions" "-fvisibility" "hidden"
	// COMMON-SAME: "-fapply-global-visibility-to-externs"			// COMMON-SAME: "-fapply-global-visibility-to-externs"
	// COMMON-SAME: "-target-cpu" "gfx900"			// COMMON-SAME: "-target-cpu" "gfx900"
	// COMMON-SAME: "-fgpu-rdc"			// COMMON-SAME: "-fgpu-rdc"
	// EMITBC-SAME: {{.}} "-o" {{"a.bc"}} "-x" "hip"			// EMITBC-SAME: {{.}} "-o" {{".a.*bc"}} "-x" "hip"
	// EMITLL-SAME: {{.}} "-o" {{"a.ll"}} "-x" "hip"			// EMITLL-SAME: {{.}} "-o" {{".a.*ll"}} "-x" "hip"
	// COMMON-SAME: {{.}} {{".a.cu"}}			// COMMON-SAME: {{.}} {{".a.cu"}}

				// COMMON: "{{.*}}clang-offload-bundler" "-type={{(bc\|ll)}}"
				// COMMON-SAME: "-targets=hip-amdgcn-amd-amdhsa-gfx803,hip-amdgcn-amd-amdhsa-gfx900"
				// COMMON-SAME: "-outputs=a-hip-amdgcn-amd-amdhsa.{{(bc\|ll)}}"

	// COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"			// COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
	// COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"			// COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"
	// EMITBC-SAME: "-emit-llvm-bc"			// EMITBC-SAME: "-emit-llvm-bc"
	// EMITLL-SAME: "-emit-llvm"			// EMITLL-SAME: "-emit-llvm"
	// COMMON-SAME: {{.*}} "-main-file-name" "b.hip"			// COMMON-SAME: {{.*}} "-main-file-name" "b.hip"
	// COMMON-SAME: "-fcuda-is-device" "-fcuda-allow-variadic-functions" "-fvisibility" "hidden"			// COMMON-SAME: "-fcuda-is-device" "-fcuda-allow-variadic-functions" "-fvisibility" "hidden"
	// COMMON-SAME: "-fapply-global-visibility-to-externs"			// COMMON-SAME: "-fapply-global-visibility-to-externs"
	// COMMON-SAME: "-target-cpu" "gfx803"			// COMMON-SAME: "-target-cpu" "gfx803"
	// COMMON-SAME: "-fgpu-rdc"			// COMMON-SAME: "-fgpu-rdc"
	// EMITBC-SAME: {{.}} "-o" {{"b.bc"}} "-x" "hip"			// EMITBC-SAME: {{.}} "-o" {{".b.*bc"}} "-x" "hip"
	// EMITLL-SAME: {{.}} "-o" {{"b.ll"}} "-x" "hip"			// EMITLL-SAME: {{.}} "-o" {{".b.*ll"}} "-x" "hip"
	// COMMON-SAME: {{.}} {{".b.hip"}}			// COMMON-SAME: {{.}} {{".b.hip"}}

	// COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"			// COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
	// COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"			// COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"
	// EMITBC-SAME: "-emit-llvm-bc"			// EMITBC-SAME: "-emit-llvm-bc"
	// EMITLL-SAME: "-emit-llvm"			// EMITLL-SAME: "-emit-llvm"
	// COMMON-SAME: {{.*}} "-main-file-name" "b.hip"			// COMMON-SAME: {{.*}} "-main-file-name" "b.hip"
	// COMMON-SAME: "-fcuda-is-device" "-fcuda-allow-variadic-functions" "-fvisibility" "hidden"			// COMMON-SAME: "-fcuda-is-device" "-fcuda-allow-variadic-functions" "-fvisibility" "hidden"
	// COMMON-SAME: "-fapply-global-visibility-to-externs"			// COMMON-SAME: "-fapply-global-visibility-to-externs"
	// COMMON-SAME: "-target-cpu" "gfx900"			// COMMON-SAME: "-target-cpu" "gfx900"
	// COMMON-SAME: "-fgpu-rdc"			// COMMON-SAME: "-fgpu-rdc"
	// EMITBC-SAME: {{.}} "-o" {{"b.bc"}} "-x" "hip"			// EMITBC-SAME: {{.}} "-o" {{".b.*bc"}} "-x" "hip"
	// EMITLL-SAME: {{.}} "-o" {{"b.ll"}} "-x" "hip"			// EMITLL-SAME: {{.}} "-o" {{".b.*ll"}} "-x" "hip"
	// COMMON-SAME: {{.}} {{".b.hip"}}			// COMMON-SAME: {{.}} {{".b.hip"}}

				// COMMON: "{{.*}}clang-offload-bundler" "-type={{(bc\|ll)}}"
				// COMMON-SAME: "-targets=hip-amdgcn-amd-amdhsa-gfx803,hip-amdgcn-amd-amdhsa-gfx900"
				// COMMON-SAME: "-outputs=b-hip-amdgcn-amd-amdhsa.{{(bc\|ll)}}"

	// SAVETEMP: [[CLANG:".clang."]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"			// SAVETEMP: [[CLANG:".clang."]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"
	// SAVETEMP-SAME: "-E"			// SAVETEMP-SAME: "-E"
	// SAVETEMP-SAME: {{.}} "-main-file-name" "a.cu" {{.}} "-target-cpu" "gfx803"			// SAVETEMP-SAME: {{.}} "-main-file-name" "a.cu" {{.}} "-target-cpu" "gfx803"
	// SAVETEMP-SAME: {{.}} "-o" [[A_GFX803_CUI:"a.cui"]] "-x" "hip" {{".*a.cu"}}			// SAVETEMP-SAME: {{.}} "-o" [[A_GFX803_CUI:"a.cui"]] "-x" "hip" {{".*a.cu"}}
	// SAVETEMP-NEXT: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"			// SAVETEMP-NEXT: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"
	// SAVETEMP-SAME: "-emit-llvm-bc"			// SAVETEMP-SAME: "-emit-llvm-bc"
	// SAVETEMP-SAME: {{.}} "-main-file-name" "a.cu" {{.}} "-target-cpu" "gfx803"			// SAVETEMP-SAME: {{.}} "-main-file-name" "a.cu" {{.}} "-target-cpu" "gfx803"
	// SAVETEMP-SAME: {{.}} "-o" [[A_GFX803_TMP_BC:"a.tmp.bc"]] "-x" "hip-cpp-output" [[A_GFX803_CUI]]			// SAVETEMP-SAME: {{.}} "-o" [[A_GFX803_TMP_BC:"a.tmp.bc"]] "-x" "hip-cpp-output" [[A_GFX803_CUI]]
	Show All 10 Lines
	// SAVETEMP-SAME: "-emit-llvm-bc"			// SAVETEMP-SAME: "-emit-llvm-bc"
	// SAVETEMP-SAME: {{.}} "-main-file-name" "a.cu" {{.}} "-target-cpu" "gfx900"			// SAVETEMP-SAME: {{.}} "-main-file-name" "a.cu" {{.}} "-target-cpu" "gfx900"
	// SAVETEMP-SAME: {{.}} "-o" [[A_GFX900_TMP_BC:"a.tmp.bc"]] "-x" "hip-cpp-output" [[A_GFX900_CUI]]			// SAVETEMP-SAME: {{.}} "-o" [[A_GFX900_TMP_BC:"a.tmp.bc"]] "-x" "hip-cpp-output" [[A_GFX900_CUI]]
	// SAVETEMP-NEXT: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"			// SAVETEMP-NEXT: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"
	// SAVETEMP-SAME: "-emit-llvm"			// SAVETEMP-SAME: "-emit-llvm"
	// SAVETEMP-SAME: {{.}} "-main-file-name" "a.cu" {{.}} "-target-cpu" "gfx900"			// SAVETEMP-SAME: {{.}} "-main-file-name" "a.cu" {{.}} "-target-cpu" "gfx900"
	// SAVETEMP-SAME: {{.}} "-o" {{"a..ll"}} "-x" "ir" [[A_GFX900_TMP_BC]]			// SAVETEMP-SAME: {{.}} "-o" {{"a..ll"}} "-x" "ir" [[A_GFX900_TMP_BC]]

				// SAVETEMP: "{{.*}}clang-offload-bundler" "-type=ll"
				// SAVETEMP-SAME: "-targets=hip-amdgcn-amd-amdhsa-gfx803,hip-amdgcn-amd-amdhsa-gfx900"
				// SAVETEMP-SAME: "-outputs=a-hip-amdgcn-amd-amdhsa.ll"

	// SAVETEMP: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"			// SAVETEMP: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"
	// SAVETEMP-SAME: "-E"			// SAVETEMP-SAME: "-E"
	// SAVETEMP-SAME: {{.}} "-main-file-name" "b.hip" {{.}} "-target-cpu" "gfx803"			// SAVETEMP-SAME: {{.}} "-main-file-name" "b.hip" {{.}} "-target-cpu" "gfx803"
	// SAVETEMP-SAME: {{.}} "-o" [[B_GFX803_CUI:"b.cui"]] "-x" "hip" {{".*b.hip"}}			// SAVETEMP-SAME: {{.}} "-o" [[B_GFX803_CUI:"b.cui"]] "-x" "hip" {{".*b.hip"}}
	// SAVETEMP-NEXT: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"			// SAVETEMP-NEXT: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"
	// SAVETEMP-SAME: "-emit-llvm-bc"			// SAVETEMP-SAME: "-emit-llvm-bc"
	// SAVETEMP-SAME: {{.}} "-main-file-name" "b.hip" {{.}} "-target-cpu" "gfx803"			// SAVETEMP-SAME: {{.}} "-main-file-name" "b.hip" {{.}} "-target-cpu" "gfx803"
	// SAVETEMP-SAME: {{.}} "-o" [[B_GFX803_TMP_BC:"b.tmp.bc"]] "-x" "hip-cpp-output" [[B_GFX803_CUI]]			// SAVETEMP-SAME: {{.}} "-o" [[B_GFX803_TMP_BC:"b.tmp.bc"]] "-x" "hip-cpp-output" [[B_GFX803_CUI]]
	Show All 9 Lines
	// SAVETEMP-NEXT: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"			// SAVETEMP-NEXT: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"
	// SAVETEMP-SAME: "-emit-llvm-bc"			// SAVETEMP-SAME: "-emit-llvm-bc"
	// SAVETEMP-SAME: {{.}} "-main-file-name" "b.hip" {{.}} "-target-cpu" "gfx900"			// SAVETEMP-SAME: {{.}} "-main-file-name" "b.hip" {{.}} "-target-cpu" "gfx900"
	// SAVETEMP-SAME: {{.}} "-o" [[B_GFX900_TMP_BC:"b.tmp.bc"]] "-x" "hip-cpp-output" [[B_GFX900_CUI]]			// SAVETEMP-SAME: {{.}} "-o" [[B_GFX900_TMP_BC:"b.tmp.bc"]] "-x" "hip-cpp-output" [[B_GFX900_CUI]]
	// SAVETEMP-NEXT: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"			// SAVETEMP-NEXT: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu"
	// SAVETEMP-SAME: "-emit-llvm"			// SAVETEMP-SAME: "-emit-llvm"
	// SAVETEMP-SAME: {{.}} "-main-file-name" "b.hip" {{.}} "-target-cpu" "gfx900"			// SAVETEMP-SAME: {{.}} "-main-file-name" "b.hip" {{.}} "-target-cpu" "gfx900"
	// SAVETEMP-SAME: {{.}} "-o" {{"b..ll"}} "-x" "ir" [[B_GFX900_TMP_BC]]			// SAVETEMP-SAME: {{.}} "-o" {{"b..ll"}} "-x" "ir" [[B_GFX900_TMP_BC]]

				// SAVETEMP: "{{.*}}clang-offload-bundler" "-type=ll"
				// SAVETEMP-SAME: "-targets=hip-amdgcn-amd-amdhsa-gfx803,hip-amdgcn-amd-amdhsa-gfx900"
				// SAVETEMP-SAME: "-outputs=b-hip-amdgcn-amd-amdhsa.ll"

				// FAIL: error: cannot specify -o when generating multiple output files

clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp

Show First 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	BundleAlignment("bundle-align",
cl::init(1), cl::cat(ClangOffloadBundlerCategory));		cl::init(1), cl::cat(ClangOffloadBundlerCategory));

/// Magic string that marks the existence of offloading data.		/// Magic string that marks the existence of offloading data.
#define OFFLOAD_BUNDLER_MAGIC_STR "__CLANG_OFFLOAD_BUNDLE__"		#define OFFLOAD_BUNDLER_MAGIC_STR "__CLANG_OFFLOAD_BUNDLE__"

/// The index of the host input in the list of inputs.		/// The index of the host input in the list of inputs.
static unsigned HostInputIndex = ~0u;		static unsigned HostInputIndex = ~0u;

		/// Whether not having host target is allowed.
		static bool AllowNoHost = false;

/// Path to the current binary.		/// Path to the current binary.
static std::string BundlerExecutable;		static std::string BundlerExecutable;

/// Obtain the offload kind and real machine triple out of the target		/// Obtain the offload kind and real machine triple out of the target
/// information specified by the user.		/// information specified by the user.
static void getOffloadKindAndTriple(StringRef Target, StringRef &OffloadKind,		static void getOffloadKindAndTriple(StringRef Target, StringRef &OffloadKind,
StringRef &Triple) {		StringRef &Triple) {
auto KindTriplePair = Target.split('-');		auto KindTriplePair = Target.split('-');
▲ Show 20 Lines • Show All 706 Lines • ▼ Show 20 Lines	for (auto &I : InputFileNames) {
ErrorOr<std::unique_ptr<MemoryBuffer>> CodeOrErr =		ErrorOr<std::unique_ptr<MemoryBuffer>> CodeOrErr =
MemoryBuffer::getFileOrSTDIN(I);		MemoryBuffer::getFileOrSTDIN(I);
if (std::error_code EC = CodeOrErr.getError())		if (std::error_code EC = CodeOrErr.getError())
return createFileError(I, EC);		return createFileError(I, EC);
InputBuffers.emplace_back(std::move(*CodeOrErr));		InputBuffers.emplace_back(std::move(*CodeOrErr));
}		}

// Get the file handler. We use the host buffer as reference.		// Get the file handler. We use the host buffer as reference.
assert(HostInputIndex != ~0u && "Host input index undefined??");		assert((HostInputIndex != ~0u \|\| AllowNoHost) &&
		"Host input index undefined??");
Expected<std::unique_ptr<FileHandler>> FileHandlerOrErr =		Expected<std::unique_ptr<FileHandler>> FileHandlerOrErr =
CreateFileHandler(*InputBuffers[HostInputIndex]);		CreateFileHandler(*InputBuffers[AllowNoHost ? 0 : HostInputIndex]);
if (!FileHandlerOrErr)		if (!FileHandlerOrErr)
return FileHandlerOrErr.takeError();		return FileHandlerOrErr.takeError();

std::unique_ptr<FileHandler> &FH = *FileHandlerOrErr;		std::unique_ptr<FileHandler> &FH = *FileHandlerOrErr;
assert(FH);		assert(FH);

// Write header.		// Write header.
if (Error Err = FH->WriteHeader(OutputFile, InputBuffers))		if (Error Err = FH->WriteHeader(OutputFile, InputBuffers))
▲ Show 20 Lines • Show All 250 Lines • ▼ Show 20 Lines	if (InputFileNames.size() != TargetNames.size()) {
"number of input files and targets should match in bundling mode"));		"number of input files and targets should match in bundling mode"));
}		}
}		}

// Verify that the offload kinds and triples are known. We also check that we		// Verify that the offload kinds and triples are known. We also check that we
// have exactly one host target.		// have exactly one host target.
unsigned Index = 0u;		unsigned Index = 0u;
unsigned HostTargetNum = 0u;		unsigned HostTargetNum = 0u;
		bool HIPOnly = true;
llvm::DenseSet<StringRef> ParsedTargets;		llvm::DenseSet<StringRef> ParsedTargets;
for (StringRef Target : TargetNames) {		for (StringRef Target : TargetNames) {
if (ParsedTargets.contains(Target)) {		if (ParsedTargets.contains(Target)) {
reportError(createStringError(errc::invalid_argument,		reportError(createStringError(errc::invalid_argument,
"Duplicate targets are not allowed"));		"Duplicate targets are not allowed"));
}		}
ParsedTargets.insert(Target);		ParsedTargets.insert(Target);

Show All 25 Lines	for (StringRef Target : TargetNames) {
}		}

if (KindIsValid && Kind == "host") {		if (KindIsValid && Kind == "host") {
++HostTargetNum;		++HostTargetNum;
// Save the index of the input that refers to the host.		// Save the index of the input that refers to the host.
HostInputIndex = Index;		HostInputIndex = Index;
}		}

		if (Kind != "hip" && Kind != "hipv4")
		HIPOnly = false;

++Index;		++Index;
}		}

		// HIP uses clang-offload-bundler to bundle device-only compilation results
		// for multiple GPU archs, therefore allow no host target if all entries
		// are for HIP.
		AllowNoHost = HIPOnly;

// Host triple is not really needed for unbundling operation, so do not		// Host triple is not really needed for unbundling operation, so do not
// treat missing host triple as error if we do unbundling.		// treat missing host triple as error if we do unbundling.
if ((Unbundle && HostTargetNum > 1) \|\| (!Unbundle && HostTargetNum != 1)) {		if ((Unbundle && HostTargetNum > 1) \|\|
		(!Unbundle && HostTargetNum != 1 && !AllowNoHost)) {
reportError(createStringError(errc::invalid_argument,		reportError(createStringError(errc::invalid_argument,
"expecting exactly one host target but got " +		"expecting exactly one host target but got " +
Twine(HostTargetNum)));		Twine(HostTargetNum)));
}		}

doWork([]() { return Unbundle ? UnbundleFiles() : BundleFiles(); });		doWork([]() { return Unbundle ? UnbundleFiles() : BundleFiles(); });
return 0;		return 0;
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[HIP] Add --gpu-bundle-outputClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 351053

clang/include/clang/Driver/Options.td

clang/lib/Driver/Driver.cpp

clang/test/Driver/clang-offload-bundler.c

clang/test/Driver/hip-device-compile.hip

clang/test/Driver/hip-output-file-name.hip

clang/test/Driver/hip-phases.hip

clang/test/Driver/hip-rdc-device-only.hip

clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp

[HIP] Add --gpu-bundle-output
ClosedPublic