This is an archive of the discontinued LLVM Phabricator instance.

[HIP] Add --gpu-bundle-output
ClosedPublic

Authored by yaxunl on Apr 30 2021, 7:12 AM.

Details

Summary

Added --gpu-bundle-output to control bundling/unbundling output of HIP device compilation.

By default preprocessor expansion, llvm bitcode and assembly are unbundled, code objects are
bundled.

Diff Detail

Event Timeline

yaxunl requested review of this revision.Apr 30 2021, 7:12 AM
yaxunl created this revision.
tra added a comment.Apr 30 2021, 9:35 AM

CUDA compilation currently errors out if -o is used when more than one output would be produced.
E.g.

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E 
#... preprocessed output from host and 2 GPU compilations is printed out

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -E  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -S  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

I think I've borrowed that behavior from some of the macos-related functionality, so we do have a somewhat established model of how to handle multiple outputs.
Wrapping multiple outputs into a single bundle could be an option too.

The question is -- what would make most sense.
Are bundles useful in cases when the user would use options that give us intermediate compiler outputs?

In my experience, most of such use cases are intended for manual examination of compiler output and as such I'd prefer to keep the results immediately usable, without having to unbundle them. In such cases we're already changing command line options and adjusting them to produce the output from the specific sub-compilation I want is trivial. Having to unbundle things is more complicated as the bundler/unbundler tool as it is right now is poorly documented and is not particularly user-friendly. If it is to become a user-facing tool like ar/nm/objdump, it would need some improvements.

If you do have use cases when you do need to bundle intermediate results, are they for the human consumption or for tooling? Perhaps we should make the "bundle the outputs" behavior an controllable by a flag, and keep enforcing "one output only" as the default.

CUDA compilation currently errors out if -o is used when more than one output would be produced.
E.g.

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E 
#... preprocessed output from host and 2 GPU compilations is printed out

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -E  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

% bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -S  -o foo.out
clang-13: error: cannot specify -o when generating multiple output files

I think I've borrowed that behavior from some of the macos-related functionality, so we do have a somewhat established model of how to handle multiple outputs.
Wrapping multiple outputs into a single bundle could be an option too.

The question is -- what would make most sense.
Are bundles useful in cases when the user would use options that give us intermediate compiler outputs?

In my experience, most of such use cases are intended for manual examination of compiler output and as such I'd prefer to keep the results immediately usable, without having to unbundle them. In such cases we're already changing command line options and adjusting them to produce the output from the specific sub-compilation I want is trivial. Having to unbundle things is more complicated as the bundler/unbundler tool as it is right now is poorly documented and is not particularly user-friendly. If it is to become a user-facing tool like ar/nm/objdump, it would need some improvements.

If you do have use cases when you do need to bundle intermediate results, are they for the human consumption or for tooling? Perhaps we should make the "bundle the outputs" behavior an controllable by a flag, and keep enforcing "one output only" as the default.

We use ccache and need one output for -E with device compilation. Also there are use cases to emit bitcode for device compilation and link them later. These use cases require output to be bundled.

If users want to get the unbundled output, they can use -save-temps. Is it sufficient?

tra added a comment.Apr 30 2021, 11:35 AM

What will happen with this patch in the following scenarios:

  • --offload_arch=A -S -o out.s
  • --offload_arch=A --offload-arch=B -S -o out.s

I would expect the first case to produce a plain text assembly file. With this patch the second case will produce a bundle. With some build tools users only add to the various compiler options provided by the system. Depending on whether those system-provided options include an --offload-arch, the format of the output in the first example becomes unstable. So the consistent way would be to always bundle everything, but that breaks (or at least complicates) the normal single-output case and makes it deviate from what users expect from a regular C++ compilation.

We use ccache and need one output for -E with device compilation. Also there are use cases to emit bitcode for device compilation and link them later. These use cases require output to be bundled.

This is a good point. I don't think I've ever used ccache on a CUDA compilation, but I see how ccache may get surprised.

Considering the scenario above, I think a better way to handle it would be to teach ccache about CUDA/HIP compilation. It's a similar situation with support for split DWARF, when compiler does something beyond the expected one-input to one-output transformation.
E.g. we could tell it to use stdout for -E. Or implement the bundle-everything flag in clang and let ccache use it if it needs to have a single output.

If users want to get the unbundled output, they can use -save-temps. Is it sufficient?

In terms of saving intermediate outputs - yes. In terms of usability - no. Sometimes I want one particular intermediate result saved with exact filename (or piped to stdout) and saving bunch and then picking one would be a pretty annoying usability regression for me.

What will happen with this patch in the following scenarios:

  • --offload_arch=A -S -o out.s
  • --offload_arch=A --offload-arch=B -S -o out.s

I would expect the first case to produce a plain text assembly file. With this patch the second case will produce a bundle. With some build tools users only add to the various compiler options provided by the system. Depending on whether those system-provided options include an --offload-arch, the format of the output in the first example becomes unstable. So the consistent way would be to always bundle everything, but that breaks (or at least complicates) the normal single-output case and makes it deviate from what users expect from a regular C++ compilation.

We use ccache and need one output for -E with device compilation. Also there are use cases to emit bitcode for device compilation and link them later. These use cases require output to be bundled.

This is a good point. I don't think I've ever used ccache on a CUDA compilation, but I see how ccache may get surprised.

Considering the scenario above, I think a better way to handle it would be to teach ccache about CUDA/HIP compilation. It's a similar situation with support for split DWARF, when compiler does something beyond the expected one-input to one-output transformation.
E.g. we could tell it to use stdout for -E. Or implement the bundle-everything flag in clang and let ccache use it if it needs to have a single output.

If users want to get the unbundled output, they can use -save-temps. Is it sufficient?

In terms of saving intermediate outputs - yes. In terms of usability - no. Sometimes I want one particular intermediate result saved with exact filename (or piped to stdout) and saving bunch and then picking one would be a pretty annoying usability regression for me.

How about an option -fhip-bundle-device-output. If it is on, device output is bundled no matter how many GPU arch there are. By default it is on.

yaxunl updated this revision to Diff 342182.May 1 2021, 3:32 PM
yaxunl edited the summary of this revision. (Show Details)

added option -fhip-bundle-device-output

tra added a subscriber: jdoerfert.May 3 2021, 10:22 AM

How about an option -fhip-bundle-device-output. If it is on, device output is bundled no matter how many GPU arch there are. By default it is on.

+1 to the option, but I can't say I'm particularly happy about the default. I'd still prefer the default to be a no-bundling + an error in cases when we'd nominally produce multiple outputs.
We could use a third opinion here.

@jdoerfert : Do you have any thoughts on what would be a sensible default when a user uses -S -o foo.s for compilations that may produce multiple results? I think OpenMP may have to deal with similar issues.

On one hand it would be convenient for ccache to just work with the CUDA/HIP compilation out of the box. Compiler always produces one output file, regardless of what it does under the hood and ccache may not care what's in it.

On the other, this behavior breaks user expectations. I.e. clang -S is supposed to produce the assembly, not an opaque binary bundle blob.
Using an -S with multiple sub-compilations is also likely an error on the user's end and should be explicitly diagnosed and that's how it currently work.
Using -fno-hip-bundle-device-output to restore the expected behavior puts the burden on the wrong side, IMO. I believe, it should be ccache which should be using -fhip-bundle-device-output to deal with the CUDA/HIP compilations.

jansvoboda11 added inline comments.
clang/include/clang/Driver/Options.td
989

The TableGen marshalling infrastructure (BoolFOption et. al.) is only intended for flags that map to the -cc1 frontend and its CompilerInvocation class. (EmptyKPM that disables this mapping shouldn't even exist anymore.)

Since these flags only work on the driver level, use something like this instead:

def fhip_bundle_device_output : Flag<["-"], "fhip-bundle-device-output">, Group<f_Group>;
def fno_hip_bundle_device_output : Flag<["-"], "fno-hip-bundle-device-output">, Group<f_Group>, HelpText<"Do not bundle output files of HIP device compilation">;
tra added inline comments.May 4 2021, 9:29 AM
clang/include/clang/Driver/Options.td
989

<offtopic>
It would be great if BoolFOption would at least document that assumption.
It would be even better if it would be allowed to be used to implement a driver-only option, maybe with a flag.
The class is very useful to handle -fsomething/-fno-something and restricting it to cc1 only will continue to be a constant source of this kind of confusion. I think we've already seen handful of them in reviews since the flag tablegen got overhauled.

yaxunl added a comment.May 7 2021, 9:42 AM

How about an option -fhip-bundle-device-output. If it is on, device output is bundled no matter how many GPU arch there are. By default it is on.

+1 to the option, but I can't say I'm particularly happy about the default. I'd still prefer the default to be a no-bundling + an error in cases when we'd nominally produce multiple outputs.
We could use a third opinion here.

@jdoerfert : Do you have any thoughts on what would be a sensible default when a user uses -S -o foo.s for compilations that may produce multiple results? I think OpenMP may have to deal with similar issues.

On one hand it would be convenient for ccache to just work with the CUDA/HIP compilation out of the box. Compiler always produces one output file, regardless of what it does under the hood and ccache may not care what's in it.

On the other, this behavior breaks user expectations. I.e. clang -S is supposed to produce the assembly, not an opaque binary bundle blob.
Using an -S with multiple sub-compilations is also likely an error on the user's end and should be explicitly diagnosed and that's how it currently work.
Using -fno-hip-bundle-device-output to restore the expected behavior puts the burden on the wrong side, IMO. I believe, it should be ccache which should be using -fhip-bundle-device-output to deal with the CUDA/HIP compilations.

I choose to emit the bundled output by default since it is the convention for compiler to have one output. The compilation is like a pipeline. If we break it into stages, users would expect to use the output from one stage as input for the next stage. This is possible only if there is one output. This happens with host compilations and combined device/host compilations. I would see it is a surprise that this is not true for device compilation.

Also, when users do not want the output to be bundled, it is usually for debugging or special purposes. Users need to know the naming convention of the multiple outputs. I think it is justifiable to enable this by an option.

tra added a comment.May 10 2021, 11:50 AM

[snip] it is the convention for compiler to have one output.
The compilation is like a pipeline. If we break it into stages, users would expect to use the output from one stage as input for the next stage. This is possible only if there is one output.
Also, when users do not want the output to be bundled, it is usually for debugging or special purposes. Users need to know the naming convention of the multiple outputs. I think it is justifiable to enable this by an option.

So in the end it's a trade-off between tools like ccache working out of the box vs additional option that would need to be used by users we do need specific intermediate output.
Considering that intermediate output already need special handling, I'll agree that bundling by default is likely more useful.

Now the question is how to make it work without breaking existing users.

There are some tools that rely on clang --cuda-host-only and --cuda-device-only to work as if it was a regular C++ compilation. E.g. godbolt.org.
It may be useful to do bundling only if we're dealing with multiple outputs.
On the other hand, it may puzzle users why they get a bundle with clang -S --offload_arch=X --offload_arch=Y but plain text assembly with clang -S --offload_arch=X.

How about this:
If the user explicitly specified --cuda-host-only or --cuda-device-only, then by default only allow producing the natural output format, unless a bundled output is requested by an option. This should keep existing users working.
If the compilation is done without explicitly requested sub-compilation(s), then bundle the output by default. This should keep the GPU-unaware tools like ccache happy as they would always get the single output they expect.

WDYT?

yaxunl marked an inline comment as done.May 24 2021, 8:37 AM

How about this:
If the user explicitly specified --cuda-host-only or --cuda-device-only, then by default only allow producing the natural output format, unless a bundled output is requested by an option. This should keep existing users working.
If the compilation is done without explicitly requested sub-compilation(s), then bundle the output by default. This should keep the GPU-unaware tools like ccache happy as they would always get the single output they expect.

WDYT?

--cuda-host-only always have one output, therefore there is no point of bundle its output. We only need to decide the proper behavior of --cuda-device-only.

How about keeping the original default behavior of not bundling if users do not specify output file, whereas bundle the output if users specifying output file. Since specifying output file indicates users requesting one output. -f[no-]hip-bundle-device-output override the default behavior.

yaxunl updated this revision to Diff 347400.May 24 2021, 8:38 AM

fixed option. bundle output if users specify output by -o or -E

yaxunl edited the summary of this revision. (Show Details)May 24 2021, 8:39 AM
tra added a subscriber: echristo.May 24 2021, 10:41 AM

How about this:
If the user explicitly specified --cuda-host-only or --cuda-device-only, then by default only allow producing the natural output format, unless a bundled output is requested by an option. This should keep existing users working.
If the compilation is done without explicitly requested sub-compilation(s), then bundle the output by default. This should keep the GPU-unaware tools like ccache happy as they would always get the single output they expect.

WDYT?

--cuda-host-only always have one output, therefore there is no point of bundle its output. We only need to decide the proper behavior of --cuda-device-only.

It still fits my proposal of requiring a single sub-compilation and not bundling the output.
The point was that such behavior is consistent regardless of whether we're compiling CUDA or HIP for the host or for device.

How about keeping the original default behavior of not bundling if users do not specify output file,
whereas bundle the output if users specifying output file.

I think it will make things worse. Compiler output should not change depending on whether -o is used.

Since specifying output file indicates users requesting one output. -f[no-]hip-bundle-device-output override the default behavior.

I disagree. When user specifies the output, the intent is to specify the location of the outputs, not its contents or format.

Telling compiler to produce a different output format should not depend on specifying (or not) the output location.

I think our options are:

  • Always bundle --cuda-device-only outputs by default. This is consistent for HIP compilation, but deviates from CUDA, which can't do bundling. Also, single-target subcompilation deviates from both CUDA and regular C++ compilation, which is what most users would be familiar with and which would probably be the most sensible default for a single sub-compilation. It can be overridden with an option, but it goes against the principle that it's specialized use case that should need extra options. The most common use case should not need them.
  • Only bundle multiple sub-compilations' output by default. This would preserve the sensible single sub-compilation behavior. The downside is that it makes the output format depend on whether compiler ends up doing one or many sub-compilations. E.g. --offload-arch=A -S would produce ASM and --offload-arch=A --offload-arch=B -S would produce a bundle. If the user can't control some of the compiler options, Such approach would make output format unpredictable. E.g. passing --offload-arch=foo to compiler on godbolt would all of a sudden produce bundled output instead of assembly text or a sensible error message that you're trying to produce multiple outputs.
  • Keep the current behavior (insist on single sub-compilation) as the default, allow overriding it for HIP with the flag. IMO that's the most consistent option and I still think it's the one most suitable to keep as the default.

I can see the benefit of always bundling for HIP, but I also believe that keeping things simple, consistent and predictable is important. Considering that we're tinkering in a relatively obscure niche of the compiler, it probably does not matter all that much, but it should not stop us from trying to figure out the best approach in a principled way.

I think we could benefit from a second opinion on which approach would make more sense for clang.
Summoning @jdoerfert and @echristo.

How about this:
If the user explicitly specified --cuda-host-only or --cuda-device-only, then by default only allow producing the natural output format, unless a bundled output is requested by an option. This should keep existing users working.
If the compilation is done without explicitly requested sub-compilation(s), then bundle the output by default. This should keep the GPU-unaware tools like ccache happy as they would always get the single output they expect.

WDYT?

--cuda-host-only always have one output, therefore there is no point of bundle its output. We only need to decide the proper behavior of --cuda-device-only.

It still fits my proposal of requiring a single sub-compilation and not bundling the output.
The point was that such behavior is consistent regardless of whether we're compiling CUDA or HIP for the host or for device.

How about keeping the original default behavior of not bundling if users do not specify output file,
whereas bundle the output if users specifying output file.

I think it will make things worse. Compiler output should not change depending on whether -o is used.

Since specifying output file indicates users requesting one output. -f[no-]hip-bundle-device-output override the default behavior.

I disagree. When user specifies the output, the intent is to specify the location of the outputs, not its contents or format.

Telling compiler to produce a different output format should not depend on specifying (or not) the output location.

I think our options are:

  • Always bundle --cuda-device-only outputs by default. This is consistent for HIP compilation, but deviates from CUDA, which can't do bundling. Also, single-target subcompilation deviates from both CUDA and regular C++ compilation, which is what most users would be familiar with and which would probably be the most sensible default for a single sub-compilation. It can be overridden with an option, but it goes against the principle that it's specialized use case that should need extra options. The most common use case should not need them.
  • Only bundle multiple sub-compilations' output by default. This would preserve the sensible single sub-compilation behavior. The downside is that it makes the output format depend on whether compiler ends up doing one or many sub-compilations. E.g. --offload-arch=A -S would produce ASM and --offload-arch=A --offload-arch=B -S would produce a bundle. If the user can't control some of the compiler options, Such approach would make output format unpredictable. E.g. passing --offload-arch=foo to compiler on godbolt would all of a sudden produce bundled output instead of assembly text or a sensible error message that you're trying to produce multiple outputs.
  • Keep the current behavior (insist on single sub-compilation) as the default, allow overriding it for HIP with the flag. IMO that's the most consistent option and I still think it's the one most suitable to keep as the default.

I can see the benefit of always bundling for HIP, but I also believe that keeping things simple, consistent and predictable is important. Considering that we're tinkering in a relatively obscure niche of the compiler, it probably does not matter all that much, but it should not stop us from trying to figure out the best approach in a principled way.

I think we could benefit from a second opinion on which approach would make more sense for clang.
Summoning @jdoerfert and @echristo.

How does nvcc --genco behave when there are multiple GPU arch's? Does it output a fat binary containing multiple ISA's? Also, does it support device-only compilation for intermediate outputs?

tra added a comment.EditedJun 1 2021, 11:57 AM

How does nvcc --genco behave when there are multiple GPU arch's? Does it output a fat binary containing multiple ISA's? Also, does it support device-only compilation for intermediate outputs?

It does not allow multiple outputs for -ptx and -cubin compilations, same as clang behaves now:

$ ~/local/cuda-11.3/bin/nvcc -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -ptx foo.cu
nvcc fatal   : Option '--ptx (-ptx)' is not allowed when compiling for multiple GPU architectures

NVCC does allow -E with multiple targets, but it does produce output for only *one* of them.

NVCC does bundle outputs for multiple GPU variants if -fatbin is used.

yaxunl added a comment.Jun 1 2021, 1:21 PM

How does nvcc --genco behave when there are multiple GPU arch's? Does it output a fat binary containing multiple ISA's? Also, does it support device-only compilation for intermediate outputs?

It does not allow multiple outputs for -ptx and -cubin compilations, same as clang behaves now:

$ ~/local/cuda-11.3/bin/nvcc -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -ptx foo.cu
nvcc fatal   : Option '--ptx (-ptx)' is not allowed when compiling for multiple GPU architectures

NVCC does allow -E with multiple targets, but it does produce output for only *one* of them.

NVCC does bundle outputs for multiple GPU variants if -fatbin is used.

I think for intermediate outputs e.g. preprocessor expansion, IR, and assembly, probably it makes sense not to bundle by default. However, for default action (emitting object), we need to bundle by default since it was the old behavior and existing HIP apps depend on that. Then we allow -fhip-bundle-device-output to override the default behavior.

tra added a comment.Jun 1 2021, 2:24 PM

I think for intermediate outputs e.g. preprocessor expansion, IR, and assembly, probably it makes sense not to bundle by default.

Agreed.

However, for default action (emitting object), we need to bundle by default since it was the old behavior and existing HIP apps depend on that.

Existing use is a valid point.
As a counterargument, I would suggest that in a compilation pipeline which does include bundling, an object file for one GPU variant *is* an intermediate output, similar to the ones you've listed above.

The final product of device-side subcompilations is a bundle. The question is what does "-c" mean?. Is it produce an object file or compile till the end of the pipeline ?
For CUDA and HIP compilation it's ambiguous. When we target just one GPU, it would be closer to the former. In general, it would be closer to the latter. NVCC side-steps the issue by using a different flags -cubin/-fatbin to disambiguate between two cases and avoid bolting on CUDA-related semantics on the compiler flags that were not designed for that.

Then we allow -fhip-bundle-device-output to override the default behavior.

OK. Bundling objects for HIP by default looks like a reasonable compromise.
It would be useful to generalize the flag to -fgpu-bundle... as it would be useful if/when we want to produce a fatbin during CUDA compilation. I'd still keep no-bundling as the default for CUDA's objects.

Now that we are in agreement of what we want, the next question is *how* we want to do it.

It appears that there's a fair bit of similarity between what the proposed -fgpu-bundle flag does and the handful of --emit-... options clang has now.
If we were to use something like --emit-gpu-object and --emit-gpu-bundle, it would be similar to NVCC's -cubin/-fatbinary, would decouple the default behavior for -c --cuda-device-only from the user's ability to specify what they want without burdening -c with additional flags that would have different defaults under different circumstances.

Compilation with "-c" would remain the "compile till the end", whatever it happens to mean for particular language and --emit-object/bundle would tell the compiler how far we want it to proceed and what kind of output we want. This would probably be easier to explain to the users as they are already familiar with flags like -emit-llvm, only now we are dealing with an extra bundling step in the compilation pipeline. It would also behave consistently across CUDA and HIP even though they have different defaults for bundling for the device-side compilation. E.g. -c --cuda-device-only --emit-gpu-bundle will always produce a bundle with the object files for both CUDA and HIP and -c --cuda-device-only --emit-gpu-object will always require single '-o' output.

WDYT? Does it make sense?

yaxunl added a comment.Jun 4 2021, 6:36 AM

I think for intermediate outputs e.g. preprocessor expansion, IR, and assembly, probably it makes sense not to bundle by default.

Agreed.

However, for default action (emitting object), we need to bundle by default since it was the old behavior and existing HIP apps depend on that.

Existing use is a valid point.
As a counterargument, I would suggest that in a compilation pipeline which does include bundling, an object file for one GPU variant *is* an intermediate output, similar to the ones you've listed above.

The final product of device-side subcompilations is a bundle. The question is what does "-c" mean?. Is it produce an object file or compile till the end of the pipeline ?
For CUDA and HIP compilation it's ambiguous. When we target just one GPU, it would be closer to the former. In general, it would be closer to the latter. NVCC side-steps the issue by using a different flags -cubin/-fatbin to disambiguate between two cases and avoid bolting on CUDA-related semantics on the compiler flags that were not designed for that.

Then we allow -fhip-bundle-device-output to override the default behavior.

OK. Bundling objects for HIP by default looks like a reasonable compromise.
It would be useful to generalize the flag to -fgpu-bundle... as it would be useful if/when we want to produce a fatbin during CUDA compilation. I'd still keep no-bundling as the default for CUDA's objects.

Now that we are in agreement of what we want, the next question is *how* we want to do it.

It appears that there's a fair bit of similarity between what the proposed -fgpu-bundle flag does and the handful of --emit-... options clang has now.
If we were to use something like --emit-gpu-object and --emit-gpu-bundle, it would be similar to NVCC's -cubin/-fatbinary, would decouple the default behavior for -c --cuda-device-only from the user's ability to specify what they want without burdening -c with additional flags that would have different defaults under different circumstances.

Compilation with "-c" would remain the "compile till the end", whatever it happens to mean for particular language and --emit-object/bundle would tell the compiler how far we want it to proceed and what kind of output we want. This would probably be easier to explain to the users as they are already familiar with flags like -emit-llvm, only now we are dealing with an extra bundling step in the compilation pipeline. It would also behave consistently across CUDA and HIP even though they have different defaults for bundling for the device-side compilation. E.g. -c --cuda-device-only --emit-gpu-bundle will always produce a bundle with the object files for both CUDA and HIP and -c --cuda-device-only --emit-gpu-object will always require single '-o' output.

WDYT? Does it make sense?

For sure we will need -fgpu-bundle-device-output to control bundling of intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be redundant and can cause confusion. What if users specify -c -fgpu-bundle-device-output -emit-gpu-object or -c -fno-gpu-bundle-device-output -emit-gpu-bundle? To me a single option -fgpu-bundle-device-output to control all device output seems cleaner.

tra added a comment.Jun 4 2021, 9:57 AM

For sure we will need -fgpu-bundle-device-output to control bundling of intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be redundant and can cause confusion. What if users specify -c -fgpu-bundle-device-output -emit-gpu-object or -c -fno-gpu-bundle-device-output -emit-gpu-bundle? To me a single option -fgpu-bundle-device-output to control all device output seems cleaner.

The idea is to use -emit-gpu-object and -emit-gpu-bundle instead of the -f[no-]gpu-bundle-device-output. Otherwise they'd do exactly the same thing.

I think -emit-gpu-{object,bundle} has a minor edge over -f[no-]gpu-bundle-device-output as it's similar to other -emit options for controlling clang compilation phases (and that's what we want to do here), while -f options are usually for tweaking code generation.

For sure we will need -fgpu-bundle-device-output to control bundling of intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be redundant and can cause confusion. What if users specify -c -fgpu-bundle-device-output -emit-gpu-object or -c -fno-gpu-bundle-device-output -emit-gpu-bundle? To me a single option -fgpu-bundle-device-output to control all device output seems cleaner.

The idea is to use -emit-gpu-object and -emit-gpu-bundle instead of the -f[no-]gpu-bundle-device-output. Otherwise they'd do exactly the same thing.

I think -emit-gpu-{object,bundle} has a minor edge over -f[no-]gpu-bundle-device-output as it's similar to other -emit options for controlling clang compilation phases (and that's what we want to do here), while -f options are usually for tweaking code generation.

But how do we control emitting LLVM IR with or without bundle? -emit-llvm -emit-gpu-object or -emit-llvm -emit-gpu-bundle? -emit-* is usually for specifying a specific file type.

tra added a comment.Jun 4 2021, 10:30 AM

But how do we control emitting LLVM IR with or without bundle? -emit-llvm -emit-gpu-object or -emit-llvm -emit-gpu-bundle? -emit-* is usually for specifying a specific file type.

Hmm. I forgot that HIP can bundle things other than objects. -emit-llvm -emit-gpu-bundle looks reasonable, but -emit-llvm -emit-gpu-object is indeed odd.
OK. Making it some sort of do/do-not bundle flag makes sense. How about just --[no-]gpu-bundle-output?

But how do we control emitting LLVM IR with or without bundle? -emit-llvm -emit-gpu-object or -emit-llvm -emit-gpu-bundle? -emit-* is usually for specifying a specific file type.

Hmm. I forgot that HIP can bundle things other than objects. -emit-llvm -emit-gpu-bundle looks reasonable, but -emit-llvm -emit-gpu-object is indeed odd.
OK. Making it some sort of do/do-not bundle flag makes sense. How about just --[no-]gpu-bundle-output?

--[no]gpu-bundle-output sounds good.

yaxunl updated this revision to Diff 350318.Jun 7 2021, 9:09 AM
yaxunl retitled this revision from [HIP] Fix device-only compilation to [HIP] Add --gpu-bundle-output.
yaxunl edited the summary of this revision. (Show Details)

use --gpu-bundle-output to control bundling/unbundling output of HIP device only compilation

tra accepted this revision.Jun 7 2021, 10:12 AM
tra added inline comments.
clang/lib/Driver/Driver.cpp
2910

We should document the behavior we expect from the --gpu-bundle-output option in one place. This may be a good place for it.

Some day we should add CUDA/HIP section to clang manual. We already have tons of options that users do need to know about.

This revision is now accepted and ready to land.Jun 7 2021, 10:12 AM
yaxunl marked an inline comment as done.Jun 7 2021, 12:49 PM
yaxunl added inline comments.
clang/lib/Driver/Driver.cpp
2910

Will fix the comments when committing.

Yes we should start adding CUDA/HIP documentation to clang manual in future patches.

This revision was automatically updated to reflect the committed changes.
yaxunl marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald TranscriptJun 9 2021, 8:54 PM

got test failures on some bots. fixed by 734213d7b51f9ea22a9d122c0646ca5b69f88ac8