Added --gpu-bundle-output to control bundling/unbundling output of HIP device compilation.
By default preprocessor expansion, llvm bitcode and assembly are unbundled, code objects are
bundled.
Differential D101630
[HIP] Add --gpu-bundle-output yaxunl on Apr 30 2021, 7:12 AM. Authored by
Details
Added --gpu-bundle-output to control bundling/unbundling output of HIP device compilation. By default preprocessor expansion, llvm bitcode and assembly are unbundled, code objects are
Diff Detail
Event TimelineComment Actions CUDA compilation currently errors out if -o is used when more than one output would be produced. % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2 zz.cu -c -E #... preprocessed output from host and 2 GPU compilations is printed out % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2 zz.cu -c -E -o foo.out clang-13: error: cannot specify -o when generating multiple output files % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2 zz.cu -c --cuda-device-only -E -o foo.out clang-13: error: cannot specify -o when generating multiple output files % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 --cuda-path=$HOME/local/cuda-10.2 zz.cu -c --cuda-device-only -S -o foo.out clang-13: error: cannot specify -o when generating multiple output files I think I've borrowed that behavior from some of the macos-related functionality, so we do have a somewhat established model of how to handle multiple outputs. The question is -- what would make most sense. In my experience, most of such use cases are intended for manual examination of compiler output and as such I'd prefer to keep the results immediately usable, without having to unbundle them. In such cases we're already changing command line options and adjusting them to produce the output from the specific sub-compilation I want is trivial. Having to unbundle things is more complicated as the bundler/unbundler tool as it is right now is poorly documented and is not particularly user-friendly. If it is to become a user-facing tool like ar/nm/objdump, it would need some improvements. If you do have use cases when you do need to bundle intermediate results, are they for the human consumption or for tooling? Perhaps we should make the "bundle the outputs" behavior an controllable by a flag, and keep enforcing "one output only" as the default. Comment Actions We use ccache and need one output for -E with device compilation. Also there are use cases to emit bitcode for device compilation and link them later. These use cases require output to be bundled. If users want to get the unbundled output, they can use -save-temps. Is it sufficient? Comment Actions What will happen with this patch in the following scenarios:
I would expect the first case to produce a plain text assembly file. With this patch the second case will produce a bundle. With some build tools users only add to the various compiler options provided by the system. Depending on whether those system-provided options include an --offload-arch, the format of the output in the first example becomes unstable. So the consistent way would be to always bundle everything, but that breaks (or at least complicates) the normal single-output case and makes it deviate from what users expect from a regular C++ compilation. This is a good point. I don't think I've ever used ccache on a CUDA compilation, but I see how ccache may get surprised. Considering the scenario above, I think a better way to handle it would be to teach ccache about CUDA/HIP compilation. It's a similar situation with support for split DWARF, when compiler does something beyond the expected one-input to one-output transformation.
In terms of saving intermediate outputs - yes. In terms of usability - no. Sometimes I want one particular intermediate result saved with exact filename (or piped to stdout) and saving bunch and then picking one would be a pretty annoying usability regression for me. Comment Actions How about an option -fhip-bundle-device-output. If it is on, device output is bundled no matter how many GPU arch there are. By default it is on. Comment Actions +1 to the option, but I can't say I'm particularly happy about the default. I'd still prefer the default to be a no-bundling + an error in cases when we'd nominally produce multiple outputs. @jdoerfert : Do you have any thoughts on what would be a sensible default when a user uses -S -o foo.s for compilations that may produce multiple results? I think OpenMP may have to deal with similar issues. On one hand it would be convenient for ccache to just work with the CUDA/HIP compilation out of the box. Compiler always produces one output file, regardless of what it does under the hood and ccache may not care what's in it. On the other, this behavior breaks user expectations. I.e. clang -S is supposed to produce the assembly, not an opaque binary bundle blob.
Comment Actions I choose to emit the bundled output by default since it is the convention for compiler to have one output. The compilation is like a pipeline. If we break it into stages, users would expect to use the output from one stage as input for the next stage. This is possible only if there is one output. This happens with host compilations and combined device/host compilations. I would see it is a surprise that this is not true for device compilation. Also, when users do not want the output to be bundled, it is usually for debugging or special purposes. Users need to know the naming convention of the multiple outputs. I think it is justifiable to enable this by an option. Comment Actions So in the end it's a trade-off between tools like ccache working out of the box vs additional option that would need to be used by users we do need specific intermediate output. Now the question is how to make it work without breaking existing users. There are some tools that rely on clang --cuda-host-only and --cuda-device-only to work as if it was a regular C++ compilation. E.g. godbolt.org. How about this: WDYT? Comment Actions --cuda-host-only always have one output, therefore there is no point of bundle its output. We only need to decide the proper behavior of --cuda-device-only. How about keeping the original default behavior of not bundling if users do not specify output file, whereas bundle the output if users specifying output file. Since specifying output file indicates users requesting one output. -f[no-]hip-bundle-device-output override the default behavior. Comment Actions It still fits my proposal of requiring a single sub-compilation and not bundling the output.
I think it will make things worse. Compiler output should not change depending on whether -o is used.
I disagree. When user specifies the output, the intent is to specify the location of the outputs, not its contents or format. Telling compiler to produce a different output format should not depend on specifying (or not) the output location. I think our options are:
I can see the benefit of always bundling for HIP, but I also believe that keeping things simple, consistent and predictable is important. Considering that we're tinkering in a relatively obscure niche of the compiler, it probably does not matter all that much, but it should not stop us from trying to figure out the best approach in a principled way. I think we could benefit from a second opinion on which approach would make more sense for clang. Comment Actions How does nvcc --genco behave when there are multiple GPU arch's? Does it output a fat binary containing multiple ISA's? Also, does it support device-only compilation for intermediate outputs? Comment Actions It does not allow multiple outputs for -ptx and -cubin compilations, same as clang behaves now: $ ~/local/cuda-11.3/bin/nvcc -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -ptx foo.cu nvcc fatal : Option '--ptx (-ptx)' is not allowed when compiling for multiple GPU architectures NVCC does allow -E with multiple targets, but it does produce output for only *one* of them. NVCC does bundle outputs for multiple GPU variants if -fatbin is used. Comment Actions I think for intermediate outputs e.g. preprocessor expansion, IR, and assembly, probably it makes sense not to bundle by default. However, for default action (emitting object), we need to bundle by default since it was the old behavior and existing HIP apps depend on that. Then we allow -fhip-bundle-device-output to override the default behavior. Comment Actions Agreed.
Existing use is a valid point. The final product of device-side subcompilations is a bundle. The question is what does "-c" mean?. Is it produce an object file or compile till the end of the pipeline ?
OK. Bundling objects for HIP by default looks like a reasonable compromise. Now that we are in agreement of what we want, the next question is *how* we want to do it. It appears that there's a fair bit of similarity between what the proposed -fgpu-bundle flag does and the handful of --emit-... options clang has now. Compilation with "-c" would remain the "compile till the end", whatever it happens to mean for particular language and --emit-object/bundle would tell the compiler how far we want it to proceed and what kind of output we want. This would probably be easier to explain to the users as they are already familiar with flags like -emit-llvm, only now we are dealing with an extra bundling step in the compilation pipeline. It would also behave consistently across CUDA and HIP even though they have different defaults for bundling for the device-side compilation. E.g. -c --cuda-device-only --emit-gpu-bundle will always produce a bundle with the object files for both CUDA and HIP and -c --cuda-device-only --emit-gpu-object will always require single '-o' output. WDYT? Does it make sense? Comment Actions For sure we will need -fgpu-bundle-device-output to control bundling of intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be redundant and can cause confusion. What if users specify -c -fgpu-bundle-device-output -emit-gpu-object or -c -fno-gpu-bundle-device-output -emit-gpu-bundle? To me a single option -fgpu-bundle-device-output to control all device output seems cleaner. Comment Actions The idea is to use -emit-gpu-object and -emit-gpu-bundle instead of the -f[no-]gpu-bundle-device-output. Otherwise they'd do exactly the same thing. I think -emit-gpu-{object,bundle} has a minor edge over -f[no-]gpu-bundle-device-output as it's similar to other -emit options for controlling clang compilation phases (and that's what we want to do here), while -f options are usually for tweaking code generation. Comment Actions But how do we control emitting LLVM IR with or without bundle? -emit-llvm -emit-gpu-object or -emit-llvm -emit-gpu-bundle? -emit-* is usually for specifying a specific file type. Comment Actions Hmm. I forgot that HIP can bundle things other than objects. -emit-llvm -emit-gpu-bundle looks reasonable, but -emit-llvm -emit-gpu-object is indeed odd. Comment Actions use --gpu-bundle-output to control bundling/unbundling output of HIP device only compilation
|
The TableGen marshalling infrastructure (BoolFOption et. al.) is only intended for flags that map to the -cc1 frontend and its CompilerInvocation class. (EmptyKPM that disables this mapping shouldn't even exist anymore.)
Since these flags only work on the driver level, use something like this instead: