This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][OpenMP] Fix the new driver crashing on multiple device-only outputs
ClosedPublic

Authored by jhuber6 on Aug 19 2022, 8:57 AM.

Details

Summary

The new driver supports device-only compilation for the offloading
device. The way this is handlded is a little different from the old
offloading driver. The old driver would put all the outputs in the final
action list akin to a linker job. The new driver however generated these
in the middle of the host's job so we instead put them all in a single
offloading action. However, we only handled these kinds of offloading
actions correctly when there was only a single input. When we had
multiple inputs we would instead attempt to get the host job, which
didn't exist, and crash.

This patch simply adds some extra logic to generate the jobs for all
dependencies if there is not host action.

Diff Detail

Event Timeline

jhuber6 created this revision.Aug 19 2022, 8:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2022, 8:57 AM
jhuber6 requested review of this revision.Aug 19 2022, 8:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2022, 8:57 AM
jhuber6 updated this revision to Diff 454036.Aug 19 2022, 9:02 AM

Forgot to use the new driver in the test.

tra added a comment.Aug 19 2022, 10:43 AM

The old driver would put all the outputs in the final action list akin to a linker job.

IIRC that's where HIP and CUDA behaved differently. CUDA compilation does not allow device-only compilation for multiple targets if we have explicitly specified output. It does produce individual per-gpu .o files if compiled without -o.

bin/clang++ --cuda-path=$HOME/local/cuda-11.7 --offload-arch=sm_80 --offload-arch=sm_86 -x cuda axpy.cu  --cuda-device-only -O3  -c -o axpy.o
clang-15: error: cannot specify -o when generating multiple output files
clang/test/Driver/cuda-bindings.cu
160

If we've specified -o foo.o, where do those multiple outputs go to?

The old driver disallowed using -o when compiling for multiple GPUs.

The old driver would put all the outputs in the final action list akin to a linker job.

IIRC that's where HIP and CUDA behaved differently. CUDA compilation does not allow device-only compilation for multiple targets if we have explicitly specified output. It does produce individual per-gpu .o files if compiled without -o.

bin/clang++ --cuda-path=$HOME/local/cuda-11.7 --offload-arch=sm_80 --offload-arch=sm_86 -x cuda axpy.cu  --cuda-device-only -O3  -c -o axpy.o
clang-15: error: cannot specify -o when generating multiple output files

Is this an architectural limitation? I'd imagine they'd just behave the same way here in my implementation.

clang/test/Driver/cuda-bindings.cu
160

Good catch, right now it'll just write both of them to the same file.

tra added a comment.Aug 19 2022, 11:45 AM

Is this an architectural limitation? I'd imagine they'd just behave the same way here in my implementation.

The constraint here is that we have to stick with a single output per compiler invocation and that format of that output should be the same. E.g. for C++ we'd expect to see an ELF file when we compile with -c and text assembly, if we compile with -S.

We could pack GPU objects into a fat binary, but for consistency it would have to be done for single-target compilations, too. Packing a single object into a fat binary would make little sense, but producing an object file or a fat binary depending on the nubmer of targets would be inconsitent.
Similarly, compilation with -S also gets tricky -- do you bundle the text output? That would be not particularly useful as, presumably, one would want to examine the assembly. We could concatenate together the ASM files, but that would produce an assembly source we can't really assemble.
On top of that, CUDA compilation has been around for a while and changing the output format would be somewhat disruptive.

In the end, CUDA stuck with insisting on erroring out when the -o has been specified, but where it would need to produce multiple outputs.
HIP grew a --[no-]gpu-bundle-output option to control whether to bundle outputs of device-only compilation.

Is this an architectural limitation? I'd imagine they'd just behave the same way here in my implementation.

The constraint here is that we have to stick with a single output per compiler invocation and that format of that output should be the same. E.g. for C++ we'd expect to see an ELF file when we compile with -c and text assembly, if we compile with -S.

We could pack GPU objects into a fat binary, but for consistency it would have to be done for single-target compilations, too. Packing a single object into a fat binary would make little sense, but producing an object file or a fat binary depending on the nubmer of targets would be inconsitent.
Similarly, compilation with -S also gets tricky -- do you bundle the text output? That would be not particularly useful as, presumably, one would want to examine the assembly. We could concatenate together the ASM files, but that would produce an assembly source we can't really assemble.
On top of that, CUDA compilation has been around for a while and changing the output format would be somewhat disruptive.

In the end, CUDA stuck with insisting on erroring out when the -o has been specified, but where it would need to produce multiple outputs.
HIP grew a --[no-]gpu-bundle-output option to control whether to bundle outputs of device-only compilation.

Thanks for the background. I'm assuming HIP did this because they use the old clang-offload-bundler which supported bundling multiple file types, while my new method relies on having some LLVM-IR to embed things in. I wasn't a huge fan of outputting bundles because it meant we couldn't do things like clang -o - | opt or similar. For my implementation I will probably make HIP do what CUDA does as I feel that is more reasonable unless someone has a major objection.

jhuber6 updated this revision to Diff 454076.Aug 19 2022, 11:59 AM

Updating to error with -o and multiple files.

tra added a comment.Aug 19 2022, 2:18 PM

I'm OK with that.

@yaxunl -- what are your thoughts on whether this approach would work for HIP? On one hand HIP already has a lot of features that the new driver is intended to provide, so AMD may have no pressure to change to something else. On the other hand, long term it would make sense to unify the driver pipeline across the different offloading mechanisms we have now.

I'm OK with that.

@yaxunl -- what are your thoughts on whether this approach would work for HIP? On one hand HIP already has a lot of features that the new driver is intended to provide, so AMD may have no pressure to change to something else. On the other hand, long term it would make sense to unify the driver pipeline across the different offloading mechanisms we have now.

I am OK as long as it works for HIP and does not break the old driver.

clang/test/Driver/cuda-bindings.cu
154

can you add similar tests for HIP? Thanks.

jhuber6 updated this revision to Diff 454839.Aug 23 2022, 7:32 AM

Adding HIP test

yaxunl accepted this revision.Aug 23 2022, 7:51 AM

LGTM. Thanks.

This revision is now accepted and ready to land.Aug 23 2022, 7:51 AM