This is an archive of the discontinued LLVM Phabricator instance.

[Clang] Only emit textual LLVM-IR in device only mode
ClosedPublic

Authored by jhuber6 on Jan 13 2023, 11:34 AM.

Details

Summary

Currently, we embed device code into the host to perform
multi-architecture linking and handling of device code. If the user
specified -S -emit-llvm then the embedded output will be textual
LLVM-IR. This is a problem because it can't be used by the LTO backend
and it makes reading the file confusing.

This patch changes the behaviour to only emit textual device IR if we
are in device only mode, that is, if the device code is presented
directly to the user instead of being embedded. Otherwise we should
always embed device bitcode instead.

Diff Detail

Event Timeline

jhuber6 created this revision.Jan 13 2023, 11:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 11:34 AM
jhuber6 requested review of this revision.Jan 13 2023, 11:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 11:34 AM
tra added a comment.Jan 13 2023, 12:10 PM

Textual output for "-S -emit-llvm" is the canonical behavior, so I would prefer it working that way in as many cases as possible and only override it when necessary.

Would it be possible to enforce binary IR generation in cases you need it? Or to prove that this is equivalent to what the patch does now?

jhuber6 added a comment.EditedJan 13 2023, 12:39 PM

Textual output for "-S -emit-llvm" is the canonical behavior, so I would prefer it working that way in as many cases as possible and only override it when necessary.

Would it be possible to enforce binary IR generation in cases you need it? Or to prove that this is equivalent to what the patch does now?

Well you'll get textual output for the host output, but the device code embedded in the host module will be bitcode instead. So the final output from the compiler is still textual IR. It just won't be some weird global like this

@llvm.embedded.object = private constant [138032 x i8] c"\10\FF\10\AD\01\00\00\000\1B\02\00\00\00\00\00 \00\00\00\00\00\00\00(\00\00\00\00\00\00\00\02\00\01\00\00\00\00\00H\00\00\00\00\00\00\00\02\00\00\00\0  0\00\00\00\90\00\00\00\00\00\00\00\9D\1A\02\00\00\00\00\00n\00\00\00\00\00\00\00u\00\00\00\00\00\00\00i\00\00\00\00\00\00\00\87\00\00\00\00\00\00\00\00arch\00triple\00amdgcn-amd-amdhsa\00gfx90a\00\00\00; Mod  uleID = 'tl2.c'\0Asource_filename = \22tl2.c\22\0Atarget datalayout = \22e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-  v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7\22\0Atarget triple = \22amdgcn-amd-amdhsa\22\0A\0A%struct.ident_t = type { i32, i32, i32, i32, ptr }\0A%struct.DeviceEnvironmentTy = type { i32, i32, i32, i32 }\0A  %\22struct.ompx::state::TeamStateTy\22 = type { %\22struct.ompx::state::ICVStateTy\22, i32, i32, ptr }\0A%\22struct.ompx::state::ICVStateTy\22 = type { i32, i32, i32, i32, i32, i32 }\0A%\22struct.(anonymous   namespace)::SharedMemorySmartStackTy\22 = type { [512 x i8], [1024 x i8] }\0A%\22struct.ompx::state::ThreadStateTy\22 = type { %\22struct.ompx::state::ICVStateTy\22, ptr }\0A\0A@__omp_rtl_assume_teams_oversu  bscription = weak_odr hidden addrspace(1) constant i32 0\0A@__omp_rtl_assume_threads_oversubscription = weak_odr hidden addrspace(1) constant i32 0\0A@0 = private unnamed_addr constant [23 x i8] c\22;unknown  ;unknown;0;0;;\\00\22, align 1\0A@1 = private unnamed_addr addrspace(1) constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @0 }, align 8\0A@__omp_offloading_16_5d6227e_thread_limit_l2_exec_mode = we  ak protected addrspace(1) constant i8 1\0A@__omp_offloading_16_5d6227e_thread_limit_l5_exec_mode = weak protected addrspace(1) constant i8 1\0A@__omp_offloading_16_5d6227e_thread_limit_l8_exec_mode = weak pr  otected addrspace(1) constant i8 1\0A@llvm.compiler.used = appending addrspace(1) global [3 x ptr] [ptr addrspacecast (ptr addrspace(1) @__omp_offloading_16_5d6227e_thread_limit_l2_exec_mode to ptr), ptr add  rspacecast (ptr addrspace(1) @__omp_offloading_16_5d6227e_

This is bad because it can't be handled by LTO or anything else. It makes the resulting IR file difficult to use for its intended purpose.

tra added a comment.Jan 13 2023, 2:04 PM

Well you'll get textual output for the host output, but the device code embedded in the host module will be bitcode instead. So the final output from the compiler is still textual IR. It just won't be some weird global like this

This is bad because it can't be handled by LTO or anything else. It makes the resulting IR file difficult to use for its intended purpose.

I understand your use case and I do agree that you do need the embedded IR to be binary.

It appears to me that the real problem here is that "-S" should not have been propagated to the GPU sub-compilation.

The fact that producing binary IR works for you seems to be just a coincidence. What if the user would specify "-E" instead? I think the right fix would be to tell compiler to generate the right kind of output, not try to tell it to produce textual assembly and undo the "textual" part somewhere downstream.

Well you'll get textual output for the host output, but the device code embedded in the host module will be bitcode instead. So the final output from the compiler is still textual IR. It just won't be some weird global like this

This is bad because it can't be handled by LTO or anything else. It makes the resulting IR file difficult to use for its intended purpose.

I understand your use case and I do agree that you do need the embedded IR to be binary.

It appears to me that the real problem here is that "-S" should not have been propagated to the GPU sub-compilation.

The fact that producing binary IR works for you seems to be just a coincidence. What if the user would specify "-E" instead? I think the right fix would be to tell compiler to generate the right kind of output, not try to tell it to produce textual assembly and undo the "textual" part somewhere downstream.

For -E we don't embed anything, so I think we just ignore the device portion unless they used --offload-device-only. And I don't think we have access to any "per-toolchain" options at this point in the compilation phase.

tra added a comment.Jan 13 2023, 3:46 PM

For -E we don't embed anything,

That was just an exaggerated example of top-level options affecting sub-compilation output where you can't magically tweak it to produce the kind of output your sub-compilation needs.

The fundamental problem I have is that compiler should not magically fix what the user has specified. "-S -emit-llvm" has very specific meaning and it's not "produce binary IR". However, when we're dealing with 'interesting' compilation pipelines like CUDA, things get complicated, as in your example. Presumably the user wants the compiler to produce the textual IR for the "top-level" compilation. In the case of CUDA it would be host-side IR (with embedded GPU binary, which should still be a *binary*). The fact that in your case that binary happens to be binary LLVM IR is an implementation detail. Without the new driver it should be the real GPU fatbin. Hence my argument that what we want is to avoid passing "-S -emit-llvm" (and potentially other output-altering options) to the GPU sub-compilation, which would work regardless of whether the GPU binary is a fatbin or LLVM bitcode.
For the compilation with --cuda-device-only, GPU compilation is the top-level compilation and it will get all the right options to make it produce textual IR and stop after that.

Does it make sense?

For -E we don't embed anything,

That was just an exaggerated example of top-level options affecting sub-compilation output where you can't magically tweak it to produce the kind of output your sub-compilation needs.

The fundamental problem I have is that compiler should not magically fix what the user has specified. "-S -emit-llvm" has very specific meaning and it's not "produce binary IR". However, when we're dealing with 'interesting' compilation pipelines like CUDA, things get complicated, as in your example. Presumably the user wants the compiler to produce the textual IR for the "top-level" compilation. In the case of CUDA it would be host-side IR (with embedded GPU binary, which should still be a *binary*). The fact that in your case that binary happens to be binary LLVM IR is an implementation detail. Without the new driver it should be the real GPU fatbin. Hence my argument that what we want is to avoid passing "-S -emit-llvm" (and potentially other output-altering options) to the GPU sub-compilation, which would work regardless of whether the GPU binary is a fatbin or LLVM bitcode.
For the compilation with --cuda-device-only, GPU compilation is the top-level compilation and it will get all the right options to make it produce textual IR and stop after that.

Does it make sense?

So are you suggesting that we complete the whole pipeline? So -S -emit-llvm gives host IR, but the device will go all the way to object?

tra added a comment.Jan 13 2023, 5:16 PM

So are you suggesting that we complete the whole pipeline? So -S -emit-llvm gives host IR, but the device will go all the way to object?

That would match my expectations and would solve the " it can't be handled by LTO or anything else." issue you've highlighted above. It will give us a valid output in the form user specified. I.e. you will be able to pass the IR further down the compilation pipeline. I.e. if we get it for the host compilation it should be the same IR we'd have captured if we we'd get it from -save-temps.

So are you suggesting that we complete the whole pipeline? So -S -emit-llvm gives host IR, but the device will go all the way to object?

That would match my expectations and would solve the " it can't be handled by LTO or anything else." issue you've highlighted above. It will give us a valid output in the form user specified. I.e. you will be able to pass the IR further down the compilation pipeline. I.e. if we get it for the host compilation it should be the same IR we'd have captured if we we'd get it from -save-temps.

Might work, I'll see how difficult it is to override the phases.

I made the phases always go to Assemble but it didn't make a difference. We still get the textual IR here without the exception I added.

The intention of -emit-llvm -S is usually to get LLVM assembly for all targets for inspection or modification. HIP emits a bundled LLVM assembly in textual format in this case. Users can modify it directly, or extract assembly for each device and bundle them together again. The modified file can then be passed on to the next compilation stage.

The intention of -emit-llvm -S is usually to get LLVM assembly for all targets for inspection or modification. HIP emits a bundled LLVM assembly in textual format in this case. Users can modify it directly, or extract assembly for each device and bundle them together again. The modified file can then be passed on to the next compilation stage.

Yeah, this is part of where I'm questioning how we want to treat the compilation pipeline when we perform offloading. When I set up the new driver support I decided to eschew the idea of perfectly bundled phases and instead decided that if users did not want the full pipeline they should use --offload-device-only and built it for themselves. In that vein I'd prefer it to be where we always generate a full (To Assembler) phase and avoid textual formats since it's supposed to be embedded as a binary blob.

Ping, is it possible to resolve this before the fork tomorrow?

The intention of -emit-llvm -S is usually to get LLVM assembly for all targets for inspection or modification. HIP emits a bundled LLVM assembly in textual format in this case. Users can modify it directly, or extract assembly for each device and bundle them together again. The modified file can then be passed on to the next compilation stage.

If you already unboundle, manually or automatically, you can also run llvm-dis. Embedded text is really not that helpful, IMHO.

yaxunl added a comment.EditedJan 23 2023, 1:34 PM

Can we keep the original behaviour for the old driver for HIP? Only enable the change for the new driver.

Can we keep the original behaviour for the old driver for HIP? Only enable the change for the new driver.

That's probably fair because AFAIK that will still use the clang-offload-bundler which shows both of them at the same time side-by-side. I'll need to check the behavior there.

jhuber6 updated this revision to Diff 491866.Jan 24 2023, 11:50 AM

Exempting HIP using the old driver.

This revision is now accepted and ready to land.Jan 24 2023, 12:04 PM
This revision was landed with ongoing or failed builds.Jan 24 2023, 1:11 PM
This revision was automatically updated to reflect the committed changes.