This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/Driver/
-
Driver/
-
Driver.cpp
-
test/Driver/
-
Driver/
-
openmp-offload-gpu.c

Differential D141717

[Clang] Only emit textual LLVM-IR in device only mode
ClosedPublic

Authored by jhuber6 on Jan 13 2023, 11:34 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
tianshilei1992
tra
yaxunl
JonChesterfield

Commits

rGd50dacd7c3c2: [Clang] Only emit textual LLVM-IR in device only mode

Summary

Currently, we embed device code into the host to perform
multi-architecture linking and handling of device code. If the user
specified -S -emit-llvm then the embedded output will be textual
LLVM-IR. This is a problem because it can't be used by the LTO backend
and it makes reading the file confusing.

This patch changes the behaviour to only emit textual device IR if we
are in device only mode, that is, if the device code is presented
directly to the user instead of being embedded. Otherwise we should
always embed device bitcode instead.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Jan 13 2023, 11:34 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 11:34 AM

jhuber6 requested review of this revision.Jan 13 2023, 11:34 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 11:34 AM

Herald added subscribers: cfe-commits, sstefan1, MaskRay. · View Herald Transcript

Textual output for "-S -emit-llvm" is the canonical behavior, so I would prefer it working that way in as many cases as possible and only override it when necessary.

Would it be possible to enforce binary IR generation in cases you need it? Or to prove that this is equivalent to what the patch does now?

In D141717#4052514, @tra wrote:

Textual output for "-S -emit-llvm" is the canonical behavior, so I would prefer it working that way in as many cases as possible and only override it when necessary.

Would it be possible to enforce binary IR generation in cases you need it? Or to prove that this is equivalent to what the patch does now?

Well you'll get textual output for the host output, but the device code embedded in the host module will be bitcode instead. So the final output from the compiler is still textual IR. It just won't be some weird global like this

@llvm.embedded.object = private constant [138032 x i8] c"\10\FF\10\AD\01\00\00\000\1B\02\00\00\00\00\00 \00\00\00\00\00\00\00(\00\00\00\00\00\00\00\02\00\01\00\00\00\00\00H\00\00\00\00\00\00\00\02\00\00\00\0  0\00\00\00\90\00\00\00\00\00\00\00\9D\1A\02\00\00\00\00\00n\00\00\00\00\00\00\00u\00\00\00\00\00\00\00i\00\00\00\00\00\00\00\87\00\00\00\00\00\00\00\00arch\00triple\00amdgcn-amd-amdhsa\00gfx90a\00\00\00; Mod  uleID = 'tl2.c'\0Asource_filename = \22tl2.c\22\0Atarget datalayout = \22e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-  v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7\22\0Atarget triple = \22amdgcn-amd-amdhsa\22\0A\0A%struct.ident_t = type { i32, i32, i32, i32, ptr }\0A%struct.DeviceEnvironmentTy = type { i32, i32, i32, i32 }\0A  %\22struct.ompx::state::TeamStateTy\22 = type { %\22struct.ompx::state::ICVStateTy\22, i32, i32, ptr }\0A%\22struct.ompx::state::ICVStateTy\22 = type { i32, i32, i32, i32, i32, i32 }\0A%\22struct.(anonymous   namespace)::SharedMemorySmartStackTy\22 = type { [512 x i8], [1024 x i8] }\0A%\22struct.ompx::state::ThreadStateTy\22 = type { %\22struct.ompx::state::ICVStateTy\22, ptr }\0A\0A@__omp_rtl_assume_teams_oversu  bscription = weak_odr hidden addrspace(1) constant i32 0\0A@__omp_rtl_assume_threads_oversubscription = weak_odr hidden addrspace(1) constant i32 0\0A@0 = private unnamed_addr constant [23 x i8] c\22;unknown  ;unknown;0;0;;\\00\22, align 1\0A@1 = private unnamed_addr addrspace(1) constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @0 }, align 8\0A@__omp_offloading_16_5d6227e_thread_limit_l2_exec_mode = we  ak protected addrspace(1) constant i8 1\0A@__omp_offloading_16_5d6227e_thread_limit_l5_exec_mode = weak protected addrspace(1) constant i8 1\0A@__omp_offloading_16_5d6227e_thread_limit_l8_exec_mode = weak pr  otected addrspace(1) constant i8 1\0A@llvm.compiler.used = appending addrspace(1) global [3 x ptr] [ptr addrspacecast (ptr addrspace(1) @__omp_offloading_16_5d6227e_thread_limit_l2_exec_mode to ptr), ptr add  rspacecast (ptr addrspace(1) @__omp_offloading_16_5d6227e_

This is bad because it can't be handled by LTO or anything else. It makes the resulting IR file difficult to use for its intended purpose.

Harbormaster completed remote builds in B207698: Diff 489083.Jan 13 2023, 12:56 PM

In D141717#4052587, @jhuber6 wrote:

Well you'll get textual output for the host output, but the device code embedded in the host module will be bitcode instead. So the final output from the compiler is still textual IR. It just won't be some weird global like this

This is bad because it can't be handled by LTO or anything else. It makes the resulting IR file difficult to use for its intended purpose.

I understand your use case and I do agree that you do need the embedded IR to be binary.

It appears to me that the real problem here is that "-S" should not have been propagated to the GPU sub-compilation.

The fact that producing binary IR works for you seems to be just a coincidence. What if the user would specify "-E" instead? I think the right fix would be to tell compiler to generate the right kind of output, not try to tell it to produce textual assembly and undo the "textual" part somewhere downstream.

In D141717#4052753, @tra wrote:

In D141717#4052587, @jhuber6 wrote:

Well you'll get textual output for the host output, but the device code embedded in the host module will be bitcode instead. So the final output from the compiler is still textual IR. It just won't be some weird global like this

This is bad because it can't be handled by LTO or anything else. It makes the resulting IR file difficult to use for its intended purpose.

I understand your use case and I do agree that you do need the embedded IR to be binary.

It appears to me that the real problem here is that "-S" should not have been propagated to the GPU sub-compilation.

The fact that producing binary IR works for you seems to be just a coincidence. What if the user would specify "-E" instead? I think the right fix would be to tell compiler to generate the right kind of output, not try to tell it to produce textual assembly and undo the "textual" part somewhere downstream.

For -E we don't embed anything, so I think we just ignore the device portion unless they used --offload-device-only. And I don't think we have access to any "per-toolchain" options at this point in the compilation phase.

In D141717#4052824, @jhuber6 wrote:

For -E we don't embed anything,

That was just an exaggerated example of top-level options affecting sub-compilation output where you can't magically tweak it to produce the kind of output your sub-compilation needs.

The fundamental problem I have is that compiler should not magically fix what the user has specified. "-S -emit-llvm" has very specific meaning and it's not "produce binary IR". However, when we're dealing with 'interesting' compilation pipelines like CUDA, things get complicated, as in your example. Presumably the user wants the compiler to produce the textual IR for the "top-level" compilation. In the case of CUDA it would be host-side IR (with embedded GPU binary, which should still be a *binary*). The fact that in your case that binary happens to be binary LLVM IR is an implementation detail. Without the new driver it should be the real GPU fatbin. Hence my argument that what we want is to avoid passing "-S -emit-llvm" (and potentially other output-altering options) to the GPU sub-compilation, which would work regardless of whether the GPU binary is a fatbin or LLVM bitcode.
For the compilation with --cuda-device-only, GPU compilation is the top-level compilation and it will get all the right options to make it produce textual IR and stop after that.

Does it make sense?

In D141717#4052986, @tra wrote:

In D141717#4052824, @jhuber6 wrote:

For -E we don't embed anything,

That was just an exaggerated example of top-level options affecting sub-compilation output where you can't magically tweak it to produce the kind of output your sub-compilation needs.

The fundamental problem I have is that compiler should not magically fix what the user has specified. "-S -emit-llvm" has very specific meaning and it's not "produce binary IR". However, when we're dealing with 'interesting' compilation pipelines like CUDA, things get complicated, as in your example. Presumably the user wants the compiler to produce the textual IR for the "top-level" compilation. In the case of CUDA it would be host-side IR (with embedded GPU binary, which should still be a *binary*). The fact that in your case that binary happens to be binary LLVM IR is an implementation detail. Without the new driver it should be the real GPU fatbin. Hence my argument that what we want is to avoid passing "-S -emit-llvm" (and potentially other output-altering options) to the GPU sub-compilation, which would work regardless of whether the GPU binary is a fatbin or LLVM bitcode.
For the compilation with --cuda-device-only, GPU compilation is the top-level compilation and it will get all the right options to make it produce textual IR and stop after that.

Does it make sense?

So are you suggesting that we complete the whole pipeline? So -S -emit-llvm gives host IR, but the device will go all the way to object?

So are you suggesting that we complete the whole pipeline? So -S -emit-llvm gives host IR, but the device will go all the way to object?

That would match my expectations and would solve the " it can't be handled by LTO or anything else." issue you've highlighted above. It will give us a valid output in the form user specified. I.e. you will be able to pass the IR further down the compilation pipeline. I.e. if we get it for the host compilation it should be the same IR we'd have captured if we we'd get it from -save-temps.

In D141717#4053164, @tra wrote:

So are you suggesting that we complete the whole pipeline? So -S -emit-llvm gives host IR, but the device will go all the way to object?

That would match my expectations and would solve the " it can't be handled by LTO or anything else." issue you've highlighted above. It will give us a valid output in the form user specified. I.e. you will be able to pass the IR further down the compilation pipeline. I.e. if we get it for the host compilation it should be the same IR we'd have captured if we we'd get it from -save-temps.

Might work, I'll see how difficult it is to override the phases.

I made the phases always go to Assemble but it didn't make a difference. We still get the textual IR here without the exception I added.

The intention of -emit-llvm -S is usually to get LLVM assembly for all targets for inspection or modification. HIP emits a bundled LLVM assembly in textual format in this case. Users can modify it directly, or extract assembly for each device and bundle them together again. The modified file can then be passed on to the next compilation stage.

In D141717#4056971, @yaxunl wrote:

The intention of -emit-llvm -S is usually to get LLVM assembly for all targets for inspection or modification. HIP emits a bundled LLVM assembly in textual format in this case. Users can modify it directly, or extract assembly for each device and bundle them together again. The modified file can then be passed on to the next compilation stage.

Yeah, this is part of where I'm questioning how we want to treat the compilation pipeline when we perform offloading. When I set up the new driver support I decided to eschew the idea of perfectly bundled phases and instead decided that if users did not want the full pipeline they should use --offload-device-only and built it for themselves. In that vein I'd prefer it to be where we always generate a full (To Assembler) phase and avoid textual formats since it's supposed to be embedded as a binary blob.

Ping, is it possible to resolve this before the fork tomorrow?

In D141717#4056971, @yaxunl wrote:

The intention of -emit-llvm -S is usually to get LLVM assembly for all targets for inspection or modification. HIP emits a bundled LLVM assembly in textual format in this case. Users can modify it directly, or extract assembly for each device and bundle them together again. The modified file can then be passed on to the next compilation stage.

If you already unboundle, manually or automatically, you can also run llvm-dis. Embedded text is really not that helpful, IMHO.

Can we keep the original behaviour for the old driver for HIP? Only enable the change for the new driver.

In D141717#4074842, @yaxunl wrote:

Can we keep the original behaviour for the old driver for HIP? Only enable the change for the new driver.

That's probably fair because AFAIK that will still use the clang-offload-bundler which shows both of them at the same time side-by-side. I'll need to check the behavior there.

Exempting HIP using the old driver.

This revision is now accepted and ready to land.Jan 24 2023, 12:04 PM

This revision was landed with ongoing or failed builds.Jan 24 2023, 1:11 PM

Closed by commit rGd50dacd7c3c2: [Clang] Only emit textual LLVM-IR in device only mode (authored by jhuber6). · Explain Why

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rGd50dacd7c3c2: [Clang] Only emit textual LLVM-IR in device only mode.

Harbormaster completed remote builds in B209717: Diff 491866.Jan 24 2023, 3:28 PM

Revision Contents

Path

Size

clang/

lib/

Driver/

Driver.cpp

9 lines

test/

Driver/

openmp-offload-gpu.c