This is an archive of the discontinued LLVM Phabricator instance.

[lld][LTO] Add assembly output to LTO save-temps
AbandonedPublic

Authored by Pierre-vh on Nov 23 2022, 3:52 AM.

Details

Summary

When using -save-temps in Clang, we usually get a .s assembly file alongside the .o output.
However, when using LTO, Clang no longer produces those .s files because it "delegates" the responsibility of writing those temporaries to LLD, but LLD doesn't support writing out a .s assembly file alongside the final .o output with LTO enabled.

This patch adds a new "asm" save-temps value and adds the necessary logic to the LTO library to allow it to write out an assembly file alongside an object file when requested.

To implement this, we use the same trick as Clang: Codegen twice but with a different filetype for the second run.
As this is a debug feature, the performance shouldn't matter much

Diff Detail

Event Timeline

Pierre-vh created this revision.Nov 23 2022, 3:52 AM
Herald added a project: Restricted Project. · View Herald Transcript
Pierre-vh requested review of this revision.Nov 23 2022, 3:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 23 2022, 3:52 AM
Pierre-vh updated this revision to Diff 477708.Nov 24 2022, 2:06 AM

Cleanup diff so it can be reviewed

Pierre-vh retitled this revision from [WIP] Add assembly output to LTO save-temps to [lld][LTO] Add assembly output to LTO save-temps.Nov 24 2022, 2:10 AM
Pierre-vh edited the summary of this revision. (Show Details)
Pierre-vh added reviewers: fhahn, MatzeB, Northbadge.
MaskRay added 1 blocking reviewer(s): MaskRay.Nov 24 2022, 8:43 PM
MaskRay added a comment.EditedNov 24 2022, 8:50 PM

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

Pierre-vh added a comment.EditedNov 25 2022, 12:19 AM

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

With your suggestion, does it also generate the final, linked output file? Or just the .S?
The idea is to be able to get both alongside each other. We'd also like to have comments in the assembly file about things like register usage - the ones you get when you codegen to assembly, not when you disassemble.

Though, I am surprised that you see this as adding significant complexity, I tried to not make it too intrusive. Is it the whole change that's problematic, or just some part of it?
If yes, maybe I can re-do the parts you find problematic to simplify them. The way I see it, this kind of patch should just be a small feature addition that shouldn't increase complexity too much. If you think complexity has increased significantly then I definitely failed somewhere and I can try to correct it before discarding the idea entirely.

yaxunl added a subscriber: yaxunl.Nov 25 2022, 8:31 AM

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

We use lld to do device LTO in the HIP toolchain. clang driver extracts device bitcode and passes them to lld. When -save-temps is used, clang driver adds -save-temps to the options passed to lld. It is not convenient for the users to rerun lld command to get the assembly. It is better for lld to generate the assembly when -save-temps is used because: 1. it avoids redundant work, especially device LTO is usually the most time-consuming part 2. to keep the temporary file names consistent 3. it is a useful feature for lld itself

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

We use lld to do device LTO in the HIP toolchain. clang driver extracts device bitcode and passes them to lld. When -save-temps is used, clang driver adds -save-temps to the options passed to lld. It is not convenient for the users to rerun lld command to get the assembly. It is better for lld to generate the assembly when -save-temps is used because: 1. it avoids redundant work, especially device LTO is usually the most time-consuming part 2. to keep the temporary file names consistent 3. it is a useful feature for lld itself

clang --save-temps does not pass --save-temps to the linker. The user can use -Wl,--save-temps which is more orthogonal. I can use -Wl,--lto-emit-asm today to get assembly output.

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

With your suggestion, does it also generate the final, linked output file? Or just the .S?
The idea is to be able to get both alongside each other. We'd also like to have comments in the assembly file about things like register usage - the ones you get when you codegen to assembly, not when you disassemble.

There is an empty output file to satisfy some build tools' requirement that an output is already emitted. You can find assembly files in ${output}1. As mentioned, they are just not named *.S.

Though, I am surprised that you see this as adding significant complexity, I tried to not make it too intrusive. Is it the whole change that's problematic, or just some part of it?
If yes, maybe I can re-do the parts you find problematic to simplify them. The way I see it, this kind of patch should just be a small feature addition that shouldn't increase complexity too much. If you think complexity has increased significantly then I definitely failed somewhere and I can try to correct it before discarding the idea entirely.

This is significant complexity because you do codegen twice // Doing codegen twice may seem inefficient, but this is designed as a debug , which you probably would agree as well that this is inelegant.

Well, the real question is when --lto-emit-asm exists, why bother with another mode and duplicating the existing functionality. So far I fail to see a convincing argument that this new mode is useful.

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

We use lld to do device LTO in the HIP toolchain. clang driver extracts device bitcode and passes them to lld. When -save-temps is used, clang driver adds -save-temps to the options passed to lld. It is not convenient for the users to rerun lld command to get the assembly. It is better for lld to generate the assembly when -save-temps is used because: 1. it avoids redundant work, especially device LTO is usually the most time-consuming part 2. to keep the temporary file names consistent 3. it is a useful feature for lld itself

clang --save-temps does not pass --save-temps to the linker. The user can use -Wl,--save-temps which is more orthogonal. I can use -Wl,--lto-emit-asm today to get assembly output.

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

With your suggestion, does it also generate the final, linked output file? Or just the .S?
The idea is to be able to get both alongside each other. We'd also like to have comments in the assembly file about things like register usage - the ones you get when you codegen to assembly, not when you disassemble.

There is an empty output file to satisfy some build tools' requirement that an output is already emitted. You can find assembly files in ${output}1. As mentioned, they are just not named *.S.

Though, I am surprised that you see this as adding significant complexity, I tried to not make it too intrusive. Is it the whole change that's problematic, or just some part of it?
If yes, maybe I can re-do the parts you find problematic to simplify them. The way I see it, this kind of patch should just be a small feature addition that shouldn't increase complexity too much. If you think complexity has increased significantly then I definitely failed somewhere and I can try to correct it before discarding the idea entirely.

This is significant complexity because you do codegen twice // Doing codegen twice may seem inefficient, but this is designed as a debug , which you probably would agree as well that this is inelegant.

Well, the real question is when --lto-emit-asm exists, why bother with another mode and duplicating the existing functionality. So far I fail to see a convincing argument that this new mode is useful.

It does pass it for HIP code when using -fgpu-rdc, see HIPAMD.cpp:151. This would be the primary use case because when building the same HIP code without -fgpu-rdc, we get a .S temporary, but if we pass fgpu-rdc, that temporary is no longer there.

There is an empty output file to satisfy some build tools' requirement that an output is already emitted. You can find assembly files in ${output}1. As mentioned, they are just not named *.S.

This is not about build systems at all, it's for debuggability. An empty file is not useful to us, we need the assembly with the comments.

This is significant complexity because you do codegen twice // Doing codegen twice may seem inefficient, but this is designed as a debug , which you probably would agree as well that this is inelegant.

I know it's inefficient but:

  1. Clang already uses the same "trick" to get a .S temporary. It codegens twice. Performance matters less for debug cases so it should be fine.
  2. There is no alternative that I'm aware of. LLVM either emits a .o, a .s, but not both unlike GCC which emits a .s that's then assembled into the .o

I'm also unsure I'd call this "significant complexity", as long as documentation is good (if it's lacking, please tell me where) the idea itself is simple IMO

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

We use lld to do device LTO in the HIP toolchain. clang driver extracts device bitcode and passes them to lld. When -save-temps is used, clang driver adds -save-temps to the options passed to lld. It is not convenient for the users to rerun lld command to get the assembly. It is better for lld to generate the assembly when -save-temps is used because: 1. it avoids redundant work, especially device LTO is usually the most time-consuming part 2. to keep the temporary file names consistent 3. it is a useful feature for lld itself

clang --save-temps does not pass --save-temps to the linker. The user can use -Wl,--save-temps which is more orthogonal. I can use -Wl,--lto-emit-asm today to get assembly output.

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

With your suggestion, does it also generate the final, linked output file? Or just the .S?
The idea is to be able to get both alongside each other. We'd also like to have comments in the assembly file about things like register usage - the ones you get when you codegen to assembly, not when you disassemble.

There is an empty output file to satisfy some build tools' requirement that an output is already emitted. You can find assembly files in ${output}1. As mentioned, they are just not named *.S.

Though, I am surprised that you see this as adding significant complexity, I tried to not make it too intrusive. Is it the whole change that's problematic, or just some part of it?
If yes, maybe I can re-do the parts you find problematic to simplify them. The way I see it, this kind of patch should just be a small feature addition that shouldn't increase complexity too much. If you think complexity has increased significantly then I definitely failed somewhere and I can try to correct it before discarding the idea entirely.

This is significant complexity because you do codegen twice // Doing codegen twice may seem inefficient, but this is designed as a debug , which you probably would agree as well that this is inelegant.

Well, the real question is when --lto-emit-asm exists, why bother with another mode and duplicating the existing functionality. So far I fail to see a convincing argument that this new mode is useful.

It does pass it for HIP code when using -fgpu-rdc, see HIPAMD.cpp:151. This would be the primary use case because when building the same HIP code without -fgpu-rdc, we get a .S temporary, but if we pass fgpu-rdc, that temporary is no longer there.

It seems that the logic in HIPAMD.cpp:151 can simply be removed.

There is an empty output file to satisfy some build tools' requirement that an output is already emitted. You can find assembly files in ${output}1. As mentioned, they are just not named *.S.

This is not about build systems at all, it's for debuggability. An empty file is not useful to us, we need the assembly with the comments.

I think you misunderstand my comment. The assembly output of --lto-emit-asm is of course not empty. The object file output is empty which is fine for your case.
If you want the object file output as well, re-run LTO without --lto-emit-asm. It's similar to what your patch intends to do, but places less burden to libLTO.

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

We use lld to do device LTO in the HIP toolchain. clang driver extracts device bitcode and passes them to lld. When -save-temps is used, clang driver adds -save-temps to the options passed to lld. It is not convenient for the users to rerun lld command to get the assembly. It is better for lld to generate the assembly when -save-temps is used because: 1. it avoids redundant work, especially device LTO is usually the most time-consuming part 2. to keep the temporary file names consistent 3. it is a useful feature for lld itself

clang --save-temps does not pass --save-temps to the linker. The user can use -Wl,--save-temps which is more orthogonal. I can use -Wl,--lto-emit-asm today to get assembly output.

I am unsure I am convinced with the motivation. Say you have two bitcode files a.o and b.o, can you just use clang++ -fuse-ld=lld a.o b.o -Wl,--lto-emit-asm? This can be used with -Wl,--save-temps as well. The output is named in term of the -o output file name, instead of *.s, but the slight difference doesn't warrant adding significant more complexity to LTOBackend.cpp as this patch does.

With your suggestion, does it also generate the final, linked output file? Or just the .S?
The idea is to be able to get both alongside each other. We'd also like to have comments in the assembly file about things like register usage - the ones you get when you codegen to assembly, not when you disassemble.

There is an empty output file to satisfy some build tools' requirement that an output is already emitted. You can find assembly files in ${output}1. As mentioned, they are just not named *.S.

Though, I am surprised that you see this as adding significant complexity, I tried to not make it too intrusive. Is it the whole change that's problematic, or just some part of it?
If yes, maybe I can re-do the parts you find problematic to simplify them. The way I see it, this kind of patch should just be a small feature addition that shouldn't increase complexity too much. If you think complexity has increased significantly then I definitely failed somewhere and I can try to correct it before discarding the idea entirely.

This is significant complexity because you do codegen twice // Doing codegen twice may seem inefficient, but this is designed as a debug , which you probably would agree as well that this is inelegant.

Well, the real question is when --lto-emit-asm exists, why bother with another mode and duplicating the existing functionality. So far I fail to see a convincing argument that this new mode is useful.

It does pass it for HIP code when using -fgpu-rdc, see HIPAMD.cpp:151. This would be the primary use case because when building the same HIP code without -fgpu-rdc, we get a .S temporary, but if we pass fgpu-rdc, that temporary is no longer there.

It seems that the logic in HIPAMD.cpp:151 can simply be removed.

There is an empty output file to satisfy some build tools' requirement that an output is already emitted. You can find assembly files in ${output}1. As mentioned, they are just not named *.S.

This is not about build systems at all, it's for debuggability. An empty file is not useful to us, we need the assembly with the comments.

I think you misunderstand my comment. The assembly output of --lto-emit-asm is of course not empty. The object file output is empty which is fine for your case.
If you want the object file output as well, re-run LTO without --lto-emit-asm. It's similar to what your patch intends to do, but places less burden to libLTO.

It seems that the logic in HIPAMD.cpp:151 can simply be removed.

Why? How would it solve the issue?

If you want the object file output as well, re-run LTO without --lto-emit-asm. It's similar to what your patch intends to do, but places less burden to libLTO.

Are you suggesting that we do this at the driver level instead? So create two linker jobs there if --save-temps is passed to get both a .o and .s?
Or that in this specific use case we don't make any code change and just work around the issue?

Have you tried experimenting with -Wl,--lto-emit-asm? How does it not fulfill your requirement?

Pierre-vh planned changes to this revision.Dec 6 2022, 1:05 AM

I checked -lto-emit-asm and it doesn't look like it does what I need. Ideally we want to get the .s file without needing to run the command twice, we'd expect -save-temps to always give us a .s file and that -fgpu-rdc wouldn't affect it
I think landing this is the simplest to make the behavior of using LTO/-fgpu-rdc consistent with non-LTO/no -fgpu-rdc builds. The next best thing would be a Driver change but I think it'll end up adding more complexity.

Do you still think this isn't the right way to go?

Pierre-vh requested review of this revision.Dec 12 2022, 7:13 AM
Pierre-vh abandoned this revision.Dec 12 2022, 7:53 AM

After thinking about it a bit more, I think if we need to codegen twice it's probably better to do it at the driver level instead of "silently" running codegen twice

(I was out of town for ~2 weeks.)

Thanks for abandoning this. Running passes twice in LLVMLTO is definitely weird, even if it appears to be the simplest approach satisfying your goal.
If you want both .s and .o, I wish that you think about doing this as whatever tool which wraps clang driver instead of doing it in the clang driver.
Regular non-GPU clang driver code doesn't allow emit both .s and .o in one run, either.
I don't track GPU changes that closely, so if you end up doing that, I might not see or object in the end.