This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Add --thinlto-index=
Needs ReviewPublic

Authored by MaskRay on Jul 21 2022, 1:28 AM.

Details

Summary

--thinlto-index= is designed to replace --thinlto-index-only= for distributed
ThinLTO. Example:

Index files and backend compile files are in the same directory:

echo 'int g() { return 0; }' > a.c
echo 'int f(); int main() { return f(); }' > b.c
echo 'int g(); int f() { return g(); }' > c.c
echo '' > d.c
mkdir -p thin obj

clang -c -flto=thin -O2 b.c -o thin/b.o -fthin-link-bitcode=thin/b.min.o
clang -c -flto=thin -O2 c.c -o thin/c.o -fthin-link-bitcode=thin/c.min.o
clang -c -flto=thin -O2 d.c -o thin/d.o -fthin-link-bitcode=thin/d.min.o
clang -c -O2 a.c -o thin/a.o

clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index=obj/exe.map,--thinlto-emit-imports-files,--thinlto-prefix-replace='thin;obj',--thinlto-object-suffix-replace='.min.o;.o' \
  a.o thin/b.min.o -Wl,--start-lib thin/c.min.o thin/d.min.o -Wl,--end-lib

clang -c -O2 -fthinlto-index=obj/b.o.thinlto.bc thin/b.o -o obj/b.o
clang -c -O2 -fthinlto-index=obj/c.o.thinlto.bc thin/c.o -o obj/c.o
clang -c -O2 -fthinlto-index=obj/d.o.thinlto.bc thin/d.o -o obj/d.o

clang -fuse-ld=lld -Wl,--remap-inputs-file=obj/exe.map a.o thin/b.min.o -Wl,--start-lib thin/c.min.o thin/d.min.o -Wl,--end-lib -o obj/exe

Index files and backend compile files are in different directories:

echo 'int g() { return 0; }' > a.c
echo 'int f(); int main() { return f(); }' > b.c
echo 'int g(); int f() { return g(); }' > c.c
echo '' > d.c
mkdir -p thin index obj

clang -c -flto=thin -O2 b.c -o thin/b.o -fthin-link-bitcode=thin/b.min.o
clang -c -flto=thin -O2 c.c -o thin/c.o -fthin-link-bitcode=thin/c.min.o
clang -c -flto=thin -O2 d.c -o thin/d.o -fthin-link-bitcode=thin/d.min.o
clang -c -O2 a.c -o a.o

clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index=obj/exe.map,--thinlto-emit-imports-files,--thinlto-prefix-replace='thin;index;obj',--thinlto-object-suffix-replace='.min.o;.o' \
  a.o thin/b.min.o -Wl,--start-lib thin/c.min.o thin/d.min.o -Wl,--end-lib

clang -c -O2 -fthinlto-index=index/b.o.thinlto.bc thin/b.o -o obj/b.o
clang -c -O2 -fthinlto-index=index/c.o.thinlto.bc thin/c.o -o obj/c.o
clang -c -O2 -fthinlto-index=index/d.o.thinlto.bc thin/c.o -o obj/d.o

clang -fuse-ld=lld -Wl,--remap-inputs-file=obj/exe.map a.o thin/b.min.o -Wl,--start-lib thin/c.min.o thin/d.min.o -Wl,--end-lib -o obj/exe

The ThinLTO indexing and the final link have very similar command lines.
Actually we just need to replace -Wl,--thinlto-index= with -Wl,--remap-inputs-file=.
obj/exe.map redirects input minimized bitcode files to backend compilation outputs.


Here is an example of the old way (--thinlto-index-only=).

clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index-only=obj/exe.params,--thinlto-prefix-replace='thin;obj',--thinlto-object-suffix-replace='.min.o;.o' \
  a.o thin/b.min.o -Wl,--start-lib thin/c.min.o thin/d.min.o -Wl,--end-lib
...
clang -fuse-ld=lld @obj/exe.params a.o -o obj/exe

--thinlto-index-only= specifies a response file containing all the backend
compilation produced native object files. The final link command line drops all
bitcode files (the build system recognize these files) and inserts @obj/exe.params
at the beginning. This reordering may cause different symbol resolution results
for ThinLTO indexing and the final link, which may lead to

  • a different --start-lib native object file is picked
  • spurious "undefined symbol": say, the different native object file may call a function defined in a bitcode file. The backend compilation for the bitcode file may not export the definition since it doesn't know it is referenced by this native object file.

Rejected alternative: let the response file include native object files.
This has complication:

  • lld has to serialize all input files (shared objects, archives, linker scripts, etc) as well as --whole-archive/--as-needed states
  • the build system has to recognize input files (including linker script) and remove them
  • -nostdlib is required

Link: https://discourse.llvm.org/t/distributed-thinlto-final-linking-order/63804

Diff Detail

Event Timeline

MaskRay created this revision.Jul 21 2022, 1:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2022, 1:28 AM
MaskRay requested review of this revision.Jul 21 2022, 1:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2022, 1:28 AM

I suppose the rationale for doing things this way is that you can keep the input files in the same ordering on the indexing link and the final native link, to avoid changing the relative order between files that were bitcode and files that were native already during the indexing link.

However, I'm a little confused as to how this will work for symbol resolution. The main reason for doing things the way we were was to ensure the same objects were selected for linking out of static libraries (that use --start-lib/--end-lib), since symbols may be referenced differently after importing. That's why the native objects corresponding to the bitcode files are listed in a flat list in the thinlto-index-only=params file (i.e. not using --start-lib/--end-lib). I'm not sure how the new solution prevents those issues.

E.g. If I change your example to use --start-lib/--end-lib:

# ThinLTO indexing
clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index=out/exe.map -Wl,--thinlto-prefix-replace='out;lto' \
  -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' out/a.o --start-lib out/b.indexing.o out/c.indexing.o --end-lib

Is the final link also like that?:

# Final link
clang -fuse-ld=lld -Wl,--remapping-file=out/exe.map out/a.o --start-lib out/b.indexing.o out/c.indexing.o --end-lib -o out/exe

I assume the symbol resolution is done on the post-mapping native objects, and not the bitcode objects, right? In more complex examples that could cause issues with which objects are selected, leading to similar problems as those this is trying to address relating to originally native objects.

hoy added a comment.Jul 21 2022, 10:53 AM

However, I'm a little confused as to how this will work for symbol resolution. The main reason for doing things the way we were was to ensure the same objects were selected for linking out of static libraries (that use --start-lib/--end-lib), since symbols may be referenced differently after importing. That's why the native objects corresponding to the bitcode files are listed in a flat list in the thinlto-index-only=params file (i.e. not using --start-lib/--end-lib). I'm not sure how the new solution prevents those issues.

I have the same question. I also have a question about the original lazy native objects (that come with --start-lib/end-lib or in form of archive files), of which the symbol resolution relies on the original bitcode objects.

MaskRay added a comment.EditedJul 21 2022, 11:50 AM

I suppose the rationale for doing things this way is that you can keep the input files in the same ordering on the indexing link and the final native link, to avoid changing the relative order between files that were bitcode and files that were native already during the indexing link.

Yes, and also to avoid -nostdlib.

However, I'm a little confused as to how this will work for symbol resolution. The main reason for doing things the way we were was to ensure the same objects were selected for linking out of static libraries (that use --start-lib/--end-lib), since symbols may be referenced differently after importing. That's why the native objects corresponding to the bitcode files are listed in a flat list in the thinlto-index-only=params file (i.e. not using --start-lib/--end-lib). I'm not sure how the new solution prevents those issues.

E.g. If I change your example to use --start-lib/--end-lib:

# ThinLTO indexing
clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index=out/exe.map -Wl,--thinlto-prefix-replace='out;lto' \
  -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' out/a.o --start-lib out/b.indexing.o out/c.indexing.o --end-lib

Is the final link also like that?:

Yes for all input files. The final link removes -flto=thin -Wl,--thinlto-index=out/exe.map and adds -Wl,--remapping-file=out/exe.map

--thinlto-index-only= reorders bitcode files and native object files, therefore may change symbol resolution. As a compensation, all backend compiled bitcode files drop --start-lib/--end-lib.
For --thinlto-index=, there is no reordering. I am trying to think of a case where dropping --start-lib is a necessity but so far unable to come up with one.

If there is a case, we probably have to give --remapping-file additional semantics that a specified file ignores the surrounding --start-lib. This makes --remapping-file less generic, but we can probably add the third column to indicate whether --start-lib should be ignored.

The complexity stems from the chain reaction of archive member extraction (I use the term for lazy object files as well), which is not easily formulated formerly.

# Final link
clang -fuse-ld=lld -Wl,--remapping-file=out/exe.map out/a.o --start-lib out/b.indexing.o out/c.indexing.o --end-lib -o out/exe

I assume the symbol resolution is done on the post-mapping native objects, and not the bitcode objects, right? In more complex examples that could cause issues with which objects are selected, leading to similar problems as those this is trying to address relating to originally native objects.

Yes.

I suppose the rationale for doing things this way is that you can keep the input files in the same ordering on the indexing link and the final native link, to avoid changing the relative order between files that were bitcode and files that were native already during the indexing link.

Yes, and also to avoid -nostdlib.

However, I'm a little confused as to how this will work for symbol resolution. The main reason for doing things the way we were was to ensure the same objects were selected for linking out of static libraries (that use --start-lib/--end-lib), since symbols may be referenced differently after importing. That's why the native objects corresponding to the bitcode files are listed in a flat list in the thinlto-index-only=params file (i.e. not using --start-lib/--end-lib). I'm not sure how the new solution prevents those issues.

E.g. If I change your example to use --start-lib/--end-lib:

# ThinLTO indexing
clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index=out/exe.map -Wl,--thinlto-prefix-replace='out;lto' \
  -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' out/a.o --start-lib out/b.indexing.o out/c.indexing.o --end-lib

Is the final link also like that?:

Yes for all input files. The final link removes -flto=thin -Wl,--thinlto-index=out/exe.map and adds -Wl,--remapping-file=out/exe.map

--thinlto-index-only= reorders bitcode files and native object files, therefore may change symbol resolution. As a compensation, all backend compiled bitcode files drop --start-lib/--end-lib.
For --thinlto-index=, there is no reordering. I am trying to think of a case where dropping --start-lib is a necessity but so far unable to come up with one.

See discussion and examples linked from https://reviews.llvm.org/D22677

If there is a case, we probably have to give --remapping-file additional semantics that a specified file ignores the surrounding --start-lib. This makes --remapping-file less generic, but we can probably add the third column to indicate whether --start-lib should be ignored.

Can we just have the file generated by thinlto-index-only= also include native objects?

The complexity stems from the chain reaction of archive member extraction (I use the term for lazy object files as well), which is not easily formulated formerly.

# Final link
clang -fuse-ld=lld -Wl,--remapping-file=out/exe.map out/a.o --start-lib out/b.indexing.o out/c.indexing.o --end-lib -o out/exe

I assume the symbol resolution is done on the post-mapping native objects, and not the bitcode objects, right? In more complex examples that could cause issues with which objects are selected, leading to similar problems as those this is trying to address relating to originally native objects.

Yes.

If there is a case, we probably have to give --remapping-file additional semantics that a specified file ignores the surrounding --start-lib. This makes --remapping-file less generic, but we can probably add the third column to indicate whether --start-lib should be ignored.

Can we just have the file generated by thinlto-index-only= also include native objects?

I agree that making thinlto-index-only= to produce the entire input list, both prebuilt native and bitcode-compiled native, is a less disruptive change to still keep the current cmdline options while making it more general. I think we can probably extend the the index file output to also include --start-lib/--end-lib grouping to that the final link can just use it as is.

If there is a case, we probably have to give --remapping-file additional semantics that a specified file ignores the surrounding --start-lib. This makes --remapping-file less generic, but we can probably add the third column to indicate whether --start-lib should be ignored.

Can we just have the file generated by thinlto-index-only= also include native objects?

I agree that making thinlto-index-only= to produce the entire input list, both prebuilt native and bitcode-compiled native, is a less disruptive change to still keep the current cmdline options while making it more general. I think we can probably extend the the index file output to also include --start-lib/--end-lib grouping to that the final link can just use it as is.

Actually we specifically don't want to do that (include the selected objects in --start-lib/--end-lib in the index output), we already know which files the linker chose to include from the archive groupings and we don't want to redo that logic which could theoretically come to a different conclusion.

If there is a case, we probably have to give --remapping-file additional semantics that a specified file ignores the surrounding --start-lib. This makes --remapping-file less generic, but we can probably add the third column to indicate whether --start-lib should be ignored.

Can we just have the file generated by thinlto-index-only= also include native objects?

I agree that making thinlto-index-only= to produce the entire input list, both prebuilt native and bitcode-compiled native, is a less disruptive change to still keep the current cmdline options while making it more general. I think we can probably extend the the index file output to also include --start-lib/--end-lib grouping to that the final link can just use it as is.

Actually we specifically don't want to do that (include the selected objects in --start-lib/--end-lib in the index output), we already know which files the linker chose to include from the archive groupings and we don't want to redo that logic which could theoretically come to a different conclusion.

Right. Linker already fetched those lazy objects, both native and bitcode, when building symbol table during thinlink, so they are not lazy anymore. The index file would include those needed for final linking.

I suppose the rationale for doing things this way is that you can keep the input files in the same ordering on the indexing link and the final native link, to avoid changing the relative order between files that were bitcode and files that were native already during the indexing link.

Yes, and also to avoid -nostdlib.

However, I'm a little confused as to how this will work for symbol resolution. The main reason for doing things the way we were was to ensure the same objects were selected for linking out of static libraries (that use --start-lib/--end-lib), since symbols may be referenced differently after importing. That's why the native objects corresponding to the bitcode files are listed in a flat list in the thinlto-index-only=params file (i.e. not using --start-lib/--end-lib). I'm not sure how the new solution prevents those issues.

E.g. If I change your example to use --start-lib/--end-lib:

# ThinLTO indexing
clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index=out/exe.map -Wl,--thinlto-prefix-replace='out;lto' \
  -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' out/a.o --start-lib out/b.indexing.o out/c.indexing.o --end-lib

Is the final link also like that?:

Yes for all input files. The final link removes -flto=thin -Wl,--thinlto-index=out/exe.map and adds -Wl,--remapping-file=out/exe.map

--thinlto-index-only= reorders bitcode files and native object files, therefore may change symbol resolution. As a compensation, all backend compiled bitcode files drop --start-lib/--end-lib.
For --thinlto-index=, there is no reordering. I am trying to think of a case where dropping --start-lib is a necessity but so far unable to come up with one.

See discussion and examples linked from https://reviews.llvm.org/D22677

I'll analyze the example later. But I haven't seen evidence that this --thinlto-index= and --remapping-file= won't work.

I have written a script to convert Bazel -Wl,-plugin-opt=thinlto- options to -Wl,--thinlto-index=/tmp/c/lto-final.map and tried the approach with several internal executables.
They do work with a small change that the remapping file ignores all un-extracted minimalized bitcode files.

If there is a case, we probably have to give --remapping-file additional semantics that a specified file ignores the surrounding --start-lib. This makes --remapping-file less generic, but we can probably add the third column to indicate whether --start-lib should be ignored.

Can we just have the file generated by thinlto-index-only= also include native objects?

! In D130229#3669926, @weiwang wrote:

! In D130229#3669826, @tejohnson wrote:

! In D130229#3669754, @weiwang wrote:

! In D130229#3669642, @tejohnson wrote:

If there is a case, we probably have to give --remapping-file additional semantics that a specified file ignores the surrounding --start-lib. This makes --remapping-file less generic, but we can probably add the third column to indicate whether --start-lib should be ignored.

Can we just have the file generated by thinlto-index-only= also include native objects?

I agree that making thinlto-index-only= to produce the entire input list, both prebuilt native and bitcode-compiled native, is a less disruptive change to still keep the current cmdline options while making it more general. I think we can probably extend the the index file output to also include --start-lib/--end-lib grouping to that the final link can just use it as is.

Actually we specifically don't want to do that (include the selected objects in --start-lib/--end-lib in the index output), we already know which files the linker chose to include from the archive groupings and we don't want to redo that logic which could theoretically come to a different conclusion.

Right. Linker already fetched those lazy objects, both native and bitcode, when building symbol table during thinlink, so they are not lazy anymore. The index file would include those needed for final linking.

Actually including all native objects will be more intrusive.
We need to consider archives and shared objects as well as ELF relocatable object files and bitcode files.
Archives have the --whole-archive state and shared objects have the --as-needed state.
For a shared object, its soname is dependent on whether it is specified in the -l form.
These state all have to be serialized.

As my first comment says, I have thought about this but I haven't convinced myself that this is a necessity.

Last, right now -nostdlib just controls some -l and startup files. I am slightly concerned whether a toolchain may use -nostdlib to do something differently and make the alternative --thinlto-index-only= (including all native files) more cumbersome to use.

MaskRay updated this revision to Diff 446651.Jul 21 2022, 3:44 PM
MaskRay edited the summary of this revision. (Show Details)

add more comments
add /dev/null
add release note
improve tests

I'm not sure how this solution works because symbol resolution before and after importing aren't guaranteed to produce same result even if input order is identical.

Say we have foo in a.o, which references bar in b.o (lazy object). Assuming a.o imports bar, before importing, b.o gets picked because of that foo->bar reference, but the reference will be gone after importing and we may not pick b.o during final linking. The ripple effect from dropping b.o can disturb other symbol's resolution, making final link and thin link diverge.

I'm not sure how this solution works because symbol resolution before and after importing aren't guaranteed to produce same result even if input order is identical.

Say we have foo in a.o, which references bar in b.o (lazy object). Assuming a.o imports bar, before importing, b.o gets picked because of that foo->bar reference, but the reference will be gone after importing and we may not pick b.o during final linking. The ripple effect from dropping b.o can disturb other symbol's resolution, making final link and thin link diverge.

Imported functions have available_externally linkage and are not considered definitions in the IR symbol table. There is no symbol resolution difference in your example.

I'm not sure how this solution works because symbol resolution before and after importing aren't guaranteed to produce same result even if input order is identical.

Say we have foo in a.o, which references bar in b.o (lazy object). Assuming a.o imports bar, before importing, b.o gets picked because of that foo->bar reference, but the reference will be gone after importing and we may not pick b.o during final linking. The ripple effect from dropping b.o can disturb other symbol's resolution, making final link and thin link diverge.

Imported functions have available_externally linkage and are not considered definitions in the IR symbol table. There is no symbol resolution difference in your example.

I actually meant importing + inlining. After importing and inlining foo->bar, the reference to bar will be gone in a.o output from thinlto codegen, so the input for final linking won't see the reference for bar, right? Does that not change final linking symbol resolution still?

I'm not sure how this solution works because symbol resolution before and after importing aren't guaranteed to produce same result even if input order is identical.

Say we have foo in a.o, which references bar in b.o (lazy object). Assuming a.o imports bar, before importing, b.o gets picked because of that foo->bar reference, but the reference will be gone after importing and we may not pick b.o during final linking. The ripple effect from dropping b.o can disturb other symbol's resolution, making final link and thin link diverge.

Imported functions have available_externally linkage and are not considered definitions in the IR symbol table. There is no symbol resolution difference in your example.

I actually meant importing + inlining. After importing and inlining foo->bar, the reference to bar will be gone in a.o output from thinlto codegen, so the input for final linking won't see the reference for bar, right? Does that not change final linking symbol resolution still?

This is my concern too. See discussion on https://reviews.llvm.org/D22356 (there's more discussion on llvm-commits that didn't make it into phab somehow), and the test case included there. That approach was abandoned in favor of the params file based on the extensive discussion there.

MaskRay added a comment.EditedJul 22 2022, 11:32 AM

I'm not sure how this solution works because symbol resolution before and after importing aren't guaranteed to produce same result even if input order is identical.

Say we have foo in a.o, which references bar in b.o (lazy object). Assuming a.o imports bar, before importing, b.o gets picked because of that foo->bar reference, but the reference will be gone after importing and we may not pick b.o during final linking. The ripple effect from dropping b.o can disturb other symbol's resolution, making final link and thin link diverge.

Imported functions have available_externally linkage and are not considered definitions in the IR symbol table. There is no symbol resolution difference in your example.

I actually meant importing + inlining. After importing and inlining foo->bar, the reference to bar will be gone in a.o output from thinlto codegen, so the input for final linking won't see the reference for bar, right? Does that not change final linking symbol resolution still?

This is my concern too. See discussion on https://reviews.llvm.org/D22356 (there's more discussion on llvm-commits that didn't make it into phab somehow), and the test case included there. That approach was abandoned in favor of the params file based on the extensive discussion there.

As mentioned I had this concern but I am not convinced that forcing a non-lazy state is a necessity. With more fiddling I figured that the simple approach as implemented in this patch likely suffices.

The D22356 and D22467 examples work with --thinlto-index= without forcing the non-lazy state.

% fld.lld t.o --start-lib t2.o --end-lib --start-lib t3.o --end-lib --start-lib t4.o --end-lib -t -e main -mllvm -import-instr-limit=4 --thinlto-index=a.map
% fld.lld t.o.5.precodegen.bc --start-lib t2.o.5.precodegen.bc --end-lib --start-lib t3.o.5.precodegen.bc --end-lib --start-lib t4.o --end-lib -t --remapping-file=a.map -e main
t.o.5.precodegen.bc
t3.o.5.precodegen.bc
t2.o.5.precodegen.bc

It presumably may not work with gold because GNU ld/gold has a different archive member extraction semantics (see https://lld.llvm.org/ELF/warn_backrefs.html).
lld adopts macOS ld64/link.exe's "first lazy object wins" approach and in practice has more stable behaviors.
I suspect lld just needs to drop --warn-backrefs for the final native link in certain cases, but the sacrifice is suitable to me.
(If we re-generate an order depending on -t output for the final native link, --warn-backrefs will not be useful anyway.)

MaskRay added a comment.EditedJul 22 2022, 11:36 AM

I think lots of concern may possibly be resolved when a build system actually implements this, or may be not.
In the release notes, I state "Experimental `--thinlto-index= and --remapping-file=` are added for distributed ThinLTO."
I am open to a change if this simple approach turns out be insufficient.

The alternative will be very intrusive and have substantial complexity. But for the sake of more stable distributed ThinLTO I am willing to pay the cost.

I'm not sure how this solution works because symbol resolution before and after importing aren't guaranteed to produce same result even if input order is identical.

Say we have foo in a.o, which references bar in b.o (lazy object). Assuming a.o imports bar, before importing, b.o gets picked because of that foo->bar reference, but the reference will be gone after importing and we may not pick b.o during final linking. The ripple effect from dropping b.o can disturb other symbol's resolution, making final link and thin link diverge.

Imported functions have available_externally linkage and are not considered definitions in the IR symbol table. There is no symbol resolution difference in your example.

I actually meant importing + inlining. After importing and inlining foo->bar, the reference to bar will be gone in a.o output from thinlto codegen, so the input for final linking won't see the reference for bar, right? Does that not change final linking symbol resolution still?

This is my concern too. See discussion on https://reviews.llvm.org/D22356 (there's more discussion on llvm-commits that didn't make it into phab somehow), and the test case included there. That approach was abandoned in favor of the params file based on the extensive discussion there.

As mentioned I had this concern but I am not convinced that forcing a non-lazy state is a necessity. With more fiddling I figured that the simple approach as implemented in this patch likely suffices.

Continuing on the example I mentioned earlier with importing + inlining, the most obvious issue is that dropping b.o on final link will also drop static initializers come with that object. And that will be an observable difference between local vs dist thinlto. This is a real case we ran into. The alternative solution to include selected native obj in thinlto-index-only output doesn't have this (class of) problem.

The D22356 and D22467 examples work with --thinlto-index= without forcing the non-lazy state.

% fld.lld t.o --start-lib t2.o --end-lib --start-lib t3.o --end-lib --start-lib t4.o --end-lib -t -e main -mllvm -import-instr-limit=4 --thinlto-index=a.map
% fld.lld t.o.5.precodegen.bc --start-lib t2.o.5.precodegen.bc --end-lib --start-lib t3.o.5.precodegen.bc --end-lib --start-lib t4.o --end-lib -t --remapping-file=a.map -e main
t.o.5.precodegen.bc
t3.o.5.precodegen.bc
t2.o.5.precodegen.bc

It presumably may not work with gold because GNU ld/gold has a different archive member extraction semantics (see https://lld.llvm.org/ELF/warn_backrefs.html).
lld adopts macOS ld64/link.exe's "first lazy object wins" approach and in practice has more stable behaviors.
I suspect lld just needs to drop --warn-backrefs for the final native link in certain cases, but the sacrifice is suitable to me.
(If we re-generate an order depending on -t output for the final native link, --warn-backrefs will not be useful anyway.)

I'm not sure how this solution works because symbol resolution before and after importing aren't guaranteed to produce same result even if input order is identical.

Say we have foo in a.o, which references bar in b.o (lazy object). Assuming a.o imports bar, before importing, b.o gets picked because of that foo->bar reference, but the reference will be gone after importing and we may not pick b.o during final linking. The ripple effect from dropping b.o can disturb other symbol's resolution, making final link and thin link diverge.

Imported functions have available_externally linkage and are not considered definitions in the IR symbol table. There is no symbol resolution difference in your example.

I actually meant importing + inlining. After importing and inlining foo->bar, the reference to bar will be gone in a.o output from thinlto codegen, so the input for final linking won't see the reference for bar, right? Does that not change final linking symbol resolution still?

This is my concern too. See discussion on https://reviews.llvm.org/D22356 (there's more discussion on llvm-commits that didn't make it into phab somehow), and the test case included there. That approach was abandoned in favor of the params file based on the extensive discussion there.

As mentioned I had this concern but I am not convinced that forcing a non-lazy state is a necessity. With more fiddling I figured that the simple approach as implemented in this patch likely suffices.

Continuing on the example I mentioned earlier with importing + inlining, the most obvious issue is that dropping b.o on final link will also drop static initializers come with that object. And that will be an observable difference between local vs dist thinlto. This is a real case we ran into. The alternative solution to include selected native obj in thinlto-index-only output doesn't have this (class of) problem.

Can you craft a test case (ideally RUN: form) which manifests the issue?

The D22356 and D22467 examples work with --thinlto-index= without forcing the non-lazy state.

% fld.lld t.o --start-lib t2.o --end-lib --start-lib t3.o --end-lib --start-lib t4.o --end-lib -t -e main -mllvm -import-instr-limit=4 --thinlto-index=a.map
% fld.lld t.o.5.precodegen.bc --start-lib t2.o.5.precodegen.bc --end-lib --start-lib t3.o.5.precodegen.bc --end-lib --start-lib t4.o --end-lib -t --remapping-file=a.map -e main
t.o.5.precodegen.bc
t3.o.5.precodegen.bc
t2.o.5.precodegen.bc

It presumably may not work with gold because GNU ld/gold has a different archive member extraction semantics (see https://lld.llvm.org/ELF/warn_backrefs.html).
lld adopts macOS ld64/link.exe's "first lazy object wins" approach and in practice has more stable behaviors.
I suspect lld just needs to drop --warn-backrefs for the final native link in certain cases, but the sacrifice is suitable to me.
(If we re-generate an order depending on -t output for the final native link, --warn-backrefs will not be useful anyway.)

I'm not sure how this solution works because symbol resolution before and after importing aren't guaranteed to produce same result even if input order is identical.

Say we have foo in a.o, which references bar in b.o (lazy object). Assuming a.o imports bar, before importing, b.o gets picked because of that foo->bar reference, but the reference will be gone after importing and we may not pick b.o during final linking. The ripple effect from dropping b.o can disturb other symbol's resolution, making final link and thin link diverge.

Imported functions have available_externally linkage and are not considered definitions in the IR symbol table. There is no symbol resolution difference in your example.

I actually meant importing + inlining. After importing and inlining foo->bar, the reference to bar will be gone in a.o output from thinlto codegen, so the input for final linking won't see the reference for bar, right? Does that not change final linking symbol resolution still?

This is my concern too. See discussion on https://reviews.llvm.org/D22356 (there's more discussion on llvm-commits that didn't make it into phab somehow), and the test case included there. That approach was abandoned in favor of the params file based on the extensive discussion there.

As mentioned I had this concern but I am not convinced that forcing a non-lazy state is a necessity. With more fiddling I figured that the simple approach as implemented in this patch likely suffices.

The D22356 and D22467 examples work with --thinlto-index= without forcing the non-lazy state.

% fld.lld t.o --start-lib t2.o --end-lib --start-lib t3.o --end-lib --start-lib t4.o --end-lib -t -e main -mllvm -import-instr-limit=4 --thinlto-index=a.map
% fld.lld t.o.5.precodegen.bc --start-lib t2.o.5.precodegen.bc --end-lib --start-lib t3.o.5.precodegen.bc --end-lib --start-lib t4.o --end-lib -t --remapping-file=a.map -e main
t.o.5.precodegen.bc
t3.o.5.precodegen.bc
t2.o.5.precodegen.bc

It presumably may not work with gold because GNU ld/gold has a different archive member extraction semantics (see https://lld.llvm.org/ELF/warn_backrefs.html).
lld adopts macOS ld64/link.exe's "first lazy object wins" approach and in practice has more stable behaviors.
I suspect lld just needs to drop --warn-backrefs for the final native link in certain cases, but the sacrifice is suitable to me.
(If we re-generate an order depending on -t output for the final native link, --warn-backrefs will not be useful anyway.)

An example I made to show the different results among local thinlto, thinlto with --thinlto-index-only and thinlto with --thinlto-index:

> echo 'int g() { return 0; }' > a.cpp
> echo 'int g() { return 1; } void f() { return; }' > b.cpp
> echo 'void f(); int g(); int main() { [[clang::always_inline]] f(); [[clang::noinline]] return g(); }' > c.cpp

# compile
> clang++ -c -flto=thin -O2 c.cpp -o c.o
> clang++ -c -flto=thin -O2 b.cpp -o b.o
> clang++ -c -O2 a.cpp -o a.o

# local thinlto
> clang++ -flto=thin -fuse-ld=lld -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib -o local
> ./local
> echo $?
1

# dist_thinlto with --thinlto-index-only
> clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index-only=thinlto.index,--thinlto-emit-imports-files -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib

> cat thinlto.index
c.o
b.o

> clang++ -O2 -c -x ir b.o -fthinlto-index=b.o.thinlto.bc -o b.native.o
> clang++ -O2 -c -x ir c.o -fthinlto-index=c.o.thinlto.bc -o c.native.o
> clang++ -fuse-ld=lld -Wl,--start-lib a.o -Wl,--end-lib c.native.o b.native.o -o dist1
ld.lld: error: duplicate symbol: g()
>>> defined at a.cpp
>>>            a.o:(g())
>>> defined at b.cpp
>>>            b.native.o:(.text+0x0)

# dist_thinlto with --thinlto-index
> clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index=thinlto.map,--thinlto-emit-imports-files,--thinlto-object-suffix-replace='.o;.native.o' -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib

> cat thinlto.map
c.o c.native.o
b.o b.native.o

> clang++ -O2 -c -x ir b.o -fthinlto-index=b.o.thinlto.bc -o b.native.o
> clang++ -O2 -c -x ir c.o -fthinlto-index=c.o.thinlto.bc -o c.native.o

> clang++ -fuse-ld=lld -Wl,--remapping-file=thinlto.map -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib -o dist2
> ./dist2
> echo $?
0
MaskRay added a comment.EditedJul 23 2022, 10:26 AM

An example I made to show the different results among local thinlto, thinlto with --thinlto-index-only and thinlto with --thinlto-index:

> echo 'int g() { return 0; }' > a.cpp
> echo 'int g() { return 1; } void f() { return; }' > b.cpp
> echo 'void f(); int g(); int main() { [[clang::always_inline]] f(); [[clang::noinline]] return g(); }' > c.cpp

# compile
> clang++ -c -flto=thin -O2 c.cpp -o c.o
> clang++ -c -flto=thin -O2 b.cpp -o b.o
> clang++ -c -O2 a.cpp -o a.o

# local thinlto
> clang++ -flto=thin -fuse-ld=lld -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib -o local
> ./local
> echo $?
1

# dist_thinlto with --thinlto-index-only
> clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index-only=thinlto.index,--thinlto-emit-imports-files -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib

> cat thinlto.index
c.o
b.o

> clang++ -O2 -c -x ir b.o -fthinlto-index=b.o.thinlto.bc -o b.native.o
> clang++ -O2 -c -x ir c.o -fthinlto-index=c.o.thinlto.bc -o c.native.o
> clang++ -fuse-ld=lld -Wl,--start-lib a.o -Wl,--end-lib c.native.o b.native.o -o dist1
ld.lld: error: duplicate symbol: g()
>>> defined at a.cpp
>>>            a.o:(g())
>>> defined at b.cpp
>>>            b.native.o:(.text+0x0)

# dist_thinlto with --thinlto-index
> clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index=thinlto.map,--thinlto-emit-imports-files,--thinlto-object-suffix-replace='.o;.native.o' -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib

> cat thinlto.map
c.o c.native.o
b.o b.native.o

> clang++ -O2 -c -x ir b.o -fthinlto-index=b.o.thinlto.bc -o b.native.o
> clang++ -O2 -c -x ir c.o -fthinlto-index=c.o.thinlto.bc -o c.native.o

> clang++ -fuse-ld=lld -Wl,--remapping-file=thinlto.map -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib -o dist2
> ./dist2
> echo $?
0

Thanks for sharing the example. However, I don't consider such an example a reasonable case. I think the --thinlto-index= behavior is ok.

In-process ThinLTO places lto.tmp native object files at the end. This causes sections from bitcode files to be placed after native object files' sections. For certain linker script usage this can be seen as an issue (https://github.com/llvm/llvm-project/issues/38738).
If distributed ThinLTO uses a different order and has a different behavior from in-process ThinLTO, I'll not consider it an issue in distributed ThinLTO, if only due to the object file insertion place.
E.g. --thinlto-index-only= places the backend compiled object files before explicit native input (but after implicit files like crt1.o). This will inevitably have differences in some duplicate definition cases.

In the example, a.o is before b.o but a.o doesn't properly shadow b.o. If we consider a.o an interceptor, it fails the requirement "The defined symbols of the interceptor should be a superset of the archive member being shadowed." for reliable archive member extraction: https://maskray.me/blog/2021-06-20-symbol-processing#relocatable-object-file-suppressing-archive-member-extraction

Corollary: I expect that a distributed ThinLTO user should fix certain duplicate definition issues (many are ODR violation). Some are fine, if the archive member extraction leverage is reasonable.
I am unable to come up a full set of rules: if you satisfy these, your build will be fine. They have to be analyzed case by case.
(Using some options can help guarantee more stable results, e.g. --warn-backrefs https://maskray.me/blog/2021-06-13-dependency-related-linker-options#warn-backrefs
This example can be altered to fix the backward reference but the link is still brittle.)

I think lots of concern may possibly be resolved when a build system actually implements this, or may be not.
In the release notes, I state "Experimental `--thinlto-index= and --remapping-file=` are added for distributed ThinLTO."
I am open to a change if this simple approach turns out be insufficient.

Thorough experiment with coordinated build system changes is expensive. The motivating case was about consistency between thinlink and final link (also between local and dist thinlto) in the presence of native libs, and we know that the consistency has two aspects -- obj selection and ordering .

I think this proposal not only doesn't solve the problem for native lazy objects, it also regresses current handling of lto lazy objects because the selection part done by thinlto-index-only is now missing.

Knowing the selection part is unresolved, I think it's a stretch to ask people to do such experiments. And I'm also not sure if trial and error approach is a good one for this kind of problem..

Corollary: I expect that a distributed ThinLTO user should fix certain duplicate definition issues (many are ODR violation). Some are fine, if the archive member extraction leverage is reasonable.

Wei's example is one possible manifestation of the underlying issue -- doing symbol resolution on the same set of inputs pre-codegen and post-codegen just isn't going to be equivalent. That can lead different results from undefined symbol to multiple definition to initializer discrepancies.

I hear you that there's no strict right vs wrong symbol resolution in some cases. But the consistency between local and dist thinLTO is important. Extending the current thinlto-index-only approach to cover native lazy objects is going to handle that better.

The alternative will be very intrusive and have substantial complexity. But for the sake of more stable distributed ThinLTO I am willing to pay the cost.

For anyone with reasonably sized codebase on local ThinLTO, some stability between local vs dist ThinLTO is critical for the adoption of dist ThinLTO. This is essentially blocking us from adopting dist ThinLTO internally.


There's another aspect that didn't get discussed as much here. If you look through the reviews @tejohnson mentioned, there're back and forth discussions with @mehdi_amini around the design principle -- specifically whether we should let final link redo what's already done by thinlink. I agree with conclusion there that it's unnecessary, and for that reason, the thinlto-index-only approach we use today is also more reasonable. I don't think we should revert back that important and heavily discussed design decision lightly.

Weighing everything, it looks sensible to me to absorb a bit of complexity so we can provide some stability between local and dist thinLTO, and avoid having final link repeat what's done already. At the very least, we should try extending thinlto-index-only and see if the concern around complexity is manageable.

MaskRay added a comment.EditedJul 30 2022, 12:24 AM

I think lots of concern may possibly be resolved when a build system actually implements this, or may be not.
In the release notes, I state "Experimental `--thinlto-index= and --remapping-file=` are added for distributed ThinLTO."
I am open to a change if this simple approach turns out be insufficient.

Thorough experiment with coordinated build system changes is expensive. The motivating case was about consistency between thinlink and final link (also between local and dist thinlto) in the presence of native libs, and we know that the consistency has two aspects -- obj selection and ordering .

I think this proposal not only doesn't solve the problem for native lazy objects, it also regresses current handling of lto lazy objects because the selection part done by thinlto-index-only is now missing.

Knowing the selection part is unresolved, I think it's a stretch to ask people to do such experiments. And I'm also not sure if trial and error approach is a good one for this kind of problem..

I think it is worth having more understanding of how the current --thinlto-index-only= is insufficient and coming up better examples.
You have doubt. I had it, too. But I wasn't able to come up a convincing symbol resolution example that --thinlto-index= is insufficient.
I have verified --thinlto-index= with some very large internal ThinLTO builds.

If you want to back up the idea of serializing the archive member extraction result, please share concrete examples, not just theory that "it also regresses...".

Actually, I am going to argue that --thinlto-index= may solve the "libcall defined in bitcode archive" problem: a backend compile may introduce new libcall references which do not exist in the IR symbol table, and the bitcode archive may not be extracted.
Implicit ThinLTO currently works around it by forcing extraction of bitcode files containing referenced libcall symbols.
If we use --remapping-file=, we can trigger archive member extraction of such bitcode archives, and rely on implicit ThinLTO to compile the very few libcall bitcode archive members.
If we serialize the archive member extraction result and ignore non-extracted bitcode archive members, we'd still need the current workaround.

The current implicit ThinLTO takes the simplest approach by inserting lto.tmp at the end, breaking ordering.
I don't think the scheme is set in stone and changing distributed ThinLTO to behave like it is probably not the right direction.
On the other hand, it's non-trivial to fix its ordering. Therefore, I think making distributed ThinLTO and implicit ThinLTO have very similar symbol resolution behaviors is a stretch goal. If --thinlto-index= doesn't behave like implicit ThinLTO, I don't think it is a design flaw.

Corollary: I expect that a distributed ThinLTO user should fix certain duplicate definition issues (many are ODR violation). Some are fine, if the archive member extraction leverage is reasonable.

Wei's example is one possible manifestation of the underlying issue -- doing symbol resolution on the same set of inputs pre-codegen and post-codegen just isn't going to be equivalent. That can lead different results from undefined symbol to multiple definition to initializer discrepancies.

I hear you that there's no strict right vs wrong symbol resolution in some cases. But the consistency between local and dist thinLTO is important. Extending the current thinlto-index-only approach to cover native lazy objects is going to handle that better.

The alternative will be very intrusive and have substantial complexity. But for the sake of more stable distributed ThinLTO I am willing to pay the cost.

For anyone with reasonably sized codebase on local ThinLTO, some stability between local vs dist ThinLTO is critical for the adoption of dist ThinLTO. This is essentially blocking us from adopting dist ThinLTO internally.

As mentioned above, please share a convincing example.
If I know one, as mentioned, I am happy to take the serialization approach, even if it will be an involved change.


There's another aspect that didn't get discussed as much here. If you look through the reviews @tejohnson mentioned, there're back and forth discussions with @mehdi_amini around the design principle -- specifically whether we should let final link redo what's already done by thinlink. I agree with conclusion there that it's unnecessary, and for that reason, the thinlto-index-only approach we use today is also more reasonable. I don't think we should revert back that important and heavily discussed design decision lightly.

Weighing everything, it looks sensible to me to absorb a bit of complexity so we can provide some stability between local and dist thinLTO, and avoid having final link repeat what's done already. At the very least, we should try extending thinlto-index-only and see if the concern around complexity is manageable.

All replied in previous paragraphs.

The current implicit ThinLTO takes the simplest approach by inserting lto.tmp at the end, breaking ordering.
I don't think the scheme is set in stone and changing distributed ThinLTO to behave like it is probably not the right direction.
On the other hand, it's non-trivial to fix its ordering. Therefore, I think making distributed ThinLTO and implicit ThinLTO have very similar symbol resolution behaviors is a stretch goal. If --thinlto-index= doesn't behave like implicit ThinLTO, I don't think it is a design flaw.

My understanding is that inserting lto.tmp at the end shouldn't matter to the symbol resolution order. Bitcode objects, before going through compilation, have already gone through symbol resolution, so there shouldn't be new symbols introduced when lto.tmp is inserted, except those being promoted or created internally by llvm, but those symbols is unique to the module.

We are currently in the process of deprecating the use of regular archives in favor of thin archives in our build system. This would make it much easier to include both bitcode and prebuilt native objects into the index list. And we'd like to propose another solution just to do that https://reviews.llvm.org/D130975. The included test case shows an example where the current --thinlto-index-only= does not give a clear indication of how to keep the symbol resolution order consistent between thin link and final link.

MTC added a subscriber: MTC.Aug 10 2022, 12:24 AM

Revisiting this as it doesn't look like we have a resolution. I went through the example and discussion here and have some comments and questions below:

Regarding the example from @weiwang, I think the most ideal situation is to have ThinLTO (in process or distributed) match the behavior of non-LTO. For that I see

$ clang++ -c -O2 a.cpp b.cpp c.cpp
$ clang++ -fuse-ld=lld -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib 
$ a.out
$ echo $?
1

So that would seem to be consistent with what was happening for in-process ThinLTO. I applied the patch and also see 0 printed in the distributed case with the new map approach, although fyi the command line used below is not correct:

dist_thinlto with --thinlto-index

clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index=thinlto.map,--thinlto-emit-imports-files,--thinlto-object-suffix-replace='.o;.native.o' -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib

You don't want to use suffix replace, that affects where the importing looks for bitcode files, you'll need to use prefix-replace instead, see my modified commands further below.

cat thinlto.map

c.o c.native.o
b.o b.native.o

clang++ -O2 -c -x ir b.o -fthinlto-index=b.o.thinlto.bc -o b.native.o
clang++ -O2 -c -x ir c.o -fthinlto-index=c.o.thinlto.bc -o c.native.o

clang++ -fuse-ld=lld -Wl,--remapping-file=thinlto.map -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib -o dist2
./dist2
echo $?

0

Here's what I did to test:

$ clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index=thinlto.map,--thinlto-emit-imports-files,--thinlto-prefix-replace=';native_' -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib
$ clang++ -O2 -c -x ir b.o -fthinlto-index=native_b.o.thinlto.bc -o native_b.o
$ clang++ -O2 -c -x ir c.o -fthinlto-index=native_c.o.thinlto.bc -o native_c.o
$ clang++ -flto=thin -fuse-ld=lld -Wl,--remapping-file=thinlto.map -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib
$ a.out
$ echo $?
0

It looks like the reason for the difference is that f() is inlined in native_c.o in the distributed ThinLTO case, and so instead of looking for f() first, finding it in b.o and then using b.o's def of g(), in this case it only looks for g() and it finds it in a.o. This is a variant of other cases I had looked at which motivated the original solution, although in our case because almost everything is bitcode by default in a ThinLTO build the solution listing bitcode files was sufficient.

I'm wondering if it would it be possible to do a solution that is intermediate between this and D130975. Specifically, in this patch in order to avoid selecting lazy objects that were originally bitcode in the thin link, the remapping file maps them to /dev/null. Can that same solution be done for any lazy native objects from the thin link (list them in the remapping file as mapped to /dev/null)? We wouldn't be able to do this for any native objects in .a archives, but that's already an limitation for bitcode objects in distributed ThinLTO builds (they can't be in .a archives).

Couple other follow up comments/questions:

Actually, I am going to argue that --thinlto-index= may solve the "libcall defined in bitcode archive" problem: a backend compile may introduce new libcall references which do not exist in the IR symbol table, and the bitcode archive may not be extracted.
Implicit ThinLTO currently works around it by forcing extraction of bitcode files containing referenced libcall symbols.
If we use --remapping-file=, we can trigger archive member extraction of such bitcode archives, and rely on implicit ThinLTO to compile the very few libcall bitcode archive members.
If we serialize the archive member extraction result and ignore non-extracted bitcode archive members, we'd still need the current workaround.

My understanding looking at this patch is that non-extracted bitcode archive members get mapped to /dev/null in the mapping file, so I don't see how this patch will help the libcall extraction if they are defined in bitcode as described above? But maybe I am misunderstanding the situation described above, or what the patch does.

The D22356 and D22467 examples work with --thinlto-index= without forcing the non-lazy state.

% fld.lld t.o --start-lib t2.o --end-lib --start-lib t3.o --end-lib --start-lib t4.o --end-lib -t -e main -mllvm -import-instr-limit=4 --thinlto-index=a.map
% fld.lld t.o.5.precodegen.bc --start-lib t2.o.5.precodegen.bc --end-lib --start-lib t3.o.5.precodegen.bc --end-lib --start-lib t4.o --end-lib -t --remapping-file=a.map -e main
t.o.5.precodegen.bc
t3.o.5.precodegen.bc
t2.o.5.precodegen.bc

It presumably may not work with gold because GNU ld/gold has a different archive member extraction semantics (see https://lld.llvm.org/ELF/warn_backrefs.html).
lld adopts macOS ld64/link.exe's "first lazy object wins" approach and in practice has more stable behaviors.
I suspect lld just needs to drop --warn-backrefs for the final native link in certain cases, but the sacrifice is suitable to me.
(If we re-generate an order depending on -t output for the final native link, --warn-backrefs will not be useful anyway.)

Yes it looks like D22467 definitely and probably D22356 will both work with the approach in this patch specifically because of the way lld does archive member extraction (back then we were using gold). I believe they would fail with this approach + gold or gnu ld. So the other issue is that with this approach we would have to maintain 2 solutions, since we want distributed ThinLTO to work for other linkers such as gold and gnu ld (at least as well as they are now, in the typical case where we are not linking with both native and ThinLTO lazy objects). To get those linkers to work with this approach, you would need to extend the mapping file to add a flag to the line to say "unconditionally link" - or just change the interpretation so that anything listed in the mapping file that isn't mapped to /dev/null is unconditionally extracted/linked. That may be preferable in any case to guarantee that we always link the same objects with lld too?

I think if this patch was modified as suggested above (map non-selected native lazy objects to /dev/null in the mapping file as well, and possibly doing something as described in the paragraph above to enforce that all objects listed in the mapping file not mapped to /dev/null are always selected and linked) this would end up being somewhat similar to what happens with D130975, but I think it may avoid some of the problems there (i.e. you don't need to modify the rest of the link line to locate and remove the object files, and for any linked in native .a archives I don't think it ends up being worse that the situation today - they can't be listed in the mapping file without some extension to identify .a archive members but the link can work albeit you may get different behavior that in process ThinLTO or non-ThinLTO).

Thoughts?

Revisiting this as it doesn't look like we have a resolution. I went through the example and discussion here and have some comments and questions below:

Thanks for reviving the discussion.

Regarding the example from @weiwang, I think the most ideal situation is to have ThinLTO (in process or distributed) match the behavior of non-LTO. For that I see

$ clang++ -c -O2 a.cpp b.cpp c.cpp
$ clang++ -fuse-ld=lld -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib 
$ a.out
$ echo $?
1

So that would seem to be consistent with what was happening for in-process ThinLTO. I applied the patch and also see 0 printed in the distributed case with the new map approach, although fyi the command line used below is not correct:

dist_thinlto with --thinlto-index

clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index=thinlto.map,--thinlto-emit-imports-files,--thinlto-object-suffix-replace='.o;.native.o' -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib

You don't want to use suffix replace, that affects where the importing looks for bitcode files, you'll need to use prefix-replace instead, see my modified commands further below.

cat thinlto.map

c.o c.native.o
b.o b.native.o

clang++ -O2 -c -x ir b.o -fthinlto-index=b.o.thinlto.bc -o b.native.o
clang++ -O2 -c -x ir c.o -fthinlto-index=c.o.thinlto.bc -o c.native.o

clang++ -fuse-ld=lld -Wl,--remapping-file=thinlto.map -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib -o dist2
./dist2
echo $?

0

Here's what I did to test:

$ clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index=thinlto.map,--thinlto-emit-imports-files,--thinlto-prefix-replace=';native_' -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib
$ clang++ -O2 -c -x ir b.o -fthinlto-index=native_b.o.thinlto.bc -o native_b.o
$ clang++ -O2 -c -x ir c.o -fthinlto-index=native_c.o.thinlto.bc -o native_c.o
$ clang++ -flto=thin -fuse-ld=lld -Wl,--remapping-file=thinlto.map -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib
$ a.out
$ echo $?
0

It looks like the reason for the difference is that f() is inlined in native_c.o in the distributed ThinLTO case, and so instead of looking for f() first, finding it in b.o and then using b.o's def of g(), in this case it only looks for g() and it finds it in a.o. This is a variant of other cases I had looked at which motivated the original solution, although in our case because almost everything is bitcode by default in a ThinLTO build the solution listing bitcode files was sufficient.

I consider this use case invalid (https://reviews.llvm.org/D130229#3674024) because it does not properly use archive semantics to shadow an archive.
Note: If done properly, using archive semantics is a reasonable extension to allow duplicate definitions.

I'm wondering if it would it be possible to do a solution that is intermediate between this and D130975. Specifically, in this patch in order to avoid selecting lazy objects that were originally bitcode in the thin link, the remapping file maps them to /dev/null. Can that same solution be done for any lazy native objects from the thin link (list them in the remapping file as mapped to /dev/null)? We wouldn't be able to do this for any native objects in .a archives, but that's already an limitation for bitcode objects in distributed ThinLTO builds (they can't be in .a archives).

This patch can be adjusted to use the intermediate behavior, but I'd like to hear a reasonable example to justify the change.
The current approach remaps lazy and non-lazy bitcode files.
The adjusted behavior will additionally remap lazy ELF .o files, which add inconsistency as ELF .a members are not remapped.

Couple other follow up comments/questions:

Actually, I am going to argue that --thinlto-index= may solve the "libcall defined in bitcode archive" problem: a backend compile may introduce new libcall references which do not exist in the IR symbol table, and the bitcode archive may not be extracted.
Implicit ThinLTO currently works around it by forcing extraction of bitcode files containing referenced libcall symbols.
If we use --remapping-file=, we can trigger archive member extraction of such bitcode archives, and rely on implicit ThinLTO to compile the very few libcall bitcode archive members.
If we serialize the archive member extraction result and ignore non-extracted bitcode archive members, we'd still need the current workaround.

My understanding looking at this patch is that non-extracted bitcode archive members get mapped to /dev/null in the mapping file, so I don't see how this patch will help the libcall extraction if they are defined in bitcode as described above? But maybe I am misunderstanding the situation described above, or what the patch does.

I had said before I noticed that /dev/null remapping is needed. I've now forgotten why /dev/null remapping is needed...
Not remapping ELF .a members helps libcall symbols which are not hard coded in lto::LTO::getRuntimeLibcallSymbols().

The D22356 and D22467 examples work with --thinlto-index= without forcing the non-lazy state.

% fld.lld t.o --start-lib t2.o --end-lib --start-lib t3.o --end-lib --start-lib t4.o --end-lib -t -e main -mllvm -import-instr-limit=4 --thinlto-index=a.map
% fld.lld t.o.5.precodegen.bc --start-lib t2.o.5.precodegen.bc --end-lib --start-lib t3.o.5.precodegen.bc --end-lib --start-lib t4.o --end-lib -t --remapping-file=a.map -e main
t.o.5.precodegen.bc
t3.o.5.precodegen.bc
t2.o.5.precodegen.bc

It presumably may not work with gold because GNU ld/gold has a different archive member extraction semantics (see https://lld.llvm.org/ELF/warn_backrefs.html).
lld adopts macOS ld64/link.exe's "first lazy object wins" approach and in practice has more stable behaviors.
I suspect lld just needs to drop --warn-backrefs for the final native link in certain cases, but the sacrifice is suitable to me.
(If we re-generate an order depending on -t output for the final native link, --warn-backrefs will not be useful anyway.)

Yes it looks like D22467 definitely and probably D22356 will both work with the approach in this patch specifically because of the way lld does archive member extraction (back then we were using gold). I believe they would fail with this approach + gold or gnu ld. So the other issue is that with this approach we would have to maintain 2 solutions, since we want distributed ThinLTO to work for other linkers such as gold and gnu ld (at least as well as they are now, in the typical case where we are not linking with both native and ThinLTO lazy objects). To get those linkers to work with this approach, you would need to extend the mapping file to add a flag to the line to say "unconditionally link" - or just change the interpretation so that anything listed in the mapping file that isn't mapped to /dev/null is unconditionally extracted/linked. That may be preferable in any case to guarantee that we always link the same objects with lld too?

I certainly like a solution which is more portable and supports more linkers. I care a lot about the GNU ld compatibility as it is the default linker on many Linux distributions. People probably care less about gold now since its default linker state in some groups has been replaced and its upstream development is stagnating (recognized by users and binutils contributors).

Nowadays GNU ld and gold are the only linkers with the odd archive extraction behavior.
Everything else has moved to the VMS (now OpenVMS), Mach-O ld64, Windows link.exe, and ld.lld behavior.
mold uses a variant of such a behavior (I find it difficult to formalize its behavior, but it seems fine in most cases).

If the new option we are proposing does not work with GNU ld, I consider it is ok since (a) its traditional archive member extraction looks problematic (b) I don't know any group wants to use it since it is so slow and confumses so much memory (c) any such user can fall back the --thinlto-index-only=.

I think if this patch was modified as suggested above (map non-selected native lazy objects to /dev/null in the mapping file as well, and possibly doing something as described in the paragraph above to enforce that all objects listed in the mapping file not mapped to /dev/null are always selected and linked) this would end up being somewhat similar to what happens with D130975, but I think it may avoid some of the problems there (i.e. you don't need to modify the rest of the link line to locate and remove the object files, and for any linked in native .a archives I don't think it ends up being worse that the situation today - they can't be listed in the mapping file without some extension to identify .a archive members but the link can work albeit you may get different behavior that in process ThinLTO or non-ThinLTO).

Thoughts?

Hopefully all answered in previous paragraphs.

Revisiting this as it doesn't look like we have a resolution. I went through the example and discussion here and have some comments and questions below:

Thanks for reviving the discussion.

Regarding the example from @weiwang, I think the most ideal situation is to have ThinLTO (in process or distributed) match the behavior of non-LTO. For that I see

$ clang++ -c -O2 a.cpp b.cpp c.cpp
$ clang++ -fuse-ld=lld -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib 
$ a.out
$ echo $?
1

So that would seem to be consistent with what was happening for in-process ThinLTO. I applied the patch and also see 0 printed in the distributed case with the new map approach, although fyi the command line used below is not correct:

dist_thinlto with --thinlto-index

clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index=thinlto.map,--thinlto-emit-imports-files,--thinlto-object-suffix-replace='.o;.native.o' -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib

You don't want to use suffix replace, that affects where the importing looks for bitcode files, you'll need to use prefix-replace instead, see my modified commands further below.

cat thinlto.map

c.o c.native.o
b.o b.native.o

clang++ -O2 -c -x ir b.o -fthinlto-index=b.o.thinlto.bc -o b.native.o
clang++ -O2 -c -x ir c.o -fthinlto-index=c.o.thinlto.bc -o c.native.o

clang++ -fuse-ld=lld -Wl,--remapping-file=thinlto.map -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib -o dist2
./dist2
echo $?

0

Here's what I did to test:

$ clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index=thinlto.map,--thinlto-emit-imports-files,--thinlto-prefix-replace=';native_' -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib
$ clang++ -O2 -c -x ir b.o -fthinlto-index=native_b.o.thinlto.bc -o native_b.o
$ clang++ -O2 -c -x ir c.o -fthinlto-index=native_c.o.thinlto.bc -o native_c.o
$ clang++ -flto=thin -fuse-ld=lld -Wl,--remapping-file=thinlto.map -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib
$ a.out
$ echo $?
0

It looks like the reason for the difference is that f() is inlined in native_c.o in the distributed ThinLTO case, and so instead of looking for f() first, finding it in b.o and then using b.o's def of g(), in this case it only looks for g() and it finds it in a.o. This is a variant of other cases I had looked at which motivated the original solution, although in our case because almost everything is bitcode by default in a ThinLTO build the solution listing bitcode files was sufficient.

I consider this use case invalid (https://reviews.llvm.org/D130229#3674024) because it does not properly use archive semantics to shadow an archive.
Note: If done properly, using archive semantics is a reasonable extension to allow duplicate definitions.

Does this violate a language standard and/or is it explicitly considered undefined behavior?

I construct a similarly behaving example where g() is a weak symbol if that matters.

I'm wondering if it would it be possible to do a solution that is intermediate between this and D130975. Specifically, in this patch in order to avoid selecting lazy objects that were originally bitcode in the thin link, the remapping file maps them to /dev/null. Can that same solution be done for any lazy native objects from the thin link (list them in the remapping file as mapped to /dev/null)? We wouldn't be able to do this for any native objects in .a archives, but that's already an limitation for bitcode objects in distributed ThinLTO builds (they can't be in .a archives).

This patch can be adjusted to use the intermediate behavior, but I'd like to hear a reasonable example to justify the change.
The current approach remaps lazy and non-lazy bitcode files.
The adjusted behavior will additionally remap lazy ELF .o files, which add inconsistency as ELF .a members are not remapped.

There's an inherent inconsistency for bitcode lazy .o objects vs bitcode .a archives already. Until and unless the remapping file was extended to handle ELF .a members there would be an inconsistency there, but I don't think this is worse or less desirable than the inconsistency vs non-ThinLTO links or in process ThinLTO links.

Couple other follow up comments/questions:

Actually, I am going to argue that --thinlto-index= may solve the "libcall defined in bitcode archive" problem: a backend compile may introduce new libcall references which do not exist in the IR symbol table, and the bitcode archive may not be extracted.
Implicit ThinLTO currently works around it by forcing extraction of bitcode files containing referenced libcall symbols.
If we use --remapping-file=, we can trigger archive member extraction of such bitcode archives, and rely on implicit ThinLTO to compile the very few libcall bitcode archive members.
If we serialize the archive member extraction result and ignore non-extracted bitcode archive members, we'd still need the current workaround.

My understanding looking at this patch is that non-extracted bitcode archive members get mapped to /dev/null in the mapping file, so I don't see how this patch will help the libcall extraction if they are defined in bitcode as described above? But maybe I am misunderstanding the situation described above, or what the patch does.

I had said before I noticed that /dev/null remapping is needed. I've now forgotten why /dev/null remapping is needed...
Not remapping ELF .a members helps libcall symbols which are not hard coded in lto::LTO::getRuntimeLibcallSymbols().

The D22356 and D22467 examples work with --thinlto-index= without forcing the non-lazy state.

% fld.lld t.o --start-lib t2.o --end-lib --start-lib t3.o --end-lib --start-lib t4.o --end-lib -t -e main -mllvm -import-instr-limit=4 --thinlto-index=a.map
% fld.lld t.o.5.precodegen.bc --start-lib t2.o.5.precodegen.bc --end-lib --start-lib t3.o.5.precodegen.bc --end-lib --start-lib t4.o --end-lib -t --remapping-file=a.map -e main
t.o.5.precodegen.bc
t3.o.5.precodegen.bc
t2.o.5.precodegen.bc

It presumably may not work with gold because GNU ld/gold has a different archive member extraction semantics (see https://lld.llvm.org/ELF/warn_backrefs.html).
lld adopts macOS ld64/link.exe's "first lazy object wins" approach and in practice has more stable behaviors.
I suspect lld just needs to drop --warn-backrefs for the final native link in certain cases, but the sacrifice is suitable to me.
(If we re-generate an order depending on -t output for the final native link, --warn-backrefs will not be useful anyway.)

Yes it looks like D22467 definitely and probably D22356 will both work with the approach in this patch specifically because of the way lld does archive member extraction (back then we were using gold). I believe they would fail with this approach + gold or gnu ld. So the other issue is that with this approach we would have to maintain 2 solutions, since we want distributed ThinLTO to work for other linkers such as gold and gnu ld (at least as well as they are now, in the typical case where we are not linking with both native and ThinLTO lazy objects). To get those linkers to work with this approach, you would need to extend the mapping file to add a flag to the line to say "unconditionally link" - or just change the interpretation so that anything listed in the mapping file that isn't mapped to /dev/null is unconditionally extracted/linked. That may be preferable in any case to guarantee that we always link the same objects with lld too?

I certainly like a solution which is more portable and supports more linkers. I care a lot about the GNU ld compatibility as it is the default linker on many Linux distributions. People probably care less about gold now since its default linker state in some groups has been replaced and its upstream development is stagnating (recognized by users and binutils contributors).

Nowadays GNU ld and gold are the only linkers with the odd archive extraction behavior.
Everything else has moved to the VMS (now OpenVMS), Mach-O ld64, Windows link.exe, and ld.lld behavior.
mold uses a variant of such a behavior (I find it difficult to formalize its behavior, but it seems fine in most cases).

If the new option we are proposing does not work with GNU ld, I consider it is ok since (a) its traditional archive member extraction looks problematic (b) I don't know any group wants to use it since it is so slow and confumses so much memory (c) any such user can fall back the --thinlto-index-only=.

Nevertheless, it is still being used. So we would need to maintain 2 solutions long term.

Ideally we will replace --thinlto-index-only with a solution that would work better for all linkers. In generally, I think it is dangerous to redo the archive member selection in the final link, since ThinLTO is performing linker resolution based optimizations. Imo we should strive for a solution that allows the information for this selection to be communicated between the 2 links. I think we should have a very compelling reason to go in the other direction from what we already do on this front for bitcode objects.

I think if this patch was modified as suggested above (map non-selected native lazy objects to /dev/null in the mapping file as well, and possibly doing something as described in the paragraph above to enforce that all objects listed in the mapping file not mapped to /dev/null are always selected and linked) this would end up being somewhat similar to what happens with D130975, but I think it may avoid some of the problems there (i.e. you don't need to modify the rest of the link line to locate and remove the object files, and for any linked in native .a archives I don't think it ends up being worse that the situation today - they can't be listed in the mapping file without some extension to identify .a archive members but the link can work albeit you may get different behavior that in process ThinLTO or non-ThinLTO).

Thoughts?

Hopefully all answered in previous paragraphs.

Revisiting this as it doesn't look like we have a resolution. I went through the example and discussion here and have some comments and questions below:

Thanks for reviving the discussion.

Regarding the example from @weiwang, I think the most ideal situation is to have ThinLTO (in process or distributed) match the behavior of non-LTO. For that I see

$ clang++ -c -O2 a.cpp b.cpp c.cpp
$ clang++ -fuse-ld=lld -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib 
$ a.out
$ echo $?
1

So that would seem to be consistent with what was happening for in-process ThinLTO. I applied the patch and also see 0 printed in the distributed case with the new map approach, although fyi the command line used below is not correct:

dist_thinlto with --thinlto-index

clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index=thinlto.map,--thinlto-emit-imports-files,--thinlto-object-suffix-replace='.o;.native.o' -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib

You don't want to use suffix replace, that affects where the importing looks for bitcode files, you'll need to use prefix-replace instead, see my modified commands further below.

cat thinlto.map

c.o c.native.o
b.o b.native.o

clang++ -O2 -c -x ir b.o -fthinlto-index=b.o.thinlto.bc -o b.native.o
clang++ -O2 -c -x ir c.o -fthinlto-index=c.o.thinlto.bc -o c.native.o

clang++ -fuse-ld=lld -Wl,--remapping-file=thinlto.map -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib -o dist2
./dist2
echo $?

0

Here's what I did to test:

$ clang++ -flto=thin -fuse-ld=lld -Wl,--thinlto-index=thinlto.map,--thinlto-emit-imports-files,--thinlto-prefix-replace=';native_' -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib
$ clang++ -O2 -c -x ir b.o -fthinlto-index=native_b.o.thinlto.bc -o native_b.o
$ clang++ -O2 -c -x ir c.o -fthinlto-index=native_c.o.thinlto.bc -o native_c.o
$ clang++ -flto=thin -fuse-ld=lld -Wl,--remapping-file=thinlto.map -Wl,--start-lib a.o -Wl,--end-lib -Wl,--start-lib b.o -Wl,--end-lib -Wl,--start-lib c.o -Wl,--end-lib
$ a.out
$ echo $?
0

It looks like the reason for the difference is that f() is inlined in native_c.o in the distributed ThinLTO case, and so instead of looking for f() first, finding it in b.o and then using b.o's def of g(), in this case it only looks for g() and it finds it in a.o. This is a variant of other cases I had looked at which motivated the original solution, although in our case because almost everything is bitcode by default in a ThinLTO build the solution listing bitcode files was sufficient.

I consider this use case invalid (https://reviews.llvm.org/D130229#3674024) because it does not properly use archive semantics to shadow an archive.
Note: If done properly, using archive semantics is a reasonable extension to allow duplicate definitions.

Does this violate a language standard and/or is it explicitly considered undefined behavior?

I construct a similarly behaving example where g() is a weak symbol if that matters.

C++ https://en.cppreference.com/w/cpp/language/definition#One_Definition_Rule . In practice this is loosened by allowing some intercepting. I have some arguments in this thread

If we consider a.o an interceptor, it fails the requirement "The defined symbols of the interceptor should be a superset of the archive member being shadowed." for reliable archive member extraction: https://maskray.me/blog/2021-06-20-symbol-processing#relocatable-object-file-suppressing-archive-member-extraction

and in https://reviews.llvm.org/D130975#3694954

I'm wondering if it would it be possible to do a solution that is intermediate between this and D130975. Specifically, in this patch in order to avoid selecting lazy objects that were originally bitcode in the thin link, the remapping file maps them to /dev/null. Can that same solution be done for any lazy native objects from the thin link (list them in the remapping file as mapped to /dev/null)? We wouldn't be able to do this for any native objects in .a archives, but that's already an limitation for bitcode objects in distributed ThinLTO builds (they can't be in .a archives).

This patch can be adjusted to use the intermediate behavior, but I'd like to hear a reasonable example to justify the change.
The current approach remaps lazy and non-lazy bitcode files.
The adjusted behavior will additionally remap lazy ELF .o files, which add inconsistency as ELF .a members are not remapped.

There's an inherent inconsistency for bitcode lazy .o objects vs bitcode .a archives already. Until and unless the remapping file was extended to handle ELF .a members there would be an inconsistency there, but I don't think this is worse or less desirable than the inconsistency vs non-ThinLTO links or in process ThinLTO links.

There is no inherent inconsistency between bitcode lazy .o objects vs bitcode .a archives. We just need a --remapping-file= syntax to support an archive member and the build system needs to extract the archive member and invoke clang with appropriate`-fthinlto-index=`.

Couple other follow up comments/questions:

Actually, I am going to argue that --thinlto-index= may solve the "libcall defined in bitcode archive" problem: a backend compile may introduce new libcall references which do not exist in the IR symbol table, and the bitcode archive may not be extracted.
Implicit ThinLTO currently works around it by forcing extraction of bitcode files containing referenced libcall symbols.
If we use --remapping-file=, we can trigger archive member extraction of such bitcode archives, and rely on implicit ThinLTO to compile the very few libcall bitcode archive members.
If we serialize the archive member extraction result and ignore non-extracted bitcode archive members, we'd still need the current workaround.

My understanding looking at this patch is that non-extracted bitcode archive members get mapped to /dev/null in the mapping file, so I don't see how this patch will help the libcall extraction if they are defined in bitcode as described above? But maybe I am misunderstanding the situation described above, or what the patch does.

I had said before I noticed that /dev/null remapping is needed. I've now forgotten why /dev/null remapping is needed...
Not remapping ELF .a members helps libcall symbols which are not hard coded in lto::LTO::getRuntimeLibcallSymbols().

The D22356 and D22467 examples work with --thinlto-index= without forcing the non-lazy state.

% fld.lld t.o --start-lib t2.o --end-lib --start-lib t3.o --end-lib --start-lib t4.o --end-lib -t -e main -mllvm -import-instr-limit=4 --thinlto-index=a.map
% fld.lld t.o.5.precodegen.bc --start-lib t2.o.5.precodegen.bc --end-lib --start-lib t3.o.5.precodegen.bc --end-lib --start-lib t4.o --end-lib -t --remapping-file=a.map -e main
t.o.5.precodegen.bc
t3.o.5.precodegen.bc
t2.o.5.precodegen.bc

It presumably may not work with gold because GNU ld/gold has a different archive member extraction semantics (see https://lld.llvm.org/ELF/warn_backrefs.html).
lld adopts macOS ld64/link.exe's "first lazy object wins" approach and in practice has more stable behaviors.
I suspect lld just needs to drop --warn-backrefs for the final native link in certain cases, but the sacrifice is suitable to me.
(If we re-generate an order depending on -t output for the final native link, --warn-backrefs will not be useful anyway.)

Yes it looks like D22467 definitely and probably D22356 will both work with the approach in this patch specifically because of the way lld does archive member extraction (back then we were using gold). I believe they would fail with this approach + gold or gnu ld. So the other issue is that with this approach we would have to maintain 2 solutions, since we want distributed ThinLTO to work for other linkers such as gold and gnu ld (at least as well as they are now, in the typical case where we are not linking with both native and ThinLTO lazy objects). To get those linkers to work with this approach, you would need to extend the mapping file to add a flag to the line to say "unconditionally link" - or just change the interpretation so that anything listed in the mapping file that isn't mapped to /dev/null is unconditionally extracted/linked. That may be preferable in any case to guarantee that we always link the same objects with lld too?

I certainly like a solution which is more portable and supports more linkers. I care a lot about the GNU ld compatibility as it is the default linker on many Linux distributions. People probably care less about gold now since its default linker state in some groups has been replaced and its upstream development is stagnating (recognized by users and binutils contributors).

Nowadays GNU ld and gold are the only linkers with the odd archive extraction behavior.
Everything else has moved to the VMS (now OpenVMS), Mach-O ld64, Windows link.exe, and ld.lld behavior.
mold uses a variant of such a behavior (I find it difficult to formalize its behavior, but it seems fine in most cases).

If the new option we are proposing does not work with GNU ld, I consider it is ok since (a) its traditional archive member extraction looks problematic (b) I don't know any group wants to use it since it is so slow and confumses so much memory (c) any such user can fall back the --thinlto-index-only=.

Nevertheless, it is still being used. So we would need to maintain 2 solutions long term.

Ideally we will replace --thinlto-index-only with a solution that would work better for all linkers. In generally, I think it is dangerous to redo the archive member selection in the final link, since ThinLTO is performing linker resolution based optimizations. Imo we should strive for a solution that allows the information for this selection to be communicated between the 2 links. I think we should have a very compelling reason to go in the other direction from what we already do on this front for bitcode objects.

--thinlto-index-only will still be supported. I just recall that GNU ld does not support --start-lib (https://sourceware.org/bugzilla/show_bug.cgi?id=24600), so GNU ld cannot perform distributed ThinLTO with the current scheme.

I was thinking it would be dangerous. As stated, I cannot find a reasonable archive/lazy .o example.
Therefore I prefer using the simple form until a convincing example is raised. I have stated that the approach as implemented in this patch is considered experimental.

If we temporarily ignore the Bazel ThinLTO implementation, we probably need to recognize that native .a files are very very common among other build systems.
We don't provide remapping syntax for archive members yet, so remapping native lazy .o files is incomplete.
As stated elsewhere, if we don't remap native .a members to /dev/null, the final native link can technically trigger implicit ThinLTO, which help some native libcall implementations.

I think if this patch was modified as suggested above (map non-selected native lazy objects to /dev/null in the mapping file as well, and possibly doing something as described in the paragraph above to enforce that all objects listed in the mapping file not mapped to /dev/null are always selected and linked) this would end up being somewhat similar to what happens with D130975, but I think it may avoid some of the problems there (i.e. you don't need to modify the rest of the link line to locate and remove the object files, and for any linked in native .a archives I don't think it ends up being worse that the situation today - they can't be listed in the mapping file without some extension to identify .a archive members but the link can work albeit you may get different behavior that in process ThinLTO or non-ThinLTO).

Thoughts?

Hopefully all answered in previous paragraphs.

MaskRay updated this revision to Diff 517445.Apr 26 2023, 9:33 PM
MaskRay retitled this revision from [ELF] Add --thinlto-index= and --remapping-file= to [ELF] Add --thinlto-index=.
MaskRay edited the summary of this revision. (Show Details)

rebase