This is an archive of the discontinued LLVM Phabricator instance.

[lto] Do not try to internalize symbols with escaped name
ClosedPublic

Authored by serge-sans-paille on Oct 11 2022, 1:35 PM.

Details

Summary

Because of LLVM mangling escape sequence (through '\01' prefix), it is possible
for a single symbols two have two different IR representations.

For instance, consider @symbol and @"\01_symbol". On OSX, because of the system
mangling rules, these two IR names point are converted in the same final symbol
upon linkage.

LTO doesn't model this behavior, which may result in symbols being incorrectly
internalized (if all reference use the escaping sequence while the definition
doesn't).

The proper approach is probably to use the mangled name to compute GUID to
avoid the dual representation, but we can also avoid discarding symbols that are
bound to two different IR names. This is an approximation, but it's less
intrusive on the codebase.

Fix #57864

Diff Detail

Event Timeline

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptOct 11 2022, 1:35 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
serge-sans-paille requested review of this revision.Oct 11 2022, 1:35 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2022, 1:35 PM
int3 added a comment.Oct 11 2022, 1:53 PM

My bad, I had a proper look and yeah this seems to do the right thing.

I'm wondering if the test should use llvm-lto instead of ld64.lld (and placed under llvm/test/LTO) since this is technically not an LLD-specific issue. I don't feel strongly about it though.

If we are keeping the test under lld/, I would prefer we combined both test files into one (relying more on split-file).

We call GlobalValue::dropLLVMManglingEscape on the IRName before computing the GUID for both regular and thin LTO in LTO.cpp. Do you know why this isn't handling this case?

int3 added a comment.Oct 11 2022, 3:44 PM

dropLLVMManglingEscape drops the \01 but doesn't drop the _ prefix. I *think* this prefix is only used on Mach-O and not ELF?

Basically in Mach-O, the un-mangled names \01_foo and foo map to the same mangled name _foo. dropLLVMManglingEscape("\01_foo") gives us back _foo which doesn't match foo.

smeenai added a subscriber: smeenai.EditedOct 11 2022, 4:18 PM

dropLLVMManglingEscape drops the \01 but doesn't drop the _ prefix. I *think* this prefix is only used on Mach-O and not ELF?

It's also used on COFF for x86 (but not x86-64 or arm32 or arm64), if that's at all relevant here.

We call GlobalValue::dropLLVMManglingEscape on the IRName before computing the GUID for both regular and thin LTO in LTO.cpp. Do you know why this isn't handling this case?

It's because link time mangling can add various prefixes to symbols, see https://llvm.org/docs/LangRef.html#data-layout , m:<mangling> section. Basically for visible symbols, GlobalValue::dropLLVMManglingEscape acts as un unmangler for ELF symbols. But it's not the case for other manglers.

Ok thanks all for the context. I spent some time poking around this morning and I think this approach is fine for now but have a couple of questions and comments.

First, I'm wondering why this isn't an issue for regular LTO which by default does index based liveness. Specifically, see the following which is for both regular and thin LTO:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/LTO/LTO.cpp#L1006-L1032

Since it sounds like we get separate combined index entries for the 2 names, I'm wondering why one of them (the one that doesn't match the global resolution IRName) isn't being incorrectly marked dead by that analysis and then subsequently removed by regular LTO here:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/LTO/LTO.cpp#L861-L873

Second, I see that the big problem we have is that we compute the GUID during ThinLTO in particular using the static version of getGlobalIdentifier, since we don't have IR, and thus it isn't easy to modify that to take the mangling into account. But it looks like we would just need the triple, which is available in the InputFile object created for all bitcode files during linking. That could be used here in the linker interface to LTO and then perhaps stored in the index alongside the corresponding module path for later use during thinlto. It looks like there is handling in llvm/lib/IR/Mangler.cpp that does this mangling that could be modified to then work off that information instead of accessing it from the GV. If this is correct I wonder if we should document this idea somewhere for future work. It is fairly invasive though so I'm fine with the workaround here for now.

Couple more comments below.

My bad, I had a proper look and yeah this seems to do the right thing.

I'm wondering if the test should use llvm-lto instead of ld64.lld (and placed under llvm/test/LTO) since this is technically not an LLD-specific issue. I don't feel strongly about it though.

Yeah, I think llvm-lto2 (not llvm-lto, which tests the legacy LTO interface) would be a better tool to test this with, since it doesn't seem lld-specific. Either under llvm/test/ThinLTO, or if this is also an issue with regular LTO, then under llvm/test/LTO.

If we are keeping the test under lld/, I would prefer we combined both test files into one (relying more on split-file).

Looks like the tests already use split-file? Maybe that was changed after this comment though.

lld/test/MachO/hidden-escaped-symbols-alt.ll
5 ↗(On Diff #466902)

I know this is just checking that it links correctly without error, but it would be better to add an explicit CHECK for the linked symbol too. Also please add a comment at the top about what is being tested. Ditto for other new test.

llvm/lib/LTO/LTO.cpp
571

Can you put in more details on why (with the MachO mangling as an example).

573

I assume by "dismissal" you mean "internalization" or "dead code elimination". Can you change this to mention which of those is happening incorrectly, since "dismissal" isn't really a term used elsewhere in LTO afaik.

int3 added a comment.Oct 12 2022, 9:33 AM

Looks like the tests already use split-file? Maybe that was changed after this comment though.

I was just trying to suggest that hidden-escaped-symbols-alt.ll and hidden-escaped-symbols.ll be combined into one test file.

Update test cases:

  • use llvm-lto2
  • illustrate behavior on full and thin LTO

Also update comments on the extra condition, as suggested by @tejohnson

Fix typo + add a FIXME to keep track of potential enhancements.

tejohnson accepted this revision.Oct 13 2022, 5:48 AM

lgtm, but please wait to see if @int3 has any more comments.
Suggest adding a comment to the top of the tests about what it is testing (that LTO is properly handling the MachO mangling and treating @hide_me and @"\01_hide_me" as the same symbol).

This revision is now accepted and ready to land.Oct 13 2022, 5:48 AM
int3 accepted this revision.Oct 14 2022, 10:48 AM

lgtm, thanks for doing this!

llvm/lib/LTO/LTO.cpp
571

nit: technically this is macOS + iOS, not just OSX