This is an archive of the discontinued LLVM Phabricator instance.

[Object] Fix updating darwin archives
ClosedPublic

Authored by keith on May 3 2022, 6:51 PM.

Details

Summary

When creating an archive, llvm-ar looks at the host to determine the
archive format to use, on Apple platforms this means it uses the
K_DARWIN format. K_DARWIN is _virtually_ equivalent to K_BSD, expect for
some very slight differences around padding, timestamps in deterministic
mode, and 64 bit formats. When updating an archive using llvm-ar, or
llvm-objcopy, Archive would try to determine the kind, but it was not
possible to get K_DARWIN in the initialization of the archive, because
they're virtually indistinguishable from K_BSD, especially since the
slight differences only apply in very specific cases. This leads to
linker failures when the alignment workaround is not applied to an
archive copied with llvm-objcopy. This change teaches Archive to infer
the K_DARWIN type in the cases where it's possible and the first object
in the archive is a macho object. This avoids using the host triple to
determine this to not affect cross compiling.

Ideally we would eliminate the separate K_DARWIN type entirely since
it's not a truly separate archive type, but then we'd have to force the
macho workarounds on the BSD format generally. This might be acceptable
but then it would be unclear how to handle this case without forcing the
K_DARWIN64 format on all BSD users:

if (LastOffset >= Sym64Threshold) {
  if (Kind == object::Archive::K_DARWIN)
    Kind = object::Archive::K_DARWIN64;
  else
    Kind = object::Archive::K_GNU64;
}

The logic used to determine if the object is macho is derived from the
logic llvm-ar uses.

Previous context:

  • 111cd669e90e5b2132187d36f8b141b11a671a8b
  • 23a76be5adcaa768ba538f8a4514a7afccf61988

Diff Detail

Event Timeline

keith created this revision.May 3 2022, 6:51 PM
keith requested review of this revision.May 3 2022, 6:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 6:51 PM
keith updated this revision to Diff 426899.May 3 2022, 7:19 PM

Fix spelling

keith edited the summary of this revision. (Show Details)May 3 2022, 8:03 PM

I wonder if we can just pay a penalty and simplify this to always just check the first member unconditionally.

I don't think we ought to modify the archive reader like you've done, because we don't need to tell the difference between BSD/DARWIN when reading, only while writing. Thus, going through the trouble of distinguishing them in the reader is wasted work. Instead, please add a separate function, which ar and objcopy can call before constructing an ArchiveWriter, which returns the more "refined" Kind.

Also, while your goal of improving the heuristics seems reasonable, perhaps we ought to have a --archive-format option for objcopy, so that users can be explicit if they desire. (We already have --format for ar, and for the cross-compiling case it's basically required -- there's simply no other way to choose the correct behavior for archives with no files, like llvm-ar rcs x.a).

llvm/test/tools/llvm-objcopy/MachO/Inputs/unaligned_objects.yaml
1 ↗(On Diff #426899)

This file seems to have a lot of unnecessary junk in it (e.g. actual code, lots of symbols, etc). Can it be stripped down?

keith added a comment.May 4 2022, 5:48 PM

I don't think we ought to modify the archive reader like you've done, because we don't need to tell the difference between BSD/DARWIN when reading, only while writing. Thus, going through the trouble of distinguishing them in the reader is wasted work. Instead, please add a separate function, which ar and objcopy can call before constructing an ArchiveWriter, which returns the more "refined" Kind.

Sounds fair, I'll take a stab at this. I naturally gravitated towards doing it as early as possible since I was worried the Format might be read and used somehow, such that changing it later might introduce some incompatibilities. I will try to take this path instead.

Also, while your goal of improving the heuristics seems reasonable, perhaps we ought to have a --archive-format option for objcopy, so that users can be explicit if they desire. (We already have --format for ar, and for the cross-compiling case it's basically required -- there's simply no other way to choose the correct behavior for archives with no files, like llvm-ar rcs x.a).

From a UX perspective I hope we can come to a reasonable patch here instead so folks don't have to know about these details in this case. In the case of no files I don't think the decision really matters for this use case at least, although I get that issue in general.

From this comment I also realized I could probably apply the "is first object macho" heuristic only to the decision about what format is needed for 64 bit archives, and therefore I could potentially take the path of removing K_DARWIN if we felt like applying the other darwin specific workarounds to standard BSD archives was also acceptable, do you have any thoughts there? I believe there wouldn't be any functional downsides, as it worked this way in the past before as well, but it adds some unnecessary behavior for non-macho K_BSD archives.

llvm/test/tools/llvm-objcopy/MachO/Inputs/unaligned_objects.yaml
1 ↗(On Diff #426899)

Fair enough, I can try to pair these down, I took the real world use case I hit this with since it wasn't enormous

keith updated this revision to Diff 427168.May 4 2022, 6:04 PM

Update to only check new Kind when writing

jhenderson added inline comments.May 5 2022, 4:50 AM
llvm/lib/Object/ArchiveWriter.cpp
154

I feel like this function really belongs as a free function in either the BinaryFormat or Object libraries. If I'm not mistaken, the fact that the object is an archive member is irrelevant to this function?

Also, while your goal of improving the heuristics seems reasonable, perhaps we ought to have a --archive-format option for objcopy, so that users can be explicit if they desire. (We already have --format for ar, and for the cross-compiling case it's basically required -- there's simply no other way to choose the correct behavior for archives with no files, like llvm-ar rcs x.a).

From a UX perspective I hope we can come to a reasonable patch here instead so folks don't have to know about these details in this case. In the case of no files I don't think the decision really matters for this use case at least, although I get that issue in general.

K_DARWIN definitely matters for an empty archive -- ld64 obnoxiously refuses to accept an empty archive unless it has a symbol table. Try the test case: llvm-ar cr --format=darwin x.a; llvm-objcopy x.a y.a.

Of course, if ld64 is ever removed in favor of lld, we could stop worrying about most of these issues, and just write normal BSD (or even GNU/GNU64!) archives...except for one bit. The workaround which would need to remain is the duplicate-filename+timestamp avoidance. That's necessitated by the format used for "N_OSO" entries in a linked binary, which is needed so that lldb and dsymutil can lookup archive members to get debug info (as explained in a long comment in ArchiveWriter.cpp.)

From this comment I also realized I could probably apply the "is first object macho" heuristic only to the decision about what format is needed for 64 bit archives, and therefore I could potentially take the path of removing K_DARWIN if we felt like applying the other darwin specific workarounds to standard BSD archives was also acceptable, do you have any thoughts there? I believe there wouldn't be any functional downsides, as it worked this way in the past before as well, but it adds some unnecessary behavior for non-macho K_BSD archives.

I'd rather not.

llvm/lib/Object/ArchiveWriter.cpp
666–667

I don't think this function should be doing this -- I'd like for the two callers in llvm/lib/ObjCopy/Archive.cpp (where it does deepWriteArchive(..., Ar.kind(), ...)) and llvm/tools/llvm-ar/llvm-ar.cpp (where it does Kind = OldArchive->kind();) to pass a different Kind (computed using some common function placed somewhere appropriate).

keith added a comment.May 5 2022, 9:14 AM

Of course, if ld64 is ever removed in favor of lld, we could stop worrying about most of these issues

I won't hold my breath :(

and just write normal BSD (or even GNU/GNU64!) archives...except for one bit. The workaround which would need to remain is the duplicate-filename+timestamp avoidance. That's necessitated by the format used for "N_OSO" entries in a linked binary, which is needed so that lldb and dsymutil can lookup archive members to get debug info (as explained in a long comment in ArchiveWriter.cpp.)

My assumption if it did come down to this is we could safely apply this behavior to all BSD archives, since it's still deterministic and I don't think would have significant downsides? The only thing I wonder is how many places do something like timestamp == 0 to mean something special?

llvm/lib/Object/ArchiveWriter.cpp
666–667

What do you think about the trade off of new places not automatically given this behavior if we do that?

keith updated this revision to Diff 427452.May 5 2022, 2:25 PM
keith marked an inline comment as done.

Move format decision to callers

keith marked an inline comment as not done.May 5 2022, 2:28 PM
keith added inline comments.
llvm/lib/Object/ArchiveWriter.cpp
154

Yes any buffer could be used for this function. I added it here thinking that if there wasn't already a use case for this logic out there it might make sense to keep it localized until it's needed elsewhere, also now since it's called from 2 external users instead of only here. I'm happy to move it to a free function in the Object library if you think I should!

jhenderson added inline comments.May 5 2022, 11:05 PM
llvm/lib/Object/ArchiveWriter.cpp
154

I think I'm okay with it being a function contained within this file, if it's not useful outside the scope of archive writing, but I think it makes more sense sitting somewhere alongside Mach-O code rather than archive code.

keith updated this revision to Diff 427878.May 7 2022, 10:46 AM
keith marked 2 inline comments as done.

Move isMacho to Object library

keith updated this revision to Diff 427879.May 7 2022, 10:47 AM

Remove unused import

keith updated this revision to Diff 427884.May 7 2022, 11:36 AM
keith marked an inline comment as done.

Change test strategy to test for timestamps instead of alignment. This gives us the same result knowing that the darwin specific logic is hit, and it's easier to test for than crafting an unaligned binary.

keith updated this revision to Diff 427887.May 7 2022, 12:11 PM

Add test for universal object archive copy

keith updated this revision to Diff 427974.May 8 2022, 8:55 PM

Improve tests

jhenderson added inline comments.May 9 2022, 1:21 AM
llvm/test/tools/llvm-ar/macho-edit.test
2

I don't think you need or want this.

4

Also below: comments should end with a full stop.

5
8

Incomplete sentence?

12

Actually, I'm pretty sure it's not ignored: r is replace, so it actively replaces the existing file (with an identical one). I could be wrong though - it's been a while since I've looked in depth at the archive file writing internals.

llvm/test/tools/llvm-objcopy/MachO/archive-format.test
4
5

&& is the more common pattern for linking these commands in tests from what I've seen. I think it's better too.

keith updated this revision to Diff 428106.May 9 2022, 9:08 AM
keith marked 7 inline comments as done.

Improve tests

keith updated this revision to Diff 428123.May 9 2022, 10:17 AM

Add required target to archive-format.test

Thanks for the updates! Unfortunately, I don't know anything about the specifics of Darwin archives, so I think it's best if one of the other reviewers signs off on the logic of this patch.

Couple more comments, but looks pretty good.

llvm/lib/ObjCopy/MachO/MachOObjcopy.cpp
487–490

This ought to specify K_DARWIN unconditionally. An archive inside a Mach-O universal binary wrapper has no reason to be in other formats.

llvm/tools/llvm-ar/llvm-ar.cpp
944–951

The special-casing only needs to apply in the "else if (OldArchive)" under "case Default" above.

  1. If "--format=bsd" was explicitly specified on the command-line, we shouldn't be overruling with a detected format.
  2. In the other cases in the "if" chain within that, we've computed Kind from getKindFromMember, which already has logic that does something very similar (possibly even the same?

Actually, not quite sure, but maybe getKindFromMember's detection logic could be unified with the detection logic for isMachO?

keith updated this revision to Diff 428734.May 11 2022, 11:52 AM
keith marked an inline comment as done.

Extract previous llvm-ar logic instead of using a new function

llvm/lib/ObjCopy/MachO/MachOObjcopy.cpp
487–490

Yea I actually had that originally, but it broke some tests where it manually specified --flavor=gnu, I looked at the history of that test and that wasn't the intent of the change in the test, but I figured I'd preserve the behavior. The only case I was potentially worried about with that is if folks were intentionally using thin archives (it would crash in that case at least, instead of doing the entirely wrong thing). I've simplified the logic here to not check the members to preserve the thin case, but I'm happy to just change it to always pass darwin if you think that would still be preferred.

llvm/tools/llvm-ar/llvm-ar.cpp
944–951

Ah yes good point, moved this up there.

I originally went down the path of extracting getKindFromMember for this so it could be used in objcopy as well, but I wasn't sure if we'd want to override the format in the other cases we hit here. This diff has changed a lot though so I pushed that approach here by moving this function to ArchiveMember, let me know what you think!

jhenderson added inline comments.May 11 2022, 11:47 PM
llvm/lib/Object/ArchiveWriter.cpp
62
keith updated this revision to Diff 428989.May 12 2022, 10:11 AM
keith marked an inline comment as done.

Fix comment formatting

friendly ping!

jyknight accepted this revision.May 18 2022, 11:56 AM

LGTM, modulo final comments.

llvm/include/llvm/Object/ArchiveWriter.h
29

Please add a comment describing what this does.

llvm/lib/Object/ArchiveWriter.cpp
50

This isn't a great name because it sounds like a simple accessor, but it's not at all. Maybe detectKindFromObjectFormat()?

This revision is now accepted and ready to land.May 18 2022, 11:56 AM
keith updated this revision to Diff 430724.May 19 2022, 9:50 AM
keith marked 3 inline comments as done.

Improve name and add comment

This revision was automatically updated to reflect the committed changes.