This is an archive of the discontinued LLVM Phabricator instance.

[lld/mac] Use xxhash instead of MD5 for computing the UUID
ClosedPublic

Authored by thakis on Dec 6 2020, 11:15 AM.

Details

Summary

15% faster for linking Chromium's base_unittests.txt, according to ministat:

    N           Min           Max        Median           Avg        Stddev
x  10      0.650213    0.69287586    0.65793395    0.66127126   0.012365407
+  10    0.54993701    0.59006906    0.55885506    0.56146643   0.013215349
Difference at 95.0% confidence
        -0.0998048 +/- 0.0120244
        -15.0929% +/- 1.81838%
        (Student's t, pooled s = 0.0127974)

And matches what we do on the other ports.

Diff Detail

Event Timeline

thakis requested review of this revision.Dec 6 2020, 11:15 AM
thakis created this revision.

As discussed in D89418 this means that we no longer adhere to RFC 4122. I'm not sure it matters? I only have done perf testing for this -- what's your favorite test case to check that debugging still works?

I'll try to get some automated perf testing set up so that we can find regressions faster.

smeenai added subscribers: clayborg, smeenai.

I'll let @clayborg weigh in on the implications of this for LLDB.

int3 added a subscriber: int3.Dec 7 2020, 3:15 AM

So the ld64 source that implemented RFC 4122 referenced an internal rdar:// task, which makes me think that some other tool (which may or may not be debugging-related) depends on this behavior. But I'm fine with leaving it out for now and waiting to see if anything chokes on it down the line

MaskRay added a subscriber: MaskRay.Dec 7 2020, 9:52 AM

xxhash is a pretty arbitrary choice (https://bugs.llvm.org/show_bug.cgi?id=37029). If there are a better one, we can use that, too.

xxhash is a pretty arbitrary choice (https://bugs.llvm.org/show_bug.cgi?id=37029). If there are a better one, we can use that, too.

Sure. But for now it's what the other ports use, so for now let's go with this :) (...assuming lldb can handle it.)

If md5 is important for some other apple-internal tool, I guess we can add a flag for picking the hash style, like ELF has.

So the UUID that ld64 writes out is the MD5 hash of only parts of the binary that do not have debug info paths in them. So you can build the same binary a /tmp/a or in ~/sources/a and get the exact same UUID for the resulting binary.

The nice thing about this is if you use the same compiler and runtime libraries to compile the same sources, you will end up with a consistent UUID from ld64 that is only hashing the .text and .data and other appropriate sections, and it avoids all STAB symbols (which contains the debug map which has source paths and object file paths) that change with where the binary is built. Apple relies on this for their build system so that dSYM files can be uniqued, and if 5 consecutive builds have the same sources and are built with the same compiler, they will not re-upload a new dSYM to their symbol server, they will re-use the same dSYM.

So if we change the UUID to be unique on each build, we can probably guarantee that Apple won't ever use lld for their builds. Not sure how important this is. This also means that if a UUID is generated by ld64 for a stripped or non-stripped binary, it will end up being the same, which is nice.

LLDB only uses the UUID to find the matching dSYM when loading debug symbols, so nothing will affect LLDB's ability to use the UUID no matter how it is generated. The main question is do we want to mimic what ld64 is doing or not by having a consistent UUID that depends on the contents of the file that won't change if the binary is built in different source roots.

int3 added a comment.Dec 7 2020, 2:56 PM

Thanks for the info @clayborg! I wasn't aware of that. I guess I should update LLD's STABS implementation to write out N_OSO only after the hash is calculated.

thakis added a comment.Dec 7 2020, 3:40 PM

Thanks! There's two mostly independent questions here.

  1. Choice of hash function. The patch changes the hash function but doesn't modify what parts of the binary get hashed:

So if we change the UUID to be unique on each build, we can probably guarantee that Apple won't ever use lld for their builds. Not sure how important this is. This also means that if a UUID is generated by ld64 for a stripped or non-stripped binary, it will end up being the same, which is nice.

That's not what this patch is doing. It changes only which hash function we use, not which ranges of the binary we hash.

  1. Output binary determinism.

I agree that this is an important goal. It is independent of this patch.

You can get the benefit of build-dir-independence without hacking up the linker, see "Getting to local determinism" on https://blog.llvm.org/2019/11/deterministic-builds-with-clang-and-lld.html , in particular the -fdebug-compilation-dir bits. That's important to get .o files that are independent of the build dir as well, which in turn is important for distcc-like systems that can reuse .o caches across users / bots, independent of build dir.

Is stripped builds and non-stripped builds having the same UUID something that's useful in practice? I would've expected that everyone builds a debug binary with symbols, and then strips it after building. What's the use case for links with and without debug info and wanting the same UUID?

Thanks! There's two mostly independent questions here.

  1. Choice of hash function. The patch changes the hash function but doesn't modify what parts of the binary get hashed:

So if we change the UUID to be unique on each build, we can probably guarantee that Apple won't ever use lld for their builds. Not sure how important this is. This also means that if a UUID is generated by ld64 for a stripped or non-stripped binary, it will end up being the same, which is nice.

That's not what this patch is doing. It changes only which hash function we use, not which ranges of the binary we hash.

Gotcha. So we are hashing only what ld64 is hashing? Do the RFC 4122 changes make us differ from ld64 if we don't switch hashing algorithms?

  1. Output binary determinism.

I agree that this is an important goal. It is independent of this patch.

You can get the benefit of build-dir-independence without hacking up the linker, see "Getting to local determinism" on https://blog.llvm.org/2019/11/deterministic-builds-with-clang-and-lld.html , in particular the -fdebug-compilation-dir bits. That's important to get .o files that are independent of the build dir as well, which in turn is important for distcc-like systems that can reuse .o caches across users / bots, independent of build dir.

Using this means debuggers need to have source remapping settings set correctly or we can't show sources correctly. So it does fix determinism but at the cost of debugging not working without user intervention. Most people will try and set a breakpoint, it won't work, then they turn the printf pro. But this kind of thing is needed for caches across users / bots like you mentioned.

Is stripped builds and non-stripped builds having the same UUID something that's useful in practice? I would've expected that everyone builds a debug binary with symbols, and then strips it after building. What's the use case for links with and without debug info and wanting the same UUID?

We have cases where the ELF build ID changes on us for binaries when we have stripped debug info. Debuggers really want to match a stripped binary to a non stripped binary. The UUID in the file (LC_UUID for mach-o, ELF build ID for others) is what we try to use. When these don't match, we end up not associating the symbol file with the stripped executable and users don't get debug info in their debug session. We have scripts that know how to re-write the ELF build ID that we use at Facebook to make things match up for these cases, but this again requires the user to know why this is happening (most don't) and how to work around the issue (run a script to update the ELF build ID). Many debugger users turn to debugging with non-stripped binaries to get things working, which means the binaries that are cached and sent around and up being huge with all of the debug info.

int3 added a comment.Dec 8 2020, 5:18 AM

So we are hashing only what ld64 is hashing? Do the RFC 4122 changes make us differ from ld64 if we don't switch hashing algorithms?

No, we're hashing the object paths in the debug info too. But that should be addressed in a separate diff...

RFC 4122 has nothing to do with what gets hashed. It's merely about the format of the UUID string, which stores metadata about how the UUID was generated.

thakis added a comment.Dec 8 2020, 6:07 AM

Is stripped builds and non-stripped builds having the same UUID something that's useful in practice? I would've expected that everyone builds a debug binary with symbols, and then strips it after building. What's the use case for links with and without debug info and wanting the same UUID?

We have cases where the ELF build ID changes on us for binaries when we have stripped debug info. Debuggers really want to match a stripped binary to a non stripped binary.

This is where I don't follow. My mental model is:

  1. Build with symbols, upload that binary to some symbol server
  2. Run strip
  3. Ship the stripped binary to production (users, data centers, whatever)

strip doesn't change the UUID, it leaves it in place unchanged (...right?). So stripped and non-stripped binary always have a matching UUID in this case. Having a strip-invariant hash algorithm matters if you do something like:

  1. Build with symbols, upload that binary to some symbol server
  2. Independently build the same binary at the same revision, but without symbols
  3. Ship the binary from 2 to production

In that case, if you want the binaries in 1 and 2 to have the same UUID you do need a hashing algorithm like the one you describe. Is the second workflow what you're doing? Or is there something else I'm missing?

Is stripped builds and non-stripped builds having the same UUID something that's useful in practice? I would've expected that everyone builds a debug binary with symbols, and then strips it after building. What's the use case for links with and without debug info and wanting the same UUID?

We have cases where the ELF build ID changes on us for binaries when we have stripped debug info. Debuggers really want to match a stripped binary to a non stripped binary.

This is where I don't follow. My mental model is:

  1. Build with symbols, upload that binary to some symbol server
  2. Run strip
  3. Ship the stripped binary to production (users, data centers, whatever)

strip doesn't change the UUID, it leaves it in place unchanged (...right?). So stripped and non-stripped binary always have a matching UUID in this case. Having a strip-invariant hash algorithm matters if you do something like:

  1. Build with symbols, upload that binary to some symbol server
  2. Independently build the same binary at the same revision, but without symbols
  3. Ship the binary from 2 to production

In that case, if you want the binaries in 1 and 2 to have the same UUID you do need a hashing algorithm like the one you describe. Is the second workflow what you're doing? Or is there something else I'm missing?

The point is if you produce a binary with ld64 with or without local symbols, with or without debug info (STABS debug map) -- at link time -- it will produce the same UUID. This isn't the case for ELF files as the UUID is based off of the binary contents at build time. ld64 takes only the parts that don't change and makes a hash of those bits only.

So we are hashing only what ld64 is hashing? Do the RFC 4122 changes make us differ from ld64 if we don't switch hashing algorithms?

No, we're hashing the object paths in the debug info too. But that should be addressed in a separate diff...

Gotcha.

RFC 4122 has nothing to do with what gets hashed. It's merely about the format of the UUID string, which stores metadata about how the UUID was generated.

The main question is if ld64 is doing this and if we want to replicate what it does. Might be a handy option to add in case compatibility is desired: --ld64 which would enable the slower MD5 hashing of only the bits that ld64 does (and possibly other things), without this lld can pick its own hashing algorithm for speed?

int3 added a comment.EditedDec 8 2020, 7:34 PM

I'd rather we not add compatibility options unless we discover it's actually necessary. I'm okay going with the xxhash for now, though personally I think it wouldn't hurt to set the bits to pretend that it's an MD5 hash-based UUID (as opposed to e.g. a timestamp UUID). I could see some tools caring about whether it's a timestamp vs a hash, but I think it's less likely that they would care about the exact hashing algorithm used. (RFC 4122 only specifies representations for MD5 and SHA1 hashes, so we can't represent xxhash.)

thakis added a comment.Dec 9 2020, 8:00 AM

The point is if you produce a binary with ld64 with or without local symbols, with or without debug info (STABS debug map) -- at link time -- it will produce the same UUID.

I understand that part. I'm wondering what the use case is where this is important. In my experience, people build with debug info and then strip (which gives you binaries with and without debug info and with matching UUID even with the elf scheme), instead of building once with debug info and once without. So my question above is: Do you do two separate builds, one with debug info and one without, instead of using strip? If so, why?

thakis updated this revision to Diff 310545.Dec 9 2020, 8:26 AM

Restore rfc 4122 stuff.

thakis added a comment.Dec 9 2020, 8:26 AM

I'd rather we not add compatibility options unless we discover it's actually necessary. I'm okay going with the xxhash for now, though personally I think it wouldn't hurt to set the bits to pretend that it's an MD5 hash-based UUID (as opposed to e.g. a timestamp UUID). I could see some tools caring about whether it's a timestamp vs a hash, but I think it's less likely that they would care about the exact hashing algorithm used. (RFC 4122 only specifies representations for MD5 and SHA1 hashes, so we can't represent xxhash.)

Makes sense. Put back the bits to claim that this is MD5. Please take another look :)

int3 accepted this revision.Dec 9 2020, 2:21 PM

lgtm. Thanks for the detailed comments!

The determinism discussion is interesting but seems orthogonal to this diff, so I'm accepting it...

This revision is now accepted and ready to land.Dec 9 2020, 2:21 PM

The point is if you produce a binary with ld64 with or without local symbols, with or without debug info (STABS debug map) -- at link time -- it will produce the same UUID.

I understand that part. I'm wondering what the use case is where this is important. In my experience, people build with debug info and then strip (which gives you binaries with and without debug info and with matching UUID even with the elf scheme), instead of building once with debug info and once without. So my question above is: Do you do two separate builds, one with debug info and one without, instead of using strip? If so, why?

I build once and strip. I never run into issues on Darwin builds that use ld64. But I have run into issues with android, and in those cases, I believe they did two builds. I am not worried about the contents of the patch, mostly just trying to let you guys know what an existing Darwin ld64 user currently gets so you guys know what the current system provides and how it might affect users that are used to ld64.

clayborg accepted this revision.Dec 9 2020, 4:18 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2020, 6:06 PM