This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/MachO/
-
MachO/
-
Writer.cpp

Differential D92736

[lld/mac] Use xxhash instead of MD5 for computing the UUID
ClosedPublic

Authored by thakis on Dec 6 2020, 11:15 AM.

Download Raw Diff

Details

Reviewers

clayborg
int3

Group Reviewers

Restricted Project

Commits

rG9d6177c2a526: [lld/mac] Use xxhash instead of MD5 for computing the UUID

Summary

15% faster for linking Chromium's base_unittests.txt, according to ministat:

    N           Min           Max        Median           Avg        Stddev
x  10      0.650213    0.69287586    0.65793395    0.66127126   0.012365407
+  10    0.54993701    0.59006906    0.55885506    0.56146643   0.013215349
Difference at 95.0% confidence
        -0.0998048 +/- 0.0120244
        -15.0929% +/- 1.81838%
        (Student's t, pooled s = 0.0127974)

And matches what we do on the other ports.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

As discussed in D89418 this means that we no longer adhere to RFC 4122. I'm not sure it matters? I only have done perf testing for this -- what's your favorite test case to check that debugging still works?

I'll try to get some automated perf testing set up so that we can find regressions faster.

I'll let @clayborg weigh in on the implications of this for LLDB.

So the ld64 source that implemented RFC 4122 referenced an internal rdar:// task, which makes me think that some other tool (which may or may not be debugging-related) depends on this behavior. But I'm fine with leaving it out for now and waiting to see if anything chokes on it down the line

xxhash is a pretty arbitrary choice (https://bugs.llvm.org/show_bug.cgi?id=37029). If there are a better one, we can use that, too.

In D92736#2437447, @MaskRay wrote:

xxhash is a pretty arbitrary choice (https://bugs.llvm.org/show_bug.cgi?id=37029). If there are a better one, we can use that, too.

Sure. But for now it's what the other ports use, so for now let's go with this :) (...assuming lldb can handle it.)

If md5 is important for some other apple-internal tool, I guess we can add a flag for picking the hash style, like ELF has.

So the UUID that ld64 writes out is the MD5 hash of only parts of the binary that do not have debug info paths in them. So you can build the same binary a /tmp/a or in ~/sources/a and get the exact same UUID for the resulting binary.

The nice thing about this is if you use the same compiler and runtime libraries to compile the same sources, you will end up with a consistent UUID from ld64 that is only hashing the .text and .data and other appropriate sections, and it avoids all STAB symbols (which contains the debug map which has source paths and object file paths) that change with where the binary is built. Apple relies on this for their build system so that dSYM files can be uniqued, and if 5 consecutive builds have the same sources and are built with the same compiler, they will not re-upload a new dSYM to their symbol server, they will re-use the same dSYM.

So if we change the UUID to be unique on each build, we can probably guarantee that Apple won't ever use lld for their builds. Not sure how important this is. This also means that if a UUID is generated by ld64 for a stripped or non-stripped binary, it will end up being the same, which is nice.

LLDB only uses the UUID to find the matching dSYM when loading debug symbols, so nothing will affect LLDB's ability to use the UUID no matter how it is generated. The main question is do we want to mimic what ld64 is doing or not by having a consistent UUID that depends on the contents of the file that won't change if the binary is built in different source roots.

Thanks for the info @clayborg! I wasn't aware of that. I guess I should update LLD's STABS implementation to write out N_OSO only after the hash is calculated.

clayborg mentioned this in D92537: [lld-macho] Implement -object_path_lto.Dec 7 2020, 3:09 PM

Thanks! There's two mostly independent questions here.

Choice of hash function. The patch changes the hash function but doesn't modify what parts of the binary get hashed:

So if we change the UUID to be unique on each build, we can probably guarantee that Apple won't ever use lld for their builds. Not sure how important this is. This also means that if a UUID is generated by ld64 for a stripped or non-stripped binary, it will end up being the same, which is nice.

That's not what this patch is doing. It changes only which hash function we use, not which ranges of the binary we hash.

Output binary determinism.

I agree that this is an important goal. It is independent of this patch.

You can get the benefit of build-dir-independence without hacking up the linker, see "Getting to local determinism" on https://blog.llvm.org/2019/11/deterministic-builds-with-clang-and-lld.html , in particular the -fdebug-compilation-dir bits. That's important to get .o files that are independent of the build dir as well, which in turn is important for distcc-like systems that can reuse .o caches across users / bots, independent of build dir.

Is stripped builds and non-stripped builds having the same UUID something that's useful in practice? I would've expected that everyone builds a debug binary with symbols, and then strips it after building. What's the use case for links with and without debug info and wanting the same UUID?

In D92736#2438432, @thakis wrote:

Thanks! There's two mostly independent questions here.

Choice of hash function. The patch changes the hash function but doesn't modify what parts of the binary get hashed:

So if we change the UUID to be unique on each build, we can probably guarantee that Apple won't ever use lld for their builds. Not sure how important this is. This also means that if a UUID is generated by ld64 for a stripped or non-stripped binary, it will end up being the same, which is nice.

That's not what this patch is doing. It changes only which hash function we use, not which ranges of the binary we hash.

Gotcha. So we are hashing only what ld64 is hashing? Do the RFC 4122 changes make us differ from ld64 if we don't switch hashing algorithms?

Output binary determinism.

I agree that this is an important goal. It is independent of this patch.

You can get the benefit of build-dir-independence without hacking up the linker, see "Getting to local determinism" on https://blog.llvm.org/2019/11/deterministic-builds-with-clang-and-lld.html , in particular the -fdebug-compilation-dir bits. That's important to get .o files that are independent of the build dir as well, which in turn is important for distcc-like systems that can reuse .o caches across users / bots, independent of build dir.

Using this means debuggers need to have source remapping settings set correctly or we can't show sources correctly. So it does fix determinism but at the cost of debugging not working without user intervention. Most people will try and set a breakpoint, it won't work, then they turn the printf pro. But this kind of thing is needed for caches across users / bots like you mentioned.

Is stripped builds and non-stripped builds having the same UUID something that's useful in practice? I would've expected that everyone builds a debug binary with symbols, and then strips it after building. What's the use case for links with and without debug info and wanting the same UUID?

We have cases where the ELF build ID changes on us for binaries when we have stripped debug info. Debuggers really want to match a stripped binary to a non stripped binary. The UUID in the file (LC_UUID for mach-o, ELF build ID for others) is what we try to use. When these don't match, we end up not associating the symbol file with the stripped executable and users don't get debug info in their debug session. We have scripts that know how to re-write the ELF build ID that we use at Facebook to make things match up for these cases, but this again requires the user to know why this is happening (most don't) and how to work around the issue (run a script to update the ELF build ID). Many debugger users turn to debugging with non-stripped binaries to get things working, which means the binaries that are cached and sent around and up being huge with all of the debug info.

So we are hashing only what ld64 is hashing? Do the RFC 4122 changes make us differ from ld64 if we don't switch hashing algorithms?

No, we're hashing the object paths in the debug info too. But that should be addressed in a separate diff...

RFC 4122 has nothing to do with what gets hashed. It's merely about the format of the UUID string, which stores metadata about how the UUID was generated.

Is stripped builds and non-stripped builds having the same UUID something that's useful in practice? I would've expected that everyone builds a debug binary with symbols, and then strips it after building. What's the use case for links with and without debug info and wanting the same UUID?

We have cases where the ELF build ID changes on us for binaries when we have stripped debug info. Debuggers really want to match a stripped binary to a non stripped binary.

This is where I don't follow. My mental model is:

Build with symbols, upload that binary to some symbol server
Run strip
Ship the stripped binary to production (users, data centers, whatever)

strip doesn't change the UUID, it leaves it in place unchanged (...right?). So stripped and non-stripped binary always have a matching UUID in this case. Having a strip-invariant hash algorithm matters if you do something like:

Build with symbols, upload that binary to some symbol server
Independently build the same binary at the same revision, but without symbols
Ship the binary from 2 to production

In that case, if you want the binaries in 1 and 2 to have the same UUID you do need a hashing algorithm like the one you describe. Is the second workflow what you're doing? Or is there something else I'm missing?

In D92736#2439612, @thakis wrote:

Is stripped builds and non-stripped builds having the same UUID something that's useful in practice? I would've expected that everyone builds a debug binary with symbols, and then strips it after building. What's the use case for links with and without debug info and wanting the same UUID?

We have cases where the ELF build ID changes on us for binaries when we have stripped debug info. Debuggers really want to match a stripped binary to a non stripped binary.

This is where I don't follow. My mental model is:

Build with symbols, upload that binary to some symbol server

Run strip

Ship the stripped binary to production (users, data centers, whatever)

strip doesn't change the UUID, it leaves it in place unchanged (...right?). So stripped and non-stripped binary always have a matching UUID in this case. Having a strip-invariant hash algorithm matters if you do something like:

Build with symbols, upload that binary to some symbol server

Independently build the same binary at the same revision, but without symbols

Ship the binary from 2 to production

In that case, if you want the binaries in 1 and 2 to have the same UUID you do need a hashing algorithm like the one you describe. Is the second workflow what you're doing? Or is there something else I'm missing?

The point is if you produce a binary with ld64 with or without local symbols, with or without debug info (STABS debug map) -- at link time -- it will produce the same UUID. This isn't the case for ELF files as the UUID is based off of the binary contents at build time. ld64 takes only the parts that don't change and makes a hash of those bits only.

In D92736#2439511, @int3 wrote:

So we are hashing only what ld64 is hashing? Do the RFC 4122 changes make us differ from ld64 if we don't switch hashing algorithms?

No, we're hashing the object paths in the debug info too. But that should be addressed in a separate diff...

Gotcha.

RFC 4122 has nothing to do with what gets hashed. It's merely about the format of the UUID string, which stores metadata about how the UUID was generated.

The main question is if ld64 is doing this and if we want to replicate what it does. Might be a handy option to add in case compatibility is desired: --ld64 which would enable the slower MD5 hashing of only the bits that ld64 does (and possibly other things), without this lld can pick its own hashing algorithm for speed?

I'd rather we not add compatibility options unless we discover it's actually necessary. I'm okay going with the xxhash for now, though personally I think it wouldn't hurt to set the bits to pretend that it's an MD5 hash-based UUID (as opposed to e.g. a timestamp UUID). I could see some tools caring about whether it's a timestamp vs a hash, but I think it's less likely that they would care about the exact hashing algorithm used. (RFC 4122 only specifies representations for MD5 and SHA1 hashes, so we can't represent xxhash.)

The point is if you produce a binary with ld64 with or without local symbols, with or without debug info (STABS debug map) -- at link time -- it will produce the same UUID.

I understand that part. I'm wondering what the use case is where this is important. In my experience, people build with debug info and then strip (which gives you binaries with and without debug info and with matching UUID even with the elf scheme), instead of building once with debug info and once without. So my question above is: Do you do two separate builds, one with debug info and one without, instead of using strip? If so, why?

Restore rfc 4122 stuff.

In D92736#2441709, @int3 wrote:

I'd rather we not add compatibility options unless we discover it's actually necessary. I'm okay going with the xxhash for now, though personally I think it wouldn't hurt to set the bits to pretend that it's an MD5 hash-based UUID (as opposed to e.g. a timestamp UUID). I could see some tools caring about whether it's a timestamp vs a hash, but I think it's less likely that they would care about the exact hashing algorithm used. (RFC 4122 only specifies representations for MD5 and SHA1 hashes, so we can't represent xxhash.)

Makes sense. Put back the bits to claim that this is MD5. Please take another look :)

lgtm. Thanks for the detailed comments!

The determinism discussion is interesting but seems orthogonal to this diff, so I'm accepting it...

This revision is now accepted and ready to land.Dec 9 2020, 2:21 PM

In D92736#2442831, @thakis wrote:

The point is if you produce a binary with ld64 with or without local symbols, with or without debug info (STABS debug map) -- at link time -- it will produce the same UUID.

I understand that part. I'm wondering what the use case is where this is important. In my experience, people build with debug info and then strip (which gives you binaries with and without debug info and with matching UUID even with the elf scheme), instead of building once with debug info and once without. So my question above is: Do you do two separate builds, one with debug info and one without, instead of using strip? If so, why?

I build once and strip. I never run into issues on Darwin builds that use ld64. But I have run into issues with android, and in those cases, I believe they did two builds. I am not worried about the contents of the patch, mostly just trying to let you guys know what an existing Darwin ld64 user currently gets so you guys know what the current system provides and how it might affect users that are used to ld64.

clayborg accepted this revision.Dec 9 2020, 4:18 PM

Closed by commit rG9d6177c2a526: [lld/mac] Use xxhash instead of MD5 for computing the UUID (authored by thakis). · Explain WhyDec 9 2020, 6:06 PM

This revision was automatically updated to reflect the committed changes.

thakis added a commit: rG9d6177c2a526: [lld/mac] Use xxhash instead of MD5 for computing the UUID.

Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2020, 6:06 PM

Revision Contents

Path

Size

lld/

MachO/

Writer.cpp

36 lines

Diff 310738

lld/MachO/Writer.cpp

Show All 18 Lines
#include "Target.h"		#include "Target.h"
#include "UnwindInfoSection.h"		#include "UnwindInfoSection.h"

#include "lld/Common/ErrorHandler.h"		#include "lld/Common/ErrorHandler.h"
#include "lld/Common/Memory.h"		#include "lld/Common/Memory.h"
#include "llvm/BinaryFormat/MachO.h"		#include "llvm/BinaryFormat/MachO.h"
#include "llvm/Config/llvm-config.h"		#include "llvm/Config/llvm-config.h"
#include "llvm/Support/LEB128.h"		#include "llvm/Support/LEB128.h"
#include "llvm/Support/MD5.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"
#include "llvm/Support/Path.h"		#include "llvm/Support/Path.h"
		#include "llvm/Support/xxhash.h"

#include <algorithm>		#include <algorithm>

using namespace llvm;		using namespace llvm;
using namespace llvm::MachO;		using namespace llvm::MachO;
using namespace lld;		using namespace lld;
using namespace lld::macho;		using namespace lld::macho;

▲ Show 20 Lines • Show All 325 Lines • ▼ Show 20 Lines	public:

void writeTo(uint8_t *buf) const override {		void writeTo(uint8_t *buf) const override {
auto c = reinterpret_cast<uuid_command >(buf);		auto c = reinterpret_cast<uuid_command >(buf);
c->cmd = LC_UUID;		c->cmd = LC_UUID;
c->cmdsize = getSize();		c->cmdsize = getSize();
uuidBuf = c->uuid;		uuidBuf = c->uuid;
}		}

void writeUuid(const std::array<uint8_t, 16> &uuid) const {		void writeUuid(uint64_t digest) const {
memcpy(uuidBuf, uuid.data(), uuid.size());		// xxhash only gives us 8 bytes, so put some fixed data in the other half.
		static_assert(sizeof(uuid_command::uuid) == 16, "unexpected uuid size");
		memcpy(uuidBuf, "LLD\xa1UU1D", 8);
		memcpy(uuidBuf + 8, &digest, 8);

		// RFC 4122 conformance. We need to fix 4 bits in byte 6 and 2 bits in
		// byte 8. Byte 6 is already fine due to the fixed data we put in. We don't
		// want to lose bits of the digest in byte 8, so swap that with a byte of
		// fixed data that happens to have the right bits set.
		std::swap(uuidBuf[3], uuidBuf[8]);

		// Claim that this is an MD5-based hash. It isn't, but this signals that
		// this is not a time-based and not a random hash. MD5 seems like the least
		// bad lie we can put here.
		assert((uuidBuf[6] & 0xf0) == 0x30 && "See RFC 4122 Sections 4.2.2, 4.1.3");
		assert((uuidBuf[8] & 0xc0) == 0x80 && "See RFC 4122 Section 4.2.2");
}		}

mutable uint8_t *uuidBuf;		mutable uint8_t *uuidBuf;
};		};

} // namespace		} // namespace

void Writer::scanRelocations() {		void Writer::scanRelocations() {
▲ Show 20 Lines • Show All 276 Lines • ▼ Show 20 Lines
void Writer::writeSections() {		void Writer::writeSections() {
uint8_t *buf = buffer->getBufferStart();		uint8_t *buf = buffer->getBufferStart();
for (OutputSegment *seg : outputSegments)		for (OutputSegment *seg : outputSegments)
for (OutputSection *osec : seg->getSections())		for (OutputSection *osec : seg->getSections())
osec->writeTo(buf + osec->fileOff);		osec->writeTo(buf + osec->fileOff);
}		}

void Writer::writeUuid() {		void Writer::writeUuid() {
MD5 hash;		uint64_t digest =
const auto bufStart = reinterpret_cast<char >(buffer->getBufferStart());		xxHash64({buffer->getBufferStart(), buffer->getBufferEnd()});
const auto bufEnd = reinterpret_cast<char >(buffer->getBufferEnd());		uuidCommand->writeUuid(digest);
hash.update(StringRef(bufStart, bufEnd - bufStart));
MD5::MD5Result result;
hash.final(result);
// Conform to UUID version 4 & 5 as specified in RFC 4122:
// 1. Set the version field to indicate that this is an MD5-based UUID.
result.Bytes[6] = (result.Bytes[6] & 0xf) \| 0x30;
// 2. Set the two MSBs of uuid_t::clock_seq_hi_and_reserved to zero and one.
result.Bytes[8] = (result.Bytes[8] & 0x3f) \| 0x80;
uuidCommand->writeUuid(result.Bytes);
}		}

void Writer::run() {		void Writer::run() {
// dyld requires __LINKEDIT segment to always exist (even if empty).		// dyld requires __LINKEDIT segment to always exist (even if empty).
OutputSegment *linkEditSegment =		OutputSegment *linkEditSegment =
getOrCreateOutputSegment(segment_names::linkEdit);		getOrCreateOutputSegment(segment_names::linkEdit);

prepareBranchTarget(config->entry);		prepareBranchTarget(config->entry);
▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines