This is an archive of the discontinued LLVM Phabricator instance.

ELF: Implement --build-id.
AbandonedPublic

Authored by MaskRay on Mar 10 2016, 11:23 AM.

Details

Summary

This is not ready for submission.

This patch implements --build-id in a naive way. After the linker creates
an output file in the memory buffer, it computes the md5sum of the resulting
file and set the hash to the .note section as a build-id.

The problem is that its performance impact is too large. Computing a secure
hash is a computational intensive task (our md5sum function seems a bit too
slow, though). Here are some numbers (milliseconds to link):

LLD: 713.78 -> 883.44 (+23.8%)
Scylla: 5005.64 -> 5437.84 (+8.6%)

Even if we replace MD5 with a faster hash function, it cannot be within
an "okay" range, which is, say, +3%.

Do you guys have any opnion on this? I have a few ideas.

  • Use CRC32 instead of MD5. Non-secure hash should be much faster.
  • Make clang to calculate a secure hash for each section. This is basically moving the workload from the linker to the compiler, but we can use the hash for ICF in the linker, so it might be a overall win.
  • Compute a build-id from input files' timestamps. This makes builds non-reproducible if you touch a file, so I don't think we want this.
  • Build-id is not needed for program execution in general. So we may be able to let the linker exit as soon as it's done with linking, and backfill the hash value in a background process which is kicked in by the linker. (It's probably unrealistic plan, though.)
  • Do not support build-id and let the user pass --build-id=<hash value>.

Diff Detail

Event Timeline

ruiu updated this revision to Diff 50317.Mar 10 2016, 11:23 AM
ruiu retitled this revision from to ELF: Implement --build-id..
ruiu updated this object.
ruiu added reviewers: rafael, emaste.
ruiu added a subscriber: llvm-commits.
emaste edited edge metadata.Mar 10 2016, 11:54 AM

I think crc32 is a nonstarter: we probably don't need all of the attributes of a cryptographically secure hash, we do want to avoid collisions and I don't think crc32 is sufficient. Off hand I'd suggest we might want to try blake2b (https://blake2.net/). Timestamps are no good due to collisions and nonreproducibility.

Make clang to calculate a secure hash for each section. This is basically moving the workload from the linker to the compiler, but we can use the hash for ICF in the linker, so it might be a overall win.

This is a pretty neat idea, and we can easily combine those hashes into the output build-id value. We'd have to come up with an appropriate way to store those hashes though.

I was thinking about an approach of post-filling build-id in FreeBSD, because we'd like --build-id and it's not supported in GNU ld 2.17.50 that we have in the base system. We'd provide a dummy space for a build-id note e.g. in csu, and give objcopy the ability to fill that in. But I agree that it doesn't make a lot of sense in the context of a feature to add to lld.

Do not support build-id and let the user pass --build-id=<hash value>.

bfd and gold both allow a hash value to be specified directly and it seems reasonable to support that mode even when some other hashing is implemented.

All of that said, I think something like blake2 is not too far off -- my 4 year old desktop uses 1.6 CPU seconds to hash a 1GB random file. We can also look at a threaded model for hash calculation.

In any case, a straightforward (but optimized) approach in lld is going to be no worse than invoking a separate tool to calculate and store the hash, and for my use case that's fine. Is it that unreasonable to pay the cost if the user specifies --build-id, and it's off by default for the regular edit-compile-test cycle?

Do you have a sense of how other linkers implement this feature and/or how
much it costs in performance/link time? You mention what an "okay" cost for
this would be, but it's not clear to me how you determined that cost.

rafael edited edge metadata.Mar 10 2016, 12:17 PM
rafael added a subscriber: rafael.

BTW, have you tried this in combination with the patch for using
posix_memalign? Since this basically reads back the output of the
linker I would be curious if that patch makes any difference is the
results.

  • Use CRC32 instead of MD5. Non-secure hash should be much faster.

That is a bit too weak I think, but we can block this on getting a
better MD5 or use another strong but fast hash.

  • Make clang to calculate a secure hash for each section. This is basically moving the workload from the linker to the compiler, but we can use the hash for ICF in the linker, so it might be a overall win.

Seems like something we should do in the future, but should not depend
on it. That is, having the hash available speeds up build-id, but
build-id should work without it.

  • Compute a build-id from input files' timestamps. This makes builds non-reproducible if you touch a file, so I don't think we want this.

Not this. O use of build-id is for checking for reproducible builds I think.

  • Build-id is not needed for program execution in general. So we may be able to let the linker exit as soon as it's done with linking, and backfill the hash value in a background process which is kicked in by the linker. (It's probably unrealistic plan, though.)

A bit dangerous I would say.

Cheers,
Rafael

silvas added a subscriber: silvas.Mar 10 2016, 2:16 PM

One thing that is missing from the data you provided in the OP is the actual amount of data hashed, and in what pattern (lots of small, or a couple large?). How fast does the hashing have to be to be "acceptable"? E.g. ADT/Hashing.h claims 6.5GB/s hashing for large keys on Nehalem (which is pretty old at this point, modern CPU's likely to better). That is 6.5MB/ms. If we need to hash 100MB then the overhead should be about 15ms.

For LLD, you quoted a number 713.78ms link time. So for 1% overhead we need to spend <7ms. This is enough for 45MB hashed with ADT/Hashing.h (which is similar to LLD text+data size), assuming that we can get max memory bandwidth (like Rafael said, we can do this while we would be copying data otherwise, so I don't see a reason we can't hit the full performance of ADT/Hashing.h).

ruiu added a subscriber: ruiu.Mar 10 2016, 2:52 PM

For LLD, it hashed 75,421,320 bytes. For Scylla, it hashed 207,703,010
bytes. So the throughput of MD5 is 444 MB/s and 480 MB/s, respectively.

BLAKE2 claims that it is about 1.6x faster than MD5, but it is probably the
lower bound for a secure hash function.

ismail added a subscriber: ismail.Mar 11 2016, 12:59 AM
This comment was removed by ismail.
ruiu added a comment.Mar 11 2016, 6:15 PM

I implemented --build-id=md5 in this patch (http://reviews.llvm.org/D18055),
but since this patch adds more knobs to the linker, I'm not going to submit
this until someone comes and convinces us that we really need this.

emaste added a comment.Sep 8 2016, 7:13 AM

This review should be closed or abandoned as --build-id is implemented

espindola edited reviewers, added: espindola; removed: rafael.Mar 15 2018, 10:52 AM
emaste resigned from this revision.Sep 7 2023, 7:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2023, 7:07 AM
MaskRay commandeered this revision.Sep 7 2023, 10:19 AM
MaskRay abandoned this revision.
MaskRay added a reviewer: ruiu.