This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Implement --build-id={md5,sha1} with truncated BLAKE3
ClosedPublic

Authored by MaskRay on Mar 12 2022, 11:37 AM.

Details

Summary

--build-id was introduced as "approximation of true uniqueness across all
binaries that might be used by overlapping sets of people". It does not require
the above mentioned resistance. In practice, people just use --build-id=md5
for 16-byte build ID and --build-id=sha1 for 20-byte build ID.

BLAKE3 has 256-bit key length, which provides 128-bit security against
(second-)preimage, collision, and differentiability attacks. Its portable
implementation is fast. It additionally provides Arm Neon/AVX2/AVX-512. Just
implement --build-id={md5,sha1} with truncated BLAKE3.

Linking clang 14 RelWithDebInfo with --threads=8 on a Skylake CPU:

  • 1.13x as fast with --build-id=md5
  • 1.15x as fast with --build-id=sha1

--threads=4 on Apple m1:

  • 1.25x as fast with --build-id=md5
  • 1.17x as fast with --build-id=sha1

Diff Detail

Event Timeline

MaskRay created this revision.Mar 12 2022, 11:37 AM
MaskRay requested review of this revision.Mar 12 2022, 11:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 12 2022, 11:37 AM
MaskRay updated this revision to Diff 414873.Mar 12 2022, 12:40 PM

Remove unused headers. Add a comment

MaskRay edited the summary of this revision. (Show Details)Mar 12 2022, 1:42 PM
tschuett added inline comments.
lld/ELF/Writer.cpp
2927

Do you have link?

MaskRay marked an inline comment as done.Mar 12 2022, 2:28 PM
MaskRay added inline comments.
lld/ELF/Writer.cpp
2927

It's a deleted page https://fedoraproject.org/w/index.php?title=RolandMcGrath/BuildID&oldid=16098
I do not want to put the URI in the comment.

joerg added a subscriber: joerg.Mar 12 2022, 4:18 PM

I would find it quite surprising to find two binaries produced with the same command line now resulting in different build-ids, if that is the only difference. Especially when the cryptographic hash function is explicitly set.

For reference, I did some tests a while ago in the context of Mercurial's SHA1 replacement and got the following numbers on a Threadripper using a large file:

BLAKE2s256asm13.8s
SHA2-256asm4.5s
SHA2-256C28.0s
SHA3-256asm16.7s
SHA3-256C19.8s
K12asm5.9s
K12C9.2s
BLAKE3asm4.1s
BLAKE3C10.1s
BLAKE3*asm5.5s
BLAKE3*C13.8s

I've included variants of BLAKE3 with the same number of rounds as BLAKE2 to show that much of the gain is actually from the weaker security. There are arguments speaking for K12, especially that we are likely to see some form of hardware support in the future given that it is using the sponge function of SHA3.

MaskRay marked an inline comment as done.EditedMar 12 2022, 4:33 PM

I would find it quite surprising to find two binaries produced with the same command line now resulting in different build-ids, if that is the only difference. Especially when the cryptographic hash function is explicitly set.

lld built at different commits don't guarantee a binary gets hashed in the same way.
When a tree hash style parallelism was added, the hash obviously changed.
The stability/determinism is only related to lld built at the same commit.

This is like we don't guarantee output may be the same with lld different at different commits.
The hash changes very infrequently, though.
If lld X and lld X+1 use different layouts or generate some synthetic section differently, I don't see how "same content => same build ID" is very useful, since content may be very unlikely the same using lld of different versions.

ikudrin accepted this revision.Mar 21 2022, 10:15 AM

I like the idea and don't see real drawbacks, so LGTM.

This revision is now accepted and ready to land.Mar 21 2022, 10:15 AM
This revision was landed with ongoing or failed builds.Mar 24 2022, 11:31 AM
This revision was automatically updated to reflect the committed changes.