This is an archive of the discontinued LLVM Phabricator instance.

lld-link: Set PDB GUID to hash of PDB contents instead of to a random byte sequence.
ClosedPublic

Authored by thakis on Sep 11 2018, 4:20 PM.

Details

Summary

Previously, lld-link would use a random byte sequence as the PDB GUID. Instead, use a hash of the PDB file contents.

Naively computing the hash after the PDB data has been generated is in practice as fast as other approaches I tried. I also tried online-computing the hash as parts of the PDB were written out (https://reviews.llvm.org/D51887; that's also where all the measuring data is) and computing the hash in parallel (https://reviews.llvm.org/D51957). This approach here is simplest, without being slower.

To not disturb llvm-pdbutil pdb2yaml, make the hash generation an opt-in feature on InfoStreamBuilder and let ldb/COFF/PDB.cpp always set it.

Since writing the PDB computes this ID which also goes in the exe, the PDB writing code now must be called before writeBuildId(). writeBuildId() for that reason is no longer included in the "Code Layout" timer.

Since the PDB GUID is now a function of the PDB contents, the PDB Age is always set to 1. There was a long comment above loadExistingBuildId (now gone) about how not changing the GUID and only incrementing the age was important, but according to the discussion in PR35914 that comment was incorrect.

Diff Detail

Event Timeline

thakis created this revision.Sep 11 2018, 4:20 PM
thakis retitled this revision from https://reviews.llvm.org/D51951, naive sequential hash variant to lld-link: Set PDB GUID to hash of PDB contents instead of to a random byte sequence..Sep 13 2018, 10:52 AM
thakis edited the summary of this revision. (Show Details)
thakis added a reviewer: zturner.
thakis added subscribers: ruiu, llvm-commits.
thakis updated this revision to Diff 165515.Sep 14 2018, 8:11 AM

rebase, minor cleanups

Please take a look!

ruiu added inline comments.Sep 14 2018, 9:48 AM
lld/COFF/PDB.cpp
128–129

It's return type is void. I'd change the comment or change the return type.

llvm/lib/DebugInfo/PDB/Native/PDBFileBuilder.cpp
28

As long as you are using a non-crypto hash function, there is a risk of generating the same build id, and the probability is not negligible if you have a lot of executables due to the birthday problem. Is this okay?

thakis added inline comments.Sep 14 2018, 10:27 AM
lld/COFF/PDB.cpp
128–129

Will do.

llvm/lib/DebugInfo/PDB/Native/PDBFileBuilder.cpp
28

The 8 byte hash still gives decent hash collision resistance for up to 2**32 different pdb files, and since pdbs are keyed by executable name on the symbol server that's per binary. Projects tend to have far fewer revisions than 4 billion. Does that make sense?

thakis updated this revision to Diff 165533.Sep 14 2018, 10:38 AM
thakis marked an inline comment as done.

improve comment

ruiu added inline comments.Sep 14 2018, 10:39 AM
llvm/lib/DebugInfo/PDB/Native/PDBFileBuilder.cpp
28

Maybe it is safe. But what could happen if two executables have the same hash? Since xxhash is not cryptographically-safe, you could easily generate two executables having the same ID. Is there any security risks or something caused by that possibility? If the probability is small and the result of hash collision is not that bad, xxhash is probably okay.

thakis added inline comments.Sep 14 2018, 11:25 AM
llvm/lib/DebugInfo/PDB/Native/PDBFileBuilder.cpp
28

The main use case for this guid is to an executable to its pdb file. The common workflow is that a build server builds an executable and its pdb, then uploads both to a symbol server (under the namespace of the exe, the exe in a subdir containing the exe's pe timestamp and size, and the pdb under the guid). If the executable crashes, it produces a minidump. From the minidump, crash infrastructure can obtain the full executable and the pdb.

Since nothing guarantees that the pdb guid is a hash of the pdb data, I can't think of anything where being able to produce a pdb with a given uuid that is an xxhash buys you anything: Since nothing forces the guid to be a hash, you can just produce a pdb and set its guid field to whatever you want anyways.

The symptoms of a collision are just going to be you can’t debug the
program, so not very severe imo, especially since it would almost certainly
be resolved on the next incremental build

The symptoms of a collision are just going to be you can’t debug the
program, so not very severe imo, especially since it would almost certainly
be resolved on the next incremental build

Can you explain how it would lead to you not being able to debug the program?

Do you mean for local builds? If so, if two back-to-back builds end up with the same pdb guid in the exe and pdb by chance even though the pdb changes, the debugger should still load the new pdb off disk fine (?)

Do you mean if a build server produces PDBs with the same guid for different builds? If so, that would probably produce an error during pdb upload and make the build fail, not debugging (?)

If you’re uploading to build server, i don’t think it would be an error, it
would either overwrite or not. If it does overwrite, debugging the exe
matching the pdb that was there before wouldn’t work, and if it did not
overwrite debugging the new exe would fail.

That said, my point was mainly that the probability of this being an issue
in practice is negligible

ruiu accepted this revision.Sep 14 2018, 1:18 PM

LGTM

It sounds like I worried a bit too much about hash collisions.

This revision is now accepted and ready to land.Sep 14 2018, 1:18 PM
This revision was automatically updated to reflect the committed changes.