This is an archive of the discontinued LLVM Phabricator instance.

[LLD COFF / PDB] Incrementally update the BuildId when writing a PDB.
ClosedPublic

Authored by zturner on Aug 15 2017, 11:04 AM.

Details

Summary

Previously, our algorithm to compute a build id involved hashing the executable and storing that as the GUID in the CV Debug Record chunk, and setting the age to 1.

This breaks down in one very obvious case: a user adds some newlines to a file, rebuilds, but changes nothing else. This causes new line information and new file checksums to get written to the PDB, meaning that the debug info is different, but the generated code would be the same, so we would write the same build over again with an age of 1.

Anyone using a symbol cache would have a problem now, because the debugger would open the executable, look at the age and guid, find a matching PDB in the symbol cache and then load it. It would never copy the new PDB to the symbol cache.

There are various other scenarios where this can arise as well, but this one is both the easiest to describe and is what people will run into the most often.

PE files are ultimately matched against PDBs using 3 pieces of information:

  1. A 128-bit GUID. When an executable is being written for the first time, a new GUID is generated, otherwise the existing GUID is re-used.
  2. When an executable is being written for the first time, the Age is set to 1, otherwise the age is incremented.
  3. Every time an executable is written, the time stamp field is updated.

Then, the debugger goes through these steps when looking for a PDB:

  1. Is there a PDB with a matching GUID? If not, there's no debug info, otherwise go to step 2.
  2. Does the PDB have a matching age? If not, warn the user that the debug info might be stale, otherwise go to step 3.
  3. Does the PDB have a matching timestamp? If not, warn the user that the debug info might be stale, otherwise we're good to go.

This patch implements the first two of these fields. We're still not writing a timestamp, but this is at least better than before. Unfortunately, this hurts reproducibility, but if we want PDBs to work correctly, we don't have much of a choice but to do this. We can still get reproducibility through additional flags, or via a tool that runs as a post-processing step to strip out un-reproducible pieces.

Diff Detail

Repository
rL LLVM

Event Timeline

zturner created this revision.Aug 15 2017, 11:04 AM
ruiu edited edge metadata.Aug 15 2017, 11:17 AM

I think it is more straightforward if you change Writer::writeBuildId to compute a hash using not only an executable but also a pdb file. If you do that, every time debug info is different, you'll get a different build id, which I think solves the issue.

That way, we can still maintain build reproducibility. I don't want to use a random value as a build id.

In D36758#842342, @ruiu wrote:

I think it is more straightforward if you change Writer::writeBuildId to compute a hash using not only an executable but also a pdb file. If you do that, every time debug info is different, you'll get a different build id, which I think solves the issue.

That way, we can still maintain build reproducibility. I don't want to use a random value as a build id.

I actually think that would be rather difficult. First we would have to emit the entire executable and PDB, then we would have to hash them, then we would have to modify them and write them again.

Even if we did do it that way, there is another problem. Now every time the debug info changes, a new PDB is generated, which means you get a new entry in your symbol cache / symbol server. Over time, your symbol server will grow indefinitely in size because stale entries aren't overwritten, they are copied to a new location (it computes the folder to store the symbol file in based on the GUID).

I realize this is not what any of us wants, but if we want to be compatible with MS symbol servers, I believe this is the way it has to be done.

rnk edited edge metadata.Aug 15 2017, 1:23 PM
In D36758#842342, @ruiu wrote:

I think it is more straightforward if you change Writer::writeBuildId to compute a hash using not only an executable but also a pdb file. If you do that, every time debug info is different, you'll get a different build id, which I think solves the issue.

That way, we can still maintain build reproducibility. I don't want to use a random value as a build id.

PDB files are too big. I don't want to MD5 4GB of data just to avoid a random number.

My suggestion for people that want determinstic builds is that we ask them to pass a flag like /PDBGUID:XXXXXX-XXX-XX. Then it's completely out of our hands.

rnk added a comment.Aug 15 2017, 1:26 PM
In D36758#842342, @ruiu wrote:

I think it is more straightforward if you change Writer::writeBuildId to compute a hash using not only an executable but also a pdb file. If you do that, every time debug info is different, you'll get a different build id, which I think solves the issue.

That way, we can still maintain build reproducibility. I don't want to use a random value as a build id.

I actually think that would be rather difficult. First we would have to emit the entire executable and PDB, then we would have to hash them, then we would have to modify them and write them again.

Even if we did do it that way, there is another problem. Now every time the debug info changes, a new PDB is generated, which means you get a new entry in your symbol cache / symbol server. Over time, your symbol server will grow indefinitely in size because stale entries aren't overwritten, they are copied to a new location (it computes the folder to store the symbol file in based on the GUID).

I realize this is not what any of us wants, but if we want to be compatible with MS symbol servers, I believe this is the way it has to be done.

This is also true. We have to try to reuse the PDB GUID if we can. This means our output is not a pure function of the input objects, unless you consider the output image as an additional input. Once we accept that this will be the default operating model, adding a random number isn't that big of a deal. We can always offer an escape hatch.

ruiu added inline comments.Aug 15 2017, 1:33 PM
lld/COFF/PDB.cpp
827 ↗(On Diff #111208)

I believe you can remove {}.

831 ↗(On Diff #111208)

If Age is assigned only once, I'd remove that variable and use BuildId.PDB70.Age directly instead.

834 ↗(On Diff #111208)

Does this mean both Uuid and uuid are visible in this function? That is a bit confusing.

lld/COFF/Writer.cpp
786 ↗(On Diff #111208)

Looks like this function does not depend on Writer class. Can you make this a static non-member function that returns a const codeview::DebugInfo * (or a nullptr if no previous build id is available)?

Also I think this function needs comment as to why we want to keep existing build id if available.

zturner added inline comments.Aug 15 2017, 1:37 PM
lld/COFF/Writer.cpp
786 ↗(On Diff #111208)

It would have to return an Optional<codeview::DebugInfo>. It can't return by pointer because after the function returns the file will be closed and the memory the pointer points to will be detroyed.

But yea I can do that.

zturner updated this revision to Diff 111248.Aug 15 2017, 1:50 PM

Updated from feedback.

compnerd edited edge metadata.Aug 15 2017, 1:51 PM

Nice! Im looking forward to this! It was one of the bits that I hadn't gotten around to implementing just yet, so thanks for taking it on.

lld/COFF/PDB.cpp
827 ↗(On Diff #111208)

Where is this used?

831 ↗(On Diff #111208)

+1 on the inline the variable use.

lld/COFF/Writer.cpp
318 ↗(On Diff #111208)

I think that this is not particularly correct. In the CV case, no PDB is supposed to be generated, and we should be creating a COFF group entry instead. I realize that its unlikely that we will support the /Z7 mode, but, the comment is slightly misleading.

792 ↗(On Diff #111208)

Can you change this to ExpectedOutput? It could be a dll.

799 ↗(On Diff #111208)

When would this be false? I thought that link is never used as a resource compiler, and you use resgen for that.

809 ↗(On Diff #111208)

Is the bitsex enough? Shouldn't we check the machine and the bitsex?

820 ↗(On Diff #111208)

I think that this deserves a comment. The only reason we need this is because we don't support the other versions rather than it is not the one that we want. I think that we should be checking that the existing CVSignature matches the one that we are going to write out this time (which ATM, just happens to be this constant).

zturner added inline comments.Aug 15 2017, 1:56 PM
lld/COFF/Writer.cpp
799 ↗(On Diff #111208)

If the output file exists on disk but is just a random text file, for example.

809 ↗(On Diff #111208)

Ahh yes, I suppose you could be writing an intel x86-64 binary and the existing binary on disk is arm64 or something. We shouldn't re-use the build id in that case either.

zturner updated this revision to Diff 111252.Aug 15 2017, 2:08 PM

Fixes suggested by compnerd@

ruiu accepted this revision.Aug 15 2017, 2:18 PM

LGTM

lld/COFF/Writer.cpp
270 ↗(On Diff #111252)

remove llvm::

This revision is now accepted and ready to land.Aug 15 2017, 2:18 PM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Aug 16 2017, 2:19 PM
thakis added inline comments.
lld/trunk/COFF/Writer.cpp
237

Could this talk about build determinism a bit? rnk mentioned that upthread.

Is there a list of all sources of build indeterminism in lld? Deterministic builds is one of the motivations for the whole clang-cl thing, so it'd be good if we had a list of things needed to get there :-)

ruiu added a comment.Aug 16 2017, 2:23 PM

Until this patch, lld behavior was deterministic. It generated the same output as long as your input files and command line options are the same.

We can definitely make a list. There are quite a few in PDB that we will need to work out. However, we're kind of at the mercy of the ecosystem we're trying to support. determinism was one goal, but making sure developers can debug their code (and making sure we can debug crashes generated in the wild) was another goal. And those two goals conflict with each other in some ways.

I don't dispute the utility of reproducible builds, but at least right now, debug info is what's blocking us. As far as I can tell there is no way to have both at the same time unless you have a post-processing step that strips out the non-deterministic portions of the debug info. Which seems like something we'll have to look into when we get to that bridge, but at least immediately, it's not what's blocking us.

The line number case described earlier is one example. People need to be able to continue to debug their code if they add a blank line (PDB needs to change even though EXE doesn't).

And to work with symbol and source server, EXEs need to re-use the same GUID. We can still get determinism by adding flags such as /PDBGUID:{...} and /PDBAGE:n, but the default-out-of-the-box experience really needs to err on the side of actually working IMO.