This is an archive of the discontinued LLVM Phabricator instance.

[lld][COFF] Fix TypeServerSource lookup on GUID collisions
ClosedPublic

Authored by thieta on Mar 24 2022, 1:26 AM.

Details

Summary

Microsoft shipped a bunch of PDB files with broken/invalid GUIDs
which lead lld to use 0xFF as the key for these files in an internal
cache. When multiple files have this key it will lead to collisions
and confused symbol lookup.

Several approaches to fix this was considered. Including making the key
the path to the PDB file, but this requires some filesystem operations
in order to normalize the file path.

Since this only happens with malformatted PDB files and we haven't
seen this before they malformatted files where shipped with visual
studio we probably shouldn't optimize for this use-case.

Instead we now just don't insert files with Guid == 0xFF into the
cache map and warn if we get collisions so similar problems can be
found in the future instead of being silent.

Discussion about the root issue and the approach to this fix can be found on Github: https://github.com/llvm/llvm-project/issues/54487

Diff Detail

Event Timeline

thieta created this revision.Mar 24 2022, 1:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2022, 1:26 AM
Herald added a subscriber: StephenFan. · View Herald Transcript
thieta requested review of this revision.Mar 24 2022, 1:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2022, 1:26 AM
thieta edited the summary of this revision. (Show Details)Mar 24 2022, 1:34 AM
saudi added inline comments.Mar 24 2022, 5:48 AM
lld/COFF/DebugTypes.cpp
71

You can remove this line, now that it is always used

Can you please craft a small test that would showcase the problem? The test shall fail before your change, and should succeed after.

Can you please craft a small test that would showcase the problem? The test shall fail before your change, and should succeed after.

Yeah absolutely - the discussion got a little split between here and Github I think - but I said that I will add a test if we can agree on the approach for how to handle the root cause. The test is kind of hard to get right as well - since it requires we load two PDB files with the same GUID and then we need to make sure they end up loaded in a order that messes with the functiontype table - so I am not sure exactly how that is going to work.

rnk added a comment.Mar 25 2022, 3:31 PM

I think this is a reasonable approach that avoids FS access and path canonicalization issues, +1 to this.

For a test, I don't think it's necessary to go so far as to construct an input that would generate a warning. I'd start with this code, and then YAML-ify the 2 objects and 2 PDB files:

// foo.cpp
struct Foo { int x; };
Foo foo_gv;
// bar.cpp
struct Bar { int x; };
Bar bar_gv;

Compile these with MSVC -Zi, obj2yaml the objs, llvm-pdbutil pdb2yaml the PDBs, edit the GUIDs to be invalid. The test case should make objs from them, link them together, dump the symbols, and confirm that bar_gv and foo_gv have the right types.

thieta updated this revision to Diff 418590.Mar 28 2022, 8:00 AM

Added test-cases for both valid and invalid Guid's

thieta marked an inline comment as done.Mar 28 2022, 8:02 AM

@rnk thanks for your suggestions - I was almost there, but got hung up on trying to replicate the warning message and check for that instead of inspecting the actual contents. That was much easier.

I also added a test case for the duplicate-but-valid Guid case, and commited a small fix for that case as well so that it passes.

The inputs are a bit big - but I am unsure on how to trim them down efficiently.

Otherwise I think this is ready to land.

The inputs are a bit big - but I am unsure on how to trim them down efficiently.

Not sure we need to add new files, the following repro triggers the crash, with existing files in the monorepo:

> yaml2obj.exe -o=a.obj lld\test\COFF\Inputs\pdb-type-server-simple-a.yaml
> sed -e s/ts.pdb/bs.pdb/ lld\test\COFF\Inputs\pdb-type-server-simple-b.yaml | yaml2obj.exe > b.obj
> llvm-pdbutil.exe yaml2pdb lld\test\COFF\Inputs\pdb-type-server-simple-ts.yaml -pdb ts.pdb
> sed -e s/{41414141-4141-4141-4141-414141414141}/{FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF}/ lld\test\COFF\Inputs\pdb-type-server-simple-ts.yaml > bs.yaml
> llvm-pdbutil.exe yaml2pdb bs.yaml -pdb bs.pdb
> lld-link.exe a.obj b.obj /debug
(crash)

You can probably do the same for the "valid" testcase? I suppose you can also play with /debug:noghash and /debug:ghash (default value) to cover both cases (with and without parallel type merging).

lld/COFF/DebugTypes.cpp
68

Since the issue is in system libs, I wonder if we can detect that and only warn on non-system PDBs? MSVC link.exe doesn't warn. How many warnings do you actually get when linking a production executable?

Reduced further:

> yaml2obj.exe -o=a.obj lld\test\COFF\Inputs\pdb-type-server-simple-a.yaml
> sed -e s/ts.pdb/bs.pdb/ lld\test\COFF\Inputs\pdb-type-server-simple-b.yaml | yaml2obj.exe > b.obj
> llvm-pdbutil.exe yaml2pdb lld\test\COFF\Inputs\pdb-type-server-simple-ts.yaml -pdb ts.pdb
> llvm-pdbutil.exe yaml2pdb lld\test\COFF\Inputs\pdb-type-server-simple-ts.yaml -pdb bs.pdb
> lld-link.exe a.obj b.obj /debug
(crash)
aganea added inline comments.Mar 28 2022, 7:39 PM
lld/COFF/DebugTypes.cpp
68

I wonder if log() instead of warn() wouldn't be more appropriate here.

Reduced further:

> yaml2obj.exe -o=a.obj lld\test\COFF\Inputs\pdb-type-server-simple-a.yaml
> sed -e s/ts.pdb/bs.pdb/ lld\test\COFF\Inputs\pdb-type-server-simple-b.yaml | yaml2obj.exe > b.obj
> llvm-pdbutil.exe yaml2pdb lld\test\COFF\Inputs\pdb-type-server-simple-ts.yaml -pdb ts.pdb
> llvm-pdbutil.exe yaml2pdb lld\test\COFF\Inputs\pdb-type-server-simple-ts.yaml -pdb bs.pdb
> lld-link.exe a.obj b.obj /debug
(crash)

The crash is just part of the problem here - the crash is because of the assertion. If you build without asserts - you don't get a crash instead you get mixed up debug information where the symbols get the wrong type. That's why the actual check is looking to confirm that the type matches the source.

That said - I think I don't need to add both inputs since I can use sed to change the invalid guid to the valid one between test cases. I'll update with that later.

thieta updated this revision to Diff 418797.Mar 28 2022, 11:21 PM

Reuse test-inputs and SED the Guid instead

Consolidated the test-inputs and used SED instead. I tried using available input but couldn't get it to collide the same way this one did.

Consolidated the test-inputs and used SED instead. I tried using available input but couldn't get it to collide the same way this one did.

Oh yeah to fully repro the bug, you'd probably need to insert and extra record in the second PDB to shift the offsets to the LF_FUNC_ID. Seems good otherwise, just two (and a half) comments.

lld/COFF/DebugTypes.cpp
64

I'm starting to wonder if we really need this if. The fact that the bug is raised with a FF FF... guid is just an implementation artefact. The issue would still happen if two PDBs with different paths but same GUIDs (like in my example) are loaded during a link session.

68

This wasn't addressed in the last update. Any advice either way?

lld/test/COFF/Inputs/pdb-type-server-guid-collision-a-pdb.yaml
3

Optionally, and if you can, you could manually trim down a bit the file. It also helps understanding the relationships between the OBJ and the PDB and in-between records. As an example, the MSF part here isn't needed, it will be re-generated. Same with the UDTs and the build records below, it's probably not needed for you test. Up to you! :)

saudi added inline comments.Mar 29 2022, 2:45 PM
lld/COFF/DebugTypes.cpp
64

With this approach, the idea is that we still trust the unicity of Guids with just one exception for the special case encountered in Microsoft's system PDBs; I tend to think it's valid.

Except for that bug in those PDBs, I think that it's still safe to assume that Guids are really unique, and that if we find two PDBs with different paths but same Guid, it would be safe to assume it was just a pdb that was copied, which can very well happen.

thieta added inline comments.Mar 30 2022, 3:08 AM
lld/COFF/DebugTypes.cpp
64

We could do this instead:

auto it = ctx.typeServerSourceMappings.emplace(Guid, this);
if (!it.second) {
  TypeServerSource *tSrc = (TypeServerSource *)it.first->second;
  if (Guid != InvalidGuid)
  {
    warn("GUID collision between " + file.getFilePath() + " and " +
          tSrc->pdbInputFile->session->getPDBFile().getFilePath());
  }
  ctx.typeServerSourceMappings.erase(Guid);

This only handle the special case of logging when the PDB is invalid. This cuts down the error messages.

Check if it's system libraries seems complicated and adds a lot of code here or even filesystem checks.

thieta added inline comments.Mar 30 2022, 4:04 AM
lld/COFF/DebugTypes.cpp
68

The reason I put it as a warning was that before I added the line below that erases it from the map it would actually lead to bad things happening in the PDB, getting the wrong types etc.

Now that this case is actually handled it could most likely be a Log() call instead. I don't have a strong opinion.

lld/test/COFF/Inputs/pdb-type-server-guid-collision-a-pdb.yaml
3

Good idea - I find it a bit hard to know what parts are needed to be there without a lot of trial-and-error. But I guess it will come from experience in adding more tests like this.

thieta updated this revision to Diff 419099.Mar 30 2022, 4:24 AM

Simplify the logic, change warn() to log(), reduced the test cases Input a bit

Ok after toying around with the code again I realize that you have a good point here @aganea - this simplified version is probably better and enough. You have to turn on verbose logging in lld to see the collisions, BUT they don't lead to anything bad anymore - so it's just for performance reasons you would like to see if it collides.

I also reduced the inputs to the test cases a bit.

How are feeling about this revision? Also @rnk - any input before this lands?

aganea accepted this revision.Mar 30 2022, 5:39 AM

LGTM, thanks @thieta !

This revision is now accepted and ready to land.Mar 30 2022, 5:39 AM
thieta updated this revision to Diff 419335.Mar 30 2022, 11:37 PM
thieta marked 4 inline comments as done.

clang-format

Ping @rnk - any objections?

This revision was automatically updated to reflect the committed changes.
aganea added inline comments.Apr 5 2022, 5:02 PM
lld/COFF/DebugTypes.cpp
66

I'm having some doubts regarding this line (L66). If we have an odd number of PDBs queued with bad GUIDs, the first one will ctx.typeServerSourceMappings.emplace(), the second will do ctx.typeServerSourceMappings.erase() and the third will ctx.typeServerSourceMappings.emplace() again. I think we need a different strategy here to handle an arbitrary number of PDBs with bad (FFFF..) GUIDs. Any suggestions? I would be tempted to keep the entry in the map and not erase it -- that could help later with multithreading, if we ever do that. What about doing it.first->second = nullptr; when there's a collision, and then handle this case later in UseTypeServerSource::getTypeServerSource() to fallback to the lookup by path?

thieta added inline comments.Apr 5 2022, 11:22 PM
lld/COFF/DebugTypes.cpp
66

Yeah this is much nicer - I actually had the thought before and thought it didn't matter that much - but it's definitely the better solution. Posted a follow-up diff here: https://reviews.llvm.org/D123185

aganea added inline comments.Apr 6 2022, 6:17 AM
lld/COFF/DebugTypes.cpp
66

I was thinking in term of correctness. This current patch as it stands could still exhibit the issue if a third "bad" PDB is inserted again in typeServerSourceMappings? The code in UseTypeServerSource::getTypeServerSource() will not fallback to path lookup, and all OBJs with 0xFFFF.. LF_TYPESERVER2 will fetch the wrong PDB, leading most likely to PR54500? Could you please confirm @thieta ?

If we want to backport this (D122372), we'll have to also backport D123185. I'm just wondering if you should just re-open this current patch, and re-land everything atomically?

thieta added inline comments.Apr 6 2022, 6:35 AM
lld/COFF/DebugTypes.cpp
66

Ah I think you are correct. I think I need to change the testing here a bit to catch this.

I don't know about reverting and re-adding - is that normally how things are handled?

aganea added a subscriber: tstellar.Apr 6 2022, 6:38 AM
aganea added inline comments.
lld/COFF/DebugTypes.cpp
66

@tstellar Regarding https://github.com/llvm/llvm-project/issues/54487#issuecomment-1089361869, please see above, we'll need to land another patch (D123185) first and backport both. Should we do that, or revert this current patch, and reland it by including D123185 ?

@aganea I think it's OK to land D123185 and then backport both patches.

thieta added a comment.Apr 6 2022, 7:05 AM

@aganea I think it's OK to land D123185 and then backport both patches.

I'll just add another test to 123185 and then we can land that.