This is an archive of the discontinued LLVM Phabricator instance.

[PECOFF] Implementation of ObjectFilePECOFF:: GetUUID()
ClosedPublic

Authored by asmith on Jan 2 2019, 4:04 PM.

Details

Summary

Provide an implementation of GetUUID() for remote debugging scenarios.

Return a PDB's GUID (or PDB70's Signature) as the UUID.

Diff Detail

Event Timeline

asmith created this revision.Jan 2 2019, 4:04 PM

Should we use the GUID from the COFF Debug Directory instead? It certainly seems more appropriate, if it's there. The UUID's purpose is to match symbol to the executable, so if you use a hash of the path it might solve this one problem, but won't solve the general case.

labath added a subscriber: labath.Jan 3 2019, 2:22 AM

I would definitely encourage using something better than the file checksum as UUID, if at all possible.

source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
139

I think this should be fromData. The Optional is there to give special meaning to an all-zero checksum, but you don't need that here. Just because the md5 checksum comes out as all-zero, it doesn't mean it is not valid.

PS: I am responsible for the existence of this function, so I am to blame for any confusion. If you have any idea, how to make this api more clear, I'd like to hear it.

Hui added a subscriber: Hui.Jan 4 2019, 1:33 PM

Not quite sure but correct me if i am wrong.

(1) I think the Debug Directory is optional for COFF if it does have debug information and pdb to match with.

(2) The Debug Directory does not contain COFF timestamp.

Using md5 seems very tentative. Please elaborate how to leverage both COFF contents and the existing GUID mentioned?

In D56229#1346869, @Hui wrote:

Not quite sure but correct me if i am wrong.

(1) I think the Debug Directory is optional for COFF if it does have debug information and pdb to match with.

(2) The Debug Directory does not contain COFF timestamp.

Using md5 seems very tentative. Please elaborate how to leverage both COFF contents and the existing GUID mentioned?

Well, I guess I would ask what you want to do with the GUID? If you want to match it to a debug information file, then the Debug Directory is the correct way to do that, and using a hash of the file path will not even be helpful.

Another option would be to check for a debug directory of type IMAGE_DEBUG_TYPE_REPRO, and if that exists, then it means that the COFF timestamp is a hash of the binary, so it should be stable.

If neither of these is present, then I think we should simply return false from this function and not mislead the caller. The caller might wish to use special logic if the function returns false that says "if I couldn't get a UUID from the file, then try hashing the path and doing some kind of lookup based on that", but I don't think that should be part of this function.

The UUID that is used in ELF and Mach-o is designed to be something that is stable in a binary after it has been linked and should be the same before and after any kind of post production (stripping symbols, stripping section content to make a stand alone symbol file, etc). When someone types "target symbols add /path/to/symbol/file/a.out" we will grab its UUID and try to match it up with an existing object file.

Well, I guess I would ask what you want to do with the GUID? If you want to match it to a debug information file, then the Debug Directory is the correct way to do that, and using a hash of the file path will not even be helpful.

Another option would be to check for a debug directory of type IMAGE_DEBUG_TYPE_REPRO, and if that exists, then it means that the COFF timestamp is a hash of the binary, so it should be stable.

If neither of these is present, then I think we should simply return false from this function and not mislead the caller. The caller might wish to use special logic if the function returns false that says "if I couldn't get a UUID from the file, then try hashing the path and doing some kind of lookup based on that", but I don't think that should be part of this function.

I agree with the above statement. A very long time ago (for reasons which are not very important now), we decided to return a crc checksum as the UUID for ELF files without a build-id, and that's a decision that's been haunting us ever since. I think it would be good to not repeat the same mistake for COFF files. The problems with the crc elf checksum are:

  • it does not help (in fact, it gets in the way) of matching build-id-less elf files, because both files appear to have a valid-but-different UUID
  • checksumming whole files is slow, and we regularly get bug reports about lldb being slow from people whose default config does not include build-ids

So I'd propose to have this function return just the debug directory GUID, if it is there. Possibly the coff timestamp could be used as a fallback, but I don't know enough about windows/coff to say for sure. The case when we need to verify whether we have a local copy of a module for the remote debugging scenario can be handled at a different level. E.g., it already looks like the qModuleInfo packet we use does differentiate to some level the "uuid" and "md5" case https://github.com/llvm-mirror/lldb/blob/master/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationClient.cpp#L3539. Right now they are both used to fill in the UUID field of the module spec, but we could change that and treat each field slightly differently. For matching exes, pdbs and minidumps we would just use the UUID field, while for checking object identity we would also fall back to the md5 checksum if the UUID field is not present.

Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2019, 12:41 AM
asmith updated this revision to Diff 195282.Apr 15 2019, 6:01 PM
asmith edited the summary of this revision. (Show Details)
labath edited reviewers, added: amccarth, labath; removed: zturner, llvm-commits.Apr 16 2019, 12:18 AM
labath added a subscriber: amccarth.

s/@zturner/@amccarth, as Zach probably won't have time to review this

lit/Modules/PECOFF/export-dllfunc.yaml
11–12

Since the U(G)UID needs to be stable and match the value computed from other sources, it would be good to have a test where we check that a file has some exact UUID.

Is there any way to use yaml2obj to generate such a file? For instance, if we drop the lld-link step and yamlify the resulting dll file instead. Would that work?

source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
60

Can getDebugPDBInfo succeed and still return a null pdb_info? If not, can we delete the second part?

Instead I believe you should check the CVSignature field of the returned struct to see that it indeed contains a PDB70 record.

866

ObjectFilePECOFF already has a llvm::object::Binary created for the underlying file. I think it's super-wasteful (and potentially racy, etc.) to create a second one just to read out it's GUID. If you make a second version of this function, taking a Binary (and have the FileSpec version delegate to that), then you can avoid this.

labath added inline comments.Apr 16 2019, 5:04 AM
source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
61

Is there a specific reason you used this particular encoding of the UUID, or did you do that just because it was the easiest? I am asking because I have a reason to have this use a somewhat different encoding. :)

Let me elaborate: I think there are two things we can want from the UUID here: The first one is for it to match the UUID encoding we get from other sources (so that they can agree on whether they are talking about the same binary). The second one is for the uuid encoding to match the "native" UUID format of the given platform.

Right now, this implementation achieves neither. :) It fails the second criterion because the UUID strings comes out in different endianness than what the windows native tools produce (I'm using dumpbin as reference here.). And it also fails the first one, because e.g. minidump reading code parses the UUID differently <D60501>.

Now, for windows, these two criteria are slightly at odds with one another. In order to fully match the dumpbin format, we would need to have some kind of a special field for the "age" bit. But lldb has no such concept, and there doesn't seem to be much need to introduce it. However, including the "age" field in the "uuid" seems like the right thing to do, as two files with different "ages" should be considered different for debug info matching purposes (at least, that's what my limited understanding of pdbs tells me. if some of this is wrong, please let me know). So, in <D60501> I made a somewhat arbitrary decision to attach the age field to the UUID in big endian.
That's the format that made most sense to me, though that can certainly be changed (the most important thing is for these things to stay in sync).

So, if you have a reason to use a different encoding, please let me know, so we can agree on a consistent implementation. Otherwise, could you change this to use the same UUID format as the minidump parser?

Hui added inline comments.Apr 16 2019, 7:54 AM
source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
60

If the exe/dll is built without any debug info, the function succeeds and still returns null pdb_info.

61

You are right. The encoding of MS struct GUID and the PDB70DebugInfo::Signature are different. Can UUID format and the method to yield it from minidump parser be available in class COFFObjectFile?

866

In addition, it is possible to simplify ObjectFilePECOFF ::GetModuleSpecifications API with such Binary. In this sense, none of the arguments, like data_sp, data_offset will be used.

labath added inline comments.Apr 16 2019, 8:24 AM
source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
60

Ah, ok. Thanks for explaining.

61

I don't think you can put that in the COFFObjectFile, as it lives in llvm, and the UUID class is an lldb concept. It might be possible to put some utility function into llvm to help with that, but it's not clear to me how exactly that would look (and that would need to be a separate patch with a separate review).

What would make kind of sense is to add another factory function to the UUID class in lldb (UUID::fromCvRecord ?), which both ObjectFilePECOFF and ProcessMinidump could call into. However, the problem with that is that the definition of the CvRecord is in llvm/Object, and it seems silly to have lldb/Utility depend on that just to pull the single struct. I think it would make sense to move this struct into llvm/BinaryFormat (which lldb/Utility already depends on) and then everything would be fine. If you want to try that out, then go ahead, but I don't think that's really necessary here. (swapping the bytes around should be just a couple of LOC).

866

Yeah, I noticed that too, but I didn't want to throw too many things into this patch. However, if you feel like trying it out, then please go ahead.

asmith updated this revision to Diff 195504.Apr 16 2019, 7:02 PM
asmith edited the summary of this revision. (Show Details)
asmith marked 10 inline comments as done.Apr 16 2019, 8:22 PM
asmith marked an inline comment as done.

Thanks. I have a couple of small comments, but I think this is basically done.

source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
43

llvm style is to only use the anonymous namespaces for class declarations (and use the static keyword for functions). What you've done here is particularly confusing, as you've combined this with using namespace llvm, which gives off the impression that the using declaration is somehow local to the anonymous namespace (which it isn't).

In this case, I'd probably just get rid of the anonymous namespace and move everything (the struct definition and using declarations, if you really want it) into the now-static GetCoffUUID function.

45

typedef struct is very C-like. Just use plain struct.

117

You should keep the MagicBytesMatch call (if you want to llvm-ize it, you could replace that with a call to llvm::identify_magic).

All of the GetModuleSpecifications will be called each time lldb does anything with an object file (any object file), and the idea is to avoid reading the whole file until we are reasonably certain that it is the file we are looking for. That's the reason this function gets the initial bytes of the file in the data_sp member. This way, all of the object file plugins can quickly inspect that data to see if they care about the file, and only one of them will attempt an actual parse.

891–892

I don't think this is necessary as CreateInstance will refuse to return the ObjectFile instance if the creation of the coff binary object failed. (You could theoretically assert that the binary is really there if you want extra security).

Hui added inline comments.Apr 17 2019, 7:49 AM
source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
891–892

There is no cached binary for memory instance (by CreateMemoryInstance). Is there any chance that any JIT-ed codes will call module or UUID related API?

labath added inline comments.Apr 17 2019, 7:55 AM
source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
891–892

Ah, interesting. Yes, if you manage to create a memory instance of ObjectFilePECOFF, then there's a very high chance that someone will call GetUUID on it. I strongly doubt that anyone is creating memory instances, or that they even work in the first place, but I suppose we can leave this check just in case.

asmith updated this revision to Diff 196874.Apr 26 2019, 9:53 AM
asmith marked 5 inline comments as done.Apr 26 2019, 9:55 AM
labath accepted this revision.Apr 28 2019, 11:56 PM
labath added inline comments.
source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
81

this cast is unneeded as the function takes a void*

176–177

If you change the dyn_cast into a plain cast then you can drop the assert (as it will do the asserting for you).

This revision is now accepted and ready to land.Apr 28 2019, 11:56 PM
asmith updated this revision to Diff 197239.Apr 29 2019, 6:38 PM
asmith closed this revision.Apr 29 2019, 6:39 PM