Pass MD5 checksums through from IR to assembly/object files.
After this, getting Clang to compute the MD5 should be the last step
to supporting MD5 in the DWARF v5 line table header.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
This was not immediately obvious to me: Why can we now have empty DIFiles?
LGTM otherwise.
llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp | ||
---|---|---|
277 ↗ | (On Diff #129371) | do we need error handling here or is it okay to silently drop malformed input? |
The possibly empty DIFiles were previously accommodated by various getFilename() methods, for example in DebugInfoMetadata.h:
StringRef DIScope::getFilename() const { if (auto *F = getFile()) return F->getFilename(); return ""; }
I am also unclear why they can happen, but I found out the hard way that they do. Possibly only in tests, because LLParser is willing to allow many attributes to be omitted, but I got crashes before I made the same accommodation.
llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp | ||
---|---|---|
277 ↗ | (On Diff #129371) | It would have to be malformed metadata, which is how likely? I should point out that fromHex() will actually assert if the string has non-hex digits in it, so there's a limit to how conscientious it's worth being here. |
llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp | ||
---|---|---|
277 ↗ | (On Diff #129371) | I think I'm confused by the fact that we are converting a hex string in AsmPrinter. I would have expected that conversion to happen in the LLVM IR parser. From what sources can we get a Hex *string* here? |
Wait. I think I can answer my own question. We store the MD5 sum as a string in DIFile. I vaguely remember the discussion happening back then.
So I think what we need is to make sure the. the Verifier rejects invalid MD5 strings on DIFiles.
Could you perhaps add a check that the string matches [0-9a-z]+ (with the right amount of characters) here?
void Verifier::visitDIFile(const DIFile &N) { AssertDI(N.getTag() == dwarf::DW_TAG_file_type, "invalid tag", &N); AssertDI((N.getChecksumKind() != DIFile::CSK_None || N.getChecksum().empty()), "invalid checksum kind", &N); }
So, the checksum will come in from the front-end as a string in DIFile. It wants to be binary in the .o file. What happens in-between is maybe not optimal, but has to address a couple of considerations.
First, where is it stored and who owns the memory? As a StringRef it's not clear, but as an MD5Result it really ought to be owned by MCContext (the CodeView path does the same thing). AsmPrinter does this allocation and conversion.
Second, from AsmPrinter it goes to "the streamer" which might be emitting an object file or might be emitting a text file, but AsmPrinter (despite its name) doesn't know which. For type and memory safety I chose to have the MC API take an MD5Result* instead of yet another StringRef. For emitting a text file, MCAsmStreamer converts it back to text, and then the AsmParser reads the text and converts it back to binary before passing it to "the (object-file) streamer." AsmParser needs to validate the string when it does this conversion. Then the object-file streamer has binary data and just emits it to the object file without any further fuss. For emitting an object-file directly, AsmPrinter passes it along as binary data and there are no additional conversions in the path.
If we have the Verifier verify the DIFile's copy of the checksum, then the initial conversion in AsmPrinter can assume it's a hex string of the correct length and avoid a bit of checking. AsmParser still needs to validate, but the direct-to-object path doesn't.
I hope that all made sense. I have to re-train my brain every time I think about the different paths through here (and it gets even more convoluted if you have to get down into the nitty-gritty of object files, which fortunately we don't have to here).
If we have the Verifier verify the DIFile's copy of the checksum, then the initial conversion in AsmPrinter can assume it's a hex string of the correct length and avoid a bit of checking. AsmParser still needs to validate, but the direct-to-object path doesn't.
Agreed.
One more nitpick, otherwise LGTM!
llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp | ||
---|---|---|
103 ↗ | (On Diff #129555) | Should this be an Optional<unsigned> parameter instead of using 0? |
Thanks! I'll commit this tomorrow.
llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp | ||
---|---|---|
103 ↗ | (On Diff #129555) | Assembler text doesn't have a way to describe line tables for multiple compilation units (like you could get with LTO), so the line tables all get mooshed into CU 0. And this parameter gets used as an array index, so it really does want to be 0 in that case. |