This is an archive of the discontinued LLVM Phabricator instance.

[DWARFv5] CodeGen support for MD5 checksum
ClosedPublic

Authored by probinson on Jan 10 2018, 4:49 PM.

Details

Summary

Pass MD5 checksums through from IR to assembly/object files.
After this, getting Clang to compute the MD5 should be the last step
to supporting MD5 in the DWARF v5 line table header.

Diff Detail

Repository
rL LLVM

Event Timeline

probinson created this revision.Jan 10 2018, 4:49 PM

This was not immediately obvious to me: Why can we now have empty DIFiles?
LGTM otherwise.

llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp
277 ↗(On Diff #129371)

do we need error handling here or is it okay to silently drop malformed input?

This was not immediately obvious to me: Why can we now have empty DIFiles?
LGTM otherwise.

The possibly empty DIFiles were previously accommodated by various getFilename() methods, for example in DebugInfoMetadata.h:

StringRef DIScope::getFilename() const {
  if (auto *F = getFile())
    return F->getFilename();
  return "";
}

I am also unclear why they can happen, but I found out the hard way that they do. Possibly only in tests, because LLParser is willing to allow many attributes to be omitted, but I got crashes before I made the same accommodation.

probinson added inline comments.Jan 10 2018, 5:33 PM
llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp
277 ↗(On Diff #129371)

It would have to be malformed metadata, which is how likely? I should point out that fromHex() will actually assert if the string has non-hex digits in it, so there's a limit to how conscientious it's worth being here.

aprantl added inline comments.Jan 11 2018, 9:01 AM
llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp
277 ↗(On Diff #129371)

I think I'm confused by the fact that we are converting a hex string in AsmPrinter. I would have expected that conversion to happen in the LLVM IR parser. From what sources can we get a Hex *string* here?

Wait. I think I can answer my own question. We store the MD5 sum as a string in DIFile. I vaguely remember the discussion happening back then.

So I think what we need is to make sure the. the Verifier rejects invalid MD5 strings on DIFiles.

Could you perhaps add a check that the string matches [0-9a-z]+ (with the right amount of characters) here?

void Verifier::visitDIFile(const DIFile &N) {
  AssertDI(N.getTag() == dwarf::DW_TAG_file_type, "invalid tag", &N);
  AssertDI((N.getChecksumKind() != DIFile::CSK_None ||
            N.getChecksum().empty()), "invalid checksum kind", &N);
}

So, the checksum will come in from the front-end as a string in DIFile. It wants to be binary in the .o file. What happens in-between is maybe not optimal, but has to address a couple of considerations.
First, where is it stored and who owns the memory? As a StringRef it's not clear, but as an MD5Result it really ought to be owned by MCContext (the CodeView path does the same thing). AsmPrinter does this allocation and conversion.
Second, from AsmPrinter it goes to "the streamer" which might be emitting an object file or might be emitting a text file, but AsmPrinter (despite its name) doesn't know which. For type and memory safety I chose to have the MC API take an MD5Result* instead of yet another StringRef. For emitting a text file, MCAsmStreamer converts it back to text, and then the AsmParser reads the text and converts it back to binary before passing it to "the (object-file) streamer." AsmParser needs to validate the string when it does this conversion. Then the object-file streamer has binary data and just emits it to the object file without any further fuss. For emitting an object-file directly, AsmPrinter passes it along as binary data and there are no additional conversions in the path.

If we have the Verifier verify the DIFile's copy of the checksum, then the initial conversion in AsmPrinter can assume it's a hex string of the correct length and avoid a bit of checking. AsmParser still needs to validate, but the direct-to-object path doesn't.

I hope that all made sense. I have to re-train my brain every time I think about the different paths through here (and it gets even more convoluted if you have to get down into the nitty-gritty of object files, which fortunately we don't have to here).

If we have the Verifier verify the DIFile's copy of the checksum, then the initial conversion in AsmPrinter can assume it's a hex string of the correct length and avoid a bit of checking. AsmParser still needs to validate, but the direct-to-object path doesn't.

Agreed.

See D41965 for verifier change.

probinson updated this revision to Diff 129555.Jan 11 2018, 4:31 PM

Simplify a bit, making assumptions the verifier will check for us.

aprantl accepted this revision.Jan 11 2018, 5:13 PM

One more nitpick, otherwise LGTM!

llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp
103 ↗(On Diff #129555)

Should this be an Optional<unsigned> parameter instead of using 0?

This revision is now accepted and ready to land.Jan 11 2018, 5:13 PM

Thanks! I'll commit this tomorrow.

llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp
103 ↗(On Diff #129555)

Assembler text doesn't have a way to describe line tables for multiple compilation units (like you could get with LTO), so the line tables all get mooshed into CU 0. And this parameter gets used as an array index, so it really does want to be 0 in that case.

This revision was automatically updated to reflect the committed changes.