This is an archive of the discontinued LLVM Phabricator instance.

DebugInfo: Added support for Checksum debug info feature (LLVM IR part)
ClosedPublic

Authored by aaboud on Dec 9 2016, 4:47 PM.

Details

Summary

Resolve Bug 30978.

This patch contains changes in LLVM IR to support Checksum debug info feature.

Diff Detail

Event Timeline

aaboud updated this revision to Diff 80976.Dec 9 2016, 4:47 PM
aaboud retitled this revision from to DebugInfo: Added support for Checksum debug info feature (LLVM IR part).
aaboud updated this object.
aaboud added reviewers: rnk, probinson.
aaboud added a subscriber: llvm-commits.

I think we try to make as much as possible metadata to be "distinct", CC Duncan to advise.

docs/LangRef.rst
4029

What are the valid list of "type"? What type is the "type" field? Why the prefix ChecksumType_ and not just MD5?
Why not a simpler approach where the type is an MDString? (That already how you parse it in DIChecksum::getChecksumType anyway)

include/llvm/IR/DebugInfoMetadata.h
430

Why there? Where is this API used?

490

What is the reason to have the "None"?

1281

Intended?

I think we try to make as much as possible metadata to be "distinct", CC Duncan to advise.

Not sure about that, so far we have only DICompileUnit and DISubprogram defined as distinct.
If I am not mistaken, distinct will prevent us from sharing Metadata, and for checksum there is no harm of sharing.

docs/LangRef.rst
4029

Right now the valid list of types are:
ChecksumType_MD5
ChecksumType_SHA1

I will add this to the document.

Answering your other questions:

  1. The current solution does not handle the type as string, but as an enumeration with a known token (see LLexer.cpp)
  2. Omitting the prefix ChecksumType_ will not allow us to recognize the token.
  3. This is similar to what is done with DIFlag...
  4. Allowing the user to write the type as string "MD5" is easier to implement, however, checking invalid types will not be elegant as this solution.

Bottom line, I am ready to go with either solutions.
Let's hear what others think.

include/llvm/IR/DebugInfoMetadata.h
430

We need to getChecksum from DICompileUnit, however, this information is stored in the DIScope.

490

We need to pass the checksum data and type to the DICompileUnit together with the filename and directory. However, not always we have a checksum data to send, and that is why we need an indicator that there is no checksum available.

If you try to create a checksum with None type Checksum::get() will return a nullptr.
This is the best way I could find to deal with the issue.
Please, let me know if you have better way to do that.

1281

I should have removed this line.
I am not sure why we have the getFilename() and getDirectory() for DILocation, thought that I might need the getChecksum for same reason, however, I guess it is not needed.

If for any reason we end up need this function we will add it then.
For now I will remove it.

+Adrian.

Note that I'm not worried about using 'distinct' here, since DICheckSum doesn't look like it can participate in cycles. (Otherwise, I haven't looked deeply at the patch.)

rnk edited edge metadata.Dec 12 2016, 8:49 AM

Any objections to folding this into DIFile? It would substantially reduce the size of this patch, IMO.

docs/LangRef.rst
4029

I think we could fold DIChecksum into DIFile. It would look like:

!DIFile(filename: ..., checksumkind: ChecksumType_MD5, checksum: !"deadbeef")

This adds a dead integer to DIFile when checksums aren't being used, but it's probably more efficient and less work than having a separate node.

include/llvm/IR/DebugInfoMetadata.h
430

I don't believe we really need it. We can probably do CU->getFile()->getChecksum().

lib/IR/DIBuilder.cpp
124–125

Instead of having createCompileUnit take Filename, Directory, ChecksumType, and Checksum, have it take a DIFIle.

In D27642#619933, @rnk wrote:

Any objections to folding this into DIFile? It would substantially reduce the size of this patch, IMO.

I have no objection going this way, if everybody agree on that I will implement the change and upload a new patch.

docs/LangRef.rst
4029

Fine with me.

include/llvm/IR/DebugInfoMetadata.h
430

OK, can do.

lib/IR/DIBuilder.cpp
124–125

It would be much better solution for sure.
I will give it a try (in different patch), unless you want to do it yourself :) ?

aprantl added inline comments.Dec 12 2016, 9:36 AM
docs/LangRef.rst
4029

I'm in favor of rolling it into DIFile. As Paul noted, DWARF 5 also allows for an MD5 checksum on each file entry in the line table section so I'm expecting this to become the normal case eventually.

What's the problem with recognizing MD5 as a context-sensitive token in the parser?

aaboud added inline comments.Dec 12 2016, 9:45 AM
docs/LangRef.rst
4029

It is not only MD5, right? should the LLexer recognize all possible types of checksum, i.e. MD5, SHA1, etc.?
Or like it is implemented in this patch, it only needs to recognize the checksum type prefix (whatever it will be).

So, how do you prefer to implement this?

  1. Prefix - (if yes, what should be the prefix?)
  2. Recognize all types
  3. Have a wild string
mehdi_amini added inline comments.Dec 12 2016, 10:50 AM
docs/LangRef.rst
4029

I rather have a much simpler implementation, the validation can be done in the verifier easily and it would handle similarly in-memory validation, bitcode and textual IR (the first two are the most important by the way, textual IR is not what we should optimize for).

With the change that @rnk suggested that would make it quite a smaller patch I believe.

4029

Alternative representation:

!DIFile(filename: ..., checksum: !"MD5:deadbeef")

lib/Bitcode/Reader/BitcodeReader.cpp
2759 ↗(On Diff #80976)

What if the value in Record[1] is unexpected?

lib/IR/Verifier.cpp
969

The check could be more strict: what if the bitcode contained an invalid ID?

Thanks for all the reviews and suggestions.
I just need to hear that all of you agree to one of the suggested solutions.

docs/LangRef.rst
4029

Once again, I do not mind how we implement it, I just need to hear an agreement from most of the participants in this review.
however, I think that with this approach it might complicate the code in CodeViewDebug, as it needs to parse the checksum string to get the type.

I would like to hear Reid opinion on this suggestion before I go and implement it.

lib/Bitcode/Reader/BitcodeReader.cpp
2759 ↗(On Diff #80976)

Do you suggest to do this instead?

static_cast<DIChecksum::ChecksumType>(Record[1])
lib/IR/Verifier.cpp
969

I am not sure if you are asking about the getTag() or the getType(), so I will answer for both.

  1. If we fold the checksum data into the DIFile metadata, then there will be no need for the tag here.
  2. Do you suggest that we check for valid range of check sum types? something like this:
(N.getType() >= DIChecksum::FirstValidType && N,getType() <= DIChecksum::LastValidType)
mehdi_amini added inline comments.Dec 12 2016, 5:41 PM
docs/LangRef.rst
4029

Is there already a patch up that shows the CodeViewDebug, so that I can see how it gets more complicated there?
I'd expect it to be a StringSwitch on the prefix in the string instead of a switch on the enum value and that's it.

lib/Bitcode/Reader/BitcodeReader.cpp
2759 ↗(On Diff #80976)

Not really, this is identical. What I mean is that the value in Record[1] can be totally invalid.

You were concerned about the error checking in the LLParser while it is the least of my concern: the textual IR is more of a "debugging" representation while In Memory and Bitcode representation of the IR are both on the production path.

lib/IR/Verifier.cpp
969

Yes, basically N.getType() could be an integer value that is neither None neither one of the valid types.

aaboud added inline comments.Dec 12 2016, 6:00 PM
docs/LangRef.rst
4029

Sorry, I do not have this patch, and I am not planing to.
I hope that Reid can implement this in CodeViewDebug once I commit this patch.
This is why I would like to hear his agreement to your proposal before I go this direction.

lib/Bitcode/Reader/BitcodeReader.cpp
2759 ↗(On Diff #80976)
  1. I implemented the DIChecksumm::CheckSumType similar to DINode::DIFlags.
  2. Do you think that if we use String in Record[1], we will be able to validate it here?

I think that it is fare enough to validate these values in the IR, after all LLexer and BitcodeReader are limited on how much they can validate.

mehdi_amini added inline comments.Dec 12 2016, 6:30 PM
lib/Bitcode/Reader/BitcodeReader.cpp
2759 ↗(On Diff #80976)

I'm fine validating this in the IR itself, I'd be even fine leaving this field up to the backend which can only ignore and warn if it can't handle the type of the checksum (i.e. the IR verifier would accept null or any string for this field).

I raised it here because you put some effort into having a "robust" textual IR path while the bitcode was not well validated, which seems somehow backward.

aprantl added inline comments.Dec 13 2016, 9:02 AM
docs/LangRef.rst
4029

I was skeptical at first about using a string to represent the checksum, but I came around to agree that this format is likely the most straight-forward way to add this to the IR assembler language.

!DIFile(filename: ..., checksum: !"MD5:deadbeef") // with checksum being an optional nullable field.

We probably still want a DIChecksum wrapper around the string that converts the hex string into an llvm::MD5, etc, and/or stores it as a byte string for a more efficient representation.

include/llvm/IR/DebugInfoMetadata.h
490

Why not allow for a nullptr to be passed in that interface?

rnk added inline comments.Dec 13 2016, 9:23 AM
docs/LangRef.rst
4029

I don't think we want a DIChecksum node. If we're concerned about efficiency, the real gain is eliminating the extra allocation.

I'd rather separate the checksum kind out and store it in DIFile::SubclassData32. It's consistent with what we do for DW_VIRTUALITY_*. It only adds minor complexity to the AsmParser, which is not on any critical path, and avoids a StringSwitch in actual code that will run on every compilation.

mehdi_amini added inline comments.Dec 13 2016, 9:32 AM
docs/LangRef.rst
4029

Are you really motivated by the efficiency of a single StringSwitch per DIFile at Codegen time?

The complexity is not much in the AsmParser than in the bitcode (backward compatibility, etc.), the explicit "None" instead of just ignoring the field (null), and the validation in the verifier.

aprantl added inline comments.Dec 13 2016, 9:39 AM
docs/LangRef.rst
4029

@rnk: To make sure I understand where you're going with your comment: You said you would like to separate out the storage of the checksum kind, but how do you propose to store the checksum data?

aprantl edited edge metadata.Dec 13 2016, 9:57 AM

About the discussion on distinct: The checksums (and also the DIFile nodes) should be uniqued in an LTO compilation and thus should not be distinct.

rnk added inline comments.Dec 13 2016, 10:02 AM
docs/LangRef.rst
4029

@aprantl I'd store it in an MDString. Different checksums have different lengths, so we'll need a separate allocation for that.

@mehdi_amini What I really care about is that the code to generate this stuff and move it through LLVM is simple. Remember when Duncan encoded as much debug info as strings as possible before moving to the current DI* class hierarchy? Doing stringly-typed things like !"MD5:deadbeef" feels like that, and given a choice, I'd rather avoid that.

Anyway, @aaboud asked for my opinion, so I figured I should spell it out. If you both feel that !"MD5:deadbeef" is better, I don't feel strongly about this and can definitely live with the checksum kind in the string.

aprantl added inline comments.Dec 13 2016, 11:41 AM
docs/LangRef.rst
4029

Would you prefer to store the hexadecimal ASCII encoding in the MDString or the raw bytes?

I would also like to point out that the assembler syntax also doesn't necessarily need to match the in-memory representation. We could use the easy-to-parse !"MD5:deadbeef" syntax while still having a different internal representation.

mehdi_amini added inline comments.Dec 13 2016, 11:43 AM
docs/LangRef.rst
4029

@mehdi_amini What I really care about is that the code to generate this stuff and move it through LLVM is simple

That's also the motivation behind my comment: my impression was that using a MDString is reusing as much of the existing infrastructure and requires changing less places, making it less intrusive for something that likely needs only to be understood manipulated "late" and can stay opaque most of the time. It also makes it handling the checksum as a single unit rather than when it is split.
Not sure if it makes much difference in practice though, that was more my impression of it.

I thought the main motivation for Duncan's work on DI* class was the efficiency of the representation (LTO memory consumption especially). I'd have to ask him though, that is just me speculating.

aaboud updated this revision to Diff 81639.Dec 15 2016, 12:58 PM
aaboud edited edge metadata.
aaboud added a subscriber: bwyma.

Folded Checksum type and data into DIFile metadata.

rnk accepted this revision.Dec 15 2016, 1:32 PM
rnk edited edge metadata.
rnk added inline comments.
docs/LangRef.rst
4029

I just wanted to echo what Duncan said here about type safety. Ultimately the point of our representation is to have a coherent object model, and I think separating the kind out from the string helps achieve that. The textual IR and bitcode exist to support and serialize our object model.

lib/AsmParser/LLLexer.cpp
812

This prefix works for me, but we should ask others to stamp it.

lib/IR/DebugInfoMetadata.cpp
373

This assert doesn't look right. The checksum string should be non-empty or null, which is what the isCanonical check does. We shouldn't have an exception for CSK_None.

This revision is now accepted and ready to land.Dec 15 2016, 1:32 PM
mehdi_amini added inline comments.Dec 15 2016, 1:39 PM
docs/LangRef.rst
4029

I don't see what can break the "coherency" of the "object model" on a leaf of the graph.

majnemer added inline comments.
include/llvm/IR/DebugInfoMetadata.h
478

I think it'd be nicer to instead do CSK_Last = CSK_SHA1 and use CSK_Last + 1 instead of CSK_Num. It makes it nicer to switch over the ChecksumKind type.

aaboud updated this revision to Diff 81961.Dec 19 2016, 9:07 AM
aaboud edited edge metadata.
aaboud marked 2 inline comments as done.

Addressed comments by Reid and David.

This revision was automatically updated to reflect the committed changes.