Page MenuHomePhabricator

Add AST_SIGNATURE record to unhashed control block of PCM files
ClosedPublic

Authored by dang on May 21 2020, 9:17 AM.

Details

Summary

This record is constructed by hashing the bytes of the AST block in a similiar
fashion to the SIGNATURE record. This new signature only means anything if the
AST block is fully relocatable, i.e. it does not embed absolute offsets within
the PCM file. This change ensure this does not happen by repalcing these offsets
with offsets relative to the start of the AST block.

Diff Detail

Event Timeline

dang created this revision.May 21 2020, 9:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 21 2020, 9:17 AM
dang updated this revision to Diff 265530.May 21 2020, 9:53 AM

Formatting fixes

dang updated this revision to Diff 265535.May 21 2020, 10:11 AM

Accidently deleted a line in the previous update

dang updated this revision to Diff 265848.May 23 2020, 3:39 AM

This should address the Windows test failure as well as some offline feedback

  • Ensure that DeclOffset instances can only be read by providing an offset to the AST block
  • Simplify ASTSignature test by using -fdisable-module-hash
dexonsmith requested changes to this revision.Jun 1 2020, 3:03 PM

Thanks for working on this! I have a number of suggestions inline.

clang/include/clang/Serialization/ASTBitCodes.h
44

VERSION_MAJOR should be bumped.

244–245

Please update the first sentence to be accurate.

246

I don't think you'll need this comment anymore, after you fix the above. But if you keep it, please add a period at the end of the sentence.

397–403

These names and descriptions hard hard to differentiate. Is there another way of naming these that will be more clear?

(One idea I had is to create CONTROL_BLOCK_HASH and AST_BLOCK_HASH and then SIGNATURE could just be their hash-combine, but maybe you have another idea.)

clang/lib/Serialization/ASTReader.cpp
2931

Please submit this NFC change separately.

clang/lib/Serialization/ASTWriter.cpp
1047

Alternately, you could change ASTFileSignature to have three fields:

  • ControlBlockHash
  • ASTBlockHash
  • Signature

(You could construct the signature by taking the SHA1 of the two block hashes.)

1048–1052

Rather than duplicating this code, would it be simpler to start with a patch that changes ASTFileSignature to not require this massaging? I suggest either adding a constructor that takes Hash directly, or just changing it to have the same type as Hash (and updating the serialization to expect that).

1064–1073

The default abbreviation (7-VBR) isn't great for hashes. Given that we're going to have two of these records, I suggest specifying a custom abbreviation.

If you were to split out a prep commit as suggested above to change SIGNATURE to be an array of 8-bit values (instead of the current array of 32-bit values), that would be a good opportunity to add an abbreviation.

2066

Can this be relative to the start of SOURCE_MANAGER_BLOCK_ID instead?

4718–4719

Is there ever a reason to use M.FileName here, or is that always redundant with what's in the control block? I wonder if we can just remove this complexity.

clang/lib/Serialization/ASTWriterDecl.cpp
2435

I suggest making WriteDecls relative to the start of the DECLTYPES_BLOCK_ID. That will make the block itself (more) self-contained, and also make the offsets smaller (and therefore cheaper to store in bitcode).

I also don't think you need to actually emit that integer, it's just an implicit contract between the reader and writer.

2440

Same here, this should be relative to the start of DECLTYPES_BLOCK_ID.

This revision now requires changes to proceed.Jun 1 2020, 3:03 PM
dang added a comment.Jun 2 2020, 5:55 AM

I made the decision to make all relative offsets relative to the start of the AST block rather than their own sub-block or a neighboring block, in order to prevent adding complexity in ASTReader. My reasoning that it would be best if the reader didn't have to keep track of a bunch of offsets inside the bitcode at all times, but that it could just remember the offset of the AST block once and always use that for relative offsets. I agree that conceptually making the sub-blocks relocatable too would have been nice, but I figured it wasn't worth the extra complexity in the reader.

clang/lib/Serialization/ASTWriter.cpp
1064–1073

I think encoding the signature as a blob makes the most sense if we make it an array of 20 8-bit values, which is what the hasher gives us. However it is not quite so simple, as it would need a big change in the way direct imports are serialized. The way it is currently done embeds multiple strings and signatures within a single unabbreviated record. I can replace the unabbreviated record with a sub-block that keeps track of a string table and a record for each import containing offsets inside the string table and blobs for the signature.

4718–4719

The control block only tracks this information for direct imports although it can maybe be extended to do keep track of this for all dependencies. This bit of the table could then become and index inside the IMPORTS record in the control block.

clang/lib/Serialization/ASTWriterDecl.cpp
2435

Not sure what you mean by I don't actually need to emit that integer? If you mean ASTBlockRange.first then I don't emit it but rather pass it to the constructor of DeclOffset so that it performs the subtraction there. Since it is a self contained type, it forces the Reader to provide a value for it, making it harder to read the raw relative offset.

dang updated this revision to Diff 267914.Jun 2 2020, 9:57 AM

Address some of the code review comments

dang marked 4 inline comments as done.Jun 2 2020, 10:07 AM
dang added inline comments.
clang/include/clang/Serialization/ASTBitCodes.h
397–403

I kept the same hasher when computing both of these which mitigates the cost. I don't see the need for also emitting a hash for the control block, there are some optional records that are not in both the AST block and the control block anyway.

dang marked an inline comment as done.Jun 2 2020, 10:44 AM
dang added inline comments.
clang/lib/Serialization/ASTWriter.cpp
1064–1073

But I do think it is outside the scope of this change to do this.

dang marked an inline comment as done.Jun 2 2020, 10:52 AM
dang added inline comments.
clang/include/clang/Serialization/ASTBitCodes.h
397–403

I also think that the AST_BLOCK_HASH and the SIGNATURE are enough information already. In most cases you can deduce if the control block was different by just checking if the signatures were different and the ASTs the same.

I made the decision to make all relative offsets relative to the start of the AST block rather than their own sub-block or a neighboring block, in order to prevent adding complexity in ASTReader. My reasoning that it would be best if the reader didn't have to keep track of a bunch of offsets inside the bitcode at all times, but that it could just remember the offset of the AST block once and always use that for relative offsets. I agree that conceptually making the sub-blocks relocatable too would have been nice, but I figured it wasn't worth the extra complexity in the reader.

I think the extra complexity is small and worth it. The default abbreviation is 7-VBR, which is cheaper for small offsets. I doubt we'll bother to come back and fix this later, so might as well get it right now.

clang/include/clang/Basic/Module.h
60 ↗(On Diff #267914)

Can you make this create? I think new code we prefer lowerCamelCase for method names.

66–72 ↗(On Diff #267914)

This code looks correct, and it's better factored out like this, but I still don't understand why we convert to 32-bit numbers. Have you looked at the code simplification from storing this as an array of 8-bit numbers?

clang/include/clang/Serialization/ASTBitCodes.h
397–403

Thanks for updating to AST_BLOCK_HASH, I think that addressed the concern I had.

However, having thought further about it, I don't think it makes sense to create a new record type. I suggest just having a convention of writing the hashes out in a particular order.

  • First record is the overall signature.
  • Second record is the AST block hash.
  • In the future, we can add a third record for the control block hash.

In which case, the (single) record name should be something generic like SIGNATURE or SHA1_HASH.

Note: I doubt there would be additional cost to computing the AST and control block hashes separately, but I agree there isn't a compelling use for the control block hash currently (although it would be nice to confirm properties like if the CONTROL_BLOCK_HASH were typically stable when sources changed).

Note: I don't think the content of the "hashed" control block vs. unhashed control block have been carefully considered from first principles. At some point we might do that. I expect there are currently things that change the control block that would not invalidate it for importers.

clang/lib/Serialization/ASTWriter.cpp
1047

Have you considered changing ASTFileSignature to have multiple fields, one for the signature and another for the AST block hash? Working with pairs and tuples is always awkward.

1064–1073

However it is not quite so simple, as it would need a big change in the way direct imports are serialized. The way it is currently done embeds multiple strings and signatures within a single unabbreviated record.

I don't understand why switching to 8-bit values would affect this. You can just write each 8-bit value as a separate field in the record, just like it currently does for 32-bit values. I.e., the current code seems to work just as well:

for (auto I : M.Signature)
  Record.push_back(I);

That said, having a single IMPORTS record is kind of ridiculous. It would be nice to clean that up (either as a follow-up or prep commit). I think your suggestion of having an IMPORTS sub-block is reasonable. A simpler approach would be just to emit IMPORT records. In the reader you can figure out implicitly how many there are (is the next record another IMPORT? If so keep reading). If you did this, I would suggest abbreviating the IMPORT record since it'll be well-structured.

For SIGNATURE record abbreviation, I wasn't think a blob, just something simple like:

auto HashAbbrev = std::make_shared<BitCodeAbbrev>();
HashAbbrev->Add(BitCodeAbbrevOp(Signature));
for (int I = 0, E = 5; I != E; ++I)
  HashAbbrev->Add(BitCodeAbbrevOp(BitcodeAbbrevOp::Fixed, 32));

If you switch to 8-bit:

auto HashAbbrev = std::make_shared<BitCodeAbbrev>();
HashAbbrev->Add(BitCodeAbbrevOp(Signature));
for (int I = 0, E = 20; I != E; ++I)
  HashAbbrev->Add(BitCodeAbbrevOp(BitcodeAbbrevOp::Fixed, 8));
4718–4719

Okay. That might be a nice follow-up or prep commit. If you're not interested in that tangent though I suggest adding an accessor ModuleFile::isModule that you can use here.

clang/lib/Serialization/ASTWriterDecl.cpp
2435

Ah! I misread the code.

dang updated this revision to Diff 269155.Mon, Jun 8, 3:38 AM
dang marked an inline comment as done.

Address some code review feedback

dang marked an inline comment as done.Mon, Jun 8, 3:45 AM

See D81347 for the parent revision that makes ASTFileSignature a std::array<uint8_t, 20>. Adding the abbreviation doesn't make that much sense at the moment because of how the IMPORTS record is structured as discussed offline.

dang updated this revision to Diff 269854.Wed, Jun 10, 7:53 AM
dang marked an inline comment as done.

Address @dexonsmith feedback about making the relative offsets in the AST block relative to the nearest relevant subblock.

clang/include/clang/Serialization/ASTBitCodes.h
397–403

I agree with the fact that it might be worth maintaining a control block hash, but I do think this is not the purpose of this change. I can do it as a follow up when I have some time. I am somewhat opposed to keeping them all in the same record for now, because it makes it harder to grep for one of the signatures in the output of llvm-bcanalyzer which is currently the only available tooling for looking inside the unhashed control block.

dexonsmith accepted this revision.Wed, Jun 10, 12:33 PM

LGTM, with one more comment inline.

clang/lib/Serialization/ASTWriter.cpp
2345

It looks to me like this field is ignored in the reader. If so we should just strip it. Can you confirm?

This revision is now accepted and ready to land.Wed, Jun 10, 12:33 PM
dang updated this revision to Diff 270077.Thu, Jun 11, 2:36 AM
dang marked 3 inline comments as done.

Rebase on top of latest master that has needed NFC changes to calm down clang-format.

dang added inline comments.Thu, Jun 11, 6:06 AM
clang/lib/Serialization/ASTReader.cpp
3706

@dexonsmith the field you asked about is used here and the MacroOffsetsBase is used in other places to compute offsets

clang/lib/Serialization/ASTWriter.cpp
2345

No, look at line 3706 of the Reader in the updated version. I added a comment on the relevant line.

This revision was automatically updated to reflect the committed changes.