This is an archive of the discontinued LLVM Phabricator instance.

[AST] Store the string data in StringLiteral in a trailing array of chars
ClosedPublic

Authored by riccibruno on Nov 6 2018, 11:14 AM.

Details

Summary

Use the newly available space in the bit-fields of Stmt and store the
string data in a trailing array of chars after the trailing array
of SourceLocation. This cuts the size of StringLiteral by 2 pointers.

Also refactor slightly StringLiteral::Create and StringLiteral::CreateEmpty
so that StringLiteral::Create is just responsible for the allocation, and the
constructor is responsible for doing all the initialization. This match what
is done for the other classes in general.

Two points I am wondering about:

The original version used type punning through an union. Here I am just
reinterpret_casting back and forth between char * and uint16_t */uint32_t *.

There is a FIXME in ASTWriterStmt.cpp (see the end of the diff) which says that
storing the string past the StringLiteral would cause problems with abbreviations.
The note dates from 2009 and everything *seems* to be working just fine. However
I am not familiar with this and it is possible I am missing something here.

Diff Detail

Repository
rL LLVM

Event Timeline

riccibruno created this revision.Nov 6 2018, 11:14 AM
dblaikie added inline comments.
include/clang/AST/Expr.h
1700–1701 ↗(On Diff #172795)

Seems like changes like this are unrelated refactoring? Or are they in some way needed for this change?

(also the isXXX functions above - might be neat/tidy to pull those out separately (as an NFC change) so that then this change (that modifies the implementation of getKind) is just that, without having to refactor all the other bits too? But not a big deal either way there, really)

@dblaikie Thanks for looking at this patch !

I have a set of patches shrinking the other statements/expressions.
Can I add you to review some of these too ? Most of them consists of just moving
some data. I added rsmith but I think that he is busy with the standard.

include/clang/AST/Expr.h
1701 ↗(On Diff #172795)

They are not needed, I just could not resist.
I can factor these changes out of this patch and submit it as an NFC.

@dblaikie Thanks for looking at this patch !

I have a set of patches shrinking the other statements/expressions.
Can I add you to review some of these too ? Most of them consists of just moving
some data. I added rsmith but I think that he is busy with the standard.

Sure thing! do pull out as many bits of cleanup, renaming, refactoring, etc, into separate patches (no worries if you have to also send those for review (not sure if you have commit access/what level of stuff you want to review-after-commit, etc) - still much easier to review in little pieces than all together)

Factored out some NFSs which I will submit separately.

riccibruno marked 2 inline comments as done.Nov 14 2018, 7:02 AM
riccibruno added inline comments.
include/clang/AST/Expr.h
1615 ↗(On Diff #174032)

Note that the trailing array of chars is aligned to 4 bytes
since it is after the array of SourceLocation.
Therefore I believe that the uint16_t * and uint32_t *
point to properly aligned memory. However I can add an
assertion here if needed.

IIRC, abbreviations just silently don't take effect if the record doesn't conform; so things will appear to work, but the size on disk will be bigger.

include/clang/AST/Expr.h
1615 ↗(On Diff #174032)

I think it's fine.

riccibruno marked 2 inline comments as done.Nov 14 2018, 3:23 PM

IIRC, abbreviations just silently don't take effect if the record doesn't conform; so things will appear to work, but the size on disk will be bigger.

I looked at where the abbreviations are defined, and it seems that the only abbreviations for
statements/expressions are for DeclRefExpr, IntegerLiteral, CharacterLiteral
and ImplicitCastExpr (grep for EmitAbbrev in Serialization/,
for some reasons they are emitted in WriteDeclAbbrev()...).

And indeed changing the serialization format of CharacterLiteral triggers various assertions
because of the abbreviation. Therefore unless I am missing something no other statement/expression
has currently an abbreviation. I suspect therefore that someone could go wild and cut the on-disk
size of the serialization format significantly here.

I looked at the size of the generated pch for all of Boost, and I am only seeing an increase of
about 8k, which is entirely attributable to the fact that I am adding one field to the serialization
format. I can rework it to remove this additional field if needed.

IIRC, abbreviations just silently don't take effect if the record doesn't conform; so things will appear to work, but the size on disk will be bigger.

I looked at where the abbreviations are defined, and it seems that the only abbreviations for
statements/expressions are for DeclRefExpr, IntegerLiteral, CharacterLiteral
and ImplicitCastExpr (grep for EmitAbbrev in Serialization/,
for some reasons they are emitted in WriteDeclAbbrev()...).

And indeed changing the serialization format of CharacterLiteral triggers various assertions
because of the abbreviation. Therefore unless I am missing something no other statement/expression
has currently an abbreviation. I suspect therefore that someone could go wild and cut the on-disk
size of the serialization format significantly here.

I looked at the size of the generated pch for all of Boost, and I am only seeing an increase of
about 8k, which is entirely attributable to the fact that I am adding one field to the serialization
format. I can rework it to remove this additional field if needed.

If you're generally interested in improving build times, working on the serialization format would probably be really valuable. We don't generally insist on a zero-regressions policy in other commits, though.

rjmccall accepted this revision.Nov 14 2018, 4:33 PM

LGTM.

This revision is now accepted and ready to land.Nov 14 2018, 4:33 PM

IIRC, abbreviations just silently don't take effect if the record doesn't conform; so things will appear to work, but the size on disk will be bigger.

I looked at where the abbreviations are defined, and it seems that the only abbreviations for
statements/expressions are for DeclRefExpr, IntegerLiteral, CharacterLiteral
and ImplicitCastExpr (grep for EmitAbbrev in Serialization/,
for some reasons they are emitted in WriteDeclAbbrev()...).

And indeed changing the serialization format of CharacterLiteral triggers various assertions
because of the abbreviation. Therefore unless I am missing something no other statement/expression
has currently an abbreviation. I suspect therefore that someone could go wild and cut the on-disk
size of the serialization format significantly here.

I looked at the size of the generated pch for all of Boost, and I am only seeing an increase of
about 8k, which is entirely attributable to the fact that I am adding one field to the serialization
format. I can rework it to remove this additional field if needed.

If you're generally interested in improving build times, working on the serialization format would probably be really valuable. We don't generally insist on a zero-regressions policy in other commits, though.

I was planning to finish packing the other expression classes to minimize the in-memory footprint
of the AST first. I will try to take a look at serialization if I have some time but no promise for now.
Thanks for the review !

This revision was automatically updated to reflect the committed changes.