This is an archive of the discontinued LLVM Phabricator instance.

[clang] Use string tables for static diagnostic descriptions
ClosedPublic

Authored by froydnj on Jun 15 2020, 12:24 PM.

Details

Summary

Using a pointer for the description string in StaticDiagInfoRec causes several problems:

  1. We don't need to use a whole pointer to represent the string;
  2. The use of pointers incurs runtime relocations for those pointers; the relocations take up space on disk and represent runtime overhead;
  3. The need to relocate data implies that, on some platforms, the entire array containing StaticDiagInfoRecs cannot be shared between processes.

This patch changes the storage scheme for the diagnostic descriptions to
avoid these problems. We instead generate (effectively) one large
string and then StaticDiagInfoRec conceptually holds offsets into the
string. We elected to also move the storage of those offsets into a
separate array to further reduce the space required.

On x86-64 Linux, this change removes about 120KB of relocations and
moves about 60KB from the non-shareable .data.rel.ro section to
shareable .rodata. (The array is about 80KB before this, but we
eliminated 4 bytes/entry by using offsets rather than pointers.) We
actually reap this benefit twice, because these tables show up in both
libclang.so and libclang-cpp.so and we get the reduction in both places.

Diff Detail

Event Timeline

froydnj created this revision.Jun 15 2020, 12:24 PM

I'd like to apply the same sort of techniques to the builtin tables, which are enormous, but this is baby steps.

froydnj changed the repository for this revision from rC Clang to rG LLVM Github Monorepo.Jun 23 2020, 8:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2020, 8:27 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Add some reviewers based on git diff --name-only | xargs -n 1 git blame --porcelain | grep "^author " | sort | uniq -c | sort -nr | head -30.

riccibruno retitled this revision from use string tables for static diagnostic descriptions to [clang] Use string tables for static diagnostic descriptions.Jul 27 2020, 7:27 AM
bkramer accepted this revision.Jul 27 2020, 8:18 AM

Nice, those relocations have annoyed me for years. I'm worried about whether the way you're accessing StaticDiagInfoDescriptionStringTable might be undefined behavior. I won't block this change on that though.

This revision is now accepted and ready to land.Jul 27 2020, 8:18 AM

Nice, those relocations have annoyed me for years. I'm worried about whether the way you're accessing StaticDiagInfoDescriptionStringTable might be undefined behavior. I won't block this change on that though.

Is there somebody who should review that particular bit? My understanding is that it is OK, but my understanding of the C++ standard is not necessarily complete, and I'd like to get a second opinion.

Nice, those relocations have annoyed me for years. I'm worried about whether the way you're accessing StaticDiagInfoDescriptionStringTable might be undefined behavior. I won't block this change on that though.

Is there somebody who should review that particular bit? My understanding is that it is OK, but my understanding of the C++ standard is not necessarily complete, and I'd like to get a second opinion.

I believe this falls under the (using cppreference ( https://en.cppreference.com/w/cpp/language/union ) , since it's a bit easier to read) UB clause: " It's undefined behavior to read from the member of the union that wasn't most recently written. Many compilers implement, as a non-standard language extension, the ability to read inactive members of a union."

Last member written to was the "StringTable" member, but then it's read from the "String" member, so that'd be UB. Commonly supported, but UB - not sure if we have a general statement that we depend on this behavior in LLVM, even though it's non-standard (but it's possible that we do make such an assumption about the compiler that's building LLVM). It'd be nice to avoid that, though - and not too difficult, I think - I /believe/ it's valid to take a pointer to an object, cast it to char*, compute a pointer to some specific member and then cast it back to the right type and access. But I could be wrong there. @rsmith would be the person to give an authoritative answer.

I believe this falls under the (using cppreference ( https://en.cppreference.com/w/cpp/language/union ) , since it's a bit easier to read) UB clause: " It's undefined behavior to read from the member of the union that wasn't most recently written. Many compilers implement, as a non-standard language extension, the ability to read inactive members of a union."

Last member written to was the "StringTable" member, but then it's read from the "String" member, so that'd be UB. Commonly supported, but UB - not sure if we have a general statement that we depend on this behavior in LLVM, even though it's non-standard (but it's possible that we do make such an assumption about the compiler that's building LLVM). It'd be nice to avoid that, though - and not too difficult, I think - I /believe/ it's valid to take a pointer to an object, cast it to char*, compute a pointer to some specific member and then cast it back to the right type and access. But I could be wrong there. @rsmith would be the person to give an authoritative answer.

Thanks for the explanation. Is the language of "writing" applicable here, given that this is a constant blob of storage? (I suppose the compiler is permitted to designate a particular member as having been "written"?)

Would it be more palatable to write:

struct StaticDiagInfoDescriptionStringTable {
  // members as char[] for each diagnostic
};

const StaticDiagInfoDescriptionStringTable StaticDiagInfoDescriptions = {
  // define all the members
};

...

struct StaticDiagInfoRec {
  ...
  StringRef getDescription() const {
    size_t MyIndex = this - &StaticDiagInfo[0];
    uint32_t StringOffset = StaticDiagInfoDescriptionOffsets[MyIndex];
    // Defined as a pointer to the first member, and (presumably) there is no internal padding.
    const char *StringTable = reinterpret_cast<const char*>(&StaticDiagInfoDescriptions);
    return StringRef(&StringTable[StringOffset], DescriptionLen);
};

and then we don't have to care about how the host compiler interprets access to different members of unions?

I believe this falls under the (using cppreference ( https://en.cppreference.com/w/cpp/language/union ) , since it's a bit easier to read) UB clause: " It's undefined behavior to read from the member of the union that wasn't most recently written. Many compilers implement, as a non-standard language extension, the ability to read inactive members of a union."

Last member written to was the "StringTable" member, but then it's read from the "String" member, so that'd be UB. Commonly supported, but UB - not sure if we have a general statement that we depend on this behavior in LLVM, even though it's non-standard (but it's possible that we do make such an assumption about the compiler that's building LLVM). It'd be nice to avoid that, though - and not too difficult, I think - I /believe/ it's valid to take a pointer to an object, cast it to char*, compute a pointer to some specific member and then cast it back to the right type and access. But I could be wrong there. @rsmith would be the person to give an authoritative answer.

Thanks for the explanation. Is the language of "writing" applicable here, given that this is a constant blob of storage? (I suppose the compiler is permitted to designate a particular member as having been "written"?)

I /believe/ it is applicable, though I could be wrong.

Ah, cppreference's example supports that theory at least: S s = {0x12345678}; // initializes the first member, s.n is now the active member - that the initialization itself does set the active member of the union.

(oh, and a later example on cppreference, which is lifted from the C++ spec (though not sure exactly which version) says similarly: "Y y = { { 1, 2 } }; // OK, y.x is active union member (9.2)")

Would it be more palatable to write:

struct StaticDiagInfoDescriptionStringTable {
  // members as char[] for each diagnostic
};

const StaticDiagInfoDescriptionStringTable StaticDiagInfoDescriptions = {
  // define all the members
};

...

struct StaticDiagInfoRec {
  ...
  StringRef getDescription() const {
    size_t MyIndex = this - &StaticDiagInfo[0];
    uint32_t StringOffset = StaticDiagInfoDescriptionOffsets[MyIndex];
    // Defined as a pointer to the first member, and (presumably) there is no internal padding.
    const char *StringTable = reinterpret_cast<const char*>(&StaticDiagInfoDescriptions);
    return StringRef(&StringTable[StringOffset], DescriptionLen);
};

and then we don't have to care about how the host compiler interprets access to different members of unions?

I think so? I guess that's essentially the point of offsetof.

The comment about padding probably isn't needed, I think? Even if there was padding, the StringOffset comes from "offsetof" so it describes the offset including any padding involved?

Be great if @rsmith got a chance to weigh in here.

Ping for an opinion on the recent discussion of how to move forward with this patch.

Ping for an opinion on the recent discussion of how to move forward with this patch.

froydnj updated this revision to Diff 288628.Aug 28 2020, 9:09 AM

Updated to use a struct instead of a union for the actual table in an effort to avoid UB.

froydnj requested review of this revision.Aug 28 2020, 9:10 AM

Throwing this back into the review queue.

dblaikie accepted this revision.Aug 28 2020, 12:58 PM

Sounds good to me - thanks!

This revision is now accepted and ready to land.Aug 28 2020, 12:58 PM

@froydnj The committed version rG31a3c5fb45b78bdaa78d94ffcc9258e839002016 appears to be very different from the review. I guess next time your probably can upload the diff again if it is very diffierent

@froydnj The committed version rG31a3c5fb45b78bdaa78d94ffcc9258e839002016 appears to be very different from the review. I guess next time your probably can upload the diff again if it is very diffierent

Judging by a cursory glance at Phab's view of the delta ( https://reviews.llvm.org/rG4b64ce7428b66cacfe74dbd9dbc29aff6dfb84af ) it /looks/ like it wasn't too different. Mostly picking up upstream changes that added "DEFERRABLE"? (I think Phab uses light green for "this changed, but only because of upstream changes" and dark green is the actual patch changes?)

@froydnj The committed version rG31a3c5fb45b78bdaa78d94ffcc9258e839002016 appears to be very different from the review. I guess next time your probably can upload the diff again if it is very diffierent

Judging by a cursory glance at Phab's view of the delta ( https://reviews.llvm.org/rG4b64ce7428b66cacfe74dbd9dbc29aff6dfb84af ) it /looks/ like it wasn't too different. Mostly picking up upstream changes that added "DEFERRABLE"? (I think Phab uses light green for "this changed, but only because of upstream changes" and dark green is the actual patch changes?)

Sorry for the noise. What I saw previously was a mere difference in the DIAG macro and the new isDeferable... Maybe Phab presented the diff between two Diffs to me. The updated view seems good.

@froydnj The committed version rG31a3c5fb45b78bdaa78d94ffcc9258e839002016 appears to be very different from the review. I guess next time your probably can upload the diff again if it is very diffierent

Judging by a cursory glance at Phab's view of the delta ( https://reviews.llvm.org/rG4b64ce7428b66cacfe74dbd9dbc29aff6dfb84af ) it /looks/ like it wasn't too different. Mostly picking up upstream changes that added "DEFERRABLE"? (I think Phab uses light green for "this changed, but only because of upstream changes" and dark green is the actual patch changes?)

Sorry for the noise. What I saw previously was a mere difference in the DIAG macro and the new isDeferable... Maybe Phab presented the diff between two Diffs to me. The updated view seems good.

I assumed that "add another parameter to a macro due to rebasing" was not a significant enough change to warrant reposting...but as this is the first patch I was committing myself, I probably should have been a bit more explicit in what I was committing (even re-asking for review? I'm not sure of the norms around rebasing in the LLVM project). My mistake!

@froydnj The committed version rG31a3c5fb45b78bdaa78d94ffcc9258e839002016 appears to be very different from the review. I guess next time your probably can upload the diff again if it is very diffierent

Judging by a cursory glance at Phab's view of the delta ( https://reviews.llvm.org/rG4b64ce7428b66cacfe74dbd9dbc29aff6dfb84af ) it /looks/ like it wasn't too different. Mostly picking up upstream changes that added "DEFERRABLE"? (I think Phab uses light green for "this changed, but only because of upstream changes" and dark green is the actual patch changes?)

Sorry for the noise. What I saw previously was a mere difference in the DIAG macro and the new isDeferable... Maybe Phab presented the diff between two Diffs to me. The updated view seems good.

I assumed that "add another parameter to a macro due to rebasing" was not a significant enough change to warrant reposting...but as this is the first patch I was committing myself, I probably should have been a bit more explicit in what I was committing (even re-asking for review? I'm not sure of the norms around rebasing in the LLVM project). My mistake!

I think that is fine. It seems that you just rebased on top of Recommit "[CUDA][HIP] Defer overloading resolution diagnostics for host device functions" (which actually added isDeferable). I somewhat read the diff between two Diffs and found it different from the commit and commented here. Sorry!