Page MenuHomePhabricator

[clang] Use string tables for static diagnostic descriptions
AcceptedPublic

Authored by froydnj on Jun 15 2020, 12:24 PM.

Details

Summary

Using a pointer for the description string in StaticDiagInfoRec causes several problems:

  1. We don't need to use a whole pointer to represent the string;
  2. The use of pointers incurs runtime relocations for those pointers; the relocations take up space on disk and represent runtime overhead;
  3. The need to relocate data implies that, on some platforms, the entire array containing StaticDiagInfoRecs cannot be shared between processes.

This patch changes the storage scheme for the diagnostic descriptions to
avoid these problems. We instead generate (effectively) one large
string and then StaticDiagInfoRec conceptually holds offsets into the
string. We elected to also move the storage of those offsets into a
separate array to further reduce the space required.

On x86-64 Linux, this change removes about 120KB of relocations and
moves about 60KB from the non-shareable .data.rel.ro section to
shareable .rodata. (The array is about 80KB before this, but we
eliminated 4 bytes/entry by using offsets rather than pointers.) We
actually reap this benefit twice, because these tables show up in both
libclang.so and libclang-cpp.so and we get the reduction in both places.

Diff Detail

Event Timeline

froydnj created this revision.Jun 15 2020, 12:24 PM

I'd like to apply the same sort of techniques to the builtin tables, which are enormous, but this is baby steps.

froydnj changed the repository for this revision from rC Clang to rG LLVM Github Monorepo.Jun 23 2020, 8:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2020, 8:27 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Add some reviewers based on git diff --name-only | xargs -n 1 git blame --porcelain | grep "^author " | sort | uniq -c | sort -nr | head -30.

riccibruno retitled this revision from use string tables for static diagnostic descriptions to [clang] Use string tables for static diagnostic descriptions.Mon, Jul 27, 7:27 AM
bkramer accepted this revision.Mon, Jul 27, 8:18 AM

Nice, those relocations have annoyed me for years. I'm worried about whether the way you're accessing StaticDiagInfoDescriptionStringTable might be undefined behavior. I won't block this change on that though.

This revision is now accepted and ready to land.Mon, Jul 27, 8:18 AM

Nice, those relocations have annoyed me for years. I'm worried about whether the way you're accessing StaticDiagInfoDescriptionStringTable might be undefined behavior. I won't block this change on that though.

Is there somebody who should review that particular bit? My understanding is that it is OK, but my understanding of the C++ standard is not necessarily complete, and I'd like to get a second opinion.

Nice, those relocations have annoyed me for years. I'm worried about whether the way you're accessing StaticDiagInfoDescriptionStringTable might be undefined behavior. I won't block this change on that though.

Is there somebody who should review that particular bit? My understanding is that it is OK, but my understanding of the C++ standard is not necessarily complete, and I'd like to get a second opinion.

I believe this falls under the (using cppreference ( https://en.cppreference.com/w/cpp/language/union ) , since it's a bit easier to read) UB clause: " It's undefined behavior to read from the member of the union that wasn't most recently written. Many compilers implement, as a non-standard language extension, the ability to read inactive members of a union."

Last member written to was the "StringTable" member, but then it's read from the "String" member, so that'd be UB. Commonly supported, but UB - not sure if we have a general statement that we depend on this behavior in LLVM, even though it's non-standard (but it's possible that we do make such an assumption about the compiler that's building LLVM). It'd be nice to avoid that, though - and not too difficult, I think - I /believe/ it's valid to take a pointer to an object, cast it to char*, compute a pointer to some specific member and then cast it back to the right type and access. But I could be wrong there. @rsmith would be the person to give an authoritative answer.

I believe this falls under the (using cppreference ( https://en.cppreference.com/w/cpp/language/union ) , since it's a bit easier to read) UB clause: " It's undefined behavior to read from the member of the union that wasn't most recently written. Many compilers implement, as a non-standard language extension, the ability to read inactive members of a union."

Last member written to was the "StringTable" member, but then it's read from the "String" member, so that'd be UB. Commonly supported, but UB - not sure if we have a general statement that we depend on this behavior in LLVM, even though it's non-standard (but it's possible that we do make such an assumption about the compiler that's building LLVM). It'd be nice to avoid that, though - and not too difficult, I think - I /believe/ it's valid to take a pointer to an object, cast it to char*, compute a pointer to some specific member and then cast it back to the right type and access. But I could be wrong there. @rsmith would be the person to give an authoritative answer.

Thanks for the explanation. Is the language of "writing" applicable here, given that this is a constant blob of storage? (I suppose the compiler is permitted to designate a particular member as having been "written"?)

Would it be more palatable to write:

struct StaticDiagInfoDescriptionStringTable {
  // members as char[] for each diagnostic
};

const StaticDiagInfoDescriptionStringTable StaticDiagInfoDescriptions = {
  // define all the members
};

...

struct StaticDiagInfoRec {
  ...
  StringRef getDescription() const {
    size_t MyIndex = this - &StaticDiagInfo[0];
    uint32_t StringOffset = StaticDiagInfoDescriptionOffsets[MyIndex];
    // Defined as a pointer to the first member, and (presumably) there is no internal padding.
    const char *StringTable = reinterpret_cast<const char*>(&StaticDiagInfoDescriptions);
    return StringRef(&StringTable[StringOffset], DescriptionLen);
};

and then we don't have to care about how the host compiler interprets access to different members of unions?

I believe this falls under the (using cppreference ( https://en.cppreference.com/w/cpp/language/union ) , since it's a bit easier to read) UB clause: " It's undefined behavior to read from the member of the union that wasn't most recently written. Many compilers implement, as a non-standard language extension, the ability to read inactive members of a union."

Last member written to was the "StringTable" member, but then it's read from the "String" member, so that'd be UB. Commonly supported, but UB - not sure if we have a general statement that we depend on this behavior in LLVM, even though it's non-standard (but it's possible that we do make such an assumption about the compiler that's building LLVM). It'd be nice to avoid that, though - and not too difficult, I think - I /believe/ it's valid to take a pointer to an object, cast it to char*, compute a pointer to some specific member and then cast it back to the right type and access. But I could be wrong there. @rsmith would be the person to give an authoritative answer.

Thanks for the explanation. Is the language of "writing" applicable here, given that this is a constant blob of storage? (I suppose the compiler is permitted to designate a particular member as having been "written"?)

I /believe/ it is applicable, though I could be wrong.

Ah, cppreference's example supports that theory at least: S s = {0x12345678}; // initializes the first member, s.n is now the active member - that the initialization itself does set the active member of the union.

(oh, and a later example on cppreference, which is lifted from the C++ spec (though not sure exactly which version) says similarly: "Y y = { { 1, 2 } }; // OK, y.x is active union member (9.2)")

Would it be more palatable to write:

struct StaticDiagInfoDescriptionStringTable {
  // members as char[] for each diagnostic
};

const StaticDiagInfoDescriptionStringTable StaticDiagInfoDescriptions = {
  // define all the members
};

...

struct StaticDiagInfoRec {
  ...
  StringRef getDescription() const {
    size_t MyIndex = this - &StaticDiagInfo[0];
    uint32_t StringOffset = StaticDiagInfoDescriptionOffsets[MyIndex];
    // Defined as a pointer to the first member, and (presumably) there is no internal padding.
    const char *StringTable = reinterpret_cast<const char*>(&StaticDiagInfoDescriptions);
    return StringRef(&StringTable[StringOffset], DescriptionLen);
};

and then we don't have to care about how the host compiler interprets access to different members of unions?

I think so? I guess that's essentially the point of offsetof.

The comment about padding probably isn't needed, I think? Even if there was padding, the StringOffset comes from "offsetof" so it describes the offset including any padding involved?

Be great if @rsmith got a chance to weigh in here.

Ping for an opinion on the recent discussion of how to move forward with this patch.

Ping for an opinion on the recent discussion of how to move forward with this patch.