This is an archive of the discontinued LLVM Phabricator instance.

[wip] Bitcode: Write the irsymtab to disk.
AbandonedPublic

Authored by pcc on Apr 13 2017, 4:18 PM.

Details

Summary

Fixes PR27551.

This is just to show where I'm going with the irsymtab changes. With this and
its dependencies, no-op incremental ThinLTO link time for Chromium decreases
to about 35s and total .bc file size increases to 1225420580 bytes (versus [0]).

This needs more tests and needs to be split into multiple patches. I also
want to try to shrink the size of the irsymtab further.

[0] http://lists.llvm.org/pipermail/llvm-dev/2017-April/111828.html

Depends on D31838

Depends on D31921

Depends on D31922

Event Timeline

pcc created this revision.Apr 13 2017, 4:18 PM
tejohnson added inline comments.Apr 18 2017, 9:37 AM
llvm/include/llvm/Object/IRSymtab.h
131

Does this mean that the symbol table will be thrown away every time the llvm revision changes? Can we avoid that by having a symbol table version number?

llvm/lib/Object/IRSymtab.cpp
332

I guess the disadvantage of having a fully upgradable symbol table is that we can't avoid duplication between the module bitcode and the symbol table. I.e. vs reading the symbol table first, and getting visibility and other symbol info from there and using it in the Module. Although the format if the symbol table is more compressed by not keeping it as structured bitcode records, so it is a tradeoff.

(As an aside - is the linkage type going to be added to the symbol table so that we don't need to parse IR when building the summaries in the bitcode reader?)

I guess the approach of storing the symbol table in a custom string blob was already signed off in D31364, but I'm hoping for another pair of eyes on this. Will add Rafael to the review in case Mehdi doesn't have the bandwidth right now. Especially since we chatted about moving the ThinLTO summaries into this format.

pcc added inline comments.Apr 18 2017, 10:44 AM
llvm/include/llvm/Object/IRSymtab.h
131

That would be tricky to get right. There are many things that could change the symbol table, even without changing the format.

For example, a change such as D31443 would re-order the symbol table, despite not touching the bitcode reader at all.

As another example, suppose that we decide to remove an intrinsic from LLVM, and we upgrade existing IR by removing the intrinsic declaration (and changing calls to do something else). That by itself would change the symbol table.

So I think the safest option would be to accept only symbol tables from the same revision of LLVM.

llvm/lib/Object/IRSymtab.cpp
332

I guess the disadvantage of having a fully upgradable symbol table is that we can't avoid duplication between the module bitcode and the symbol table. I.e. vs reading the symbol table first, and getting visibility and other symbol info from there and using it in the Module. Although the format if the symbol table is more compressed by not keeping it as structured bitcode records, so it is a tradeoff.

Yes, although to some extent the duplication is unavoidable because we have to be able to store module inline asm symbols here as well.

(As an aside - is the linkage type going to be added to the symbol table so that we don't need to parse IR when building the summaries in the bitcode reader?)

I believe that the only use of the linkage when building the summary is to determine whether a global has external linkage. There is already a symbol flag FB_global that gives us that information.

tejohnson added inline comments.Apr 18 2017, 11:50 AM
llvm/include/llvm/Object/IRSymtab.h
131

For example, a change such as D31443 would re-order the symbol table, despite not touching the bitcode reader at all.

I guess this makes a different for regularLTO where we need to correlate the symbol resolutions built from the symbol table with the symbols in the module. Presumably it shouldn't matter for something like ThinLTO though, where in the thin link we eventually shouldn't have to look at the module ir at all. Not that this helps much since we don't know the use case. Unfortunate, but I can't think of a way to avoid it offhand, without adding some other way to correlate the module symbols to the symbol table symbols.

llvm/lib/Object/IRSymtab.cpp
332

I believe that the only use of the linkage when building the summary is to determine whether a global has external linkage. There is already a symbol flag FB_global that gives us that information.

Not just that - we need to know the linkage type during the thin link to do weak resolution.

pcc added inline comments.Apr 18 2017, 12:04 PM
llvm/lib/Object/IRSymtab.cpp
332

You are quite right, I forgot about that. I will take a closer look at that as part of the summary overhaul to see if it can be refactored away.

mehdi_amini added inline comments.Apr 18 2017, 4:32 PM
llvm/include/llvm/Object/IRSymtab.h
131

So I think the safest option would be to accept only symbol tables from the same revision of LLVM.

But that's a design decision. The symbol table could be stored as bitcode with the same exact restriction as all the usual compatibility, and avoid any duplication with the rest of the module (i.e. not being able to regenerate the symbol table from the rest of the bitcode is *not* a bad thing to me, quite the opposite).
I'm not sure what makes a symbol table "special" here.

pcc added inline comments.Apr 18 2017, 4:44 PM
llvm/include/llvm/Object/IRSymtab.h
131

Even with a bitcode design that somehow avoided all duplication between the module and the symbol table, I don't see why we wouldn't still have the same set of issues when upgrading old bitcode.

mehdi_amini added inline comments.Apr 18 2017, 4:59 PM
llvm/include/llvm/Object/IRSymtab.h
131

I don't follow.

pcc added inline comments.Apr 18 2017, 6:12 PM
llvm/include/llvm/Object/IRSymtab.h
131

The ModuleSymbolTable is the "source of truth" for a module's symbol table, and the irsymtab is effectively a "cache" of the ModuleSymbolTable (an irsymtab is derived from a ModuleSymbolTable). As previously mentioned, a ModuleSymbolTable can change as a result of arbitrary changes to the compiler. The situation we want to avoid is where irsymtab::read gives us a different set of symbols than the ModuleSymbolTable, which could happen if the irsymtab was created using an old version of LLVM. We are robust against such situations by checking the revision number embedded in the bitcode file, and rebuilding the irsymtab if it does not match.

One possible way of being resistant to changes in the symbol table suggests itself: we could discard resolutions for symbols that do not appear in the ModuleSymbolTable, and make up resolutions for symbols that did not appear in the irsymtab. But that doesn't seem particularly sound, because the linker still needs to see the correct set of symbols at symbol resolution time for correctness. To go back to the example where we remove an intrinsic, if the intrinsic is upgraded by replacing it with a call to a runtime library function, this would result in a new undefined symbol in the symbol table. Although we could invent a resolution for it (non-prevailing + non-local + external would do), this new undefined symbol may require the linker to have different behaviour at symbol resolution time. For example, the new undefined symbol may require additional archive members to be loaded, which the linker may not be prepared to do once it has completed LTO. The simplest way to handle that situation is by rebuilding the irsymtab, which would ensure that the undefined symbol is available at symbol resolution time.

mehdi_amini added inline comments.Apr 18 2017, 6:35 PM
llvm/include/llvm/Object/IRSymtab.h
131

The ModuleSymbolTable is the "source of truth" for a module's symbol table, and the irsymtab is effectively a "cache" of the ModuleSymbolTable (an irsymtab is derived from a ModuleSymbolTable)

Well that is exactly what I qualified right above:

But that's a design decision.

It does not *have to be* this way.

pcc added inline comments.Apr 18 2017, 6:44 PM
llvm/include/llvm/Object/IRSymtab.h
131

I mean sure, it may be possible to redesign how the bitcode reader works so that the symbol table is implicitly "upgraded" as it is read, and the module bitcode reader uses IR entities derived from the "upgraded" symbol table. That may be something to consider exploring if we ever redesign the bitcode format. My gut feeling, though, is that it would be too complicated to be worth it.

tejohnson added inline comments.Apr 20 2017, 1:55 PM
llvm/include/llvm/Object/IRSymtab.h
131

Talking through this with pcc this morning, I now tend to agree that it doesn't make sense to optimize for the case where the llvm revision has changed. If the format of the symbol table changes so that fields like linkage, visibility etc are shared between the irsymtab and the IR, then of course it must be kept in sync with the IR at all times. But at least with the format we have in D31364, where the irsymtab is essentially a cache of the symbol information in the module, this seems fine.

Mehdi, do you still have concerns?

mehdi_amini added inline comments.Apr 20 2017, 3:48 PM
llvm/include/llvm/Object/IRSymtab.h
131

Mehdi, do you still have concerns?

Basically the logic I'm reading here is "the design choice implemented here does not optimize for use-case A so use-case A does not make sense".

This seems completely flawed to me, so I clearly don't understand the reasoning.

pcc added inline comments.Apr 20 2017, 5:14 PM
llvm/include/llvm/Object/IRSymtab.h
131

The choice being made here is not set in stone. We may decide on a different approach later. But it would be best to consider that orthogonally.

Indeed, if we start with the design that you appear to be advocating, it would not only make the code more complicated, but make it that much more difficult to back out of that decision if it turns out to be wrong, due to the need to continuously support upgrading from the old design. It is not clear to me that a design that avoids duplication like you appear to be proposing would be better because the amount of information being duplicated is, frankly, not large (at least compared to names, which are already being shared) and it is unclear how such a design would address module inline asm symbols.

On the other hand, if the design you are advocating turns out to be better, we can move from this design to that one without the need for upgrade support.

Normally I would expect the consumer's revision to be the same as the producer's, so the need to optimize upgrades does not seem strong. That said, I would be prepared to acknowledge such a need. But there are other ways of optimizing upgrades that do not involve creating compatibility constraints. For example, you can imagine writing upgraded bitcode files (or a marker indicating that the file does not require an upgrade) to the ThinLTO cache directory. So I do not want to make any decisions in this space prematurely.

pcc updated this revision to Diff 96631.Apr 25 2017, 2:01 PM
  • Add a missing build dependency
  • Add a version number
pcc abandoned this revision.Jun 6 2017, 6:03 PM

I have split this change into 6 patches with tests: D33969 D33970 D33971 D33972 D33973 D33974