Page MenuHomePhabricator

[GlobalDCE] Dead Virtual Function Elimination
ClosedPublic

Authored by ostannard on Jun 28 2019, 7:37 AM.

Details

Summary

Currently, it is hard for the compiler to remove unused C++ virtual
functions, because they are all referenced from vtables, which are referenced
by constructors. This means that if the constructor is called from any live
code, then we keep every virtual function in the final link, even if there
are no call sites which can use it.

This patch allows unused virtual functions to be removed during LTO (and
regular compilation in limited circumstances) by using type metadata to match
virtual function call sites to the vtable slots they might load from. This
information can then be used in the global dead code elimination pass instead
of the references from vtables to virtual functions, to more accurately
determine which functions are reachable.

To make this transformation safe, I have changed clang's code-generation to
always load virtual function pointers using the llvm.type.checked.load
intrinsic, instead of regular load instructions. I originally tried writing
this using clang's existing code-generation, which uses the llvm.type.test
and llvm.assume intrinsics after doing a normal load. However, it is possible
for optimisations to obscure the relationship between the GEP, load and
llvm.type.test, causing GlobalDCE to fail to find virtual function call
sites.

The existing linkage and visibility types don't accurately describe the scope
in which a virtual call could be made which uses a given vtable. This is
wider than the visibility of the type itself, because a virtual function call
could be made using a more-visible base class. I've added a new
!vcall_visibility metadata type to represent this, described in
TypeMetadata.rst. The internalization pass and libLTO have been updated to
change this metadata when linking is performed.

This doesn't currently work with ThinLTO, because it needs to see every call
to llvm.type.checked.load in the linkage unit. It might be possible to
extend this optimisation to be able to use the ThinLTO summary, as was done
for devirtualization, but until then that combination is rejected in the
clang driver.

To test this, I've written a fuzzer which generates random C++ programs with
complex class inheritance graphs, and virtual functions called through object
and function pointers of different types. The programs are spread across
multiple translation units and DSOs to test the different visibility
restrictions.

I've also tried doing bootstrap builds of LLVM to test this. This isn't
ideal, because only classes in anonymous namespaces can be optimised with
-fvisibility=default, and some parts of LLVM (plugins and bugpoint) do not
work correctly with -fvisibility=hidden. However, there are only 12 test
failures when building with -fvisibility=hidden (and an unmodified compiler),
and this change does not cause any new failures for either value of
-fvisibility.

On the 7 C++ sub-benchmarks of SPEC2006, this gives a geomean code-size
reduction of ~6%, over a baseline compiled with "-O2 -flto
-fvisibility=hidden -fwhole-program-vtables". The best cases are reductions
of ~14% in 450.soplex and 483.xalancbmk, and there are no code size
increases.

I've also run this on a set of 8 mbed-os examples compiled for Armv7M, which
show a geomean size reduction of ~3%, again with no size increases.

I had hoped that this would have no effect on performance, which would allow
it to awlays be enabled (when using -fwhole-program-vtables). However, the
changes in clang to use the llvm.type.checked.load intrinsic are causing ~1%
performance regression in the C++ parts of SPEC2006. It should be possible to
recover some of this perf loss by teaching optimisations about the
llvm.type.checked.load intrinsic, which would make it worth turning this on
by default (though it's still dependent on -fwhole-program-vtables).

Diff Detail

Event Timeline

ostannard created this revision.Jun 28 2019, 7:37 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 28 2019, 7:37 AM
To make this transformation safe, I have changed clang's code-generation to
always load virtual function pointers using the llvm.type.checked.load
intrinsic, instead of regular load instructions. I originally tried writing
this using clang's existing code-generation, which uses the llvm.type.test
and llvm.assume intrinsics after doing a normal load. However, it is possible
for optimisations to obscure the relationship between the GEP, load and
llvm.type.test, causing GlobalDCE to fail to find virtual function call
sites.

Can you please clarify, that is saying that without switching to
llvm.type.checked.load the optimization would be less efficient
since it would trigger less, not that the optimization would trigger
in some cases where it shouldn't trigger, thus miscompiling the code,
correct ?

Without switching to llvm.type.checked.load, the optimisation would trigger in some cases where it shouldn't trigger, thus miscompiling the code. This is because it needs to know about _every_ virtual call site, if we lose the information about some of them then it will incorrectly think a virtual function is unused.

miyuki added a subscriber: miyuki.Jul 1 2019, 7:20 AM
ormris added a subscriber: ormris.Jul 1 2019, 3:42 PM
pcc added a comment.Jul 10 2019, 8:50 AM

Hi Oliver, thanks for working on this. This is something that I've always wanted to do with the type metadata but never had time to work on. I'll try to take a more in depth look later this week.

phosek added a subscriber: phosek.Jul 18 2019, 3:49 PM
pcc added a comment.Jul 29 2019, 8:09 PM

It doesn't seem sound to create the distinction between translation unit and linkage unit visibility and then allow passes to relax visibility. For example, imagine that you have an executable with (compiled with -fvisibility=default):

struct PluginInterface {
  virtual void foo() = 0;
}

struct Plugin1 : PluginInterface {
  virtual void foo();
};

void Plugin1::foo() {
 // ...
}

And a DSO loaded with dlopen:

void call_foo(PluginInterface *plugin) {
  plugin->foo();
}

Note that the lack of attribute means that the vtables have public LTO visibility. Depending on the contents of the DSO, the PluginInterface and Plugin1 vtable symbols may not be referenced at all by the DSO. If the DSO is loaded via dlopen it will definitely not be referenced at link time. If we allow LTO to relax PluginInterface and Plugin1 vtables to what you're calling VCallVisibilityTranslationUnit in the main executable and there are no calls to foo() in the main executable, we will incorrectly drop the definition of Plugin1::foo().

The approach that I can see working involves basically the same structure as CFI and WPD: the LTO visibility is decided by the frontend and not relaxed by passes or LTO. There is only one "flavour" of !vcall_visibility metadata: the one that means the entire hierarchy below the vtable has hidden LTO visibility.

However, the
changes in clang to use the llvm.type.checked.load intrinsic are causing ~1%
performance regression in the C++ parts of SPEC2006.

Have you checked the assembly to see whether an unused CFI check is being emitted?

clang/lib/CodeGen/CGClass.cpp
2826

There should be no need to emit this llvm.assume intrinsic. We can rely on the unspecified behaviour in the case where the second element returned by llvm.type.checked.load is false.

llvm/docs/TypeMetadata.rst
240

s/"public"/"default"/

llvm/lib/Transforms/IPO/GlobalDCE.cpp
256

Not necessarily, at least in this implementation, because "vtable symbol can be internalized" doesn't imply "all vcalls are visible". See main response.

In that example, with everything having default ELF visibility, all of the vtables will get vcall_visibility public, which can't be optimised by VFE, and won't ever be relaxed to one of the tighter visibilities.

The cases which can be optimised are (assuming "visibility" means visibility of the most-visible dynamic base class):

  • Classes in an anonymous namespace, which aren't visible outside their translation unit. Probably doesn't happen very often, but can be optimised without LTO.
  • Classes visible outside the translation unit, but not outside the LTO unit. This generally means hidden ELF visibility, unless the lto_visibility_public attribute is used. These can't be optimised without LTO, but can be with it.

I implemented the second case by adding the LinkageUnit visibility, which can't be optimised by VFE, but can be reduced to TranslationUnit when LTO internalisation happens. This could also have been done by not changing the visibility at LTO time, and instead leting GlobalDCE know if it is running post-LTO. Both of these should give the same results, but the way I used represents the visibility fully in the IR, without having the extra state of "are we doing LTO?".

I also don't think it matters here whether the DSO is loaded by dlopen or by being linked against it. Is there something I've missed here, or was dlopen not relevant to your example.

Have you checked the assembly to see whether an unused CFI check is being emitted?

Not yet, I wanted to make sure that this optimisation was valid and correctly implemented before tracking down the performance regressions.

llvm/lib/Transforms/IPO/GlobalDCE.cpp
256

This is checking the vcall_visibility, which will only be VCallVisibilityTranslationUnit if all base classes which contain virtual functions are private to this translation unit, which does imply "all vcalls are visible".

ostannard updated this revision to Diff 212370.Jul 30 2019, 9:45 AM
ostannard marked 2 inline comments as done.
  • Rebase
  • Don't emit llvm.assume when not necessary (we already weren't checking for it's presence in GlobalDCE)
  • s/"public"/"default"/ in IR docs
pcc added a comment.Jul 30 2019, 10:24 AM

In that example, with everything having default ELF visibility, all of the vtables will get vcall_visibility public, which can't be optimised by VFE, and won't ever be relaxed to one of the tighter visibilities.

The cases which can be optimised are (assuming "visibility" means visibility of the most-visible dynamic base class):

  • Classes in an anonymous namespace, which aren't visible outside their translation unit. Probably doesn't happen very often, but can be optimised without LTO.
  • Classes visible outside the translation unit, but not outside the LTO unit. This generally means hidden ELF visibility, unless the lto_visibility_public attribute is used. These can't be optimised without LTO, but can be with it.

    I implemented the second case by adding the LinkageUnit visibility, which can't be optimised by VFE, but can be reduced to TranslationUnit when LTO internalisation happens. This could also have been done by not changing the visibility at LTO time, and instead leting GlobalDCE know if it is running post-LTO. Both of these should give the same results, but the way I used represents the visibility fully in the IR, without having the extra state of "are we doing LTO?".

Okay, I somehow missed that the relaxation to TranslationUnit was guarded on it being LinkageUnit to start with. So at least from that perspective I don't see any soundness issues. It still doesn't seem like a good idea to do the relaxation in LTO though, since the visibility-outside-of-LTO of the vtable symbols is not a reliable indicator whether the type is used externally (I think that's what confused me to begin with). Here are a couple of cases where things might go wrong:

  • Take the example from my earlier message, give the "main executable" and "DSO" hidden visibility, build the "main executable" with LTO and the "DSO" without LTO, and statically link them both into the same executable. We run into the same problem where the Plugin1 vtable is potentially not referenced and thus misoptimised. Yes, this is a violation of the LTO visibility rules, but the example shows that we only detect it sometimes. I think that if we did want to detect cases where the LTO visibility rules are clearly being violated, the outcome should be to issue a diagnostic and not to silently proceed with optimizations disabled, since the violation might be masking other undetected issues. That really seems orthogonal to this patch, though.
  • Imagine linking part of a program with ld -r with LTO -- all symbols including vtable symbols will appear to be externally visible, which is not necessarily a violation of the LTO visibility rules because they may not be used externally in the final link. So we end up pessimising unnecessarily.

I gave this some more thought and it seems that the frontend has all of the information required to make a determination about whether to allow VFE, without needing to teach LTO to relax visibility. We can make the frontend do this:

  • If -flto was passed (see CodeGenOptions::PrepareForLTO), attach "allow VFE" metadata if class has hidden LTO visibility.
  • Otherwise, attach "allow VFE" metadata if class is not isExternallyVisible.

Now the behaviour of GlobalDCE can be made conditional on the "allow VFE" metadata alone. This also has the advantage of simplifying the IR by making "allow VFE" a boolean rather than a tri-state as it is now.

llvm/lib/Transforms/IPO/GlobalDCE.cpp
256

Ack.

  • Take the example from my earlier message, give the "main executable" and "DSO" hidden visibility, build the "main executable" with LTO and the "DSO" without LTO, and statically link them both into the same executable. We run into the same problem where the Plugin1 vtable is potentially not referenced and thus misoptimised. Yes, this is a violation of the LTO visibility rules, but the example shows that we only detect it sometimes. I think that if we did want to detect cases where the LTO visibility rules are clearly being violated, the outcome should be to issue a diagnostic and not to silently proceed with optimizations disabled, since the violation might be masking other undetected issues. That really seems orthogonal to this patch, though.

As you say, this is a violation of the LTO visibility rules, so [[clang::lto_visibility_public]] must be used. I've gated this optimisation behind -fwhole-program-vtables, which if I've understood https://clang.llvm.org/docs/LTOVisibility.html correctly, is effectively the user asserting to the compiler that they've followed the LTO visility rules correctly.

  • Imagine linking part of a program with ld -r with LTO -- all symbols including vtable symbols will appear to be externally visible, which is not necessarily a violation of the LTO visibility rules because they may not be used externally in the final link. So we end up pessimising unnecessarily.

I'm not really familiar with partial linking, is there any good documentation on how it is supposed to interact with LTO?

I've tried to make the changed to vcall_visibility as similar to the normal LTO internalisation process as possible, it happens at the same time and under the same conditions. If partial linking prevents normal internalisation, which will in turn prevent most of the existing optimisations done by GlobalDCE, then surely it should also prevent internalisation of vcall_visibility?

I gave this some more thought and it seems that the frontend has all of the information required to make a determination about whether to allow VFE, without needing to teach LTO to relax visibility. We can make the frontend do this:

  • If -flto was passed (see CodeGenOptions::PrepareForLTO), attach "allow VFE" metadata if class has hidden LTO visibility.
  • Otherwise, attach "allow VFE" metadata if class is not isExternallyVisible.

    Now the behaviour of GlobalDCE can be made conditional on the "allow VFE" metadata alone. This also has the advantage of simplifying the IR by making "allow VFE" a boolean rather than a tri-state as it is now.

The problem with that is that GlobalDCE is still run in the pre-linking pipeline when emiting an LTO object, because it can remove code/data which already has internal linkage. To make this work, we'd have to do one of:

  • Tell GlobalDCE whether it is running in LTO post-linking or not, so it can avoid doing VFE before internalisation. This breaks the idea that the IR contains all the information needed to determine if an optimisation is legal or not.
  • Remove the pre-linking runs of GlobalDCE when doing LTO. This will result in more known-dead code reaching the LTO link stage, and results in a pass which is only valid at certain points in the pipeline.
  • Split VFE out of GlobalDCE. This would result in worse optimisation, because chains of dependencies involving both virtual calls and regular references won't be followed fully.
pcc added a comment.Jul 31 2019, 10:58 AM
  • Take the example from my earlier message, give the "main executable" and "DSO" hidden visibility, build the "main executable" with LTO and the "DSO" without LTO, and statically link them both into the same executable. We run into the same problem where the Plugin1 vtable is potentially not referenced and thus misoptimised. Yes, this is a violation of the LTO visibility rules, but the example shows that we only detect it sometimes. I think that if we did want to detect cases where the LTO visibility rules are clearly being violated, the outcome should be to issue a diagnostic and not to silently proceed with optimizations disabled, since the violation might be masking other undetected issues. That really seems orthogonal to this patch, though.

As you say, this is a violation of the LTO visibility rules, so [[clang::lto_visibility_public]] must be used. I've gated this optimisation behind -fwhole-program-vtables, which if I've understood https://clang.llvm.org/docs/LTOVisibility.html correctly, is effectively the user asserting to the compiler that they've followed the LTO visility rules correctly.

Yes, that's correct.

  • Imagine linking part of a program with ld -r with LTO -- all symbols including vtable symbols will appear to be externally visible, which is not necessarily a violation of the LTO visibility rules because they may not be used externally in the final link. So we end up pessimising unnecessarily.

I'm not really familiar with partial linking, is there any good documentation on how it is supposed to interact with LTO?

I'm not aware of any. The way that I think about it is that a partial link operation with LTO inputs can still produce an LTO unit, but that LTO unit may not be combined with other LTO units, either by passing bitcode files directly to the final link or by adding another ld -r output with an LTO unit to the final link.

I've tried to make the changed to vcall_visibility as similar to the normal LTO internalisation process as possible, it happens at the same time and under the same conditions. If partial linking prevents normal internalisation, which will in turn prevent most of the existing optimisations done by GlobalDCE, then surely it should also prevent internalisation of vcall_visibility?

Partial linking will indeed prevent dropping the virtual functions, but it should not prevent clearing the pointer to the virtual function in the vtable. The linker should then be able to drop the virtual function body as part of --gc-sections during the final link.

I gave this some more thought and it seems that the frontend has all of the information required to make a determination about whether to allow VFE, without needing to teach LTO to relax visibility. We can make the frontend do this:

  • If -flto was passed (see CodeGenOptions::PrepareForLTO), attach "allow VFE" metadata if class has hidden LTO visibility.
  • Otherwise, attach "allow VFE" metadata if class is not isExternallyVisible.

    Now the behaviour of GlobalDCE can be made conditional on the "allow VFE" metadata alone. This also has the advantage of simplifying the IR by making "allow VFE" a boolean rather than a tri-state as it is now.

The problem with that is that GlobalDCE is still run in the pre-linking pipeline when emiting an LTO object, because it can remove code/data which already has internal linkage.

Right, of course. Sorry, I forgot about that. It seems that if we want to allow VFE on internal linkage types without LTO, we would still need the tri-state.

To make this work, we'd have to do one of:

  • Tell GlobalDCE whether it is running in LTO post-linking or not, so it can avoid doing VFE before internalisation. This breaks the idea that the IR contains all the information needed to determine if an optimisation is legal or not.
  • Remove the pre-linking runs of GlobalDCE when doing LTO. This will result in more known-dead code reaching the LTO link stage, and results in a pass which is only valid at certain points in the pipeline.
  • Split VFE out of GlobalDCE. This would result in worse optimisation, because chains of dependencies involving both virtual calls and regular references won't be followed fully.

I think I would favour something closer to your first suggestion, but instead of telling GlobalDCE specifically about this, we represent the "post-link" flag in the IR (e.g. as a module flag) in order to keep the IR self-contained. LTO would then be taught to add this flag at the right time, and the logic inside GlobalDCE would be:

  • If post-link flag not set, allow VFE if linkage <= TranslationUnit.
  • If post-link flag set, allow VFE if linkage <= LinkageUnit.

This would also help address a correctness issue with the CFI and WPD passes, which is that it is currently unsound to run them at compile time. If we let them run at compile time, we would in principle be able to do CFI and WPD on internal linkage types without LTO.

Partial linking will indeed prevent dropping the virtual functions, but it should not prevent clearing the pointer to the virtual function in the vtable. The linker should then be able to drop the virtual function body as part of --gc-sections during the final link.

If partial linking isn't doing internalisation, I'd expect that to prevent a lot of other LTO optimisations, not just VFE. Is this a common use-case which we need to care about?

I think I would favour something closer to your first suggestion, but instead of telling GlobalDCE specifically about this, we represent the "post-link" flag in the IR (e.g. as a module flag) in order to keep the IR self-contained. LTO would then be taught to add this flag at the right time, and the logic inside GlobalDCE would be:

  • If post-link flag not set, allow VFE if linkage <= TranslationUnit.
  • If post-link flag set, allow VFE if linkage <= LinkageUnit.

    This would also help address a correctness issue with the CFI and WPD passes, which is that it is currently unsound to run them at compile time. If we let them run at compile time, we would in principle be able to do CFI and WPD on internal linkage types without LTO.

Ok, sounds reasonable, though I suspect WPD and CFI will need a slightly different definition of type visibility - they care about the possibility of unseen code adding new derived classes, so the visibility of base classes isn't important, and they might be able to make use of the final specifier.

ostannard updated this revision to Diff 212833.Aug 1 2019, 9:05 AM
  • Add LTOPostLink metadata, instead of internalising vcall visibility at LTO time
pcc added a comment.Aug 19 2019, 4:49 PM

Partial linking will indeed prevent dropping the virtual functions, but it should not prevent clearing the pointer to the virtual function in the vtable. The linker should then be able to drop the virtual function body as part of --gc-sections during the final link.

If partial linking isn't doing internalisation, I'd expect that to prevent a lot of other LTO optimisations, not just VFE. Is this a common use-case which we need to care about?

It isn't that common, but it seems worth doing if it can be done easily.

That said, I note that it does appear that your implementation will end up preserving the pointer in the vtable in this case because you're relying on the use list to make decisions about what to GC. So it doesn't seem easy to do at this point, but if for example we made this compatible with ThinLTO at some point we would probably not be able to rely on the use list, and the resulting changes to this feature might make this easier to do.

I think I would favour something closer to your first suggestion, but instead of telling GlobalDCE specifically about this, we represent the "post-link" flag in the IR (e.g. as a module flag) in order to keep the IR self-contained. LTO would then be taught to add this flag at the right time, and the logic inside GlobalDCE would be:

  • If post-link flag not set, allow VFE if linkage <= TranslationUnit.
  • If post-link flag set, allow VFE if linkage <= LinkageUnit.

    This would also help address a correctness issue with the CFI and WPD passes, which is that it is currently unsound to run them at compile time. If we let them run at compile time, we would in principle be able to do CFI and WPD on internal linkage types without LTO.

Ok, sounds reasonable, though I suspect WPD and CFI will need a slightly different definition of type visibility - they care about the possibility of unseen code adding new derived classes, so the visibility of base classes isn't important, and they might be able to make use of the final specifier.

Internal linkage types use distinct metadata (i.e. metadata that can't be merged with other TUs), so there's no possibility of a new derived class being added. I suspect that we could make these passes check the distinctness of metadata if the post-link flag is not set, but that probably deserves more thought.

llvm/lib/Transforms/IPO/GlobalDCE.cpp
409

Could be just ConstantPointerNull::get(F->getType()).

llvm/test/LTO/ARM/lto-linking-metadata.ll
3

Could you add a test for the new LTO API as well as the legacy one?

llvm/test/Transforms/GlobalDCE/virtual-functions-derived-pointer-call.ll
31

Remove word "or".

llvm/test/Transforms/GlobalDCE/virtual-functions-visibility-post-lto.ll
14

Could you simplify this test (and others) by removing the RTTI, please? We should also have a test showing that RTTI is preserved.

ostannard marked 4 inline comments as done.Sep 2 2019, 7:16 AM

It isn't that common, but it seems worth doing if it can be done easily.

That said, I note that it does appear that your implementation will end up preserving the pointer in the vtable in this case because you're relying on the use list to make decisions about what to GC. So it doesn't seem easy to do at this point, but if for example we made this compatible with ThinLTO at some point we would probably not be able to rely on the use list, and the resulting changes to this feature might make this easier to do.

Ok, I think that it makes sense to leave this for a separate patch, as long as we currently generate correct code. I've added partial linking of the LTO unit to my fuzzer, and haven't found any problems.

ostannard updated this revision to Diff 218363.Sep 2 2019, 7:17 AM

we represent the "post-link" flag in the IR (e.g. as a module flag) in order to keep the IR self-contained.

This makes the IR self-contained, which is good, but it also make the interpretation of the IR modal, which isn't great. It means that suddenly the rules of interpretation of what is valid to do or not changes according to this module flag.

This makes the IR self-contained, which is good, but it also make the interpretation of the IR modal, which isn't great. It means that suddenly the rules of interpretation of what is valid to do or not changes according to this module flag.

I think the original version was better from that perspective - most compiler passes only need to check for one value of the attribute, and only the linker needed to care about the difference between link-unit and public visibility. Do you think we should go back to that design, or do you have a different idea?

This makes the IR self-contained, which is good, but it also make the interpretation of the IR modal, which isn't great. It means that suddenly the rules of interpretation of what is valid to do or not changes according to this module flag.

I think the original version was better from that perspective - most compiler passes only need to check for one value of the attribute, and only the linker needed to care about the difference between link-unit and public visibility.

The *linker* needed to care about the difference is fine, but that's different from a *compiler pass* during LTO. In general it is an important design point in LLVM so far to encode these information in the IR (for instance the linkage type of globals would be changed and not the interpretation of these with a flag).

Do you think we should go back to that design, or do you have a different idea?

I didn't review the original version, but pcc@ did apparently, so I'm willing to trust them on what's the best way forward here.

pcc accepted this revision.Oct 10 2019, 4:01 PM

LGTM

This revision is now accepted and ready to land.Oct 10 2019, 4:01 PM
This revision was automatically updated to reflect the committed changes.

@ostannard Maybe you could add that to the release notes? Thanks