Page MenuHomePhabricator

[libc++] Simplify how we pick the typeinfo comparison
ClosedPublic

Authored by ldionne on Nov 16 2020, 3:20 PM.

Details

Reviewers
EricWF
Group Reviewers
Restricted Project
Commits
rGbe00e8893fdb: [libc++] Clarify how we pick the typeinfo comparison
Summary

Selecting the right implementation for comparing typeinfos is not something
that we should configure in the config site. It's specific to the ABI in
use, and it should be detected automatically by the library. It has been
a source of confusion and bugs ever since we've introduced that.

Furthermore, it's really weird to set the comparison implementation
explicitly, yet for that choice not to be honored on arm64/Apple.

This commit removes the ability to explicitly select the implementation
of typeinfo comparison, and instead relies on the existing logic to select
the right one based on the ABI.

Diff Detail

Event Timeline

ldionne created this revision.Nov 16 2020, 3:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2020, 3:20 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
ldionne requested review of this revision.Nov 16 2020, 3:20 PM

We have a large internal codebase that I migrated from libstdc++ to libc++. libstdc++ uses relaxed typeinfo comparisons, and I found that just naively switching to libc++ produced hundreds of exception-related issues. Configuring libc++ to use the non-unique implementation was by far the easiest way to get this codebase working. I'm aware that in most cases, there is a proper fix for the exception issues, but (a) it's really hard to justify the large time investment required to fix all the different issues, and (b) there's some cases involving dlopen and RTLD_LOCAL that I believe are impossible to get right with the unique implementation (and I think that was the motivation for libstdc++ to switch to relaxed typeinfo comparisons).

Getting rid of the configurability seems problematic at least for ELF, where there's good reasons to want to use either the unique or non-unique implementations.

We have a large internal codebase that I migrated from libstdc++ to libc++. libstdc++ uses relaxed typeinfo comparisons, and I found that just naively switching to libc++ produced hundreds of exception-related issues. Configuring libc++ to use the non-unique implementation was by far the easiest way to get this codebase working. I'm aware that in most cases, there is a proper fix for the exception issues, but (a) it's really hard to justify the large time investment required to fix all the different issues, and (b) there's some cases involving dlopen and RTLD_LOCAL that I believe are impossible to get right with the unique implementation (and I think that was the motivation for libstdc++ to switch to relaxed typeinfo comparisons).

Getting rid of the configurability seems problematic at least for ELF, where there's good reasons to want to use either the unique or non-unique implementations.

Interesting, thanks for the information. This is the sort of feedback I was looking for when putting up this review.

In that case, what I'll do instead is clarify that LIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION allows unconditionally overriding the implementation of typeinfo used, and have the non unique ARM implementation be _LIBCPP_TYPEINFO_COMPARISON_IMPLEMENTATION == 3 like in this patch. It'll get rid of most of the confusing-ness while still letting you override it.

We have a large internal codebase that I migrated from libstdc++ to libc++. libstdc++ uses relaxed typeinfo comparisons, and I found that just naively switching to libc++ produced hundreds of exception-related issues. Configuring libc++ to use the non-unique implementation was by far the easiest way to get this codebase working. I'm aware that in most cases, there is a proper fix for the exception issues, but (a) it's really hard to justify the large time investment required to fix all the different issues, and (b) there's some cases involving dlopen and RTLD_LOCAL that I believe are impossible to get right with the unique implementation (and I think that was the motivation for libstdc++ to switch to relaxed typeinfo comparisons).

Getting rid of the configurability seems problematic at least for ELF, where there's good reasons to want to use either the unique or non-unique implementations.

Interesting, thanks for the information. This is the sort of feedback I was looking for when putting up this review.

In that case, what I'll do instead is clarify that LIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION allows unconditionally overriding the implementation of typeinfo used, and have the non unique ARM implementation be _LIBCPP_TYPEINFO_COMPARISON_IMPLEMENTATION == 3 like in this patch. It'll get rid of most of the confusing-ness while still letting you override it.

Sounds great, thank you!

ldionne updated this revision to Diff 305838.Nov 17 2020, 9:58 AM

Still allow customizing the implementation

Thanks!

libcxx/docs/BuildingLibcxx.rst
381

Do you still wanna be removing this documentation?

ldionne added inline comments.Nov 17 2020, 1:25 PM
libcxx/docs/BuildingLibcxx.rst
381

That's a tough one. I kinda do.

I've been trying to avoid duplicate information in libc++, since that gets out of sync, etc. The set of options documented here does not reflect the real set of options provided in the CMake files, so I thought we might as well just have a single source of truth. For that reason, I tried to beef up the documentation of the option in libcxx/CMakeLists.txt itself.

Do you have an opinion about that?

smeenai added inline comments.Nov 17 2020, 1:28 PM
libcxx/docs/BuildingLibcxx.rst
381

That makes sense to me; the documentation in the CMakeLists.txt and the header looks good.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 18 2020, 1:58 PM
This revision was automatically updated to reflect the committed changes.
shafik added a subscriber: shafik.Nov 18 2020, 2:34 PM

I think your CMakeLists.txt broke the green dragon build: http://lab.llvm.org:8080/green/view/LLDB/job/lldb-cmake/24893/consoleFull#349140555e9a0fee5-ebcc-4238-a641-c5aa112c323e

It is not obvious to me what the problem is though.

I think your CMakeLists.txt broke the green dragon build: http://lab.llvm.org:8080/green/view/LLDB/job/lldb-cmake/24893/consoleFull#349140555e9a0fee5-ebcc-4238-a641-c5aa112c323e

It is not obvious to me what the problem is though.

I am experiencing the same error.

You need to remove your build directory before running CMake. The issue is that CMake cache variables don't get overwritten with new values unless the cache is changed manually. If you don't remove the cache, it gets stale, and in this case, it's wrong.

Generally speaking, any CI job that doesn't either remove the whole build directory and start from scratch, or at least remove the CMake cache, is bound to be flaky and break in this way. This is a problem of the build job, not of this change.

Note that if you're trying to setup an incremental build and save on full rebuilds, I think it's better to use something like ccache than trying to keep the build directory around.

So if you want to fix the Green Dragon issue, just remove the build directory. But I would advise fixing the root cause because this sort of failure may happen again, either with libcxx changes or any other change that falls in the same pattern.

This seems to have broken no-op rebuild w/ existing build dir:

CMake Error at /repositories/llvm-project/libcxx/CMakeLists.txt:187 (message):
  Value '' is not a valid value for

                         LIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION


-- Configuring incomplete, errors occurred!

It seems to work after i manually specify -DLIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION=default, but still..

fhahn added a subscriber: fhahn.Nov 19 2020, 1:52 AM

This seems to have broken no-op rebuild w/ existing build dir:

CMake Error at /repositories/llvm-project/libcxx/CMakeLists.txt:187 (message):
  Value '' is not a valid value for

                         LIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION


-- Configuring incomplete, errors occurred!

It seems to work after i manually specify -DLIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION=default, but still..

Same for me.

@ldionne would it be possible to adjust the default so that it doesn't break re-builds & works without explicitly having to set DLIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION?

This seems to have broken no-op rebuild w/ existing build dir:

CMake Error at /repositories/llvm-project/libcxx/CMakeLists.txt:187 (message):
  Value '' is not a valid value for

                         LIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION


-- Configuring incomplete, errors occurred!

It seems to work after i manually specify -DLIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION=default, but still..

Same for me.

@ldionne would it be possible to adjust the default so that it doesn't break re-builds & works without explicitly having to set DLIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION?

Sorry, but no. The whole point of this patch is partly to make the default explicit. We've had bugs related to defining this as an empty string (which it was previously), which is what would appear to "satisfy" rebuilds.

I already said the bug was that these bots are not cleaning their build directory (or at least their build cache), and I stand by that.

This seems to have broken no-op rebuild w/ existing build dir:

Please read two comments above yours. I explained the issue.

You need to remove your build directory before running CMake. The issue is that CMake cache variables don't get overwritten with new values unless the cache is changed manually. If you don't remove the cache, it gets stale, and in this case, it's wrong.

Generally speaking, any CI job that doesn't either remove the whole build directory and start from scratch, or at least remove the CMake cache, is bound to be flaky and break in this way. This is a problem of the build job, not of this change.

Note that if you're trying to setup an incremental build and save on full rebuilds, I think it's better to use something like ccache than trying to keep the build directory around.

LLVM has a bunch of incremental bots and in addition to trying to be be fast, they also mirror what developers do when they pull. I can only speak for myself, but I certainly don't remove the CMake cache every time I do that. I've been involved in a bunch of CMake-related code reviews and as this is a pretty common problem, which I've seen people address by introducing a temporary CMake variable to let the bots (and developers) catch up. Is there any reason that couldn't have worked here?

FWIW if you disagree with how the bots deal with incremental builds, you're of course free to change that, but in the meantime the burden is still on the author of the patch and anyone could've reverted this in accordance with LLVM's "revert-to-green" policy.

[...]

LLVM has a bunch of incremental bots and in addition to trying to be be fast, they also mirror what developers do when they pull. I can only speak for myself, but I certainly don't remove the CMake cache every time I do that. I've been involved in a bunch of CMake-related code reviews and as this is a pretty common problem, which I've seen people address by introducing a temporary CMake variable to let the bots (and developers) catch up. Is there any reason that couldn't have worked here?

The only trick I'm aware of is to temporarily use FORCE in the CMake set() command to force the cache to be overwritten. I've used that trick in the past to avoid exactly this kind of breakage, however in this case that would also break any build that explicitly sets that CMake variable (i.e. -DLIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION=XXX). So the reason why this broke bots is not that I didn't care about it, it's that I thought it was inevitable.

Now that you mention a temporary variable, I think I know where you're going. Do you mean:

  1. Introduce a new cache variable with a different name, and make the code use it. That new variable should be initialized to the old one if it's set, to accommodate bots that set the old name explicitly.
  2. set(FORCE) the old variable name to whatever you want.
  3. Wait for the bots to catch up (like 1 day or so).
  4. Move the code back to the original variable name, which will now have the correct default set by the set(FORCE).
  5. Delete the temporary variable name, which is not useful anymore.

That sounds like it *might* work, however it's kind of complicated too, no? It seems to me like it would be better to acknowledge that CMake caches are invalidated from time to time, and make sure our infrastructure is robust against that.

FWIW if you disagree with how the bots deal with incremental builds, you're of course free to change that, but in the meantime the burden is still on the author of the patch and anyone could've reverted this in accordance with LLVM's "revert-to-green" policy.

I don't disagree with how they deal with incremental builds -- I just think we should be fine with a bit of human intervention from time to time if we're going to assume the cache never gets invalidated. And at the same time, I suggested a technical approach (full build + ccache) to make the bots more robust so that these issues don't happen again in the future.

I don't disagree with how they deal with incremental builds -- I just think we should be fine with a bit of human intervention from time to time if we're going to assume the cache never gets invalidated. And at the same time, I suggested a technical approach (full build + ccache) to make the bots more robust so that these issues don't happen again in the future.

If you foresee the need for possible human intervention maybe the discussion about that should have happened before the change was landed? Then at least folks would have been prepared to fix the problem right after the it landed, assuming they agreed on that approach.

I don't disagree with how they deal with incremental builds -- I just think we should be fine with a bit of human intervention from time to time if we're going to assume the cache never gets invalidated. And at the same time, I suggested a technical approach (full build + ccache) to make the bots more robust so that these issues don't happen again in the future.

If you foresee the need for possible human intervention maybe the discussion about that should have happened before the change was landed? Then at least folks would have been prepared to fix the problem right after the it landed, assuming they agreed on that approach.

TBH, I didn't know which bots, if any, would break. We've moved some of the incremental builds to ccache + full build in the past, and I don't have a full mental map of all the bots in LLVM (I'm sure you don't, either). What I meant above is that I didn't purposefully break the build because I didn't care -- I thought it was inevitable (and it's still not clear to me it's not), and I wasn't sure whether it would actually break bots. When asked whether there was a technical way to unbreak the cache, I replied "no", which is the answer I thought was correct at the time (@JDevlieghere can LMK if I got his suggestion right).

Is it possible we're making this way bigger than it needs to be? If there's an agreed-upon way of making these CMake cache-affecting changes, then let's document it and be done with it. I'll be happy to do it in the future. If there's no such technical way, let's fix the bots to make sure they're robust to these changes, and somehow find a way to avoid breaking people's build as well when they git pull.

[...]

LLVM has a bunch of incremental bots and in addition to trying to be be fast, they also mirror what developers do when they pull. I can only speak for myself, but I certainly don't remove the CMake cache every time I do that. I've been involved in a bunch of CMake-related code reviews and as this is a pretty common problem, which I've seen people address by introducing a temporary CMake variable to let the bots (and developers) catch up. Is there any reason that couldn't have worked here?

The only trick I'm aware of is to temporarily use FORCE in the CMake set() command to force the cache to be overwritten. I've used that trick in the past to avoid exactly this kind of breakage, however in this case that would also break any build that explicitly sets that CMake variable (i.e. -DLIBCXX_TYPEINFO_COMPARISON_IMPLEMENTATION=XXX). So the reason why this broke bots is not that I didn't care about it, it's that I thought it was inevitable.

Now that you mention a temporary variable, I think I know where you're going. Do you mean:

  1. Introduce a new cache variable with a different name, and make the code use it. That new variable should be initialized to the old one if it's set, to accommodate bots that set the old name explicitly.
  2. set(FORCE) the old variable name to whatever you want.
  3. Wait for the bots to catch up (like 1 day or so).
  4. Move the code back to the original variable name, which will now have the correct default set by the set(FORCE).
  5. Delete the temporary variable name, which is not useful anymore.

That sounds like it *might* work, however it's kind of complicated too, no?

Yep, at least that's how we dealt with this for LLDB in the past.

It seems to me like it would be better to acknowledge that CMake caches are invalidated from time to time, and make sure our infrastructure is robust against that.

I totally agree. Some of the bots allow you to specify a variable (which admittedly is totally obscure) to request a clean build which is something else I've used in the past (when I had already landed something without the workaround and wanted to avoid reverting).

FWIW if you disagree with how the bots deal with incremental builds, you're of course free to change that, but in the meantime the burden is still on the author of the patch and anyone could've reverted this in accordance with LLVM's "revert-to-green" policy.

I don't disagree with how they deal with incremental builds -- I just think we should be fine with a bit of human intervention from time to time if we're going to assume the cache never gets invalidated. And at the same time, I suggested a technical approach (full build + ccache) to make the bots more robust so that these issues don't happen again in the future.

I think that's fair, but if you expect something like this to break, it might be nice to land it at a time where you can work with the bot owners to get it resolved quickly.

I don't disagree with how they deal with incremental builds -- I just think we should be fine with a bit of human intervention from time to time if we're going to assume the cache never gets invalidated. And at the same time, I suggested a technical approach (full build + ccache) to make the bots more robust so that these issues don't happen again in the future.

If you foresee the need for possible human intervention maybe the discussion about that should have happened before the change was landed? Then at least folks would have been prepared to fix the problem right after the it landed, assuming they agreed on that approach.

TBH, I didn't know which bots, if any, would break. We've moved some of the incremental builds to ccache + full build in the past, and I don't have a full mental map of all the bots in LLVM (I'm sure you don't, either). What I meant above is that I didn't purposefully break the build because I didn't care -- I thought it was inevitable (and it's still not clear to me it's not), and I wasn't sure whether it would actually break bots. When asked whether there was a technical way to unbreak the cache, I replied "no", which is the answer I thought was correct at the time (@JDevlieghere can LMK if I got his suggestion right).

Not that I know of, other than reverting and/or introducing the differently named (temporary) variable.

Is it possible we're making this way bigger than it needs to be? If there's an agreed-upon way of making these CMake cache-affecting changes, then let's document it and be done with it. I'll be happy to do it in the future. If there's no such technical way, let's fix the bots to make sure they're robust to these changes, and somehow find a way to avoid breaking people's build as well when they git pull.

It was not my intention to assign blame, I'm sorry if it came across that way. My goal was to make some suggestions on how this could've been avoided or at least less obtrusive in the future. I think documenting this might be worthwhile, although it doesn't happen that often, it can be annoying for a bunch of people.