This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ][z/OS] ASCII/EBCDIC support for libc++
AbandonedPublic

Authored by muiez on Oct 7 2021, 9:29 AM.

Details

Summary

The aim of this patch is to provide ASCII/EBCDIC support for libc++ on z/OS and start a discussion about the approach. The approach taken involves using different inline namespaces for the ASCII/EBCDIC versions of the functions, and having implementations based on whether __NATIVE_ASCII_F is defined. More details are in the design document.

It should be noted that this patch does not include the script to build libc++ on z/OS in both 32 and 64-bit mode (and combining the A/E in a shared library). This will be in a separate patch.

Nonetheless, this patch cannot yet land because of a dependency on an underlying compiler feature (pragma map), which is in the works of being upstreamed.

Diff Detail

Event Timeline

muiez created this revision.Oct 7 2021, 9:29 AM
muiez requested review of this revision.Oct 7 2021, 9:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2021, 9:29 AM
Herald added 1 blocking reviewer(s): Restricted Project. · View Herald Transcript

This is above my pay grade, but FWIW, this looks to me like a huge unmaintainable complication. It feels bad to say "you must take on the huge burden of maintaining this out-of-tree," but IMHO it would be worse for libc++ to say "libc++ should take on the burden of maintaining this in-tree."
If @ldionne feels the same, then perhaps a compromise approach would be to propose merely granularizing the src/ files enough to segregate "code affected by this patch" from "code that isn't." Right now the PR is a maze of #ifdefs; I assume it would be easier for you to maintain an out-of-tree patch if it was just "swap out these few specific files."

SeanP added a comment.Oct 8 2021, 12:17 PM

@Quuxplusone what parts of this do you feel are unmaintainable? We are open to ideas on how to address any concerns you have.

I see the changes break down into these groups:

  • adding the namespace wrappers - these are simple additions that shouldn't impact anyone outside of z/OS
  • platform specializations - these have #if's but always (except for two table definitions) occur within a block handling specialization for other platforms too
  • code wrapped in _LIBCPP_ADDITIONAL_CHAR_ENCODING - I like the idea of moving these to separate files. It was one alterative we considered
  • error messages - yes these add a bunch of #if's. This sounds like the code that alarmed you. We have a better solution being worked on. That solution replaces the string literals with passing an enum to a macro. We can remove these changes and promote the new solution when ready. The new solution gets rid of all of those #if's.

I really don't want to create a separate copy of the library as that will just be forking it. We would rather point customers to one place to get their C++ library.

SeanP added inline comments.Oct 8 2021, 12:25 PM
libcxx/docs/DesignDocs/MultipleCharEncodings.rst
90

We have a better way of doing this that is being worked on now.

In a nutshell code like:

return string("The associated promise has been destructed prior "
                      "to the associated state becoming ready.");

will be replaced with:

return string(getMessage(_BROKEN_PROMISE));

and getMessage will expand out to an array of the strings index.

This will eliminate all of the #if's. I also see a potential benefit that this could be the ground work for enabling internationalization of the runtime messages.

muiez updated this revision to Diff 382352.Oct 26 2021, 8:49 AM

Improved solution for error messages, eliminate #ifs.

muiez updated this revision to Diff 382366.Oct 26 2021, 9:23 AM
muiez marked an inline comment as done.

Update design doc to account for alternate solution to translate libcxx err messages based on char-mode.

muiez added inline comments.Oct 26 2021, 9:25 AM
libcxx/docs/DesignDocs/MultipleCharEncodings.rst
90

Updated the diff to include this better way of translating error messages.

libcxx/include/filesystem
910

My comment about "merely granularizing", concretely, would be something like "pull out lines 910–1518 into a new header, libcxx/include/__filesystem/somemeaningfulname.h," so that we didn't have headers that flipped back and forth between CES and non-CES "modes"; every header would be either all-under-CES or never-mention-CES.

I'm not saying that you clearly should do that granularizing. If ldionne or whoever didn't like that direction, or didn't like the PR regardless, then I admit it would be wasted work on your part. However, if you have the cycles to spare, I think it would be at least, interesting to see that approach proof-of-concepted.

Also, if you did that, then I bet you could make combined macros _LIBCPP_BEGIN_NAMESPACE_STD_CES and _LIBCPP_END_NAMESPACE_STD_CES, because you'd never be opening the CES namespace without also opening _VSTD. Right?

muiez added a comment.Nov 5 2021, 10:15 AM

@ldionne, any thoughts on the approach and/or discussion so far? Your insight would be appreciated.

Thanks for the detailed design document, this was really helpful in understanding what this patch is about.

However, I share @Quuxplusone 's concerns. This is adding a lot of complexity, especially the part about deciding whether a function should be in an encoding-dependent namespace or not. This means that whenever we add new code, we'll need to ask ourselves the question "oh, should this be in a CES namespace or not"? This is on the same level of annoyance as the _LIBCPP_INLINE_VISIBILITY macro we have everywhere (but wait, not quite everywhere, which is why it's so evil), which is an enormous mess and a barrier to contributions, a source of bugs, and a source of confusion all the time. I am concerned by the approach taken here, as I feel it's adding just another _LIBCPP_INLINE_VISIBILITY.

I suspect there are alternate solutions to supporting multiple encodings without mangling the library like this. For starters, it seems like the namespacing is primarily aimed at reducing code duplication when you link two libc++s built with different character encodings, instead of using the existing knob for tweaking the namespace, i.e. defining _LIBCPP_ABI_NAMESPACE. I believe you could build the whole library twice, once with _LIBCPP_ABI_NAMESPACE=__ascii and once with _LIBCPP_ABI_NAMESPACE=__ebcdic instead. You'd end up with more code duplication, however I believe it would be vastly more beneficial and general to solve this problem by working on enabling better code de-duplication at the optimizer level. I know there has been a lot of work on outlining in LLVM, for example, perhaps that could be investigated? IOW: I think libc++ already gives you the knob you need for the namespace changes you're making here -- they might not be granular enough, however mangling the library isn't the right way to get what you need.

Regarding the string literals, I don't have as big an issue with those -- it looks roughly like an attempt to localize those error messages, which is not convenient but at least it is systematic. However, I don't understand why the compiler isn't performing those translations itself, since it knows which encoding we're compiling for.

To make progress on this:

  1. Have you tried using _LIBCPP_ABI_NAMESPACE?
  2. Is it possible for the compiler to perform the encoding translation for string literals? It would not only be simpler, but it would also solve issues more generally for your users who want to use alternate encodings, not only in libc++/libc++abi.

TL;DR: These changes are quite intrusive and mangling the library like this doesn't strike me as the right approach, especially since I think we already have general knobs to solve some of these issues. It adds a lot of complexity and room for mistakes for the benefit of a single platform. I would like to investigate alternate ways to solve your problems, with some possible avenues to explore laid out above.

Thanks @ldionne for your detailed feedback and suggestions.

To make progress on this:

  1. Have you tried using _LIBCPP_ABI_NAMESPACE?

I prototyped with your suggestion to build the whole library twice, once with _LIBCPP_ABI_NAMESPACE=__ascii and once with _LIBCPP_ABI_NAMESPACE=__ebcdic; however, some symbols (with this approach) should not end up having __ascii / __ebcdic variations. In particular, we need to make sure there is only one variation generated, outside the A/E namespace. For example, we want the *error_category to be independent of A/E. These can be passed and thrown around. We want a system_error_category to be handled the same regardless whether it came from A or E. With that being said, would it be okay to move these instances to the c++abi (like std::exception was)? Or should we exclude them from the _LIBCPP_ABI_NAMESPACE part, for example using:

namespace std  // purposefully not using versioning namespace
{
...
}

We can gather a list of candidates to cross check with you; however, the first ones that come to mind are the exception derived classes and mutex's.

muiez updated this revision to Diff 390758.Nov 30 2021, 11:00 AM

Resolve merge conflict with main

muiez added a comment.Feb 3 2022, 9:57 AM

Closing since this patch is divided into the following (with comments addressed and better approach):

muiez abandoned this revision.Feb 3 2022, 9:58 AM