Page MenuHomePhabricator

[SystemZ][z/OS] Missing wchar functions libc++
ClosedPublic

Authored by muiez on Mar 8 2021, 11:48 AM.

Details

Summary

The aim is to add the missing z/OS specific implementations for mbsnrtowcs and wcsnrtombs, as part of libc++.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

[...]

Thanks for the notice. After looking through the LLVM developer policy, there is the following statement:

The scope of relicensing is all code that is considered part of the LLVM project, including the main LLVM repository, runtime libraries (compiler_rt, OpenMP, etc), Polly, and all other subprojects. There are a few exceptions:

  • Code imported from other projects (e.g. Google Test, Autoconf, etc) will remain as it is. This code isn’t developed as part of the LLVM project, it is used by LLVM.

The copyright in this patch (and in solaris) mentions that the code is from the FreeBSD libc project. Maybe this falls into the bullet above? If not, are there any suggestions/alternatives to move forward?

Indeed, but I read that as talking about third-party projects that we check-in directly into the LLVM monorepo but we don't ever modify (for example Google Benchmark). This is not the same thing here, as arguably libcxx/src/support/ibm/wcsnrtombs.inc is now considered a libc++ file after this patch and we'd modify it if there was a bug in it, for example.

I am not a lawyer, I'm just trying to make sure we're following the LLVM contribution policy. I know @chandlerc has done a lot of work related to licensing, so perhaps he can provide more information here.

Ping. Do you see a solution or should I revert this?

muiez added a comment.Mar 22 2021, 1:55 PM

Ping. Do you see a solution or should I revert this?

I was hoping for @chandlerc to provide insight; however, if that is not the case, it could be reverted. I will try to rewrite the functions.

Thanks. Reverted in:

commit 1e337b1dd90369763ebe6aa4eb3e563c61ddb213
Author: Louis Dionne <ldionne.2@gmail.com>
Date:   Mon Mar 22 17:17:58 2021 -0400

    [libc++] Revert "[SystemZ][z/OS] Missing wchar functions libc++"

    This reverts commit febbf68b423b14d55a14980d2ba3ec37ef1e31dc because it
    added files that were not under the LLVM license.

    See https://reviews.llvm.org/D98207 for details.

FWIW, what I care about is just that the contribution is made under an acceptable license for LLVM, I otherwise have no problems with the patch itself.

ldionne requested changes to this revision.Mar 23 2021, 10:27 AM

I'm going to request changes on this so it shows up properly in the review queue.

This revision now requires changes to proceed.Mar 23 2021, 10:27 AM
muiez updated this revision to Diff 337499.Apr 14 2021, 11:14 AM

Added missing functions under LLVM license (FYI: implementations are similar to those in libcxx/src/support/win32/support.cpp).

Ping (ready for new round of review, FYI @ldionne)

muiez updated this revision to Diff 337514.Apr 14 2021, 11:59 AM

Clang tidy

zibi added a comment.Apr 26 2021, 9:28 AM

Sorry for the nit picking and I know this is coming from existing Windows implementation but we have the opportunity to clean it up.
Most comments refer to both functions.

libcxx/src/support/ibm/mbsnrtowcs.cpp
2

Can you replace first 2 lines with

===----------------------- mbsnrtowcs.cpp -------------------------------===

This will make consistent with majority of other source files.

18

Can you clean up this description so it's easer to read?

For example:
and no "out" parameters are updated. Does it refer to dst or src? Also if you refer to the parameter name please annotate it to make it stand out.

27

Any reason to keep this commented out line around?

45

Apparently comments need to be proper sentences. Also fix the comment so it actually refer to char_size and not result.

It looks like result will never be set to >0 value..

muiez updated this revision to Diff 340595.Apr 26 2021, 11:34 AM

Improved readability of comments and documentation.

muiez marked 4 inline comments as done.Apr 26 2021, 11:34 AM
zibi added a comment.Apr 26 2021, 12:27 PM

LGTM, just wait for libc++ approval.

libcxx/src/support/ibm/wcsnrtombs.cpp
44

Can you fix this as well?

muiez updated this revision to Diff 340637.Apr 26 2021, 1:58 PM

Addressed comment.

muiez marked an inline comment as done.Apr 26 2021, 1:58 PM
muiez updated this revision to Diff 340844.Apr 27 2021, 7:42 AM

No changes. Used to kick-off CI again.

muiez added a comment.May 14 2021, 7:18 AM

Ping, patch is ready for new round of review (@ldionne). Thanks

DanielMcIntosh-IBM requested changes to this revision.Jun 9 2021, 10:49 AM

If this is taken from windows, I will point out that, on windows wide characters are only 16 bits (on most systems, wchar_t is 32 bits). This may result in some assumptions in the implementation which aren't true for z/OS. I'm particularly concerned about line 39 potentially writing more than dest_remaining characters (TBH, I'm a little surprised this part works correctly on windows). This could result in writing past the end of dst (i.e. a buffer overflow). This would then be made even worse by dest_remaining -= char_size not setting dest to 0 at that point, and thus the ! dest_remaining check not triggering. Assuming it has the correct license, using the z/OS implementation of wcsrtombs as a reference might be better (unlike wcsnrtombs which is a POSIX extention, wcsrtombs is a part of the C and C++ standards).

If you don't end up re-writing this based on the z/OS implementation of wcsrtombs, I've made several comments regarding how to clean up the control flow. Most are small, but together I think they will result in a significant improvement to readability.

libcxx/src/support/ibm/wcsnrtombs.cpp
36–37

Any reason this can't just be a part of the loop condition? i.e. while (source_converted != max_source_chars && dest_remaining != 0)

46–56

As per LLVM coding standards we should probably invert this condition so it is an early-exit. i.e. if (char_size <= 0) { have_result = true; break; }

49

This one takes a little bit more work, but I think it's cleaner to return here instead of break. (Also, return is an even earlier exit than break, so I think that would be preferred by the LLVM coding standards).

We know control will jump to line 59, and the if (terminator_found) on line 60 is only true if we got to line 59 from this break. Thus, we can replace line 47 and 48 with

if (dst)
    *src = NULL;
return dest_converted;

Then, get rid of the terminator_found state variable, and simplify lines 59-64.
I think this will also probably enable further simplification of the control flow in a few places, but at this point I've recommended so many changes it's hard to keep track.

67–69

If I'm not mistaken, when char_size < 0, have_result is guaranteed to be true (if the loop doesn't run then char_size == 0, and if char_size <= 0 in the loop, then we set have_result = true and immediately exit). Thus, we can get rid of the have_result state variable.

This revision now requires changes to proceed.Jun 9 2021, 10:49 AM

Immediately after hitting submit I notice a few more things. Added them as comments too

libcxx/src/support/ibm/mbsnrtowcs.cpp
40–41

This could also probably be made part of the loop condition. (Don't forget to invert it properly when you do)

48

again, char_size is unsigned, so this won't behave as you've described in the error case.

48

As in wcsnrtombs, this should also probably be inverted to convert it to an early-exit

51

Probably easier to increment dst directly here. You'll have to do so behind an extra if (dst), but in exchange you can get rid of the ternary operator on line 45, and get rid of a state variable.

59

I think that we may be able to get rid of have_result here too, but I'm not as sure about this one, since result is initialized to 0, and terminated_sequence == static_cast<size_t>(0).

libcxx/src/support/ibm/wcsnrtombs.cpp
45–46

char_size is actually not -1 but rather (size_t)-1, and size_t is an unsigned type, so this test won't behave as you've described in the error case.

54

Since dst is a pointer and not a pointer-to-a-pointer, wouldn't it be easier to just add char_size directly to dst and do away with dest_converted? You don't even need to add an extra if (dst) since you have one on line 51

libcxx/src/support/ibm/wcsnrtombs.cpp
54

Never mind this part, I forgot we return dest_converted

muiez updated this revision to Diff 370400.Thu, Sep 2, 2:11 PM
muiez marked 8 inline comments as done.

Updated implementation to fix error handling, account for dst buffer overflow and refactoring.

muiez updated this revision to Diff 370407.Thu, Sep 2, 2:14 PM
muiez marked an inline comment as not done.

removed todo comment

muiez marked 5 inline comments as done.Thu, Sep 2, 2:26 PM
muiez added inline comments.
libcxx/src/support/ibm/mbsnrtowcs.cpp
51

I'd prefer to not add an extra conditional within the loop due to a performance tradeoff. Also, as mentioned below, dest_converted is returned so we need that state variable either way.

59

Yup, by using result directly within the loop (since it gets updated), I was able to remove have_result.

ldionne accepted this revision.Fri, Sep 3, 12:15 PM

I'm going to get out of your way and remove this from the review queue. The bits in libcxx/include LGTM, and I'll let you folks review the libcxx/src/support/ibm between yourselves cause it's your platform anyways. Once you're all happy with it, you can ship it assuming:

  1. The CI is green (BTW I'm going to request for the nth time that we add a CI bot on z/OS)
  2. You don't make additional modifications outside of libcxx/src/support/ibm without pinging me for a quick look

Thanks!

muiez updated this revision to Diff 371334.Wed, Sep 8, 7:38 AM
muiez marked an inline comment as done.

Minor refactor

muiez updated this revision to Diff 371435.Wed, Sep 8, 1:53 PM

Check for ps nullptr

DanielMcIntosh-IBM requested changes to this revision.Wed, Sep 8, 2:35 PM

As discussed over slack, there is an edge case that needs to be addressed (the case where ps == nullptr on line 55 of wcsnrtombs and/or on line 61 of mbsnrtowcs).
Once that's fixed, LGTM. I've left several comments for a few more ways it could be tidied up, but none of them are functional changes.

libcxx/src/support/ibm/mbsnrtowcs.cpp
30–36

Similar to my comments for wcsnrtombs.cpp, we might want to move some of these so they have a narrower scope. (buff to line 60, mbstate_tmp to line 61, and source_remaining is discussed on line 42)

43

By tweaking this condition a bit we can get rid of source_remaining as a state variable: source_converted < src_size_bytes && (!dst || dest_converted < max_dest_chars)

You may still want to keep source_remaining around, but declare it inside the loop so it's no longer used as a state variable (size_t source_remaining = src_size_bytes - source_converted;).

This also allows you to make it obvious that source_converted and dest_converted are only declared outside the for loop so that we can read from them after: After moving all the declarations except terminated_sequence, invalid_sequence, incomplete_sequence, source_converted, and dest_converted, the only non-const variables you've declared before the loop are source_converted, and dest_converted. If you move the zero-initialization to the init statement of the for loop, this signals that all writes to these two variables happen within the for loop, and after we only read from them (even better, we'd actually be doing all the writing within the init and increment of the loop statement, showing the reader right away what the pre-conditions and post-conditions are for each loop iteration).

49

See my comment in wcsnrtombs.cpp.

81

If you move source_remaining into a for loop init statement, this should probably become the increment statement of the for loop. (probably along with the next 2 lines)

libcxx/src/support/ibm/wcsnrtombs.cpp
28–34

Is there some reason we need to preserve all of these across loop iterations? For example, could we move mbstate_tmp's declaration to line 55 to limit the scope, and thereby reduce the number of variables the reader needs to keep track of. What about moving the declaration of char_size to line 42, and buff to line 54?

38–41

If you change this around so it's more like mbsnrtowcs, you can get rid of dest_remaining as a state variable: instead of dest_remaining, you set dst_size_bytes = static_cast<size_t>(-1), then the loop condition can be changed to source_converted < max_source_chars && (!dst || dest_converted < dst_size_bytes).

You may still want to keep dest_remaining around, but declare it inside the loop so it's no longer used as a state variable (size_t dest_remaining = dst_size_bytes - dest_converted;). This also gets rid of line 87 and 88, allowing you to cleanly turn line 86 and 89 into the increment statement in a for loop.

If you move the declarations as I suggested, then you can go one step further and move the zero-initialization of source_converted and dest_converted to the init statement of the for loop for all the reasons I explained in my comments on line 42 of mbsnrtowcs.

44

I think we could make things a bit more clear here by inverting this if and reducing the level of nesting a little. i.e.

if (dst == nullptr) {
  char_size = wcrtomb(nullptr, c, ps);
} else if (dest_remaining >= static_cast<size_t>(MB_CUR_MAX)) {
  char_size = wcrtomb(dst + dest_converted, c, ps);
} else {
  //...
}
This revision now requires changes to proceed.Wed, Sep 8, 2:35 PM
muiez updated this revision to Diff 371715.Thu, Sep 9, 1:41 PM
muiez marked 7 inline comments as done.

More refactoring [NFC]

libcxx/src/support/ibm/mbsnrtowcs.cpp
43

Yup, I tweaked the condition to remove the source_remaining state variable.

On the other hand, I attempted to use a for loop to include the init, condition & increment; however, this turned out to be tedious (i.e lots of variables, conditions, increments, etc.). For example we'd atleast have the following:

for( size_t source_converted = 0;
    source_converted < src_size_bytes && (!dst || dest_converted < max_dest_chars);
    source_converted += result,
    ++dest_converted) {
      
    }

Nonetheless, since we return dest_converted and must declare it out of the loop, then it might be more concise to initialize it upon declaration too. Also, since writing (*src) is conditional, a while loop is more appropriate.

This revision is now accepted and ready to land.Fri, Sep 10, 12:52 PM
muiez updated this revision to Diff 371999.Fri, Sep 10, 1:15 PM

Refactor to use for loop

This revision was automatically updated to reflect the committed changes.