The aim is to add the missing z/OS specific implementations for mbsnrtowcs and wcsnrtombs, as part of libc++.
Details
- Reviewers
- curdeius - ldionne - Jonathan.Crowther - DanielMcIntosh-IBM - • Quuxplusone 
- Group Reviewers
- Restricted Project 
- Commits
- rG4247381e26dd: [SystemZ][z/OS] Missing wchar functions libc++
 rGfebbf68b423b: [SystemZ][z/OS] Missing wchar functions libc++
Diff Detail
Event Timeline
If I understand correctly, these are almost the same as the files in support/solaris/ but:
- without the _l suffix, and hence without the loc parameter
- calling functions without _l and without loc
- not calling FIX_LOCALE(loc;)
| libcxx/include/wchar.h | ||
|---|---|---|
| 178–179 | Could you please fix this comment while here? | |
Peanut gallery says: What's the significance of the .inc file extension (as opposed to .cpp)?
(Either way, I'm definitely not qualified to mark this as accepted, neither on the z/OS front nor on the is-this-license-acceptable front.)
FYI, this has a precedence in support/solaris. The licence is the same. The files differ only by _l suffix and so the use (or not) of locale paremeter.
Sorry for not having said that before, but for the next time please wait for a libc++ group approval before landing.
Thanks for the note. I saw 'This revision is now accepted and ready to land' and landed after that. I'll keep this in mind for next time.
Sorry, but we can't take this contribution unless it is made under the LLVM license. I'm going to have to revert this. I'll give you 24h to reply to this before reverting.
@ldionne, then something needs to be done with support/solaris that has almost the same license, no?
https://github.com/llvm/llvm-project/tree/main/libcxx/src/support/solaris
Yes indeed, I believe those would need to be LLVM licensed. Those support files were added back in 2012 and there's a README explaining it (https://github.com/llvm/llvm-project/blob/main/libcxx/src/support/solaris/README), so I'm not sure what the thought was at the time.
However, regarding new contributions at least, the LLVM developer policy is pretty clear about what licensing is acceptable: https://llvm.org/docs/DeveloperPolicy.html#copyright-license-and-patents.
Thanks for the notice. After looking through the LLVM developer policy, there is the following statement:
The scope of relicensing is all code that is considered part of the LLVM project, including the main LLVM repository, runtime libraries (compiler_rt, OpenMP, etc), Polly, and all other subprojects. There are a few exceptions:
- Code imported from other projects (e.g. Google Test, Autoconf, etc) will remain as it is. This code isn’t developed as part of the LLVM project, it is used by LLVM.
The copyright in this patch (and in solaris) mentions that the code is from the FreeBSD libc project. Maybe this falls into the bullet above? If not, are there any suggestions/alternatives to move forward?
Indeed, but I read that as talking about third-party projects that we check-in directly into the LLVM monorepo but we don't ever modify (for example Google Benchmark). This is not the same thing here, as arguably libcxx/src/support/ibm/wcsnrtombs.inc is now considered a libc++ file after this patch and we'd modify it if there was a bug in it, for example.
I am not a lawyer, I'm just trying to make sure we're following the LLVM contribution policy. I know @chandlerc has done a lot of work related to licensing, so perhaps he can provide more information here.
I was hoping for @chandlerc to provide insight; however, if that is not the case, it could be reverted. I will try to rewrite the functions.
Thanks. Reverted in:
commit 1e337b1dd90369763ebe6aa4eb3e563c61ddb213
Author: Louis Dionne <ldionne.2@gmail.com>
Date:   Mon Mar 22 17:17:58 2021 -0400
    [libc++] Revert "[SystemZ][z/OS] Missing wchar functions libc++"
    This reverts commit febbf68b423b14d55a14980d2ba3ec37ef1e31dc because it
    added files that were not under the LLVM license.
    See https://reviews.llvm.org/D98207 for details.FWIW, what I care about is just that the contribution is made under an acceptable license for LLVM, I otherwise have no problems with the patch itself.
Added missing functions under LLVM license (FYI: implementations are similar to those in libcxx/src/support/win32/support.cpp).
Sorry for the nit picking and I know this is coming from existing Windows implementation but we have the opportunity to clean it up.
Most comments refer to both functions.
| libcxx/src/support/ibm/mbsnrtowcs.cpp | ||
|---|---|---|
| 2 | Can you replace first 2 lines with ===----------------------- mbsnrtowcs.cpp -------------------------------=== This will make consistent with majority of other source files. | |
| 18 | Can you clean up this description so it's easer to read? For example: | |
| 27 | Any reason to keep this commented out line around? | |
| 45 | Apparently comments need to be proper sentences. Also fix the comment so it actually refer to char_size and not result. It looks like result will never be set to >0 value.. | |
LGTM, just wait for libc++ approval.
| libcxx/src/support/ibm/wcsnrtombs.cpp | ||
|---|---|---|
| 44 | Can you fix this as well? | |
If this is taken from windows, I will point out that, on windows wide characters are only 16 bits (on most systems, wchar_t is 32 bits). This may result in some assumptions in the implementation which aren't true for z/OS. I'm particularly concerned about line 39 potentially writing more than dest_remaining characters (TBH, I'm a little surprised this part works correctly on windows). This could result in writing past the end of dst (i.e. a buffer overflow). This would then be made even worse by dest_remaining -= char_size not setting dest to 0 at that point, and thus the ! dest_remaining check not triggering. Assuming it has the correct license, using the z/OS implementation of wcsrtombs as a reference might be better (unlike wcsnrtombs which is a POSIX extention, wcsrtombs is a part of the C and C++ standards).
If you don't end up re-writing this based on the z/OS implementation of wcsrtombs, I've made several comments regarding how to clean up the control flow. Most are small, but together I think they will result in a significant improvement to readability.
| libcxx/src/support/ibm/wcsnrtombs.cpp | ||
|---|---|---|
| 36–37 | Any reason this can't just be a part of the loop condition? i.e. while (source_converted != max_source_chars && dest_remaining != 0) | |
| 46–56 | As per LLVM coding standards we should probably invert this condition so it is an early-exit. i.e. if (char_size <= 0) { have_result = true; break; } | |
| 49 | This one takes a little bit more work, but I think it's cleaner to return here instead of break. (Also, return is an even earlier exit than break, so I think that would be preferred by the LLVM coding standards). We know control will jump to line 59, and the if (terminator_found) on line 60 is only true if we got to line 59 from this break. Thus, we can replace line 47 and 48 with if (dst)
    *src = NULL;
return dest_converted;Then, get rid of the terminator_found state variable, and simplify lines 59-64. | |
| 67–69 | If I'm not mistaken, when char_size < 0, have_result is guaranteed to be true (if the loop doesn't run then char_size == 0, and if char_size <= 0 in the loop, then we set have_result = true and immediately exit). Thus, we can get rid of the have_result state variable. | |
Immediately after hitting submit I notice a few more things. Added them as comments too
| libcxx/src/support/ibm/mbsnrtowcs.cpp | ||
|---|---|---|
| 40–41 | This could also probably be made part of the loop condition. (Don't forget to invert it properly when you do) | |
| 48 | again, char_size is unsigned, so this won't behave as you've described in the error case. | |
| 48 | As in wcsnrtombs, this should also probably be inverted to convert it to an early-exit | |
| 51 | Probably easier to increment dst directly here. You'll have to do so behind an extra if (dst), but in exchange you can get rid of the ternary operator on line 45, and get rid of a state variable. | |
| 59 | I think that we may be able to get rid of have_result here too, but I'm not as sure about this one, since result is initialized to 0, and terminated_sequence == static_cast<size_t>(0). | |
| libcxx/src/support/ibm/wcsnrtombs.cpp | ||
| 45–46 | char_size is actually not -1 but rather (size_t)-1, and size_t is an unsigned type, so this test won't behave as you've described in the error case. | |
| 54 | Since dst is a pointer and not a pointer-to-a-pointer, wouldn't it be easier to just add char_size directly to dst and do away with dest_converted? You don't even need to add an extra if (dst) since you have one on line 51 | |
| libcxx/src/support/ibm/wcsnrtombs.cpp | ||
|---|---|---|
| 54 | Never mind this part, I forgot we return dest_converted | |
Updated implementation to fix error handling, account for dst buffer overflow and refactoring.
| libcxx/src/support/ibm/mbsnrtowcs.cpp | ||
|---|---|---|
| 51 | I'd prefer to not add an extra conditional within the loop due to a performance tradeoff. Also, as mentioned below, dest_converted is returned so we need that state variable either way. | |
| 59 | Yup, by using result directly within the loop (since it gets updated), I was able to remove have_result. | |
I'm going to get out of your way and remove this from the review queue. The bits in libcxx/include LGTM, and I'll let you folks review the libcxx/src/support/ibm between yourselves cause it's your platform anyways. Once you're all happy with it, you can ship it assuming:
- The CI is green (BTW I'm going to request for the nth time that we add a CI bot on z/OS)
- You don't make additional modifications outside of libcxx/src/support/ibm without pinging me for a quick look
Thanks!
As discussed over slack, there is an edge case that needs to be addressed (the case where ps == nullptr on line 55 of wcsnrtombs and/or on line 61 of mbsnrtowcs).
Once that's fixed, LGTM. I've left several comments for a few more ways it could be tidied up, but none of them are functional changes.
| libcxx/src/support/ibm/mbsnrtowcs.cpp | ||
|---|---|---|
| 29–35 | Similar to my comments for wcsnrtombs.cpp, we might want to move some of these so they have a narrower scope. (buff to line 60, mbstate_tmp to line 61, and source_remaining is discussed on line 42) | |
| 42 | By tweaking this condition a bit we can get rid of source_remaining as a state variable: source_converted < src_size_bytes && (!dst || dest_converted < max_dest_chars) You may still want to keep source_remaining around, but declare it inside the loop so it's no longer used as a state variable (size_t source_remaining = src_size_bytes - source_converted;). This also allows you to make it obvious that source_converted and dest_converted are only declared outside the for loop so that we can read from them after: After moving all the declarations except terminated_sequence, invalid_sequence, incomplete_sequence, source_converted, and dest_converted, the only non-const variables you've declared before the loop are source_converted, and dest_converted. If you move the zero-initialization to the init statement of the for loop, this signals that all writes to these two variables happen within the for loop, and after we only read from them (even better, we'd actually be doing all the writing within the init and increment of the loop statement, showing the reader right away what the pre-conditions and post-conditions are for each loop iteration). | |
| 48 | See my comment in wcsnrtombs.cpp. | |
| 80 | If you move source_remaining into a for loop init statement, this should probably become the increment statement of the for loop. (probably along with the next 2 lines) | |
| libcxx/src/support/ibm/wcsnrtombs.cpp | ||
| 27–33 | Is there some reason we need to preserve all of these across loop iterations? For example, could we move mbstate_tmp's declaration to line 55 to limit the scope, and thereby reduce the number of variables the reader needs to keep track of. What about moving the declaration of char_size to line 42, and buff to line 54? | |
| 37–40 | If you change this around so it's more like mbsnrtowcs, you can get rid of dest_remaining as a state variable: instead of dest_remaining, you set dst_size_bytes = static_cast<size_t>(-1), then the loop condition can be changed to source_converted < max_source_chars && (!dst || dest_converted < dst_size_bytes). You may still want to keep dest_remaining around, but declare it inside the loop so it's no longer used as a state variable (size_t dest_remaining = dst_size_bytes - dest_converted;). This also gets rid of line 87 and 88, allowing you to cleanly turn line 86 and 89 into the increment statement in a for loop. If you move the declarations as I suggested, then you can go one step further and move the zero-initialization of source_converted and dest_converted to the init statement of the for loop for all the reasons I explained in my comments on line 42 of mbsnrtowcs. | |
| 43 | I think we could make things a bit more clear here by inverting this if and reducing the level of nesting a little. i.e. if (dst == nullptr) {
  char_size = wcrtomb(nullptr, c, ps);
} else if (dest_remaining >= static_cast<size_t>(MB_CUR_MAX)) {
  char_size = wcrtomb(dst + dest_converted, c, ps);
} else {
  //...
} | |
More refactoring [NFC]
| libcxx/src/support/ibm/mbsnrtowcs.cpp | ||
|---|---|---|
| 42 | Yup, I tweaked the condition to remove the source_remaining state variable. On the other hand, I attempted to use a for loop to include the init, condition & increment; however, this turned out to be tedious (i.e lots of variables, conditions, increments, etc.). For example we'd atleast have the following: for( size_t source_converted = 0;
    source_converted < src_size_bytes && (!dst || dest_converted < max_dest_chars);
    source_converted += result,
    ++dest_converted) {
      
    }Nonetheless, since we return dest_converted and must declare it out of the loop, then it might be more concise to initialize it upon declaration too. Also, since writing (*src) is conditional, a while loop is more appropriate. | |
Could you please fix this comment while here?