llvm-ar's mri-utf8.test test relies on the en_US.UTF-8 locale to be
installed for its last RUN line to work. If not installed, the unicode
string gets encoded (interpreted) as ascii which fails since the most
significant byte is non zero. This commit changes the test to only rely
on the system being able to encode the pound sign in its default
encoding (e.g. UTF-16 for Microsoft Windows) by always opening the file
via input/output redirection. This avoids forcing a given locale to be
present and supported. A Byte Order Mark is also added to help
recognizing the encoding of the file and its endianness.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
- Build Status
Buildable 39078 Build 39084: arc lint + arc unit
Event Timeline
There's an en_US.UTF-8 on "my" AIX box and an en_US.utf8 on "my" RHEL 7 box. There's no "C" UTF-8 locale anywhere in sight.
I believe that's because it is a builtin locale (much like the C locale). There wasn't one in /usr/share/locale on the Ubuntu docker image I've tested this but it did work while trying with en_US.UTF-8 did not.
I mean that using "C.UTF-8" with setlocale gets me a null pointer, and using "en_US.UTF-8" gets me a string with the following:
#include <locale.h> extern int printf(const char *, ...); void trylocale(const char *locale) { const char *ret = setlocale(LC_ALL, locale); printf("setlocale(\"%s\") returned \"%s\".\n", locale, ret ? ret : "(null)"); } int main(void) { trylocale("C.UTF-8"); trylocale("en_US.UTF-8"); }
On AIX:
setlocale("C.UTF-8") returned "(null)". setlocale("en_US.UTF-8") returned "en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8".
On RHEL 7:
setlocale("C.UTF-8") returned "(null)". setlocale("en_US.UTF-8") returned "en_US.UTF-8".
llvm/test/tools/llvm-ar/mri-utf8.test | ||
---|---|---|
1 | Just delete the comments and avoid python. RUN: FileCheck --input-file £.txt --match-full-lines CHECK: contents |
llvm/test/tools/llvm-ar/mri-utf8.test | ||
---|---|---|
1 | As it is, the file contains nothing aside from this last RUN line and its associated comment block that indicates that U+00A3 is the intended interpretation of the bytes \xC2\xA3. Note: There is no BOM in the file. In addition to making the intent clear, I believe that the current approach has more of an ability to detect cases where the instances of \xC2\xA3 in the file are misinterpreted. That said, if the file redirection to create the file works, then FileCheck can be invoked with use of file redirection: RUN: FileCheck <£.txt --match-full-lines %s |
Mmh so much for C.UTF-8 then. Good thing Fangrui Song came up with a better idea.
llvm/test/tools/llvm-ar/mri-utf8.test | ||
---|---|---|
1 | Is UTF-8 encoding really the desired behavior or just non ascii? I know the test is named mri-utf8 but the first comment says "Test non-ascii archive members". Besides as I mentioned in the patch description Windows encodes it in UTF-16 so UTF-8 is already not possible there. I do like the approach of using FileCheck with an input redirection. It is consistent with the echo line above so if one works the other one will as well. I feel ashamed I didn't think of that good old FileCheck. I'll revise the patch accordingly. |
llvm/test/tools/llvm-ar/mri-nonascii.test | ||
---|---|---|
6 | I am not particularly thrilled with having a file containing non-ASCII characters that are ambiguous with regards to their interpretation. Is this £, Β£, or something else? Is there an objection to adding a BOM? | |
15 | Minor nit: s/processess/processes/; | |
llvm/test/tools/llvm-ar/mri-utf8.test | ||
1 | The description in the Windows case indicates that the file that ends up on the filesystem is named, in terms of what a user might see in a directory listing, £.txt (as opposed to something else). |
llvm/test/tools/llvm-ar/mri-nonascii.test | ||
---|---|---|
15 | What problems do you work around? POSIX.1-2017 3.282 Portable Filename Character Set consists of the classical Latin alphabet, 0~9, <period>, <underscore>, and <hyphen-minus>. a filename consisting of the UTF-8 byte sequence 0xc2 0xa3 (£) may be disallowed by some implementations but it is unlikely that the implementation can arbitrarily reinterpret the byte sequence and cause the test to fail. I suggest deleting the comment. |
llvm/test/tools/llvm-ar/mri-nonascii.test | ||
---|---|---|
15 | The original message is not mine so I'm not sure what it referred to it might be that arguments are passed down the the program being invoked without interpretation, thus the filename would be UTF-8 encoded since that is what mri-utf8.test is encoded in. This would fail on Windows where filename must be UTF-16 and the output redirection of the earlier line would have created a filename in UTF-16. I'll let Owen confirm. |
Thanks for adding the BOM. With the BOM, would it make sense to leave mri-utf8.test as the name of the file?
I think the testfile name should reflect what is being tested since that's the test identifier (ie. when a test fails lit prints the relative filepath) so the fact that the file is encoded in UTF-8 is irrelevant. Here the test is about llvm-ar handling non ascii filename, as the first comment explains it. How is the <pound sign>.txt file encoded would make a bit more sense as a name but then as I mentioned AFAIK the filename is encoded in UTF-16 on Windows anywat. In summary, I think the renaming is warranted.
llvm/test/tools/llvm-ar/mri-nonascii.test | ||
---|---|---|
15 | Sorry that the comment was not clear. The issue I had was explicitly with the behaviour differences between python versions and OS causing strings not being encoded to the right format and failing to open the file in question. Originally I used python as opposed to filecheck as to be explicit with the expected characters and encoding. However if everyone is happier with MaskRays suggestion and it functions as expected, I'm not sure I can argue. Avoiding the dependency on the locale would be great. |
Remove comment about why file redirection and add a new one explaining
the encoding/decoding steps performed to the file since it is key to
how why the test work accross OSes.
llvm/test/tools/llvm-ar/mri-nonascii.test | ||
---|---|---|
7 | OK, the fact that lit (Python) does some decoding/encoding make this tricky... Please let someone with a Windows machine verify this works. |
llvm/test/tools/llvm-ar/mri-nonascii.test | ||
---|---|---|
7 | Commit :) |
This fails on macOS: http://45.33.8.238/mac/1350/step_10.txt
Relying on the system default locale seems a lot more brittle than relying on utf8. Requiring utf8 to be able to run llvm's tests seems like an ok requirement to me.
But in any case, please take a look at the breakage on Mac, and if it takes a while to fix please revert while you investigate.
As mentioned in the commit message, Windows requires UTF-16 so would need to be treated specially. Besides, even on Linux the problem is that it relies on a specific locale rather than UTF-8. Anyway, I've reverted the commit for now. @gbreynoo the test appears to be passing on Mac OS with this commit. Is llvm-ar really expected to fail dealing with nonascii characters on Mac OS X?
I added the XFAIL to 3 llvm-ar tests I added earlier in the year, due to them failing on Darwin systems. See below:
After the limited investigation I could do without a Darwin machine I created the bug below:
https://bugs.llvm.org/show_bug.cgi?id=42562
I believed the 3 failures were due to this Darwin format bug in which white space was added to the extracted files used in each test. If this test now passes on Darwin there is no reason for the XFAIL.
I am not particularly thrilled with having a file containing non-ASCII characters that are ambiguous with regards to their interpretation. Is this £, Β£, or something else? Is there an objection to adding a BOM?