This is an archive of the discontinued LLVM Phabricator instance.

[test] Use system locale for mri-utf8.test
ClosedPublic

Authored by thopre on Oct 4 2019, 10:18 AM.

Details

Summary

llvm-ar's mri-utf8.test test relies on the en_US.UTF-8 locale to be
installed for its last RUN line to work. If not installed, the unicode
string gets encoded (interpreted) as ascii which fails since the most
significant byte is non zero. This commit changes the test to only rely
on the system being able to encode the pound sign in its default
encoding (e.g. UTF-16 for Microsoft Windows) by always opening the file
via input/output redirection. This avoids forcing a given locale to be
present and supported. A Byte Order Mark is also added to help
recognizing the encoding of the file and its endianness.

Event Timeline

thopre created this revision.Oct 4 2019, 10:18 AM

There's an en_US.UTF-8 on "my" AIX box and an en_US.utf8 on "my" RHEL 7 box. There's no "C" UTF-8 locale anywhere in sight.

There's an en_US.UTF-8 on "my" AIX box and an en_US.utf8 on "my" RHEL 7 box. There's no "C" UTF-8 locale anywhere in sight.

I believe that's because it is a builtin locale (much like the C locale). There wasn't one in /usr/share/locale on the Ubuntu docker image I've tested this but it did work while trying with en_US.UTF-8 did not.

I believe that's because it is a builtin locale (much like the C locale). There wasn't one in /usr/share/locale on the Ubuntu docker image I've tested this but it did work while trying with en_US.UTF-8 did not.

I mean that using "C.UTF-8" with setlocale gets me a null pointer, and using "en_US.UTF-8" gets me a string with the following:

#include <locale.h>
extern int printf(const char *, ...);
void trylocale(const char *locale) {
  const char *ret = setlocale(LC_ALL, locale);
  printf("setlocale(\"%s\") returned \"%s\".\n", locale, ret ? ret : "(null)");
}
int main(void) {
  trylocale("C.UTF-8");
  trylocale("en_US.UTF-8");
}

On AIX:

setlocale("C.UTF-8") returned "(null)".
setlocale("en_US.UTF-8") returned "en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8".

On RHEL 7:

setlocale("C.UTF-8") returned "(null)".
setlocale("en_US.UTF-8") returned "en_US.UTF-8".
MaskRay added inline comments.Oct 4 2019, 6:50 PM
llvm/test/tools/llvm-ar/mri-utf8.test
1

Just delete the comments and avoid python.

RUN: FileCheck --input-file £.txt --match-full-lines
CHECK: contents
llvm/test/tools/llvm-ar/mri-utf8.test
1

As it is, the file contains nothing aside from this last RUN line and its associated comment block that indicates that U+00A3 is the intended interpretation of the bytes \xC2\xA3. Note: There is no BOM in the file.

In addition to making the intent clear, I believe that the current approach has more of an ability to detect cases where the instances of \xC2\xA3 in the file are misinterpreted.

That said, if the file redirection to create the file works, then FileCheck can be invoked with use of file redirection:

RUN: FileCheck <£.txt --match-full-lines %s
thopre marked an inline comment as done.Oct 7 2019, 2:17 AM

I believe that's because it is a builtin locale (much like the C locale). There wasn't one in /usr/share/locale on the Ubuntu docker image I've tested this but it did work while trying with en_US.UTF-8 did not.

I mean that using "C.UTF-8" with setlocale gets me a null pointer, and using "en_US.UTF-8" gets me a string with the following:

#include <locale.h>
extern int printf(const char *, ...);
void trylocale(const char *locale) {
  const char *ret = setlocale(LC_ALL, locale);
  printf("setlocale(\"%s\") returned \"%s\".\n", locale, ret ? ret : "(null)");
}
int main(void) {
  trylocale("C.UTF-8");
  trylocale("en_US.UTF-8");
}

On AIX:

setlocale("C.UTF-8") returned "(null)".
setlocale("en_US.UTF-8") returned "en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8".

On RHEL 7:

setlocale("C.UTF-8") returned "(null)".
setlocale("en_US.UTF-8") returned "en_US.UTF-8".

Mmh so much for C.UTF-8 then. Good thing Fangrui Song came up with a better idea.

llvm/test/tools/llvm-ar/mri-utf8.test
1

Is UTF-8 encoding really the desired behavior or just non ascii? I know the test is named mri-utf8 but the first comment says "Test non-ascii archive members". Besides as I mentioned in the patch description Windows encodes it in UTF-16 so UTF-8 is already not possible there.

I do like the approach of using FileCheck with an input redirection. It is consistent with the echo line above so if one works the other one will as well. I feel ashamed I didn't think of that good old FileCheck. I'll revise the patch accordingly.

thopre updated this revision to Diff 223532.Oct 7 2019, 5:49 AM

Use file redirection + FileCheck to test file content

thopre edited the summary of this revision. (Show Details)Oct 7 2019, 5:50 AM

Use file redirection + FileCheck to test file content

Can people with mac & Windows test this new version works for them?

llvm/test/tools/llvm-ar/mri-nonascii.test
7

I am not particularly thrilled with having a file containing non-ASCII characters that are ambiguous with regards to their interpretation. Is this £, Β£, or something else? Is there an objection to adding a BOM?

16

Minor nit: s/processess/processes/;

llvm/test/tools/llvm-ar/mri-utf8.test
1

The description in the Windows case indicates that the file that ends up on the filesystem is named, in terms of what a user might see in a directory listing, £.txt (as opposed to something else).

MaskRay added inline comments.Oct 7 2019, 6:50 AM
llvm/test/tools/llvm-ar/mri-nonascii.test
16

What problems do you work around? POSIX.1-2017 3.282 Portable Filename Character Set consists of the classical Latin alphabet, 0~9, <period>, <underscore>, and <hyphen-minus>. a filename consisting of the UTF-8 byte sequence 0xc2 0xa3 (£) may be disallowed by some implementations but it is unlikely that the implementation can arbitrarily reinterpret the byte sequence and cause the test to fail.

I suggest deleting the comment.

thopre updated this revision to Diff 223591.Oct 7 2019, 7:16 AM
thopre marked an inline comment as done.

Fix typo and add BOM

thopre marked an inline comment as done.Oct 7 2019, 7:17 AM
thopre added inline comments.
llvm/test/tools/llvm-ar/mri-nonascii.test
16

The original message is not mine so I'm not sure what it referred to it might be that arguments are passed down the the program being invoked without interpretation, thus the filename would be UTF-8 encoded since that is what mri-utf8.test is encoded in. This would fail on Windows where filename must be UTF-16 and the output redirection of the earlier line would have created a filename in UTF-16.

I'll let Owen confirm.

Thanks for adding the BOM. With the BOM, would it make sense to leave mri-utf8.test as the name of the file?

thopre added a comment.Oct 7 2019, 2:10 PM

Thanks for adding the BOM. With the BOM, would it make sense to leave mri-utf8.test as the name of the file?

I think the testfile name should reflect what is being tested since that's the test identifier (ie. when a test fails lit prints the relative filepath) so the fact that the file is encoded in UTF-8 is irrelevant. Here the test is about llvm-ar handling non ascii filename, as the first comment explains it. How is the <pound sign>.txt file encoded would make a bit more sense as a name but then as I mentioned AFAIK the filename is encoded in UTF-16 on Windows anywat. In summary, I think the renaming is warranted.

Harbormaster completed remote builds in B39085: Diff 223591.
gbreynoo added inline comments.Oct 8 2019, 8:34 AM
llvm/test/tools/llvm-ar/mri-nonascii.test
16

Sorry that the comment was not clear. The issue I had was explicitly with the behaviour differences between python versions and OS causing strings not being encoded to the right format and failing to open the file in question.

Originally I used python as opposed to filecheck as to be explicit with the expected characters and encoding. However if everyone is happier with MaskRays suggestion and it functions as expected, I'm not sure I can argue. Avoiding the dependency on the locale would be great.

This functions on Windows fine.

gbreynoo added a comment.EditedOct 8 2019, 8:39 AM

Thanks for adding the BOM. With the BOM, would it make sense to leave mri-utf8.test as the name of the file?

I think the testfile name should reflect what is being tested since that's the test identifier (ie. when a test fails lit prints the relative filepath) so the fact that the file is encoded in UTF-8 is irrelevant. Here the test is about llvm-ar handling non ascii filename, as the first comment explains it. How is the <pound sign>.txt file encoded would make a bit more sense as a name but then as I mentioned AFAIK the filename is encoded in UTF-16 on Windows anywat. In summary, I think the renaming is warranted.

I agree that the new name makes more sense.
LGTM once the comment is fixed.

thopre updated this revision to Diff 223889.Oct 8 2019, 9:39 AM

Remove comment about why file redirection and add a new one explaining
the encoding/decoding steps performed to the file since it is key to
how why the test work accross OSes.

thopre retitled this revision from [test] Depend on C.UTF-8 dependency for mri-utf8.test to [test] Use system locale for mri-utf8.test.Oct 8 2019, 9:40 AM
thopre edited the summary of this revision. (Show Details)
MaskRay accepted this revision.Oct 10 2019, 3:31 AM
MaskRay added inline comments.
llvm/test/tools/llvm-ar/mri-nonascii.test
6

OK, the fact that lit (Python) does some decoding/encoding make this tricky... Please let someone with a Windows machine verify this works.

This revision is now accepted and ready to land.Oct 10 2019, 3:31 AM
thopre marked 5 inline comments as done.Oct 10 2019, 4:03 AM
llvm/test/tools/llvm-ar/mri-nonascii.test
6

It does according to Owen's comment:

This functions on Windows fine.

MaskRay added inline comments.Oct 10 2019, 4:37 AM
llvm/test/tools/llvm-ar/mri-nonascii.test
6

Commit :)

thopre closed this revision.Oct 10 2019, 4:48 AM
thopre marked an inline comment as done.
thakis added a subscriber: thakis.Oct 10 2019, 5:29 AM

This fails on macOS: http://45.33.8.238/mac/1350/step_10.txt

Relying on the system default locale seems a lot more brittle than relying on utf8. Requiring utf8 to be able to run llvm's tests seems like an ok requirement to me.

But in any case, please take a look at the breakage on Mac, and if it takes a while to fix please revert while you investigate.

This fails on macOS: http://45.33.8.238/mac/1350/step_10.txt

Relying on the system default locale seems a lot more brittle than relying on utf8. Requiring utf8 to be able to run llvm's tests seems like an ok requirement to me.

But in any case, please take a look at the breakage on Mac, and if it takes a while to fix please revert while you investigate.

As mentioned in the commit message, Windows requires UTF-16 so would need to be treated specially. Besides, even on Linux the problem is that it relies on a specific locale rather than UTF-8. Anyway, I've reverted the commit for now. @gbreynoo the test appears to be passing on Mac OS with this commit. Is llvm-ar really expected to fail dealing with nonascii characters on Mac OS X?

I added the XFAIL to 3 llvm-ar tests I added earlier in the year, due to them failing on Darwin systems. See below:

D64802

After the limited investigation I could do without a Darwin machine I created the bug below:

https://bugs.llvm.org/show_bug.cgi?id=42562

I believed the 3 failures were due to this Darwin format bug in which white space was added to the extracted files used in each test. If this test now passes on Darwin there is no reason for the XFAIL.

thopre added a comment.Nov 4 2019, 7:27 AM

My apologies for the commit message, I forgot to mention it's a recommit.