This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/test/tools/llvm-ar/
-
test/
-
tools/
-
llvm-ar/
7/8
mri-nonascii.test
-
mri-utf8.test

Differential D68472

[test] Use system locale for mri-utf8.test
ClosedPublic

Authored by thopre on Oct 4 2019, 10:18 AM.

Download Raw Diff

Details

Reviewers

gbreynoo
MaskRay
rupprecht
JamesNagurne
jfb

Commits

rG0bab0538d8cc: [test] Use system locale for mri-utf8.test
rGb6f1d1fa0e3e: [test] Use system locale for mri-utf8.test
rL374318: [test] Use system locale for mri-utf8.test

Summary

llvm-ar's mri-utf8.test test relies on the en_US.UTF-8 locale to be
installed for its last RUN line to work. If not installed, the unicode
string gets encoded (interpreted) as ascii which fails since the most
significant byte is non zero. This commit changes the test to only rely
on the system being able to encode the pound sign in its default
encoding (e.g. UTF-16 for Microsoft Windows) by always opening the file
via input/output redirection. This avoids forcing a given locale to be
present and supported. A Byte Order Mark is also added to help
recognizing the encoding of the file and its endianness.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 39178
Build 39189: arc lint + arc unit

Event Timeline

thopre created this revision.Oct 4 2019, 10:18 AM

Herald added a subscriber: dexonsmith. · View Herald TranscriptOct 4 2019, 10:18 AM

Harbormaster completed remote builds in B39005: Diff 223244.Oct 4 2019, 10:19 AM

There's an en_US.UTF-8 on "my" AIX box and an en_US.utf8 on "my" RHEL 7 box. There's no "C" UTF-8 locale anywhere in sight.

In D68472#1695093, @hubert.reinterpretcast wrote:

There's an en_US.UTF-8 on "my" AIX box and an en_US.utf8 on "my" RHEL 7 box. There's no "C" UTF-8 locale anywhere in sight.

I believe that's because it is a builtin locale (much like the C locale). There wasn't one in /usr/share/locale on the Ubuntu docker image I've tested this but it did work while trying with en_US.UTF-8 did not.

In D68472#1695156, @thopre wrote:

I believe that's because it is a builtin locale (much like the C locale). There wasn't one in /usr/share/locale on the Ubuntu docker image I've tested this but it did work while trying with en_US.UTF-8 did not.

I mean that using "C.UTF-8" with setlocale gets me a null pointer, and using "en_US.UTF-8" gets me a string with the following:

#include <locale.h>
extern int printf(const char *, ...);
void trylocale(const char *locale) {
  const char *ret = setlocale(LC_ALL, locale);
  printf("setlocale(\"%s\") returned \"%s\".\n", locale, ret ? ret : "(null)");
}
int main(void) {
  trylocale("C.UTF-8");
  trylocale("en_US.UTF-8");
}

On AIX:

setlocale("C.UTF-8") returned "(null)".
setlocale("en_US.UTF-8") returned "en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8".

On RHEL 7:

setlocale("C.UTF-8") returned "(null)".
setlocale("en_US.UTF-8") returned "en_US.UTF-8".

MaskRay added inline comments.Oct 4 2019, 6:50 PM

llvm/test/tools/llvm-ar/mri-utf8.test
1	Just delete the comments and avoid python. RUN: FileCheck --input-file £.txt --match-full-lines CHECK: contents

hubert.reinterpretcast added inline comments.Oct 5 2019, 7:56 AM

llvm/test/tools/llvm-ar/mri-utf8.test
1	As it is, the file contains nothing aside from this last RUN line and its associated comment block that indicates that U+00A3 is the intended interpretation of the bytes `\xC2\xA3`. Note: There is no BOM in the file. In addition to making the intent clear, I believe that the current approach has more of an ability to detect cases where the instances of `\xC2\xA3` in the file are misinterpreted. That said, if the file redirection to create the file works, then `FileCheck` can be invoked with use of file redirection: RUN: FileCheck <£.txt --match-full-lines %s

In D68472#1695743, @hubert.reinterpretcast wrote:
In D68472#1695156, @thopre wrote:

I believe that's because it is a builtin locale (much like the C locale). There wasn't one in /usr/share/locale on the Ubuntu docker image I've tested this but it did work while trying with en_US.UTF-8 did not.

I mean that using "C.UTF-8" with setlocale gets me a null pointer, and using "en_US.UTF-8" gets me a string with the following:
#include <locale.h>
extern int printf(const char *, ...);
void trylocale(const char *locale) {
  const char *ret = setlocale(LC_ALL, locale);
  printf("setlocale(\"%s\") returned \"%s\".\n", locale, ret ? ret : "(null)");
}
int main(void) {
  trylocale("C.UTF-8");
  trylocale("en_US.UTF-8");
}
On AIX:
setlocale("C.UTF-8") returned "(null)".
setlocale("en_US.UTF-8") returned "en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8 en_US.UTF-8".
On RHEL 7:
setlocale("C.UTF-8") returned "(null)".
setlocale("en_US.UTF-8") returned "en_US.UTF-8".

Mmh so much for C.UTF-8 then. Good thing Fangrui Song came up with a better idea.

llvm/test/tools/llvm-ar/mri-utf8.test
1	Is UTF-8 encoding really the desired behavior or just non ascii? I know the test is named mri-utf8 but the first comment says "Test non-ascii archive members". Besides as I mentioned in the patch description Windows encodes it in UTF-16 so UTF-8 is already not possible there. I do like the approach of using FileCheck with an input redirection. It is consistent with the echo line above so if one works the other one will as well. I feel ashamed I didn't think of that good old FileCheck. I'll revise the patch accordingly.

Use file redirection + FileCheck to test file content

In D68472#1697316, @thopre wrote:

Use file redirection + FileCheck to test file content

Can people with mac & Windows test this new version works for them?

hubert.reinterpretcast added inline comments.Oct 7 2019, 6:47 AM

llvm/test/tools/llvm-ar/mri-nonascii.test
7	I am not particularly thrilled with having a file containing non-ASCII characters that are ambiguous with regards to their interpretation. Is this `Â£`, `Β£`, or something else? Is there an objection to adding a BOM?
16	Minor nit: s/processess/processes/;
llvm/test/tools/llvm-ar/mri-utf8.test
1	The description in the Windows case indicates that the file that ends up on the filesystem is named, in terms of what a user might see in a directory listing, `£.txt` (as opposed to something else).

MaskRay added inline comments.Oct 7 2019, 6:50 AM

llvm/test/tools/llvm-ar/mri-nonascii.test
16	What problems do you work around? POSIX.1-2017 3.282 Portable Filename Character Set consists of the classical Latin alphabet, 0~9, <period>, <underscore>, and <hyphen-minus>. a filename consisting of the UTF-8 byte sequence 0xc2 0xa3 (£) may be disallowed by some implementations but it is unlikely that the implementation can arbitrarily reinterpret the byte sequence and cause the test to fail. I suggest deleting the comment.

Fix typo and add BOM

thopre marked an inline comment as done.Oct 7 2019, 7:17 AM

thopre added inline comments.

llvm/test/tools/llvm-ar/mri-nonascii.test
16	The original message is not mine so I'm not sure what it referred to it might be that arguments are passed down the the program being invoked without interpretation, thus the filename would be UTF-8 encoded since that is what mri-utf8.test is encoded in. This would fail on Windows where filename must be UTF-16 and the output redirection of the earlier line would have created a filename in UTF-16. I'll let Owen confirm.

Thanks for adding the BOM. With the BOM, would it make sense to leave mri-utf8.test as the name of the file?

In D68472#1697691, @hubert.reinterpretcast wrote:

Thanks for adding the BOM. With the BOM, would it make sense to leave mri-utf8.test as the name of the file?

I think the testfile name should reflect what is being tested since that's the test identifier (ie. when a test fails lit prints the relative filepath) so the fact that the file is encoded in UTF-8 is irrelevant. Here the test is about llvm-ar handling non ascii filename, as the first comment explains it. How is the <pound sign>.txt file encoded would make a bit more sense as a name but then as I mentioned AFAIK the filename is encoded in UTF-16 on Windows anywat. In summary, I think the renaming is warranted.

Harbormaster completed remote builds in B39078: Diff 223532.Oct 7 2019, 10:04 PM

Harbormaster completed remote builds in B39085: Diff 223591.

gbreynoo added inline comments.Oct 8 2019, 8:34 AM

llvm/test/tools/llvm-ar/mri-nonascii.test
16	Sorry that the comment was not clear. The issue I had was explicitly with the behaviour differences between python versions and OS causing strings not being encoded to the right format and failing to open the file in question. Originally I used python as opposed to filecheck as to be explicit with the expected characters and encoding. However if everyone is happier with MaskRays suggestion and it functions as expected, I'm not sure I can argue. Avoiding the dependency on the locale would be great.

This functions on Windows fine.

In D68472#1698154, @thopre wrote:

In D68472#1697691, @hubert.reinterpretcast wrote:

Thanks for adding the BOM. With the BOM, would it make sense to leave mri-utf8.test as the name of the file?

I think the testfile name should reflect what is being tested since that's the test identifier (ie. when a test fails lit prints the relative filepath) so the fact that the file is encoded in UTF-8 is irrelevant. Here the test is about llvm-ar handling non ascii filename, as the first comment explains it. How is the <pound sign>.txt file encoded would make a bit more sense as a name but then as I mentioned AFAIK the filename is encoded in UTF-16 on Windows anywat. In summary, I think the renaming is warranted.

I agree that the new name makes more sense.
LGTM once the comment is fixed.

Remove comment about why file redirection and add a new one explaining
the encoding/decoding steps performed to the file since it is key to
how why the test work accross OSes.

thopre retitled this revision from [test] Depend on C.UTF-8 dependency for mri-utf8.test to [test] Use system locale for mri-utf8.test.Oct 8 2019, 9:40 AM

thopre edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B39178: Diff 223889.Oct 8 2019, 9:47 AM

MaskRay accepted this revision.Oct 10 2019, 3:31 AM

MaskRay added inline comments.

llvm/test/tools/llvm-ar/mri-nonascii.test
6	OK, the fact that lit (Python) does some decoding/encoding make this tricky... Please let someone with a Windows machine verify this works.

This revision is now accepted and ready to land.Oct 10 2019, 3:31 AM

llvm/test/tools/llvm-ar/mri-nonascii.test
6	It does according to Owen's comment: In D68472#1699815, @gbreynoo wrote: This functions on Windows fine.

MaskRay added inline comments.Oct 10 2019, 4:37 AM

llvm/test/tools/llvm-ar/mri-nonascii.test
6	Commit :)

thopre closed this revision.Oct 10 2019, 4:48 AM

thopre marked an inline comment as done.

This fails on macOS: http://45.33.8.238/mac/1350/step_10.txt

Relying on the system default locale seems a lot more brittle than relying on utf8. Requiring utf8 to be able to run llvm's tests seems like an ok requirement to me.

But in any case, please take a look at the breakage on Mac, and if it takes a while to fix please revert while you investigate.

In D68472#1703619, @thakis wrote:

This fails on macOS: http://45.33.8.238/mac/1350/step_10.txt

Relying on the system default locale seems a lot more brittle than relying on utf8. Requiring utf8 to be able to run llvm's tests seems like an ok requirement to me.

But in any case, please take a look at the breakage on Mac, and if it takes a while to fix please revert while you investigate.

As mentioned in the commit message, Windows requires UTF-16 so would need to be treated specially. Besides, even on Linux the problem is that it relies on a specific locale rather than UTF-8. Anyway, I've reverted the commit for now. @gbreynoo the test appears to be passing on Mac OS with this commit. Is llvm-ar really expected to fail dealing with nonascii characters on Mac OS X?

I added the XFAIL to 3 llvm-ar tests I added earlier in the year, due to them failing on Darwin systems. See below:

D64802

After the limited investigation I could do without a Darwin machine I created the bug below:

https://bugs.llvm.org/show_bug.cgi?id=42562

I believed the 3 failures were due to this Darwin format bug in which white space was added to the extracted files used in each test. If this test now passes on Darwin there is no reason for the XFAIL.

MaskRay mentioned this in D69665: [llvm-ar] Fix llvm-ar response file reading on Windows.Oct 31 2019, 10:36 AM

My apologies for the commit message, I forgot to mention it's a recommit.

Revision Contents

Path

Size

llvm/

test/

tools/

llvm-ar/

mri-nonascii.test

22 lines

mri-utf8.test

Diff 223889

llvm/test/tools/llvm-ar/mri-nonascii.test

This file was added.

				# Test non-ascii archive members
				# XFAIL: system-darwin

				RUN: rm -rf %t && mkdir -p %t/extracted

				# Note: lit's Python will read this UTF-8 encoded mri-nonascii.txt file,
				MaskRayUnsubmitted Done Reply Inline Actions OK, the fact that lit (Python) does some decoding/encoding make this tricky... Please let someone with a Windows machine verify this works. MaskRay: OK, the fact that lit (Python) does some decoding/encoding make this tricky... Please let…
				thopreAuthorUnsubmitted Done Reply Inline Actions It does according to Owen's comment: In D68472#1699815, @gbreynoo wrote: This functions on Windows fine. thopre: It does according to Owen's comment: >>! In D68472#1699815, @gbreynoo wrote: > This functions…
				MaskRayUnsubmitted Not Done Reply Inline Actions Commit :) MaskRay: Commit :)
				# decode it to unicode. The filename in the redirection below will then
				hubert.reinterpretcastUnsubmitted Done Reply Inline Actions I am not particularly thrilled with having a file containing non-ASCII characters that are ambiguous with regards to their interpretation. Is this `Â£`, `Β£`, or something else? Is there an objection to adding a BOM? hubert.reinterpretcast: I am not particularly thrilled with having a file containing non-ASCII characters that are…
				# be encoded in the system's filename encoding (e.g. UTF-16 for
				# Microsoft Windows).
				RUN: echo "contents" > %t/£.txt

				RUN: echo "CREATE %t/mri.ar" > %t/script.mri
				RUN: echo "ADDMOD %t/£.txt" >> %t/script.mri
				RUN: echo "SAVE" >> %t/script.mri

				RUN: llvm-ar -M < %t/script.mri
				hubert.reinterpretcastUnsubmitted Done Reply Inline Actions Minor nit: s/processess/processes/; hubert.reinterpretcast: Minor nit: s/processess/processes/;
				MaskRayUnsubmitted Done Reply Inline Actions What problems do you work around? POSIX.1-2017 3.282 Portable Filename Character Set consists of the classical Latin alphabet, 0~9, <period>, <underscore>, and <hyphen-minus>. a filename consisting of the UTF-8 byte sequence 0xc2 0xa3 (£) may be disallowed by some implementations but it is unlikely that the implementation can arbitrarily reinterpret the byte sequence and cause the test to fail. I suggest deleting the comment. MaskRay: What problems do you work around? POSIX.1-2017 3.282 Portable Filename Character Set consists…
				thopreAuthorUnsubmitted Done Reply Inline Actions The original message is not mine so I'm not sure what it referred to it might be that arguments are passed down the the program being invoked without interpretation, thus the filename would be UTF-8 encoded since that is what mri-utf8.test is encoded in. This would fail on Windows where filename must be UTF-16 and the output redirection of the earlier line would have created a filename in UTF-16. I'll let Owen confirm. thopre: The original message is not mine so I'm not sure what it referred to it might be that arguments…
				gbreynooUnsubmitted Done Reply Inline Actions Sorry that the comment was not clear. The issue I had was explicitly with the behaviour differences between python versions and OS causing strings not being encoded to the right format and failing to open the file in question. Originally I used python as opposed to filecheck as to be explicit with the expected characters and encoding. However if everyone is happier with MaskRays suggestion and it functions as expected, I'm not sure I can argue. Avoiding the dependency on the locale would be great. gbreynoo: Sorry that the comment was not clear. The issue I had was explicitly with the behaviour…
				RUN: cd %t/extracted && llvm-ar x %t/mri.ar

				# Same as above.
				RUN: FileCheck --strict-whitespace %s <£.txt
				CHECK:{{^}}
				CHECK-SAME:{{^}}contents{{$}}

llvm/test/tools/llvm-ar/mri-utf8.test

This file was deleted.

	# Test non-ascii archive members
	# XFAIL: system-darwin

	RUN: rm -rf %t && mkdir -p %t/extracted

	RUN: echo "contents" > %t/£.txt

	RUN: echo "CREATE %t/mri.ar" > %t/script.mri
	RUN: echo "ADDMOD %t/£.txt" >> %t/script.mri
	RUN: echo "SAVE" >> %t/script.mri

	RUN: llvm-ar -M < %t/script.mri
	RUN: cd %t/extracted && llvm-ar x %t/mri.ar

	# This works around problems launching processess that
	# include arguments with non-ascii characters.
	# Python on Linux defaults to ASCII encoding unless the
	# environment specifies otherwise, so it is explicitly set.
	# The reliance the test has on this locale is not ideal,
	# however alternate solutions have been difficult due to
	# behaviour differences with python 2 vs python 3,
	# and linux vs windows.
	RUN: env LANG=en_US.UTF-8 %python -c "assert open(u'\U000000A3.txt', 'rb').read() == b'contents\n'"