This is an archive of the discontinued LLVM Phabricator instance.

[test] Remove locale dependency for mri-utf8.test
ClosedPublic

Authored by thopre on Oct 3 2019, 1:35 PM.

Details

Summary

llvm-ar's mri-utf8.test test relies on the en_US.UTF-8 locale to be
installed for its last RUN line to work. If not installed, the unicode
string gets encoded (interpreted) as ascii which fails since the most
significant byte is non zero. This commit changes the call to open to
use a binary literal of the UTF-8 encoding for the pound sign instead,
thus bypassing the encoding step.

Note that the echo to create the <pound sign>.txt file will work
regardless of the locale because both the shell and the echo (in case
it's not a builtin of the shell concerned) only care about ascii
character to operate. Indeed, the mri-utf8.test file (and in particular
the pound sign) is encoded in UTF-8 and UTF-8 guarantees only ascii
characters can create bytes that can be interpreted as ascii characters
(i.e. bytes with the most significant bit null).

So the process to break down the filename in the line goes something
along:

  • find an ascii chevron '>'
  • find beginning of the filename by removing ascii space-like characters
  • find ascii newline character indicating the end of the redirection (no semicolon ';', closing curly bracket '}' or parenthesis ')' or the like
  • create a file whose name is made of all the bytes in between beginning and end of filename *without interpretting them*

Event Timeline

thopre created this revision.Oct 3 2019, 1:35 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2019, 1:35 PM
thopre added a reviewer: jfb.Oct 3 2019, 1:37 PM
rupprecht accepted this revision.Oct 3 2019, 5:50 PM

Out of curiosity, what environment does this currently fail in?

llvm/test/tools/llvm-ar/mri-utf8.test
14–15

This seems to be mostly describing env LANG=en_US.UTF-8 which is now removed, so the comment can go too.

This revision is now accepted and ready to land.Oct 3 2019, 5:50 PM
thopre updated this revision to Diff 223151.Oct 4 2019, 12:09 AM

Remove now irrelevant comment

thopre marked 2 inline comments as done.Oct 4 2019, 12:11 AM
thopre added inline comments.
llvm/test/tools/llvm-ar/mri-utf8.test
14–15

I'm not sure but I believe Ubuntu docker images for instance don't come with en_US.UTF-8 by default.

This revision was automatically updated to reflect the committed changes.
thopre marked an inline comment as done.

You are correct that the locale is required to pass on linux. I had some trouble with this test as the behaviour of python in this area differs between linux / windows and python 2 / python 3. For example this fix appears to be fine for linux, however Windows with python 2 fails:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '\xc2\xa3.txt'

I do not like the reliance on the locale however the upstream buildbots all appear to have it installed. Maybe the test should be split into a windows test and a linux test?

I reverted this change due to the clang-x64-windows-msvc build bot failure.

thopre added a comment.Oct 4 2019, 5:35 AM

You are correct that the locale is required to pass on linux. I had some trouble with this test as the behaviour of python in this area differs between linux / windows and python 2 / python 3. For example this fix appears to be fine for linux, however Windows with python 2 fails:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '\xc2\xa3.txt'

I do not like the reliance on the locale however the upstream buildbots all appear to have it installed. Maybe the test should be split into a windows test and a linux test?

Mmmh, I need a Windows system to try more then. I guess the current code will output the pound sign into whatever encoding Windows use (which I guess is not UTF-8 but then how does the echo few lines above creates the file correct). Thanks for reverting the commit and sorry for the breakage.

thopre added a comment.Oct 4 2019, 6:29 AM

You are correct that the locale is required to pass on linux. I had some trouble with this test as the behaviour of python in this area differs between linux / windows and python 2 / python 3. For example this fix appears to be fine for linux, however Windows with python 2 fails:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '\xc2\xa3.txt'

I do not like the reliance on the locale however the upstream buildbots all appear to have it installed. Maybe the test should be split into a windows test and a linux test?

Mmmh, I need a Windows system to try more then. I guess the current code will output the pound sign into whatever encoding Windows use (which I guess is not UTF-8 but then how does the echo few lines above creates the file correct). Thanks for reverting the commit and sorry for the breakage.

So it's lit that processes the redirection and I guess the file will be read as a UTF-8, decoded into unicode and then on Windows this will output a filename in UTF-16.

Does llvm-ar espects UTF-8 specifically in archive members or can it be anything (e.g. UTF-16)?

thopre added a comment.Oct 4 2019, 6:44 AM

You are correct that the locale is required to pass on linux. I had some trouble with this test as the behaviour of python in this area differs between linux / windows and python 2 / python 3. For example this fix appears to be fine for linux, however Windows with python 2 fails:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '\xc2\xa3.txt'

I do not like the reliance on the locale however the upstream buildbots all appear to have it installed. Maybe the test should be split into a windows test and a linux test?

Mmmh, I need a Windows system to try more then. I guess the current code will output the pound sign into whatever encoding Windows use (which I guess is not UTF-8 but then how does the echo few lines above creates the file correct). Thanks for reverting the commit and sorry for the breakage.

So it's lit that processes the redirection and I guess the file will be read as a UTF-8, decoded into unicode and then on Windows this will output a filename in UTF-16.

Does llvm-ar espects UTF-8 specifically in archive members or can it be anything (e.g. UTF-16)?

Could you try this change on Windows instead:

RUN: env LANG=C.UTF-8 %python -c "assert open(u'\xA3.txt', 'rb').read() == b'contents\n'"