Page MenuHomePhabricator

[lit] Fix UnicodeEncodeError when test commands contain non-ASCII chars

Authored by mgorny on Jun 13 2019, 5:13 AM.



Ensure that the bash script written by lit TestRunner is open with UTF-8
encoding when using Python 3. Otherwise, attempt to write non-ASCII
characters causes UnicodeEncodeError. This happened e.g. with
the following LLD test:

UNRESOLVED: lld :: ELF/format-binary-non-ascii.s (657 of 2119)
******************** TEST 'lld :: ELF/format-binary-non-ascii.s' FAILED ********************
Exception during script execution:
Traceback (most recent call last):
  File "/home/mgorny/llvm-project/llvm/utils/lit/lit/", line 63, in _execute_test
    result = test.config.test_format.execute(test, lit_config)
  File "/home/mgorny/llvm-project/llvm/utils/lit/lit/formats/", line 25, in execute
  File "/home/mgorny/llvm-project/llvm/utils/lit/lit/", line 1644, in executeShTest
    res = _runShTest(test, litConfig, useExternalSh, script, tmpBase)
  File "/home/mgorny/llvm-project/llvm/utils/lit/lit/", line 1590, in _runShTest
    res = executeScript(test, litConfig, tmpBase, script, execdir)
  File "/home/mgorny/llvm-project/llvm/utils/lit/lit/", line 1157, in executeScript
    f.write('{ ' + '; } &&\n{ '.join(commands) + '; }')
UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 274: ordinal not in range(128)

Diff Detail


Event Timeline

mgorny created this revision.Jun 13 2019, 5:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2019, 5:13 AM
Herald added a subscriber: delcypher. · View Herald Transcript

LGTM. Can you add a test case in lit/tests for that?

mgorny updated this revision to Diff 204557.Jun 13 2019, 8:43 AM

Here's a proposed test case. I've verified that it failed (became unresolved) without my patch.

rnk accepted this revision.Jun 13 2019, 10:23 AM


1139 ↗(On Diff #204557)

I was thinking perhaps this should be if not elif, but I've run the ELF LLD test suite successfully on Windows with Python 3 and it works fine as is.

This revision is now accepted and ready to land.Jun 13 2019, 10:23 AM
mgorny marked an inline comment as done.Jun 13 2019, 11:36 AM
mgorny added inline comments.
1139 ↗(On Diff #204557)

This is elif because encoding only applies when mode is w and not wb.

ruiu accepted this revision.Jun 14 2019, 6:17 AM


1137–1141 ↗(On Diff #204557)

I'd perhaps write this this way,

f = None
if litConfig.isWindows and not isWin32CMDEXE:
  f = open(script, 'wb')
elif sys.version_info > (3,0):
  f = open(script, 'w', encoding='utf-8')
  f = open(script, 'w')

so that it is clear what values are passed to open, but that's probably my personal preference.

mgorny marked an inline comment as done.Jun 14 2019, 6:23 AM
mgorny added inline comments.
1137–1141 ↗(On Diff #204557)

Sure but the code below relies on mode being defined :-(. I personally think the whole function here is a horrible hack but I currently don't have a good idea how to write it better. I mean, how does it make sense to have two copies of everything with different string types in order to control endlines as a side effect? Even conditionally putting \r\n would be cleaner than that. In fact, maybe I'll rewrite it to do just that…

This revision was automatically updated to reflect the committed changes.

Anyway, committed as-is to resolve the issue at hand and I'll think how to rewrite it best.

@rnk, do you happen to know whether we are using UTF-8 on Windows these days or some system-native encoding?