This is an archive of the discontinued LLVM Phabricator instance.

[lit] Fix problem in how Python versions open files with different encodings
ClosedPublic

Authored by asmith on Feb 10 2018, 12:38 PM.

Details

Summary

This issue was found when running the clang unit test on Windows. Python 3.x cannot open some of the files that the tests are using with a simple open because of their encoding. Python 2.7+ and Python 3.x both support io.open which allows for an encoding to be specified.

This change will determine whether two files being compared should be opened (and then compared) as text or binary and whether to use utf-8 or the default encoding before proceeding with a line-by-line comparison.

Patch by Stella Stamenova!

Diff Detail

Repository
rL LLVM

Event Timeline

asmith created this revision.Feb 10 2018, 12:38 PM

Using utf-8 instead of latin-1 causes errors with some tests. Here's an example.

UNRESOLVED: Clang :: Driver/clang-offload-bundler.c (4201 of 11896)

  • TEST 'Clang :: Driver/clang-offload-bundler.c' FAILED ****

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 2: invalid start byte

An alternative to this change is to try and open the file and if it fails, retry with Latin-1.

The issue is tests are generating binary data then opening it in python and encodings are different on Windows.

Ex. clang -c foo.c; open foo.obj

zturner accepted this revision.Feb 14 2018, 10:39 AM
zturner added a reviewer: rnk.

I don't have a strong opinion, but this seems fine. Feel free to commit and watch the bots to see if any other platforms break.

This revision is now accepted and ready to land.Feb 14 2018, 10:39 AM
MatzeB added a subscriber: MatzeB.Feb 14 2018, 10:47 AM

'Latin-1' seems completely arbitrary and wrong to me. You should rather try to get a stream of bytes rather than have python try to decode the file.

'Latin-1' seems completely arbitrary and wrong to me. You should rather try to get a stream of bytes rather than have python try to decode the file.

Have you tried open(file, 'rb') or does that not work with difflib?

A couple of things:

  1. Aaron actually forgot to update the change, the final revision should first try to open as before and then fall back to Latin-1 on a decode exception.
  2. This is not a Windows specific failure - it is a python 3 issue, so running the tests with python 3 on any platform fails to open the file when it's a binary file with just 'r'
  3. And yes, 'rb' causes failures in difflib later because it expects a string and throws an exception when it is given an array of bytes

Hi MatzeB,

We don't have any other ideas for how to fix this and I'm wondering if you are okay with committing this so that the lldb unit tests will pass on Windows (with a few more changes)?

I can commit this change, or I can commit a different change that catches the exception when open fails and reopens in 'Latin-1', or we can try something else like marking the tests as XFAIL on Windows and letting the original authors fix them.

Aaron

Reading python docu and playing a bit with a prototype I believe this is the way to do things:

import sys
import difflib
f = open(sys.argv[1], 'rb').readlines()
f2 = open(sys.argv[2], 'rb').readlines()
if hasattr(difflib, 'diff_bytes'):
    # python 3.5 or newer
    gen = difflib.diff_bytes(difflib.unified_diff, f, f2)
    sys.stdout.buffer.writelines(gen)
else:
    # python 2.7
    gen = difflib.unified_diff(f, f2)
    sys.stdout.writelines(gen)

(python 3.0-3.4 difflib before diff_bytes appears to be broken indeed for inconsistent/invalid encoded inputs, but I hope we can just ignore old 3.x versions...)

Reading python docu and playing a bit with a prototype I believe this is the way to do things:

import sys
import difflib
f = open(sys.argv[1], 'rb').readlines()
f2 = open(sys.argv[2], 'rb').readlines()
if hasattr(difflib, 'diff_bytes'):
    # python 3.5 or newer
    gen = difflib.diff_bytes(difflib.unified_diff, f, f2)
    sys.stdout.buffer.writelines(gen)
else:
    # python 2.7
    gen = difflib.unified_diff(f, f2)
    sys.stdout.writelines(gen)

(python 3.0-3.4 difflib before diff_bytes appears to be broken indeed for inconsistent/invalid encoded inputs, but I hope we can just ignore old 3.x versions...)

Given that you're aware of this issue I don't think we should ignore this problem. I don't want weird diff failures because a machine I happen to be using is using a old Python 3 version. This would be a serious PITA to debug.

If python >= 3.5 is required when using python 3 this seems like a reasonable requirement given that this only matters for the internal shell which is used on Windows (where no python is shipped by default so upgrading python should be fairly painless).

If we're doing a binary diff on Python 3.0-3.4 I think we should just emit an error explaining the problem and fail the currently running test.

@asmith Could you please add a test case to lit's test suite for this problem?

Reading python docu and playing a bit with a prototype I believe this is the way to do things:

import sys
import difflib
f = open(sys.argv[1], 'rb').readlines()
f2 = open(sys.argv[2], 'rb').readlines()
if hasattr(difflib, 'diff_bytes'):
    # python 3.5 or newer
    gen = difflib.diff_bytes(difflib.unified_diff, f, f2)
    sys.stdout.buffer.writelines(gen)
else:
    # python 2.7
    gen = difflib.unified_diff(f, f2)
    sys.stdout.writelines(gen)

(python 3.0-3.4 difflib before diff_bytes appears to be broken indeed for inconsistent/invalid encoded inputs, but I hope we can just ignore old 3.x versions...)

This actually has another issue (other than python 3.0 - 3.4). After we read the lines from the file, the code assumes that the lines are all strings for the purpose of ignoring whitespace.

So we have three options, none of which are 100% safe if we want to keep the change localized to here (options are below).

There's a fourth option which is to re-write the tests to explicitly compare binary files (I think somebody already suggested that). This would mean adding a function here which compares binary files (or at least an option to the existing function) and changing the tests. Some of the tests right now compare all the files in two folders, rather than individual files, so the tests would need changes to compare all files EXCEPT the binary files and then explicitly the binary files. This is not a trivial change either and we would have to make sure that we find all such tests in LLVM, LLD, Clang and LLDB.

Here are the options if we want to make a change just to the TestRunner. I've not run ALL the tests across LLVM, LLDB, Clang and LLD with any of these changes yet.

  1. Open as binary, then decode, do whitespace operations, encode again and do the comparison with diff_bytes or unified_diff. We are assuming we can safely remove whitespace characters from binary files and we have an issue with python 3.0 - 3.4. The decode is using the same encoding as diff_bytes.
with open(file, 'rb') as f:
    filelines.append(f.readlines())
for idx, lines in enumerate(filelines):
    filelines[idx]= [line.decode('ascii', 'surrogateescape') for line in lines]

# do white space operations

for idx, lines in enumerate(filelines):
    filelines[idx]= [line.encode('ascii', 'surrogateescape') for line in lines]

if hasattr(difflib, 'diff_bytes'):
    # python 3.5 or newer
    diffs = difflib.diff_bytes(difflib.unified_diff, filelines[0], filelines[1], filepaths[0].encode(), filepaths[1].encode())
    diffs = [diff.decode() for diff in diffs]
else:
    # python 2.7
    func = difflib.unified_diff if unified_diff else difflib.context_diff
    diffs = func(filelines[0], filelines[1], filepaths[0], filepaths[1])
  1. Open as binary, then decode and proceed as usual. We are assuming we can safely remove whitespace characters from binary files and that comparing the decoded strings will produce the desired results. The decode is using the same encoding as diff_bytes.
with open(file, 'rb') as f:
    filelines.append(f.readlines())
for idx, lines in enumerate(filelines):
    filelines[idx]= [line.decode('ascii', 'surrogateescape') for line in lines]
  1. Try to open the file assuming it's text and if we fail, open as binary. Then compare as binary or string depending on how the file was opened. This is more complicated and we still have an issue for python versions between 3.0 and 3.4 BUT it preserves the original logic as much as possible and we're not removing whitespace characters from binary files.
compare_bytes = False
for file in filepaths:
    if compare_bytes is False:
        # Assume every file is a text file
        try:
            with open(file, 'r') as f:
                filelines.append(f.readlines())
        except UnicodeDecodeError:
            with open(file, 'rb') as f:
                filelines.append(f.readlines())
                compare_bytes = True
    else:
        # Since we are opening two files to compare against each other, ensure that both files are opened as binary if one is
        with open(file, 'rb') as f:
            filelines.append(f.readlines())

if compare_bytes is False:
    # perform logic to ignore whitespace

if compare_bytes is True:
    if hasattr(difflib, 'diff_bytes'):
        # python 3.5 or newer, we will use diff_bytes
        compare_bytes = True
    else:
        # python 2.7, we can compare the bytes as strings
        compare_bytes = False

if compare_bytes:
    # python 3.5 or newer
    diffs = difflib.diff_bytes(difflib.unified_diff, filelines[0], filelines[1], filepaths[0].encode(), filepaths[1].encode())
    # To be able to print the diffs below, we need to decode them to strings
    diffs = [diff.decode() for diff in diffs]
else:
    # python 2.7
    func = difflib.unified_diff if unified_diff else difflib.context_diff
    diffs = func(filelines[0], filelines[1], filepaths[0], filepaths[1])

I checked the code and this happens indirectly inside of a call to compareDirectoryTrees. So indeed, it would be hard to explicitly request a binary diff. Instead, could we detect if a file is binary, and if it is fall back to a binary diff? For example, using code like this. We could put this in a function named lit.util.is_binary(file) and inside of compareDirectoryTrees we could call this function and then branch to the appropriate diff function.

Thoughts?

I checked the code and this happens indirectly inside of a call to compareDirectoryTrees. So indeed, it would be hard to explicitly request a binary diff. Instead, could we detect if a file is binary, and if it is fall back to a binary diff? For example, using code like this. We could put this in a function named lit.util.is_binary(file) and inside of compareDirectoryTrees we could call this function and then branch to the appropriate diff function.

Thoughts?

There's also mimetypes.guess_type(file) which might be helpful here.

I checked the code and this happens indirectly inside of a call to compareDirectoryTrees. So indeed, it would be hard to explicitly request a binary diff. Instead, could we detect if a file is binary, and if it is fall back to a binary diff? For example, using code like this. We could put this in a function named lit.util.is_binary(file) and inside of compareDirectoryTrees we could call this function and then branch to the appropriate diff function.

Thoughts?

There's also mimetypes.guess_type(file) which might be helpful here.

Another option, what if we compared all files as binary? We aren't actually interested in the specific places where files differ, we only ever care that they do differ. This wouldn't work if we compare generated files against checked in files, because in that case we'd need line ending translation, but if we're only ever comparing generated files against generated files, it seems like binary should be sufficient?

Thoughts?

I don't think we should compare all files as binary. The code has input options to specify how whitespace should be treated and if we compared all files as binary, we would explicitly be making a change which breaks that functionality.

I like the idea of deciding in compareDirectoryTrees whether to compare the files as binary or text - we would have to add a second comparison function, but the code would be much easier to read. Since mimetypes.guess_type(file) decides the type of file based on the extension, I don't think we can rely on it as a lot of the files we are comparing have interesting extensions.

We could do something like http://code.activestate.com/recipes/173220/ as Zachary suggested or we could try to open the file as text (open(file, 'r'), then read lines) and if that fails, treat the file as binary. The first has the benefit of not having to open each file an extra time but I am not sure how reliable the logic is, the second will always tell us whether in our version of python we can treat the file as text, but it does require us to open the file and read it an extra time.

How about this? Open the file as text without specifying an encoding (this should use the default encoding). If that fails, open the file as UTF-8. If that fails, open the file as binary.

The approach in the ActiveState post doesn't handle UTF properly, but I think "either ASCII or UTF8" handles 99.99% of what types of encodings we want to treat as text.

asmith updated this revision to Diff 140661.Apr 2 2018, 12:20 PM
asmith edited the summary of this revision. (Show Details)

This is changed based on the reviewers suggestions. First open as usual and on failure reopen as UTF8 and when that fails reopen as binary. Verified on Windows and Linux.

asmith edited the summary of this revision. (Show Details)Apr 2 2018, 12:54 PM
asmith updated this revision to Diff 140685.Apr 2 2018, 1:50 PM
asmith updated this revision to Diff 140688.Apr 2 2018, 1:59 PM
This revision was automatically updated to reflect the committed changes.
rnk added inline comments.Apr 2 2018, 2:28 PM
llvm/trunk/utils/lit/lit/TestRunner.py
434 ↗(On Diff #140690)

I'm getting this Python error now:

  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 545, in executeBuiltinDiff
    exitCode = compareTwoFiles(filepaths)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 408, in compareTwoFiles
    return compareTwoTextFiles(filepaths, encoding)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 434, in compareTwoTextFiles
    with open(file, 'r', encoding=encoding) as f:
TypeError: 'encoding' is an invalid keyword argument for this function

This doesn't seem Py2/3 compatible. =/

asmith added a subscriber: zturner.Apr 2 2018, 2:53 PM

I committed a one line fix before noticing you reverted the change.
Did you happen to check if the fix works for you?

rnk added a comment.Apr 2 2018, 3:25 PM

I committed a one line fix before noticing you reverted the change.
Did you happen to check if the fix works for you?

I had do do a few more fixes, and I still have these test failures:

diff --git a/llvm/utils/lit/lit/TestRunner.py b/llvm/utils/lit/lit/TestRunner.py
index 02745e62dcf..d0c7417c248 100644
--- a/llvm/utils/lit/lit/TestRunner.py
+++ b/llvm/utils/lit/lit/TestRunner.py
@@ -12,6 +12,7 @@ import shutil
 import tempfile
 import threading

+import io
 try:
     from StringIO import StringIO
 except ImportError:
@@ -388,6 +389,8 @@ def executeBuiltinDiff(cmd, cmd_shenv):
     def compareTwoFiles(filepaths):
         filelines = []
         for file in filepaths:
+            compare_bytes = False
+            encoding = None
             try:
                 with open(file, 'r') as f:
                     filelines.append(f.readlines())
...

Unresolved Tests (5):
    LLVM :: ThinLTO/X86/distributed_import.ll
    LLVM :: Transforms/ThinLTOBitcodeWriter/no-type-md.ll
    LLVM :: Transforms/ThinLTOBitcodeWriter/split.ll
    LLVM :: Transforms/ThinLTOBitcodeWriter/unsplittable.ll
    LLVM :: tools/llvm-objcopy/drawf-fission.test

The failures look like:

$ ./bin/llvm-lit.py  -v ../llvm/test/ThinLTO/X86/distributed_import.ll
-- Testing: 1 tests, 1 threads --
UNRESOLVED: LLVM :: ThinLTO/X86/distributed_import.ll (1 of 1)
******************** TEST 'LLVM :: ThinLTO/X86/distributed_import.ll' FAILED ********************
Exception during script execution:
Traceback (most recent call last):
  File "C:/src/llvm-project/llvm\utils\lit\lit\run.py", line 202, in _execute_test_impl
    result = test.config.test_format.execute(test, lit_config)
  File "C:/src/llvm-project/llvm\utils\lit\lit\formats\shtest.py", line 25, in execute
    self.execute_external)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 1534, in executeShTest
    res = _runShTest(test, litConfig, useExternalSh, script, tmpBase)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 1482, in _runShTest
    res = executeScriptInternal(test, litConfig, tmpBase, script, execdir)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 994, in executeScriptInternal
    exitCode, timeoutInfo = executeShCmd(cmd, shenv, results, timeout=litConfig.maxIndividualTestTime)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 151, in executeShCmd
    finalExitCode = _executeShCmd(cmd, shenv, results, timeoutHelper)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 717, in _executeShCmd
    res = _executeShCmd(cmd.lhs, shenv, results, timeoutHelper)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 717, in _executeShCmd
    res = _executeShCmd(cmd.lhs, shenv, results, timeoutHelper)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 717, in _executeShCmd
    res = _executeShCmd(cmd.lhs, shenv, results, timeoutHelper)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 717, in _executeShCmd
    res = _executeShCmd(cmd.lhs, shenv, results, timeoutHelper)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 717, in _executeShCmd
    res = _executeShCmd(cmd.lhs, shenv, results, timeoutHelper)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 717, in _executeShCmd
    res = _executeShCmd(cmd.lhs, shenv, results, timeoutHelper)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 717, in _executeShCmd
    res = _executeShCmd(cmd.lhs, shenv, results, timeoutHelper)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 722, in _executeShCmd
    res = _executeShCmd(cmd.rhs, shenv, results, timeoutHelper)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 775, in _executeShCmd
    cmdResult = executeBuiltinDiff(cmd.commands[0], shenv)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 545, in executeBuiltinDiff
    exitCode = compareTwoFiles(filepaths)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 408, in compareTwoFiles
    return compareTwoTextFiles(filepaths, encoding)
  File "C:/src/llvm-project/llvm\utils\lit\lit\TestRunner.py", line 435, in compareTwoTextFiles
    filelines.append(f.readlines())
  File "C:\Python27\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 42: character maps to <undefined>

Some kind of binary file comparison?

asmith added inline comments.Apr 2 2018, 3:38 PM
llvm/trunk/utils/lit/lit/TestRunner.py
434 ↗(On Diff #140690)

That's a merge issue on my part that I just fixed.