With python3 there is a difference between the length of the string and the length of the utf-8 encoded bytes array. To not cut off characters at the end when the string contains multi-byte characters, the length of file contents that gets passed to clang needs to be calculated from their bytes representation.
I also added a test case that catches this. I needed to add the coding line at the top of the test unit to make python2 work with the embedded Unicode character. Alternatively we could replace the character with /uXXXX, but then there would be other problems with python2.