This is an archive of the discontinued LLVM Phabricator instance.

[lit] Remove ANSI control character from xunit output
ClosedPublic

Authored by arichardson on Jul 21 2020, 5:20 AM.

Details

Summary

Failing test output sometimes contains control characters like \x1b (e.g.
if there was some -fcolor-diagnostics output) which are not allowed inside
XML files. This causes problems with CI systems: for example, the Jenkins
JUnit XML will throw an exception when ecountering those characters and
similar problems also occur with GitLab CI.

Diff Detail

Event Timeline

arichardson created this revision.Jul 21 2020, 5:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2020, 5:20 AM
yln added a comment.Jul 22 2020, 12:17 PM

Thanks for adding a test, we definitely want to keep it!

Implementation: do you think that this is the best approach to solving this issue? Have you considered other options?
Is this a new issue or was this caused by my "report generators" refactoring?

yln added inline comments.Jul 22 2020, 12:21 PM
llvm/utils/lit/lit/reports.py
132–137

I just spotted a bug in this existing code, decode() returns a new string. It should be:

output = output.decode('utf-8', 'ignore')

Can you check if this would resolve your issue?

arichardson marked an inline comment as done.Jul 23 2020, 3:01 AM
In D84233#2167650, @yln wrote:

Thanks for adding a test, we definitely want to keep it!

Implementation: do you think that this is the best approach to solving this issue? Have you considered other options?
Is this a new issue or was this caused by my "report generators" refactoring?

No this issue has been around for a while, I originally added a workaround to our fork in 2018 (https://github.com/CTSRD-CHERI/llvm-project/commit/3179f32c07f224c96d33f485747f508a34106c91).
I'm currently trying to reduce our diff to upstream and cleaning up local workarounds for upstreaming.

llvm/utils/lit/lit/reports.py
132–137

Good catch. But I don't believe this fixes the issue since those characters are valid utf-8 characters just not ones that the Jenkins JUnit XML parser accepts.

Fix missing assignment

yln added a subscriber: jdenny.Jul 23 2020, 11:43 AM

@arichardson: can you double-check that this workaround is still needed?
Do we understand the semantics of CDATA blocks? I was under the impression we use it here to avoid problems like this.

Anyways, I am fine with this. Adding Joel as a second reviewer to get his feedback before accepting.

@jdenny:
Do you have opinions on this? Is llvm-lit responsible for this (or should the fix be in Jenkins/Jenkin's config)?
Any other reviewers we should include?

yln added a reviewer: jdenny.Jul 23 2020, 11:43 AM
In D84233#2170305, @yln wrote:

@arichardson: can you double-check that this workaround is still needed?
Do we understand the semantics of CDATA blocks? I was under the impression we use it here to avoid problems like this.

Anyways, I am fine with this. Adding Joel as a second reviewer to get his feedback before accepting.

I believe CDATA just avoids the need for escape XML special characters. However, characters 0-0x20 (with the exception of \t \r an \n are not valid anywhere in the document according to the XML spec (https://www.w3.org/TR/xml/#charsets):
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

XML 1.1 seems to relax that an allow everything except NUL: https://www.w3.org/TR/xml11/#charsets

Maybe specifying version 1.1 for the XUnit output would make the Java parsers happy again, but escaping ANSI control characters might also be useful if you open the report XML file in a text editor.

Thanks for working on this. I've been wanting to see this fixed too. I agree that lit is the culprit.

Maybe specifying version 1.1 for the XUnit output would make the Java parsers happy again, but escaping ANSI control characters might also be useful if you open the report XML file in a text editor.

I'm not sure which fix is best:

  • The current patch: If I understand correctly, the current patch's escaping for XML 1.0 is not reversible: the escape sequences are indistinguishable from any existing occurrences of the two-character sequence \x.
  • Drop control characters instead: That would make the output more readable. It loses even more information, but that information's value is questionable.
  • Use a reversible escape sequence: This doesn't seem worth it and might make simple output harder to read.
  • XML 1.1: It looks like that would avoid the above issues (I think it's best to just drop nul bytes). Does anyone know how widely supported XML 1.1 is today? A quick google search turned up info from a decade ago saying it was not widely supported then.

If we cannot come to a conclusion soon, I think the current patch is an improvement, so we should commit it and then consider possible improvements.

llvm/utils/lit/lit/reports.py
72

Please add a comment here citing the relevant part of the XML 1.0 spec.

136

I had similar trouble with Gitlab CI around the end of last year. Maybe the main point this comment should make is that the XML is invalid without this escaping, and then the comment can mention CI failures as a practical consequence.

llvm/utils/lit/tests/shtest-format.py
35

Is this line intentional?

  • Improve comments
  • Remove unnecessary line from test
jdenny accepted this revision.Jul 27 2020, 10:27 AM

Other than the nit I just added about comment duplication, this LGTM. Thanks. However, let's a wait a couple of days in case anyone else has an opinion.

I still wonder whether it would be more helpful to drop control characters rather than irreversibly escape them. Dropping them makes the output more readable. Reversibly escaping them makes it possible to feed the text to a tool that can render the control codes. Irreversibly escaping them doesn't achieve either of those advantages, and I'm not sure if it serves a real use case. Any comments on this question?

llvm/utils/lit/lit/reports.py
142

You indeed did what I asked, but I didn't consider that we would end up with duplication of some comments at the caller and callee.

I think it would be better to consolidate these comments with the ones you just inserted into the escape_invalid_xml_chars definition, and place the consolidated version as header comments on escape_invalid_xml_chars and nowhere else.

This revision is now accepted and ready to land.Jul 27 2020, 10:27 AM
yln added a comment.Jul 28 2020, 10:17 AM

I still wonder whether it would be more helpful to drop control characters rather than irreversibly escape them. Dropping them makes the output more readable. Reversibly escaping them makes it possible to feed the text to a tool that can render the control codes. Irreversibly escaping them doesn't achieve either of those advantages, and I'm not sure if it serves a real use case. Any comments on this question?

I agree with Joel here and would like to see this done as part of this patch.

yln added inline comments.Jul 28 2020, 10:18 AM
llvm/utils/lit/lit/reports.py
76

Should this be a global to avoid building the dict on every call to this function?

arichardson added inline comments.Jul 28 2020, 2:07 PM
llvm/utils/lit/lit/reports.py
76

Yes, this could be called many times so that definitely makes sense.

Will update the patch tomorrow.

  • Remove instead of escaping. This avoids a py2 -> py3 difference
  • fix comments
arichardson retitled this revision from [lit] Escape ANSI control character in xunit output to [lit] Remove ANSI control character from xunit output.Aug 3 2020, 3:57 AM
arichardson edited the summary of this revision. (Show Details)

fix python2 issue

arichardson marked 3 inline comments as done.

remove debug code

yln accepted this revision.Aug 3 2020, 10:57 AM

LGTM with two small nits. Thanks!

llvm/utils/lit/lit/reports.py
71

This is now an _invalid_xml_chars_dict

83

Why reassign s? Just return.

arichardson added inline comments.Aug 3 2020, 11:11 AM
llvm/utils/lit/lit/reports.py
83

Leftover from when I had debug statements there. Will remove before committing.

arichardson marked 5 inline comments as done.

fix variable name and remove temporary

This revision was landed with ongoing or failed builds.Aug 6 2020, 1:17 AM
This revision was automatically updated to reflect the committed changes.