Failing test output sometimes contains control characters like \x1b (e.g.
if there was some -fcolor-diagnostics output) which are not allowed inside
XML files. This causes problems with CI systems: for example, the Jenkins
JUnit XML will throw an exception when ecountering those characters and
similar problems also occur with GitLab CI.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Thanks for adding a test, we definitely want to keep it!
Implementation: do you think that this is the best approach to solving this issue? Have you considered other options?
Is this a new issue or was this caused by my "report generators" refactoring?
llvm/utils/lit/lit/reports.py | ||
---|---|---|
132–137 | I just spotted a bug in this existing code, decode() returns a new string. It should be: output = output.decode('utf-8', 'ignore') Can you check if this would resolve your issue? |
No this issue has been around for a while, I originally added a workaround to our fork in 2018 (https://github.com/CTSRD-CHERI/llvm-project/commit/3179f32c07f224c96d33f485747f508a34106c91).
I'm currently trying to reduce our diff to upstream and cleaning up local workarounds for upstreaming.
llvm/utils/lit/lit/reports.py | ||
---|---|---|
132–137 | Good catch. But I don't believe this fixes the issue since those characters are valid utf-8 characters just not ones that the Jenkins JUnit XML parser accepts. |
@arichardson: can you double-check that this workaround is still needed?
Do we understand the semantics of CDATA blocks? I was under the impression we use it here to avoid problems like this.
Anyways, I am fine with this. Adding Joel as a second reviewer to get his feedback before accepting.
@jdenny:
Do you have opinions on this? Is llvm-lit responsible for this (or should the fix be in Jenkins/Jenkin's config)?
Any other reviewers we should include?
I believe CDATA just avoids the need for escape XML special characters. However, characters 0-0x20 (with the exception of \t \r an \n are not valid anywhere in the document according to the XML spec (https://www.w3.org/TR/xml/#charsets):
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
XML 1.1 seems to relax that an allow everything except NUL: https://www.w3.org/TR/xml11/#charsets
Maybe specifying version 1.1 for the XUnit output would make the Java parsers happy again, but escaping ANSI control characters might also be useful if you open the report XML file in a text editor.
Thanks for working on this. I've been wanting to see this fixed too. I agree that lit is the culprit.
I'm not sure which fix is best:
- The current patch: If I understand correctly, the current patch's escaping for XML 1.0 is not reversible: the escape sequences are indistinguishable from any existing occurrences of the two-character sequence \x.
- Drop control characters instead: That would make the output more readable. It loses even more information, but that information's value is questionable.
- Use a reversible escape sequence: This doesn't seem worth it and might make simple output harder to read.
- XML 1.1: It looks like that would avoid the above issues (I think it's best to just drop nul bytes). Does anyone know how widely supported XML 1.1 is today? A quick google search turned up info from a decade ago saying it was not widely supported then.
If we cannot come to a conclusion soon, I think the current patch is an improvement, so we should commit it and then consider possible improvements.
llvm/utils/lit/lit/reports.py | ||
---|---|---|
72 | Please add a comment here citing the relevant part of the XML 1.0 spec. | |
136 | I had similar trouble with Gitlab CI around the end of last year. Maybe the main point this comment should make is that the XML is invalid without this escaping, and then the comment can mention CI failures as a practical consequence. | |
llvm/utils/lit/tests/shtest-format.py | ||
35 | Is this line intentional? |
Other than the nit I just added about comment duplication, this LGTM. Thanks. However, let's a wait a couple of days in case anyone else has an opinion.
I still wonder whether it would be more helpful to drop control characters rather than irreversibly escape them. Dropping them makes the output more readable. Reversibly escaping them makes it possible to feed the text to a tool that can render the control codes. Irreversibly escaping them doesn't achieve either of those advantages, and I'm not sure if it serves a real use case. Any comments on this question?
llvm/utils/lit/lit/reports.py | ||
---|---|---|
142 | You indeed did what I asked, but I didn't consider that we would end up with duplication of some comments at the caller and callee. I think it would be better to consolidate these comments with the ones you just inserted into the escape_invalid_xml_chars definition, and place the consolidated version as header comments on escape_invalid_xml_chars and nowhere else. |
llvm/utils/lit/lit/reports.py | ||
---|---|---|
76 | Should this be a global to avoid building the dict on every call to this function? |
llvm/utils/lit/lit/reports.py | ||
---|---|---|
76 | Yes, this could be called many times so that definitely makes sense. Will update the patch tomorrow. |
llvm/utils/lit/lit/reports.py | ||
---|---|---|
83 | Leftover from when I had debug statements there. Will remove before committing. |
This is now an _invalid_xml_chars_dict