This patch implements detection of incomplete bidirectional sequence withing comments and string literals within clang-tidy.
It detects the bidi part of https://www.trojansource.codes/trojan-source.pdf
Differential D112913
Misleading bidirectional detection serge-sans-paille on Nov 1 2021, 2:19 AM. Authored by
Details
This patch implements detection of incomplete bidirectional sequence withing comments and string literals within clang-tidy. It detects the bidi part of https://www.trojansource.codes/trojan-source.pdf
Diff Detail
Event TimelineComment Actions Nits and a suggested approach for invalid code sequences that's probably important to handle better. Please fix the clang-tidy findings too. Otherwise, LGTM.
Comment Actions Nice addition! Please add this check to the documentation (list of available checks + individual page with the documentation for this check), plus mention in the clang-tidy release notes. The same applies to the other related patches.
Comment Actions
If you don't use arc diff 'HEAD^' to upload a Diff, please use -U99999 https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface
Comment Actions
Comment Actions I'm not familiar with LLVM / Clang codebase. I was asked by @MaskRay to help review the bidi-related part of this patch. Generally I believe the algorithm here is too naive that it produces both false positives and false negatives. I'm not sure how important those edge cases are considered, but anyway... detailed comments below:
Comment Actions @upsuper I've not added extra test cases yet, but does it looks it's heading in the right direction? Comment Actions I think the core algorithm looks correct now. I'll leave the code review to LLVM reviewers. Thanks. Comment Actions Has this been tested against any large code bases that use bidirectional characters to see what the false positive rate is? Also, have you seen the latest Unicode guidance on this topic: https://unicode.org/L2/L2022/22007-avoiding-spoof.pdf to make sure we're following along? Comment Actions @aaron.ballman unfortunately I don't know any of those. If I recall correctly we found no software in the RedHat collection actually using those control characters :-/ Comment Actions I'd like to clarify that what I think is correct now is the algorithm to detect unclosed explicit formatting scopes in a given string. I haven't been following very closely with the whole spoofing issue, so I can't say that there is no other ways to construct a spoof that this algorithm is not designed to detect. As you have found, RLM, and ALM can be used to confuse code reader, but they are not much different than a string with other strong RTL characters inside, and I don't quite see how that can be linted without hurting potentially legitimate code. Maybe if the compiler supports treating LRM as whitespace (I'm not sure whether Clang does), a lint may be added to ask wrapping any string with outermost strong characters being RTL in the form of {LRM}"string"{LRM} so that the RTL characters don't affect outside. Other than that, I don't think there is anyway to lint against such a confusion. Comment Actions Thanks for confirming. This check only detects unterminated bidi sequence within comments and string literals. Its scope limits to that aspect.
Agreed. FYI we already have a check for RTL characters ending identfiers, and a pending one for confusable identifiers
I agree that allowing LRM is a good step forward, and that's part of the official recommendation, but orthogonal to that review.
|
Nit: please keep alphabetical order in the list of jobs.