This patch implements detection of confusable identifiers within clang-tidy. It only considers identifiers whose declaration scope may conflict with.
It detects the homoglyph part of https://www.trojansource.codes/trojan-source.pdf
Differential D112916
[clang-tidy] Confusable identifiers detection serge-sans-paille on Nov 1 2021, 2:24 AM. Authored by
Details This patch implements detection of confusable identifiers within clang-tidy. It only considers identifiers whose declaration scope may conflict with. It detects the homoglyph part of https://www.trojansource.codes/trojan-source.pdf
Diff Detail
Event TimelineThere are a very large number of changes, so older changes are hidden. Show Older Changes Comment Actions Personally, I'm uncomfortable with this check because it's not really covering homoglyphs in general, it's covering homoglyphs outside of the usual Latin character set. For example, there's no attempt to catch other problematic homoglyph attack vectors like 1 vs l vs I or O vs 0. The result is: this comes across feeling like it targets non-Latin character scripts as being dangerous but Latin character scripts are fine when that's not really accurate. That's not a path I feel comfortable walking down because it treats programmers differently depending on what their native language is (at least, as used within source code). Adding some folks from the WG21 Unicode study group (SG16) as they may also have interesting input and advice. Comment Actions @aaron.ballman Thanks for the ping. One one hand, I agree with you, on the other hand, this tries to stick to TR39, and I think we should stick with that. It might be worth checking with the Unicode consortium what they think of i/l as confusable. I think this is a rather good approach as a clang-tidy plugin (and we can gain experience from it to iterate). I'm however rather concern that we are not following the spec in that there is no NFD decomposition happening (and I am aware this increases the workload a lot for the author).
Comment Actions I have no concerns once the licensing question is resolved and the other reviewers are happy. I agree with Corentin that we should be exactly following the Unicode Consortium's recommendations. I think that while this doesn't fully address Aaron's concern, it at least gives us a rationale for our treatment of different scripts and seems more defensible than our coming up with some additional rules of our own. Comment Actions Actually @aaron.ballman, Unicode does consider these confusables already from https://www.unicode.org/Public/security/14.0.0/confusables.txt 0031 ; 006C ; MA # ( 1 → l ) DIGIT ONE → LATIN SMALL LETTER L # 0049 ; 006C ; MA # ( I → l ) LATIN CAPITAL LETTER I → LATIN SMALL LETTER L # 0030 ; 004F ; MA # ( 0 → O ) DIGIT ZERO → LATIN CAPITAL LETTER O # So ASCII is already taken care of. No issue here :) Comment Actions Oh, so it is, good catch! I looked for it before, but missed it. FWIW, I agree with sticking with what Unicode defines here. (I am worried about the results in practice and whether enough people will use this check to warrant its inclusion in clang-tidy, but not enough to block this patch.)
Comment Actions @tstellar, is there an update from the LLVM Board of Directors regarding using Unicode data? Comment Actions I haven't checked, but this probably doesn't do the right thing in cross builds (eg building a Mac/arm binary on a Mac/Intel machine - i think the gen_confusables binary will likely end up being an arm binary then, and it won't be able to run during the build). See https://reviews.llvm.org/D126397 and discussion there for how to handle that.
Comment Actions This also makes clang-tidy crash for fairly normal inputs, see below.
Comment Actions
Comment Actions
Add [clang-tidy]
Comment Actions @tstellar -- Do we have any license requirements we need to follow? @rsmith pointed out:
which sounds like we need to update one of our license files. (I'll assume the same answer applies for the Unicode named escape sequences patch happening in Clang currently.)
Comment Actions Because of the stability issues related to getName()-like constructs I'm putting a temporary ❌ on this (so it doesn't show up as faux accept). However, I have to emphasise that I do like the idea of the check!
Comment Actions So that this doesn't get lost, I'm opposed to this change because it does not actually follow the TR39 guidance or algorithm. For an input string X, define skeleton(X) to be the following transformation on the string: 1. Convert X to NFD format, as described in [UAX15]. 2. Concatenate the prototypes for each character in X according to the specified data, producing a string of exemplar characters. 3. Reapply NFD. Step 1 and 3 not being performed will lead to false negative. That being said, given that it is for clang tidy and that adding decomposition to llvm is a bit involved (I could look into it if there is interest), this is not a strong opposition, but i wanted to make sure people are aware of the limitation.
Comment Actions Address most reviewers comment:
Comment Actions I remain concerned about the utility of this check. As I understand it, the more complete Unicode guidance is still under active discussion and hasn't been finalized (please correct me if I'm wrong @tahonermann and @cor3ntin) so it feels a bit premature to land this. It also doesn't follow the guidance from the Unicode consortium and expected to have more false positives. That said, I don't have a strong enough technical reason to block this check, so it gets my disgaree-but-LGTM. |