This patch implements detection of confusable identifiers within clang-tidy. It only considers identifiers whose declaration scope may conflict with.
It detects the homoglyph part of https://www.trojansource.codes/trojan-source.pdf
Paths
| Differential D112916
[clang-tidy] Confusable identifiers detection ClosedPublic Authored by serge-sans-paille on Nov 1 2021, 2:24 AM.
Details Summary This patch implements detection of confusable identifiers within clang-tidy. It only considers identifiers whose declaration scope may conflict with. It detects the homoglyph part of https://www.trojansource.codes/trojan-source.pdf
Diff Detail
Event TimelineHerald added subscribers: carlosgalvezp, mgrang, mgorny. · View Herald TranscriptNov 1 2021, 2:24 AM rsmith added inline comments.
Comment Actions @tonic / @tstellar as members of the LLVM fundation board, can you tell us if it's okay to ship the confusables.txt file from https://www.unicode.org/Public/security/latest/confusables.txt and what are the steps to do so? Alternatively, I can ship the derived header and provide an optional rule that fetches confisables.txt from its latest version and updates the header file. Comment Actions
Side note: llvm/include/llvm/Support/ConvertUTF.h and llvm/lib/Support/ConvertUTF.cpp are already licensed under Copyright 2001-2004 Unicode, Inc. Comment Actions
Comment Actions The LLVM Board of Directors will do a legal review of this change. We will give an update in 4-6 weeks. This revision now requires changes to proceed.Nov 5 2021, 7:11 PM Comment Actions @rsmith : once the licensing issue has been fixed, can we merge that patch or do you have any other thought? Comment Actions Personally, I'm uncomfortable with this check because it's not really covering homoglyphs in general, it's covering homoglyphs outside of the usual Latin character set. For example, there's no attempt to catch other problematic homoglyph attack vectors like 1 vs l vs I or O vs 0. The result is: this comes across feeling like it targets non-Latin character scripts as being dangerous but Latin character scripts are fine when that's not really accurate. That's not a path I feel comfortable walking down because it treats programmers differently depending on what their native language is (at least, as used within source code). Adding some folks from the WG21 Unicode study group (SG16) as they may also have interesting input and advice. Comment Actions @aaron.ballman Thanks for the ping. One one hand, I agree with you, on the other hand, this tries to stick to TR39, and I think we should stick with that. It might be worth checking with the Unicode consortium what they think of i/l as confusable. I think this is a rather good approach as a clang-tidy plugin (and we can gain experience from it to iterate). I'm however rather concern that we are not following the spec in that there is no NFD decomposition happening (and I am aware this increases the workload a lot for the author).
Comment Actions
I have no concerns once the licensing question is resolved and the other reviewers are happy. I agree with Corentin that we should be exactly following the Unicode Consortium's recommendations. I think that while this doesn't fully address Aaron's concern, it at least gives us a rationale for our treatment of different scripts and seems more defensible than our coming up with some additional rules of our own. Comment Actions
Actually @aaron.ballman, Unicode does consider these confusables already from https://www.unicode.org/Public/security/14.0.0/confusables.txt 0031 ; 006C ; MA # ( 1 → l ) DIGIT ONE → LATIN SMALL LETTER L # 0049 ; 006C ; MA # ( I → l ) LATIN CAPITAL LETTER I → LATIN SMALL LETTER L # 0030 ; 004F ; MA # ( 0 → O ) DIGIT ZERO → LATIN CAPITAL LETTER O # So ASCII is already taken care of. No issue here :) Comment Actions
Oh, so it is, good catch! I looked for it before, but missed it. FWIW, I agree with sticking with what Unicode defines here. (I am worried about the results in practice and whether enough people will use this check to warrant its inclusion in clang-tidy, but not enough to block this patch.)
Comment Actions
@tstellar, is there an update from the LLVM Board of Directors regarding using Unicode data? Comment Actions
Yes, it's fine to include the unicode data. This revision is now accepted and ready to land.Jun 2 2022, 12:12 PM Closed by commit rGb94db7ed7eaf: [clang-tidy] Confusable identifiers detection (authored by serge-sans-paille). · Explain WhyJun 3 2022, 3:20 AM This revision was automatically updated to reflect the committed changes. Comment Actions I haven't checked, but this probably doesn't do the right thing in cross builds (eg building a Mac/arm binary on a Mac/Intel machine - i think the gen_confusables binary will likely end up being an arm binary then, and it won't be able to run during the build). See https://reviews.llvm.org/D126397 and discussion there for how to handle that.
Comment Actions This also makes clang-tidy crash for fairly normal inputs, see below.
thakis added a reverting change: rG371e6f8b7fb9: Revert "[clang-tidy] Confusable identifiers detection".Jun 3 2022, 6:30 AM Comment Actions
This revision is now accepted and ready to land.Jun 13 2022, 8:15 AM serge-sans-paille retitled this revision from Confusable identifiers detection to [clang-tidy] Confusable identifiers detection. Comment ActionsAddress reviews This revision is now accepted and ready to land.Jun 13 2022, 1:02 PM Comment Actions
@tstellar -- Do we have any license requirements we need to follow? @rsmith pointed out:
which sounds like we need to update one of our license files. (I'll assume the same answer applies for the Unicode named escape sequences patch happening in Clang currently.)
serge-sans-paille marked an inline comment as done. Comment ActionsTake into account @aaron.ballman review
Comment Actions Because of the stability issues related to getName()-like constructs I'm putting a temporary ❌ on this (so it doesn't show up as faux accept). However, I have to emphasise that I do like the idea of the check!
This revision now requires changes to proceed.Jun 20 2022, 12:26 AM Comment Actions So that this doesn't get lost, I'm opposed to this change because it does not actually follow the TR39 guidance or algorithm. For an input string X, define skeleton(X) to be the following transformation on the string: 1. Convert X to NFD format, as described in [UAX15]. 2. Concatenate the prototypes for each character in X according to the specified data, producing a string of exemplar characters. 3. Reapply NFD. Step 1 and 3 not being performed will lead to false negative. That being said, given that it is for clang tidy and that adding decomposition to llvm is a bit involved (I could look into it if there is interest), this is not a strong opposition, but i wanted to make sure people are aware of the limitation.
Comment Actions Address most reviewers comment:
Comment Actions I remain concerned about the utility of this check. As I understand it, the more complete Unicode guidance is still under active discussion and hasn't been finalized (please correct me if I'm wrong @tahonermann and @cor3ntin) so it feels a bit premature to land this. It also doesn't follow the guidance from the Unicode consortium and expected to have more false positives. That said, I don't have a strong enough technical reason to block this check, so it gets my disgaree-but-LGTM. This revision is now accepted and ready to land.Jun 22 2022, 2:04 AM This revision was landed with ongoing or failed builds.Jun 22 2022, 7:19 AM Closed by commit rGc3574ef739fb: [clang-tidy] Confusable identifiers detection (authored by serge-sans-paille). · Explain Why This revision was automatically updated to reflect the committed changes.
Revision Contents
Diff 433991 clang-tools-extra/clang-tidy/misc/CMakeLists.txt
clang-tools-extra/clang-tidy/misc/ConfusableTable/CMakeLists.txt
clang-tools-extra/clang-tidy/misc/ConfusableTable/build_confusable_table.cpp
clang-tools-extra/clang-tidy/misc/ConfusableTable/confusables.txt
clang-tools-extra/clang-tidy/misc/Homoglyph.h
clang-tools-extra/clang-tidy/misc/Homoglyph.cpp
clang-tools-extra/clang-tidy/misc/MiscTidyModule.cpp
clang-tools-extra/docs/ReleaseNotes.rst
clang-tools-extra/docs/clang-tidy/checks/list.rst
clang-tools-extra/docs/clang-tidy/checks/misc-homoglyph.rst
clang-tools-extra/test/clang-tidy/checkers/misc-homoglyph.cpp
|
Please generate a filename that follows our normal conventions (Confusables.inc).