This is an archive of the discontinued LLVM Phabricator instance.

[clang-tidy] Confusable identifiers detection
ClosedPublic

Authored by serge-sans-paille on Nov 1 2021, 2:24 AM.

Details

Summary

This patch implements detection of confusable identifiers within clang-tidy. It only considers identifiers whose declaration scope may conflict with.

It detects the homoglyph part of https://www.trojansource.codes/trojan-source.pdf

Diff Detail

Event Timeline

serge-sans-paille requested review of this revision.Nov 1 2021, 2:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 1 2021, 2:24 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
rsmith added a subscriber: rsmith.Nov 1 2021, 3:14 PM
rsmith added inline comments.
clang-tools-extra/clang-tidy/misc/CMakeLists.txt
9

Please generate a filename that follows our normal conventions (Confusables.inc).

clang-tools-extra/clang-tidy/misc/ConfusableTable/confusables.txt
6

From the terms of use we find this: https://www.unicode.org/license.html and in particular:

[...] provided that either
(a) this copyright and permission notice appear with all copies
of the Data Files or Software, or
(b) this copyright and permission notice appear in associated
Documentation.

... so it looks like we might need to include a copy of at least the Unicode license. I'd like someone from the LLVM foundation to check for licensing concerns here before this lands.

clang-tools-extra/clang-tidy/misc/Homoglyph.cpp
1–2 ↗(On Diff #383749)

Nit: fix the formatting here.

89–90 ↗(On Diff #383749)

I think we also need to check for the reverse case (OND->getDeclContext()->isDeclInLexicalTraversal(ND)) here (assuming a and A are homoglyphs etc):

int a;
namespace N {
  int A; // new is in context of old but not vice versa
  int b;
}
int B; // old is in context of new but not vice versa
namespace N {
  int c = a + B; // finds  ::a and ::B
}
clang-tools-extra/clang-tidy/misc/Homoglyph.h
1 ↗(On Diff #383749)

Nit: add more -s to pad to 80 characters.

MaskRay added a subscriber: MaskRay.Nov 1 2021, 5:40 PM
MaskRay added inline comments.
clang-tools-extra/test/clang-tidy/checkers/misc-homoglyph.cpp
4 ↗(On Diff #383749)

[[@LINE-1]] is deprecated lit feature. Use [[#@LINE-1]]

@tonic / @tstellar as members of the LLVM fundation board, can you tell us if it's okay to ship the confusables.txt file from https://www.unicode.org/Public/security/latest/confusables.txt and what are the steps to do so? Alternatively, I can ship the derived header and provide an optional rule that fetches confisables.txt from its latest version and updates the header file.

@tonic / @tstellar as members of the LLVM fundation board, can you tell us if it's okay to ship the confusables.txt file from https://www.unicode.org/Public/security/latest/confusables.txt and what are the steps to do so? Alternatively, I can ship the derived header and provide an optional rule that fetches confisables.txt from its latest version and updates the header file.

Side note: llvm/include/llvm/Support/ConvertUTF.h and llvm/lib/Support/ConvertUTF.cpp are already licensed under Copyright 2001-2004 Unicode, Inc.

  • Add documentation and changelog entry
  • Formatting / renaming nits
  • Some extra test cases
  • Check scope intersection in both orders
uabelho added a subscriber: uabelho.Nov 4 2021, 2:55 AM

Portable confusable.txt parsing

tstellar requested changes to this revision.Nov 5 2021, 7:11 PM

The LLVM Board of Directors will do a legal review of this change. We will give an update in 4-6 weeks.

This revision now requires changes to proceed.Nov 5 2021, 7:11 PM

@rsmith : once the licensing issue has been fixed, can we merge that patch or do you have any other thought?

aaron.ballman added a subscriber: aaron.ballman.

Personally, I'm uncomfortable with this check because it's not really covering homoglyphs in general, it's covering homoglyphs outside of the usual Latin character set. For example, there's no attempt to catch other problematic homoglyph attack vectors like 1 vs l vs I or O vs 0. The result is: this comes across feeling like it targets non-Latin character scripts as being dangerous but Latin character scripts are fine when that's not really accurate. That's not a path I feel comfortable walking down because it treats programmers differently depending on what their native language is (at least, as used within source code).

Adding some folks from the WG21 Unicode study group (SG16) as they may also have interesting input and advice.

@aaron.ballman Thanks for the ping.

One one hand, I agree with you, on the other hand, this tries to stick to TR39, and I think we should stick with that. It might be worth checking with the Unicode consortium what they think of i/l as confusable.

I think this is a rather good approach as a clang-tidy plugin (and we can gain experience from it to iterate).

I'm however rather concern that we are not following the spec in that there is no NFD decomposition happening (and I am aware this increases the workload a lot for the author).
But as it stands, composed characters may not be confusable even if the decomposed form would be, and the prescribed algorithm actually do 2 passed of decomposition, which is certainly necessary (I would need to think more about it)
The reasonable approach is certainly to stick to the algorithm to the letter, unless we have good motivation not to.

clang-tools-extra/clang-tidy/misc/Homoglyph.cpp
35 ↗(On Diff #384854)

This does not seem to do NFD decomposition, so at the very least it is not consistent with TR39
Identifiers in C++ are mandated to be NFC.

This means that in the presence of composed characters ( for example digraphs ), the algorithms will simply not work.

38 ↗(On Diff #384854)

It's probably more efficient to keep Skeleton as a sequence of char32_t, to avoid multiple round of conversions

82 ↗(On Diff #384854)

I think i like this idea of only complaining if there similar-looking identifiers

@rsmith : once the licensing issue has been fixed, can we merge that patch or do you have any other thought?

I have no concerns once the licensing question is resolved and the other reviewers are happy. I agree with Corentin that we should be exactly following the Unicode Consortium's recommendations. I think that while this doesn't fully address Aaron's concern, it at least gives us a rationale for our treatment of different scripts and seems more defensible than our coming up with some additional rules of our own.

cor3ntin added a comment.EditedFeb 18 2022, 1:52 AM

@aaron.ballman Thanks for the ping.

One one hand, I agree with you, on the other hand, this tries to stick to TR39, and I think we should stick with that. It might be worth checking with the Unicode consortium what they think of i/l as confusable.

Actually @aaron.ballman, Unicode does consider these confusables already

from https://www.unicode.org/Public/security/14.0.0/confusables.txt

0031 ;	006C ;	MA	# ( 1 → l ) DIGIT ONE → LATIN SMALL LETTER L	# 
0049 ;	006C ;	MA	# ( I → l ) LATIN CAPITAL LETTER I → LATIN SMALL LETTER L	# 
0030 ;	004F ;	MA	# ( 0 → O ) DIGIT ZERO → LATIN CAPITAL LETTER O	#

So ASCII is already taken care of. No issue here :)

@aaron.ballman Thanks for the ping.

One one hand, I agree with you, on the other hand, this tries to stick to TR39, and I think we should stick with that. It might be worth checking with the Unicode consortium what they think of i/l as confusable.

Actually @aaron.ballman, Unicode does consider these confusables already

from https://www.unicode.org/Public/security/14.0.0/confusables.txt

0031 ;	006C ;	MA	# ( 1 → l ) DIGIT ONE → LATIN SMALL LETTER L	# 
0049 ;	006C ;	MA	# ( I → l ) LATIN CAPITAL LETTER I → LATIN SMALL LETTER L	# 
0030 ;	004F ;	MA	# ( 0 → O ) DIGIT ZERO → LATIN CAPITAL LETTER O	#

So ASCII is already taken care of. No issue here :)

Oh, so it is, good catch! I looked for it before, but missed it.

FWIW, I agree with sticking with what Unicode defines here. (I am worried about the results in practice and whether enough people will use this check to warrant its inclusion in clang-tidy, but not enough to block this patch.)

rurban added a subscriber: rurban.Feb 20 2022, 7:10 AM
aaron.ballman added inline comments.Feb 22 2022, 4:22 AM
clang-tools-extra/clang-tidy/misc/ConfusableTable/build_confusable_table.cpp
22 ↗(On Diff #384854)

We're testing this functionality in our downstream and noticed that this causes an issue when the repo is cloned with CRLF line endings. Can you make this a bit more resilient to the platform line endings?

Rebased on main branch and (should) fix @aaron.ballman portability issue.

The LLVM Board of Directors will do a legal review of this change. We will give an update in 4-6 weeks.

@tstellar, is there an update from the LLVM Board of Directors regarding using Unicode data?
I'm implementing extended grapheme clustering in libc++ which requires using Unicode data.
Note that MSVC STL (which is using the same license as the LLVM project) already uses Unicode data in the following generated header
https://github.com/microsoft/STL/blob/main/stl/inc/__msvc_format_ucd_tables.hpp

Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2022, 12:04 PM
tstellar accepted this revision.Jun 2 2022, 12:12 PM

The LLVM Board of Directors will do a legal review of this change. We will give an update in 4-6 weeks.

@tstellar, is there an update from the LLVM Board of Directors regarding using Unicode data?
I'm implementing extended grapheme clustering in libc++ which requires using Unicode data.
Note that MSVC STL (which is using the same license as the LLVM project) already uses Unicode data in the following generated header
https://github.com/microsoft/STL/blob/main/stl/inc/__msvc_format_ucd_tables.hpp

Yes, it's fine to include the unicode data.

This revision is now accepted and ready to land.Jun 2 2022, 12:12 PM

Thanks for the update!

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Jun 3 2022, 3:46 AM

I haven't checked, but this probably doesn't do the right thing in cross builds (eg building a Mac/arm binary on a Mac/Intel machine - i think the gen_confusables binary will likely end up being an arm binary then, and it won't be able to run during the build). See https://reviews.llvm.org/D126397 and discussion there for how to handle that.

thakis added inline comments.Jun 3 2022, 4:48 AM
clang-tools-extra/clang-tidy/misc/ConfusableTable/build_confusable_table.cpp
55 ↗(On Diff #433991)

Could we not print this line? The build is supposed to be silent if nothing's broken.

thakis added a comment.Jun 3 2022, 6:26 AM

This also makes clang-tidy crash for fairly normal inputs, see below.

clang-tools-extra/clang-tidy/misc/Homoglyph.cpp
84 ↗(On Diff #433991)

This call is not correct, it's only valid for simple identifiers:

% cat t.cc
namespace ns { struct Foo {}; }
auto f = ns::Foo();
% ../llvm-build-2/bin/clang-tidy t.cc  -fix '--checks=-*,misc-homoglyph'  -config={} --
argh Foo
Assertion failed: (Name.isIdentifier() && "Name is not a simple identifier"), function getName, file Decl.h, line 278.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.

This crash breaks check-clangd in some build configs (t.cc is reduced form clang-tools-extra/clangd/test/check.test).

Thanks @thakis for the post-commit review. I'll give it another try next week.

  • quiet output of the table conversion program when everything goes well
  • cross-compilation support (untested)
  • fix identifier retrieving
This revision is now accepted and ready to land.Jun 13 2022, 8:15 AM
serge-sans-paille requested review of this revision.Jun 13 2022, 8:27 AM

Subject: Confusable identifiers detection

Add [clang-tidy]

clang-tools-extra/clang-tidy/misc/ConfusableTable/build_confusable_table.cpp
30 ↗(On Diff #436399)

Place SmallVector outside the loop and use clear. It is more efficient than repeatedly constructing/destroying a SmallVector.

60 ↗(On Diff #436399)

const auto

65 ↗(On Diff #436399)

drop braces

serge-sans-paille retitled this revision from Confusable identifiers detection to [clang-tidy] Confusable identifiers detection.

Address reviews

MaskRay accepted this revision.Jun 13 2022, 1:02 PM

LGTM. Added some clang-tidy reviewers.

This revision is now accepted and ready to land.Jun 13 2022, 1:02 PM

The LLVM Board of Directors will do a legal review of this change. We will give an update in 4-6 weeks.

@tstellar, is there an update from the LLVM Board of Directors regarding using Unicode data?
I'm implementing extended grapheme clustering in libc++ which requires using Unicode data.
Note that MSVC STL (which is using the same license as the LLVM project) already uses Unicode data in the following generated header
https://github.com/microsoft/STL/blob/main/stl/inc/__msvc_format_ucd_tables.hpp

Yes, it's fine to include the unicode data.

@tstellar -- Do we have any license requirements we need to follow? @rsmith pointed out:

(b) this copyright and permission notice appear in associated Documentation.

which sounds like we need to update one of our license files.

(I'll assume the same answer applies for the Unicode named escape sequences patch happening in Clang currently.)

clang-tools-extra/clang-tidy/misc/Homoglyph.cpp
84 ↗(On Diff #436506)

(Formatting -- might just want to run clang-format over the patch.)

clang-tools-extra/docs/clang-tidy/checks/misc-homoglyph.rst
6–10 ↗(On Diff #436506)
14–15 ↗(On Diff #436506)
serge-sans-paille marked an inline comment as done.

Take into account @aaron.ballman review

whisperity added inline comments.Jun 20 2022, 12:22 AM
clang-tools-extra/clang-tidy/misc/Homoglyph.cpp
32–35 ↗(On Diff #436739)

/* -> // or /// (latter is for documentation comments)

41–42 ↗(On Diff #436739)

(Nit: in LLVM we use "const west")

48 ↗(On Diff #436739)

These casts are weird when reading the code. The problem with C-style casts is that the compiler will try to resolve it to some kind of cast, which makes them less "explicit" than would be good for type safety. Could you replace these with static_casts? Or are these, in fact, reinterpret_casts?

64–68 ↗(On Diff #436739)

(Nit: consider adding a using namespace llvm; inside the function and then you could reduce the length of these types.)

84 ↗(On Diff #436739)

(Nit: formatting is broken here.)

86–88 ↗(On Diff #436739)

(Nit: we tend to use auto if and only if the type of the variable is either reasonably impossible to spell out (e.g. an iterator into a container with 3784723 parameters) or immediately obvious from the initialiser expression (e.g. the init is a cast).)

93 ↗(On Diff #436739)

Perhaps you could elaborate on "confusable" here a bit for the user?

94 ↗(On Diff #436739)

Careful: getName() might assert away if the identifier isn't "trivially nameable", e.g. operator==. Is this prevented somewhere? (Otherwise, could we add a negative test case just to ensure no crashes start happening?) Operators are also NamedDecls.

95 ↗(On Diff #436739)

Definition or declaration? "Entity"?

clang-tools-extra/clang-tidy/misc/Homoglyph.h
18 ↗(On Diff #436739)

This is the "top-level" check that is the addition in this patch, right?

In that case, this class is breaking naming conventions (usually the symbol name ends with Check) and the "usual" documentation comment from the class is also missing.


Also, maybe a full rename to ConfusableIdentifierCheck would be worth it. If you'd like to stick to the current summary of the patch.

clang-tools-extra/clang-tidy/misc/MiscTidyModule.cpp
43

The comment about coming up with a better name for the check I think applies here too. I would be comfortable with misc-confusable-identifiers. The problem with homoglyph is that it requires a specific understanding of English and (natural) language studies that we should not hard-require from our users. If I have to look up on Wikipedia what the name of the rule I'm using means then something is wrong.

clang-tools-extra/docs/ReleaseNotes.rst
144

Unicode? Isn't it a trademark?

clang-tools-extra/docs/clang-tidy/checks/misc-homoglyph.rst
8 ↗(On Diff #436739)

Is the associated research paper published anywhere? I can only come up with arXiv preprints. However, adding a link to arXiv would be more "future proof" than the (agreeably fancy) website itself. The site mentions one and the research paper mentions two CVE numbers for the attack vector... perhaps we should just link the CVE database.

clang-tools-extra/test/clang-tidy/checkers/misc-homoglyph.cpp
16–19 ↗(On Diff #436739)

(Nit: If we keep the message blocks together they will be more resilient to accidental modifications. You can have any kind of offset in the test comment, anyway.)

whisperity requested changes to this revision.Jun 20 2022, 12:26 AM

Because of the stability issues related to getName()-like constructs I'm putting a temporary ❌ on this (so it doesn't show up as faux accept). However, I have to emphasise that I do like the idea of the check!

clang-tools-extra/clang-tidy/misc/CMakeLists.txt
50

gen_confusable_glyph_list?

clang-tools-extra/clang-tidy/misc/ConfusableTable/build_confusable_table.cpp
1 ↗(On Diff #436739)

Why does this file have snake_case name?

This revision now requires changes to proceed.Jun 20 2022, 12:26 AM

So that this doesn't get lost, I'm opposed to this change because it does not actually follow the TR39 guidance or algorithm.

For an input string X, define skeleton(X) to be the following transformation on the string:

1. Convert X to NFD format, as described in [UAX15].
2. Concatenate the prototypes for each character in X according to the specified data, producing a string of exemplar characters.
3. Reapply NFD.

Step 1 and 3 not being performed will lead to false negative.

That being said, given that it is for clang tidy and that adding decomposition to llvm is a bit involved (I could look into it if there is interest), this is not a strong opposition, but i wanted to make sure people are aware of the limitation.
I think we should have a fixme in the code so that we don't forget.

clang-tools-extra/clang-tidy/misc/ConfusableTable/build_confusable_table.cpp
59 ↗(On Diff #436739)

This is terribly menory inneficient - most confusables are 1-3 code point longs.
Consider

  • Replacing the hard coded 32 by the actual maximum length that you can extract when buikding the table
    • Having a table for the common case of 1-3 codepoints for the common case and another one for the long sequence.

Address most reviewers comment:

  • formatting style
  • reduced memory consumption
  • be clear about TR39 divergence
  • class and option renaming
  • getName() usage
aaron.ballman accepted this revision.Jun 21 2022, 12:33 PM

I remain concerned about the utility of this check. As I understand it, the more complete Unicode guidance is still under active discussion and hasn't been finalized (please correct me if I'm wrong @tahonermann and @cor3ntin) so it feels a bit premature to land this. It also doesn't follow the guidance from the Unicode consortium and expected to have more false positives.

That said, I don't have a strong enough technical reason to block this check, so it gets my disgaree-but-LGTM.

whisperity accepted this revision.Jun 22 2022, 2:04 AM

LGTM. Thanks!

This revision is now accepted and ready to land.Jun 22 2022, 2:04 AM
This revision was landed with ongoing or failed builds.Jun 22 2022, 7:19 AM
This revision was automatically updated to reflect the committed changes.