This is an archive of the discontinued LLVM Phabricator instance.

Misleading bidirectional detection
ClosedPublic

Authored by serge-sans-paille on Nov 1 2021, 2:19 AM.

Details

Summary

This patch implements detection of incomplete bidirectional sequence withing comments and string literals within clang-tidy.

It detects the bidi part of https://www.trojansource.codes/trojan-source.pdf

Diff Detail

Event Timeline

serge-sans-paille requested review of this revision.Nov 1 2021, 2:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 1 2021, 2:19 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
rsmith accepted this revision.Nov 1 2021, 3:27 PM
rsmith added a subscriber: rsmith.

Nits and a suggested approach for invalid code sequences that's probably important to handle better. Please fix the clang-tidy findings too. Otherwise, LGTM.

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
39

Nit: please use // comments per our normal coding style.

50
61–63

Can we scan forward looking for the next non-continuation byte? (Skip while c & 0b1100_0000 == 0b1000_0000)

75
This revision is now accepted and ready to land.Nov 1 2021, 3:27 PM
carlosgalvezp requested changes to this revision.Nov 1 2021, 3:41 PM

Nice addition! Please add this check to the documentation (list of available checks + individual page with the documentation for this check), plus mention in the clang-tidy release notes. The same applies to the other related patches.

This revision now requires changes to proceed.Nov 1 2021, 3:41 PM
carlosgalvezp added inline comments.Nov 1 2021, 3:45 PM
clang-tools-extra/clang-tidy/misc/MiscTidyModule.cpp
61

Nit: please keep alphabetical order in the list of jobs.

MaskRay added a subscriber: MaskRay.Nov 1 2021, 5:17 PM

Context not available.

If you don't use arc diff 'HEAD^' to upload a Diff, please use -U99999 https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
16
MaskRay requested changes to this revision.Nov 1 2021, 5:33 PM
MaskRay added inline comments.
clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
20

functionName

20

clang-format bool HonorLineBreaks=true

21
48
52

no brace

99

delete blank line

clang-tools-extra/test/clang-tidy/check_clang_tidy.py
85 ↗(On Diff #383745)

Prefer single quotes

MaskRay added inline comments.
clang-tools-extra/test/clang-tidy/checkers/misc-misleading-bidirectional.cpp
1

The test misses many interesting cases.

  • recover from failed utf8 decoding
  • doc and release note updated
  • clang-formatting
  • more examples / testing
rsmith added inline comments.Nov 2 2021, 2:27 PM
clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
60

Is there a guarantee that convertUTF8Sequence doesn't update CurPtr on error? I'm concerned we might increment *past* the end in the case where CurPtr points to the end, below, which would at least formally be UB.

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
60

According to the doc "If the conversion succeeds, this pointer will be updated to point to the byte just past the end of the converted sequence". My understanding of the implementation confirms that statements, and it's used in a similar manner in clang Lexer.

61–63

I'm not quite sure. There's a risk we end up starting over from the second part of a unicode character, and that could mess up with the decoding. I'll investigate how it's done in other libraries.

uabelho added a subscriber: uabelho.Nov 4 2021, 2:56 AM
carlosgalvezp added inline comments.Nov 5 2021, 7:45 AM
clang-tools-extra/docs/ReleaseNotes.rst
137

Nit: Inspects

Rebase on main branch

Patch rebased on main, all comments addressed. Looks good?

MaskRay added inline comments.Nov 23 2021, 6:41 PM
clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
50

/*next line=*/0x85 is more common

67
69

More common style is A = A ? A - 1 : 0; which avoids unsigned wraparound.

That said, if the state is currently 0, should an attempt to decrease it be reported as an error as well?

71

ditto

129

If you using MisleadingBidirectionalCheck (or clang::tidy::misc), then you can use

void MisleadingBidirectionalCheck::registerMatchers

clang-tools-extra/test/clang-tidy/checkers/misc-misleading-bidirectional.cpp
6

[[@LINE-1]] is a deprecated FileCheck feature.

Use [[#@LINE-1]]

upsuper added a subscriber: upsuper.EditedNov 24 2021, 4:08 AM

I'm not familiar with LLVM / Clang codebase. I was asked by @MaskRay to help review the bidi-related part of this patch.

Generally I believe the algorithm here is too naive that it produces both false positives and false negatives. I'm not sure how important those edge cases are considered, but anyway... detailed comments below:

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
49–50

UAX 14 is probably not the right document to look here.

According to UAX 9 step L1, the embedding level is reset to the paragraph embedding level when hitting segment separator and paragraph separator, which are defined in table 4 as type B and S, and in UCD DerivedBidiClass.txt you can see type B includes U+000A, U+000D, U+001C..001E, U+0085, U+2029, and type S includes U+0009, U+000B, and U+001F.

You have U+000C and U+2028 here which are counted as type WS which doesn't affect embedding level, so you may be resetting the counter prematurely. And you may want to reset the counter for all other characters above.

65–73

The bidi algorithm is more complicated than having two simple counters. Basically, a PDF cancels a override / embedding character only when they match, and PDI cancels all override / embedding between it and its matching isolate character.

I was thinking whether the counters would be enough in the sense that we may accept some false positive for edge cases but definitely no false negative. But I'm convinced it's not the case. As an example, a sequence of RLO LRI PDF PDI will yield no embedding and no isolate in this counting model, but the PDF here actually has no effect, as shown in step X7, if it does not match an embedding initiator, it is ignored, so this is effectively leaving a dangling RLO.

Update bidi algorithm

@upsuper I've not added extra test cases yet, but does it looks it's heading in the right direction?

Fix some parts of the bidi algorithm, and add extra test cases

I think the core algorithm looks correct now. I'll leave the code review to LLVM reviewers. Thanks.

rebased on main branch

Has this been tested against any large code bases that use bidirectional characters to see what the false positive rate is? Also, have you seen the latest Unicode guidance on this topic: https://unicode.org/L2/L2022/22007-avoiding-spoof.pdf to make sure we're following along?

@aaron.ballman unfortunately I don't know any of those. If I recall correctly we found no software in the RedHat collection actually using those control characters :-/
My understanding is that we are inline with the document you mention, except the fact that 1. We don't allow LRM character in plain text and 2. we do not check if a string / comment sequence ends in RTL state, but that actually requires another algorithm :-/

Note that if there's a consensus on it, I can implement LRM character support.

@MaskRay any blocker on that new version now that it recieved a green light from @upsuper?

I'd like to clarify that what I think is correct now is the algorithm to detect unclosed explicit formatting scopes in a given string.

I haven't been following very closely with the whole spoofing issue, so I can't say that there is no other ways to construct a spoof that this algorithm is not designed to detect.

As you have found, RLM, and ALM can be used to confuse code reader, but they are not much different than a string with other strong RTL characters inside, and I don't quite see how that can be linted without hurting potentially legitimate code. Maybe if the compiler supports treating LRM as whitespace (I'm not sure whether Clang does), a lint may be added to ask wrapping any string with outermost strong characters being RTL in the form of {LRM}"string"{LRM} so that the RTL characters don't affect outside. Other than that, I don't think there is anyway to lint against such a confusion.

I'd like to clarify that what I think is correct now is the algorithm to detect unclosed explicit formatting scopes in a given string.

Thanks for confirming. This check only detects unterminated bidi sequence within comments and string literals. Its scope limits to that aspect.

I haven't been following very closely with the whole spoofing issue, so I can't say that there is no other ways to construct a spoof that this algorithm is not designed to detect.

Agreed. FYI we already have a check for RTL characters ending identfiers, and a pending one for confusable identifiers

As you have found, RLM, and ALM can be used to confuse code reader, but they are not much different than a string with other strong RTL characters inside, and I don't quite see how that can be linted without hurting potentially legitimate code. Maybe if the compiler supports treating LRM as whitespace (I'm not sure whether Clang does), a lint may be added to ask wrapping any string with outermost strong characters being RTL in the form of {LRM}"string"{LRM} so that the RTL characters don't affect outside. Other than that, I don't think there is anyway to lint against such a confusion.

I agree that allowing LRM is a good step forward, and that's part of the official recommendation, but orthogonal to that review.

MaskRay accepted this revision.Jan 11 2022, 7:07 PM
MaskRay added inline comments.
clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
17

Consider using clang::tidy::misc::MisleadingBidirectionalCheck to avoid specifying the long name repeatedly.

44

i is unused

49

You may save the conditions to two booleans to avoid repeated BidiContexts.clear();

67

A sentence should end with a period.

This revision is now accepted and ready to land.Jan 11 2022, 7:07 PM
This revision was landed with ongoing or failed builds.Jan 12 2022, 2:39 AM
This revision was automatically updated to reflect the committed changes.