This is an archive of the discontinued LLVM Phabricator instance.

Nits and a suggested approach for invalid code sequences that's probably important to handle better. Please fix the clang-tidy findings too. Otherwise, LGTM.

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
39	Nit: please use `//` comments per our normal coding style.
50
61–63	Can we scan forward looking for the next non-continuation byte? (Skip while `c & 0b1100_0000 == 0b1000_0000`)
75

This revision is now accepted and ready to land.Nov 1 2021, 3:27 PM

Nice addition! Please add this check to the documentation (list of available checks + individual page with the documentation for this check), plus mention in the clang-tidy release notes. The same applies to the other related patches.

This revision now requires changes to proceed.Nov 1 2021, 3:41 PM

carlosgalvezp added inline comments.Nov 1 2021, 3:45 PM

clang-tools-extra/clang-tidy/misc/MiscTidyModule.cpp
61	Nit: please keep alphabetical order in the list of jobs.

Context not available.

If you don't use arc diff 'HEAD^' to upload a Diff, please use -U99999 https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
16	See https://llvm.org/docs/CodingStandards.html#use-namespace-qualifiers-to-implement-previously-declared-functions

MaskRay requested changes to this revision.Nov 1 2021, 5:33 PM

MaskRay added inline comments.

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
20	`functionName`
20	clang-format `bool HonorLineBreaks=true`
21
48
52	no brace
99	delete blank line
clang-tools-extra/test/clang-tidy/check_clang_tidy.py
85 ↗	(On Diff #383745)	Prefer single quotes

MaskRay added a reviewer: aaron.ballman.Nov 1 2021, 5:37 PM

MaskRay added inline comments.

clang-tools-extra/test/clang-tidy/checkers/misc-misleading-bidirectional.cpp
1	The test misses many interesting cases.

recover from failed utf8 decoding
doc and release note updated
clang-formatting
more examples / testing

Harbormaster completed remote builds in B131982: Diff 384111.Nov 2 2021, 8:53 AM

rsmith added inline comments.Nov 2 2021, 2:27 PM

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
60	Is there a guarantee that `convertUTF8Sequence` doesn't update `CurPtr` on error? I'm concerned we might increment past the end in the case where `CurPtr` points to the end, below, which would at least formally be UB.

serge-sans-paille added inline comments.Nov 3 2021, 1:39 PM

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
60	According to the doc "If the conversion succeeds, this pointer will be updated to point to the byte just past the end of the converted sequence". My understanding of the implementation confirms that statements, and it's used in a similar manner in clang Lexer.
61–63	I'm not quite sure. There's a risk we end up starting over from the second part of a unicode character, and that could mess up with the decoding. I'll investigate how it's done in other libraries.

uabelho added a subscriber: uabelho.Nov 4 2021, 2:56 AM

carlosgalvezp added inline comments.Nov 5 2021, 7:45 AM

clang-tools-extra/docs/ReleaseNotes.rst
137	Nit: Inspects

Rebase on main branch

Patch rebased on main, all comments addressed. Looks good?

Nits

Harbormaster completed remote builds in B135067: Diff 388425.Nov 19 2021, 2:00 AM

carlosgalvezp removed a reviewer: carlosgalvezp.Nov 22 2021, 8:47 AM

Gentle ping @MaskRay and/or @rsmith

MaskRay added inline comments.Nov 23 2021, 6:41 PM

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
50	`/next line=/0x85` is more common
67
69	More common style is `A = A ? A - 1 : 0;` which avoids unsigned wraparound. That said, if the state is currently 0, should an attempt to decrease it be reported as an error as well?
71	ditto
129	If you `using` MisleadingBidirectionalCheck (or `clang::tidy::misc`), then you can use `void MisleadingBidirectionalCheck::registerMatchers`
clang-tools-extra/test/clang-tidy/checkers/misc-misleading-bidirectional.cpp
6	`[[@LINE-1]]` is a deprecated FileCheck feature. Use `[[#@LINE-1]]`

I'm not familiar with LLVM / Clang codebase. I was asked by @MaskRay to help review the bidi-related part of this patch.

Generally I believe the algorithm here is too naive that it produces both false positives and false negatives. I'm not sure how important those edge cases are considered, but anyway... detailed comments below:

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
49–50	UAX 14 is probably not the right document to look here. According to UAX 9 step L1, the embedding level is reset to the paragraph embedding level when hitting segment separator and paragraph separator, which are defined in table 4 as type B and S, and in UCD DerivedBidiClass.txt you can see type B includes U+000A, U+000D, U+001C..001E, U+0085, U+2029, and type S includes U+0009, U+000B, and U+001F. You have U+000C and U+2028 here which are counted as type WS which doesn't affect embedding level, so you may be resetting the counter prematurely. And you may want to reset the counter for all other characters above.
65–73	The bidi algorithm is more complicated than having two simple counters. Basically, a `PDF` cancels a override / embedding character only when they match, and `PDI` cancels all override / embedding between it and its matching isolate character. I was thinking whether the counters would be enough in the sense that we may accept some false positive for edge cases but definitely no false negative. But I'm convinced it's not the case. As an example, a sequence of `RLO LRI PDF PDI` will yield no embedding and no isolate in this counting model, but the `PDF` here actually has no effect, as shown in step X7, if it does not match an embedding initiator, it is ignored, so this is effectively leaving a dangling `RLO`.

Update bidi algorithm

@upsuper I've not added extra test cases yet, but does it looks it's heading in the right direction?

Harbormaster completed remote builds in B141691: Diff 397597.Jan 5 2022, 9:15 AM

Fix some parts of the bidi algorithm, and add extra test cases

Harbormaster completed remote builds in B141850: Diff 397813.Jan 6 2022, 1:21 AM

I think the core algorithm looks correct now. I'll leave the code review to LLVM reviewers. Thanks.

rebased on main branch

Thanks @upsuper!

Has this been tested against any large code bases that use bidirectional characters to see what the false positive rate is? Also, have you seen the latest Unicode guidance on this topic: https://unicode.org/L2/L2022/22007-avoiding-spoof.pdf to make sure we're following along?

Harbormaster completed remote builds in B141903: Diff 397885.Jan 6 2022, 7:46 AM

@aaron.ballman unfortunately I don't know any of those. If I recall correctly we found no software in the RedHat collection actually using those control characters :-/
My understanding is that we are inline with the document you mention, except the fact that 1. We don't allow LRM character in plain text and 2. we do not check if a string / comment sequence ends in RTL state, but that actually requires another algorithm :-/

Note that if there's a consensus on it, I can implement LRM character support.

rebased

Harbormaster completed remote builds in B142124: Diff 398195.Jan 7 2022, 11:10 AM

jrheath added a subscriber: jrheath.Jan 10 2022, 5:29 PM

@MaskRay any blocker on that new version now that it recieved a green light from @upsuper?

I'd like to clarify that what I think is correct now is the algorithm to detect unclosed explicit formatting scopes in a given string.

I haven't been following very closely with the whole spoofing issue, so I can't say that there is no other ways to construct a spoof that this algorithm is not designed to detect.

As you have found, RLM, and ALM can be used to confuse code reader, but they are not much different than a string with other strong RTL characters inside, and I don't quite see how that can be linted without hurting potentially legitimate code. Maybe if the compiler supports treating LRM as whitespace (I'm not sure whether Clang does), a lint may be added to ask wrapping any string with outermost strong characters being RTL in the form of {LRM}"string"{LRM} so that the RTL characters don't affect outside. Other than that, I don't think there is anyway to lint against such a confusion.

In D112913#3233699, @upsuper wrote:

I'd like to clarify that what I think is correct now is the algorithm to detect unclosed explicit formatting scopes in a given string.

Thanks for confirming. This check only detects unterminated bidi sequence within comments and string literals. Its scope limits to that aspect.

I haven't been following very closely with the whole spoofing issue, so I can't say that there is no other ways to construct a spoof that this algorithm is not designed to detect.

Agreed. FYI we already have a check for RTL characters ending identfiers, and a pending one for confusable identifiers

As you have found, RLM, and ALM can be used to confuse code reader, but they are not much different than a string with other strong RTL characters inside, and I don't quite see how that can be linted without hurting potentially legitimate code. Maybe if the compiler supports treating LRM as whitespace (I'm not sure whether Clang does), a lint may be added to ask wrapping any string with outermost strong characters being RTL in the form of {LRM}"string"{LRM} so that the RTL characters don't affect outside. Other than that, I don't think there is anyway to lint against such a confusion.

I agree that allowing LRM is a good step forward, and that's part of the official recommendation, but orthogonal to that review.

MaskRay accepted this revision.Jan 11 2022, 7:07 PM

MaskRay added inline comments.

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp
17	Consider `using clang::tidy::misc::MisleadingBidirectionalCheck` to avoid specifying the long name repeatedly.
44	`i` is unused
49	You may save the conditions to two booleans to avoid repeated `BidiContexts.clear();`
67	A sentence should end with a period.

This revision is now accepted and ready to land.Jan 11 2022, 7:07 PM

This revision was landed with ongoing or failed builds.Jan 12 2022, 2:39 AM

Closed by commit rG35cca45b09b8: Misleading bidirectional detection (authored by serge-sans-paille). · Explain Why

This revision was automatically updated to reflect the committed changes.

serge-sans-paille added a commit: rG35cca45b09b8: Misleading bidirectional detection.

Revision Contents

Path

Size

clang-tools-extra/

clang-tidy/

misc/

CMakeLists.txt

1 line

MiscTidyModule.cpp

3 lines

MisleadingBidirectional.h

38 lines

MisleadingBidirectional.cpp

139 lines

docs/

ReleaseNotes.rst

4 lines

clang-tidy/

checks/

list.rst

3 lines

misc-misleading-bidirectional.rst

21 lines

test/

clang-tidy/

checkers/

misc-misleading-bidirectional.cpp

Diff 399273

clang-tools-extra/clang-tidy/misc/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS			set(LLVM_LINK_COMPONENTS
	FrontendOpenMP			FrontendOpenMP
	Support			Support
	)			)

	add_clang_library(clangTidyMiscModule			add_clang_library(clangTidyMiscModule
	DefinitionsInHeadersCheck.cpp			DefinitionsInHeadersCheck.cpp
	MiscTidyModule.cpp			MiscTidyModule.cpp
				MisleadingBidirectional.cpp
	MisleadingIdentifier.cpp			MisleadingIdentifier.cpp
	MisplacedConstCheck.cpp			MisplacedConstCheck.cpp
	NewDeleteOverloadsCheck.cpp			NewDeleteOverloadsCheck.cpp
	NoRecursionCheck.cpp			NoRecursionCheck.cpp
	NonCopyableObjects.cpp			NonCopyableObjects.cpp
	NonPrivateMemberVariablesInClassesCheck.cpp			NonPrivateMemberVariablesInClassesCheck.cpp
	RedundantExpressionCheck.cpp			RedundantExpressionCheck.cpp
	StaticAssertCheck.cpp			StaticAssertCheck.cpp
	Show All 25 Lines

clang-tools-extra/clang-tidy/misc/MiscTidyModule.cpp

	//===--- MiscTidyModule.cpp - clang-tidy ----------------------------------===//			//===--- MiscTidyModule.cpp - clang-tidy ----------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "../ClangTidy.h"			#include "../ClangTidy.h"
	#include "../ClangTidyModule.h"			#include "../ClangTidyModule.h"
	#include "../ClangTidyModuleRegistry.h"			#include "../ClangTidyModuleRegistry.h"
	#include "DefinitionsInHeadersCheck.h"			#include "DefinitionsInHeadersCheck.h"
				#include "MisleadingBidirectional.h"
	#include "MisleadingIdentifier.h"			#include "MisleadingIdentifier.h"
	#include "MisplacedConstCheck.h"			#include "MisplacedConstCheck.h"
	#include "NewDeleteOverloadsCheck.h"			#include "NewDeleteOverloadsCheck.h"
	#include "NoRecursionCheck.h"			#include "NoRecursionCheck.h"
	#include "NonCopyableObjects.h"			#include "NonCopyableObjects.h"
	#include "NonPrivateMemberVariablesInClassesCheck.h"			#include "NonPrivateMemberVariablesInClassesCheck.h"
	#include "RedundantExpressionCheck.h"			#include "RedundantExpressionCheck.h"
	#include "StaticAssertCheck.h"			#include "StaticAssertCheck.h"
	#include "ThrowByValueCatchByReferenceCheck.h"			#include "ThrowByValueCatchByReferenceCheck.h"
	#include "UnconventionalAssignOperatorCheck.h"			#include "UnconventionalAssignOperatorCheck.h"
	#include "UniqueptrResetReleaseCheck.h"			#include "UniqueptrResetReleaseCheck.h"
	#include "UnusedAliasDeclsCheck.h"			#include "UnusedAliasDeclsCheck.h"
	#include "UnusedParametersCheck.h"			#include "UnusedParametersCheck.h"
	#include "UnusedUsingDeclsCheck.h"			#include "UnusedUsingDeclsCheck.h"

	namespace clang {			namespace clang {
	namespace tidy {			namespace tidy {
	namespace misc {			namespace misc {

	class MiscModule : public ClangTidyModule {			class MiscModule : public ClangTidyModule {
	public:			public:
	void addCheckFactories(ClangTidyCheckFactories &CheckFactories) override {			void addCheckFactories(ClangTidyCheckFactories &CheckFactories) override {
	CheckFactories.registerCheck<DefinitionsInHeadersCheck>(			CheckFactories.registerCheck<DefinitionsInHeadersCheck>(
	"misc-definitions-in-headers");			"misc-definitions-in-headers");
				CheckFactories.registerCheck<MisleadingBidirectionalCheck>(
				"misc-misleading-bidirectional");
	CheckFactories.registerCheck<MisleadingIdentifierCheck>(			CheckFactories.registerCheck<MisleadingIdentifierCheck>(
	"misc-misleading-identifier");			"misc-misleading-identifier");
	CheckFactories.registerCheck<MisplacedConstCheck>("misc-misplaced-const");			CheckFactories.registerCheck<MisplacedConstCheck>("misc-misplaced-const");
	CheckFactories.registerCheck<NewDeleteOverloadsCheck>(			CheckFactories.registerCheck<NewDeleteOverloadsCheck>(
	"misc-new-delete-overloads");			"misc-new-delete-overloads");
	CheckFactories.registerCheck<NoRecursionCheck>("misc-no-recursion");			CheckFactories.registerCheck<NoRecursionCheck>("misc-no-recursion");
	CheckFactories.registerCheck<NonCopyableObjectsCheck>(			CheckFactories.registerCheck<NonCopyableObjectsCheck>(
	"misc-non-copyable-objects");			"misc-non-copyable-objects");
	CheckFactories.registerCheck<NonPrivateMemberVariablesInClassesCheck>(			CheckFactories.registerCheck<NonPrivateMemberVariablesInClassesCheck>(
	"misc-non-private-member-variables-in-classes");			"misc-non-private-member-variables-in-classes");
	CheckFactories.registerCheck<RedundantExpressionCheck>(			CheckFactories.registerCheck<RedundantExpressionCheck>(
	"misc-redundant-expression");			"misc-redundant-expression");
	CheckFactories.registerCheck<StaticAssertCheck>("misc-static-assert");			CheckFactories.registerCheck<StaticAssertCheck>("misc-static-assert");
	CheckFactories.registerCheck<ThrowByValueCatchByReferenceCheck>(			CheckFactories.registerCheck<ThrowByValueCatchByReferenceCheck>(
	"misc-throw-by-value-catch-by-reference");			"misc-throw-by-value-catch-by-reference");
	CheckFactories.registerCheck<UnconventionalAssignOperatorCheck>(			CheckFactories.registerCheck<UnconventionalAssignOperatorCheck>(
	"misc-unconventional-assign-operator");			"misc-unconventional-assign-operator");
	CheckFactories.registerCheck<UniqueptrResetReleaseCheck>(			CheckFactories.registerCheck<UniqueptrResetReleaseCheck>(
	"misc-uniqueptr-reset-release");			"misc-uniqueptr-reset-release");
	CheckFactories.registerCheck<UnusedAliasDeclsCheck>(			CheckFactories.registerCheck<UnusedAliasDeclsCheck>(
	"misc-unused-alias-decls");			"misc-unused-alias-decls");
	CheckFactories.registerCheck<UnusedParametersCheck>(			CheckFactories.registerCheck<UnusedParametersCheck>(
				carlosgalvezpUnsubmitted Not Done Reply Inline Actions Nit: please keep alphabetical order in the list of jobs. carlosgalvezp: Nit: please keep alphabetical order in the list of jobs.
	"misc-unused-parameters");			"misc-unused-parameters");
	CheckFactories.registerCheck<UnusedUsingDeclsCheck>(			CheckFactories.registerCheck<UnusedUsingDeclsCheck>(
	"misc-unused-using-decls");			"misc-unused-using-decls");
	}			}
	};			};

	} // namespace misc			} // namespace misc

	Show All 10 Lines

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.h

This file was added.

				//===--- MisleadingBidirectionalCheck.h - clang-tidy ------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANG_TIDY_MISC_MISLEADINGBIDIRECTIONALCHECK_H
				#define LLVM_CLANG_TOOLS_EXTRA_CLANG_TIDY_MISC_MISLEADINGBIDIRECTIONALCHECK_H

				#include "../ClangTidyCheck.h"

				namespace clang {
				namespace tidy {
				namespace misc {

				class MisleadingBidirectionalCheck : public ClangTidyCheck {
				public:
				MisleadingBidirectionalCheck(StringRef Name, ClangTidyContext *Context);
				~MisleadingBidirectionalCheck();

				void registerPPCallbacks(const SourceManager &SM, Preprocessor *PP,
				Preprocessor *ModuleExpanderPP) override;

				void registerMatchers(ast_matchers::MatchFinder *Finder) override;
				void check(const ast_matchers::MatchFinder::MatchResult &Result) override;

				private:
				class MisleadingBidirectionalHandler;
				std::unique_ptr<MisleadingBidirectionalHandler> Handler;
				};

				} // namespace misc
				} // namespace tidy
				} // namespace clang

				#endif // LLVM_CLANG_TOOLS_EXTRA_CLANG_TIDY_MISC_MISLEADINGBIDIRECTIONALCHECK_H

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp

This file was added.

//===--- MisleadingBidirectional.cpp - clang-tidy -------------------------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#include "MisleadingBidirectional.h"

#include "clang/Frontend/CompilerInstance.h"

#include "clang/Lex/Preprocessor.h"

#include "llvm/Support/ConvertUTF.h"

using namespace clang;

using namespace clang::tidy::misc;

MaskRayUnsubmitted

Not Done

See https://llvm.org/docs/CodingStandards.html#use-namespace-qualifiers-to-implement-previously-declared-functions

MaskRay: See https://llvm.org/docs/CodingStandards.html#use-namespace-qualifiers-to-implement-previously…

MaskRayUnsubmitted

Not Done

Consider using clang::tidy::misc::MisleadingBidirectionalCheck to avoid specifying the long name repeatedly.

MaskRay: Consider `using clang::tidy::misc::MisleadingBidirectionalCheck` to avoid specifying the long…

static bool containsMisleadingBidi(StringRef Buffer,

bool HonorLineBreaks = true) {

const char *CurPtr = Buffer.begin();

MaskRayUnsubmitted

Not Done

functionName

MaskRay: `functionName`

MaskRayUnsubmitted

Not Done

clang-format bool HonorLineBreaks=true

MaskRay: clang-format `bool HonorLineBreaks=true`

MaskRayUnsubmitted

Not Done

static bool ContainsMisleadingBidi(StringRef Buffer, bool HonorLineBreaks=true) {

- const char* CurPtr = Buffer.begin();

+ const char *CurPtr = Buffer.begin();

unsigned EmbeddingOverride = 0, Isolate = 0;

MaskRay:

enum BidiChar {

PS = 0x2029,

RLO = 0x202E,

RLE = 0x202B,

LRO = 0x202D,

LRE = 0x202A,

PDF = 0x202C,

RLI = 0x2067,

LRI = 0x2066,

FSI = 0x2068,

PDI = 0x2069

};

SmallVector<BidiChar> BidiContexts;

// Scan each character while maintaining a stack of opened bidi context.

// RLO/RLE/LRO/LRE all are closed by PDF while RLI LRI and FSI are closed by

// PDI. New lines reset the context count. Extra PDF / PDI are ignored.

rsmithUnsubmitted

Not Done

Nit: please use // comments per our normal coding style.

rsmith: Nit: please use `//` comments per our normal coding style.

// Warn if we end up with an unclosed context.

while (CurPtr < Buffer.end()) {

unsigned char C = *CurPtr;

if (isASCII(C)) {

MaskRayUnsubmitted

Not Done

i is unused

MaskRay: `i` is unused

++CurPtr;

bool IsParagrapSep =

(C == 0xA || C == 0xD || (0x1C <= C && C <= 0x1E) || C == 0x85);

bool IsSegmentSep = (C == 0x9 || C == 0xB || C == 0x1F);

MaskRayUnsubmitted

Not Done

unsigned char C = *CurPtr;

- if(isASCII(C)) {

+ if (isASCII(C)) {

++CurPtr;

MaskRay:

if (IsParagrapSep || IsSegmentSep)

MaskRayUnsubmitted

Not Done

You may save the conditions to two booleans to avoid repeated BidiContexts.clear();

MaskRay: You may save the conditions to two booleans to avoid repeated `BidiContexts.clear();`

BidiContexts.clear();

rsmithUnsubmitted

Not Done

++CurPtr;

- // line break: https://www.unicode.org/reports/tr14/tr14-32.html

+ // Line break: https://www.unicode.org/reports/tr14/tr14-32.html

if(C == '\n' || C == '\r' || C == '\f' || C == '\v' || C == 0x85 /*next line*/) {

rsmith:

MaskRayUnsubmitted

Not Done

/*next line=*/0x85 is more common

MaskRay: `/*next line=*/0x85` is more common

upsuperUnsubmitted

Not Done

UAX 14 is probably not the right document to look here.

According to UAX 9 step L1, the embedding level is reset to the paragraph embedding level when hitting segment separator and paragraph separator, which are defined in table 4 as type B and S, and in UCD DerivedBidiClass.txt you can see type B includes U+000A, U+000D, U+001C..001E, U+0085, U+2029, and type S includes U+0009, U+000B, and U+001F.

You have U+000C and U+2028 here which are counted as type WS which doesn't affect embedding level, so you may be resetting the counter prematurely. And you may want to reset the counter for all other characters above.

upsuper: UAX 14 is probably not the right document to look here. According to UAX 9 step [[ https://www.

continue;

}

MaskRayUnsubmitted

Not Done

no brace

MaskRay: no brace

llvm::UTF32 CodePoint;

llvm::ConversionResult Result = llvm::convertUTF8Sequence(

(const llvm::UTF8 **)&CurPtr, (const llvm::UTF8 *)Buffer.end(),

&CodePoint, llvm::strictConversion);

// If conversion fails, utf-8 is designed so that we can just try next char.

if (Result != llvm::conversionOK) {

++CurPtr;

rsmithUnsubmitted

Not Done

Is there a guarantee that convertUTF8Sequence doesn't update CurPtr on error? I'm concerned we might increment *past* the end in the case where CurPtr points to the end, below, which would at least formally be UB.

rsmith: Is there a guarantee that `convertUTF8Sequence` doesn't update `CurPtr` on error? I'm concerned…

serge-sans-pailleAuthorUnsubmitted

Done

According to the doc "If the conversion succeeds, this pointer will be updated to point to the byte just past the end of the converted sequence". My understanding of the implementation confirms that statements, and it's used in a similar manner in clang Lexer.

serge-sans-paille: According to the doc "If the conversion succeeds, this pointer will be updated to point to the…

continue;

}

rsmithUnsubmitted

Not Done

Can we scan forward looking for the next non-continuation byte? (Skip while c & 0b1100_0000 == 0b1000_0000)

rsmith: Can we scan forward looking for the next non-continuation byte? (Skip while `c & 0b1100_0000 ==…

serge-sans-pailleAuthorUnsubmitted

Done

I'm not quite sure. There's a risk we end up starting over from the second part of a unicode character, and that could mess up with the decoding. I'll investigate how it's done in other libraries.

serge-sans-paille: I'm not quite sure. There's a risk we end up starting over from the second part of a unicode…

// Open a PDF context.

if (CodePoint == RLO || CodePoint == RLE || CodePoint == LRO ||

CodePoint == LRE)

BidiContexts.push_back(PDF);

MaskRayUnsubmitted

Not Done

CodePoint == LRE)

- EmbeddingOverride += 1;

+ ++EmbeddingOverride;

else if (CodePoint == PDF)

MaskRay:

MaskRayUnsubmitted

Not Done

A sentence should end with a period.

MaskRay: A sentence should end with a period.

// Close PDF Context.

else if (CodePoint == PDF) {

MaskRayUnsubmitted

Not Done

More common style is A = A ? A - 1 : 0; which avoids unsigned wraparound.

That said, if the state is currently 0, should an attempt to decrease it be reported as an error as well?

MaskRay: More common style is `A = A ? A - 1 : 0;` which avoids unsigned wraparound. That said, if the…

if (!BidiContexts.empty() && BidiContexts.back() == PDF)

BidiContexts.pop_back();

MaskRayUnsubmitted

Not Done

ditto

MaskRay: ditto

}

// Open a PDI Context.

upsuperUnsubmitted

Not Done

The bidi algorithm is more complicated than having two simple counters. Basically, a PDF cancels a override / embedding character only when they match, and PDI cancels all override / embedding between it and its matching isolate character.

I was thinking whether the counters would be enough in the sense that we may accept some false positive for edge cases but definitely no false negative. But I'm convinced it's not the case. As an example, a sequence of RLO LRI PDF PDI will yield no embedding and no isolate in this counting model, but the PDF here actually has no effect, as shown in step X7, if it does not match an embedding initiator, it is ignored, so this is effectively leaving a dangling RLO.

upsuper: The bidi algorithm is more complicated than having two simple counters. Basically, a `PDF`…

else if (CodePoint == RLI || CodePoint == LRI || CodePoint == FSI)

BidiContexts.push_back(PDI);

rsmithUnsubmitted

Not Done

Isolate = std::min(Isolate - 1, Isolate);

- // line break: https://www.unicode.org/reports/tr14/tr14-32.html

+ // Line break: https://www.unicode.org/reports/tr14/tr14-32.html

else if (CodePoint == LS || CodePoint == PS)

rsmith:

// Close a PDI Context.

else if (CodePoint == PDI) {

auto R = std::find(BidiContexts.rbegin(), BidiContexts.rend(), PDI);

if (R != BidiContexts.rend())

BidiContexts.resize(BidiContexts.rend() - R - 1);

}

// Line break or equivalent

else if (CodePoint == PS)

BidiContexts.clear();

}

return !BidiContexts.empty();

}

class MisleadingBidirectionalCheck::MisleadingBidirectionalHandler

: public CommentHandler {

public:

MisleadingBidirectionalHandler(MisleadingBidirectionalCheck &Check,

llvm::Optional<std::string> User)

: Check(Check) {}

bool HandleComment(Preprocessor &PP, SourceRange Range) override {

// FIXME: check that we are in a /* */ comment

StringRef Text =

Lexer::getSourceText(CharSourceRange::getCharRange(Range),

MaskRayUnsubmitted

Not Done

delete blank line

MaskRay: delete blank line

PP.getSourceManager(), PP.getLangOpts());

if (containsMisleadingBidi(Text, true))

Check.diag(

Range.getBegin(),

"comment contains misleading bidirectional Unicode characters");

return false;

}

private:

MisleadingBidirectionalCheck &Check;

};

MisleadingBidirectionalCheck::MisleadingBidirectionalCheck(

StringRef Name, ClangTidyContext *Context)

: ClangTidyCheck(Name, Context),

Handler(std::make_unique<MisleadingBidirectionalHandler>(

*this, Context->getOptions().User)) {}

MisleadingBidirectionalCheck::~MisleadingBidirectionalCheck() = default;

void MisleadingBidirectionalCheck::registerPPCallbacks(

const SourceManager &SM, Preprocessor *PP, Preprocessor *ModuleExpanderPP) {

PP->addCommentHandler(Handler.get());

}

void MisleadingBidirectionalCheck::check(

const ast_matchers::MatchFinder::MatchResult &Result) {

if (const auto *SL = Result.Nodes.getNodeAs<StringLiteral>("strlit")) {

StringRef Literal = SL->getBytes();

MaskRayUnsubmitted

Not Done

If you using MisleadingBidirectionalCheck (or clang::tidy::misc), then you can use

void MisleadingBidirectionalCheck::registerMatchers

MaskRay: If you `using` MisleadingBidirectionalCheck (or `clang::tidy::misc`), then you can use `void…

if (containsMisleadingBidi(Literal, false))

diag(SL->getBeginLoc(), "string literal contains misleading "

"bidirectional Unicode characters");

}

void MisleadingBidirectionalCheck::registerMatchers(

ast_matchers::MatchFinder *Finder) {

Finder->addMatcher(ast_matchers::stringLiteral().bind("strlit"), this);

}

clang-tools-extra/docs/ReleaseNotes.rst

Show First 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	- New :doc:`readability-container-data-pointer
element at index 0 in a container.		element at index 0 in a container.

- New :doc:`readability-identifier-length		- New :doc:`readability-identifier-length
<clang-tidy/checks/readability-identifier-length>` check.		<clang-tidy/checks/readability-identifier-length>` check.

Reports identifiers whose names are too short. Currently checks local		Reports identifiers whose names are too short. Currently checks local
variables and function parameters only.		variables and function parameters only.

		- New :doc:`misc-misleading-bidirectional <clang-tidy/checks/misc-misleading-bidirectional>` check.

		Inspects string literal and comments for unterminated bidirectional Unicode
		characters.

New check aliases		New check aliases
^^^^^^^^^^^^^^^^^		^^^^^^^^^^^^^^^^^

		carlosgalvezpUnsubmitted Not Done Reply Inline Actions Nit: Inspects carlosgalvezp: Nit: Inspects
- New alias :doc:`cert-err33-c		- New alias :doc:`cert-err33-c
<clang-tidy/checks/cert-err33-c>` to		<clang-tidy/checks/cert-err33-c>` to
:doc:`bugprone-unused-return-value		:doc:`bugprone-unused-return-value
<clang-tidy/checks/bugprone-unused-return-value>` was added.		<clang-tidy/checks/bugprone-unused-return-value>` was added.

- New alias :doc:`cert-exp42-c		- New alias :doc:`cert-exp42-c
<clang-tidy/checks/cert-exp42-c>` to		<clang-tidy/checks/cert-exp42-c>` to
:doc:`bugprone-suspicious-memory-comparison		:doc:`bugprone-suspicious-memory-comparison
▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

clang-tools-extra/docs/clang-tidy/checks/list.rst

Show First 20 Lines • Show All 206 Lines • ▼ Show 20 Lines	.. csv-table::
`llvm-namespace-comment <llvm-namespace-comment.html>`_,		`llvm-namespace-comment <llvm-namespace-comment.html>`_,
`llvm-prefer-isa-or-dyn-cast-in-conditionals <llvm-prefer-isa-or-dyn-cast-in-conditionals.html>`_, "Yes"		`llvm-prefer-isa-or-dyn-cast-in-conditionals <llvm-prefer-isa-or-dyn-cast-in-conditionals.html>`_, "Yes"
`llvm-prefer-register-over-unsigned <llvm-prefer-register-over-unsigned.html>`_, "Yes"		`llvm-prefer-register-over-unsigned <llvm-prefer-register-over-unsigned.html>`_, "Yes"
`llvm-twine-local <llvm-twine-local.html>`_, "Yes"		`llvm-twine-local <llvm-twine-local.html>`_, "Yes"
`llvmlibc-callee-namespace <llvmlibc-callee-namespace.html>`_,		`llvmlibc-callee-namespace <llvmlibc-callee-namespace.html>`_,
`llvmlibc-implementation-in-namespace <llvmlibc-implementation-in-namespace.html>`_,		`llvmlibc-implementation-in-namespace <llvmlibc-implementation-in-namespace.html>`_,
`llvmlibc-restrict-system-libc-headers <llvmlibc-restrict-system-libc-headers.html>`_, "Yes"		`llvmlibc-restrict-system-libc-headers <llvmlibc-restrict-system-libc-headers.html>`_, "Yes"
`misc-definitions-in-headers <misc-definitions-in-headers.html>`_, "Yes"		`misc-definitions-in-headers <misc-definitions-in-headers.html>`_, "Yes"
`misc-misleading-identifier <misc-misleading-identifier.html>`_,		`misc-misleading-bidirectional <misc-misleading-bidirectional.html>`_,
		`misc-misleading-identifier <misc-mileading-identifier.html>`_,
`misc-misplaced-const <misc-misplaced-const.html>`_,		`misc-misplaced-const <misc-misplaced-const.html>`_,
`misc-new-delete-overloads <misc-new-delete-overloads.html>`_,		`misc-new-delete-overloads <misc-new-delete-overloads.html>`_,
`misc-no-recursion <misc-no-recursion.html>`_,		`misc-no-recursion <misc-no-recursion.html>`_,
`misc-non-copyable-objects <misc-non-copyable-objects.html>`_,		`misc-non-copyable-objects <misc-non-copyable-objects.html>`_,
`misc-non-private-member-variables-in-classes <misc-non-private-member-variables-in-classes.html>`_,		`misc-non-private-member-variables-in-classes <misc-non-private-member-variables-in-classes.html>`_,
`misc-redundant-expression <misc-redundant-expression.html>`_, "Yes"		`misc-redundant-expression <misc-redundant-expression.html>`_, "Yes"
`misc-static-assert <misc-static-assert.html>`_, "Yes"		`misc-static-assert <misc-static-assert.html>`_, "Yes"
`misc-throw-by-value-catch-by-reference <misc-throw-by-value-catch-by-reference.html>`_,		`misc-throw-by-value-catch-by-reference <misc-throw-by-value-catch-by-reference.html>`_,
▲ Show 20 Lines • Show All 232 Lines • Show Last 20 Lines

clang-tools-extra/docs/clang-tidy/checks/misc-misleading-bidirectional.rst

This file was added.

				.. title:: clang-tidy - misc-misleading-bidirectional

				misc-misleading-bidirectional
				=============================

				Warn about unterminated bidirectional unicode sequence, detecting potential attack
				as described in the `Trojan Source <https://www.trojansource.codes>`_ attack.

				Example:

				.. code-block:: c++

				#include <iostream>

				int main() {
				bool isAdmin = false;
				/‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only /
				std::cout << "You are an admin.\n";
				/* end admins only ‮ { ⁦*/
				return 0;
				}

clang-tools-extra/test/clang-tidy/checkers/misc-misleading-bidirectional.cpp

This binary file was added.

This is an archive of the discontinued LLVM Phabricator instance.

Misleading bidirectional detectionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 399273

clang-tools-extra/clang-tidy/misc/CMakeLists.txt

clang-tools-extra/clang-tidy/misc/MiscTidyModule.cpp

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.h

clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp

clang-tools-extra/docs/ReleaseNotes.rst

clang-tools-extra/docs/clang-tidy/checks/list.rst

clang-tools-extra/docs/clang-tidy/checks/misc-misleading-bidirectional.rst

clang-tools-extra/test/clang-tidy/checkers/misc-misleading-bidirectional.cpp

Misleading bidirectional detection
ClosedPublic