This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Support/
-
Support/
7
CommandLine.cpp
-
unittests/Support/
-
Support/
2
CommandLineTest.cpp

Differential D78346

Fix Windows command line bug when last token in response file is ""
ClosedPublic

Authored by neildhar on Apr 16 2020, 11:22 PM.

Download Raw Diff

Details

Reviewers

Bigcheese
amccarth

Commits

rG2d068e534f16: Fix Windows command line bug when last token in response file is ""

Summary

Current state machine for parsing tokens from response files in Windows does not correctly handle the case where the last token is "". The current implementation handles the last token by only adding it if it is not empty, however this does not cover the case where the last token is meant to be the empty string. We can cover this case by checking whether the state machine was last in the UNQUOTED state, which indicates that the last character of the input was a non-whitespace character.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

neildhar created this revision.Apr 16 2020, 11:22 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 16 2020, 11:22 PM

Herald added subscribers: llvm-commits, hiraditya. · View Herald Transcript

Harbormaster failed remote builds in B53678: Diff 258236!Apr 16 2020, 11:25 PM

neildhar edited the summary of this revision. (Show Details)Apr 16 2020, 11:27 PM

neildhar added a reviewer: Bigcheese.

Herald added a subscriber: dexonsmith. · View Herald TranscriptApr 16 2020, 11:27 PM

Harbormaster completed remote builds in B53678: Diff 258236.Apr 20 2020, 6:28 PM

Ping

neildhar added a reviewer: rnk.May 11 2020, 5:06 PM

Hi @rnk and @Bigcheese, I was wondering if you would be willing to review this diff.

Thank you!

I think the fix looks good, but I'd like to see a bit more in the test.

Thanks for taking on this bug.

llvm/unittests/Support/CommandLineTest.cpp
261	It would make the test easier to read if we could use raw string literals for these kinds of inputs. Does LLVM style permit that?
264	I'm mildly concerned that in the future someone may come in and append another argument to the end in order to test some other case, and thus invalidate this test. Can you add a comment that this test is (among other things) explicitly testing the case of an empty quoted string at the end of the command? Or maybe even make this a separate test for specifically that (e.g., `TokenizeWindowsCommandLineQuotedLastArgument`). Are there other cases we should test to make sure this doesn't change things? What if the last argument is `"` or `"""` or `""""`? What if there's white space after the last quotation mark? A quick look at the code suggests these might all work with this change, but it looks like they'd hit some other code paths.

Thanks for the fix, I agree with Adrian's feedback. @amccarth, can you take over the review? I will remove myself, thanks.

rnk removed a reviewer: rnk.May 19 2020, 11:35 AM

I'm happy to see this through.

Update tests

Harbormaster failed remote builds in B57770: Diff 265970!May 25 2020, 1:34 AM

Thanks for taking a look @amccarth! I've updated the tests to include one for " as well, I think the other versions (""", """") will trigger the same code paths although I can add them in if that is preferred.

Unfortunately, I've used the wrong linter here and will fix that and update.

Fix linting

Harbormaster completed remote builds in B57805: Diff 266050.May 25 2020, 12:52 PM

Thanks for extending the tests. Those will certainly make your fix more robust against regressions.

I'm concerned about the extra complexity added to the state machine and the introduction of an unused API. The new version has more complicated flow control for no obvious benefit, and it seems unrelated to the relatively simple fix this patch started out to be. If there are good reasons for these additional changes, perhaps it would be best to break them into a separate follow-on patch. I'd be happy to review that as well.

llvm/lib/Support/CommandLine.cpp

928

Looks like we have some scope-creep: This patch is now about more than solving the problem with arguments at the end of the line.

I'm not convinced that it's important this be inlined. That's ultimately up to the compiler.

I think the comment about the user entry points doesn't tell us anything that we cannot see in the function signature.

935

I don't understand what I'm supposed to take away from this comment. Either delete it or explain _why_ it's important to do as much work as possible.

I'm not even sure what you mean by "work." It seems this version of the state machine does the same amount of work as the old version. Is this about the fact that the states now have their own internal loops in addition to the outer loop?

945

I'm concerned that the index is advanced here and and in the loop header. I'm not saying it's wrong. It's just harder to reason about state machine behavior when it's a hybrid of a classic loop surrounding state handlers that have their own internal state machines. Now we have a bunch of checks throughout the body to see if we've hit (or exceeded?!) the end. Is there an advantage to this that I'm not seeing?

959

This seems like a behavior change: AddToken will now be called _after_ MarkEOL. Maybe that doesn't matter, but it seems nonsensical to report the delimiter before the token it delimits.

960

Consistent with [[ https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code

LLVM style ]], the old version used continue to avoid cascades of else if statements. Please follow that pattern here.

1028

Nobody is calling this new function, not even in the tests. Is the intent to replace the original function with this new one to reduce copying? Do we really need the flexibility to parse it both ways, or can we keep things a bit simpler?

1611

Did you intentionally change J to start at 0 rather than 1? This mostly looks like you were just trying to make the style consistent by capitalizing these variable names.

This revision now requires changes to proceed.May 26 2020, 11:28 AM

Hi @amccarth, thanks for taking another look! I'm not sure how those changes render for you, on my screen they don't appear as part of this diff. They were picked up when I rebased, the original diff for those is https://reviews.llvm.org/D79262.

neildhar requested review of this revision.May 26 2020, 8:19 PM

Ah, I guess the diff was based off the old head rather than the rebased one. Phabricator highlighted all that code as part of this patch.

Sorry about that. I'll have to redirect most of those comments @rnk . That'll be awkward.

Now that Phabricator is showing just seeing fix and test improvements, it looks great.

This revision is now accepted and ready to land.May 27 2020, 8:52 AM

No worries, thanks! This is my first patch to LLVM so I do not have commit access, could I trouble you to commit it?

No problem. I'll land it this afternoon (Pacific time). Thanks!

Closed by commit rG2d068e534f16: Fix Windows command line bug when last token in response file is "" (authored by amccarth). · Explain WhyMay 27 2020, 3:16 PM

This revision was automatically updated to reflect the committed changes.

simon_tatham mentioned this in D122914: [Windows] Fix handling of \" in program name on cmd line..Apr 4 2022, 1:28 AM

Revision Contents

Path

Size

llvm/

lib/

Support/

CommandLine.cpp

2 lines

unittests/

Support/

CommandLineTest.cpp

15 lines

Diff 266682

llvm/lib/Support/CommandLine.cpp

Show First 20 Lines • Show All 919 Lines • ▼ Show 20 Lines

} }

// Windows treats whitespace, double quotes, and backslashes specially. // Windows treats whitespace, double quotes, and backslashes specially.

static bool isWindowsSpecialChar(char C) { static bool isWindowsSpecialChar(char C) {

return isWhitespaceOrNull(C) || C == '\\' || C == '\"'; return isWhitespaceOrNull(C) || C == '\\' || C == '\"';

} }

// Windows tokenization implementation. The implementation is designed to be // Windows tokenization implementation. The implementation is designed to be

// inlined and specialized for the two user entry points. // inlined and specialized for the two user entry points.

amccarthUnsubmitted

Not Done

Looks like we have some scope-creep: This patch is now about more than solving the problem with arguments at the end of the line.

I'm not convinced that it's important this be inlined. That's ultimately up to the compiler.

I think the comment about the user entry points doesn't tell us anything that we cannot see in the function signature.

amccarth: Looks like we have some scope-creep: This patch is now about more than solving the problem…

static inline void static inline void

tokenizeWindowsCommandLineImpl(StringRef Src, StringSaver &Saver, tokenizeWindowsCommandLineImpl(StringRef Src, StringSaver &Saver,

function_ref<void(StringRef)> AddToken, function_ref<void(StringRef)> AddToken,

bool AlwaysCopy, function_ref<void()> MarkEOL) { bool AlwaysCopy, function_ref<void()> MarkEOL) {

SmallString<128> Token; SmallString<128> Token;

// Try to do as much work inside the state machine as possible. // Try to do as much work inside the state machine as possible.

amccarthUnsubmitted

Not Done

I don't understand what I'm supposed to take away from this comment. Either delete it or explain _why_ it's important to do as much work as possible.

amccarth: I don't understand what I'm supposed to take away from this comment. Either delete it or…

enum { INIT, UNQUOTED, QUOTED } State = INIT; enum { INIT, UNQUOTED, QUOTED } State = INIT;

for (size_t I = 0, E = Src.size(); I < E; ++I) { for (size_t I = 0, E = Src.size(); I < E; ++I) {

switch (State) { switch (State) {

case INIT: { case INIT: {

assert(Token.empty() && "token should be empty in initial state"); assert(Token.empty() && "token should be empty in initial state");

// Eat whitespace before a token. // Eat whitespace before a token.

while (I < E && isWhitespaceOrNull(Src[I])) { while (I < E && isWhitespaceOrNull(Src[I])) {

if (Src[I] == '\n') if (Src[I] == '\n')

MarkEOL(); MarkEOL();

++I; ++I;

amccarthUnsubmitted

Not Done

amccarth: I'm concerned that the index is advanced here and and in the loop header. I'm not saying it's…

} }

// Stop if this was trailing whitespace. // Stop if this was trailing whitespace.

if (I >= E) if (I >= E)

break; break;

size_t Start = I; size_t Start = I;

while (I < E && !isWindowsSpecialChar(Src[I])) while (I < E && !isWindowsSpecialChar(Src[I]))

++I; ++I;

StringRef NormalChars = Src.slice(Start, I); StringRef NormalChars = Src.slice(Start, I);

if (I >= E || isWhitespaceOrNull(Src[I])) { if (I >= E || isWhitespaceOrNull(Src[I])) {

if (I < E && Src[I] == '\n') if (I < E && Src[I] == '\n')

MarkEOL(); MarkEOL();

// No special characters: slice out the substring and start the next // No special characters: slice out the substring and start the next

// token. Copy the string if the caller asks us to. // token. Copy the string if the caller asks us to.

AddToken(AlwaysCopy ? Saver.save(NormalChars) : NormalChars); AddToken(AlwaysCopy ? Saver.save(NormalChars) : NormalChars);

amccarthUnsubmitted

Not Done

This seems like a behavior change: AddToken will now be called _after_ MarkEOL. Maybe that doesn't matter, but it seems nonsensical to report the delimiter before the token it delimits.

amccarth: This seems like a behavior change: AddToken will now be called _after_ MarkEOL. Maybe that…

} else if (Src[I] == '\"') { } else if (Src[I] == '\"') {

amccarthUnsubmitted

Not Done

Consistent with [[ https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code

LLVM style ]], the old version used continue to avoid cascades of else if statements. Please follow that pattern here.

amccarth: Consistent with [[ https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to…

Token += NormalChars; Token += NormalChars;

State = QUOTED; State = QUOTED;

} else if (Src[I] == '\\') { } else if (Src[I] == '\\') {

Token += NormalChars; Token += NormalChars;

I = parseBackslash(Src, I, Token); I = parseBackslash(Src, I, Token);

State = UNQUOTED; State = UNQUOTED;

} else { } else {

llvm_unreachable("unexpected special character"); llvm_unreachable("unexpected special character");

Show All 35 Lines case QUOTED:

I = parseBackslash(Src, I, Token); I = parseBackslash(Src, I, Token);

} else { } else {

Token.push_back(Src[I]); Token.push_back(Src[I]);

} }

break; break;

} }

if (!Token.empty()) if (State == UNQUOTED)

AddToken(Saver.save(Token.str())); AddToken(Saver.save(Token.str()));

} }

void cl::TokenizeWindowsCommandLine(StringRef Src, StringSaver &Saver, void cl::TokenizeWindowsCommandLine(StringRef Src, StringSaver &Saver,

SmallVectorImpl<const char *> &NewArgv, SmallVectorImpl<const char *> &NewArgv,

bool MarkEOLs) { bool MarkEOLs) {

auto AddToken = [&](StringRef Tok) { NewArgv.push_back(Tok.data()); }; auto AddToken = [&](StringRef Tok) { NewArgv.push_back(Tok.data()); };

auto OnEOL = [&]() { auto OnEOL = [&]() {

if (MarkEOLs) if (MarkEOLs)

NewArgv.push_back(nullptr); NewArgv.push_back(nullptr);

}; };

tokenizeWindowsCommandLineImpl(Src, Saver, AddToken, tokenizeWindowsCommandLineImpl(Src, Saver, AddToken,

/*AlwaysCopy=*/true, OnEOL); /*AlwaysCopy=*/true, OnEOL);

} }

void cl::TokenizeWindowsCommandLineNoCopy(StringRef Src, StringSaver &Saver, void cl::TokenizeWindowsCommandLineNoCopy(StringRef Src, StringSaver &Saver,

amccarthUnsubmitted

Not Done

amccarth: Nobody is calling this new function, not even in the tests. Is the intent to replace the…

SmallVectorImpl<StringRef> &NewArgv) { SmallVectorImpl<StringRef> &NewArgv) {

auto AddToken = [&](StringRef Tok) { NewArgv.push_back(Tok); }; auto AddToken = [&](StringRef Tok) { NewArgv.push_back(Tok); };

auto OnEOL = []() {}; auto OnEOL = []() {};

tokenizeWindowsCommandLineImpl(Src, Saver, AddToken, /*AlwaysCopy=*/false, tokenizeWindowsCommandLineImpl(Src, Saver, AddToken, /*AlwaysCopy=*/false,

OnEOL); OnEOL);

} }

void cl::tokenizeConfigFile(StringRef Source, StringSaver &Saver, void cl::tokenizeConfigFile(StringRef Source, StringSaver &Saver,

▲ Show 20 Lines • Show All 566 Lines • ▼ Show 20 Lines for (size_t i = 0, e = PositionalOpts.size(); i != e; ++i) {

llvm_unreachable("Internal error, unexpected NumOccurrences flag in " llvm_unreachable("Internal error, unexpected NumOccurrences flag in "

"positional argument processing!"); "positional argument processing!");

} }

} else { } else {

assert(ConsumeAfterOpt && NumPositionalRequired <= PositionalVals.size()); assert(ConsumeAfterOpt && NumPositionalRequired <= PositionalVals.size());

unsigned ValNo = 0; unsigned ValNo = 0;

for (size_t J = 0, E = PositionalOpts.size(); J != E; ++J) for (size_t J = 0, E = PositionalOpts.size(); J != E; ++J)

amccarthUnsubmitted

Not Done

Did you intentionally change J to start at 0 rather than 1? This mostly looks like you were just trying to make the style consistent by capitalizing these variable names.

amccarth: Did you intentionally change `J` to start at 0 rather than 1? This mostly looks like you were…

if (RequiresValue(PositionalOpts[J])) { if (RequiresValue(PositionalOpts[J])) {

ErrorParsing |= ProvidePositionalOption(PositionalOpts[J], ErrorParsing |= ProvidePositionalOption(PositionalOpts[J],

PositionalVals[ValNo].first, PositionalVals[ValNo].first,

PositionalVals[ValNo].second); PositionalVals[ValNo].second);

ValNo++; ValNo++;

} }

// Handle the case where there is just one positional option, and it's // Handle the case where there is just one positional option, and it's

▲ Show 20 Lines • Show All 998 Lines • Show Last 20 Lines

llvm/unittests/Support/CommandLineTest.cpp

Show First 20 Lines • Show All 247 Lines • ▼ Show 20 Lines	TEST(CommandLineTest, TokenizeGNUCommandLine) {
const char *const Output[] = {		const char *const Output[] = {
"foo bar", "foo bar", "foo bar", "foo\\bar",		"foo bar", "foo bar", "foo bar", "foo\\bar",
"-DFOO=bar()", "foobarbaz", "C:\\src\\foo.cpp", "C:srcfoo.cpp"};		"-DFOO=bar()", "foobarbaz", "C:\\src\\foo.cpp", "C:srcfoo.cpp"};
testCommandLineTokenizer(cl::TokenizeGNUCommandLine, Input, Output,		testCommandLineTokenizer(cl::TokenizeGNUCommandLine, Input, Output,
array_lengthof(Output));		array_lengthof(Output));
}		}

TEST(CommandLineTest, TokenizeWindowsCommandLine1) {		TEST(CommandLineTest, TokenizeWindowsCommandLine1) {
const char Input[] = "a\\b c\\\\d e\\\\\"f g\" h\\\"i j\\\\\\\"k \"lmn\" o pqr "		const char Input[] =
"\"st \\\"u\" \\v";		R"(a\b c\\d e\\"f g" h\"i j\\\"k "lmn" o pqr "st \"u" \v)";
const char *const Output[] = { "a\\b", "c\\\\d", "e\\f g", "h\"i", "j\\\"k",		const char *const Output[] = { "a\\b", "c\\\\d", "e\\f g", "h\"i", "j\\\"k",
"lmn", "o", "pqr", "st \"u", "\\v" };		"lmn", "o", "pqr", "st \"u", "\\v" };
testCommandLineTokenizer(cl::TokenizeWindowsCommandLine, Input, Output,		testCommandLineTokenizer(cl::TokenizeWindowsCommandLine, Input, Output,
array_lengthof(Output));		array_lengthof(Output));
		amccarthUnsubmitted Not Done Reply Inline Actions It would make the test easier to read if we could use raw string literals for these kinds of inputs. Does LLVM style permit that? amccarth: It would make the test easier to read if we could use raw string literals for these kinds of…
}		}

TEST(CommandLineTest, TokenizeWindowsCommandLine2) {		TEST(CommandLineTest, TokenizeWindowsCommandLine2) {
		amccarthUnsubmitted Not Done Reply Inline Actions I'm mildly concerned that in the future someone may come in and append another argument to the end in order to test some other case, and thus invalidate this test. Can you add a comment that this test is (among other things) explicitly testing the case of an empty quoted string at the end of the command? Or maybe even make this a separate test for specifically that (e.g., `TokenizeWindowsCommandLineQuotedLastArgument`). Are there other cases we should test to make sure this doesn't change things? What if the last argument is `"` or `"""` or `""""`? What if there's white space after the last quotation mark? A quick look at the code suggests these might all work with this change, but it looks like they'd hit some other code paths. amccarth: I'm mildly concerned that in the future someone may come in and append another argument to the…
const char Input[] = "clang -c -DFOO=\"\"\"ABC\"\"\" x.cpp";		const char Input[] = "clang -c -DFOO=\"\"\"ABC\"\"\" x.cpp";
const char *const Output[] = { "clang", "-c", "-DFOO=\"ABC\"", "x.cpp"};		const char *const Output[] = { "clang", "-c", "-DFOO=\"ABC\"", "x.cpp"};
testCommandLineTokenizer(cl::TokenizeWindowsCommandLine, Input, Output,		testCommandLineTokenizer(cl::TokenizeWindowsCommandLine, Input, Output,
array_lengthof(Output));		array_lengthof(Output));
}		}

		TEST(CommandLineTest, TokenizeWindowsCommandLineQuotedLastArgument) {
		const char Input1[] = R"(a b c d "")";
		const char *const Output1[] = {"a", "b", "c", "d", ""};
		testCommandLineTokenizer(cl::TokenizeWindowsCommandLine, Input1, Output1,
		array_lengthof(Output1));
		const char Input2[] = R"(a b c d ")";
		const char *const Output2[] = {"a", "b", "c", "d"};
		testCommandLineTokenizer(cl::TokenizeWindowsCommandLine, Input2, Output2,
		array_lengthof(Output2));
		}

TEST(CommandLineTest, TokenizeConfigFile1) {		TEST(CommandLineTest, TokenizeConfigFile1) {
const char *Input = "\\";		const char *Input = "\\";
const char *const Output[] = { "\\" };		const char *const Output[] = { "\\" };
testCommandLineTokenizer(cl::tokenizeConfigFile, Input, Output,		testCommandLineTokenizer(cl::tokenizeConfigFile, Input, Output,
array_lengthof(Output));		array_lengthof(Output));
}		}

TEST(CommandLineTest, TokenizeConfigFile2) {		TEST(CommandLineTest, TokenizeConfigFile2) {
▲ Show 20 Lines • Show All 1,563 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Fix Windows command line bug when last token in response file is ""ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 266682

llvm/lib/Support/CommandLine.cpp

llvm/unittests/Support/CommandLineTest.cpp

Fix Windows command line bug when last token in response file is ""
ClosedPublic