This is an archive of the discontinued LLVM Phabricator instance.

[Windows] Fix handling of \" in program name on cmd line.
ClosedPublic

Authored by simon_tatham on Apr 1 2022, 9:23 AM.

Details

Summary

Bugzilla #47579: if you invoke clang on Windows via a pathname in
which a quoted section closes just after a backslash, e.g.

"C:\Program Files\Whatever\"clang.exe

then cmd.exe and CreateProcess will correctly find the binary, because
when they parse the program name at the start of the command line,
they don't regard the \ before the " as having any kind of escaping
effect. This is different from the behaviour of the Windows standard C
library when it parses the rest of the command line, which would
consider that \" not to close the quoted string.

But this confuses windows::GetCommandLineArguments, because the
Windows API function GetCommandLineW() will return a command line
containing that \" sequence, and cl::TokenizeWindowsCommandLine will
tokenize the whole string according to the C library's rules. So it
will misidentify where the program name stops and the arguments start.

To fix this, I've introduced a new variant function
cl::TokenizeWindowsCommandLineFull(), intended to be applied to the
string returned from GetCommandLineW(). It parses the first word of
the command line according to CreateProcess's rules, considering \ to
never be an escaping character; thereafter, it switches over to the C
library rules for the rest of the command line.

In the case where there are no further " on the command line, the
previous code was also managing to return no arguments at all, causing
an assertion failure later on in GetCommandLineArguments when it tried
to refer to Args[0]. That occurred because the tokenizer was throwing
away the final token if it contained a quoted string still unclosed at
the end of the command line, which also doesn't match the Windows CRT
handling. So I've fixed that too, and for extra safety, included a
final check before dereferencing Args[0], so that just in case we
still somehow get a zero-length argument vector, we'll at least
not crash.

Diff Detail

Event Timeline

simon_tatham created this revision.Apr 1 2022, 9:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2022, 9:23 AM
simon_tatham requested review of this revision.Apr 1 2022, 9:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2022, 9:23 AM
rnk added a comment.Apr 1 2022, 1:19 PM

I would really like to avoid adding another tokenization variant function. We tried to write this code exactly following the rules the CRT uses to produce the argv[] array supplied to main. How does the CRT startup code handle that? Does it have special handling for the first token to match CreateProcess here?

Is the idea that Windows paths are not allowed to contain quotes, so backslash is not allowed to have any escaping power? In other words, double quotes in the first token are treated similarly to bash single quote, they have no escape mechanism.

llvm/lib/Support/CommandLine.cpp
1038–1039

This seems like it could be a bug when the command line ends with an open quoted string. We should match the CRT logic. In the original example, the entire command line would become argv[0].

llvm/lib/Support/Windows/Process.inc
258

Seems reasonable.

We tried to write this code exactly following the rules the CRT uses to produce the argv[] array supplied to main. How does the CRT startup code handle that? Does it have special handling for the first token to match CreateProcess here?

I haven't seen the source of the real Windows CRT startup code, I'm afraid. But observing it as a black box, yes, it really does seem to handle the first token specially. I compiled and ran this trivial test program with VS2019, which prints both the GetCommandLine raw string and the argv received by main:

#include <stdio.h>
#include <windows.h>
int main(int argc, char **argv) {
  for (int i = 0; i < argc; i++) printf("argv[%d] = [%s]\n", i, argv[i]);
  printf("cl = [%s]\n", GetCommandLineA());
}

Output with a test string consisting of the exact same syntax twice:

C:\Users\statham>".\"args ".\"args
argv[0] = [.\args]
argv[1] = [."args]
cl = [".\"args  ".\"args]

You can see that GetCommandLine has received the same string I passed to cmd.exe, without any pre-transformation by cmd or CreateProcess to make it more palatable to the CRT (which was another hypothesis I considered). But then the C library argv splitter has treated the two copies of the same string differently: in argv[0], the quotes have been discarded and the backslash kept, whereas in argv[1] the backslash has escaped the quote.

(And therefore you'd expect this command line to have ended in mid-quoted string – and running again with further text after the second copy of ".\"args, it's easy to see that that is indeed the case ...

C:\Users\statham>".\"args ".\"args one " two
argv[0] = [.\args]
argv[1] = [."args one ]
argv[2] = [two]
cl = [".\"args  ".\"args one " two]

... which supports your other comment that we ought not to be throwing away non-empty unclosed quoted arguments at the end of the command line.)

Is the idea that Windows paths are not allowed to contain quotes, so backslash is not allowed to have any escaping power? In other words, double quotes in the first token are treated similarly to bash single quote, they have no escape mechanism.

As I say, I don't know for 100% sure how the CRT really thinks. But what you say is exactly my interpretation (including the rationale about " not being a legal filename character).

I would really like to avoid adding another tokenization variant function.

I hear you. But the problem is that the command-line tokenizer in llvm/Support is used both for full command lines as received from GetCommandLine and for snippets in things like response files, which only contain arguments intended to come after the command name. And surely one of those needs this special-casing of the first token, and the other one does not, if they're to be parsed correctly. So I don't see how to avoid having two functions, or one function with a boolean flag, or some way to trigger two different kinds of parsing for two contexts.

llvm/lib/Support/CommandLine.cpp
1038–1039

I thought the same, and I considered fixing it as part of the same commit – I would have expected that we'd add the current token if it had accumulated any text, whether the quotes had been closed or not. But the git history tells me that D78346 changed it from that to this, so I didn't want to unilaterally revert a previous deliberate change.

However, now I look more closely at that previous change, it looks as if it's not intentionally discarding any actual text – it's _preventing_ the discarding of an explicit empty-string token. So perhaps I can fix that after all without reverting the intended change. I think we ought to be testing State != INIT.

simon_tatham edited the summary of this revision. (Show Details)

Updated to fix the handling of a trailing unclosed quoted string, as suggested. It turns out (checking with that same 6-line test program) that even an empty unclosed quoted string still turns into an argv word, contrary to one of the existing unit tests.

Fixed a CI failure (unreachable in tokenizeWindowsCommandLineImpl was reached because I forgot to correctly handle an unquoted initial backslash in CommandName mode), and added an extra unit test to ensure the triggering case stays working.

Gentle ping: @rnk, are you satisfied with my changes and explanations from last week?

rnk edited reviewers, added: hans; removed: amccarth, ruiu.Apr 14 2022, 4:18 PM

Thanks for running the experiments, I think I'm convinced we need to do this. @hans, can you review the implementation? I removed Adrian and Rui, who are no longer regular contributors.

hans accepted this revision.Apr 19 2022, 4:56 AM

Looks good to me.

llvm/lib/Support/CommandLine.cpp
1038–1039

Perhaps this should be done in a separate commit?

This revision is now accepted and ready to land.Apr 19 2022, 4:56 AM
This revision was landed with ongoing or failed builds.May 3 2022, 3:58 AM
This revision was automatically updated to reflect the committed changes.
simon_tatham added inline comments.May 3 2022, 3:59 AM
llvm/lib/Support/CommandLine.cpp
1038–1039

OK, I split it up at the last minute, and committed rG1be024ee450f2d3cb07086f6141d50f291c1910b for the problem with a trailing unclosed quoted string, and rG32814df442690d4673759296d850804773a7ea5b for the problem with \" in the command name.

Thanks for the review!