Current state machine for parsing tokens from response files in Windows does not correctly handle the case where the last token is "". The current implementation handles the last token by only adding it if it is not empty, however this does not cover the case where the last token is meant to be the empty string. We can cover this case by checking whether the state machine was last in the UNQUOTED state, which indicates that the last character of the input was a non-whitespace character.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Hi @rnk and @Bigcheese, I was wondering if you would be willing to review this diff.
Thank you!
I think the fix looks good, but I'd like to see a bit more in the test.
Thanks for taking on this bug.
llvm/unittests/Support/CommandLineTest.cpp | ||
---|---|---|
254 | It would make the test easier to read if we could use raw string literals for these kinds of inputs. Does LLVM style permit that? | |
257 | I'm mildly concerned that in the future someone may come in and append another argument to the end in order to test some other case, and thus invalidate this test. Can you add a comment that this test is (among other things) explicitly testing the case of an empty quoted string at the end of the command? Or maybe even make this a separate test for specifically that (e.g., TokenizeWindowsCommandLineQuotedLastArgument). Are there other cases we should test to make sure this doesn't change things? What if the last argument is " or """ or """"? What if there's white space after the last quotation mark? A quick look at the code suggests these might all work with this change, but it looks like they'd hit some other code paths. |
Thanks for the fix, I agree with Adrian's feedback. @amccarth, can you take over the review? I will remove myself, thanks.
Thanks for taking a look @amccarth! I've updated the tests to include one for " as well, I think the other versions (""", """") will trigger the same code paths although I can add them in if that is preferred.
Unfortunately, I've used the wrong linter here and will fix that and update.
Thanks for extending the tests. Those will certainly make your fix more robust against regressions.
I'm concerned about the extra complexity added to the state machine and the introduction of an unused API. The new version has more complicated flow control for no obvious benefit, and it seems unrelated to the relatively simple fix this patch started out to be. If there are good reasons for these additional changes, perhaps it would be best to break them into a separate follow-on patch. I'd be happy to review that as well.
llvm/lib/Support/CommandLine.cpp | |||
---|---|---|---|
927 | Looks like we have some scope-creep: This patch is now about more than solving the problem with arguments at the end of the line. I'm not convinced that it's important this be inlined. That's ultimately up to the compiler. I think the comment about the user entry points doesn't tell us anything that we cannot see in the function signature. | ||
934 | I don't understand what I'm supposed to take away from this comment. Either delete it or explain _why_ it's important to do as much work as possible. I'm not even sure what you mean by "work." It seems this version of the state machine does the same amount of work as the old version. Is this about the fact that the states now have their own internal loops in addition to the outer loop? | ||
944 | I'm concerned that the index is advanced here and and in the loop header. I'm not saying it's wrong. It's just harder to reason about state machine behavior when it's a hybrid of a classic loop surrounding state handlers that have their own internal state machines. Now we have a bunch of checks throughout the body to see if we've hit (or exceeded?!) the end. Is there an advantage to this that I'm not seeing? | ||
958 | This seems like a behavior change: AddToken will now be called _after_ MarkEOL. Maybe that doesn't matter, but it seems nonsensical to report the delimiter before the token it delimits. | ||
959 | Consistent with [[ https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code
| ||
1027 | Nobody is calling this new function, not even in the tests. Is the intent to replace the original function with this new one to reduce copying? Do we really need the flexibility to parse it both ways, or can we keep things a bit simpler? | ||
1613 | Did you intentionally change J to start at 0 rather than 1? This mostly looks like you were just trying to make the style consistent by capitalizing these variable names. |
Hi @amccarth, thanks for taking another look! I'm not sure how those changes render for you, on my screen they don't appear as part of this diff. They were picked up when I rebased, the original diff for those is https://reviews.llvm.org/D79262.
Ah, I guess the diff was based off the old head rather than the rebased one. Phabricator highlighted all that code as part of this patch.
Sorry about that. I'll have to redirect most of those comments @rnk . That'll be awkward.
Now that Phabricator is showing just seeing fix and test improvements, it looks great.
No worries, thanks! This is my first patch to LLVM so I do not have commit access, could I trouble you to commit it?
Looks like we have some scope-creep: This patch is now about more than solving the problem with arguments at the end of the line.
I'm not convinced that it's important this be inlined. That's ultimately up to the compiler.
I think the comment about the user entry points doesn't tell us anything that we cannot see in the function signature.