This is an archive of the discontinued LLVM Phabricator instance.

Demonstrate proposed new Args API
ClosedPublic

Authored by zturner on Nov 18 2016, 8:26 PM.

Details

Summary

The purpose of this patch is to demonstrate my proposed new API to the Args class. There are a couple of things I'm trying to demonstrate here:

  1. range-based for loops by iterating directly over the Args instance itself. This should be the preferred way to iterate over arguments and should always be used unless index-based iteration is required.
  1. range-based for loops by iterating over args.entries(). This should be used when you only want to iterate a subset of the arguments, like if you want to skip the first one (e.g. for (const auto &entry : args.entries().drop_front()). This prevents many types of bugs, such as those posted to the list recently in which we repeatedly access args[0] instead of args[i].
  1. Index-based iteration through the use of operator[]. Should only be used when you need to do look ahead or look-behind, or a command takes positional arguments. Prefer range-based for otherwise. This method supersedes GetArgumentAtIndex and the long term plan is to delete GetArgumentAtIndex after all call-sites have been converted.
  1. Direct access to StringRef through the use of entry.ref. Should always be used unless you're printing or passing to a C-api.
  1. Direct access to the quote char through the use of entry.quote. Supersedes GetArgumentQuoteCharAtIndex.
  1. Access to a null-terminated c-string. Should only be used when printf'ing the value or passing to a C-api (which, btw, should almost never be done for any kind of string manipulation operation, as StringRef provides everything you need). Long term I plan to delete this accessor, the only thing blocking this is the heavy use of printf-style formatting (which I have a solution for in mind). In any case, it should not be relied on for any kind of algorithmic access to the underlying string.
  1. A couple of simple examples of how this proposed new API can trivially reduce string copies and improve efficiency.

All dependent calls have already been updated to take StringRef, so the "phase 2", so to speak, will be to begin eliminating calls to GetArgumentAtIndex using the methods described above. Once they are all gone, the plan is to delete GetArgumentAtIndex and GetQuoteCharAtIndex.

In the more distant future, if I can make my Printf alternative a reality, I plan to delete c-style string access entirely by removing the c_str() method from ArgEntry.

Diff Detail

Repository
rL LLVM

Event Timeline

zturner updated this revision to Diff 78617.Nov 18 2016, 8:26 PM
zturner retitled this revision from to Demonstrate proposed new Args API.
zturner updated this object.
zturner added reviewers: jingham, beanz, clayborg, labath.
zturner added a subscriber: lldb-commits.
labath edited edge metadata.Nov 19 2016, 5:23 AM

I don't know how deep do you want this refactor to be, but there is one issue I would like us to consider, if only to decide it is out of scope of this change. I am talking about the quote_char thingy. The main problem for me is that I don't think it's possible to sanely define the meaning of that field. According to POSIX quoting rules (which our command line more-or-less follows) a single argument can be quoted in a great many ways, using various combinations of quote characters. For example, these are all valid ways to represent the argument asdf in a POSIX shell:

asdf
"asdf"
'asdf'
a"sd"f
"as"df
"as""df"
"as"'df'
"a"s'd'"f"
... (you get my point)

I don't think there is a self-consistent way to define what the quote_char field will be for each of these options. Moreover, I don't see why one would ever need to use that field. It can only encourage someone to try to "quote" the argument by doing quote_char+value+quote_char, which is absolutely wrong if you ever want that result to be machine parsable.(*) For proper quoting I think we should just have a free-standing std::string quote_for_posix_shell(llvm::StringRef) function (and maybe quote_for_windows_cmd, and whatever else quoting scheme we need), and then the user can decide which one to use based on who is going to be consuming it. Then we can just kill the quote field. The only thing is... I have no idea how much work that will be (but I am ready to chip in to make it happen).

So, yea, if we decide not to do that, then I think the interface looks great. Otherwise, I think we can design a slightly simpler (and more consistent) one.

(*) Bonus question: Try to start an executable under lldb, so that in enters main() with argc=2 and argv[1]="'" I.e., as if it had been started this way via bash:

$ /bin/cat \'

Assuming we do that, what interface do you think would be simpler? We still
need easy access to both a StringRef and a c_str(), since StringRef::data
is not guaranteed to be null terminated, so the entry thing is still nice.

I was assuming (possibly incorrectly, I did not look at that code much) that the main user of the null-terminated string version was execve(2), which needs an entire list of strings, not just a single item, in which case we could have the iteration simply be over StringRefs. That said, it probably does not matter now, as judging by your comments, you're looking for incremental changes, not one grand final design (btw, I didn't mean to suggest that this should all be done in one patch). In that case, this looks fine as far as I am concerned. We can always revise this later, and it doesn't look like it will require any major rewrite, just a bit of syntax-twiddling. You can then read my comments as "the direction I would like to move us in".

One idea might be to have the entry contain 2 StringRefs. str and
quoted_str. This way you never get access to the underlying quote char,
just the full arg, either quoted or unquoted (although doing this would
still be better done orthogonally to this patch)

I don't think this is a good idea, as I don't see a need to be able to access the original quoted string this way. Also, when you construct the args vector programmatically, you would have to invent a quoted representation without knowing if it will ever be used. I'd prefer to have a standalone quote function instead, as then it can be used in other contexts as well (separate algorithms from data).

jingham edited edge metadata.Nov 21 2016, 12:45 PM

I don't know how deep do you want this refactor to be, but there is one issue I would like us to consider, if only to decide it is out of scope of this change. I am talking about the quote_char thingy. The main problem for me is that I don't think it's possible to sanely define the meaning of that field. According to POSIX quoting rules (which our command line more-or-less follows) a single argument can be quoted in a great many ways, using various combinations of quote characters. For example, these are all valid ways to represent the argument asdf in a POSIX shell:

asdf
"asdf"
'asdf'
a"sd"f
"as"df
"as""df"
"as"'df'
"a"s'd'"f"
... (you get my point)

I don't think there is a self-consistent way to define what the quote_char field will be for each of these options. Moreover, I don't see why one would ever need to use that field. It can only encourage someone to try to "quote" the argument by doing quote_char+value+quote_char, which is absolutely wrong if you ever want that result to be machine parsable.(*) For proper quoting I think we should just have a free-standing std::string quote_for_posix_shell(llvm::StringRef) function (and maybe quote_for_windows_cmd, and whatever else quoting scheme we need), and then the user can decide which one to use based on who is going to be consuming it. Then we can just kill the quote field. The only thing is... I have no idea how much work that will be (but I am ready to chip in to make it happen).

So, yea, if we decide not to do that, then I think the interface looks great. Otherwise, I think we can design a slightly simpler (and more consistent) one.

(*) Bonus question: Try to start an executable under lldb, so that in enters main() with argc=2 and argv[1]="'" I.e., as if it had been started this way via bash:

$ /bin/cat \'

The outermost quote character is syntactically significant in the lldb command language. If you say:

memory read -c count_var 0x123345

then lldb evaluates the expression in the backticks, replacing the argument with the result of the expression. So you can't get rid of the quote character altogether. The other use is in completion to figure out what an unterminated quoted string should be completed with.

Since only the outermost quoting character is important, I think the problem is more tractable than your examples would suggest.

Instead of storing the quote char, I think we could just store two StringRefs. str and quoted_str, where str = quoted_str.trim(quote_char);, and if there is no quote char, then they are equal. I'd rather do that in a followup since this ArgEntry class has already been here for some time and is kind of orthogonal to what's going on here, but it seems like a sane approach, as then you don't have to worry about anyone reconstructing the quoted version from the unquoted version, it's just already there.

clayborg accepted this revision.Nov 21 2016, 2:41 PM
clayborg edited edge metadata.
This revision is now accepted and ready to land.Nov 21 2016, 2:41 PM
This revision was automatically updated to reflect the committed changes.

The outermost quote character is syntactically significant in the lldb command language. If you say:

memory read -c count_var 0x123345

then lldb evaluates the expression in the backticks, replacing the argument with the result of the expression. So you can't get rid of the quote character altogether. The other use is in completion to figure out what an unterminated quoted string should be completed with.

Since only the outermost quoting character is important, I think the problem is more tractable than your examples would suggest.

I didn't mention backticks, as (to use bash parlor) command substitution and word splitting are two separate passes. backticks can be embedded into a single word, which also contains other kinds of quote characters, so storing the quote character alongside the word-split argument is not going to help you implement command substitution. LLDB actually handles this mostly correctly now (notice that in each case the input gets parsed as a single word):

(lldb) `1+1`
error: '2' is not a valid command.
error: Unrecognized command '2'.
(lldb) a`1+1`a
error: 'a2a' is not a valid command.
error: Unrecognized command 'a2a'.
(lldb) `1+1`a`2+2`
error: '2a4' is not a valid command.
error: Unrecognized command '2a4'.
(lldb) "a a"`1+1`"b a"
error: unknown command shorthand suffix: ' a'
error: Unrecognized command 'a a2b a'.

The reason this works is because the command substitution is done in a different place in the code, so I still hold my view that the Args class does not need to differentiate between the different quote characters as it is just a holder for a word-split input.

That said, none of this should affect this patch. If the quote char field has been there until now, it can stay a little longer.