This is an archive of the discontinued LLVM Phabricator instance.

Add StringRef::scan_between
Needs ReviewPublic

Authored by zturner on Aug 28 2016, 9:31 PM.

Details

Summary

A common pattern is something like this:

// S is of the form "foo[bar]baz", get bar out.
size_t open = S.find("[");
if (open != npos) {
  size_t close = S.find("]", open);
  if (close != npos) {
    StringRef middle = S.slice(open, close);
    // Do something
  }
}

I want to simplify the above by turning it into one line:

StringRef middle = S.scan_between("[", "]");

This patch addresses this.

Diff Detail

Event Timeline

zturner updated this revision to Diff 69529.Aug 28 2016, 9:31 PM
zturner retitled this revision from to Add StringRef::scan_between.
zturner updated this object.
zturner added a subscriber: llvm-commits.

Ahh, forgot that I also added contains in this patch. I'd like to get that function in as well.

Adding a few people for comments.

amccarth added inline comments.Aug 29 2016, 1:18 PM
include/llvm/ADT/StringRef.h
501

This description is not 100% accurate. See my next comment.

unittests/ADT/StringRefTest.cpp
382

The expectation here surprised me. Given that "ABCDE" doesn't exist in the string, I'd expect there to be no match (as with the next example). But there's an asymmetry here.

I'm not necessarily opposed to this, but I think it's worth discussing since surprising behavior from an API leads to misuse of the API. At a minimum, the description of the method needs to be corrected to make this behavior explicit.

Consider Foo.scan_between("(", ")"). Would you really want a non-empty result if Foo did not have a matched pair of parentheses?

zturner added inline comments.Aug 29 2016, 1:26 PM
include/llvm/ADT/StringRef.h
501

The description could probably be improved.

unittests/ADT/StringRefTest.cpp
382

When I came up with these semantics, I decided to treat "I didn't find the string" not as an error, but as an "I made it to the end, keep going".

This roughly matches the semantics of split(), where if if A.split(B) doesn't find B, it returns (A, "").

amccarth added inline comments.Aug 29 2016, 2:42 PM
unittests/ADT/StringRefTest.cpp
382

split doesn't document what it does if you pass an empty string as a separator. I'd expect an empty string to match at the first position, which is what happens with std::string::find. If scan_between were to use the same semantics as split, I'd expect A.scan_between("", "") to return an empty string rather than A.

zturner added inline comments.Aug 29 2016, 2:53 PM
unittests/ADT/StringRefTest.cpp
382

The API should be able to handle the case where you say "I know the substring foo is in here, return me everything after foo". Likewise, it should be able to handle the case of "I know the substring foo is in here, return me everything before foo". With the current API, those would be expressed as A.scan_between("foo", "") and A.scan_between("", "foo") respectively. If Right.empty() is not treated as matching the end of the string, then there would be no way to implement the first case at all, which seems kind of fundamental for an api such as this.

We could add an overload that takes a size_t for the second argument, allowing you to pass npos, but I'm not sure if anyone would ever use that api for any other purpose. Alternatively, we could add a single argument overload, but it might be confusing to have A.scan_between("foo") return something different from A.scan_between("foo", "")

amccarth added inline comments.Aug 29 2016, 3:05 PM
unittests/ADT/StringRefTest.cpp
382

The API should be able to handle the case where you say "I know the substring foo is in here, return me everything after foo".

Isn't that what split is for? Why would someone choose scan_between for that?

I think this API behaves in a manner that's surprising enough that it will eventually confuse somebody into using it incorrectly. But that's only for corner cases, so it's probably not important enough. I'm willing to let it go if the behavior is more accurately document in the comments on the method.

TBH, this seems like a rather specialized use case of StringRef... I don't think it is appropriate to grow this sort of parsing machinery here.

I'd avoid making it a member of the general interface and just sink this method into your particular use case as a function.

TBH, this seems like a rather specialized use case of StringRef... I don't think it is appropriate to grow this sort of parsing machinery here.

I'd avoid making it a member of the general interface and just sink this method into your particular use case as a function.

I think searching for a substring and dropping everything before or after the substring is very common. Perhaps scanning for a section *between* two substrings is specialized, but I don't agree about the single delimeter case. So what about scan_before() and scan_after()? This way scan_between(X, Y) becomes scan_after(X).scan_before(Y). And perhaps the edge cases are easier to hash out with the simpler parameter list.

Actually I guess split(X)[0] or split(X)[1] could be used for that purpose.

David, surely contains is less controversial and can go in? :)

David, surely contains is less controversial and can go in? :)

Yeah, that LGTM :)