This is an archive of the discontinued LLVM Phabricator instance.

Use StringRef in Option library instead of raw pointers (NFC)
AbandonedPublic

Authored by mehdi_amini on Oct 4 2016, 3:29 PM.

Details

Reviewers
zturner
rnk
Summary

Try to convert llvm/lib/Option library away from raw
const char * pointers to handle string. Also propagating to
the uses of the API in clang itself.

Event Timeline

mehdi_amini updated this revision to Diff 73567.Oct 4 2016, 3:29 PM
mehdi_amini retitled this revision from to Use StringRef in Option library instead of raw pointers (NFC).
mehdi_amini updated this object.
mehdi_amini added a reviewer: zturner.
mehdi_amini added a subscriber: llvm-commits.
zturner edited edge metadata.Oct 4 2016, 4:13 PM

I can review this in more detail later. But my hope is that we can move to StringRef and improve code at the same time, so that it's more than just churn. One thing I've found in doing this with LLDB is that it's easy to introduce bugs, because once you get StringRef and you decide to port some C-style string manipulation, the resulting StringRef code is more elegant, but since it is re-written it has the potential to introduce new bugs. As a result, I've found smaller CLs to be much more workable.

Did you find that it was impossible to do this in a smaller CL? For example, the first change I see is changing this:

const char *addTempFile(const char *Name)

to this:

llvm::StringRef addTempFile(llvm::StringRef Name)

If instead you started from a blank slate and changed it to this:

llvm::StringRef addTempFile(const char *Name)

You would get a lot of immediate compilation failures, which you could fix up and (theoretically) end up with a CL much smaller than what you have here. Did you try that and find it impractical for some reason?

but since it is re-written it has the potential to introduce new bugs.

Yes, I spent 2 days debugging this one because of subtleties...

As a result, I've found smaller CLs to be much more workable.

This is one patch in a ~40 patches queue, I let you imagine how much smaller it is from the main diff :)

Did you find that it was impossible to do this in a smaller CL?

It is possible, I have to go through intermediate stages where I have a lot of conversion through .data() for instance. So when you have something working it is quite hard to go back.
Also the story of this patch is that I have a queue of patches to update all the LLVM APIs, and now I update the other subprojects (clang, lld, lldb) before committing these individual patches. So the question is "how do I proceed?". What depth do I propagate the change?

For example, the first change I see is changing this:

const char *addTempFile(const char *Name)

to this:

llvm::StringRef addTempFile(llvm::StringRef Name)

If instead you started from a blank slate and changed it to this:

llvm::StringRef addTempFile(const char *Name)

You would get a lot of immediate compilation failures, which you could fix up and (theoretically) end up with a CL much smaller than what you have here. Did you try that and find it impractical for some reason?

I started from a patch ready in LLVM that convert the Option library. So the issue is as I update the clang codebase to adapt for the change, there a some minor API like the one you're mentioning that come up. I can "stash" my current patch and deal with this API, before coming back to the current patch. However that's not practical: I have to rebuild the full codebase every time and it takes *long* (I only have a laptop for that, not a large workstation).

As mentioned above, if you think it is better, I can go through intermediate stages where I just add a bunch of .data() conversion. But this intermediate stage will not look pretty.

rnk added a reviewer: rnk.Oct 5 2016, 9:44 AM
rnk added inline comments.
clang/include/clang/Driver/Job.h
76

I don't like declaring a template in the header without defining it in the header. Let's just keep it const char *. It calls llvm::sys::ExecuteAndWait, which takes const char * because it calls execv/posix_spawn, which take C strings. The only other use is to implement Command::Print, which can suffer the const char *'s.

clang/lib/Driver/Compilation.cpp
101–102

You can change CleanupFile to take StringRef, and not do the dangerous data call.

clang/lib/Driver/Driver.cpp
97

Hang on, nullptr is different from an empty string in the response file parsing logic. We have to be careful not to affect that.

llvm/include/llvm/Support/CommandLine.h
1806–1811

I don't think this is the same for MarkEOLs = true. Command lines can totally contain empty string arguments that do not mark response file line endings.

Probably the right way to do this is to have a second optional output vector that has the indices of all the response file line endings.

I guess my comment applies everywhere there is a call to .data(). Suppose you shelve this CL, then go change some of the places where you are using .data() in this CL to accept StringRef arguments, then rebase this CL on top of those. Since those functions have been left alone in this CL, I would guess that many of them are independent of the changes in this CL and could be converted to StringRef first (and in much smaller CLs), then when you come back to this one you won't have to use .data() so much.

As mentioned in the first comment though, whenever .data() is necessary, I would tend to favor .str().c_str() (the fact that it is inefficient compared to .data() could serve as more incentive to go fix the other places first so that you don't have to do either one)

clang/lib/Driver/Compilation.cpp
101–102

In general, I prefer it->str().c_str() over data() almost everywhere. Yes it's slower, but we shouldn't be making any assumptions about the underlying StringRef. And anyway, in this particular case we're dealing with the filesystem. An extra alloc isn't goign to hurt.

clang/lib/Driver/Driver.cpp
618

Can you change Command::printArg to take a StringRef? If it's too hard to change it outright, you could add an overload.

796

Can setResponseFile take a StringRef?

1168

Can BindArchAction take a StringRef?

1310

Can LookupTypeForExtension take a StringRef?

1350

Same as above.

2314

Same here.

Attached is a sample patch that converts some of the simpler functions to accepting StringRef. This should get rid of some of the data() calls. There's more functions you can do this too as well, I didn't want to go too deep though. I think you should do this for as many functions as possible, then rebase on top of that and remove all those data calls. We can look at the rest after. In general though I'm against using data() anywhere if at all possible.

clang/lib/Driver/Tools.cpp
137

In this case, there's no reason for data at all. You should change the for loop to use a range based for over ArgStr.

mehdi_amini added inline comments.Oct 5 2016, 4:14 PM
clang/lib/Driver/Driver.cpp
97

Good to know, during a recent discussion about StringRef we were wondering about examples where we'd like to differentiate between empty and null.

It seems we don't have any test that rely on this different right now, and I don't know about the "response file" (what are they, how to use them), I so don't feel I can come up easily with a test without guidance.

llvm/include/llvm/Support/CommandLine.h
1806–1811

This is the same underlying issue as above right?
We "could" actually look through .data() to differentiate empty string from nullptr, but I'm not fan of this...

I just had an idea. Why not add an llvm::StringPool to the Compilation. This would pretty much solve all ownership problems.

I'm not aware of any ownership issue here, so I don’t know what you’re referring to?

I'm not aware of any ownership issue here, so I don’t know what you’re referring to?

The ownership problem I'm referring to is that if you return a str.str().c_str(), the resulting c_str() will die immediately, whereas the caller might need the pointer to remain alive. A StringSaver would solve that, so that you could return Saver.save(str).data() anywhere you would normally want to return str.data()

The ownership problem I'm referring to is that if you return a str.str().c_str(), the resulting c_str() will die immediately, whereas the caller might need the pointer to remain alive. A StringSaver would solve that, so that you could return Saver.save(str).data() anywhere you would normally want to return str.data()

I see: you're referring to an ownership issue that don't exist in the code but that you want to introduce by forcing allocations where we don't need to.

mehdi_amini updated this revision to Diff 74055.Oct 8 2016, 4:12 PM
mehdi_amini marked an inline comment as done.

Update more APIs to take StringRef to remove calls to .data()

mehdi_amini added inline comments.Oct 8 2016, 4:37 PM
clang/lib/Driver/Driver.cpp
97

@rnk ping

Drive-by comment regarding the process of creating changes like this:
In our somewhat larger codebase we've found that the best way to create this type of CL is by writing a clang tool:

  • the clang tool can be reviewed on its own, and behavior can be bikeshed on the tests
  • it's easy to re-run a clang tool on a subset of the interfaces to produce any size CL you want
  • if you can make it (or convert it into) a clang-tidy check with a fixit, we prevent regressions

That said, if others are fine reviewing this change I'm fine with it; this is mainly an idea for the future :)

mehdi_amini abandoned this revision.Oct 27 2016, 2:18 PM

Abandon after D25639 went dead.

rnk edited edge metadata.Oct 27 2016, 2:47 PM

hm, this comment has been sitting unsubmitted for a while.

clang/lib/Driver/Driver.cpp
97

We mark EOLs in the response file mostly so that we can be compatible with MSVC, which has arguments that consume everything following them. Given this response file, the compilation command should get all the -D flags, and the link should get 3 -debug flags:

$ cat foo.rsp
-DFOO -link -debug
-DBAR -link -debug
-DBAZ -link -debug

$ clang-cl t.cpp @foo.rsp -###
...
"clang-cl" "-cc1" ... "-D" "FOO" "-D" "BAR" "-D" "BAZ"  ...
"link.exe" "-out:t.exe" "-nologo" "-debug" "-debug" "-debug" ...

So, we need a way to distinguish between empty strings and EOLs.

I mentioned one way to do this:

Probably the right way to do this is to have a second optional output vector that has the indices of all the response file line endings.