This is an archive of the discontinued LLVM Phabricator instance.

[clang][cli] Teach CompilerInvocation to allocate strings on its own
AbandonedPublic

Authored by jansvoboda11 on Jan 27 2021, 4:05 AM.

Details

Summary

Some members of CompilerInvocation store StringRefs pointing to the command line arguments.

When round-tripping, those StringRefs will reference strings we generated ourselves. To ensure they live long enough, this patch creates a string storage inside CompilerInvocation itself.

To prevent iterator invalidation, std::forward_list was used.

Diff Detail

Event Timeline

jansvoboda11 requested review of this revision.Jan 27 2021, 4:05 AM
jansvoboda11 created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2021, 4:05 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Can you include the use in the patch? I’m not sure having the AllocateString API totally makes sense, vs moving in a string pool that was independently used during generation, but maybe in context?

I also wonder what command-line option is forcing this. Ideally we could arrange the generated -cc1 command line to only use generated strings for numbers / enums, which will not need long-term storage as stringrefs. String arguments could reuse the original char*. It seems better to drop frivolous = than carrying this around.

Consider that some callers may parse the full command-line and then drop everything but one options class. That pattern wouldn’t work anymore since the StringRefs won’t be valid.

clang/include/clang/Frontend/CompilerInvocation.h
144

I suggest using llvm::StringPool instead, which also uniques the strings and is used elsewhere for this kind of thing. Probably in an Optional to make it more clear it’s not always used.

Maybe the name RoundTrippedStrings?

Can you include the use in the patch?
I also wonder what command-line option is forcing this.

The use-case that requires it is here: https://github.com/llvm/llvm-project/blob/f3449ed6073cac58efd9b62d0eb285affa650238/clang/lib/Frontend/CompilerInvocation.cpp#L734.
Clang accepts analyzer arguments in form of -analyzer-config key=value. Reference to the key sub-string is stored as an index in a StringMap (AnalyzerOptions::Config) and values are held by std::string. More entries get stored into the map in parseAnalyzerConfigs.

String arguments could reuse the original char*.

Interesting idea. We could avoid allocation (and the need for owning strings) by grabbing the map key substring, scanning past its end until we hit \0 and putting the whole thing into generated arguments. Keys in the StringMap that are not followed by =value most likely only come from parseAnalyzerConfigs, but we don't need to generate those.
This could work, but I'm worried it's somewhat brittle.

It seems better to drop frivolous = than carrying this around.

Cool, this could save us allocations in some cases. Not sure dropping = between key-value pairs would result in good user experience though.

I’m not sure having the AllocateString API totally makes sense, vs moving in a string pool that was independently used during generation, but maybe in context?

What exactly do you mean here?

Consider that some callers may parse the full command-line and then drop everything but one options class. That pattern wouldn’t work anymore since the StringRefs won’t be valid.

Ah, that's right. So if we don't go with the allocation-less approach, we'd need to take StringAllocator from the client that also ensures proper lifetimes.

jansvoboda11 added inline comments.Jan 27 2021, 8:35 AM
clang/include/clang/Frontend/CompilerInvocation.h
144

From the documentation of StringPool, it seems like we would need to store PooledStringPtrs in order to keep them alive. How would that work with the const char * API we have?

dexonsmith added a comment.EditedJan 27 2021, 11:52 AM

Can you include the use in the patch?
I also wonder what command-line option is forcing this.

The use-case that requires it is here: https://github.com/llvm/llvm-project/blob/f3449ed6073cac58efd9b62d0eb285affa650238/clang/lib/Frontend/CompilerInvocation.cpp#L734.
Clang accepts analyzer arguments in form of -analyzer-config key=value. Reference to the key sub-string is stored as an index in a StringMap (AnalyzerOptions::Config) and values are held by std::string. More entries get stored into the map in parseAnalyzerConfigs.

String arguments could reuse the original char*.

Interesting idea. We could avoid allocation (and the need for owning strings) by grabbing the map key substring, scanning past its end until we hit \0 and putting the whole thing into generated arguments. Keys in the StringMap that are not followed by =value most likely only come from parseAnalyzerConfigs, but we don't need to generate those.
This could work, but I'm worried it's somewhat brittle.

I think (?) running the tests with ASan on would catch any bugs, and there are bots that do that. Might be worth double-checking though.

You'd also want to assert(!Arg.end()[0] && "reused command-line arg not null-terminated?")...

It seems better to drop frivolous = than carrying this around.

Cool, this could save us allocations in some cases. Not sure dropping = between key-value pairs would result in good user experience though.

Well, the compiler developers are the users, since -cc1 isn't for end-users. I acknowledge it's a tradeoff. IMO, without actually seeing it in practice, it seems a bit nicer to design this away entirely. But maybe having a = is more valuable than I think.

Here's another option. Assume we have a broken-up version of UniqueStringSaver that lets us pre-load the existing strings and move away the new ones for later storage, something like:

BumpPtrAllocator Storage;
DenseSet<StringRef> ExistingStrings;
StringSaver NewStrings(Storage);

auto saveString = [&](const Twine &T) {
  SmallVector<char, 128> Storage;
  StringRef S = T.toStringRef(Storage);
  auto I = ExistingStrings.find(S);
  if (I != ExistingStrings.end())
    return *I;
  S = NewStrings.save(S);
  ExistingStrings.insert(S);
  return S;
};

Strings.ExistingStrings.reserve(CommandLineArgs.size());
for (StringRef Arg : CommandLineArgs)
  Strings.ExistingStrings.insert(Arg);

For string options:

  • If -cc1 accepts only -opt string, no need for allocation.
  • If -cc1 accepts only -opt=string, the fully-spelled option is on the original command-line. ExistingStrings.find() should turn it up when calling Strings.save().
  • If -cc1 accepts both -opt=string and -opt string canonicalize to -opt string when generating. No need for allocation.

For numeric / enum options we'll need to allocate, but the allocation doesn't have to live past running the parser.

The only problem would be if string gets canonicalized / rewritten somehow during parse + generate, or if an enum value is additionally saved as a string value.

I’m not sure having the AllocateString API totally makes sense, vs moving in a string pool that was independently used during generation, but maybe in context?

What exactly do you mean here?

The intent in this patch seemed to be to call CompilerInvocation::AllocateString when generating arguments. My intuition would be to continue using the StringMap we have now (or StringSaver, or whatever), and then std::move() that into a CompilerInvocation that's parsed from it. In which case there's no need for CompilerInvocation::AllocateString.

In fact, it might be even better if it was type-erased, just a payload that had to be kept alive...

class CompilerInvocation {
  class RoundTrippedCommandLineStorage {
    std::shared_ptr<void> Storage;
  public:
    void store(BumpPtrAllocator &&Strings) {
      assert(!Storage);
      Storage = std::make_shared<BumpPtrAllocator>(std::move(Strings));
    }
  } RoundTrip;
};

Consider that some callers may parse the full command-line and then drop everything but one options class. That pattern wouldn’t work anymore since the StringRefs won’t be valid.

Ah, that's right. So if we don't go with the allocation-less approach, we'd need to take StringAllocator from the client that also ensures proper lifetimes.

Maybe it wouldn't be terrible to have a RoundTrip arg on each options class. When you assign to the CompilerInvocation, it copies the shared ptr into each of the options classes, so that the pool stays alive as long as one of them has it.

clang/include/clang/Frontend/CompilerInvocation.h
144

Oops; I meant StringSet<> / StringMap<NoneType> / etc., which is what I've always used in the past. This is essentially a StringMap<void>; I just always call the variable StringPool, hence my error. It has the properties you need (stable, kept alive indefinitely, null-terminated).

Looking around, there are a couple of options designed for this:

  • StringSaver, which is an optimized version of your list (uses a bumpptrallocator, no uniquing).
  • UniqueStringSaver, which is essentially equivalent to StringSet<>; might be slightly more optimized, and if it's not, we should fix it.

I suggest using one of those since they exactly match what you need.

Well, the compiler developers are the users, since -cc1 isn't for end-users. I acknowledge it's a tradeoff. IMO, without actually seeing it in practice, it seems a bit nicer to design this away entirely. But maybe having a = is more valuable than I think.

For string options:

  • If -cc1 accepts only -opt string, no need for allocation.
  • If -cc1 accepts only -opt=string, the fully-spelled option is on the original command-line. ExistingStrings.find() should turn it up when calling Strings.save().
  • If -cc1 accepts both -opt=string and -opt string canonicalize to -opt string when generating. No need for allocation.

For numeric / enum options we'll need to allocate, but the allocation doesn't have to live past running the parser.

The only problem would be if string gets canonicalized / rewritten somehow during parse + generate, or if an enum value is additionally saved as a string value.

We can definitely ask around how would people feel about that.

Consider that some callers may parse the full command-line and then drop everything but one options class. That pattern wouldn’t work anymore since the StringRefs won’t be valid.

Ah, that's right. So if we don't go with the allocation-less approach, we'd need to take StringAllocator from the client that also ensures proper lifetimes.

Maybe it wouldn't be terrible to have a RoundTrip arg on each options class. When you assign to the CompilerInvocation, it copies the shared ptr into each of the options classes, so that the pool stays alive as long as one of them has it.

I think this is the most robust solution.

Moving BumpPtrAllocator/StringSaver into CompilerInvocation will become problematic once we start round-tripping more option classes.

I've looked through all option classes in CompilerInvocation, and AnalyzerOptions::Config seems to be the only member that stores non-owning references to command line arguments.
I think the best path forward is to change AnalyzerOptions to take ownership and avoid dealing with complicated lifetimes. We can look into removing allocations later, as an optional optimization.
What do you think?

clang/include/clang/Frontend/CompilerInvocation.h
144

Thanks for mentioning StringSavers, they fit our needs perfectly (now used in D94472).

Well, the compiler developers are the users, since -cc1 isn't for end-users. I acknowledge it's a tradeoff. IMO, without actually seeing it in practice, it seems a bit nicer to design this away entirely. But maybe having a = is more valuable than I think.

For string options:

  • If -cc1 accepts only -opt string, no need for allocation.
  • If -cc1 accepts only -opt=string, the fully-spelled option is on the original command-line. ExistingStrings.find() should turn it up when calling Strings.save().
  • If -cc1 accepts both -opt=string and -opt string canonicalize to -opt string when generating. No need for allocation.

For numeric / enum options we'll need to allocate, but the allocation doesn't have to live past running the parser.

The only problem would be if string gets canonicalized / rewritten somehow during parse + generate, or if an enum value is additionally saved as a string value.

We can definitely ask around how would people feel about that.

Consider that some callers may parse the full command-line and then drop everything but one options class. That pattern wouldn’t work anymore since the StringRefs won’t be valid.

Ah, that's right. So if we don't go with the allocation-less approach, we'd need to take StringAllocator from the client that also ensures proper lifetimes.

Maybe it wouldn't be terrible to have a RoundTrip arg on each options class. When you assign to the CompilerInvocation, it copies the shared ptr into each of the options classes, so that the pool stays alive as long as one of them has it.

I think this is the most robust solution.

Moving BumpPtrAllocator/StringSaver into CompilerInvocation will become problematic once we start round-tripping more option classes.

I've looked through all option classes in CompilerInvocation, and AnalyzerOptions::Config seems to be the only member that stores non-owning references to command line arguments.
I think the best path forward is to change AnalyzerOptions to take ownership and avoid dealing with complicated lifetimes. We can look into removing allocations later, as an optional optimization.
What do you think?

I think that makes sense given the rest of CompilerInvocation is owning.

I've looked through all option classes in CompilerInvocation, and AnalyzerOptions::Config seems to be the only member that stores non-owning references to command line arguments.
I think the best path forward is to change AnalyzerOptions to take ownership and avoid dealing with complicated lifetimes. We can look into removing allocations later, as an optional optimization.
What do you think?

I think that makes sense given the rest of CompilerInvocation is owning.

+1, let's just simplify this.

jansvoboda11 abandoned this revision.Jan 29 2021, 12:51 AM

Closing this in favor of D95629.