This is an archive of the discontinued LLVM Phabricator instance.

Add Twine support for std::string_view.
ClosedPublic

Authored by saugustine on Jun 8 2021, 5:03 PM.

Details

Summary

With Twine now ubiquitous after rG92a79dbe91413f685ab19295fc7a6297dbd6c824,
it needs support for string_view when using llvm as a library under C++ standards.

This is similar to how StringRef is handled.

Diff Detail

Event Timeline

saugustine created this revision.Jun 8 2021, 5:03 PM
saugustine requested review of this revision.Jun 8 2021, 5:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 8 2021, 5:03 PM
saugustine edited the summary of this revision. (Show Details)Jun 8 2021, 5:05 PM
saugustine added reviewers: lattner, rriddle.
saugustine added a reviewer: jyknight.
craig.topper added inline comments.
llvm/lib/Support/Twine.cpp
48

Is string_view guaranteed to be null terminated?

saugustine updated this revision to Diff 350743.Jun 8 2021, 5:10 PM

Small fix for prototype.

saugustine added inline comments.Jun 8 2021, 5:15 PM
llvm/lib/Support/Twine.cpp
48

A string_view is not guaranteed to be null terminated, so it includes the size field, which is filled in when the string_view at construction time, possibly by the constructor searching for a null in the string or array of characters, but perhaps just by passing in a length.

craig.topper added inline comments.Jun 8 2021, 5:19 PM
llvm/lib/Support/Twine.cpp
48

That's what I thought. This function is expecting to return a String that has a null terminator following it isn't it? StringRef isn't directly handled here so I wouldn't expect string_view to be.

lattner accepted this revision.Jun 8 2021, 5:51 PM

LGTM, thanks!

This revision is now accepted and ready to land.Jun 8 2021, 5:51 PM
saugustine updated this revision to Diff 350765.Jun 8 2021, 7:51 PM

This addresses the upstream comments.

It also adds a string_view overload for raw_ostream, as now that
a string_view can be converted to both a Twine and a StringRef,
those calls would be ambiguous.

saugustine added inline comments.Jun 8 2021, 8:18 PM
llvm/lib/Support/Twine.cpp
48

I see what you are saying now. Fixed.

This revision was landed with ongoing or failed builds.Jun 8 2021, 8:30 PM
This revision was automatically updated to reflect the committed changes.

This patch causes segmentation faults when you have an application compiled with -std=c++17 (The current gcc default) and libLLVM.so compiled with c++14 (The current LLVM project default). Is this a valid use case or do we require that all library users use the same c++ version as the library.

This patch causes segmentation faults when you have an application compiled with -std=c++17 (The current gcc default) and libLLVM.so compiled with c++14 (The current LLVM project default). Is this a valid use case or do we require that all library users use the same c++ version as the library.

The most immediate reason is the enum value StdStringViewKind has an (unnecessary) #ifdef around it. Making that unconditional seems entirely reasonable, and is easy enough.

But, even with that, compiling with different standards versions will result in an ODR violation in getSingleStringRef(), which may or may not cause a problem, depending on whether it's inlined, and whether the C++14-compiled version of it wins the link choice. Also, if llvm itself is compiled with c++14 while the consumer is compiled with c++17, the various print methods defined in the cpp file will not work either.

I'm not sure removing the ifdef guards around the enum value is better.

That would also mean moving the #ifdef guards in all the switch statements so the enum value would be present but unhandled by doing nothing (or asserting or similar) which seems pretty wrong.

This patch causes segmentation faults when you have an application compiled with -std=c++17 (The current gcc default) and libLLVM.so compiled with c++14 (The current LLVM project default). Is this a valid use case or do we require that all library users use the same c++ version as the library.

Seems like valid/important use case to me!

I wonder if the Twine part of the patch should wait for LLVM to drop support for C++14.

llvm/include/llvm/ADT/Twine.h
299–303

Is this constructor necessary? I would have expected this to get picked up as a conversion by the StringRef constructor... but maybe there are two candidates?

@dblaikie / @lhames , maybe one of you has ideas for how to add Twine support for std::string_view in C++17 and later without changing Twine's ABI?

I think the following won't work, because the lifetime of the StringRef wouldn't be extended to match the lifetime of the calling expression:

Twine(const std::string_view &view) : Twine(StringRef(view)) {}

Seems like valid/important use case to me!

The ODR violation issues would need to be resolved to support this case.

I wonder if the Twine part of the patch should wait for LLVM to drop support for C++14.

This patch follows on another one that enables Twine support everywhere over std::string. Before that, because strings are trivially convertable to string_views, there was no ambiguity. With that change there is ambiguity in many places, and causes similar problems to the present one for downstream C++ 17 users. Not sure which case should win.

llvm/include/llvm/ADT/Twine.h
299–303

There are indeed two candidates, so we need this one to disambiguate.

Seems like valid/important use case to me!

The ODR violation issues would need to be resolved to support this case.

Well, there's also the option of reverting this patch.

This patch follows on another one that enables Twine support everywhere over std::string. Before that, because strings are trivially convertable to string_views, there was no ambiguity. With that change there is ambiguity in many places, and causes similar problems to the present one for downstream C++ 17 users. Not sure which case should win.

I took a quick look at the patch mentioned in the description, rG92a79dbe91413f685ab19295fc7a6297dbd6c824, which seems to be just updating MLIR. I imagine there's a way to make those APIs easier to use without breaking LLVM. IMO, a compile-time failure (problem before this patch) is much easier to reason about and fix for a library user than a miscompile (problem after this patch).

What is the ambiguity between? Twine also has a constructor directly from std::string so it's not obvious to me how that patch introduces ambiguity. If the problem is an LLVM client that's using std::string_view instead of StringRef, it seems reasonable for the client to explicitly convert to a StringRef() to use the APIs. Alternatively, specific APIs can add std::string_view wrapper overloads that defer to the Twine versions.

Reverting this patch might help in the instant case, but the odds of something similar happening, where the data structure is different across different versions of the standard is very hard to guarantee. And also hard to test for.

Reverting this patch might help in the instant case, but the odds of something similar happening, where the data structure is different across different versions of the standard is very hard to guarantee. And also hard to test for.

When things are hard to test for, we usually rely on vendors to find and report problems during qualification/deployment (like @tstellar did) and then revert/refine/fix the patch that broke them.

Not sure why you see it as likely that a data structure would vary across different versions of the standard. Seems to me that'd only happen when someone has code (like this patch) that explicitly asks for it, assuming your toolchain is ABI compatible between standards versions, like clang and libc++ are.

Yeah, the current state of this is not acceptable - compatibility across different standard versions is necessary.

@dblaikie / @lhames , maybe one of you has ideas for how to add Twine support for std::string_view in C++17 and later without changing Twine's ABI?

I think the following won't work, because the lifetime of the StringRef wouldn't be extended to match the lifetime of the calling expression:

Twine(const std::string_view &view) : Twine(StringRef(view)) {}

Yep.

Would it be so bad if we made "Child" larger by storing a StringRef (or equivalent - anything involving a pointer+length) in it (so Child would be two words rather than one)? Then we could collapse several cases into that (StringRef, string_view, std::string, SmallString, (could even do char as well)).

MaskRay added a subscriber: MaskRay.EditedJul 16 2021, 10:10 AM

llvm/include has such __cplusplus dispatches in a few other places, e.g. llvm/include/llvm/ADT/StringRef.h L82 strLen. So those (-std=c++14 libLLVM.so + -std=c++17 application using llvm/include headers) users are already relying on benign ODR violation.

LLVM has -fvisibility-inlines-hidden, and an additional -Bsymbolic-functions for safeguard, after linking into libLLVM.so, the ODR violation should be benign.

Ensuring that struct/enum sizes are the same for different -std= modes is probably a sufficient fix to regular users (and a sufficiently good workaround for libLLVM.so with mixed -std= users).


This thread does raise the question about what degree of support we want to provide for -DLLVM_LINK_LLVM_DYLIB=on users. Before previously shared object users were explicitly not supported. Now it may change as many Linux distributions have switched to -DLLVM_LINK_LLVM_DYLIB=on for their llvm/clang packages for size. With shared objects, we have to care about all sorts of compiler options which can cause non-benign ODR issues.

llvm/include has such __cplusplus dispatches in a few other places, e.g. llvm/include/llvm/ADT/StringRef.h L82 strLen. So those (-std=c++14 libLLVM.so + -std=c++17 application using llvm/include headers) users are already relying on benign ODR violation.

LLVM has -fvisibility-inlines-hidden, and an additional -Bsymbolic-functions for safeguard, after linking into libLLVM.so, the ODR violation should be benign.

Ensuring that struct/enum sizes are the same for different -std= modes is probably a sufficient fix to regular users (and a sufficiently good workaround for libLLVM.so with mixed -std= users).


This thread does raise the question about what degree of support we want to provide for -DLLVM_LINK_LLVM_DYLIB=on users. Before previously shared object users were explicitly not supported. Now it may change as many Linux distributions have switched to -DLLVM_LINK_LLVM_DYLIB=on for their llvm/clang packages for size. With shared objects, we have to care about all sorts of compiler options which can cause non-benign ODR issues.

I think this will be an issue for static or dynamic linking, due to inlining. Importantly - if LLVM was built withoun string_view support but a caller was built with it and that caller tries to pass a string_view into some LLVM API then even if the layout and enum elements were consistent, we'd still hit UB/assert/etc because the Twine passed in would have that unexpected enum value which wouldn't be covered by the Twine code inside LLVM.

Having the implementation be the same (by using some form of char*+length representation which would always be supported), having an extra ctor that takes string_view but lowers to that same representation should be safe (yes, a benign ODR violation - the debug info will be a bit weird/missing that extra function/ctor - but codewise, I can't think of any way it wouldn't work).

I will implement dblaikie's idea, which should unblock in the short term, but I think a larger conversation about the level of support that the project wants for this would be good to have.

bogner added a subscriber: bogner.Jul 21 2021, 7:15 PM

Just ran into this issue (we were a bit behind on integrating upstream) - glad to see it's already fixed. Regarding the following:

Reverting this patch might help in the instant case, but the odds of something similar happening, where the data structure is different across different versions of the standard is very hard to guarantee. And also hard to test for.

This particular case isn't a particularly egregious case of issues being hard to detect, given that it's an ABI change based on an ifdef - LLVM has a long history of avoiding doing this kind of thing (either through being compiler agnostic with Compiler.h or using LLVM_ENABLE_ABI_BREAKING_CHECKS)