This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Switch RegularExpression from llvm::Regex to std::regex
Needs RevisionPublic

Authored by kastiglione on Aug 20 2022, 11:17 AM.

Details

Summary

Use std::regex as the backing implementation of RegularExpression.

This provides common and expected regex features, from basic features like
\w and \d character classes, to more advanced features such as lookahead.

This change used to be gated on the minimum GCC version, because some versions
of GCC had incomplete support for std::regex, but the GCC required by llvm
for the past couple years should be sufficient, according to docs. At this time
GCC >=7.1 is required.

Diff Detail

Event Timeline

kastiglione created this revision.Aug 20 2022, 11:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2022, 11:17 AM
Herald added a subscriber: mgorny. · View Herald Transcript
kastiglione requested review of this revision.Aug 20 2022, 11:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2022, 11:17 AM
kastiglione added inline comments.Aug 20 2022, 11:24 AM
lldb/source/Utility/CMakeLists.txt
29

the std::regex constructor throws std::regex_error if the pattern is invalid. For this reason, exceptions are enabled for this one file.

lldb/source/Utility/RegularExpression.cpp
51

llvm::Regex considers the empty string to be an invalid pattern, while std::regex does not.

58

llvm::Regex considers the empty string to be an invalid pattern, while std::regex does not.

lldb/test/API/commands/breakpoint/set/func-regex/TestBreakpointRegexError.py
11–22

the error messages have changed.

lldb/test/API/functionalities/breakpoint/source_regexp/TestSourceRegexBreakpoints.py
37

exercise the features that aren't supported by llvm::Regex

69

exercise the features that aren't supported by llvm::Regex

lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/multiset/TestDataFormatterGenericMultiSet.py
132

previously, there was a ., which llvm::Regex matched to a newline, while std::regex does not, resulting in no match.

JDevlieghere added inline comments.Aug 20 2022, 1:08 PM
lldb/include/lldb/Utility/RegularExpression.h
93

There's an ongoing discussion on the forums [1] to replace llvm::Optional with std::optional. Until that's resolved we should stick with llvm's variant for consistency.

[1] https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716/10

lldb/source/Utility/CMakeLists.txt
29

What happens when exceptions are disabled? What does it mean to have this enabled for a single file? I don't know if it's part of the LLVM developer guide, but LLVM is supposed to build without RTTI and without exceptions.

kastiglione added inline comments.Aug 20 2022, 1:32 PM
lldb/include/lldb/Utility/RegularExpression.h
93

thanks for the details, I was actually planning to do this change, so that it would be easy to cherry pick into apple/llvm-project.

lldb/source/Utility/CMakeLists.txt
29

What happens when exceptions are disabled?

This cmake config enables exceptions for this one file, independent of LLVM_ENABLE_EH. No other source files will be allowed to catch or throw exceptions.

What does it mean to have this enabled for a single file?

It means this file can compile with a try/catch, and that inside this file, exceptions can be caught.

I don't know if it's part of the LLVM developer guide, but LLVM is supposed to build without RTTI and without exceptions.

llvm has LLVM_ENABLE_EH which allows llvm to be built with exceptions support enabled. Similarly, LLVM_ENABLE_RTTI allows RTTI to be enabled. It seems that both are disabled as a default, but not as a hard requirement.

I wondered if enabling RTTI would needed for exceptions, but at least for this code, the answer is no. The catch (const std::regex_error &e) block is exercised by TestBreakpointRegexError.py, so we know this code can and does catch an exception of that type, and accesses the error's member functions.

Switch to llvm::Optional

kastiglione marked an inline comment as done.Aug 20 2022, 1:37 PM
kastiglione added inline comments.Aug 20 2022, 2:02 PM
lldb/source/Utility/CMakeLists.txt
29

What makes me believe this use of exceptions is ok is that the API boundary continues to be exception free, callers continue to be exception disabled. Only the internal implementation, knows about exceptions.

labath requested changes to this revision.Aug 22 2022, 12:17 AM
labath added a subscriber: labath.
labath added inline comments.
lldb/source/Utility/CMakeLists.txt
29

That is definitely not ok. ODR madness awaits therein. The standard library is full of functions that effectively do

#ifdef EXCEPTIONS
  something();
#else
  something_else();
#endif

Compiling just one file with exceptions enabled can cause two different versions of those functions to appear. It is sufficient that this file instantiates one function whose behavior is dependent on exceptions being available. When linking, the linker has to choose one of the two versions, and there's no telling which one will it use. This means that exceptions can creep in into the supposedly exception-free code, or vice-versa.

The only way this would be safe is if this code was in a shared library, and we took care to restrict the visibility of all symbols except the exported api of that library. And I don't think that's worth it.

This revision now requires changes to proceed.Aug 22 2022, 12:17 AM
kastiglione added inline comments.Aug 22 2022, 1:19 PM
lldb/source/Utility/CMakeLists.txt
29

@labath thanks for pointing that out. I had incomplete understanding and thought it could still work, but I see now that it could be an issue.

The only way this would be safe is if this code was in a shared library, and we took care to restrict the visibility of all symbols except the exported api of that library. And I don't think that's worth it.

what are your concerns with this approach?

btw I spoke to Louis Dionne of libc++ and he said that it could be possible to include whether -fno-exceptions is used in the libc++ ABI tag, which would allow mixing files built with and without exceptions.

mib added a comment.Aug 22 2022, 1:46 PM

This would be pretty neat, if we're able to make it work without creating side-effects to other component.

labath added inline comments.Aug 23 2022, 5:05 AM
lldb/source/Utility/CMakeLists.txt
29

The only way this would be safe is if this code was in a shared library, and we took care to restrict the visibility of all symbols except the exported api of that library. And I don't think that's worth it.

what are your concerns with this approach?

Well.. I don't even know where to start.

For one, since a shared library is a shippable artifact, it requires adjustment to all distribution methods up- and down-stream. I know this would definitely be an issue for us, since google's internal build system has a strong distaste for shared libraries. This is more on the extreme end, but I expect that most of the other build/deployment systems would need to be adjusted as well.

Now you should shrug that off as a "downstream problem", but there's also something to be said about consistency. Even as things stand now, LLDB is pretty much the only llvm component, which _requires_ to be built with shared libraries. For this case, we at least have a fairly good reason for it -- python extension modules must be shared libraries (although one could still imagine building a extension-free lldb in static mode). Everything else in llvm works both in static and in dynamic modes (for better or worse, there are two dynamic modes).

For this feature, I think the reasoning is pretty weak, because we're essentially doing it to work around _another_ llvm policy. And unlike the shared library thing, this one is explicitly spelled out, in no uncertain terms. So we would immediately become outliers, not in one, but two ways. Actually, three, if you count the fact that we would be using a different regex implementation than the rest of llvm.

Why don't you try proposing an llvm-wide llvm::Regex->std::regex replacement? I expect the answer to be a resounding NO, but who knows, maybe I'll be surprised, and someone will even come up with a way to do this safely...

btw I spoke to Louis Dionne of libc++ and he said that it could be possible to include whether -fno-exceptions is used in the libc++ ABI tag, which would allow mixing files built with and without exceptions.

I think that would be a very cool feature of libc++, but given that libc++ is not the only host c++ library we use with lldb/llvm, I don't think it will help in this case.

However, I think it might be tractable to excise the regex component out of libc++ and pull it into llvm as a separate, independent library. There is already a precedent for that -- the demangler library. The reason we're doing that is because we don't want to maintain two demangler libraries (one for the runtime and one for the tools), but that is precisely the situation that we're in with llvm::Regex and libc++ std::regex (admittedly the demangler is a more central component of llvm than the regex engine). And since this internal library would not need to conform to any specific API, we could make it usable even without exceptions. I'm not saying this would be a walk in the park, but I think you could gather enough support to pull this off.