This is an archive of the discontinued LLVM Phabricator instance.

Implements multiline regex support
AbandonedPublic

Authored by Mordante on Jun 9 2019, 7:29 AM.

Details

Reviewers
mclow.lists
ldionne
Group Reviewers
Restricted Project
Summary

It implements the multiline support as described at https://wg21.link/LWG2503

Diff Detail

Event Timeline

Mordante created this revision.Jun 9 2019, 7:29 AM
mclow.lists added a comment.EditedJun 26 2019, 8:50 AM

To keep the code easier to read, I would be tempted to remove almost all of the #ifdefs here, and just make __use_multiline and the constants depend on the language level.
[ Note that I haven't tried this; just looking to make the code simpler. ]

mclow.lists added inline comments.Jun 26 2019, 8:53 AM
libcxx/include/regex
2003

I think that this is going to have to go behind an "ABI break" flag, because it changes the size of __l_anchor. I'll look into it some more.

mclow.lists requested changes to this revision.Jun 26 2019, 12:12 PM

Note: Since this was resolved by an LWG issue (rather than a paper), we should provide this capability back to C++11.
(So you can get rid of all the #ifdefs

This revision now requires changes to proceed.Jun 26 2019, 12:12 PM

Thanks for the review.

Once it's known whether or not the __l_anchor changes are an ABI break I'll update the patch.
Is there a way to test what is and what is not an ABI break?

@mclow.lists, @ldionne is the change of the size of __l_anchor an ABI break?

@mclow.lists, @ldionne is the change of the size of __l_anchor an ABI break?

Yes, I think it is. Imagine a program compiled with an old version of libc++ (which assumes that __l_anchor doesn't have a bool) which creates a regex with a __l_anchor node in it, and then passes that regex to a shared library that was compiled with the new headers (which assumes __l_anchor has a bool). The new library will try to read the field which doesn't exist. This can be a problem because the <regex> code is free to be inlined into users programs and libraries too.

If we added a new derived class instead, I think this could solve our problem. Indeed, __l_anchor would stay unchanged, however we could have something like a __multiline_l_anchor?

The other option is that we just take that ABI break. Do we have a way of quantifying how bad it would be? What regexes would be impacted? Is it regexes with ^ or $ used as anchors, passed through ABI boundaries? If we want to go down this route, we could also try to see what the extent of the ABI break is, i.e. if we tend to crash it's better than if we silently run and do something weird.

ldionne requested changes to this revision.Mar 11 2020, 11:28 AM

Thanks for the confirmation it's an ABI break. I like your suggestion to use a derived class, so I'll try that solution first.

Herald added a reviewer: Restricted Project. · View Herald TranscriptMar 12 2020, 2:35 PM

Thanks for the confirmation it's an ABI break. I like your suggestion to use a derived class, so I'll try that solution first.

I'm not 100% sure that solution addresses the ABI break problem, though. I'd have to think about it harder and run it through at least another person to be confident.

Can you comment on that:

Do we have a way of quantifying how bad it would be? What regexes would be impacted? Is it regexes with ^ or $ used as anchors, passed through ABI boundaries?

If that's an ABI break we can afford to make, it would be great cause it would avoid increasing the code complexity.

Thanks for the confirmation it's an ABI break. I like your suggestion to use a derived class, so I'll try that solution first.

I'm not 100% sure that solution addresses the ABI break problem, though. I'd have to think about it harder and run it through at least another person to be confident.

I'll update the diff with this approach in a sec.

Can you comment on that:

Do we have a way of quantifying how bad it would be? What regexes would be impacted? Is it regexes with ^ or $ used as anchors, passed through ABI boundaries?

Yes it requires a regex with a ^ or $ anchor. It should only break if created with the old version (no bool in the struct) and executed by the new version (with a bool in the struct)

If that's an ABI break we can afford to make, it would be great cause it would avoid increasing the code complexity.

Googling/looking in GitHub is hard since it ignores special characters. I searched on https://codesearch.debian.net
and found 318 hits looking for std::regex("^ 205 were in the llvm projects so about 100 in other projects.
Looking at std::regex\(".*\$ found 309 matches with 177 in the llvm projects. This regex hit some false positives.

So at least in Debian it is not used that often. So we have 3 options:

  1. Use the version with inheritance if ABI safe. Pro no #ifdefs, Con adds some 'dead code' for ABI compatibility
  2. Replace #if _LIBCPP_STD_VER > 14 of the current patch with an ABI flag. Pro ABI compatible, Con adds a lot if #ifdefs
  3. Break the ABI, Pro cleanest code, Con an ABI breakage

If 1 is ABI safe I would prefer that option, else I would slightly prefer to avoid an ABI breakage.

What's your opinion?

Mordante updated this revision to Diff 250856.Mar 17 2020, 12:23 PM

Implements the multiline using a derived class.

ldionne requested changes to this revision.Mar 20 2020, 8:36 AM
ldionne added inline comments.
libcxx/include/regex
2104

To avoid code duplication, we could create a helper function that takes a bool __multiline parameter. Then, __r_anchor_multiline<_CharT>::__exec would call that helper function with __multiline as an argument, and __r_anchor<_CharT>::__exec would call the helper function with false.

4771

Do we ever create a __l_anchor anymore? If not, then we could in theory delete that code. Actually, now that I think of it, I believe we could simply add the functionality to __l_anchor itself, but rename the class to something else. The idea is:

  • No new code needs to use __l_anchor, it can all start using the new class
  • Existing code using __l_anchor already has a vtable for __l_anchor, and that contains the old implementation of execute they need to keep using to avoid and ABI break.

I believe the ABI break only happens if we change __l_anchor::execute without changing the name of the class, since that's when two libraries compiled at different times (and hence with different versions of __l_anchor) can disagree on the layout of __l_anchor. If we change the name of the class, I think it should all work out.

WDYT?

This revision now requires changes to proceed.Mar 20 2020, 8:36 AM
Mordante marked 4 inline comments as done.Mar 21 2020, 9:20 AM
Mordante added inline comments.
libcxx/include/regex
2104

That's indeed possible. I can create an updated patch with that approach.
(Before updating I want to look what approach we take regarding your comments on __l_anchor.)

4771

No after this change we'll never create a __l_anchor. I expected removing the __l_anchor class from the header would be an ABI break. But I think you're right since this is a templated class. So that means I:

  • Let the __l_anchor_multiline directly inherit from __owns_one_state<_CharT>
  • Remove the __l_anchor class
  • Do the same for the 'right' part and thus don't need the helper function as suggested before

Agreed?

Mordante marked 2 inline comments as done.Nov 1 2020, 2:18 AM
Mordante added inline comments.
libcxx/include/regex
4771

@ldionne Do you agree with the proposed approach?

ldionne added inline comments.Nov 10 2020, 6:08 AM
libcxx/include/regex
4771

Sorry this got dropped. Yes, we agree on the approach.

Regarding

No after this change we'll never create a l_anchor. I expected removing the l_anchor class from the header would be an ABI break. But I think you're right since this is a templated class. [...]

My reasoning is that since it's a class template that isn't explicitly instantiated in and exported from the dylib, any code that references __l_anchor already has a weak definition for the vtable. It needs to, because that vtable isn't defined anywhere else the linker can tell upfront.

So, just removing that class from the headers isn't going to break anything, because

  1. already-compiled code already has the definitions it needs, and
  2. we're not changing an existing definition to do something incompatible (since we're introducing a new one).

ABIs are tricky, so I hope I'm not missing something. But if my reasoning holds, your approach LGTM. Let's go with that.

Mordante updated this revision to Diff 305843.Nov 17 2020, 10:19 AM
  • Removed __[rl]_anchor
  • Changed __is_eol to use _LIBCPP_INLINE_VISIBILITY
Mordante marked an inline comment as done.Nov 17 2020, 10:20 AM
Mordante added inline comments.
libcxx/include/regex
4771

Thanks for the explanation.

ldionne accepted this revision.Nov 17 2020, 1:37 PM

LGTM, thanks a lot for this! Consider my comment about the comments you left behind, but feel free to commit without implementing my suggestion.

I don't understand why CI isn't running -- it seems to be doing that with a few patches. If you commit this, can you please keep an eye on https://buildkite.com/llvm-project/libcxx-ci/builds?branch=master for the next build?

libcxx/include/regex
2001

I don't think this comment is relevant, since __l_anchor doesn't exist anymore. I feel like questions of this nature can be answered best via git blame and following the link back to this review. If you feel strongly that it adds value, please feel free to leave it in.

2040

Ditto.

Mordante marked an inline comment as done.Nov 18 2020, 9:14 AM

Thanks for the review!

LGTM, thanks a lot for this! Consider my comment about the comments you left behind, but feel free to commit without implementing my suggestion.

I'll remove them.

I don't understand why CI isn't running -- it seems to be doing that with a few patches. If you commit this, can you please keep an eye on https://buildkite.com/llvm-project/libcxx-ci/builds?branch=master for the next build?

I'll keep an eye on the build bot.
I update the patches in the review using Phabricator's web-interface. Can the fact that I don't use arc be the reason the CI builds aren't triggered?

I'll keep an eye on the build bot.
I update the patches in the review using Phabricator's web-interface. Can the fact that I don't use arc be the reason the CI builds aren't triggered?

I think that's most likely it. It seems like CI isn't triggered when the repository isn't set on the review. I think Mikhail is looking into it. Don't worry about it.

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 11:02 AM
Mordante abandoned this revision.Aug 31 2023, 11:02 AM