This is an archive of the discontinued LLVM Phabricator instance.

[Support] unsafe pointer arithmetic in llvm_regcomp()
ClosedPublic

Authored by brad on Feb 20 2021, 3:46 PM.

Details

Summary

llvm/lib/Support/regcomp.c is borrowed from OpenBSD, to which the following issue has been reported and fixed. (report and patch in https://marc.info/?l=openbsd-tech&m=160923823113340&w=2 )

regcomp.c uses the "start + count < end" idiom to check that there are "count" bytes available in an array of char "start" and "end" both point to.

This is fine, unless "start + count" goes beyond the last element of the array. In this case, pedantic interpretation of the C standard makes the comparison of such a pointer against "end" undefined, and optimizers from hell will happily remove as much code as possible because of this.

An example of this occurs in regcomp.c's bothcases(), which defines bracket[3], sets "next" to "bracket" and "end" to "bracket + 2". Then it invokes p_bracket(), which starts with "if (p->next + 5 < p->end)"...

Because bothcases() and p_bracket() are static functions in regcomp.c, there is a real risk of miscompilation if aggressive inlining happens. The following diff rewrites the "start + count < end" constructs into "end - start > count". Assuming "end" and "start" are always pointing in the array (such as "bracket[3]" above), "end - start" is well-defined and can be compared without trouble.

As a bonus, MORE2() implies MORE() therefore SEETWO() can be simplified a bit.

From bug report: https://bugs.llvm.org/show_bug.cgi?id=48649

Diff Detail

Event Timeline

brad created this revision.Feb 20 2021, 3:46 PM
brad requested review of this revision.Feb 20 2021, 3:46 PM
brad added a comment.Oct 1 2021, 6:44 PM

ping ping.

Can you please clang format according to premerge checks?

llvm/lib/Support/regcomp.c
252

I understand all but this line. I guess it had no UB there.

MaskRay accepted this revision.EditedJan 25 2022, 5:44 PM

Thanks for porting the fix.

The clang-format diagnostic is because of TAB. The OpenBSD version (https://github.com/openbsd/src/blob/master/lib/libc/regex/regcomp.c) uses TAB and our copy probably should just match.

I verified that this does port the OpenBSD change, but please make sure @vitalybuka is happy as well.

llvm/lib/Support/regcomp.c
252

I think this clause applies
https://www.iso-9899.info/n1570.html#6.5.6p8 "... If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined."

There was a UB.

This revision is now accepted and ready to land.Jan 25 2022, 5:44 PM

Context not available.

In the future please include full context in the patch. See https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

brad updated this revision to Diff 403380.Jan 26 2022, 12:53 PM

Updated with full context.

MaskRay accepted this revision.Jan 26 2022, 1:23 PM
vitalybuka added inline comments.Jan 26 2022, 2:03 PM
llvm/lib/Support/regcomp.c
252

I am aware of this rule, but I assume that end and next always "point to elements of the same array object, or one past the last element"
So old MORE2 violates the rule, but MORE does not.

vitalybuka accepted this revision.Jan 26 2022, 2:07 PM
vitalybuka added inline comments.
llvm/lib/Support/regcomp.c
252

I am fine if it 's done just for consistency. But I would appreciate explanation if there is UB in MORE() as well.

I already LGTMed. What did you want me to do ? :)

brad added a comment.Feb 1 2022, 11:57 PM

I already LGTMed. What did you want me to do ? :)

@vitalybuka asked a question. Just wondering if there should be further discussion or should I go ahead?

This revision was landed with ongoing or failed builds.Feb 3 2022, 4:59 PM
This revision was automatically updated to reflect the committed changes.