This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Fold strtoul and strtoull and avoid PR #56293
ClosedPublic

Authored by msebor on Jul 6 2022, 1:31 PM.

Details

Summary

PR #56293 points out a minor portability issue with the atoi,strtol, and strtoll folders as a result of relying on a nonportable aspect of strtoull for empty sequences. The issue is easy to avoid for just the limited subset of supported folders, but relying on srtroull turns out be problematic also due to overflow handling.

Rather than working around just the root cause of PR #56293, this patch replaces the use of strtoull with a hand-written conversion algorithm that doesn't suffer from either of these issues. Besides resolving the problem in the PR it enhances the libcall simplifier to handle the rest of the standard integer conversion functions.

In the PR @efriedma suggests using llvm::consumeSignedInteger. I ran into a couple of problems when trying that:

  • The function recognizes other base prefixes besides 0 for octal and 0x for hex (0b for binary and 0o for octal), and these interfere with the rules of strtol. The prefixes would need to be disabled for the function to be usable in this context (e.g., by adding another argument).
  • Like the current convertStrToNumber helper, the function doesn't make it possible to reliably detect whether or not the parsed number is representable in the destination type (which may be narrower than uint64_t). This too would require a change to the API to handle.

Rather than making intrusive changes to the existing "public" API I decided to instead modify the static helper function just for the libcall simplifier.

Besides the two problems noted above, llvm::consumeSignedInteger also doesn't strip leading space and succeeds even when the subject sequence isn't at the end of the string. Unlike those above, both of these could be handled by the caller.

Tested by running make check-all on x86_64-linux.

Diff Detail

Event Timeline

msebor created this revision.Jul 6 2022, 1:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2022, 1:31 PM
msebor requested review of this revision.Jul 6 2022, 1:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2022, 1:31 PM
xbolva00 added inline comments.Jul 6 2022, 2:28 PM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
2579

This should go to BuildLibcalls as it is added unconditionally

efriedma added inline comments.Jul 6 2022, 2:35 PM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
151

uint64_t Max = AsSigned && Negate is clever, but confusing at first glance; maybe just write out the conditional?

157

Maybe this would be easier to read using the SaturatingMultiplyAdd helper?

170

This assumes the string is null-terminated? I mean, we don't have to guarantee any particular behavior if it isn't, but we shouldn't read past the end of the string.

172

Please use llvm::isDigit etc. The C library routines interact with the current locale.

msebor marked 2 inline comments as done.Jul 7 2022, 8:27 AM
msebor added inline comments.
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
157

Sure. (It requires making sure the first three arguments all have the same type, hence the further changes.)

170

Right. I forgot that getConstantStringInfo actually returns an unterminated array. Thanks for pointing it out! Let me fix that, add some tests, and update the comment above the function to make it clear.

172

Sure, I didn't notice the capitalization in existing code.

(My understanding was that the compiler sets the C locale but using the llvm routines makes sense to me, especially if there's any expectation to support EBCDIC in -finput-charset= at some point.)

2579

Where in BuildCalls would you suggest to add it?

For what it's worth, llvm::inferNonMandatoryLibFuncAttrs already handles LibFunc_atoi but it's not invoked for calls to the function already in the source. (I was under the impression is that BuildCalls is used when library call is being synthesized by the middle end, but I have very little experience in this area.)

msebor updated this revision to Diff 442932.Jul 7 2022, 8:32 AM
msebor marked an inline comment as done.

Revision 2 of the patch addressing reviewer suggestions:

  • Use a conditional instead of conversion from bool.
  • Avoid assuming atoi etc. argument is a nul-terminated string.
  • Use llvm::isDigit et al. instead of the C standard functions.
  • Use SaturatingMultiplyAdd to detect overflow.
  • Update getConstantStringInfo to make it clear it need not return a nul-terminated string.
  • Add tests.
yurai007 added inline comments.
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
91

Couldn't it be simplified a bit to something like: Str = Str.drop_while([](char c) { return isSpace(c); }); ?

151

Couldn't it be simplified a bit to something like:

if (Str.startswith_insensitive("0x") {
Base = 16;
Str = Str.drop_front(2);
} else {...}

?

nikic added inline comments.Jul 8 2022, 5:22 AM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
2579

inferNonMandatoryLibFuncAttrs() is called for existing function declarations. The attribute is placed on the called function, not on the call.

nikic added inline comments.Jul 8 2022, 5:31 AM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
153

NBytes -> NBits?

163

Can avoid the separate cases via AsSigned ? maxUIntN(NBits) : maxIntN(NBits)?

msebor marked an inline comment as done.Jul 8 2022, 2:11 PM
msebor added inline comments.
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
91

Yes, it could be rewritten like that. I'm not sure everyone will find the more compact form as readable. Not every compiler also inlines the lambda invocation (Clang does but GCC doesn't, likely due to exceeding some inlining limit), so there may be a runtime cost to pay. On balance I'm not sure it matters enough one way or the other so unless there's a strong preference for the lambda I'd just as soon leave it as is.

151

Thanks for the suggestion. The code tests for '0' first, before testing for 'x', and I'd rather keep it that way than repeating the former if the "0x" test fails.

163

Yes, that's simpler, thanks.

msebor updated this revision to Diff 443349.Jul 8 2022, 2:11 PM

Revision 3 with one change from previous:

  • Use maxUIntN(unsigned) instead of a case/switch statement.
efriedma added inline comments.Jul 8 2022, 3:34 PM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
99

I think the check for the empty string should come after the check for a +/- sign?

efriedma added inline comments.Jul 8 2022, 3:44 PM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
117

The standard requires that a "0x" prefix be followed by a hexadecimal digit. Similarly, the standard requires that a "0" prefix be followed by an octal digit. If there isn't an appropriate digit after, the subject sequence (the part that's actually parsed as a number) is just a decimal "0". And the rest is treated as "unrecognized characters".

The standard allows a "0x" prefix if the base is explicitly specified to be 16.

msebor updated this revision to Diff 443996.Jul 12 2022, 9:58 AM

Revision 4 with the following changes:

  • correct base autodetection to consume the 0x prefix also in base 16,
  • reject more variants of empty subject sequences including sole "+" and "0x",
  • add tests for the above.
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
99

Sure. I don't have access to Darwin but based on the dated Apple strtol.c source I found online it looks like it also sets errno to EINVAL for calls like strtol("+", 0, 0) and also strtol("0x", 0, 0) that Glibc succeeds for so I adjusted the code to fail for those sequences as well.

117

Whoops, yes, we need to consume the hex prefix even in base 16. Thanks for pointing out this mistake!

I'm aware of what the C and POSIX standards require but I'm not sure I understand the first part of your comment, but just for clarity: 1) a lone zero is valid in any base (i.e., it doesn't need to be followed by other digits), and 2) the function fails if the whole subject sequence isn't consumed (i.e., has "unrecognized characters").

efriedma added inline comments.Jul 12 2022, 10:26 AM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
117

I think strictly interpreting the standard, if you have something like "0xg", or "09", they should also parse to "0". But if we're rejecting them anyway, I guess that's fine... might want to add a comment somewhere to clarify why we're rejecting strings with unrecognized trailing characters.

msebor updated this revision to Diff 444033.Jul 12 2022, 11:53 AM

Revision 5 with an expanded comment explaining why sequences that end parsing prior to the terminating nul are not accepted.

llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
117

That's right. The rationale for rejecting valid sequences with trailing characters is already in one of the new tests but I can repeat it in the function.

(In practice it would probably be safe to accept sequences terminated by most unrecognizable characters except a comma or a space but I'm not sure it's worth it since it's possible to create a wacky locale with a thousands separator set to pretty much any nonnumeric character.)

barannikov88 added inline comments.
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
155

This comment is not true for the case when AsSigned and Negate are both true. I.e. 0x80 is not the maximum possible representable value in i8 type. Something like "maximum possible value of the magnitude of a number which could be encoded" might be better.
Excuse my English

159

Honestly, I'd prefer more verbose code to this bit manipulation (it took me two minutes to figure out that the added 1 deals with two's complement representation). E.g.

if (AsSigned)
  Max = Negate ? -(uint64_t)minIntN(NBits) : maxIntN(NBits)
else
  Max = maxUIntN(NBits);

or, if you prefer one-liners:
Max = AsSigned ? Negate ? -(uint64_t)minIntN(NBits) : maxIntN(NBits) : maxUIntN(NBits)

Ping: @efriedma, please let me know if you have any further comments or suggestions, or if the last revisions is good enough to commit. Thanks!

llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
155

I agree that magnitude might be more accurate, although to me the suggested rewording sounds too cumbersome to be an overall improvement. I'll see if I can work the word into the comment without making it harder to understand. Thanks!

159

I don't see either of the suggested forms as an improvement, although (obviously) others might feel differently. Since it's a matter of taste I'm going to leave it as is.

This revision is now accepted and ready to land.Jul 18 2022, 3:15 PM
nikic added inline comments.Jul 19 2022, 2:51 PM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
2577

This call should still get dropped (the other one is fine, as it is conditional).

2592

That sounds like the parameter can still be readonly, just not the function.

This revision was landed with ongoing or failed builds.Jul 26 2022, 1:12 PM
This revision was automatically updated to reflect the committed changes.
msebor added inline comments.Jul 26 2022, 1:14 PM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
2577

I'm not familiar with the guidelines for and against applying attributes to calls (or if there are any), but I added this one because I noticed that nocapture was missing from calls to atol while it was present in calls to strtol. That seemed unexpected.

It sounds like you're suggesting that an attribute should only be applied to an argument of a call when it's not implied either by the explicit declaration of the called function in the IL, or by whatever might be implicitly added to it by the middle end.

I would find that subtle and surprising. It would mean that an attribute could be used for codegen decisions even though it didn't actually appear anywhere in the IL. From what I've seen, the IL is quite verbose and commonly redundant (e.g., the noundef and nonnull annotations added to calls to string function like strlen even when they are implied by accesses to declared objects that are readily apparent in the IL).

So with that I prefer to leave this code in place for this change. That said, in my testing not all calls with the same access semantics are annotated with the same attributes (e.g., a call to atoi is annotated differently than the corresponding call to strlen). So I think there's room for improvement here and I'll try to look into it in a followup.

2592

I'm not sure I understand what you're suggesting here but the first argument is of course read-only regardless of whether or not the second one is null. It seems that it could be annotated Attribute::ReadOnly unconditionally, as could many others in this file. I just copied this bit from LibCallSimplifier::optimizeStrTo without thinking about what other attributes might be appropriate to add and under what conditions. As there are no uses of Attribute::ReadOnly in this file I'm not sure that adding it just in this one case would be appropriate. But as in the other case, I'd be happy to look in a followup into what other attributes these functions might benefit from.

(As an aside, annotating just the argument readonly doesn't have the effect I'd expect: calls to strtol are assumed to modify the object the first argument points to. This is in contrast to calls to atol etc., where the whole function is annotated readonly.)

nikic added inline comments.Jul 26 2022, 1:58 PM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
2577

Attribute queries on calls work by first checking whether the attribute exists on the call, and if it doesn't, checking whether it exists on the called function. If the called function has the attribute, then that implies that all calls have it as well.

The policy for libcall attributes is that all attributes that are unconditional are inferred on the declaration using inferNonMandatoryLibFuncAttrs(). This happens as part of the InferFunctionAttrs pass. Attributes that are conditional and cannot be inferred on the declaration, are placed by SimplifyLibCalls instead.

This is why the responsibility for inferring nocapture on atol falls to InferFunctionAttrs (it holds for all atol calls), while inferring nocapture on strtol falls to SimplifyLibCalls (it holds only for some strtol calls). If you run just InstCombine, it may look like nocapture doesn't get inferred, but it will get inferred in a complete optimization pipeline.

While inferring unconditional attributes in SimplifyLibCalls is fine from a correctness perspective, it is also unnecessary. We do not need (or want) to duplicate the functionality already present in inferNonMandatoryLibFuncAttrs() inside SimplifyLibCalls as well.

2592

I just checked, and the readonly attribute will already get added by InferFunctionAttrs, so there is nothing to do here.

As an aside, annotating just the argument readonly doesn't have the effect I'd expect: calls to strtol are assumed to modify the object the first argument points to.

Do you have an example for this handy? I would expect this to work as long as the object is non-escaping. If it is potentially captured (say a function argument), then a non-argmemonly call might modify it through the capture. (LLVM doesn't model errno specially).

msebor added inline comments.Jul 26 2022, 3:08 PM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
2592

This might be easier to discuss outside of a review so I created PR #56744 with the details of both of the issues I was referring to here.

alexfh added a subscriber: alexfh.Aug 1 2022, 4:14 AM

There's a problem after this commit: the endptr pointer is not set correctly in some cases. https://gcc.godbolt.org/z/dvcPWcz1T

alexfh added a subscriber: tstellar.Aug 1 2022, 4:17 AM

There's a problem after this commit: the endptr pointer is not set correctly in some cases. https://gcc.godbolt.org/z/dvcPWcz1T

Apologies, hit "enter" too early. Here's an example:

#include <stdlib.h>
#include <stdio.h>

int main(int, char**) {
   char *endptr = nullptr;
   const char *src = "-9223372036854775808";
   long long result = strtoll(src, &endptr, 0);
   printf("source: <%s>, result: %lld, endptr: <%s>\n", src, result, endptr);
}

When compiled with -O1 or higher optimization level, it now outputs source: <-9223372036854775808>, result: -9223372036854775808, endptr: <8>. For -O0 the output is correct: source: <-9223372036854775808>, result: -9223372036854775808, endptr: <>.

Do you see an obvious fix?

(and btw, the fix will need to be cherrypicked to the release: @tstellar, fyi)

msebor added a comment.Aug 1 2022, 8:08 AM

Thanks for the test case! The bug is in skipping over the leading part of the string up to the first digit and not accounting for it in the end pointer. It should be straightforward to fix. I'll take care of it today.