Page MenuHomePhabricator

[SimplifyLibCalls] Transform memchr(STR, C, N) to chain of ORs
Needs ReviewPublic

Authored by xbolva00 on Jun 16 2022, 2:51 PM.

Details

Summary

Motivation:

#include <string_view>

size_t findFirst_ABCDEF(std::string_view sv) {
  return sv.find_first_of("ABCDEF");
}

memchr("ABCDEF", C, 6) != NULL -> (C == 'A' || C == 'B' || C == 'C' || C == 'D' || C == 'E' || C == 'F') != 0

Diff Detail

Event Timeline

xbolva00 created this revision.Jun 16 2022, 2:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2022, 2:51 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
xbolva00 requested review of this revision.Jun 16 2022, 2:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2022, 2:51 PM
xbolva00 retitled this revision from [SimplifyLibCalls] Transform memchr(STR C, N) to chain of ORs to [SimplifyLibCalls] Transform memchr(STR, C, N) to chain of ORs.Jun 17 2022, 5:15 AM
nikic added a comment.Jun 17 2022, 6:14 AM

This looks quite nice for the examples -- but can't this create a very large chain of comparisons in some cases? Like if you check every second character you'd end up with 128 comparisons? (Or maybe you need every third, I can imagine that we manage to fold every second character to am mask).

xbolva00 updated this revision to Diff 437870.Jun 17 2022, 6:26 AM

Added PhaseOrdering test to ensure instcombine + simplifycfg produces optimal switch code.

xbolva00 added a comment.EditedJun 17 2022, 6:32 AM

This looks quite nice for the examples -- but can't this create a very large chain of comparisons in some cases? Like if you check every second character you'd end up with 128 comparisons? (Or maybe you need every third, I can imagine that we manage to fold every second character to am mask).

Maybe we could introduce some limit (feel free to suggest some), but on the other hand, we often end up with nice switch or very simple code.

int g(char a)
{
   if (__builtin_memchr ("abcdefghijklmnopqrstuvwxyz", a, 27) != 0)
        return foo();
   else return -1;
}
define dso_local noundef i32 @_Z1gc(i8 noundef signext %a) local_unnamed_addr #1 {
entry:
  %0 = icmp ne i8 %a, 0
  %1 = add i8 %a, -123
  %2 = icmp ult i8 %1, -26
  %cmp.not = and i1 %0, %2
  br i1 %cmp.not, label %return, label %if.then

if.then:                                          ; preds = %entry
  %call1 = tail call noundef i32 @_Z3foov()
  br label %return

return:                                           ; preds = %entry, %if.then
  %retval.0 = phi i32 [ %call1, %if.then ], [ -1, %entry ]
  ret i32 %retval.0
}

or

int g(char a)
{
   if (__builtin_memchr ("acegikmoqsuwyz13456789", a, 22) != 0)
        return foo();
   else return -1;
}
define dso_local noundef i32 @_Z1gc(i8 noundef signext %a) local_unnamed_addr #1 {
entry:
  switch i8 %a, label %return [
    i8 122, label %if.then
    i8 121, label %if.then
    i8 119, label %if.then
    i8 117, label %if.then
    i8 115, label %if.then
    i8 113, label %if.then
    i8 111, label %if.then
    i8 109, label %if.then
    i8 107, label %if.then
    i8 105, label %if.then
    i8 103, label %if.then
    i8 101, label %if.then
    i8 99, label %if.then
    i8 97, label %if.then
    i8 57, label %if.then
    i8 56, label %if.then
    i8 55, label %if.then
    i8 54, label %if.then
    i8 53, label %if.then
    i8 52, label %if.then
    i8 51, label %if.then
    i8 49, label %if.then
  ]

...

X86: https://pastebin.com/m5gMEwGZ

An analogous optimization is applicable to other string functions besides memchr (and strchr), including for example strlen or even memcmp and strcmp (with one short constant array/string). With this one accepted as a precedent the other handlers can be enhanced as follow ups. It would be nice if the algorithm could be made general enough to let them all share it. It does seem that the expansion in all of them should be capped at some small number.

For what it's worth, a prototype of the corresponding patch for GCC is attached to PR 103798. It looks like it has a hardwired limit of 8 bytes.

xbolva00 added a comment.EditedJun 17 2022, 8:13 AM

strchr

strchr is transformed to memchr, if possible. See test strchr_to_memchr_n_equals_len.

It looks like it has a hardwired limit of 8 bytes.

Very small, imho, We usually end up with one switch instruction, not a big deal. But I am fine with 32 I think.

An analogous optimization is applicable to other string functions ... strlen or even memcmp and strcmp

I would not overcomplicate things for know, currently just 5 lines of code. Any possible followup could create create a new helper function.

For what it's worth, a prototype of the corresponding patch for GCC is attached to PR 103798.

Yes, I saw it. Oh, that expanded implementation in match.pd = D

xbolva00 updated this revision to Diff 438196.Jun 19 2022, 8:03 AM

Update memchr.ll

It seems important to put a limit on the length of the chain to avoid pathological expansion (that might come up in generated code, but not only there). I expect the optimal maximum will vary across different targets but a suitably low minimum should be target-independent. I don't have a good sense of what the ideal lower bound might be but I would hesitate to go over the proposed GCC limit of 8 unless the ORs could be assumed to collapse to something much simpler (like in memchr-7.ll). Instrumenting the compiler and building a few projects might help dial it in.

Having said that, I don't think it's a guaranteed win to set the limit to zero (or to disable the subsequent folding) when optimizing for size. But whatever the effect of -Os might be here, it should be exercised by the test suite if only to make the decision explicit.

I think for the limit, the important thing to consider is comparisons that collapse into range checks. We don't want to limit the total number of characters being checked, but rather the number of of non-contiguous ranges. The motivating cases reduce down to just one or two range checks, which is certainly profitable. The cases that would require a switch for decent lowering are less clearly profitable, especially when we take into account that such a switch is not being reliably produced (as your phase ordering test shows).

We don't want to limit the total number of characters being checked, but rather the number of of non-contiguous ranges.

I believe there is no easy StringRef API, right? Not sure if I want to work on such new API.

For now, I would go with GCC's limit of 8.

msebor added a comment.EditedJun 22 2022, 12:50 PM

Something like this seems to do the right thing for a sorted list (s is a container like StringRef):

int last = 0;
int n = std::count_if (s.begin(), s.end(),
                       [&last](char c) {
                         bool ret = (unsigned char)c > (unsigned char)last + 1;
                         last = c;
                         return ret; });

(It would be a bit more involved if it had to handle different charsets between the host and the target.)