Page MenuHomePhabricator

[InstCombine] snprintf (d, size, "%s", s) -> memccpy (d, s, '\0', size - 1), d[size - 1] = 0
Needs ReviewPublic

Authored by xbolva00 on Sep 24 2019, 2:25 PM.

Details

Summary

snprintf (d, size, "%s", s)
->
memccpy (d, s, '\0', size - 1),
d[size - 1] = 0

memccpy is much faster than snprintf my microbenchmark

time ./snprintf.out 1000000

real 0m0,057s
user 0m0,057s
sys 0m0,000s

time ./memccpy.out 1000000

real 0m0,021s
user 0m0,021s
sys 0m0,000s

Diff Detail

Event Timeline

xbolva00 created this revision.Sep 24 2019, 2:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 24 2019, 2:25 PM

Do we have accurate TLI information for memccpy? Not sure exactly how widely available it is.

If we're going to start generating memccpy calls, do we need to implement optimizations for it? For example, memccpy of a constant string can be transformed to memcpy.

xbolva00 added a comment.EditedSep 24 2019, 4:48 PM

There is a comment in TLI:

     // Win32 does not support these functions, but
     // they are generally available on POSIX-compliant systems.
...
         TLI.setUnavailable(LibFunc_memccpy);

If we're going to start generating memccpy calls, do we need to implement optimizations for it? For example, memccpy of a constant string can be transformed to memcpy.

Yeah, as follow up patch.
snprintf (d, size, "%s", "constant") is handled currently, so we should not regress the current "snprintf" calls with this transform..

If we're going to start generating memccpy calls, do we need to implement optimizations for it? For example, memccpy of a constant string can be transformed to memcpy.

Ok, done: https://reviews.llvm.org/D68089

Wrote a simple microbenchmark - memccpy is 4-5x faster than sprintf in this case.

Oh, I think the proposed transformation in that paper is incorrect.

It should rather be:
memccpy(d,s,0, n-1)
d[n-1] = 0

Since "A terminating null character is automatically appended after the content written." (snprintf)

Yes, that's right. Thanks for spotting that.

xbolva00 updated this revision to Diff 222493.Sep 30 2019, 1:53 PM
xbolva00 retitled this revision from [InstCombine] snprintf (d, size, "%s", s) -> memccpy (d, s, '\0', size). to [InstCombine] snprintf (d, size, "%s", s) -> memccpy (d, s, '\0', size - 1), d[size - 1] = 0.
xbolva00 edited the summary of this revision. (Show Details)

Updated. Current tranformations should be correct now.

xbolva00 updated this revision to Diff 222616.Oct 1 2019, 7:46 AM

Rebased + added one new test

xbolva00 edited the summary of this revision. (Show Details)Oct 2 2019, 8:45 AM
xbolva00 added a reviewer: nickdesaulniers.
xbolva00 edited the summary of this revision. (Show Details)
lebedev.ri added inline comments.
lib/Transforms/Utils/SimplifyLibCalls.cpp
2581

Where did we ask TLI about the existence of memccpy?

2583

It likely should be CreateInBoundsGEP().

xbolva00 marked 2 inline comments as done.Oct 2 2019, 9:56 AM
xbolva00 added inline comments.
lib/Transforms/Utils/SimplifyLibCalls.cpp
2581

emitMemCCpy calls emitLibcall which checks it - otherwise returns nullptr.

Ah, and we should not emit store in that case, if emitmemccpy failed. Thanks!

2583

I was wondering about this too, then code above is broken too :) I will fix it.

xbolva00 updated this revision to Diff 222857.Oct 2 2019, 10:07 AM

Addressed review comments

xbolva00 marked an inline comment as done.Oct 2 2019, 10:09 AM
xbolva00 added inline comments.
test/Transforms/InstCombine/snprintf-memccpy.ll
47

hmm...

Should I call DecSize->eraseFromParent() ?

It sounds like memccpy is part of C20. Can we not do this transform unless LangOpt says we're C20 or greater? Otherwise the Linux kernel doesn't implement this routine.

It sounds like memccpy is part of C20. Can we not do this transform unless LangOpt says we're C20 or greater? Otherwise the Linux kernel doesn't implement this routine.

LLVM middle-end does not and should not care what (if any) language standard the frontend was targeting when producing the initial IR.
Only the presence (as per TargetLoweringInfo::has()) of the replacement builtin matters.

xbolva00 added a comment.EditedOct 2 2019, 10:40 AM

The kernel should use same workaround as for bcmp transformation in this case.

Otherwise the Linux kernel doesn't implement this routine.

Original paper is from a GCC dev, so I think GCC will implement memccpy like optimizations for GCC 10 too.

Can the kernel just implement it this easy way? memccpy is not very known but I believe there are places in the kernel where memccpy could be used efficiently by kernel devs.

https://code.woboq.org/userspace/glibc/string/memccpy.c.html

It sounds like memccpy is part of C20. Can we not do this transform unless LangOpt says we're C20 or greater? Otherwise the Linux kernel doesn't implement this routine.

Doesn't the kernel use -fno-builtin? That should be enough to suppress this optimization.

In general, we assume that TargetLibraryInfo is providing an accurate picture of what functions are available. If it isn't, we should fix that in some general way.

xbolva00 added a comment.EditedOct 2 2019, 1:31 PM

but -fno-builtin is maybe too aggressive? somebody should benchmark kernel with/without it.

Anyway, I thought -fno-builtin-memccpy also could disable it, but no.. -fno-builtin-snprintf / -ffreestanding works.

Oh, hmm... actually, just realized something; is the new formulation actually equivalent? Specifically, is it okay to write to "d[size - 1]" if the string is short? Say, for example, you have something like snprintf(buf, 10, "%s", "x"). That normally writes to two elements of the array: 0 and 1. The rewritten version writes to three elements: 0, 1, and 9.

-fno-builtin-memccpy doesn't work because memccpy isn't listed in include/clang/Basic/Builtins.def. We should probably fix that.

Oh, hmm... actually, just realized something; is the new formulation actually equivalent? Specifically, is it okay to write to "d[size - 1]" if the string is short? Say, for example, you have something like snprintf(buf, 10, "%s", "x"). That normally writes to two elements of the array: 0 and 1. The rewritten version writes to three elements: 0, 1, and 9.

Sounds like this should be guarded to only work for simple (non-volatile, non-atomic) pointers?

xbolva00 added a comment.EditedOct 2 2019, 2:11 PM

-fno-builtin-memccpy doesn't work because memccpy isn't listed in include/clang/Basic/Builtins.def. We should probably fix that.

Just add new record:
LIBBUILTIN(memccpy, "v*v*vC*iz", "f", "string.h", ALL_LANGUAGES)

?

Than great, then linux kernel would just use -fno-builtin-memccpy and we are fine here.

Sounds like this should be guarded to only work for simple (non-volatile, non-atomic) pointers?

Good catch, yeah, I will do it. 'Dst' must be simple ptr.

xbolva00 added a comment.EditedOct 2 2019, 2:17 PM

Say, for example, you have something like snprintf(buf, 10, "%s", "x"). That normally writes to two elements of the array: 0 and 1. The rewritten version writes to three elements: 0, 1, and 9.

Hm, this is right. We overwrite d[size -1] always -> bad. I have no more ideas, probably we can do nothing, just abandon this transformation.

You could do something like if (!memccpy (d, s, '\0', size - 1)) d[size - 1] = 0;, I guess. That's a few more instructions that I'd really like, but I don't see a better alternative.

xbolva00 abandoned this revision.Oct 2 2019, 3:58 PM

You could do something like if (!memccpy (d, s, '\0', size - 1)) d[size - 1] = 0;, I guess. That's a few more instructions that I'd really like, but I don't see a better alternative.

Yeah, but probably InstCombine is not best place to create such ranches/phis, select would be better.

Maybe this could work?
char *ret = memccpy (d, s, '\0', n - 1);
char * ptr = ret ? ret - 1 : &d[n - 1];
*ptr = 0;

xbolva00 reclaimed this revision.Oct 2 2019, 3:59 PM

Missclicked, sorry.

And yes, with D68377 -fno-builtin-memccpy works -> Ideal fix for the linux kernel.

joerg added a subscriber: joerg.Oct 3 2019, 4:23 PM

I'd make the argument that it should be strlcpy if present. That covers a lot of existing systems too. I'm not sure this optimisation should be done for freestanding mode though.

I'm not sure this optimisation should be done for freestanding mode though.

I said it above, -ffreestanding disables this transformation too.

I'd make the argument that it should be strlcpy if present. That covers a lot of existing systems too.

strlcpy is good idea and it is faster than memccpy. but.. I need to link with -lbsd on Ubuntu.

I dont use BSD systems, so I am not gonna to do your suggested transformation since I cant test it. But patches are welcome, I will be happy to look at it if you send one here.

Oh, I think the proposed transformation in that paper is incorrect.

It should rather be:
memccpy(d,s,0, n-1)
d[n-1] = 0

Since "A terminating null character is automatically appended after the content written." (snprintf)

@efriedma, what do you think about this sequence?

MaskRay added a subscriber: MaskRay.Oct 9 2019, 7:33 PM

This transformation seems to increase code size significantly. Is the snprintf "%s" pattern common enough? I suspect most projects have already used memccpy, stpncpy, strscpy, or strlcpy. For the few that don't, the performance probably does not matter.

xbolva00 abandoned this revision.Oct 9 2019, 11:07 PM

This transformation seems to increase code size significantly. Is the snprintf "%s" pattern common enough? I suspect most projects have already used memccpy, stpncpy, strscpy, or strlcpy. For the few that don't, the performance probably does not matter.

Yes, quite common. But okay, if you dont want it, let's just abandon it.

This transformation seems to increase code size significantly. Is the snprintf "%s" pattern common enough? I suspect most projects have already used memccpy, stpncpy, strscpy, or strlcpy. For the few that don't, the performance probably does not matter.

Yes, quite common. But okay, if you dont want it, let's just abandon it.

I wouldn't have quit on this so easily, but OK.

xbolva00 reclaimed this revision.EditedOct 10 2019, 10:38 AM

BTW, nowdays we transform snprintf(d, s, "%s" , ...) into two calls - memcpy + strlen - and nobody is concerned about code size increase anyway.

So I dont think code size is problem here (if so, !optForSize). Various InstCombine transformations produce two calls. (here it is just call + select..).

I will continue with this patch.

This transformation seems to increase code size significantly. Is the snprintf "%s" pattern common enough? I suspect most projects have already used memccpy, stpncpy, strscpy, or strlcpy. For the few that don't, the performance probably does not matter.

Sounds like then maybe this optimization should conditionally occur based on optimization level/goals? For instance, maybe it's not appropriate at -Os, but is at -O2?

xbolva00 added a comment.EditedOct 10 2019, 1:39 PM

I am fine with your suggestion to restrict it like you said.

(Generally, I think more transforms in instcombine should be restricted this way)

What is the status for this one?

I will update this soon.