This is an archive of the discontinued LLVM Phabricator instance.

[DSE] Transform memset + malloc --> calloc (PR25892)
ClosedPublic

Authored by yurai007 on May 24 2021, 3:04 AM.

Diff Detail

Event Timeline

yurai007 created this revision.May 24 2021, 3:04 AM
yurai007 requested review of this revision.May 24 2021, 3:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2021, 3:04 AM
yurai007 edited the summary of this revision. (Show Details)May 24 2021, 3:19 AM
yurai007 updated this revision to Diff 347368.May 24 2021, 6:25 AM
yurai007 retitled this revision from [DSE] Eliminate store after calloc (PR50143) to [DSE] Eliminate memset after malloc.
yurai007 edited the summary of this revision. (Show Details)

Rebased with proper clang-formatting and updated title to be more accurate.

This looks good to me, but please wait for the reviewers' comments from the previous patch version.

I'd be curious how much the decision to not ignore allocas will affect compile times in (generated) IRs with many locals.

And remove transformation from instcombine?

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1844

We should not copy attributes from malloc.

And remove transformation from instcombine?

Yes. IIRC it's already dead because malloc has always more than one use. Could be done in separate change.

This looks good to me, but please wait for the reviewers' comments from the previous patch version.

I'd be curious how much the decision to not ignore allocas will affect compile times in (generated) IRs with many locals.

If this is about mallocs and callocs then impact on compile time is rather insignificant. I wouldn't expect anything more than noise but of course we can try to measure it to make sure.
Probably most time consuming part is in case of malloc - one AA check (AA.getModRefInfo) coming from memoryIsNotModifiedBetween and for calloc case that would be one getClobberingMemoryAccess call from MemorySSA (a little bit heavier then getModRefInfo).
So it's more or less about one extra AA check for every pair of m(c)alloc-memset in block.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1835

That's root cause of some crashes. Cannot assume that isMallocLikeFn imply malloc. Will fix.

1844

Thanks for pointing this out! It saves me time of investigating current backend error from CI.
My understanding is that emitCalloc merge input attribute list with default calloc attributes.
If so then passing empty list should be fine.

yurai007 updated this revision to Diff 348451.May 28 2021, 12:05 AM

Addressed comments.

Since instcombine is currently only user of emitCalloc, so after you commit this patch, please delete AttrsList parameter from emitCalloc (and adjust this new callsite in DSE) together with removal of this transformation from instcombine.

Otherwise, I think this looks good.

Sure. When I fix tests and commit is ready to land I will prepare appropriate emitCalloc/instcombine change.

yurai007 added a comment.EditedMay 28 2021, 8:42 AM

Both AddressSanitizer tests and libFuzzer tests fail from same reason so there are only 2 different error causes.
It has something to do rather with way asan_allocator works (used in all failing UT) than broken middle-end transformations.
I will check it further and will back with more details.

xbolva00 added inline comments.May 28 2021, 8:51 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1835

Regarding OutOfMemoryTest: https://github.com/llvm/llvm-project/blob/main/compiler-rt/test/fuzzer/OutOfMemoryTest.cpp failure explanation.
Test has silent assumption that in every loop iteration we request 268MB virtual memory and get 268MB physical memory allocated.
This is the case for malloc+memset combo (page faults trigger actual physical allocation in kernel) but not for calloc (physical memory allocation is deferred - it's faster, which is the whole point of our calloc transformation).
In consequence after my patch there is no OOM in OutOfMemoryTest so UT fails.
It can be easily fixed by saying "we _really_ want allocate and touch physical memory, not only virtual one" which end up in new volatile char[] + std::fill duo.

yurai007 added inline comments.May 31 2021, 4:51 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1835

Ok, maybe it would be more accurate than dyn_cast. I need to double check.

Just ignore my previous comment about test. It doesn't explain how we moved from non-inlined new[] to calloc. Indeed better explanation was suggested by @xbolva00.

yurai007 updated this revision to Diff 348788.May 31 2021, 6:26 AM

Ok, this looks fine now I think.

nikic added a comment.May 31 2021, 6:59 AM

I think this should check for just LibFunc_malloc and nothing else. MallocLike just means an allocation function that return null on failure, and the nothrow new variants are considered MallocLike and not OpNewLike.

I think this should check for just LibFunc_malloc and nothing else. MallocLike just means an allocation function that return null on failure, and the nothrow new variants are considered MallocLike and not OpNewLike.

Good point! Yeah.

Please update the patch subject to be more reflective of "replacement" as opposed to "removal".

For example, "Use calloc for memset+malloc".

yurai007 updated this revision to Diff 348800.May 31 2021, 7:54 AM
yurai007 retitled this revision from [DSE] Eliminate memset after malloc to [DSE] Use calloc for memset+malloc.

Thanks. Updated patch with (hopefully) proper LibFunc_malloc detection. Extra checks are stolen from LibCallSimplifier.

yurai007 marked 2 inline comments as done.May 31 2021, 11:29 AM

@xbolva00 @nikic @hubert.reinterpretcast: I think now after addressing all comments and passing on CI it's in good shape.

nikic added inline comments.Jun 9 2021, 1:34 PM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
514

I'm a bit surprised that MemoryLocation::get() doesn't work for MemSetInst, as it only has a single memory location, so there's no ambiguity here. @asbirlea @fhahn Do you think it would make sense to adjust the API?

1821–1859

Which part here requires the const cast?

1845

Is this cast needed? Shouldn't it be an implicit upcast?

1850

As you already have an IRBuilder, why not IRB.getIntPtrTy(DL)?

1856

Looks like this doesn't preserve MemorySSA? Try -passes='dse,verify<memoryssa>' in the test.

llvm/test/Transforms/DeadStoreElimination/noop-stores.ll
340

As it was the motivating case, I'd also expect a test where the memset is in a different block.

I also don't think that the dominates() condition in your implementation is exercised by tests. Some other conditions aren'T either, for example malloc and memset having different sizes.

yurai007 updated this revision to Diff 351131.Jun 10 2021, 4:41 AM
yurai007 marked 2 inline comments as done.Jun 10 2021, 4:47 AM
yurai007 added inline comments.
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1821–1859

Malloc, more precisely starting from this line: IRBuilder<> IRB(Malloc);
We can const_cast later, at time of Malloc definition but we cannot remove it completely - it's still required for Builder and replaceAllUsesWith/eraseFromParent.
Anyway, I moved it to Malloc definition as it's more appropriate place.

1856

Will check.

llvm/test/Transforms/DeadStoreElimination/noop-stores.ll
340

I also don't think that the dominates() condition in your implementation is exercised by tests. Some other conditions aren'T either, for example malloc and memset having different sizes.

Sure, I can add much more tests to cover more conditions.

As it was the motivating case, I'd also expect a test where the memset is in a different block.

Currently with this patch DSE cannot perform such transformation. Consider original pr25892 from https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/InstCombine/memset-1.ll.
IIRC the reason is that malloc block - entry doesn't have related local store to malloced memory so DSE cannot find malloc-memset pair as candidate for elimination. I can provide more details if you want but when I checked it last time I simply concluded that more effort in DSE would be needed to make it work across blocks.

yurai007 updated this revision to Diff 351393.Jun 11 2021, 3:42 AM

More tests.

yurai007 added inline comments.Jun 11 2021, 3:48 AM
llvm/test/Transforms/DeadStoreElimination/noop-stores.ll
340

Currently with this patch DSE cannot perform such transformation.(...)

After adding original PR unit test I noticed actually now DSE can perform transformation across blocks. When I checked it before it couldn't, apparently meantime changes unlocked it. Well, that's good :)

yurai007 added inline comments.Jun 11 2021, 5:32 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1856

Looks like this doesn't preserve MemorySSA? Try -passes='dse,verify<memoryssa>' in the test.

Right, missed MemorySSAUpdater. I'm submitting fix + related UT right now.

yurai007 updated this revision to Diff 351417.Jun 11 2021, 5:36 AM
yurai007 retitled this revision from [DSE] Use calloc for memset+malloc to [DSE] Transform memset + malloc --> calloc (PR25892).
yurai007 edited the summary of this revision. (Show Details)
yurai007 marked 3 inline comments as done.Jun 11 2021, 6:22 AM

@nikic: All comments were addressed.

xbolva00 added inline comments.Jun 15 2021, 3:08 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1843

We dont need this check, do we?

p = malloc(20)
memset(p, 0, 10)

Reading p between 10 and 20 is UB, so with calloc we would have 0s in this area so fine.

And reverse case is UB too.

yurai007 added inline comments.Jun 15 2021, 3:46 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1843

If we permitted to "calloc more than we memset" wouldn't we hide UB in some cases?
Like if we would really read unitinitialized memory much later after memset?
The second thing is that GCC doesn't transform malloc to calloc when we memset less memory than malloc allocates: https://godbolt.org/z/Ef94je4KP I'm not saying we should blindly follow them, I'm just not sure what was rationale behind that.

nikic added inline comments.Jun 15 2021, 3:56 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1843

If the memset size is different, profitability is also unclear. Converting a malloc into a calloc may be always legal, but if you have malloc(10000) followed by memset(10) it's probably not profitable to actually do it.

xbolva00 added inline comments.Jun 15 2021, 3:59 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1843

There is no rule that compiler cannot “hide” UB. Compiler is free to assume that UB never happens.

xbolva00 added inline comments.Jun 15 2021, 4:04 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1843

Yeah.

Maybe if we know both sizes and memset is unlikely to be expanded (there is some limit defined samowhere), we should still prefer calloc (1 call) than 2 libcalls?

xbolva00 added inline comments.Jun 15 2021, 4:05 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1843

But for now, let’s start with that condition you already have.

Good enough for this patch.

yurai007 added inline comments.Jun 15 2021, 4:32 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1843

If the memset size is different, profitability is also unclear. Converting a malloc into a calloc may be always legal, but if you have malloc(10000) followed by memset(10) it's probably not profitable to actually do it.

Right.

There is no rule that compiler cannot “hide” UB. Compiler is free to assume that UB never happens.

I'm aware that compiler assume UB never happens. My point was that if we permit to "calloc more than we memset" then uninitialized access may become initialized and _maybe_ sanitizers/memcheck/other tools won't detect it. If it's unreal then fine, I agree we shouldn't care.

I played a little bit with patch and checked how it performs with GCC unit tests added together with similar change in GCC on 2014.

In first case: https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/gcc.dg/tree-ssa/calloc-1.c malloc is transformed to calloc like for GCC which is good.
In LLVM case on assembly level register pressure is very similar to GCC (especially for f() function) which means that issue mentioned by Haneef in PR discussion
(https://bugs.llvm.org/show_bug.cgi?id=25892#c1) seems to be mitigated now.

However the second C++-like case: https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/g++.dg/tree-ssa/calloc.C is not transformed.
Here CFG is more complex and I guess that DSE simply doesn't detect malloc - memset pair because they aren't in neighboring blocks.
Apparently there is still room for improvements but let's leave it for future patches :)

yurai007 marked 5 inline comments as done.Jun 17 2021, 3:30 AM

@nikic @xbolva00: All comments addressed.

No comments from me. Someone more familiar with DSE should LGTM it.

@xbolva00: Ok. Added more reviewers.

nikic accepted this revision.Jun 27 2021, 1:25 PM

LGTM

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1844

It would be clearer to use Malloc instead of DefUOInst here, to match the memoryIsNotModifiedBetween() call.

llvm/test/Transforms/DeadStoreElimination/noop-stores.ll
3

Please change this RUN line to use -passes='dse,verify<memoryssa>'.

This revision is now accepted and ready to land.Jun 27 2021, 1:25 PM
yurai007 updated this revision to Diff 354868.Jun 28 2021, 6:02 AM
yurai007 marked 2 inline comments as done.

Rebased and addressed yesterday comments.

yurai007 updated this revision to Diff 354987.Jun 28 2021, 12:24 PM

@nikic: Thanks for your review and LGTM it. Looks like I cannot push it myself to main: "Permission to llvm/llvm-project.git denied to yurai007".

This revision was automatically updated to reflect the committed changes.
This revision is now accepted and ready to land.Jul 9 2021, 6:28 AM

Hello @kcc @eugenis @pgousseau, sorry for bothering you. I added you to this review because transformation introduced in this change breaks msan_test (memcpy_unaligned/TestUnalignedMemcpy unit test).
I'm quite convinced that after my change Clang does what GCC would do if compliled msan_test.cpp file with -Ofast: https://godbolt.org/z/f7s81hjaM
Therefore I'm pretty sure that transformation works correctly (as on GCC) but it simply doesn't play well with MSan.
Since I'm not MSan expert it would be great if you could take a look on this and confirm whether or not my understanding of issue is correct.

Hello @kcc @eugenis @pgousseau, sorry for bothering you. I added you to this review because transformation introduced in this change breaks msan_test (memcpy_unaligned/TestUnalignedMemcpy unit test).
I'm quite convinced that after my change Clang does what GCC would do if compliled msan_test.cpp file with -Ofast: https://godbolt.org/z/f7s81hjaM
Therefore I'm pretty sure that transformation works correctly (as on GCC) but it simply doesn't play well with MSan.
Since I'm not MSan expert it would be great if you could take a look on this and confirm whether or not my understanding of issue is correct.

Right, so this change replaces malloc with calloc in

if (src_is_poisoned)
  src_origin = __msan_get_origin(src);
else
  memset(src, 0, sz);

because the other branch contains UB.
The test can be fixed by adding __msan_allocated_memory(ptr, size) before the call to __msan_get_origin, but I'd rather disable this optimization in functions with sanitize_memory attribute because it could make us miss bugs.

If possible, it's OK to disable only the CFG-aware part of the opt. I.e. malloc + memset in linear code (or even same BB) is fair game.

yurai007 updated this revision to Diff 359297.Jul 16 2021, 5:57 AM

Fixed Msan.

Thanks. I updated patch and now optimization is disabled for functions with sanitize_memory attribute.
Detecting linear pattern would be possible but I'm not convinced it's worth effort. Anyway now it should work with MSan.

yurai007 updated this revision to Diff 360028.Jul 20 2021, 12:10 AM
This revision was landed with ongoing or failed builds.Jul 20 2021, 2:51 AM
This revision was automatically updated to reflect the committed changes.

@yurai007 Shouldn't we detect that we are implementing 'calloc' and bail out if so ? Just like we do for 'memset' ?
(See https://www.godbolt.org/z/xnMa9bj4r )

@jeroen.dobbelaere: Yes, I think DSE should special case calloc to avoid infinite recursion (like it does for memset) in libc++. Thanks for catching this. I'm reverting change to fix.

*Meant libc, not libc++ but nevermind.

delcypher added subscribers: yuri, delcypher.EditedJul 23 2021, 1:05 PM

@yurai007 I think this patch may have broken the compiler-rt/test/asan/TestCases/heap-overflow.cpp test. This test is failing on our internal bots. The test fails because it expects to see malloc at -O2 in the stacktrace but we are seeing calloc() instead.

@delcypher: Thanks for letting me know. Now patch is reverted, Before submitting again I will fix heap-overflow.cpp test.

yurai007 updated this revision to Diff 361985.Jul 27 2021, 5:13 AM

Don't transform when function is calloc or it's instrumented with ASan.

yurai007 updated this revision to Diff 362312.Jul 28 2021, 2:45 AM

Disabled transformation on HWASan as well. Maybe it's too paranoid but I don't have AAarch64 hardware to verify.

alexfh added a subscriber: alexfh.Aug 19 2021, 5:32 AM

This commit seems to cause memory usage (rss) increase in MariaDB's mysqld by a factor of 4. Looking back into the history, I found that the previous commit here caused the same regression, but we quickly picked up the revert and moved on. Now I'm trying to isolate the problem, but it's taking time.
So far, my sole hypothesis is that malloc + memset of a smaller size can still be converted to calloc. But I have no evidence so far. Naive attempts to synthetically reproduce this didn't work. Maybe this only happens when some UB is in place, but again, I have nothing to back this up.

Given this is a quite serious regression, can we roll this back while I'm investigating?

This commit seems to cause memory usage (rss) increase in MariaDB's mysqld by a factor of 4. Looking back into the history, I found that the previous commit here caused the same regression, but we quickly picked up the revert and moved on. Now I'm trying to isolate the problem, but it's taking time.
So far, my sole hypothesis is that malloc + memset of a smaller size can still be converted to calloc. But I have no evidence so far. Naive attempts to synthetically reproduce this didn't work. Maybe this only happens when some UB is in place, but again, I have nothing to back this up.

Given this is a quite serious regression, can we roll this back while I'm investigating?

You should be able to use flag -fno-builtin-calloc to disable this transformation.

This transformation is correct and valid; GCC does it as well. No reason to revert, not justified. You should check asm diffs w and w/o patch for any surprises.

This commit seems to cause memory usage (rss) increase in MariaDB's mysqld by a factor of 4. Looking back into the history, I found that the previous commit here caused the same regression, but we quickly picked up the revert and moved on. Now I'm trying to isolate the problem, but it's taking time.
So far, my sole hypothesis is that malloc + memset of a smaller size can still be converted to calloc. But I have no evidence so far. Naive attempts to synthetically reproduce this didn't work. Maybe this only happens when some UB is in place, but again, I have nothing to back this up.

Given this is a quite serious regression, can we roll this back while I'm investigating?

You should be able to use flag -fno-builtin-calloc to disable this transformation.

This transformation is correct and valid; GCC does it as well. No reason to revert, not justified. You should check asm diffs w and w/o patch for any surprises.

I found and reduced a test case that shows a too eager replacement of malloc with calloc:

https://gcc.godbolt.org/z/dTjonof74

$ cat test.cc
#include <stdlib.h>
#include <string.h>

void *my_malloc(size_t size, int my_flags)
{
  void* point = malloc(size);
  if (my_flags & 32) memset(point, 0, size);
  return point;
}

$ clang -O2 -o test.o -save-temps -c test.cc  && cat test.s
        .text
        .file   "test.cc"
        .globl  _Z9my_mallocmi                  # -- Begin function _Z9my_mallocmi
        .p2align        4, 0x90
        .type   _Z9my_mallocmi,@function
_Z9my_mallocmi:                         # @_Z9my_mallocmi
        .cfi_startproc
# %bb.0:
        pushq   %rax
        .cfi_def_cfa_offset 16
        movq    %rdi, %rsi
        movl    $1, %edi
        callq   calloc@PLT
        popq    %rcx
        .cfi_def_cfa_offset 8
        retq
.Lfunc_end0:
        .size   _Z9my_mallocmi, .Lfunc_end0-_Z9my_mallocmi
        .cfi_endproc
                                        # -- End function
        .ident  "clang version trunk (45ac5f5441818afa1b0ee4a3734583c8cc915a79)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

This is quite a serious problem. Would you please revert while working on a fix?

lebedev.ri added a comment.EditedAug 19 2021, 3:11 PM

I think i see where this is going - the just-malloced, but never touched memory
doesn't need to be actually backed by an actual pages (see overcommit),
while i guess calloc doesn't just mark the pages as zeroed-out,
but actually marks them dirty and needed to be allocated,
at least not unless you happen to allocate in multiples of page size?

I guess the easy fix here is to require that memset post-dominates the malloc,
but i guess we also need some langref blurb about this,
because the transformation is correct, just-malloced memory is filed with undef,
which we can always define into zeros: https://alive2.llvm.org/ce/z/C4vWH2

This commit seems to cause memory usage (rss) increase in MariaDB's mysqld by a factor of 4. Looking back into the history, I found that the previous commit here caused the same regression, but we quickly picked up the revert and moved on. Now I'm trying to isolate the problem, but it's taking time.
So far, my sole hypothesis is that malloc + memset of a smaller size can still be converted to calloc. But I have no evidence so far. Naive attempts to synthetically reproduce this didn't work. Maybe this only happens when some UB is in place, but again, I have nothing to back this up.

Given this is a quite serious regression, can we roll this back while I'm investigating?

You should be able to use flag -fno-builtin-calloc to disable this transformation.

This transformation is correct and valid; GCC does it as well. No reason to revert, not justified. You should check asm diffs w and w/o patch for any surprises.

I found and reduced a test case that shows a too eager replacement of malloc with calloc:

https://gcc.godbolt.org/z/dTjonof74

$ cat test.cc
#include <stdlib.h>
#include <string.h>

void *my_malloc(size_t size, int my_flags)
{
  void* point = malloc(size);
  if (my_flags & 32) memset(point, 0, size);
  return point;
}

$ clang -O2 -o test.o -save-temps -c test.cc  && cat test.s
        .text
        .file   "test.cc"
        .globl  _Z9my_mallocmi                  # -- Begin function _Z9my_mallocmi
        .p2align        4, 0x90
        .type   _Z9my_mallocmi,@function
_Z9my_mallocmi:                         # @_Z9my_mallocmi
        .cfi_startproc
# %bb.0:
        pushq   %rax
        .cfi_def_cfa_offset 16
        movq    %rdi, %rsi
        movl    $1, %edi
        callq   calloc@PLT
        popq    %rcx
        .cfi_def_cfa_offset 8
        retq
.Lfunc_end0:
        .size   _Z9my_mallocmi, .Lfunc_end0-_Z9my_mallocmi
        .cfi_endproc
                                        # -- End function
        .ident  "clang version trunk (45ac5f5441818afa1b0ee4a3734583c8cc915a79)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

This is quite a serious problem. Would you please revert while working on a fix?

For now, use -fno-builtin-calloc, it works fine. Also another solution:

void *my_malloc(size_t size, int my_flags) __attribute__((no_builtin("calloc")))
{
  void* point = malloc(size);
  if (my_flags & 32) memset(point, 0, size);
  return point;
}

I guess the easy fix here is to require that memset post-dominates the malloc,

Yeah, but I am not sure, this is quite very specific case, I would prefer ' attribute((no_builtin("calloc")))' solution here, instead of messing with malloc's internals and their impact on LLVM & Langref.

It would be great not to break much more common pattern:

void* point = malloc(size);
  if (point) memset(point, 0, size);

I think i see where this is going - the just-malloced, but never touched memory
doesn't need to be actually backed by an actual pages (see overcommit),
while i guess calloc doesn't just mark the pages as zeroed-out,
but actually marks them dirty and needed to be allocated,
at least not unless you happen to allocate in multiples of page size?

I found this problem in mysql compiled with tcmalloc. Mysqld (at least in the somewhat older version I'm looking at) speculatively allocates a potentially large (depending on the configuration parameters) block of memory on start, which is normally used only partially. With malloc the memory is lazily given to the process when it starts using it. With calloc (and tcmalloc) the process actually tries to get all the pages immediately, which increases RSS (and thus, real memory usage). I guess, it may affect performance as well due to the unnecessary filling with zeroes, when user code calls my_malloc without MY_ZEROFILL.

For the context: https://fossies.org/linux/mariadb/mysys/my_malloc.c (this version seems functionally close to what I'm looking at).

I guess the easy fix here is to require that memset post-dominates the malloc,
but i guess we also need some langref blurb about this,
because the transformation is correct, just-malloced memory is filed with undef,
which we can always define into zeros: https://alive2.llvm.org/ce/z/C4vWH2

Yep, sounds like a revert to me.

To be honest i'm not really sure why this transform is worth it/beneficial,
so i'm not sure how much trouble the implementation should go into.

Well, I am not sure, your code compiled with GCC will have same issue. Overallocating is also questionable - your code relies on some many things to work luckily together - not a very ideal state.

This transformation was in simplifylibcalls for very long time anyway so if anything, then postdominating check….

Well, I am not sure, your code compiled with GCC will have same issue.

I'm not sure "gcc also has this problem" is a good excuse ;)

Overallocating is also questionable - your code relies on some many things to work luckily together - not a very ideal state.

Well, it's not my code, it's mysql, which is probably Oracle's code, I suppose ;) But jokes aside, the function I provided as a test case looks totally reasonable to me and it shouldn't zero-fill the allocated memory, if it wasn't asked to. It's good there is a workaround (-fno-builtin-calloc) and it seems like some time ago mysql started using this flag in its gcc builds, because this specific optimization caused substantial performance regressions: http://smalldatum.blogspot.com/2017/11/a-new-optimization-in-gcc-5x-and-mysql.html

I guess, mysql is not the only software using its wrappers around malloc + memset with a similar logic.

This transformation was in simplifylibcalls for very long time anyway

It wasn't this transformation. It was targeting specifically memset(malloc(size), 0, size) pattern, which is much more specific and safe to replace to calloc.

so if anything, then postdominating check….

Yes, please. More strictly, you could try to check that the memset is on all paths from malloc to where the result of malloc can be used (in any way except for checking it for zero maybe). Not sure whether it can be efficiently implemented though.

I found and reduced a test case that shows a too eager replacement of malloc with calloc: https://gcc.godbolt.org/z/dTjonof74
(...) This is quite a serious problem. Would you please revert while working on a fix?

First of all just to make it clear - in this particular case GCC does exactly same: https://gcc.godbolt.org/z/qbE6Kxdnv unless you pass different flags.

I found and reduced a test case that shows a too eager replacement of malloc with calloc: https://gcc.godbolt.org/z/dTjonof74
(...) This is quite a serious problem. Would you please revert while working on a fix?

First of all just to make it clear - in this particular case GCC does exactly same: https://gcc.godbolt.org/z/qbE6Kxdnv unless you pass different flags.

I will be blunt: i do not understand what are the profitability heuristics for this transform are?
When is calloc better than a malloc immediately followed by the memset?

I will be blunt: i do not understand what are the profitability heuristics for this transform are?
When is calloc better than a malloc immediately followed by the memset?

IIRC the main rationale behind this transformation is here: https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc.
I think I could prepare benchmark showing benefits of calloc over malloc+memset, I already observed reduced RSS and less page faults after transformation on some tests.
However if non-standard malloc is used it may behave differently.

xbolva00 added a comment.EditedAug 20 2021, 2:34 AM

I think more codes with “common pattern” see improvements so few ones which are problematic should use flag or attribute.

Also unroller sometimes goes crazy and degrades performance but here nobody talks about revert/removal/turning off.

yurai007 added a comment.EditedAug 20 2021, 3:04 AM

I think more codes with “common pattern” see improvements so few ones which are problematic should use flag or attribute.
Also unroller sometimes goes crazy and degrades performance but here nobody talks about revert/removal/turning off.

+1 Totally agree.

I found and reduced a test case that shows a too eager replacement of malloc with calloc: https://gcc.godbolt.org/z/dTjonof74
(...) This is quite a serious problem. Would you please revert while working on a fix?

First of all just to make it clear - in this particular case GCC does exactly same: https://gcc.godbolt.org/z/qbE6Kxdnv unless you pass different flags.

As I said, "GCC also has this problem" is not a good excuse.

I think more codes with “common pattern” see improvements so few ones which are problematic should use flag or attribute.

Also unroller sometimes goes crazy and degrades performance but here nobody talks about revert/removal/turning off.

Revert is the safest thing to do when a real problem has been root caused to the commit and there is no quick and obvious fix. If there's no consensus on introducing a postdominance check with the purpose of making this transformation more conservative, I'd insist on reverting until there's a clear path forward.
I'd probably be fine with using a workaround (-fno-builtin-calloc) for now, but it also disables the completely reasonable and uncontroversial transformation of memset(malloc(n), 0, n) to calloc(1, n), which is a net regression from the state before this commit: https://gcc.godbolt.org/z/Ev64KPs85

And needless to say, "there's a problem elsewhere, why should we do better here?" is not a productive approach ;) If you see a change in unroller pessimizing use cases you care about, it *may* be reasonable to speak up, ask for a revert and figure out how to solve the problem while keeping trunk in a better shape.

I will be blunt: i do not understand what are the profitability heuristics for this transform are?
When is calloc better than a malloc immediately followed by the memset?

IIRC the main rationale behind this transformation is here: https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc.
I think I could prepare benchmark showing benefits of calloc over malloc+memset, I already observed reduced RSS and less page faults after transformation on some tests.

Yes, please. A benchmark measuring the benefits of this transformation would be nice. Specifically, it would be interesting to see the differences between the too eager (current) version of this transform vs a more conservative version (to be implemented) with a "memset postdominating malloc" check. For comparison I have a benchmark that shows a 4+x increase in RSS with this transformation in a real production code running in a rather realistic configuration.

However if non-standard malloc is used it may behave differently.

In the post I linked to above (http://smalldatum.blogspot.com/2017/11/a-new-optimization-in-gcc-5x-and-mysql.html) there were measurements of performance impacts of this transformation (when done by GCC) with three different malloc implementations - glibc, tcmalloc, jemalloc. glibc malloc suffered the most (>30% drop in QPS).

xbolva00 added a comment.EditedAug 20 2021, 8:07 AM

there is no quick and obvious fix

There is a flag/attribute.

but it also disables the completely reasonable and uncontroversial transformation of memset(malloc(n), 0, n) to calloc(1, n)

  1. this never worked with llvm.memset.
  2. this never worked if ptr from malloc was only checked for null.

p = memset(malloc(..), 0, ..) is horrible pattern to see in real world code. Rather to have nothing than optimize this horrible pattern.

impacts of this transformation (when done by GCC)

And so many years passed and gcc still has it. :) it does not look like there is a storm on gcc bugzilla about this “issue” either.

Well I think users have some ways how to avoid this optimization - what we could do is to write some info into release notes.

xbolva00 added a comment.EditedAug 20 2021, 8:17 AM

And small change like:

if (ZERO_FILL)

point =  calloc(1, size);

else point = malloc(size);

Looks even better. The fastest.

I am fine with revert if you post LangRef change related to this issue “definition of undef vs malloc”.

MaskRay added a subscriber: MaskRay.EditedAug 20 2021, 11:11 AM

I agree with alexfh that this is questionable.
55 lines of complexity and compile time hurt in return of an "optimization naturally performed by humans, deoptimization/surprise at times, with other hidden costs"?

Other hidden costs: new references to msan/asan/hwasan exclusion. (asan/hwasan are perhaps redundant.) When a new sanitizer is introduced, developers may go over the list of reference sites and adjust accordingly. We either lose test coverage or add some low-value tests.

Justification in real applications will help. I am not sure the benefits can be justified (we will need an additional cost: a post-dominance test and its accompanying test).

+1 to revert.

I'm not sure what is the LangRef change requested. This is a valid transformation, sure. It just makes the code run slower and use more memory.

It goes against the user intent in the code that is explicitly trying to avoid zero-initializing the allocated memory.

IMHO this transformation should only apply if memset can be proven to postdominate malloc *for the entire allocation size* (or at least a large part of it).

Anyway, we have some requests for this transformation:

https://bugs.llvm.org/show_bug.cgi?id=25892
https://bugs.llvm.org/show_bug.cgi?id=47159

(Maybe even more..)

So it would be good to find out solution which can be fine for both “sides” :)

Sent out https://reviews.llvm.org/D108485 in an attempt to mitigate this without reverting the full transformation.
Please add self to reviewers.