After this change DSE can eliminate malloc + memset and emit calloc.
It's https://reviews.llvm.org/D101440 follow-up.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This looks good to me, but please wait for the reviewers' comments from the previous patch version.
I'd be curious how much the decision to not ignore allocas will affect compile times in (generated) IRs with many locals.
And remove transformation from instcombine?
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1852 | We should not copy attributes from malloc. |
And remove transformation from instcombine?
Yes. IIRC it's already dead because malloc has always more than one use. Could be done in separate change.
If this is about mallocs and callocs then impact on compile time is rather insignificant. I wouldn't expect anything more than noise but of course we can try to measure it to make sure.
Probably most time consuming part is in case of malloc - one AA check (AA.getModRefInfo) coming from memoryIsNotModifiedBetween and for calloc case that would be one getClobberingMemoryAccess call from MemorySSA (a little bit heavier then getModRefInfo).
So it's more or less about one extra AA check for every pair of m(c)alloc-memset in block.
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1843 | That's root cause of some crashes. Cannot assume that isMallocLikeFn imply malloc. Will fix. | |
1852 | Thanks for pointing this out! It saves me time of investigating current backend error from CI. |
Since instcombine is currently only user of emitCalloc, so after you commit this patch, please delete AttrsList parameter from emitCalloc (and adjust this new callsite in DSE) together with removal of this transformation from instcombine.
Otherwise, I think this looks good.
Sure. When I fix tests and commit is ready to land I will prepare appropriate emitCalloc/instcombine change.
Both AddressSanitizer tests and libFuzzer tests fail from same reason so there are only 2 different error causes.
It has something to do rather with way asan_allocator works (used in all failing UT) than broken middle-end transformations.
I will check it further and will back with more details.
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1843 | https://llvm.org/doxygen/MemoryBuiltins_8cpp_source.html So: isMallocLikeFn && !isOpNewLikeFn |
Regarding OutOfMemoryTest: https://github.com/llvm/llvm-project/blob/main/compiler-rt/test/fuzzer/OutOfMemoryTest.cpp failure explanation.
Test has silent assumption that in every loop iteration we request 268MB virtual memory and get 268MB physical memory allocated.
This is the case for malloc+memset combo (page faults trigger actual physical allocation in kernel) but not for calloc (physical memory allocation is deferred - it's faster, which is the whole point of our calloc transformation).
In consequence after my patch there is no OOM in OutOfMemoryTest so UT fails.
It can be easily fixed by saying "we _really_ want allocate and touch physical memory, not only virtual one" which end up in new volatile char[] + std::fill duo.
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1843 | Ok, maybe it would be more accurate than dyn_cast. I need to double check. |
Just ignore my previous comment about test. It doesn't explain how we moved from non-inlined new[] to calloc. Indeed better explanation was suggested by @xbolva00.
I think this should check for just LibFunc_malloc and nothing else. MallocLike just means an allocation function that return null on failure, and the nothrow new variants are considered MallocLike and not OpNewLike.
Please update the patch subject to be more reflective of "replacement" as opposed to "removal".
For example, "Use calloc for memset+malloc".
Thanks. Updated patch with (hopefully) proper LibFunc_malloc detection. Extra checks are stolen from LibCallSimplifier.
@xbolva00 @nikic @hubert.reinterpretcast: I think now after addressing all comments and passing on CI it's in good shape.
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
514 | ||
1829–1880 | Which part here requires the const cast? | |
1853 | Is this cast needed? Shouldn't it be an implicit upcast? | |
1858 | As you already have an IRBuilder, why not IRB.getIntPtrTy(DL)? | |
1864 | Looks like this doesn't preserve MemorySSA? Try -passes='dse,verify<memoryssa>' in the test. | |
llvm/test/Transforms/DeadStoreElimination/noop-stores.ll | ||
339 | As it was the motivating case, I'd also expect a test where the memset is in a different block. I also don't think that the dominates() condition in your implementation is exercised by tests. Some other conditions aren'T either, for example malloc and memset having different sizes. |
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1829–1880 | Malloc, more precisely starting from this line: IRBuilder<> IRB(Malloc); | |
1864 | Will check. | |
llvm/test/Transforms/DeadStoreElimination/noop-stores.ll | ||
339 |
Sure, I can add much more tests to cover more conditions.
Currently with this patch DSE cannot perform such transformation. Consider original pr25892 from https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/InstCombine/memset-1.ll. |
llvm/test/Transforms/DeadStoreElimination/noop-stores.ll | ||
---|---|---|
339 |
After adding original PR unit test I noticed actually now DSE can perform transformation across blocks. When I checked it before it couldn't, apparently meantime changes unlocked it. Well, that's good :) |
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1864 |
Right, missed MemorySSAUpdater. I'm submitting fix + related UT right now. |
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1851 | We dont need this check, do we? p = malloc(20) Reading p between 10 and 20 is UB, so with calloc we would have 0s in this area so fine. And reverse case is UB too. |
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1851 | If we permitted to "calloc more than we memset" wouldn't we hide UB in some cases? |
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1851 | If the memset size is different, profitability is also unclear. Converting a malloc into a calloc may be always legal, but if you have malloc(10000) followed by memset(10) it's probably not profitable to actually do it. |
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1851 | There is no rule that compiler cannot “hide” UB. Compiler is free to assume that UB never happens. |
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1851 | Yeah. Maybe if we know both sizes and memset is unlikely to be expanded (there is some limit defined samowhere), we should still prefer calloc (1 call) than 2 libcalls? |
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1851 | But for now, let’s start with that condition you already have. Good enough for this patch. |
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
1851 |
Right.
I'm aware that compiler assume UB never happens. My point was that if we permit to "calloc more than we memset" then uninitialized access may become initialized and _maybe_ sanitizers/memcheck/other tools won't detect it. If it's unreal then fine, I agree we shouldn't care. |
I played a little bit with patch and checked how it performs with GCC unit tests added together with similar change in GCC on 2014.
In first case: https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/gcc.dg/tree-ssa/calloc-1.c malloc is transformed to calloc like for GCC which is good.
In LLVM case on assembly level register pressure is very similar to GCC (especially for f() function) which means that issue mentioned by Haneef in PR discussion
(https://bugs.llvm.org/show_bug.cgi?id=25892#c1) seems to be mitigated now.
However the second C++-like case: https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/g++.dg/tree-ssa/calloc.C is not transformed.
Here CFG is more complex and I guess that DSE simply doesn't detect malloc - memset pair because they aren't in neighboring blocks.
Apparently there is still room for improvements but let's leave it for future patches :)
@nikic: Thanks for your review and LGTM it. Looks like I cannot push it myself to main: "Permission to llvm/llvm-project.git denied to yurai007".
@yurai007 I would suggest to apply for a commit access, see https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access.
Reverted in rG4e332cd41acb2befa85e20ec1f28413ea4adbb50 - check-msan broke.
https://lab.llvm.org/buildbot/#/builders/18/builds/1934
Hello @kcc @eugenis @pgousseau, sorry for bothering you. I added you to this review because transformation introduced in this change breaks msan_test (memcpy_unaligned/TestUnalignedMemcpy unit test).
I'm quite convinced that after my change Clang does what GCC would do if compliled msan_test.cpp file with -Ofast: https://godbolt.org/z/f7s81hjaM
Therefore I'm pretty sure that transformation works correctly (as on GCC) but it simply doesn't play well with MSan.
Since I'm not MSan expert it would be great if you could take a look on this and confirm whether or not my understanding of issue is correct.
Right, so this change replaces malloc with calloc in
if (src_is_poisoned) src_origin = __msan_get_origin(src); else memset(src, 0, sz);
because the other branch contains UB.
The test can be fixed by adding __msan_allocated_memory(ptr, size) before the call to __msan_get_origin, but I'd rather disable this optimization in functions with sanitize_memory attribute because it could make us miss bugs.
If possible, it's OK to disable only the CFG-aware part of the opt. I.e. malloc + memset in linear code (or even same BB) is fair game.
Thanks. I updated patch and now optimization is disabled for functions with sanitize_memory attribute.
Detecting linear pattern would be possible but I'm not convinced it's worth effort. Anyway now it should work with MSan.
@yurai007 Shouldn't we detect that we are implementing 'calloc' and bail out if so ? Just like we do for 'memset' ?
(See https://www.godbolt.org/z/xnMa9bj4r )
@jeroen.dobbelaere: Yes, I think DSE should special case calloc to avoid infinite recursion (like it does for memset) in libc++. Thanks for catching this. I'm reverting change to fix.
@yurai007 I think this patch may have broken the compiler-rt/test/asan/TestCases/heap-overflow.cpp test. This test is failing on our internal bots. The test fails because it expects to see malloc at -O2 in the stacktrace but we are seeing calloc() instead.
@delcypher: Thanks for letting me know. Now patch is reverted, Before submitting again I will fix heap-overflow.cpp test.
Disabled transformation on HWASan as well. Maybe it's too paranoid but I don't have AAarch64 hardware to verify.
This commit seems to cause memory usage (rss) increase in MariaDB's mysqld by a factor of 4. Looking back into the history, I found that the previous commit here caused the same regression, but we quickly picked up the revert and moved on. Now I'm trying to isolate the problem, but it's taking time.
So far, my sole hypothesis is that malloc + memset of a smaller size can still be converted to calloc. But I have no evidence so far. Naive attempts to synthetically reproduce this didn't work. Maybe this only happens when some UB is in place, but again, I have nothing to back this up.
Given this is a quite serious regression, can we roll this back while I'm investigating?
You should be able to use flag -fno-builtin-calloc to disable this transformation.
This transformation is correct and valid; GCC does it as well. No reason to revert, not justified. You should check asm diffs w and w/o patch for any surprises.
I found and reduced a test case that shows a too eager replacement of malloc with calloc:
https://gcc.godbolt.org/z/dTjonof74
$ cat test.cc #include <stdlib.h> #include <string.h> void *my_malloc(size_t size, int my_flags) { void* point = malloc(size); if (my_flags & 32) memset(point, 0, size); return point; } $ clang -O2 -o test.o -save-temps -c test.cc && cat test.s .text .file "test.cc" .globl _Z9my_mallocmi # -- Begin function _Z9my_mallocmi .p2align 4, 0x90 .type _Z9my_mallocmi,@function _Z9my_mallocmi: # @_Z9my_mallocmi .cfi_startproc # %bb.0: pushq %rax .cfi_def_cfa_offset 16 movq %rdi, %rsi movl $1, %edi callq calloc@PLT popq %rcx .cfi_def_cfa_offset 8 retq .Lfunc_end0: .size _Z9my_mallocmi, .Lfunc_end0-_Z9my_mallocmi .cfi_endproc # -- End function .ident "clang version trunk (45ac5f5441818afa1b0ee4a3734583c8cc915a79)" .section ".note.GNU-stack","",@progbits .addrsig
This is quite a serious problem. Would you please revert while working on a fix?
I think i see where this is going - the just-malloced, but never touched memory
doesn't need to be actually backed by an actual pages (see overcommit),
while i guess calloc doesn't just mark the pages as zeroed-out,
but actually marks them dirty and needed to be allocated,
at least not unless you happen to allocate in multiples of page size?
I guess the easy fix here is to require that memset post-dominates the malloc,
but i guess we also need some langref blurb about this,
because the transformation is correct, just-malloced memory is filed with undef,
which we can always define into zeros: https://alive2.llvm.org/ce/z/C4vWH2
For now, use -fno-builtin-calloc, it works fine. Also another solution:
void *my_malloc(size_t size, int my_flags) __attribute__((no_builtin("calloc"))) { void* point = malloc(size); if (my_flags & 32) memset(point, 0, size); return point; }
I guess the easy fix here is to require that memset post-dominates the malloc,
Yeah, but I am not sure, this is quite very specific case, I would prefer ' attribute((no_builtin("calloc")))' solution here, instead of messing with malloc's internals and their impact on LLVM & Langref.
It would be great not to break much more common pattern:
void* point = malloc(size); if (point) memset(point, 0, size);
I found this problem in mysql compiled with tcmalloc. Mysqld (at least in the somewhat older version I'm looking at) speculatively allocates a potentially large (depending on the configuration parameters) block of memory on start, which is normally used only partially. With malloc the memory is lazily given to the process when it starts using it. With calloc (and tcmalloc) the process actually tries to get all the pages immediately, which increases RSS (and thus, real memory usage). I guess, it may affect performance as well due to the unnecessary filling with zeroes, when user code calls my_malloc without MY_ZEROFILL.
For the context: https://fossies.org/linux/mariadb/mysys/my_malloc.c (this version seems functionally close to what I'm looking at).
I guess the easy fix here is to require that memset post-dominates the malloc,
but i guess we also need some langref blurb about this,
because the transformation is correct, just-malloced memory is filed with undef,
which we can always define into zeros: https://alive2.llvm.org/ce/z/C4vWH2
Yep, sounds like a revert to me.
To be honest i'm not really sure why this transform is worth it/beneficial,
so i'm not sure how much trouble the implementation should go into.
Well, I am not sure, your code compiled with GCC will have same issue. Overallocating is also questionable - your code relies on some many things to work luckily together - not a very ideal state.
This transformation was in simplifylibcalls for very long time anyway so if anything, then postdominating check….
I'm not sure "gcc also has this problem" is a good excuse ;)
Overallocating is also questionable - your code relies on some many things to work luckily together - not a very ideal state.
Well, it's not my code, it's mysql, which is probably Oracle's code, I suppose ;) But jokes aside, the function I provided as a test case looks totally reasonable to me and it shouldn't zero-fill the allocated memory, if it wasn't asked to. It's good there is a workaround (-fno-builtin-calloc) and it seems like some time ago mysql started using this flag in its gcc builds, because this specific optimization caused substantial performance regressions: http://smalldatum.blogspot.com/2017/11/a-new-optimization-in-gcc-5x-and-mysql.html
I guess, mysql is not the only software using its wrappers around malloc + memset with a similar logic.
This transformation was in simplifylibcalls for very long time anyway
It wasn't this transformation. It was targeting specifically memset(malloc(size), 0, size) pattern, which is much more specific and safe to replace to calloc.
so if anything, then postdominating check….
Yes, please. More strictly, you could try to check that the memset is on all paths from malloc to where the result of malloc can be used (in any way except for checking it for zero maybe). Not sure whether it can be efficiently implemented though.
I found and reduced a test case that shows a too eager replacement of malloc with calloc: https://gcc.godbolt.org/z/dTjonof74
(...) This is quite a serious problem. Would you please revert while working on a fix?
First of all just to make it clear - in this particular case GCC does exactly same: https://gcc.godbolt.org/z/qbE6Kxdnv unless you pass different flags.
I will be blunt: i do not understand what are the profitability heuristics for this transform are?
When is calloc better than a malloc immediately followed by the memset?
Worth to read about it on google..
https://www.google.com/amp/s/willnewton.name/2013/06/17/calloc-versus-malloc-and-memset/amp/
https://vorpus.org/blog/why-does-calloc-exist/
+ some SO pages..
I will be blunt: i do not understand what are the profitability heuristics for this transform are?
When is calloc better than a malloc immediately followed by the memset?
IIRC the main rationale behind this transformation is here: https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc.
I think I could prepare benchmark showing benefits of calloc over malloc+memset, I already observed reduced RSS and less page faults after transformation on some tests.
However if non-standard malloc is used it may behave differently.
I think more codes with “common pattern” see improvements so few ones which are problematic should use flag or attribute.
Also unroller sometimes goes crazy and degrades performance but here nobody talks about revert/removal/turning off.
I think more codes with “common pattern” see improvements so few ones which are problematic should use flag or attribute.
Also unroller sometimes goes crazy and degrades performance but here nobody talks about revert/removal/turning off.
+1 Totally agree.
As I said, "GCC also has this problem" is not a good excuse.
Revert is the safest thing to do when a real problem has been root caused to the commit and there is no quick and obvious fix. If there's no consensus on introducing a postdominance check with the purpose of making this transformation more conservative, I'd insist on reverting until there's a clear path forward.
I'd probably be fine with using a workaround (-fno-builtin-calloc) for now, but it also disables the completely reasonable and uncontroversial transformation of memset(malloc(n), 0, n) to calloc(1, n), which is a net regression from the state before this commit: https://gcc.godbolt.org/z/Ev64KPs85
And needless to say, "there's a problem elsewhere, why should we do better here?" is not a productive approach ;) If you see a change in unroller pessimizing use cases you care about, it *may* be reasonable to speak up, ask for a revert and figure out how to solve the problem while keeping trunk in a better shape.
Yes, please. A benchmark measuring the benefits of this transformation would be nice. Specifically, it would be interesting to see the differences between the too eager (current) version of this transform vs a more conservative version (to be implemented) with a "memset postdominating malloc" check. For comparison I have a benchmark that shows a 4+x increase in RSS with this transformation in a real production code running in a rather realistic configuration.
However if non-standard malloc is used it may behave differently.
In the post I linked to above (http://smalldatum.blogspot.com/2017/11/a-new-optimization-in-gcc-5x-and-mysql.html) there were measurements of performance impacts of this transformation (when done by GCC) with three different malloc implementations - glibc, tcmalloc, jemalloc. glibc malloc suffered the most (>30% drop in QPS).
there is no quick and obvious fix
There is a flag/attribute.
but it also disables the completely reasonable and uncontroversial transformation of memset(malloc(n), 0, n) to calloc(1, n)
- this never worked with llvm.memset.
- this never worked if ptr from malloc was only checked for null.
p = memset(malloc(..), 0, ..) is horrible pattern to see in real world code. Rather to have nothing than optimize this horrible pattern.
impacts of this transformation (when done by GCC)
And so many years passed and gcc still has it. :) it does not look like there is a storm on gcc bugzilla about this “issue” either.
Well I think users have some ways how to avoid this optimization - what we could do is to write some info into release notes.
And small change like:
if (ZERO_FILL)
point = calloc(1, size);
else point = malloc(size);
Looks even better. The fastest.
I am fine with revert if you post LangRef change related to this issue “definition of undef vs malloc”.
I agree with alexfh that this is questionable.
55 lines of complexity and compile time hurt in return of an "optimization naturally performed by humans, deoptimization/surprise at times, with other hidden costs"?
Other hidden costs: new references to msan/asan/hwasan exclusion. (asan/hwasan are perhaps redundant.) When a new sanitizer is introduced, developers may go over the list of reference sites and adjust accordingly. We either lose test coverage or add some low-value tests.
Justification in real applications will help. I am not sure the benefits can be justified (we will need an additional cost: a post-dominance test and its accompanying test).
+1 to revert.
I'm not sure what is the LangRef change requested. This is a valid transformation, sure. It just makes the code run slower and use more memory.
It goes against the user intent in the code that is explicitly trying to avoid zero-initializing the allocated memory.
IMHO this transformation should only apply if memset can be proven to postdominate malloc *for the entire allocation size* (or at least a large part of it).
Anyway, we have some requests for this transformation:
https://bugs.llvm.org/show_bug.cgi?id=25892
https://bugs.llvm.org/show_bug.cgi?id=47159
(Maybe even more..)
So it would be good to find out solution which can be fine for both “sides” :)
Sent out https://reviews.llvm.org/D108485 in an attempt to mitigate this without reverting the full transformation.
Please add self to reviewers.
I'm a bit surprised that MemoryLocation::get() doesn't work for MemSetInst, as it only has a single memory location, so there's no ambiguity here. @asbirlea @fhahn Do you think it would make sense to adjust the API?