This is an archive of the discontinued LLVM Phabricator instance.

Move "auto-init" instructions to the dominator of their users
ClosedPublic

Authored by serge-sans-paille on Nov 9 2022, 5:11 AM.

Details

Summary

As a result of -ftrivial-auto-var-init, clang generates instructions to set alloca'd memory to a given pattern, right after the allocation site. In some cases, this (somehow costly) operation could be delayed, leading to conditional execution in some cases.

This is not an uncommon situation: it happens ~550 times on the cPython code base, and much more on the LLVM codebase. The benefit greatly varies on the execution path, but it should not regress on performance.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
nickdesaulniers requested changes to this revision.Nov 10 2022, 11:47 AM

No comment on the approach. Minor drive by comments on style. But I'd like to see IR tests added for this pass. I haven't run it yet on the Linux kernel but can do so to provide measurements.

llvm/include/llvm/InitializePasses.h
201 ↗(On Diff #474231)

drop this hunk

llvm/include/llvm/LinkAllPasses.h
116 ↗(On Diff #474231)

drop this hunk

139 ↗(On Diff #474231)

add space before cast

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
50

delete

73

auto

81

delete.

See comment below about ternaries, which also applies here.

126

delete

130

delete

151–155

Please use

if (x)
  y()
else
  z()

rather than

if (!x)
  z()
else
  y()

Additionally, since we're assigning to the same variable, consider using a ternary statement.

DominatingPredecessor = (DominatingPredecessor ? DT.find... : Pred)
This revision now requires changes to proceed.Nov 10 2022, 11:47 AM

This looks overly specific.
Is this not a yet another manifestation of the lack of a generic sinking pass?

This looks overly specific.
Is this not a yet another manifestation of the lack of a generic sinking pass?

I'd appreciate pointer to the literature on "sinking pass". That being said, it's pretty clear to me that this pass generalizes far beyond "auto-init". There's an advantage in auto-init though: we can work under the assumption of "no aliasing" which helps a lot. I'm interested in making that pass more generic, probably as a second step though.

No comment on the approach. Minor drive by comments on style. But I'd like to see IR tests added for this pass. I haven't run it yet on the Linux kernel but can do so to provide measurements.

+1 for the tests and thanks for the review. On Firefox codebase, we get decent speedups compated to raw -ftrivial-auto-var-init. Some more data here: https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=77549e4eef51c70336d2ca9e3d086bf7767f8196&newProject=try&newRevision=b9fc0eb0e1e29c37364d69a9da203c00e87df5b2&page=1&framework=13&showOnlyConfident=1 where "Base is without this commit and New with it" The benchmarks are to be taken with a grain of salt though, they are not terribly stable.

serge-sans-paille marked 9 inline comments as done.

Fix nits and address reviews
Fix bug when looking for unique predecessor
Add basic tests

@nickdesaulniers I'm really curious about the impact on ClangBuiltLinux

I tested D137707 Diff 475770 against an x86_64 defconfig build of the Linux kernel (commit 81ac25651a62 ("Merge tag 'nfsd-6.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux")), which sets CONFIG_INIT_STACK_ALL_ZERO=y. Booted fine.

$ du -h vmlinux.orig vmlinux.D137707.475770
63M	vmlinux.orig
63M	vmlinux.D137707.475770
$ bloaty vmlinux.D137707.475770 -- vmlinux.orig
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +0.0%    +336  [ = ]       0    .rela.orc_unwind_ip
  +0.0%    +144  [ = ]       0    .rela.text
  +2.6%    +141  +2.6%    +141    [LOAD #3 [RWX]]
  +0.0%     +84  +0.0%     +84    .orc_unwind
  +0.0%     +56  +0.0%     +56    .orc_unwind_ip
  +0.0%     +24  [ = ]       0    .rela.smp_locks
  +0.3%      +3  [ = ]       0    .shstrtab
  -0.0%     -11  [ = ]       0    .strtab
  -0.0%     -24  [ = ]       0    .rela.return_sites
  -0.0%     -24  [ = ]       0    .symtab
  -8.8%    -140  -8.8%    -140    [LOAD #1 [RW]]
  -0.0%    -141  -0.0%    -141    .init.text
  +0.0%    +448  [ = ]       0    TOTAL
$ llvm-readelf -S vmlinux.orig
...
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
...
  [ 1] .text             PROGBITS        ffffffff81000000 200000 10032cb 00  AX  0   0 4096
...
$ llvm-readelf -S vmlinux.D137707.475770
...
  [ 1] .text             PROGBITS        ffffffff81000000 200000 10032cb 00  AX  0   0 4096

So it appears that this patch made no difference to the size of the .text section. I triple checked this w/ and w/o D137707 applied.

When I build with make LLVM=1 -j128 KCFLAGS="-mllvm -stats", then process the stats from move-auto-init, I observe 49810 moves! Feel free to add that measurement into the commit description.

$ make LLVM=1 -j128 KCFLAGS="-mllvm -stats" &> log.txt
$ grep move-auto-init log.txt | tr -s ' ' | cut -d ' ' -f 2 | python3 -c "import sys; print(sum((float(l) for l in sys.stdin)))"
49810.0
# Triple check with sed+bc rather than python3
$ grep move-auto-init log.txt | tr -s ' ' | cut -d ' ' -f 2 | sed ':a;N;s/\n/+/;ta' |bc    
49810

Please let me know if there's any other measurements you'd like me to make.

Also, it might be nice if the tests demonstrated diffs against existing (or precommitted changes) to better demonstrate how this pass changes the generated code. Mind breaking the newly added tests into 2 patches:

  1. child patch that adds them BEFORE this patch
  2. rebase this patch on that to demonstrate how this patch differs?
llvm/lib/Passes/PassBuilderPipelines.cpp
981

Is it possible to skip this pass if -ftrivial-auto-var-init=zero wasn't specified?

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
93

is there a faster way to do this rather than having to scan the metadata of every operand?

I'd appreciate pointer to the literature on "sinking pass".

Look for PRE (partial redundancy elimination) of stores or partial dead store elimination. Maybe start with https://dl.acm.org/doi/10.1145/277650.277659 . See also D29865 .

So it appears that this patch made no difference to the size of the .text section.

This isn't fundamentally surprising; you have the same number of stores before and after sinking.

I'd appreciate pointer to the literature on "sinking pass".

Look for PRE (partial redundancy elimination) of stores or partial dead store elimination. Maybe start with https://dl.acm.org/doi/10.1145/277650.277659 . See also D29865 .

Thanks.

So it appears that this patch made no difference to the size of the .text section.

This isn't fundamentally surprising; you have the same number of stores before and after sinking.

That's my understanding too. @nickdesaulniers does this patch has an impact on some kernel benchmark, maybe one where -ftrivial-auto-var-init brought some slowdown?

llvm/lib/Passes/PassBuilderPipelines.cpp
981

It would be possible through a module flag or something, but

  1. if we're going to extend that pass beyond auto-var-init, that's going to be useless
  2. that would mean a greater coupling with the front end

The following shows no big impact on compile time when the flag is not used, so I don't think it's worth the extra step

https://llvm-compile-time-tracker.com/compare.php?from=f71d32a0eea47b3d2bb43d6be15cf09d47ef6971&to=6ea450ef673666ed6b8a8257ca56f093aae7469c&stat=instructions:u

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
93

Well, there's an early exit on the instruction metadata, I guess that's enough?

aeubanks added inline comments.
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
204

any reason for all the legacy pass infra? if it's going in the optimization pipeline then no need for a legacy pass

Remove legacy pass manager support
Minor nits

@nickdesaulniers I don't think it makes sense to update the test cases : as they only run the considered pass, the diff before/after is pretty clear based on the CHECK: lines.

I personnaly consider this ready to land :-) I'm obviously open to changes.

serge-sans-paille retitled this revision from [WIP] Move "auto-init" instructions to the dominator of their users to Move "auto-init" instructions to the dominator of their users.Nov 22 2022, 2:00 PM
jfb added a comment.Nov 22 2022, 4:16 PM

Are there any worries about moving stores that are on atomic/volatile "objects", either because the auto-init store itself is atomic/volatile, or because a subsequent access is marked as such? I don't think there's a worry because auto-init stores are technically outside the abstract machine (they "don't exist"), but I'm not 100% convinced.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
120

Typo "want"

Are there any worries about moving stores that are on atomic/volatile "objects", either because the auto-init store itself is atomic/volatile, or because a subsequent access is marked as such? I don't think there's a worry because auto-init stores are technically outside the abstract machine (they "don't exist"), but I'm not 100% convinced.

The thing I'd be concerned about is sinking the store past an atomic fence.

Allocating/initializing an object is never atomic. The requirement of the abstract machine is that the allocation/initialization of the variable has to have a happens-before relationship with any access to it. Consider something like the following:

int x = 3;
atomic_thread_fence(memory_order_release);
atomic_store_explicit(y, &x, memory_order_relaxed);

After the atomic store, it's legal for another thread to access the value "3" through the pointer y, or to write another value to x. (The operations on x don't need to be atomic, as long as the operations on y form a happens-before edge.) So it's illegal to sink the store of "3" past the fence.

If instead x is uninitialized, you run into basically the same thing with implicit zero init: it's illegal to sink the implicit zero-init of x past the fence, because it could race with stores on another thread.

Thanks @efriedma for the (frightening) answer.

If instead x is uninitialized, you run into basically the same thing with implicit zero init: it's illegal to sink the implicit zero-init of x past the fence, because it could race with stores on another thread.

As x is a stack allocated variable, in order to have it shared with another thread (and potentially generating race), we need to *use* it, which implies the initialization already happens. Why does it matter it's done before or after the fence?

If I understand correctly, the fence is global, which means basically any function call we don't have strong guarantee on could imply a memory fence, and prevent the sink, correct?

As x is a stack allocated variable, in order to have it shared with another thread (and potentially generating race), we need to *use* it, which implies the initialization already happens.

To share a variable with another thread, you need to use its address. That isn't directly connected to the memory that contains the variable itself. Without the fence, neither the compiler nor the CPU itself are aware that the initialization of the variable has to be visible to other CPUs before the address of the variable.

If I understand correctly, the fence is global, which means basically any function call we don't have strong guarantee on could imply a memory fence, and prevent the sink, correct?

I think so? I don't see any way to avoid that conclusion.

I guess in theory, you could insert a fence after the zero-initialization to solve the issue: all fences are equivalent, so you don't need the original fence if you insert another one. Not completely sure that works, though, and not sure what effects the extra fence would have.

If you can prove the address doesn't escape, then the whole multi-threaded thing becomes irrelevant, but I guess we SROA most of those cases anyway, so not sure how helpful that is.

Thanks @efriedma for the clarification. I understand your arguments and I think they apply to generic code motion, but I also think we have strong enough hypothesis for this particular pass to not be concerned by these issue.

If you can prove the address doesn't escape, then the whole multi-threaded thing becomes irrelevant, but I guess we SROA most of those cases anyway, so not sure how helpful that is.

I think we are in that situation

I think that the arguments are:

  1. we only move initializer marked as auto-init and whose argument is an alloca.
  2. we compute the dominator of the sets of the uses of this alloca (excluding the memory initialization marked as "auto-init")
  3. we move the memory initialization marked as "auto-init" before that dominator

As a consequence of 2. and 3. there cannot be sharing of the initialized memory before it is initialized, so we can safely cross a memory barrier (and we're not breaking an existing semantic as the auto-init is inserted by the compiler).

Of course all this would collapse without this strong no-sharing assumptions.

Am I missing something?

we move the memory initialization marked as "auto-init" before that dominator

The problem with this is that "before" in the sense of dominators isn't sufficient; that only dictates ordering on the CPU the code is executing on. You need happens-before, i.e. the ordering visible to all CPUs, to ensure you don't have a race.

The user is responsible for ensuring there's a barrier between the allocation of the variable and the use, but if the auto-init is in a different place, you can't depend on the barrier the user wrote.

Take into account memory fences, call sites etc so that the instruction move respects memory order enforced by memory barriers.

This is achieved through MemorySSA, which makes the pass effortlessly more generic.

There a re still a few test to be done on my side, but I'd love to have feedback from @efriedma on the approach first.

The approach seems reasonable.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
94

Can we avoid iterating over the function if it doesn't have any auto-init?

llvm/test/Other/new-pm-defaults.ll
109

It would be good to avoid introducing an additional run of MemorySSA.

fhahn added a subscriber: fhahn.Dec 5 2022, 3:15 PM
fhahn added inline comments.
llvm/lib/Passes/PassBuilderPipelines.cpp
983

What's the rational behind placing it here and not closer to other memory optimizations like DSE/MemCpyOpt?

983

(moving it closer to those will also allow re-using existing memorySSA)

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
94

It's not clear to me from the FIXME why we would want to limit this to auto-init instructions. Shouldn't this be beneficial for all memory instructions?

183

This pass moves only memory instructions, so could it be marked as preserving all CFG analyses?

This should ideally also preserve MemorySSA.

llvm/test/Transforms/MoveAutoInit/branch.ll
6

does this depend on the triple? If it does, this needs REQUIRES:... otherwise it will fail if the X86 backend is not built. (same for all tests)

it'd be good to put this through llvm-compile-time-tracker. a branch with a commit that turns on -ftrivial-auto-var-init, then this commit, to see how much this adds to compile time when using -ftrivial-auto-var-init

it'd be interesting to remove the "auto-init" restriction and see what effects on compile time this has in general, and maybe benchmark results

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
25–26

can remove these now that there's no legacy pass

104–105

this sentence doesn't make sense to me

113

want

llvm/test/Transforms/MoveAutoInit/branch.ll
10

test nit: could you remove dso_local/noundef

llvm/lib/Passes/PassBuilderPipelines.cpp
983

+1 for that. The previous implementation was only using Dominator trees so it made sense to activate it early in the pipeline. This version is sensibly more costly, so it will propably appear at higher optimization level.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
94

Can we avoid iterating over the function if it doesn't have any auto-init?

I could try a module pass and walk through the users of the auto-init metadata?

94

The algorithm beyond auto-init requires more care (it probably needs a reverse transversal of the CFG, and some optimization to avoid rebuilding the dominator of the users for each instruction). I'd rather start small with that pass and then upgrade it to a full-fledge version.

fhahn added inline comments.Dec 6 2022, 2:19 AM
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
94

AFAIK the !annotation metadata at the moment is information-only and doesn't provide any semantic guarantees. So relying on it for correctness at the moment doesn't sound ideal. From the documentation in the pass, it is not clear what property you are relying on. Could you make this clear in both the documentation in the pass and the commit description?

  • Move the pass to the O2 pipeline and update test accordingly
  • Declare preserved analyses, although I'm not sure of the validity for MemorySSA

The new pass catches much less case than the previous one (~150 on the cPython codebase) but at least it's valid now.

serge-sans-paille marked 5 inline comments as done.Dec 15 2022, 8:56 AM

@nickdesaulniers could you give a shot at this new version? It's sensibly different from the first one.

Extra numbers: this is the result of compiling the codes from compile-time-tracker with -ftrivial-auto-var-init=pattern:

-ftrivial-auto-var-init=none vs -ftrivial-auto-var-init=pattern:

https://llvm-compile-time-tracker.com/compare.php?from=50a1c9b1073d7842ef687e486dc842ffea39457c&to=0cc74fe5d7f455e8dd2a34c4cfd9c276aae9ee57&stat=instructions:u

Adding the transformation from this patch on top of -ftrivial-auto-var-init=pattern adds a very slight overhead:

-ftrivial-auto-var-init=pattern vs -ftrivial-auto-var-init=pattern + this pass:

https://llvm-compile-time-tracker.com/compare.php?from=0cc74fe5d7f455e8dd2a34c4cfd9c276aae9ee57&to=67cd4b64768d74a2335d5268967951558bca3226&stat=instructions:u

Basically as is, this pass doesn't add much overhead in compilation time.

Sorry for the delay in testing this. Do you mind rebasing it? arc patch D137707 is having a hard time applying it to ToT.

The test cases LGTM but I'd like @efriedma to review the memory ordering stuff; I don't understand that stuff as much as I would like to.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
162

Should we take a reference to the std::pair here? auto &Job?

Address review + rebase patch

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
24

are you still using this header? I didn't see any references to any intrinsics. Can you recheck these? DebugInfo.h I would think is also unused.

45

Mind adding a comment for this function?

52–56

You might be able to eliminate this for loop by using llvm::make_pointer_range (llvm/include/llvm/ADT/iterator.h) to initialize the Worklist. You'd probably need to move the cast to the below loop though.

64

Instruction

73–74

Are you able to use a range-for here with MemoryPhi::incoming_values()?

104–105

I think you can just pass successors(UsersDominator)?

134–135

This sentence could be reworded. Looks like it was partially updated at some point?

Take into account @nickdesaulniers review.

serge-sans-paille marked 15 inline comments as done.Feb 2 2023, 6:03 AM

@efriedma : any comment / opinion on this now that it's based on MemorySSA?

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
73–74

Unfortunately not, SmallVector::append doesn't accept iterator_range. I'll implement that.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
73–74

I was just thinking

for (MemoryAccess *MA : M->incoming_values())
  Worklist.push_back(MA);
serge-sans-paille marked an inline comment as done.Feb 2 2023, 4:28 PM
serge-sans-paille added inline comments.
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
73–74

Yeah, but incoming_values iterates over Use* so I'd need a cast, so I'd rather keep current implementation.

@efriedma : gentle ping :-) Any thoughts on this version, now based on MemorySSA?

efriedma added inline comments.Feb 14 2023, 1:20 PM
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
72

I'm not sure I understand the PHI handling here... the incoming values of a PHI are essentially original store itself, and other stores that store to the same location. I'm not sure how analyzing the other stores helps here. You want the store to dominate the incoming edge of the PHI, not the other stores.

Fix MemoryPhi handling

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
72

Indeed, I had it the other way around. Fixed in latest revision of the patch.

efriedma added inline comments.Feb 14 2023, 3:09 PM
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
66

Is it possible to have a MemoryUseOrDef without an associated instruction? What would that represent?

72

I'm not confident that looking through PHIs like this is actually what you want. A MemoryPHI implies there's a loop that modifies the value in question. Maybe the cycle protection code protects you from running into issues with that, though?

I haven't really worked with MemorySSA much; if someone wants to jump in to review this aspect, that would be welcome. I'm generally happy with the approach using MemorySSA, just not sure about the finer details.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
72

Is there someone who would have worked more with MemorySSA that can provide such review? It's certainly not me.

No need to check if getMemoryInst returns nullptr or not.

serge-sans-paille marked an inline comment as done.Feb 23 2023, 6:50 AM

I'd like to move forward with this, and try to implement a more general approach that considers any instruction, and not only the instruction generated for auto init. But I need to unlock that review first :-)

nikic added a comment.Mar 2 2023, 7:40 AM

High level notes:

  • I'm not sure I fully understand the motivation for using MemorySSA here. Unless I'm missing something, It seems like when it comes to optimizing allocas, it would be sufficient to look at non-transparent users of the alloca (by which I mean, in first approximation, look through bitcast, gep, count everything else). The review mentions fences, but wouldn't these be covered as long as you consider pointer captures as "uses"?
  • This is missing some tests where there are other memory instructions beyond the auto-init stores. You should find that, as implemented, these are going to block the transform even if they are "obviously" unrelated. If they are MemoryUses, you can recover this by requesting optimized use form. However, MemoryDefs always depend on the preceding MemoryDef, even if they are non-clobbering. This means that your optimization will always stop at the next MemoryDef, even if it is to a different alloca. This is a big limitation. You could do alias checks to skip those, but that's also the point where this becomes an expensive transform.
  • There is a pending patch to add sinking support to DSE (https://reviews.llvm.org/D136218). I haven't really looked at it yet, but I expect it does the MSSA-based transform in the right way (i.e. skipping non-clobbering instructions). This brings me back to the first point...
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
37

Use static instead of anon namespace (see https://llvm.org/docs/CodingStandards.html#anonymous-namespaces).

Also please prefer the more typical const Instruction &I spelling.

67

The lifetime intrinsic check is missing test coverage.

76

append_range(WorkList, UsersAsMemoryAccesses)

94

Assert or check that use_empty(), the transform relies on it.

I'd also assert or check that the instruction is not volatile. (Not sure if auto-init can be volatile...)

156

Is it possible for this to be a catchswitch block?

165

Drop braces.

169

verifyMemorySSA() here -- pretty sure there will be verification errors. E.g. you might be moving past unrelated MemoryInsts.

llvm/test/Transforms/MoveAutoInit/branch.ll
5

Shouldn't be relevant.

31

Zero-index GEP is redundant.

llvm/test/Transforms/MoveAutoInit/fence.ll
25

To make this test more meaningful, you probably want to capture %val before the fence? Otherwise the initialization is unobservable.

High level notes:

  • I'm not sure I fully understand the motivation for using MemorySSA here. Unless I'm missing something, It seems like when it comes to optimizing allocas, it would be sufficient to look at non-transparent users of the alloca (by which I mean, in first approximation, look through bitcast, gep, count everything else). The review mentions fences, but wouldn't these be covered as long as you consider pointer captures as "uses"?

I could cowardly let @efriedma answer there, because he motivated this change. You can refer to https://reviews.llvm.org/D137707#3945415 for his initial thoughts.

  • This is missing some tests where there are other memory instructions beyond the auto-init stores. You should find that, as implemented, these are going to block the transform even if they are "obviously" unrelated. If they are MemoryUses, you can recover this by requesting optimized use form. However, MemoryDefs always depend on the preceding MemoryDef, even if they are non-clobbering. This means that your optimization will always stop at the next MemoryDef, even if it is to a different alloca. This is a big limitation. You could do alias checks to skip those, but that's also the point where this becomes an expensive transform.

Yeah, test coverage is currently not sufficient, I'll add some and get a better understanding of that particular part.

  • There is a pending patch to add sinking support to DSE (https://reviews.llvm.org/D136218). I haven't really looked at it yet, but I expect it does the MSSA-based transform in the right way (i.e. skipping non-clobbering instructions). This brings me back to the first point...

Thanks for sharing!

Use Alias Analysis to filter-out non clobbering memory accesses, as suggested by @nikic . Also added a test case (clobber.ll) to ensure this works as expected.

@nikic : I've not addressed your inline comments yet, but I've improved the approach through alias analysis, this indeed looks better now, at least to me. What do you think?

serge-sans-paille marked 12 inline comments as done.

Take into account all @nikic comments.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
94

Assert or check that use_empty(), the transform relies on it

I'm not quite sure which use you're referring to, can you explain?

serge-sans-paille marked an inline comment as done.
serge-sans-paille edited the summary of this revision. (Show Details)

Update test case according to @nikic review.

Extra round of tests on my side: current version leads to 550 instruction moved when compiling cpython with -ftrivial-auto-var-init=pattern. @nikic I think the patch is ready for you to look at it again :-)

nickdesaulniers accepted this revision.Mar 28 2023, 11:47 AM

I redid quick Linux kernel (mainline, x86_64 and arm64 defconfig, arm64 thinlto) build+boot tests with this version. Clearing my previous -1. Thanks for the work on this patch! You may want to see if @efriedma or @nikic have any last minute thoughts before landing though (within a reasonable delay; we can always fix things after they've landed).

This revision is now accepted and ready to land.Mar 28 2023, 11:47 AM
MaskRay added inline comments.
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
61

To assert non-null, just use a reference.

73

For a work list, we should add elements to Visited before pushing elements to WorkList, otherwise WorkList may contain duplicates.

104

unneeded blank line

138

unneeded blank line

150

unneeded blank line

llvm/test/Transforms/MoveAutoInit/clobber.ll
1 ↗(On Diff #507292)

Add a file-level comment what this test is about.

MaskRay added inline comments.Mar 28 2023, 12:49 PM
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
47

delete blank line.

// If I exists and is deeper ...

What does deeper mean? In the loop body, if (AA.getModRefInfo(MI, ML) != ModRefInfo::NoModRef && !MI->isLifetimeStartOrEnd() && MI != I) { does something unclear but does not have a comment.

54

For auto-init instructions, MemIntrinsic is sufficient and AnyMemTransferInst is unneeded.

105

The pass name is MoveAutoInit. Handling just auto-init instructions is what the pass is designed for, so I don't think not generalizing it deserves a FIXME. You can drop FIXME.

Address reviews

nikic added inline comments.Mar 31 2023, 4:14 AM
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
59
69
if (Visited.size() > 128)
  return nullptr;

for some value of 128 :)

77

Doesn't this mean we may potentially move the initialization past the lifetime.end?

138

This will miss cases where the dominator is not the cycle header, right? Possible improvement for the future.

172

The common dominator of all the predecessors is probably just the idom?

194

Could you please also add -verify-memoryssa to the tests?

serge-sans-paille marked 4 inline comments as done.

Add threshold option + minor nits

This revision was landed with ongoing or failed builds.Apr 3 2023, 6:28 AM
This revision was automatically updated to reflect the committed changes.

Hi. I think the reland broke our mac builder at https://ci.chromium.org/ui/p/fuchsia/builders/ci/clang_toolchain.ci.core.x64-host_test_only-mac-subbuild/b8784753709492580801/overview with:

Assertion failed: (!I.isVolatile() && "auto init instructions cannot be volatile."), function runMoveAutoInit, file llvm/lib/Transforms/Utils/MoveAutoInit.cpp, line 112.

Would you be able to take a look and send out a fix or revert?

Hi. I think the reland broke our mac builder at https://ci.chromium.org/ui/p/fuchsia/builders/ci/clang_toolchain.ci.core.x64-host_test_only-mac-subbuild/b8784753709492580801/overview with:

Assertion failed: (!I.isVolatile() && "auto init instructions cannot be volatile."), function runMoveAutoInit, file llvm/lib/Transforms/Utils/MoveAutoInit.cpp, line 112.

Thanks for reporting—I thought this situation wasn't possible— I'll send a fix that just skips these instructions for now.

bjope added a subscriber: bjope.Apr 4 2023, 12:42 PM
ayzhao added a subscriber: ayzhao.Apr 4 2023, 3:43 PM

Hi. I think the reland broke our mac builder at https://ci.chromium.org/ui/p/fuchsia/builders/ci/clang_toolchain.ci.core.x64-host_test_only-mac-subbuild/b8784753709492580801/overview with:

Assertion failed: (!I.isVolatile() && "auto init instructions cannot be volatile."), function runMoveAutoInit, file llvm/lib/Transforms/Utils/MoveAutoInit.cpp, line 112.

Thanks for reporting—I thought this situation wasn't possible— I'll send a fix that just skips these instructions for now.

FYI, Chrome also saw this crash as well: https://crbug.com/1430570. I verified that your commit rGad9ad3735c4821ff4651fab7537a75b8f0bb60f8 fixes it.

uabelho added a subscriber: uabelho.Apr 4 2023, 9:58 PM
ayzhao added a comment.EditedApr 8 2023, 9:50 AM

This is also causing another regression in Chrome that is *not* fixed by rGad9ad3735c48: https://crbug.com/1431366#c3

This is also causing another regression in Chrome that is *not* fixed by rGad9ad3735c48: https://crbug.com/1431366#c3

Can you provide a minimal reproducer?

This is also causing another regression in Chrome that is *not* fixed by rGad9ad3735c48: https://crbug.com/1431366#c3

Can you provide a minimal reproducer?

Currently working on it, but I'd like to point out that this is also causing another (unrelated?) failure in Chrome/V8: https://crbug.com/1431489

This is also causing another regression in Chrome that is *not* fixed by rGad9ad3735c48: https://crbug.com/1431366#c3

Can you provide a minimal reproducer?

Currently working on it, but I'd like to point out that this is also causing another (unrelated?) failure in Chrome/V8: https://crbug.com/1431489

I now have a reproducible (but non-reduced) testcase: https://crbug.com/1431366#c5

This looks like a miscompile; the return parameter is not being initialized if we don't take the branch.

ayzhao added a subscriber: hans.Apr 11 2023, 4:26 PM
hans added a comment.Apr 12 2023, 3:56 AM

I now have a reproducible (but non-reduced) testcase: https://crbug.com/1431366#c5

This looks like a miscompile; the return parameter is not being initialized if we don't take the branch.

Here's a small repro based on that:

$ cat /tmp/a.cc
struct S {
  unsigned long long x;  
};

S g();

S f(int a) {
  S ret;
  if (a == 42)
    ret = g();
  return ret;
}

$ build/bin/clang.bad -target i686-linux-gnu -c -ftrivial-auto-var-init=pattern -O2 /tmp/a.cc -S -emit-llvm -o -
[...]
define dso_local void @_Z1fi(ptr noalias nocapture writeonly sret(%struct.S) align 4 %agg.result, i32 noundef %a) local_unnamed_addr #0 {
entry:
  %ref.tmp = alloca %struct.S, align 8
  %cmp = icmp eq i32 %a, 42
  br i1 %cmp, label %if.then, label %if.end

if.then:                                          ; preds = %entry
  store i64 -1, ptr %agg.result, align 4, !annotation !6        <------ This used to be in the %entry block.
  call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %ref.tmp) #3
  call void @_Z1gv(ptr nonnull sret(%struct.S) align 4 %ref.tmp)
  %0 = load i64, ptr %ref.tmp, align 8, !tbaa !7
  store i64 %0, ptr %agg.result, align 4, !tbaa !7
  call void @llvm.lifetime.end.p0(i64 8, ptr nonnull %ref.tmp) #3
  br label %if.end

if.end:                                           ; preds = %if.then, %entry
  ret void
}

This patch moved the store i64 -1, ptr %agg.result instruction from the %entry block to %if.then, meaning the return value doesn't always get initialized.

nikic added a comment.Apr 12 2023, 4:07 AM

The assumption of this patch was that auto-init only occurs on stores to allocas, but apparently that assumption isn't right. We of course shouldn't move a store to an sret argument.

hans added a comment.Apr 12 2023, 4:37 AM

It's actually worse than my repro above showed. Since "real" initialization of the sret arg can get folded in with the "auto" initialization, this patch can also cause an actually initialized return value to become uninitialized: https://bugs.chromium.org/p/chromium/issues/detail?id=1431366#c8

I'll revert this for now.

It's actually worse than my repro above showed. Since "real" initialization of the sret arg can get folded in with the "auto" initialization, this patch can also cause an actually initialized return value to become uninitialized: https://bugs.chromium.org/p/chromium/issues/detail?id=1431366#c8

I'll revert this for now.

This is from clang --target=i686-linux-gnu. @serge-sans-paille Testing -m32 applications will create a lot of sret opportunities (structs are unions are indirect, even if small) and sort out such bugs.