Page MenuHomePhabricator

Move "auto-init" instructions to the dominator of their users
Needs ReviewPublic

Authored by serge-sans-paille on Nov 9 2022, 5:11 AM.

Details

Summary

As a result of -ftrivial-auto-var-init, clang generates instructions to set alloca'd memory to a given pattern, right after the allocation site. In some cases, this (somehow costly) operation could be delayed, leading to conditional execution in some cases.

This is not an uncommon situation: it happens ~600 times on the cPython code base, and much more on the LLVM codebase. The benefit greatly varies on the execution path, but it should not regress on performance.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
serge-sans-paille retitled this revision from [WIP] Move "auto-init" instruction to the dominator of their users to [WIP] Move "auto-init" instructions to the dominator of their users.Nov 9 2022, 5:14 AM
nickdesaulniers requested changes to this revision.Nov 10 2022, 11:47 AM

No comment on the approach. Minor drive by comments on style. But I'd like to see IR tests added for this pass. I haven't run it yet on the Linux kernel but can do so to provide measurements.

llvm/include/llvm/InitializePasses.h
201 ↗(On Diff #474231)

drop this hunk

llvm/include/llvm/LinkAllPasses.h
116 ↗(On Diff #474231)

drop this hunk

139 ↗(On Diff #474231)

add space before cast

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
50

delete

73

auto

81

delete.

See comment below about ternaries, which also applies here.

126

delete

130

delete

151–155

Please use

if (x)
  y()
else
  z()

rather than

if (!x)
  z()
else
  y()

Additionally, since we're assigning to the same variable, consider using a ternary statement.

DominatingPredecessor = (DominatingPredecessor ? DT.find... : Pred)
This revision now requires changes to proceed.Nov 10 2022, 11:47 AM

This looks overly specific.
Is this not a yet another manifestation of the lack of a generic sinking pass?

This looks overly specific.
Is this not a yet another manifestation of the lack of a generic sinking pass?

I'd appreciate pointer to the literature on "sinking pass". That being said, it's pretty clear to me that this pass generalizes far beyond "auto-init". There's an advantage in auto-init though: we can work under the assumption of "no aliasing" which helps a lot. I'm interested in making that pass more generic, probably as a second step though.

No comment on the approach. Minor drive by comments on style. But I'd like to see IR tests added for this pass. I haven't run it yet on the Linux kernel but can do so to provide measurements.

+1 for the tests and thanks for the review. On Firefox codebase, we get decent speedups compated to raw -ftrivial-auto-var-init. Some more data here: https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=77549e4eef51c70336d2ca9e3d086bf7767f8196&newProject=try&newRevision=b9fc0eb0e1e29c37364d69a9da203c00e87df5b2&page=1&framework=13&showOnlyConfident=1 where "Base is without this commit and New with it" The benchmarks are to be taken with a grain of salt though, they are not terribly stable.

serge-sans-paille marked 9 inline comments as done.

Fix nits and address reviews
Fix bug when looking for unique predecessor
Add basic tests

@nickdesaulniers I'm really curious about the impact on ClangBuiltLinux

I tested D137707 Diff 475770 against an x86_64 defconfig build of the Linux kernel (commit 81ac25651a62 ("Merge tag 'nfsd-6.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux")), which sets CONFIG_INIT_STACK_ALL_ZERO=y. Booted fine.

$ du -h vmlinux.orig vmlinux.D137707.475770
63M	vmlinux.orig
63M	vmlinux.D137707.475770
$ bloaty vmlinux.D137707.475770 -- vmlinux.orig
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +0.0%    +336  [ = ]       0    .rela.orc_unwind_ip
  +0.0%    +144  [ = ]       0    .rela.text
  +2.6%    +141  +2.6%    +141    [LOAD #3 [RWX]]
  +0.0%     +84  +0.0%     +84    .orc_unwind
  +0.0%     +56  +0.0%     +56    .orc_unwind_ip
  +0.0%     +24  [ = ]       0    .rela.smp_locks
  +0.3%      +3  [ = ]       0    .shstrtab
  -0.0%     -11  [ = ]       0    .strtab
  -0.0%     -24  [ = ]       0    .rela.return_sites
  -0.0%     -24  [ = ]       0    .symtab
  -8.8%    -140  -8.8%    -140    [LOAD #1 [RW]]
  -0.0%    -141  -0.0%    -141    .init.text
  +0.0%    +448  [ = ]       0    TOTAL
$ llvm-readelf -S vmlinux.orig
...
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
...
  [ 1] .text             PROGBITS        ffffffff81000000 200000 10032cb 00  AX  0   0 4096
...
$ llvm-readelf -S vmlinux.D137707.475770
...
  [ 1] .text             PROGBITS        ffffffff81000000 200000 10032cb 00  AX  0   0 4096

So it appears that this patch made no difference to the size of the .text section. I triple checked this w/ and w/o D137707 applied.

When I build with make LLVM=1 -j128 KCFLAGS="-mllvm -stats", then process the stats from move-auto-init, I observe 49810 moves! Feel free to add that measurement into the commit description.

$ make LLVM=1 -j128 KCFLAGS="-mllvm -stats" &> log.txt
$ grep move-auto-init log.txt | tr -s ' ' | cut -d ' ' -f 2 | python3 -c "import sys; print(sum((float(l) for l in sys.stdin)))"
49810.0
# Triple check with sed+bc rather than python3
$ grep move-auto-init log.txt | tr -s ' ' | cut -d ' ' -f 2 | sed ':a;N;s/\n/+/;ta' |bc    
49810

Please let me know if there's any other measurements you'd like me to make.

Also, it might be nice if the tests demonstrated diffs against existing (or precommitted changes) to better demonstrate how this pass changes the generated code. Mind breaking the newly added tests into 2 patches:

  1. child patch that adds them BEFORE this patch
  2. rebase this patch on that to demonstrate how this patch differs?
llvm/lib/Passes/PassBuilderPipelines.cpp
981

Is it possible to skip this pass if -ftrivial-auto-var-init=zero wasn't specified?

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
93

is there a faster way to do this rather than having to scan the metadata of every operand?

I'd appreciate pointer to the literature on "sinking pass".

Look for PRE (partial redundancy elimination) of stores or partial dead store elimination. Maybe start with https://dl.acm.org/doi/10.1145/277650.277659 . See also D29865 .

So it appears that this patch made no difference to the size of the .text section.

This isn't fundamentally surprising; you have the same number of stores before and after sinking.

I'd appreciate pointer to the literature on "sinking pass".

Look for PRE (partial redundancy elimination) of stores or partial dead store elimination. Maybe start with https://dl.acm.org/doi/10.1145/277650.277659 . See also D29865 .

Thanks.

So it appears that this patch made no difference to the size of the .text section.

This isn't fundamentally surprising; you have the same number of stores before and after sinking.

That's my understanding too. @nickdesaulniers does this patch has an impact on some kernel benchmark, maybe one where -ftrivial-auto-var-init brought some slowdown?

llvm/lib/Passes/PassBuilderPipelines.cpp
981

It would be possible through a module flag or something, but

  1. if we're going to extend that pass beyond auto-var-init, that's going to be useless
  2. that would mean a greater coupling with the front end

The following shows no big impact on compile time when the flag is not used, so I don't think it's worth the extra step

https://llvm-compile-time-tracker.com/compare.php?from=f71d32a0eea47b3d2bb43d6be15cf09d47ef6971&to=6ea450ef673666ed6b8a8257ca56f093aae7469c&stat=instructions:u

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
93

Well, there's an early exit on the instruction metadata, I guess that's enough?

aeubanks added inline comments.
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
204

any reason for all the legacy pass infra? if it's going in the optimization pipeline then no need for a legacy pass

Remove legacy pass manager support
Minor nits

@nickdesaulniers I don't think it makes sense to update the test cases : as they only run the considered pass, the diff before/after is pretty clear based on the CHECK: lines.

I personnaly consider this ready to land :-) I'm obviously open to changes.

serge-sans-paille retitled this revision from [WIP] Move "auto-init" instructions to the dominator of their users to Move "auto-init" instructions to the dominator of their users.Nov 22 2022, 2:00 PM
jfb added a comment.Nov 22 2022, 4:16 PM

Are there any worries about moving stores that are on atomic/volatile "objects", either because the auto-init store itself is atomic/volatile, or because a subsequent access is marked as such? I don't think there's a worry because auto-init stores are technically outside the abstract machine (they "don't exist"), but I'm not 100% convinced.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
120

Typo "want"

Are there any worries about moving stores that are on atomic/volatile "objects", either because the auto-init store itself is atomic/volatile, or because a subsequent access is marked as such? I don't think there's a worry because auto-init stores are technically outside the abstract machine (they "don't exist"), but I'm not 100% convinced.

The thing I'd be concerned about is sinking the store past an atomic fence.

Allocating/initializing an object is never atomic. The requirement of the abstract machine is that the allocation/initialization of the variable has to have a happens-before relationship with any access to it. Consider something like the following:

int x = 3;
atomic_thread_fence(memory_order_release);
atomic_store_explicit(y, &x, memory_order_relaxed);

After the atomic store, it's legal for another thread to access the value "3" through the pointer y, or to write another value to x. (The operations on x don't need to be atomic, as long as the operations on y form a happens-before edge.) So it's illegal to sink the store of "3" past the fence.

If instead x is uninitialized, you run into basically the same thing with implicit zero init: it's illegal to sink the implicit zero-init of x past the fence, because it could race with stores on another thread.

Thanks @efriedma for the (frightening) answer.

If instead x is uninitialized, you run into basically the same thing with implicit zero init: it's illegal to sink the implicit zero-init of x past the fence, because it could race with stores on another thread.

As x is a stack allocated variable, in order to have it shared with another thread (and potentially generating race), we need to *use* it, which implies the initialization already happens. Why does it matter it's done before or after the fence?

If I understand correctly, the fence is global, which means basically any function call we don't have strong guarantee on could imply a memory fence, and prevent the sink, correct?

As x is a stack allocated variable, in order to have it shared with another thread (and potentially generating race), we need to *use* it, which implies the initialization already happens.

To share a variable with another thread, you need to use its address. That isn't directly connected to the memory that contains the variable itself. Without the fence, neither the compiler nor the CPU itself are aware that the initialization of the variable has to be visible to other CPUs before the address of the variable.

If I understand correctly, the fence is global, which means basically any function call we don't have strong guarantee on could imply a memory fence, and prevent the sink, correct?

I think so? I don't see any way to avoid that conclusion.

I guess in theory, you could insert a fence after the zero-initialization to solve the issue: all fences are equivalent, so you don't need the original fence if you insert another one. Not completely sure that works, though, and not sure what effects the extra fence would have.

If you can prove the address doesn't escape, then the whole multi-threaded thing becomes irrelevant, but I guess we SROA most of those cases anyway, so not sure how helpful that is.

Thanks @efriedma for the clarification. I understand your arguments and I think they apply to generic code motion, but I also think we have strong enough hypothesis for this particular pass to not be concerned by these issue.

If you can prove the address doesn't escape, then the whole multi-threaded thing becomes irrelevant, but I guess we SROA most of those cases anyway, so not sure how helpful that is.

I think we are in that situation

I think that the arguments are:

  1. we only move initializer marked as auto-init and whose argument is an alloca.
  2. we compute the dominator of the sets of the uses of this alloca (excluding the memory initialization marked as "auto-init")
  3. we move the memory initialization marked as "auto-init" before that dominator

As a consequence of 2. and 3. there cannot be sharing of the initialized memory before it is initialized, so we can safely cross a memory barrier (and we're not breaking an existing semantic as the auto-init is inserted by the compiler).

Of course all this would collapse without this strong no-sharing assumptions.

Am I missing something?

we move the memory initialization marked as "auto-init" before that dominator

The problem with this is that "before" in the sense of dominators isn't sufficient; that only dictates ordering on the CPU the code is executing on. You need happens-before, i.e. the ordering visible to all CPUs, to ensure you don't have a race.

The user is responsible for ensuring there's a barrier between the allocation of the variable and the use, but if the auto-init is in a different place, you can't depend on the barrier the user wrote.

Take into account memory fences, call sites etc so that the instruction move respects memory order enforced by memory barriers.

This is achieved through MemorySSA, which makes the pass effortlessly more generic.

There a re still a few test to be done on my side, but I'd love to have feedback from @efriedma on the approach first.

The approach seems reasonable.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
94

Can we avoid iterating over the function if it doesn't have any auto-init?

llvm/test/Other/new-pm-defaults.ll
109

It would be good to avoid introducing an additional run of MemorySSA.

fhahn added a subscriber: fhahn.Dec 5 2022, 3:15 PM
fhahn added inline comments.
llvm/lib/Passes/PassBuilderPipelines.cpp
983

What's the rational behind placing it here and not closer to other memory optimizations like DSE/MemCpyOpt?

983

(moving it closer to those will also allow re-using existing memorySSA)

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
94

It's not clear to me from the FIXME why we would want to limit this to auto-init instructions. Shouldn't this be beneficial for all memory instructions?

183

This pass moves only memory instructions, so could it be marked as preserving all CFG analyses?

This should ideally also preserve MemorySSA.

llvm/test/Transforms/MoveAutoInit/branch.ll
6

does this depend on the triple? If it does, this needs REQUIRES:... otherwise it will fail if the X86 backend is not built. (same for all tests)

it'd be good to put this through llvm-compile-time-tracker. a branch with a commit that turns on -ftrivial-auto-var-init, then this commit, to see how much this adds to compile time when using -ftrivial-auto-var-init

it'd be interesting to remove the "auto-init" restriction and see what effects on compile time this has in general, and maybe benchmark results

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
25–26

can remove these now that there's no legacy pass

104–105

this sentence doesn't make sense to me

113

want

llvm/test/Transforms/MoveAutoInit/branch.ll
10

test nit: could you remove dso_local/noundef

llvm/lib/Passes/PassBuilderPipelines.cpp
983

+1 for that. The previous implementation was only using Dominator trees so it made sense to activate it early in the pipeline. This version is sensibly more costly, so it will propably appear at higher optimization level.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
94

Can we avoid iterating over the function if it doesn't have any auto-init?

I could try a module pass and walk through the users of the auto-init metadata?

94

The algorithm beyond auto-init requires more care (it probably needs a reverse transversal of the CFG, and some optimization to avoid rebuilding the dominator of the users for each instruction). I'd rather start small with that pass and then upgrade it to a full-fledge version.

fhahn added inline comments.Dec 6 2022, 2:19 AM
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
94

AFAIK the !annotation metadata at the moment is information-only and doesn't provide any semantic guarantees. So relying on it for correctness at the moment doesn't sound ideal. From the documentation in the pass, it is not clear what property you are relying on. Could you make this clear in both the documentation in the pass and the commit description?

  • Move the pass to the O2 pipeline and update test accordingly
  • Declare preserved analyses, although I'm not sure of the validity for MemorySSA

The new pass catches much less case than the previous one (~150 on the cPython codebase) but at least it's valid now.

serge-sans-paille marked 5 inline comments as done.Dec 15 2022, 8:56 AM

@nickdesaulniers could you give a shot at this new version? It's sensibly different from the first one.

Extra numbers: this is the result of compiling the codes from compile-time-tracker with -ftrivial-auto-var-init=pattern:

-ftrivial-auto-var-init=none vs -ftrivial-auto-var-init=pattern:

https://llvm-compile-time-tracker.com/compare.php?from=50a1c9b1073d7842ef687e486dc842ffea39457c&to=0cc74fe5d7f455e8dd2a34c4cfd9c276aae9ee57&stat=instructions:u

Adding the transformation from this patch on top of -ftrivial-auto-var-init=pattern adds a very slight overhead:

-ftrivial-auto-var-init=pattern vs -ftrivial-auto-var-init=pattern + this pass:

https://llvm-compile-time-tracker.com/compare.php?from=0cc74fe5d7f455e8dd2a34c4cfd9c276aae9ee57&to=67cd4b64768d74a2335d5268967951558bca3226&stat=instructions:u

Basically as is, this pass doesn't add much overhead in compilation time.

Sorry for the delay in testing this. Do you mind rebasing it? arc patch D137707 is having a hard time applying it to ToT.

The test cases LGTM but I'd like @efriedma to review the memory ordering stuff; I don't understand that stuff as much as I would like to.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
162

Should we take a reference to the std::pair here? auto &Job?

Address review + rebase patch

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
24

are you still using this header? I didn't see any references to any intrinsics. Can you recheck these? DebugInfo.h I would think is also unused.

45

Mind adding a comment for this function?

52–56

You might be able to eliminate this for loop by using llvm::make_pointer_range (llvm/include/llvm/ADT/iterator.h) to initialize the Worklist. You'd probably need to move the cast to the below loop though.

64

Instruction

73–74

Are you able to use a range-for here with MemoryPhi::incoming_values()?

104–105

I think you can just pass successors(UsersDominator)?

134–135

This sentence could be reworded. Looks like it was partially updated at some point?

Take into account @nickdesaulniers review.

serge-sans-paille marked 15 inline comments as done.Thu, Feb 2, 6:03 AM

@efriedma : any comment / opinion on this now that it's based on MemorySSA?

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
73–74

Unfortunately not, SmallVector::append doesn't accept iterator_range. I'll implement that.

llvm/lib/Transforms/Utils/MoveAutoInit.cpp
73–74

I was just thinking

for (MemoryAccess *MA : M->incoming_values())
  Worklist.push_back(MA);
serge-sans-paille marked an inline comment as done.Thu, Feb 2, 4:28 PM
serge-sans-paille added inline comments.
llvm/lib/Transforms/Utils/MoveAutoInit.cpp
73–74

Yeah, but incoming_values iterates over Use* so I'd need a cast, so I'd rather keep current implementation.