Page MenuHomePhabricator

[SimplifyCFGPass] Tail-merging function-terminating blocks
Changes PlannedPublic

Authored by lebedev.ri on Jun 17 2021, 3:07 AM.

Details

Summary

Currently, we only tail-merge ret function terminators,
which seems to be somewhat overly pessimistic.

This change is meant to be a building block towards sinking more
common code (and yes, i'm interested specifically in doing that
in function terminators, not common block folding in general)
from terminators, potentially de-penalizing inlining.
So this is expected to have a negative impact on both the compile time,
and potentially code size, but as we have established in D101468,
we consider that to be a win.

I haven't updated all the affected codegen tests,
because they are manually written, and updating them will be
a somewhat painful, and long process.

What do people think about this?
One thought i have is that perhaps we actually may want to do
an inverse fold at the end of simplifycfg, i have that change locally,
but i had to post some change first :)

Diff Detail

Unit TestsFailed

TimeTest
40 msx64 debian > LLVM.CodeGen/AArch64::arm64-instruction-mix-remarks.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -mtriple=arm64-apple-ios7.0 -pass-remarks-output=/var/lib/buildkite-agent/builds/llvm-project/build/test/CodeGen/AArch64/Output/arm64-instruction-mix-remarks.ll.tmp -pass-remarks=asm-printer -o - /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/arm64-instruction-mix-remarks.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/arm64-instruction-mix-remarks.ll
50 msx64 debian > LLVM.CodeGen/AArch64::br-cond-not-merge.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -mtriple=aarch64 -verify-machineinstrs < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/br-cond-not-merge.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=CHECK --check-prefix=OPT /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/br-cond-not-merge.ll
30 msx64 debian > LLVM.CodeGen/AArch64::branch-relax-alignment.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -mtriple=aarch64-apple-darwin -aarch64-bcc-offset-bits=4 -align-all-nofallthru-blocks=4 < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/branch-relax-alignment.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/branch-relax-alignment.ll
40 msx64 debian > LLVM.CodeGen/AArch64::branch-relax-asm.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -mtriple=aarch64-apple-ios7.0 -disable-block-placement -aarch64-tbz-offset-bits=4 -o - /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/branch-relax-asm.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/branch-relax-asm.ll
40 msx64 debian > LLVM.CodeGen/AArch64::branch-relax-bcc.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -mtriple=aarch64-apple-darwin -aarch64-bcc-offset-bits=3 < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/branch-relax-bcc.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/branch-relax-bcc.ll
View Full Test Results (33 Failed)

Event Timeline

lebedev.ri created this revision.Jun 17 2021, 3:07 AM
lebedev.ri requested review of this revision.Jun 17 2021, 3:07 AM

I don't have any real basis to say if this step is good. It would help to point to or show an example where we will eventually win from this change {+ the inverse} + a difference in inlining.
Seems like a potentially large scope change, so post to llvm-dev to see if anyone else has experience/opinions?

xbolva00 added inline comments.
llvm/test/CodeGen/AArch64/use-cr-result-of-dom-icmp-st.ll
22

Does not look very profitable.

lebedev.ri added subscribers: hans, rnk.

Oh, i forgot to CC the people who tried this last :D
@rnk @hans - ping. this is tracking towards the very same endgoal as your D29428, but in a more generic way.

rnk added a comment.Jun 18 2021, 10:30 AM

I stopped pursuing D29428 for a few reasons:

  • it makes IR smaller, which causes more inlining, which regresses code size, the opposite of my goal =P
  • it removes debug info
  • it makes it harder to form regions later in the backend

Merging blocks that end in unreachable can make things like stack variable lifetime analysis harder. This example surprised me and didn't illustrate my point, but it's still worth thinking about. Consider:

$ cat t.cpp 
struct LargeObject {
  int a[1024];
  void escapeObjectA();
  void escapeObjectB();
};
bool cond();
void __attribute__((noreturn)) assertFail();
void foo() {
  if (cond()) {
    LargeObject o1;
    o1.escapeObjectA();
    if (cond())
      assertFail();
  } else {
    LargeObject o2;
    o2.escapeObjectB();
    if (cond())
      assertFail();
  }
}

If you merge together the calls to assertFail, that creates BBs where o1 and o2 are both live. This could confuse the stack coloring pass. Somehow, however, it seems to make no difference. I modified the IR to merge the blocks, and we still get the same stack size:

$ clang -S  -O2 t.cpp  -o - |& grep sub.*rsp
        subq    $4104, %rsp                     # imm = 0x1008

$ clang -S -emit-llvm -O2 t.cpp  -o t.ll

# Manually merge noreturn blocks in t.ll

$ llc t.ll  -o - | grep sub.*rsp
        subq    $4104, %rsp                     # imm = 0x1008

So, maybe it is fine. We merge the MBBs at the MI layer anyway, but that could happen after assigning stack offsets.

GPU targets, wasm, and Windows EH preparation may need to undo this type of region-breaking transform late in the backend, and if they do, the only thing this transform is doing is throwing away debug info.

If you limit this change to return blocks, that should completely address most regional concerns. I would also like to see this change for noreturn / unreachable blocks, but I think it's a longer path from here to there.

Thank you for taking a look.
Thinking about it, this change should be split up into NFC refactor,
removal of "block must be empty" check,
and several patches to enable each terminator opcode.

I stopped pursuing D29428 for a few reasons:

  • it makes IR smaller, which causes more inlining, which regresses code size, the opposite of my goal =P

I have mentioned that in the description - as we've now established, that is expected and is an improvement.

  • it removes debug info

This change certainly doesn't remove debuginfo, and i'm not sure if the generic sinking code is debuginfo-lossy, i

  • it makes it harder to form regions later in the backend

Merging blocks that end in unreachable can make things like stack variable lifetime analysis harder. This example surprised me and didn't illustrate my point, but it's still worth thinking about. Consider:

$ cat t.cpp 
struct LargeObject {
  int a[1024];
  void escapeObjectA();
  void escapeObjectB();
};
bool cond();
void __attribute__((noreturn)) assertFail();
void foo() {
  if (cond()) {
    LargeObject o1;
    o1.escapeObjectA();
    if (cond())
      assertFail();
  } else {
    LargeObject o2;
    o2.escapeObjectB();
    if (cond())
      assertFail();
  }
}

If you merge together the calls to assertFail, that creates BBs where o1 and o2 are both live. This could confuse the stack coloring pass. Somehow, however, it seems to make no difference. I modified the IR to merge the blocks, and we still get the same stack size:

$ clang -S  -O2 t.cpp  -o - |& grep sub.*rsp
        subq    $4104, %rsp                     # imm = 0x1008

$ clang -S -emit-llvm -O2 t.cpp  -o t.ll

# Manually merge noreturn blocks in t.ll

$ llc t.ll  -o - | grep sub.*rsp
        subq    $4104, %rsp                     # imm = 0x1008

So, maybe it is fine. We merge the MBBs at the MI layer anyway, but that could happen after assigning stack offsets.

GPU targets, wasm, and Windows EH preparation may need to undo this type of region-breaking transform late in the backend, and if they do, the only thing this transform is doing is throwing away debug info.

Yeah, no idea what any of this means, so i can't really comment on this point.

If you limit this change to return blocks, that should completely address most regional concerns.

By "return blocks", you literally mean blocks that end with ret, but neither resume nor unreachable, correct?

I would also like to see this change for noreturn / unreachable blocks, but I think it's a longer path from here to there.

Could you please be more specific about that path? I think i'm most interested in resume, actually.

I'm asking because i would like to see this change happen, and i don't believe it to be a bad change in general,
but the comment so far seemed to be rather dismissive, and it looks like this might end up following in footsteps
of rG13ec913, where a single platform penalizes/dictates behavior for all other platforms..

lebedev.ri planned changes to this revision.Jun 19 2021, 2:27 PM

That being said, let's try this in smaller steps, D104598/D104597 being the first one, handling only the ret.

rnk added a comment.Jun 21 2021, 12:54 PM

Thank you for taking a look.
Thinking about it, this change should be split up into NFC refactor,
removal of "block must be empty" check,
and several patches to enable each terminator opcode.

Sounds good, I am supportive of the long term direction.

  • it removes debug info

This change certainly doesn't remove debuginfo, and i'm not sure if the generic sinking code is debuginfo-lossy, i

I don't see how it is possible for this transform to preserve source location information. See this example, where the call to foo ends up with a "line 0" location, presumably because of simplifycfg:
https://gcc.godbolt.org/z/foqfT7aP4
We don't have a way to represent a "merged" source location, despite the name of DILocation::getMergedLocation.

  • it makes it harder to form regions later in the backend

GPU targets, wasm, and Windows EH preparation may need to undo this type of region-breaking transform late in the backend, and if they do, the only thing this transform is doing is throwing away debug info.

Yeah, no idea what any of this means, so i can't really comment on this point.

I think the easiest and most relevant example is WebAssembly (wasm). This virtual target does not have anything like goto, so all control flow must be restructured into standard high level control flow constructs. I believe this is mostly implemented in this pass:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyFixIrreducibleControlFlow.cpp
Maybe the paper abstract describes it better:
https://dl.acm.org/doi/10.1145/2048147.2048224

GPU targets have similar requirements, but GPU programs don't usually contain assertions, so this is less interesting.

Windows exception handling (WinEH) has similar requirements. This is the code that clones blocks to ensure that every block is only reachable from one function or funclet entry block:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/WinEHPrepare.cpp#L744

I think what I'm saying can be summarized as "the middle end shouldn't perform transforms which the backend must heroically undo", and there needs to be some kind of TTI hook to control this.

If you limit this change to return blocks, that should completely address most regional concerns.

By "return blocks", you literally mean blocks that end with ret, but neither resume nor unreachable, correct?

Yes.

I would also like to see this change for noreturn / unreachable blocks, but I think it's a longer path from here to there.

Could you please be more specific about that path? I think i'm most interested in resume, actually.

I think we could extend this to resume without much issue. The DwarfEHPrepare pass already rewrites them into branches to a single block that calls _Unwind_Resume.

That being said, let's try this in smaller steps, D104598/D104597 being the first one, handling only the ret.

Sounds good.

ormris removed a subscriber: ormris.Jun 23 2021, 9:38 AM
arsenm added a subscriber: arsenm.Jun 24 2021, 12:23 PM

GPU targets, wasm, and Windows EH preparation may need to undo this type of region-breaking transform late in the backend, and if they do, the only thing this transform is doing is throwing away debug info.

For AMDGPU we generally need to converge to a single return block anyway. We end up trying to merge divergent return-like things into a single exit point (e.g. AMDGPUUnifyDivergentExitNodes) like this patch is doing