This is an archive of the discontinued LLVM Phabricator instance.

X86: Fold tail calls into conditional branches where possible (PR26302)
ClosedPublic

Authored by hans on Aug 31 2016, 3:59 PM.

Details

Summary

When branching to a block that immediately tail calls, it is possible to fold the call directly into the branch if the call is direct and there is no stack adjustment, saving one byte.

Example:

define void @f(i32 %x, i32 %y) {
entry:
  %p = icmp eq i32 %x, %y
  br i1 %p, label %bb1, label %bb2
bb1:
  tail call void @foo()
  ret void
bb2:
  tail call void @bar()
  ret void
}

before:

f:
        movl    4(%esp), %eax
        cmpl    8(%esp), %eax
        jne     .LBB0_2
        jmp     foo
.LBB0_2:
        jmp     bar

after:

f:
        movl    4(%esp), %eax
        cmpl    8(%esp), %eax
        jne     bar
.LBB0_1:
        jmp     foo

I don't expect any significant size savings from this (on a Clang bootstrap I saw 288 bytes), but it does make the code a little tighter.

This patch only does 32-bit, but 64-bit would work similarly.

Diff Detail

Repository
rL LLVM

Event Timeline

hans updated this revision to Diff 69917.Aug 31 2016, 3:59 PM
hans retitled this revision from to X86: Fold tail calls into conditional branches where possible (PR26302).
hans updated this object.
hans added reviewers: mkuper, rnk.
hans added a reviewer: maksfb.
hans added a subscriber: llvm-commits.

+Maksim who I think mentioned at Euro-LLVM that he'd looked into this.

hans updated this revision to Diff 69918.Aug 31 2016, 4:10 PM

forgot about the debug info

joerg added a subscriber: joerg.Aug 31 2016, 5:49 PM

If there are multiple users of the tail call (e.g. via different incoming edges), this can actually create a size regression by replacing a byte-immediate jump with a 32bit jump?

hans added a comment.Aug 31 2016, 5:56 PM

If there are multiple users of the tail call (e.g. via different incoming edges), this can actually create a size regression by replacing a byte-immediate jump with a 32bit jump?

The transformation only occurs when there's a single edge to the tail call.

maksfb edited edge metadata.Sep 1 2016, 2:05 PM

@hans: I think it is safe to do this optimization while optimizing for size. If you do it by default it could be a performance regression if you reverse direction of the conditional branch. That's what we found out while testing on a micro benchmark. I believe it's related to the way BTB works on Intel's CPUs. The problem is that unless you are doing LTO or work at the binary level, you don't always know the new direction of the branch.

mkuper edited edge metadata.Sep 1 2016, 3:47 PM

Don't know a lot about tail call handling, so the review with a grain of salt.
(The large amount of new instructions seems unfortunate, but, as above, I don't know enough about this to say whether this is avoidable...)

@hans: I think it is safe to do this optimization while optimizing for size. If you do it by default it could be a performance regression if you reverse direction of the conditional branch. That's what we found out while testing on a micro benchmark. I believe it's related to the way BTB works on Intel's CPUs. The problem is that unless you are doing LTO or work at the binary level, you don't always know the new direction of the branch.

Is there a reason to believe the change in direction will cause worse performance, systematically?
Or is this just a random effect on that particular microbenchmark, and we may gain performance in other cases?

include/llvm/Target/TargetInstrInfo.h
1103 ↗(On Diff #69917)

Perhaps add an assert that we never get here?

lib/Target/X86/X86InstrInfo.cpp
3964 ↗(On Diff #69918)

What do you think about:

if (BranchCond[0].getImm() > X86::LAST_VALID_COND)
  return false;

Are there any valid condition codes you don't support?
Admittedly, the current version is safer, but I'm not sure there's a point to future-proofing this against X86 adding condition codes.

3986 ↗(On Diff #69918)

Maybe check this before checking the condition? (This would seem to be the most common reason for failure)

4015 ↗(On Diff #69918)

Shouldn't I be contained in BranchCond?

4016 ↗(On Diff #69918)

Can you explain what guarantees this?
I didn't see a check in canMakeTailCallConditional().

test/CodeGen/X86/conditional-tailcall.ll
20 ↗(On Diff #69917)

Why do we need an encoding check?
(Probably better document this in the test itself, too.)

maksfb added a comment.Sep 1 2016, 4:55 PM

Don't know a lot about tail call handling, so the review with a grain of salt.
(The large amount of new instructions seems unfortunate, but, as above, I don't know enough about this to say whether this is avoidable...)

@hans: I think it is safe to do this optimization while optimizing for size. If you do it by default it could be a performance regression if you reverse direction of the conditional branch. That's what we found out while testing on a micro benchmark. I believe it's related to the way BTB works on Intel's CPUs. The problem is that unless you are doing LTO or work at the binary level, you don't always know the new direction of the branch.

Is there a reason to believe the change in direction will cause worse performance, systematically?
Or is this just a random effect on that particular microbenchmark, and we may gain performance in other cases?

It's hard to say, it will depend on the application. If we assume that the original branch direction was selected intentionally (based on compiler hints, profile, or otherwise) then we probably shouldn't expect a gain from reversing the direction. And in the micro benchmark the negative effect of the reversal severely outweighed a benefit from less instructions executed when direction was the same. Can't speculate beyond the micro benchmark as we weren't able to measure the effect on real-world applications beyond applying it to our jiited code in hhvm. But there we could guarantee the branch direction stayed the same.

hans updated this revision to Diff 70480.Sep 6 2016, 3:40 PM
hans edited edge metadata.
hans marked 2 inline comments as done.

Addressing comments, doing optsize only and avoiding adding so many instructions.

I'm a little bit confused by the multiple levels of pseudo-instructions here actually. We have TCRETURNdi which is expanded in X86ExpandPseudo to a TCRETURNdi, but that is actually replaced by a JMP_1 when it comes to MC lowering. IIUC, this is because we want to put different flags and stuff on TCRETURNdi and JMP_1, but there can't be two real instructions with the same encoding, so the first one is marked isCodeGenOnly.

Anyway, I think this version of the patch is a little less messier than the first.

hans added a comment.Sep 6 2016, 3:40 PM

Thanks very much for the review. New version uploaded.

lib/Target/X86/X86InstrInfo.cpp
3964 ↗(On Diff #69918)

Thanks, that works great.

I was worried about special condition codes like COND_NE_OR_P, but LAST_VALID_COND handles that.

4015 ↗(On Diff #69918)

Hmm, I'm not sure I'm following.

BranchCond is what we got from analyzeBranch() before. We don't really have a handle to the actual branch instruction, which is why we're searching for it here.

4016 ↗(On Diff #69918)

BranchCond originally comes from X86InstrInfo::analyzeBranch(), and that one only puts one element in it.

I'll add the same assert to canMakeTailCallConditional().

mkuper added a comment.Sep 6 2016, 5:48 PM

Thanks Hans, this looks much nicer!

lib/Target/X86/X86InstrInfo.cpp
4079 ↗(On Diff #70480)

Sorry, I got confused.
X86InstrInfo::AnalyzeBranchImpl() also returns a vector of MachineInstructions, but the analyzeBranch() interface doesn't expose that, only the MachineOperands. How inconvenient.
Anything we can do about this, or do you think it would be better not to touch this?

lib/Target/X86/X86MCInstLower.cpp
509 ↗(On Diff #70480)

Can you use X86::GetCondBranchFromCond()?

hans updated this revision to Diff 70544.Sep 7 2016, 9:01 AM

Addressing comments.

lib/Target/X86/X86InstrInfo.cpp
4079 ↗(On Diff #70480)

We could change analyzeBranch() I suppose, but it would probably break some out-of-tree backends, and I'm not sure it's worth it.

lib/Target/X86/X86MCInstLower.cpp
509 ↗(On Diff #70480)

Ah yes, much nicer.

test/CodeGen/X86/conditional-tailcall.ll
21 ↗(On Diff #70480)

(Forgot to reply to this the first time.)

When I worked on the patch, I initially forgot to change X86MCInstLower::Lower, which meant the printed assembly looked correct, but the binary instruction wasn't correct, so I wanted to test that.

I'll add a comment to the test.

mkuper accepted this revision.Sep 7 2016, 10:29 AM
mkuper edited edge metadata.

LGTM

lib/Target/X86/X86InstrInfo.cpp
4079 ↗(On Diff #70544)

Yeah, you're right, I can't think of a really nice way to handle this, since the analyzeBranches() call is in in a generic part of this patch, not x86-specific. And the search here should normally be really short anyway.

But could you please leave a comment documenting this, in case someone decides to refactor this later.

test/CodeGen/X86/conditional-tailcall.ll
22 ↗(On Diff #70544)

I'm a bit confused about this. Without the change to X86MCInstLower, I'd expect you to get complete nonsense, not a poorly encoded jmp.
Anyway, that's a problem with my understanding of this, not your patch. :-)

This revision is now accepted and ready to land.Sep 7 2016, 10:29 AM
hans added inline comments.Sep 7 2016, 11:00 AM
lib/Target/X86/X86InstrInfo.cpp
4079 ↗(On Diff #70544)

Adding a comment.

test/CodeGen/X86/conditional-tailcall.ll
22 ↗(On Diff #70544)

Yes, the bits were garbage, but with the original version of my patch it still got printed as "jne" in the assembly. That wouldn't happen with the current version, but it still seems like a good idea to do a quick check of the encoding.

This revision was automatically updated to reflect the committed changes.