This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Relax branches by fusing compare with conditional branch when we can infer that source register is zero/non-zero.
AbandonedPublic

Authored by bmakam on Mar 29 2016, 1:41 PM.

Details

Summary
eg..:
        tbnz    x8, #63, .LBB0_5
        cmp             x8, #1
        b.lt    .LBB0_2
->to:
        tbnz    x8, #63, .LBB0_5
        cbz     .LBB0_2

Diff Detail

Event Timeline

bmakam updated this revision to Diff 51975.Mar 29 2016, 1:41 PM
bmakam retitled this revision from to [AArch64] Relax branches by fusing compare with conditional branch when we can infer that source register is zero/non-zero..
bmakam updated this object.
t.p.northover added inline comments.Mar 29 2016, 4:25 PM
lib/Target/AArch64/AArch64BranchRelaxation.cpp
483

What about "cmp reg, #0"/"b.le ..."?

491

Operand 0 is a destination, isn't it? Usually XZR/WZR in a true compare. Actually, I think we want to make sure it's <dead> if we're going to be removing this definition.

510

This seems wrong on 2 levels. First, if there are multiple predecessors then the fact that one of them ends in a TBZ says nothing about the contents of SrcReg2 coming from any others.

But even if that wasn't the case we really wouldn't want the logic to rely on the order MBB's predecessors happen to be returned in.

568

Related to two other comments about this loop: this seems like an immediate continue condition. In general, it looks likely that once you prune out the extra logic CompleteNZCVUsers will be unnecessary.

580

break? It doesn't seem like you could ever really do more in this BB.

583–585

Isn't this also an immediate break condition? We've hit something that defines NZCV but isn't a compare. Even if there's a "cmp reg, #1" above we mustn't optimize it.

Thanks for the comments Tim. Please see my replies inline.

lib/Target/AArch64/AArch64BranchRelaxation.cpp
483

This could be turned into a TBNZ but isn't AArch64ConditionOptimizer a better place to handle this? Although this is very similar we only fuse a compare with branch in this patch i.e. (a>0 && a<1) -> a == 0 or (a>0 && a>= 1) -> a != 0 and so it depends on branchfolding to shape the CFG.

491

You are correct. I agree, will update in my next patch.

510

I see your point now. I actually wanted to get the immediate dominator but the dom tree is not available in this pass without reconstructing. I will refactor this.

568

I will fix this.

gberry added a subscriber: gberry.Mar 31 2016, 8:45 AM

Hi Balaram,

I'm not sure I understand why this is being done so late in the pipeline. It seems to me like we could catch these cases much earlier.

This also seems related to the patch http://reviews.llvm.org/D7708, which I believe would allow this case to be caught much earlier in instsimplify (perhaps requiring a bit of work in instsimplify as well). This code was recently removed for lack of evidence of its benefit, but it might be worthwhile to evaluate it in the context of this particular optimization.

Geoff, there are 2 reasons for doing this too late. First, for the benchmark I looked at which is mcf, these patterns occur very late only after branchfolding. My initial implementation to do this at AArch64ConditionOptimizer did not catch all the interesting cases. Second, this is a type of branch relaxation optimization because the branch displacement of cbz is better than a conditional branch IMHO.

This comment was removed by bmakam.

Second, this is a type of branch relaxation optimization because the branch displacement of cbz is better than a conditional branch IMHO.

What gave you that idea? They both seem to allow imm19*4.

lib/Target/AArch64/AArch64BranchRelaxation.cpp
483

The case I'm talking about is analogous though: "a > 0 && a <= 0 -> a != 0".

bmakam added inline comments.Mar 31 2016, 10:48 AM
lib/Target/AArch64/AArch64BranchRelaxation.cpp
510

I am having difficulty in using the MachineDominatorTree in this pass even after reconstructing. When I call the DomTree->getNode API it crashes. The last user of MachineDomTree is MachineBlockPlacement pass. Is the DomTree invalid after MachineBlockPlacement? If this is the only way to get the immediate dominator then I am thinking of moving just this logic to an earlier pass. The interesting cases are found only after branchfolding so I am thinking this should be done in PreSched2 stage. Is it reasonable to do it in a separate pass in PreSched2?

I think I agree with Geoff here. This pass may be the most convenient place to add the peep-hole optimization, but even the IR doesn't look particularly obscure (well, apart from being massively and horribly undefined now that I look at it: the conditions are only related at all because of a quirk in the register allocator!).

If you change the operands of the icmps to be a function parameter instead of undefs I think it makes for a better example/test.

If you were able to do this at the IR level, I believe the relevant transformation would be to convert the second icmp (the sgt 0 one) into an icmp ne 0

Second, this is a type of branch relaxation optimization because the branch displacement of cbz is better than a conditional branch IMHO.

What gave you that idea? They both seem to allow imm19*4.

This was based on my observation on our hardware where I found ~4% performance improvement when we changed the following assembly

tbnz             x15, #63, L13
cmp              x8, #1
b.lt             L12
cmp              w16, #2         
b.ne             L12
b                L14

into:

tbnz             x15, #63, L13
cbz              x15, L12
cmp              w16, #2         
b.ne             L12
b                L14

I might be wrong in attributing this as a result of branch displacement.

lib/Target/AArch64/AArch64BranchRelaxation.cpp
483

Ah I see, this is a valid case too. I am sorry I misread this as b.lt and was thinking a < 0 should be turned into TBNZ.

bmakam abandoned this revision.Mar 31 2016, 11:36 AM

If you change the operands of the icmps to be a function parameter instead of undefs I think it makes for a better example/test.

If you were able to do this at the IR level, I believe the relevant transformation would be to convert the second icmp (the sgt 0 one) into an icmp ne 0

Thanks guys, you convinced me that this should be done at the IR level. I am thinking of doing it in SimplifyCmpInst and hope it catches all the interesting cases that I am looking for.