eg..: tbnz x8, #63, .LBB0_5 cmp x8, #1 b.lt .LBB0_2 ->to: tbnz x8, #63, .LBB0_5 cbz .LBB0_2
Details
- Reviewers
t.p.northover llvm-commits mcrosier
Diff Detail
Event Timeline
lib/Target/AArch64/AArch64BranchRelaxation.cpp | ||
---|---|---|
483 | What about "cmp reg, #0"/"b.le ..."? | |
491 | Operand 0 is a destination, isn't it? Usually XZR/WZR in a true compare. Actually, I think we want to make sure it's <dead> if we're going to be removing this definition. | |
510 | This seems wrong on 2 levels. First, if there are multiple predecessors then the fact that one of them ends in a TBZ says nothing about the contents of SrcReg2 coming from any others. But even if that wasn't the case we really wouldn't want the logic to rely on the order MBB's predecessors happen to be returned in. | |
568 | Related to two other comments about this loop: this seems like an immediate continue condition. In general, it looks likely that once you prune out the extra logic CompleteNZCVUsers will be unnecessary. | |
580 | break? It doesn't seem like you could ever really do more in this BB. | |
583–585 | Isn't this also an immediate break condition? We've hit something that defines NZCV but isn't a compare. Even if there's a "cmp reg, #1" above we mustn't optimize it. |
Thanks for the comments Tim. Please see my replies inline.
lib/Target/AArch64/AArch64BranchRelaxation.cpp | ||
---|---|---|
483 | This could be turned into a TBNZ but isn't AArch64ConditionOptimizer a better place to handle this? Although this is very similar we only fuse a compare with branch in this patch i.e. (a>0 && a<1) -> a == 0 or (a>0 && a>= 1) -> a != 0 and so it depends on branchfolding to shape the CFG. | |
491 | You are correct. I agree, will update in my next patch. | |
510 | I see your point now. I actually wanted to get the immediate dominator but the dom tree is not available in this pass without reconstructing. I will refactor this. | |
568 | I will fix this. |
Hi Balaram,
I'm not sure I understand why this is being done so late in the pipeline. It seems to me like we could catch these cases much earlier.
This also seems related to the patch http://reviews.llvm.org/D7708, which I believe would allow this case to be caught much earlier in instsimplify (perhaps requiring a bit of work in instsimplify as well). This code was recently removed for lack of evidence of its benefit, but it might be worthwhile to evaluate it in the context of this particular optimization.
Geoff, there are 2 reasons for doing this too late. First, for the benchmark I looked at which is mcf, these patterns occur very late only after branchfolding. My initial implementation to do this at AArch64ConditionOptimizer did not catch all the interesting cases. Second, this is a type of branch relaxation optimization because the branch displacement of cbz is better than a conditional branch IMHO.
Second, this is a type of branch relaxation optimization because the branch displacement of cbz is better than a conditional branch IMHO.
What gave you that idea? They both seem to allow imm19*4.
lib/Target/AArch64/AArch64BranchRelaxation.cpp | ||
---|---|---|
483 | The case I'm talking about is analogous though: "a > 0 && a <= 0 -> a != 0". |
lib/Target/AArch64/AArch64BranchRelaxation.cpp | ||
---|---|---|
510 | I am having difficulty in using the MachineDominatorTree in this pass even after reconstructing. When I call the DomTree->getNode API it crashes. The last user of MachineDomTree is MachineBlockPlacement pass. Is the DomTree invalid after MachineBlockPlacement? If this is the only way to get the immediate dominator then I am thinking of moving just this logic to an earlier pass. The interesting cases are found only after branchfolding so I am thinking this should be done in PreSched2 stage. Is it reasonable to do it in a separate pass in PreSched2? |
I think I agree with Geoff here. This pass may be the most convenient place to add the peep-hole optimization, but even the IR doesn't look particularly obscure (well, apart from being massively and horribly undefined now that I look at it: the conditions are only related at all because of a quirk in the register allocator!).
If you change the operands of the icmps to be a function parameter instead of undefs I think it makes for a better example/test.
If you were able to do this at the IR level, I believe the relevant transformation would be to convert the second icmp (the sgt 0 one) into an icmp ne 0
This was based on my observation on our hardware where I found ~4% performance improvement when we changed the following assembly
tbnz x15, #63, L13 cmp x8, #1 b.lt L12 cmp w16, #2 b.ne L12 b L14
into:
tbnz x15, #63, L13 cbz x15, L12 cmp w16, #2 b.ne L12 b L14
I might be wrong in attributing this as a result of branch displacement.
lib/Target/AArch64/AArch64BranchRelaxation.cpp | ||
---|---|---|
483 | Ah I see, this is a valid case too. I am sorry I misread this as b.lt and was thinking a < 0 should be turned into TBNZ. |
Thanks guys, you convinced me that this should be done at the IR level. I am thinking of doing it in SimplifyCmpInst and hope it catches all the interesting cases that I am looking for.
What about "cmp reg, #0"/"b.le ..."?