This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Generate tbz/tbnz when comparing against zero.
ClosedPublic

Authored by mcrosier on Jul 9 2014, 1:07 PM.

Details

Summary

This patch generates tbz/tbnz when comparing against zero. The tbz/tbnz checks the sign bit to convert

op w1, w1, w10
cmp w1, #0
b.lt .LBB0_0

to

op w1, w1, w10
tbnz w1, #31, .LBB0_0

Please have a look.

Chad

Diff Detail

Repository
rL LLVM

Event Timeline

mcrosier updated this revision to Diff 11215.Jul 9 2014, 1:07 PM
mcrosier retitled this revision from to [AArch64] Generate tbz/tbnz when comparing against zero..
mcrosier updated this object.
mcrosier edited the test plan for this revision. (Show Details)
mcrosier added reviewers: t.p.northover, jmolloy.
mcrosier added subscribers: Unknown Object (MLST), grosbach, Jiangning, HaoLiu.

The patch looks good, and I am researching an issue which is a bit like this one. I think there should be more test case, for example

cmp      w0, #1                 // =1
b.lt    .LBB

I have tried your patch and test.Here, I doubt whether it has something to do with add/sub or adds/subs, and I have also tried some test case written by myself.

void foo();
void gt(int tmp) {
  if (tmp >= 0)
    foo();
}

and the asm is below:

  ge:                                     // @ge 
// BB#0:                                // %entry 
        cmp      w0, #0                 // =0 
        b.lt    .LBB1_2                           
// BB#1:                                // %if.then 
        b       foo 
.LBB1_2:                                // %if.end 
        ret 
.Ltmp3:
      .size   ge, .Ltmp3-ge

I think the cmp and b.lt above can combine to tbz/tbnz. I think your patch should cover this case. When I chang the c code to (tmp>0), it generates

gt:                                     // @gt
// BB#0:                                // %entry
       cmp      w0, #1                 // =1
       b.lt    .LBB0_2
// BB#1:                                // %if.then
       b       foo
.LBB0_2:                                // %if.end
       ret

I think this case may not be easy to generate TBZ/TBNZ, what is your opinion?

Thanks
-David

In D4440#7, @xjwwm_cat wrote:

I have tried your patch and test.Here, I doubt whether it has something to do with add/sub or adds/subs, and I have also tried some test case written by myself.

The difference between add/sub and adds/subs is that the latter sets the NZCV bits.

void foo();
void gt(int tmp) {
  if (tmp >= 0)
    foo();
}

and the asm is below:

  ge:                                     // @ge 
// BB#0:                                // %entry 
        cmp      w0, #0                 // =0 
        b.lt    .LBB1_2                           
// BB#1:                                // %if.then 
        b       foo 
.LBB1_2:                                // %if.end 
        ret 
.Ltmp3:
      .size   ge, .Ltmp3-ge

I think the cmp and b.lt above can combine to tbz/tbnz. I think your patch should cover this case. When I chang the c code to (tmp>0), it generates

AFAICT, that case can be handled by a similar combine, but it shouldn't block this patch. Feel free to submit a patch of your own.

gt:                                     // @gt
// BB#0:                                // %entry
       cmp      w0, #1                 // =1
       b.lt    .LBB0_2
// BB#1:                                // %if.then
       b       foo
.LBB0_2:                                // %if.end
       ret

I think this case may not be easy to generate TBZ/TBNZ, what is your opinion?

I don't think that this is possible as you're not strictly checking in sign bit. You would need two checks, to see if the value is zero and another to check if it's negative.

Thanks
-David

Chad

Hi Chad,

2014-07-10 4:08 GMT+08:00 Chad Rosier <mcrosier@codeaurora.org>:

Hi t.p.northover, jmolloy,

This patch generates tbz/tbnz when comparing against zero. The tbz/tbnz
checks the sign bit to convert

add/sub w1, w1, w10
cmp w1, #0
b.lt .LBB0_0

to

adds/subs w1, w1, w10
tbnz w1, #31, .LBB0_0

I think there are two cases around this,

  1. LHS can support flags update like ADDS/SUBS, and we can cover more like

ANDS, ORRS and others etc.

  1. LHS can't support flags update and BR_CC is the only user. At

present performBRCONDCombine can cover this scenario.

The patch would be more complete if you can add more instructions for case

For case 2, it can be a separate patch, I think.

Thanks,
-Jiangning

Please have a look.

On an A53 processor this improves the following benchmarks:
office_suite/OAv2mark +3%
office_suite/rotatev2data* +6-10%

I also saw improvements in spec2000 parser, vortex, and vpr, but just
barely above noise, so I wouldn't bank on them. These numbers were based
on the community mainline.

Chad

http://reviews.llvm.org/D4440

Files:

lib/Target/AArch64/AArch64ISelLowering.cpp
test/CodeGen/AArch64/tbz-tbnz.ll

Hi Jiangning,
AFAICT, emitComparison only generates ADDS, SUBS, and ANDS. I didn't handle the ANDS case because I didn't want to break test8 (extracted from logical_shifted_reg.ll) and it wasn't applicable to the code I was trying to target. I'll need to investigate this case further before I enable the combine for ANDS.

According to the documentation I have, the only other scalar operations that set the condition flags are ADCS, BICS, NEGS, NGCS, and SCBS. I'll look into adding these operations as well, but I'm not sure we'll hit these case very often, if at all.

I'll investigate adding an implementation in InstCombine. Thanks for the suggestion!

Chad

mcrosier updated this revision to Diff 11979.Jul 29 2014, 7:41 AM
mcrosier updated this object.

Jiangning,
I've updated the patch according to your feedback. Rather than enable the combine for just ADD/SUB it is now enabled in all cases except when the LHS operand is an AND. The emitComparison converts the AND to an ANDS and the test bit and branch instruction becomes redundant. It also increases register pressure because the ANDS instruction must write to a register (rather than WZR/XZR), which is consumed by the test bit and branch instruction. test8 provides examples of why we don't want to combine ANDs.

The performance numbers are slightly better, but still less than noise. All, please take a look.

Chad

In D4440#12, @mcrosier wrote:

The performance numbers are slightly better, but still less than noise. All, please take a look.

To be clear, the numbers are slightly better relative to the ADD/SUB only combine. Overall, we still see a large improvement in eembc/OAv2.

Hi Chad,

I'm happy with this, and your new patch looks good to me if only you can
add more comments in the code around excluding AND, because at the first
look it's strange.

Thanks,
-Jiangning

2014-07-30 0:02 GMT+08:00 Chad Rosier <mcrosier@codeaurora.org>:

In D4440#12, @mcrosier wrote:

The performance numbers are slightly better, but still less than noise.

All, please take a look.

To be clear, the numbers are slightly better relative to the ADD/SUB only
combine. Overall, we still see a large improvement in eembc/OAv2.

http://reviews.llvm.org/D4440

mcrosier updated this revision to Diff 12065.Jul 31 2014, 8:22 AM
mcrosier added a reviewer: Jiangning.

Revised patch based on Jiangning's feedback. Please take a look.

Jiangning edited edge metadata.Jul 31 2014, 7:32 PM

Hi Chad,

LGTM now!

Thanks,
-Jiangning

2014-07-31 23:23 GMT+08:00 Chad Rosier <mcrosier@codeaurora.org>:

Revised patch based on Jiangning's feedback. Please take a look.

http://reviews.llvm.org/D4440

Files:

lib/Target/AArch64/AArch64ISelLowering.cpp
test/CodeGen/AArch64/tbz-tbnz.ll
mcrosier closed this revision.Aug 1 2014, 7:58 AM
mcrosier updated this revision to Diff 12106.

Closed by commit rL214518 (authored by @mcrosier).