Page MenuHomePhabricator

[AArch64] Prefer B.cond to CBZ/CBNZ/TBZ/TBNZ when NZCV flags can be set for "free"
ClosedPublic

Authored by mcrosier on Jun 14 2017, 1:46 PM.

Details

Summary

This patch contains a pass that transforms CBZ/CBNZ/TBZ/TBNZ instructions into a conditional branch (B.cond), when the NZCV flags can be set for "free". This is preferred on targets that have more flexibility when scheduling B.cond instructions as compared to CBZ/CBNZ/TBZ/TBNZ (assuming all other variables are equal). This can also reduce register pressure.

A few examples:

add w8, w0, w1  -> cmn w0, w1             ; CMN is an alias of ADDS.
cbz w8, .LBB_2  -> b.eq .LBB0_2

add w8, w0, w1  -> adds w8, w0, w1        ; w8 has multiple uses.
cbz w8, .LBB1_2 -> b.eq .LBB1_2

sub w8, w0, w1       -> subs w8, w0, w1   ; w8 has multiple uses.
tbz w8, #31, .LBB6_2 -> b.ge .LBB6_2

The pass is run after the Machine Combiner because I saw a few cases where converting from 'ADD' to 'ADDS' prevented fusion. I also noticed the AArch64 Conditional Compares pass (i.e., CCMP formation) doesn't handle 'ANDS' and 'BICS', but since this pass is later the formation of CCMP instructions isn't negatively impacted.

No correctness issues across SPEC2000, SPEC2006 or the llvm-test-suite. When running on Falkor, this improves SPEC2006/libquantum by 8.5%. I also saw a few other small improvements in SPEC200[0|6] of ~1-2%. Otherwise, mostly small improvements within noise and no regressions above noise. I saw a few minor improvements on Kryo and didn't test any other subtargets as none are readily available. FWIW, I'm okay with predicating this with a Feature flag, if preferred.

PTAL,
Chad

Diff Detail

Event Timeline

mcrosier created this revision.Jun 14 2017, 1:46 PM

RE: the feature flag statement. Unless we know that this is a win all supported sub-targets (micro-architectures), then it seems like a feature flag would be the way to go, right?

gberry added inline comments.Jun 15 2017, 11:06 AM
lib/Target/AArch64/AArch64CondBrTuning.cpp
77

You should be able to at least do AU.setPreservesCFG() here, otherwise there's no need to override this to just call the parent version.

82

The check for MO.isReg() seems unnecessary. MO.getReg() will assert if it isn't a reg, which it always should be.
You could simplify this a bit to be:

if (!isvirt(MO.getReg())
  return null;
return getUniqueVRegDef(MO.getReg())
191

You should probably do this after the opcode check below to save compile time.

RE: the feature flag statement. Unless we know that this is a win all supported sub-targets (micro-architectures), then it seems like a feature flag would be the way to go, right?

Yes, I think this is reasonable. I left it off initially assuming it would be easier for other sub-target owners to test, but if the consensus is this should be sub-target specific, I'll add the feature flag accordingly.

MatzeB edited edge metadata.Jun 15 2017, 11:46 AM
  • The transformation make sense to me.
  • Does this need to be a separate pass? Glancing at the code, it seems that performBRCONDCombine() in AArch64ISelLowering.cpp is the only place creating CBZ instructions (except for FastISel). So maybe the adjustment can be performed there?
  • The code has a lot of opcode "tables" (in the form of switch/case). It's a judgement call each time, but generally I think it looks better if we have most opcode based tables in AArch64InstrInfo. At least the part mapping "S" opcodes to the non-"S" opcodes looks like a good candidate.
lib/Target/AArch64/AArch64CondBrTuning.cpp
11–26

You should use /// \file at the beginning and continue with /// so that doxygen can pick it up.

335

Writing MachineBasicBlock & instead of auto & would be friendlier to the reader.

mcrosier updated this revision to Diff 102702.Jun 15 2017, 12:06 PM

Address Geoff's feedback.

  • The transformation make sense to me.

Great! :D

  • Does this need to be a separate pass? Glancing at the code, it seems that performBRCONDCombine() in AArch64ISelLowering.cpp is the only place creating CBZ instructions (except for FastISel). So maybe the adjustment can be performed there?

Initially, I attempted this during ISel but ran into several problems. First, I found myself duplicating a lot of code from AArch64ISelDAGtoDAG.cpp to avoid this transformation when the 'AND' could be folded into a bitfield insert/extract operation (e.g., BFM). It also leads to the aforementioned problems (i.e., the ConditionalCompares pass would need to be able to support ANDS, BICS and we might miss MADD fusion opportunities because we can't fuse MUL+ADDS).

  • The code has a lot of opcode "tables" (in the form of switch/case). It's a judgement call each time, but generally I think it looks better if we have most opcode based tables in AArch64InstrInfo. At least the part mapping "S" opcodes to the non-"S" opcodes looks like a good candidate.

I'm happy to move the large opcode table to switch between the non-flag-setting to flag-setting opcodes. No problem.

mcrosier marked 3 inline comments as done.Jun 15 2017, 12:14 PM
mcrosier added inline comments.
lib/Target/AArch64/AArch64CondBrTuning.cpp
77

I also noticed MachineTraceMetrics was being clobbered, so I've marked that as preserved as well. Please let me know if that's not correct.

gberry added inline comments.Jun 15 2017, 12:18 PM
lib/Target/AArch64/AArch64CondBrTuning.cpp
77

I don't think that is correct since you are potentially changing opcodes and latencies.

mcrosier updated this revision to Diff 102708.Jun 15 2017, 12:58 PM
mcrosier marked 4 inline comments as done.

Address Matthias' and Geoff's feedback.

lib/Target/AArch64/AArch64CondBrTuning.cpp
77

Oh, right. I guess we'll just have to recompute in that case.

mcrosier updated this revision to Diff 102710.Jun 15 2017, 1:02 PM

Remove an accidental change to AArch64ISelLowering.cpp from a different WIP.

MatzeB accepted this revision.Jun 15 2017, 2:32 PM

I'm fine with coding style and transformation.

I'm always a bit sad to see whole new passes for simple/local patterns. There is already other passes like MachineCombiner and MachinePeephole, DeadMachineInstructionElim that also just go through all instruction looking for a pattern and don't need any analysis upfront. Unfortunately none of them is extensible and they are at slightly different positions in the pipeline. I guess we do not need to solve/improve this situation with this patch. Though I reserve the right to ask for a revert and replanning if the bots measure slowdowns in CTMark after committing this.

So LGTM.

This revision is now accepted and ready to land.Jun 15 2017, 2:32 PM

I'm fine with coding style and transformation.

I'm always a bit sad to see whole new passes for simple/local patterns. There is already other passes like MachineCombiner and MachinePeephole, DeadMachineInstructionElim that also just go through all instruction looking for a pattern and don't need any analysis upfront. Unfortunately none of them is extensible and they are at slightly different positions in the pipeline. I guess we do not need to solve/improve this situation with this patch. Though I reserve the right to ask for a revert and replanning if the bots measure slowdowns in CTMark after committing this.

So LGTM.

Thanks, Matthias.

mcrosier added a comment.EditedJun 16 2017, 7:40 AM

I went ahead and looked at each of the sub-target machine descriptions to determine if this patch is good, bad, or neutral. In short, this transformation is either neutral or positive for all sub-targets.

The default WriteRes mappings are neutral when considering the non-flag-setting vs. flag-setting instructions. Falkor and Kryo are the only targets that don't use the default WriteRes values, but converting from non-flag-setting to flag-setting is still neutral for these targets.

These targets appear to be neutral: A53, A57, and Cyclone (per Matthias), ThunderX, ThunderX2T99.
These targets appear to benefit from this tranformation: Falkor, Kryo, M1

For M1, TBZ/TBNZ consume M1WriteC1 and M1WriteA2 for 2 cycles as compared to Bcc which consumes M1WriteB1 for 1 cycle. That seems to be a clear win. Both CBZ/CBNZ and Bcc execute in a single cycle, but consume different resources (M1WriteC1 vs. M1WriteB1, respectively). This isn't clearly a win or loss, but overall this transform appears to be general goodness.

At this point I'm thinking this should be enabled by default because it appears either neutral or good for all sub-targets. FWIW, this is also the default behavior for GCC. However, I'll add a few more sub-target owners to see if they have any comments.

joelkevinjones accepted this revision.Jun 22 2017, 12:24 PM

I concur with the analysis for ThunderXT99.

I concur with the analysis for ThunderXT99.

Thanks, Joel!

Note this should be an improvement for ThunderX (though maybe it is not modeled correctly). Not ThunderX2T99 though (only because fusion happens even for add/cbz cases too).

mcrosier closed this revision.Jun 26 2017, 7:44 AM

Note this should be an improvement for ThunderX (though maybe it is not modeled correctly). Not ThunderX2T99 though (only because fusion happens even for add/cbz cases too).

Glad to her, Andrew!!

This was committed in r306144.