This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Materialise some boolean values to avoid a branch
ClosedPublic

Authored by rogfer01 on Jun 22 2017, 8:41 AM.

Details

Summary

This patch combines some cases of ARMISD::CMOV for integers that arise in comparisons of the form

a != b ? x : 0
a == b ? 0 : x

and that currently (e.g. in Thumb1) are emitted as branches.

Diff Detail

Event Timeline

rogfer01 created this revision.Jun 22 2017, 8:41 AM

A lot of these combines look very similar to the work done for ISD::ADDCARRY/SUBCARRY; see https://reviews.llvm.org/D29872 . Can we reuse that work rather than add a bunch of ARM-specific code?

Hi @efriedma,

thanks for pointing me to the ADDCARRY/SUBCARRY nodes. I tried to use them but I ran into serious problems (maybe due to some inexperience on my side working with SelectionDAG). First if I want to use them I assume I have to make them legal for the ARM back end, this means that I have to lower them somehow. After the short discussion in the list about the semantics of SUBCARRY I made ISD::ADDCARRY to lower to ARMISD::ADDC and ISD::SUBCARRY to lower to ARMISD::SUBC plus doing an extra (ISD::ADD 1, c) to the carry result, to change from borrow semantics (ISD::SUBCARRY semantics) to ARM carry semantics.

One side effect of doing this is that many operations that were previously legalized using ISD::ADD and ISD::ADDC like i64 add/sub are now legalized using ISD::ADDCARRY. And one generic combiner for ISD::ADDCARRY when it has a zero carry transforms it into ISD::UADDO (similarly ISD::USUBO for ISD::SUBCARRY without borrow). This, unfortunately, leads to very poor (and probably wrong) code involving msr / mrs instructions.

I tried to fix some of these issues but still it is very easy that sometimes the carry value is moved to some general purpose register and the backend decides to use mrs. For instance a case like this

define i64 @f3(i32 %al, i32 %bl) {
entry:
    ; unsigned wide add
    %aw = zext i32 %al to i64
    %bw = zext i32 %bl to i64
    %cw = add i64 %aw, %bw
    ; ch == carry bit
    %ch = lshr i64 %cw, 32
	%dw = add i64 %ch, %bw
	ret i64 %dw
}

leads to something like this before instruction selection

SelectionDAG has 13 nodes:
  t0: ch = EntryToken
  t4: i32,ch = CopyFromReg t0, Register:i32 %vreg1
  t17: ch,glue = CopyToReg t0, Register:i32 %R0, t28
  t19: ch,glue = CopyToReg t17, Register:i32 %R1, t28:1, t17:1
      t2: i32,ch = CopyFromReg t0, Register:i32 %vreg0
    t29: i32,i32 = ARMISD::ADDC t2, t4
  t28: i32,i32 = ARMISD::ADDE t4, Constant:i32<0>, t29:1
  t20: ch = ARMISD::RET_FLAG t19, Register:i32 %R0, Register:i32 %R1, t19:1

but the t19: ch,glue = CopyToReg t17, Register:i32 %R1, t28:1, t17:1 ends being emitted as mrs r1, apsr, below is the final code generated by llc -mtriple=armv6t2-eabi which I think is really wrong as the apsr register has more bits than the carry.

f3:
        adds    r0, r0, r1
        adcs    r0, r1, #0
        mrs     r1, apsr
        bx      lr

So my understanding is that using these nodes, while something that has to be done in the future, is not an easy addition to this change.

The upside is that it is possible to express the original change of this patch using the ADDCARRY and SUBCARRY nodes, so this may still be worth the effort.

Do you agree with my analysis or I am missing something very obvious here. I can upload a phab with the current change as it is if it helps.

Thank you very much,

If you're seeing references to apsr, you aren't lowering ADDCARRY correctly.

ARMISD::ADDE has three operands; two integers, and apsr. It has two results: an integer, and the resulting apsr. If you're lowering ADDCARRY to ARMISD::ADDE, you actually need three operations: one to convert the boolean carry input to an apsr, one ARMISD::ADDE to perform the actual operation, and one operation to convert the result apsr to a boolean. See LowerADDSUBCARRY for how the x86 backend does this.

(Afterwards, you use a target dagcombine to eliminate the redundant operations.)

Hi Eli,

to make this more manageable I opened D35192 in which I only use ADDCARRY and SUBCARRY. Once we're happy with that one I will update this change to use the new nodes.

Regards,

rogfer01 updated this revision to Diff 116191.Sep 21 2017, 8:21 AM

ChangeLog:

  • Updated the patch to use ADDCARRY / SUBCARRY nodes.
rogfer01 added inline comments.Sep 21 2017, 8:24 AM
lib/Target/ARM/ARMISelLowering.cpp
10010โ€“10020

I think this may make the ADDC combiner above redundant as ADDC x, -1 will usually become SUBC x, 1. I'll try to see if I can remove the ADDC one.

Friendly ping :)

RKSimon edited reviewers, added: efriedma; removed: eli.friedman.Sep 28 2017, 7:32 AM
samparker added inline comments.Sep 28 2017, 8:02 AM
lib/Target/ARM/ARMISelLowering.cpp
12231

early exit when not integer instead?

12261

I don't follow how this sequence better? What makes this more efficient than a cmp and conditional move? On a Thumb1Only target, even a cmp and branch will only be 3 cycles, I think.

12310

Same as above, I guess I'm missing something... could you explain why this is better please?

rogfer01 added inline comments.Sep 28 2017, 9:38 AM
lib/Target/ARM/ARMISelLowering.cpp
12231

Sadly there is some code after this block that does not require integer (if I'm reading it correctly).

12261

Consider a case like

int test(int a, int b) {
  return a != b;
}

ToT is currently generating (--target=arm -mcpu=cortex-m0 -O2)

test:
	.fnstart
@ BB#0:                                 @ %entry
	mov	r2, r0
	movs	r0, #1
	movs	r3, #0
	cmp	r2, r1
	bne	.LBB0_2
@ BB#1:                                 @ %entry
	mov	r0, r3
.LBB0_2:                                @ %entry
	bx	lr

with this combiner we can generate

test:
	.fnstart
@ BB#0:                                 @ %entry
	subs	r0, r0, r1
	subs	r1, r0, #1
	sbcs	r0, r1
	bx	lr

So, not counting bx, we go from a sequence of 5/6 to 3.

12310

Consider a case like

int test(int a, int b) {
  return a == b;
}

ToT is currently generating (--target=arm -mcpu=cortex-m0 -O2)

test:
	.fnstart
@ BB#0:                                 @ %entry
	mov	r2, r0
	movs	r0, #1
	movs	r3, #0
	cmp	r2, r1
	beq	.LBB0_2
@ BB#1:                                 @ %entry
	mov	r0, r3
.LBB0_2:                                @ %entry
	bx	lr

But the code above is in practice like

int test(int a, int b) {
  return a == b ? 1 : 0;
}

where k = 0 (2^k = 1).

So with this change we can generate

test:
	.fnstart
@ BB#0:                                 @ %entry
	subs	r1, r0, r1
	movs	r0, #0
	subs	r0, r0, r1
	adcs	r0, r1
	bx	lr

We go from 5/6 instructions to always 4 (not counting the bx).

When k != 0 (e.g. return a == b ? 0 : k) the improvement is more modest from 5/6 to always 5, due to the LSL

12352โ€“12355

Ignore these comments, this ended here accidentally. I'll remove them in the next update.

samparker added inline comments.Sep 29 2017, 2:08 AM
lib/Target/ARM/ARMISelLowering.cpp
12231

Couldn't you move the original combiner to live above this? Doesn't look like the known bits will be operating on floats.

12261

aaah, awesome, thanks!

12310

thanks for the great explanation!

test/CodeGen/ARM/atomic-cmpxchg.ll
29

Sorry, I really should of checked the tested before I asked you to explain... so thanks again for taking the time.

test/CodeGen/ARM/cmn.ll
11

Why change the input?

test/CodeGen/ARM/cmpxchg-O0.ll
23

Is this still beneficial when conditional moves are available? This test makes it look like an extra instruction is used.

test/CodeGen/ARM/select-imm.ll
155

I'd say that it's also worth checking that branches aren't generated for these tests.

test/CodeGen/Thumb/branchless-cmp.ll
3

Same here, it would be nice to ensure that there isn't a cmp and a br generated.

test/CodeGen/Thumb2/thumb2-cmn.ll
9โ€“10

Why the input change for these tests?

rogfer01 updated this revision to Diff 117834.Oct 5 2017, 9:07 AM
rogfer01 marked 5 inline comments as done.

ChangeLog:

  • Reordered combiner and added early exit for non-integers.
  • Constrained the CLZ combiner to T32 because it is not clear whether it is profitable in A32 (where conditional moves are more accessible)
  • Added negative tests to make sure no branches are emitted where we don't want them.
test/CodeGen/ARM/cmn.ll
11

By doing an artificial select we avoid triggering the new combiner while forcing the previous cmn to appear.

test/CodeGen/ARM/cmpxchg-O0.ll
23

I've just reduced the scope of this change to only Thumb1 for this case as it is less obvious for Arm that this is an improvement.

efriedma added inline comments.Oct 5 2017, 12:53 PM
lib/Target/ARM/ARMISelLowering.cpp
12252

What are you trying to do here? With Thumb2, the relevant instructions are basically identical in ARM mode vs. Thumb mode, so doing this transform exclusively in Thumb mode doesn't really make sense.

rogfer01 added inline comments.Oct 6 2017, 9:50 AM
lib/Target/ARM/ARMISelLowering.cpp
12252

Mmmh, I was trying to emit this in ARM because it is not always beneficial in cmpxcg-O0 below. We're going from 2 instructions to 3.

For Thumb 2 I understand there is an IT block (which the test does not check) so we stay in 3 instructions.

(I'm aware that the metric "number of instructions" is a bit poor)

I may be missing something here.

rogfer01 added inline comments.Oct 6 2017, 10:06 AM
lib/Target/ARM/ARMISelLowering.cpp
12252

Also I noticed that if I'm not doing this in ARM any more checking for V5TOps is unnecessary.

efriedma edited edge metadata.Oct 6 2017, 11:56 AM

"it" is more like an instruction prefix than an actual instruction from the perspective of the CPU; it gets decoded into the same thing as an ARM mode conditional instruction. It's not an improvement to replace "it" with an actual instruction.

@efriedma Ah I see. If I get you right, the initial change was more sensible. Also Sam's concerns were on a file that is explicitly marked as -O0.

I will go back the original change.

rogfer01 updated this revision to Diff 118770.Oct 12 2017, 4:05 AM

ChangeLog:

  • Reinstate the transformation for ARM as well.

This patch would be easier to review if you would commit the changes to add new tests and RUN lines separately.

test/CodeGen/Thumb/long-setcc.ll
25โ€“26

Could we fix this test to include some higher-quality CHECK lines, so it's clear what we're actually generating?

rogfer01 updated this revision to Diff 126575.Dec 12 2017, 10:27 AM

ChangeLog

  • Split tests in D41122
  • Improve test long-setcc.ll
rogfer01 updated this revision to Diff 126578.Dec 12 2017, 10:33 AM

ChangeLog:

  • I forgot to update the Thumb counterpart of long-setcc.ll in the last update
rogfer01 marked an inline comment as done.Dec 12 2017, 10:34 AM
rogfer01 updated this revision to Diff 126983.Dec 14 2017, 9:27 AM
rogfer01 removed a child revision: D41122: Tests for D34515.
rogfer01 added a parent revision: D41122: Tests for D34515.

ChangeLog

  • New tests in D41122 are now updated in this change to show the change in codegen

Ping. Modulo the bugs we may find with ADDCARRY / SUBCARRY further thoughts on this?

The use of clz is very similar to what the PowerPC backend does; I wonder if we can pick up any additional tricks from there.

lib/Target/ARM/ARMISelLowering.cpp
7443

When do we hit this case? I would expect that normally DAGCombiner::visitADDCARRY would combine this away.

10010โ€“10020

Did you end up checking this?

12245

Do we generate ARMISD::CMOV for values which aren't integers?

rogfer01 added inline comments.Jan 22 2018, 10:33 AM
lib/Target/ARM/ARMISelLowering.cpp
7443

In a testcase like

define i32 @test1a(i32 %a, i32 %b) {
entry:
  %cmp = icmp ne i32 %a, %b
  %cond = zext i1 %cmp to i32
  ret i32 %cond
}

without this we generate

	subs	r0, r0, r1
	movs	r1, #1
	subs	r2, r1, #1
	mov	r2, r0
	sbcs	r2, r1
	sbcs	r0, r2
	bx	lr

with it we can generate

	subs	r0, r0, r1
	subs	r1, r0, #1
	sbcs	r0, r1
	bx	lr

Alternatively I could queue a combine to the newly created ISD::SUBCARRY in line 12270 below. I think DCI.CombineTo should be able to do that. Does this sound right?

10010โ€“10020

Yep, neither is redundant apparently, but I will check with more detail why it happens.

12245

In cases like this

float a;
int b;
a = (b & 0x1) != 0;

there is a moment where the DAG contains this node

t19: f32 = ARMISD::CMOV ConstantFP:f32<0.000000e+00>, ConstantFP:f32<1.000000e+00>, Constant:i32<1>, Register:i32 %cpsr, t18
efriedma added inline comments.Jan 22 2018, 1:28 PM
lib/Target/ARM/ARMISelLowering.cpp
7443

If we're not correctly adding new nodes to the DAGCombiner queue in a custom DAGCombine, we should fix that.

12245

Huh.

I guess we're getting lucky that the code below to create an AssertZext doesn't crash.

rogfer01 updated this revision to Diff 131272.Jan 24 2018, 8:28 AM

ChangeLog:

  • Change ConvertBooleanCarryToCarryFlag to use ARMISD::SUBC x, 1 instead of ARMISD::ADDC x, ~0, this way a single combiner in PerformAddcSubcCombine is enough.
  • Do not use ISD::SUBCARRY if the third operand is a zero. While the generic combiner can lower this to ISD::USUBO it may not have a chance to run before we lower that ISD::SUBCARRY.
rogfer01 updated this revision to Diff 131293.Jan 24 2018, 9:28 AM

ChangeLog:

  • Remove whitespace and unused variables in LowerADDSUBCARRY
rogfer01 marked 2 inline comments as done.Jan 31 2018, 11:52 AM

@efriedma @samparker any further comments on this change? We can add more clz-like tricks in later changes if needed.

chrib added a subscriber: chrib.Feb 7 2018, 11:47 PM
samparker added inline comments.Feb 8 2018, 2:00 AM
test/CodeGen/ARM/select-imm.ll
235

Why do we have a sxtb for T1 and not T2?

rogfer01 added inline comments.Feb 15 2018, 5:47 AM
test/CodeGen/ARM/select-imm.ll
235

This comes from the load byte + sext in that testcase.

In Thumb2 we can coalesce the load byte + sext in a single ldrsb.w Rt, [Rn, #0] while in Thumb1 we can't do that (at least directly) because there ldsrb is of the form ldsrb Rt, [Rn, Rm] so a ldrb Rt, [Rn, #0] and then a sxtb Rt, Rn are used instead.

samparker accepted this revision.Feb 15 2018, 6:35 AM

No other points from me, lets get this monster in. (fingers crossed)

test/CodeGen/ARM/select-imm.ll
235

Ok cheers, I didn't realise we didn't have ldrsb in thumb-1. But still odd this doesn't get combined away, guess it must be because the load has two users.

This revision is now accepted and ready to land.Feb 15 2018, 6:35 AM
rogfer01 updated this revision to Diff 134563.Feb 16 2018, 12:55 AM

ChangeLog:

  • Rebase with ToT
This revision was automatically updated to reflect the committed changes.

Let's see how this one goes :)

Thanks @samparker and @efriedma