This is an archive of the discontinued LLVM Phabricator instance.

[DAG] optimize negation of bool
ClosedPublic

Authored by spatel on Oct 11 2016, 2:03 PM.

Details

Summary

Use mask and and negate for legalization of i1 source type with SIGN_EXTEND_INREG.

Diff Detail

Event Timeline

spatel updated this revision to Diff 74289.Oct 11 2016, 2:03 PM
spatel retitled this revision from to [x86] use 'neg' for negation of bool.
spatel updated this object.
spatel added reviewers: hfinkel, zvi, delena, mkuper.
spatel added a subscriber: llvm-commits.
zvi edited edge metadata.Oct 12 2016, 12:29 PM

Sanjay, can you please a RUN: and CHECK:'s for a 32-bit target?

Can we do this in SelectionDAGLegalize::ExpandNode instead? I suppose in theory some platform might prefer to lower this using shifts, but I can't think of any off the top of my head.

lib/Target/X86/X86ISelLowering.cpp
16343

Given "SIGN_EXTEND_INREG(X, i1)", you can transform it to "-(X&1)". But you can't assume the input is zero-extended. I think your patch misbehaves for the following testcase:

define void @_Z1fbi(i1 zeroext %a, i32 %b, i8* %p) local_unnamed_addr #0 {
entry:
  %conv = trunc i32 %b to i1
  %or = or i1 %a, %conv
  %s = sext i1 %or to i8
  store i8 %s, i8* %p
  ret void
}

Can we do this in SelectionDAGLegalize::ExpandNode instead? I suppose in theory some platform might prefer to lower this using shifts, but I can't think of any off the top of my head.

Would we still need to do the 'and' masking op though? In that case, there's probably no win?

lib/Target/X86/X86ISelLowering.cpp
16343

Ah, I misunderstood the meaning/guarantee of the source value type of ISD::SIGN_EXTEND_INREG.
So for the given example, we get:

orb   %dil, %sil
shlb   $7, %sil
sarb   $7, %sil
movb   %sil, (%rdx)

But with this patch:

orb   %dil, %sil
negb   %sil
movb   %sil, (%rdx)

And that's wrong if any of the higher bits of %b / %sil are set.

DAGCombine can eliminate the mask instruction in a lot of cases (if the value is in fact zero-extended). Also, the mask+neg is probably slightly more efficient than two shifts on most processors.

In D25485#568525, @zvi wrote:

Sanjay, can you please a RUN: and CHECK:'s for a 32-bit target?

Sure, that's:
https://reviews.llvm.org/rL284124

DAGCombine can eliminate the mask instruction in a lot of cases (if the value is in fact zero-extended). Also, the mask+neg is probably slightly more efficient than two shifts on most processors.

Yes, you're correct - thanks!

So I tried the SelectionDAGLegalize::ExpandNode suggestion, and I see one problem case: micromips.
I don't know micromips (cc'ing @sdardis and @dsanders), but it doesn't appear to have a negate instruction. If we legalize to and+negate, the code grows from something like:

sll	$1, $1, 31
jr	$ra
sra	$2, $1, 31

To:

andi16	$2, $2, 1
li16	$3, 0
subu16	$2, $3, $2
jrc	$ra

Given this potential regression, I'd like to proceed with the x86-only solution for now (I will add a TODO comment about making it more general). As noted in the initial summary, I have filed bugs for the PPC and ARM folks and linked to this patch, so they are aware of what is needed to pursue the common solution.

spatel updated this revision to Diff 74517.Oct 13 2016, 8:10 AM
spatel edited edge metadata.

Patch updated:
This is still an x86-only solution, but now we correctly mask the operand before negation. As Eli noted, the mask is optimized away when we know the operand's top bits are already zero (via zeroext on the input parameter in the test cases).

spatel added a subscriber: amehsan.Oct 13 2016, 8:18 AM

We don't have a negate instruction, we use "nor $dst, $src, $zero" instead. "nor" present in all our ISAs except mips16 which has an actual negate instruction.

Did that regression hit general MIPS code or just microMIPS? It's possible that we have a missing pattern for microMIPS in that case. I'll give this patch a whirl.

We don't have a negate instruction, we use "nor $dst, $src, $zero" instead. "nor" present in all our ISAs except mips16 which has an actual negate instruction.

Did that regression hit general MIPS code or just microMIPS? It's possible that we have a missing pattern for microMIPS in that case. I'll give this patch a whirl.

I only saw the regression for a RUN with the micromips attribute specified. Here are the tests that were affected by the LegalizeDAG patch which I'll attach here if you'd like to try it:

LLVM :: CodeGen/Mips/llvm-ir/add.ll
 LLVM :: CodeGen/Mips/llvm-ir/mul.ll
 LLVM :: CodeGen/Mips/llvm-ir/sdiv.ll
 LLVM :: CodeGen/Mips/llvm-ir/srem.ll
 LLVM :: CodeGen/Mips/llvm-ir/sub.ll
 LLVM :: CodeGen/Mips/llvm-ir/urem.ll
 LLVM :: CodeGen/Mips/select.ll
 LLVM :: CodeGen/SystemZ/branch-07.ll
 LLVM :: CodeGen/SystemZ/risbg-01.ll
 LLVM :: CodeGen/SystemZ/shift-10.ll
 LLVM :: CodeGen/X86/negate-i1.ll

patch for LegalizeDAG

We don't have a negate instruction, we use "nor $dst, $src, $zero" instead. "nor" present in all our ISAs except mips16 which has an actual negate instruction.

Did that regression hit general MIPS code or just microMIPS? It's possible that we have a missing pattern for microMIPS in that case. I'll give this patch a whirl.

Arithmetic and bitwise negation are equivalent in this case but just to mention it. The arithmetic negation is 'sub $dst, $zero, $src'. IIRC, GAS has a 'neg' alias for this but I don't think it's implemented in LLVM yet.

There are some neg<-> aliases in LLVM for MIPS which is what you're seeing in the produced assembly. The main reason for the different is it's not possible to use $zero with the 16 bit instructions.

As Daniel pointed out, you can use subtraction in that case. The produced assembly from that selection dag patch is actually better than what we're currently getting in terms of size as li16, subu16 are 16 bits each, the two constant shifts are 32bits each. The optimal case would be to use a 32bit microMIPS instruction with the register $zero (better than current in terms of size and # instructions). The new instruction pattern for microMIPS should be easy to fold away.

As Daniel pointed out, you can use subtraction in that case. The produced assembly from that selection dag patch is actually better than what we're currently getting in terms of size as li16, subu16 are 16 bits each, the two constant shifts are 32bits each. The optimal case would be to use a 32bit microMIPS instruction with the register $zero (better than current in terms of size and # instructions). The new instruction pattern for microMIPS should be easy to fold away.

Ah, so if the LegalizeDAG patch is an improvement anyway (if not always optimal) for MIPS, then I'll update this patch to use that along with all of the regression test updates.

Looking closer at the SystemZ regression test differences, we have more instructions with the LegalizeDAG patch, so I think that target is missing a pattern ( cc'ing @uweigand and @jonpa ).

Example - the current code for test 'f7' in test/CodeGen/SystemZ/branch-07.ll is:

cr  %r2, %r3
ipm %r0
afi %r0, -268435456
sra %r0, 31
srlg  %r1, %r3, 32
srlg  %r2, %r2, 32
cr  %r2, %r1
ipm %r1
afi %r1, -268435456
sra %r1, 31
sllg  %r2, %r1, 32
lr  %r2, %r0
br  %r14

With the LegalizeDAG patch, it becomes:

cr  %r2, %r3
ipm %r0
afi %r0, -268435456
srl %r0, 31
lcr %r0, %r0
srlg  %r1, %r3, 32
srlg  %r2, %r2, 32
cr  %r2, %r1
ipm %r1
afi %r1, -268435456
srl %r1, 31
lcr %r1, %r1
sllg  %r2, %r1, 32
lr  %r2, %r0
br  %r14

Looking closer at the SystemZ regression test differences, we have more instructions with the LegalizeDAG patch, so I think that target is missing a pattern ( cc'ing @uweigand and @jonpa ).

On 2nd thought, this is a missing combine for all targets:

define i32 @topbit(i32 %x) {
  %sra = ashr i32 %x, 31
  %neg = sub i32 0, %sra
  ret i32 %neg
}

Should be simplified to:

%neg = lshr i32 %x, 31

A patch for the missing DAGCombines of shifts was committed here:
rL284239
and improved:
rL284269

So now I'm trying for the universal (not just x86) fix and added tests for PR30660/30661:
rL284279
rL284280

spatel retitled this revision from [x86] use 'neg' for negation of bool to [DAG] optimize negation of bool.Oct 14 2016, 2:24 PM
spatel updated this object.
spatel updated this revision to Diff 74742.Oct 14 2016, 2:32 PM

Patch upated:
Use mask+negate universally via LegalizeDAG.

There are improvements in the tests for x86, PPC, ARM, and MIPS.
SystemZ looks neutral to me, but I have no experience with that target. I assume it would see the same improvement (optimize away the mask) if we added tests for that.
There's an extra 'negu' in the MIPS select test. I didn't check to see what is going on there.
Based on the earlier comments, I think we're ok with an added instruction in the microMIPS cases as long as there's a reduction in the code size.

zvi added inline comments.Oct 15 2016, 10:49 PM
test/CodeGen/X86/negate-i1.ll
72

This SAR is redundant. Does DAGCombine know that SAR(all_ones/allzeros) is redundant?

Yes, the SystemZ changes now look fine to me.

spatel added inline comments.Oct 17 2016, 8:13 AM
test/CodeGen/X86/negate-i1.ll
72

It knows sometimes, but of course it missed this one. I'll work on that patch now.

spatel updated this revision to Diff 74870.Oct 17 2016, 10:45 AM

Patch updated:
Rebased after rL284395 so we no longer have the unnecessary 'sar' in the x86-32 test as noted by Zvi.
The vector side of that fold needs more work and is currently up for review with D25685.

There's an extra 'negu' in the MIPS select test. I didn't check to see what is going on there.

This isn't extra. The existing CHECK lines don't include the sll/sra pair, so this is actually a win, not a regression.

Before:

mtc1	$5, $f1
mtc1	$6, $f2
sltu	$1, $zero, $4
sll	$1, $1, 31
sra	$1, $1, 31
mtc1	$1, $f0
jr	 $ra
sel.s	$f0, $f2, $f1

After:

sltu	$1, $zero, $4
negu	 $1, $1   <--- the 'and' mask was folded away, so we saved an instruction
mtc1	$5, $f1
mtc1	$6, $f2
mtc1	$1, $f0
jr	 $ra
sel.s	$f0, $f2, $f1

Based on the earlier comments, I think we're ok with an added instruction in the microMIPS cases as long as there's a reduction in the code size.

@sdardis / @dsanders : do you see any common folds that are missing based on the MIPS diffs? I don't think you want me trying any MIPS-specific hacks, so if there's a net win already, we should be ok to proceed?

zvi added a comment.Oct 17 2016, 12:56 PM

The changes to the legalizer + X86 tests LGTM. Thanks!

sdardis accepted this revision.Oct 17 2016, 2:37 PM
sdardis added a reviewer: sdardis.

LGTM, unless ARM/PPC backend maintainers want to jump in. Two comments inlined.

do you see any common folds that are missing based on the MIPS diffs?

I'm happy with codesize reduction in the microMIPS case, the missing fold/optimization case is subtraction by zero but that shouldn't hold this patch up. I'll deal with/work on the instruction count reduction change later.

Thanks,
Simon

test/CodeGen/Mips/llvm-ir/add.ll
48–49 ↗(On Diff #74870)

Add a comment above here along the lines of:

; FIXME: This code sequence is inefficient as it should be 'subu $[[T0]], $zero, $[[T0]'. This sequence is even better as it's a single instruction. See D25485 for the rest of the cases where this sequence occurs.
test/CodeGen/Mips/llvm-ir/mul.ll
23–26 ↗(On Diff #74742)

Unnecessary change.

This revision is now accepted and ready to land.Oct 17 2016, 2:37 PM

Thanks, Simon. I'll make the suggested changes and get this in.

Note that the MIPS tests for sdiv/urem/srem with i1 values *are* universal folds (can't divide by zero, so these disappear?), but I'm wondering how those patterns would appear in the backend. We do have IR-level folds for those in InstSimplify.

Thanks.

Looking at the existing test output for division by i1 leads me to believe they're correct but perhaps not optimal in the sense of "division by zero => undefined behaviour". Taking that view, division of an i1 by an i1 should yield the numerator always, as for MIPS division by zero yields an undefined result and hence division by an i1 should be folded away (unless -mcheck-zero-division is active).

I'll investigate that issue in a bit, but it's an optimization issue, not a correctness issue.

Thanks,
Simon

amehsan added inline comments.Oct 17 2016, 4:11 PM
test/CodeGen/PowerPC/negate-i1.ll
2–4 ↗(On Diff #74870)

please add -verify-machineinstrs to RUN command line. We make sure that we add it to all tests in our backend. Also -mtriple=powerpc64le-unknown-linux-gnu is a more common triple.

hfinkel accepted this revision.Oct 17 2016, 4:40 PM
hfinkel edited edge metadata.

LGTM, unless ARM/PPC backend maintainers want to jump in. Two comments inlined.

LGTM too.

amehsan added inline comments.Oct 17 2016, 5:03 PM
test/CodeGen/PowerPC/negate-i1.ll
2–4 ↗(On Diff #74870)

(I believe you just created and committed this in r284279. That is why I am asking to make a change in a place that is not modified in this patch :)

spatel added inline comments.Oct 17 2016, 5:21 PM
test/CodeGen/PowerPC/negate-i1.ll
2–4 ↗(On Diff #74870)

Correct - I added the test, so we could close PR30661 when this patch is committed.

Note that the comment on line 1 gives away why I used the apple-darwin triple; I thought that would be an easier regex hack while adapting the script that we use for x86 auto-generation of CHECK lines. :)

Thank you for pointing out the improvements - I'll fix these up.

spatel updated this revision to Diff 75063.Oct 18 2016, 1:09 PM
spatel edited edge metadata.
spatel marked 3 inline comments as done.

Patch updated:

  1. Added FIXME comment to Mips test file to note optimization opportunity.
  2. Removed inadvertent changes to RUN lines in Mips test file.
  3. Added -verify-machineinstrs option for PPC test.
  4. Changed PPC test triple to powerpc64le-unknown-linux-gnu.

I'll let this sit a bit in case there is any feedback from the ARM coders.

This revision was automatically updated to reflect the committed changes.