Use mask and and negate for legalization of i1 source type with SIGN_EXTEND_INREG.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Can we do this in SelectionDAGLegalize::ExpandNode instead? I suppose in theory some platform might prefer to lower this using shifts, but I can't think of any off the top of my head.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
16343 โ | (On Diff #74289) | Given "SIGN_EXTEND_INREG(X, i1)", you can transform it to "-(X&1)". But you can't assume the input is zero-extended. I think your patch misbehaves for the following testcase: define void @_Z1fbi(i1 zeroext %a, i32 %b, i8* %p) local_unnamed_addr #0 { entry: %conv = trunc i32 %b to i1 %or = or i1 %a, %conv %s = sext i1 %or to i8 store i8 %s, i8* %p ret void } |
Would we still need to do the 'and' masking op though? In that case, there's probably no win?
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
16343 โ | (On Diff #74289) | Ah, I misunderstood the meaning/guarantee of the source value type of ISD::SIGN_EXTEND_INREG. orb %dil, %sil shlb $7, %sil sarb $7, %sil movb %sil, (%rdx) But with this patch: orb %dil, %sil negb %sil movb %sil, (%rdx) And that's wrong if any of the higher bits of %b / %sil are set. |
DAGCombine can eliminate the mask instruction in a lot of cases (if the value is in fact zero-extended). Also, the mask+neg is probably slightly more efficient than two shifts on most processors.
Yes, you're correct - thanks!
So I tried the SelectionDAGLegalize::ExpandNode suggestion, and I see one problem case: micromips.
I don't know micromips (cc'ing @sdardis and @dsanders), but it doesn't appear to have a negate instruction. If we legalize to and+negate, the code grows from something like:
sll $1, $1, 31 jr $ra sra $2, $1, 31
To:
andi16 $2, $2, 1 li16 $3, 0 subu16 $2, $3, $2 jrc $ra
Given this potential regression, I'd like to proceed with the x86-only solution for now (I will add a TODO comment about making it more general). As noted in the initial summary, I have filed bugs for the PPC and ARM folks and linked to this patch, so they are aware of what is needed to pursue the common solution.
Patch updated:
This is still an x86-only solution, but now we correctly mask the operand before negation. As Eli noted, the mask is optimized away when we know the operand's top bits are already zero (via zeroext on the input parameter in the test cases).
We don't have a negate instruction, we use "nor $dst, $src, $zero" instead. "nor" present in all our ISAs except mips16 which has an actual negate instruction.
Did that regression hit general MIPS code or just microMIPS? It's possible that we have a missing pattern for microMIPS in that case. I'll give this patch a whirl.
I only saw the regression for a RUN with the micromips attribute specified. Here are the tests that were affected by the LegalizeDAG patch which I'll attach here if you'd like to try it:
LLVM :: CodeGen/Mips/llvm-ir/add.ll LLVM :: CodeGen/Mips/llvm-ir/mul.ll LLVM :: CodeGen/Mips/llvm-ir/sdiv.ll LLVM :: CodeGen/Mips/llvm-ir/srem.ll LLVM :: CodeGen/Mips/llvm-ir/sub.ll LLVM :: CodeGen/Mips/llvm-ir/urem.ll LLVM :: CodeGen/Mips/select.ll LLVM :: CodeGen/SystemZ/branch-07.ll LLVM :: CodeGen/SystemZ/risbg-01.ll LLVM :: CodeGen/SystemZ/shift-10.ll LLVM :: CodeGen/X86/negate-i1.ll
Arithmetic and bitwise negation are equivalent in this case but just to mention it. The arithmetic negation is 'sub $dst, $zero, $src'. IIRC, GAS has a 'neg' alias for this but I don't think it's implemented in LLVM yet.
There are some neg<-> aliases in LLVM for MIPS which is what you're seeing in the produced assembly. The main reason for the different is it's not possible to use $zero with the 16 bit instructions.
As Daniel pointed out, you can use subtraction in that case. The produced assembly from that selection dag patch is actually better than what we're currently getting in terms of size as li16, subu16 are 16 bits each, the two constant shifts are 32bits each. The optimal case would be to use a 32bit microMIPS instruction with the register $zero (better than current in terms of size and # instructions). The new instruction pattern for microMIPS should be easy to fold away.
Ah, so if the LegalizeDAG patch is an improvement anyway (if not always optimal) for MIPS, then I'll update this patch to use that along with all of the regression test updates.
Looking closer at the SystemZ regression test differences, we have more instructions with the LegalizeDAG patch, so I think that target is missing a pattern ( cc'ing @uweigand and @jonpa ).
Example - the current code for test 'f7' in test/CodeGen/SystemZ/branch-07.ll is:
cr %r2, %r3 ipm %r0 afi %r0, -268435456 sra %r0, 31 srlg %r1, %r3, 32 srlg %r2, %r2, 32 cr %r2, %r1 ipm %r1 afi %r1, -268435456 sra %r1, 31 sllg %r2, %r1, 32 lr %r2, %r0 br %r14
With the LegalizeDAG patch, it becomes:
cr %r2, %r3 ipm %r0 afi %r0, -268435456 srl %r0, 31 lcr %r0, %r0 srlg %r1, %r3, 32 srlg %r2, %r2, 32 cr %r2, %r1 ipm %r1 afi %r1, -268435456 srl %r1, 31 lcr %r1, %r1 sllg %r2, %r1, 32 lr %r2, %r0 br %r14
On 2nd thought, this is a missing combine for all targets:
define i32 @topbit(i32 %x) { %sra = ashr i32 %x, 31 %neg = sub i32 0, %sra ret i32 %neg }
Should be simplified to:
%neg = lshr i32 %x, 31
Patch upated:
Use mask+negate universally via LegalizeDAG.
There are improvements in the tests for x86, PPC, ARM, and MIPS.
SystemZ looks neutral to me, but I have no experience with that target. I assume it would see the same improvement (optimize away the mask) if we added tests for that.
There's an extra 'negu' in the MIPS select test. I didn't check to see what is going on there.
Based on the earlier comments, I think we're ok with an added instruction in the microMIPS cases as long as there's a reduction in the code size.
test/CodeGen/X86/negate-i1.ll | ||
---|---|---|
131 โ | (On Diff #74742) | This SAR is redundant. Does DAGCombine know that SAR(all_ones/allzeros) is redundant? |
test/CodeGen/X86/negate-i1.ll | ||
---|---|---|
131 โ | (On Diff #74742) | It knows sometimes, but of course it missed this one. I'll work on that patch now. |
This isn't extra. The existing CHECK lines don't include the sll/sra pair, so this is actually a win, not a regression.
Before:
mtc1 $5, $f1 mtc1 $6, $f2 sltu $1, $zero, $4 sll $1, $1, 31 sra $1, $1, 31 mtc1 $1, $f0 jr $ra sel.s $f0, $f2, $f1
After:
sltu $1, $zero, $4 negu $1, $1 <--- the 'and' mask was folded away, so we saved an instruction mtc1 $5, $f1 mtc1 $6, $f2 mtc1 $1, $f0 jr $ra sel.s $f0, $f2, $f1
Based on the earlier comments, I think we're ok with an added instruction in the microMIPS cases as long as there's a reduction in the code size.
@sdardis / @dsanders : do you see any common folds that are missing based on the MIPS diffs? I don't think you want me trying any MIPS-specific hacks, so if there's a net win already, we should be ok to proceed?
LGTM, unless ARM/PPC backend maintainers want to jump in. Two comments inlined.
do you see any common folds that are missing based on the MIPS diffs?
I'm happy with codesize reduction in the microMIPS case, the missing fold/optimization case is subtraction by zero but that shouldn't hold this patch up. I'll deal with/work on the instruction count reduction change later.
Thanks,
Simon
test/CodeGen/Mips/llvm-ir/add.ll | ||
---|---|---|
48โ49 โ | (On Diff #74870) | Add a comment above here along the lines of: ; FIXME: This code sequence is inefficient as it should be 'subu $[[T0]], $zero, $[[T0]'. This sequence is even better as it's a single instruction. See D25485 for the rest of the cases where this sequence occurs. |
test/CodeGen/Mips/llvm-ir/mul.ll | ||
23โ26 โ | (On Diff #74742) | Unnecessary change. |
Thanks, Simon. I'll make the suggested changes and get this in.
Note that the MIPS tests for sdiv/urem/srem with i1 values *are* universal folds (can't divide by zero, so these disappear?), but I'm wondering how those patterns would appear in the backend. We do have IR-level folds for those in InstSimplify.
Thanks.
Looking at the existing test output for division by i1 leads me to believe they're correct but perhaps not optimal in the sense of "division by zero => undefined behaviour". Taking that view, division of an i1 by an i1 should yield the numerator always, as for MIPS division by zero yields an undefined result and hence division by an i1 should be folded away (unless -mcheck-zero-division is active).
I'll investigate that issue in a bit, but it's an optimization issue, not a correctness issue.
Thanks,
Simon
test/CodeGen/PowerPC/negate-i1.ll | ||
---|---|---|
2โ4 โ | (On Diff #74870) | please add -verify-machineinstrs to RUN command line. We make sure that we add it to all tests in our backend. Also -mtriple=powerpc64le-unknown-linux-gnu is a more common triple. |
LGTM, unless ARM/PPC backend maintainers want to jump in. Two comments inlined.
LGTM too.
test/CodeGen/PowerPC/negate-i1.ll | ||
---|---|---|
2โ4 โ | (On Diff #74870) | (I believe you just created and committed this in r284279. That is why I am asking to make a change in a place that is not modified in this patch :) |
test/CodeGen/PowerPC/negate-i1.ll | ||
---|---|---|
2โ4 โ | (On Diff #74870) | Correct - I added the test, so we could close PR30661 when this patch is committed. Note that the comment on line 1 gives away why I used the apple-darwin triple; I thought that would be an easier regex hack while adapting the script that we use for x86 auto-generation of CHECK lines. :) Thank you for pointing out the improvements - I'll fix these up. |
Patch updated:
- Added FIXME comment to Mips test file to note optimization opportunity.
- Removed inadvertent changes to RUN lines in Mips test file.
- Added -verify-machineinstrs option for PPC test.
- Changed PPC test triple to powerpc64le-unknown-linux-gnu.
I'll let this sit a bit in case there is any feedback from the ARM coders.