This is an archive of the discontinued LLVM Phabricator instance.

X86: More efficient codegen for 64-bit compares on 32-bit target
ClosedPublic

Authored by hans on Nov 9 2015, 4:11 AM.

Details

Summary

This patch changes the lowering of 64-bit integer compares on 32-bit x86 to be more efficient.

Example:

define i32 @test_slt(i64 %a, i64 %b) {
entry:
  %cmp = icmp slt i64 %a, %b
  br i1 %cmp, label %bb1, label %bb2
bb1:
  ret i32 1
bb2:
  ret i32 2
}

Before this patch:

test_slt:
        movl    4(%esp), %eax
        movl    8(%esp), %ecx
        cmpl    12(%esp), %eax
        setae   %al
        cmpl    16(%esp), %ecx
        setge   %cl
        je      .LBB2_2
        movb    %cl, %al
.LBB2_2:
        testb   %al, %al
        jne     .LBB2_4
        movl    $1, %eax
        retl
.LBB2_4:
        movl    $2, %eax
        retl

After this patch:

test_slt:
        movl    4(%esp), %eax
        movl    8(%esp), %ecx
        cmpl    12(%esp), %eax
        sbbl    16(%esp), %ecx
        jge     .LBB1_2
        movl    $1, %eax
        retl
.LBB1_2:
        movl    $2, %eax
        retl

On a 32-bit Clang bootstrap, this results in 19 KB binary size reduction.

Diff Detail

Event Timeline

hans updated this revision to Diff 39675.Nov 9 2015, 4:11 AM
hans retitled this revision from to X86: More efficient codegen for 64-bit compare-and-branch .
hans updated this object.
hans added reviewers: mkuper, majnemer.
hans added subscribers: llvm-commits, hansw.
mkuper edited edge metadata.Nov 9 2015, 5:36 AM

Hi Hans,

For the eq version, I find it a bit surprising that the new code is faster, but if benchmarksing says it is, who am I to argue. Adding Dave to the review for another opinion.

For the lt version, the new code definitely looks better than what we had before. It looks like there's another option though, which is more similar in spirit to the old code than to the new, but looks much nicer. This is what ICC produces:

test(long long, long long):
        movl      4(%esp), %eax 
        subl      12(%esp), %eax
        movl      8(%esp), %edx 
        sbbl      16(%esp), %edx
        jge       ..B1.3        
        movl      $1, %eax      
        ret                     
..B1.3:                         
        movl      $2, %eax      
        ret

Do you think it may be worth lowering the new pseudo to that, instead of the proposed sequence?

Michael

lib/Target/X86/X86ISelLowering.cpp
15235

I'm not a huge fan of this, but producing the pseudo in a target-specific pre-legalization DAG combine sounds like it may cause too many problems due to making the comparison opaque to other early combines.

21275

Why is this guaranteed?

hans added a comment.Nov 9 2015, 7:00 AM

Thanks for the quick reply!

test(long long, long long):
        movl      4(%esp), %eax 
        subl      12(%esp), %eax
        movl      8(%esp), %edx 
        sbbl      16(%esp), %edx
        jge       ..B1.3        
        movl      $1, %eax      
        ret                     
..B1.3:                         
        movl      $2, %eax      
        ret

Do you think it may be worth lowering the new pseudo to that, instead of the proposed sequence?

Ooh, that looks very nice indeed. Thinking about cmp as a subtraction, this wide subtraction seems like a natural way to do it. I wonder why MSVC doesn't do it this way..

By the way, is there a reason ICC is using %edx instead of %eax for the second move and sbb? It seems to me that the register pressure here should be the same as for my proposed sequence, i.e. only one register needed.

Maybe we could also use this for the equality comparisons?

I'm mainly coming at this from a binary size perspective. Your suggestion is 4 bytes shorter than my test_slt. It's also 2 bytes shorter than my test_eq, so I like it :-)

mkuper added a comment.Nov 9 2015, 8:00 AM

ICC doesn't do this for equality comparisons.
In fact, it produces code that's very close to what we currently have in clang (before your patch).

As to eax/edx - I can't see a good reason for that. Doesn't necessarily mean there isn't one, of course, just that I can't see one...

DavidKreitzer edited edge metadata.Nov 9 2015, 9:40 AM

(1) The subl/sbbl sequence cannot distinguish between greater-than and equal-to, so it doesn't work for ==/!= without an additional instruction at which point, it's no better than the current sequence. This also means you have to be careful about how you order the operands depending on the condition. For (a < b), you'd generate what Michael showed:

test(long long, long long):
        movl      4(%esp), %eax 
        subl      12(%esp), %eax
        movl      8(%esp), %edx 
        sbbl      16(%esp), %edx
        jge       ..B1.3        
        movl      $1, %eax      
        ret                     
..B1.3:                         
        movl      $2, %eax      
        ret

For (a >= b), you can simply replace the "jge" with "jl". For (a > b), you have to reverse the sense of the subtraction like this:

test(long long, long long):
        movl     12(%esp), %eax 
        subl      4(%esp), %eax
        movl      16(%esp), %edx 
        sbbl      8(%esp), %edx
        jge       ..B1.3        
        movl      $1, %eax      
        ret                     
..B1.3:                         
        movl      $2, %eax      
        ret

(a <= b) would have the same operand order but with "jl" instead of "jge".

(2) I think there is no clear winner between the current 1-branch implementation of ==/!= and the proposed 2-branch implementation. The branch prediction effect will be data dependent, and the context will determine whether the extra branch or the longer dependence chain of the current implementation is more harmful. For example, one situation where the 1-branch implementation will shine is when the compare operands almost always compare unequal, but the lower bits often compare equal.

(3) ICC uses both eax and edx to give the post-RA scheduler more flexibility, which is sometimes useful, sometimes not. (In Michael's code snippet, it clearly didn't accomplish anything.)

hans added a comment.Nov 10 2015, 8:50 AM

David, Michael: thank you very much for your input. Not only is the code you suggest better, but it should be easier to generate too since it doesn't change the cfg.

I've been trying to make this work today, but am not sure exactly where to put it. The approach of pattern-matching for "(setcc (or (xor hi1 hi2) (xor lo1 lo2)) 0 {eq,ne})" is brittle because some nodes can get "simplified" e.g .(setcc a 0 ugt) will become (setcc a 0 ne). We could of course try to unsimplify it but I wonder if there's a better way.

Michael mentioned maybe we could do a target-specific DAG-combine. But that would need to be done before legalization, so the types would be illegal. Is it OK to legalize them in a combine? I don't think the combine has access to the logic for expanding integers etc?

Can I make this a target-specific legalization somehow? Where would I do that?

Or, could we make this a generic thing? We have SUBE and SUBC that would let us do subtract with borrow, but I don't think we can pick up the flags from that. Is there some way we could represent this in the generic selection dag?

I'll keep working on this tomorrow, just wanted to put down my notes in case you have any comments. The backend is still new territory for me.

hans updated this revision to Diff 40020.Nov 12 2015, 3:13 AM
hans retitled this revision from X86: More efficient codegen for 64-bit compare-and-branch to X86: More efficient codegen for 64-bit compares on 32-bit target .
hans updated this object.
hans edited edge metadata.

Here is a new version of the patch, using the SUB-SBB method.

I'm trying a new approach, adding a custom SETCC_PARTS node. This avoids the trouble of pattern-matching, allows this to fire in more situations (e.g. not just branches, but all setcc's), and would makes it easier to do the same thing for other targets.

I did try to come up with a way to make this more generic, using ISD::SUBC and SUBE, but couldn't figure out how to detect the outcome based on the result from SUBE. For unsigned comparisons we can just compare it against zero (a comparison that will get folded away nicely), but what to do for unsigned?

I've removed the change to equality comparisons. It might still be worth looking at some more since it lowers the register pressure (maybe use when optimizing for size?).

DavidKreitzer added inline comments.Nov 12 2015, 3:19 PM
include/llvm/CodeGen/ISDOpcodes.h
375 ↗(On Diff #40020)

wit --> with

378 ↗(On Diff #40020)

I think the operand ordering ought to be [LHSLo, LHSHi, RHSLo, RHSHi] to be consistent with SHL_PARTS, et al.

lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2685 ↗(On Diff #40020)

Should probably delete the extra blank line.

2749 ↗(On Diff #40020)

This looks reasonable to me.

I am just wondering whether it would be better to refactor this such that we always generate SETCC_PARTS here regardless of whether the target has a custom lowering and then implement a default lowering that does most of what this routine is currently doing.

lib/Target/X86/X86ISelLowering.cpp
14545

Your operand ordering for SETCC_PARTS doesn't match your comment in ISDOpcodes.h. The two should at least be in sync, and I already gave my opinion on what the operand order ought to be. :-)

14575

The code sequences you are generating here look great!

test/CodeGen/X86/wide-integer-cmp.ll
39

I think you meant LHSHi here.

61

Again, this looks odd. Did you mean RHSHi here and at line 62?

80

Also RHSHi here and at line 81.

hans marked 5 inline comments as done.Nov 13 2015, 12:59 AM
hans added inline comments.
lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2749 ↗(On Diff #40020)

Can you provide an example of how always generating SETCC_PARTS and having a default lowering would help? One problem I see is that we'd need to legalize it too.

For example a 128-bit SETCC could be legalized to 64-bit SETCC_PARTS, and on a 32-bit target we'd need to legalize that too. With the current patch, we only legalize SETCC to SETCC_PARTS for types the target has said it can efficiently lower.

lib/Target/X86/X86ISelLowering.cpp
14545

Oops. Thanks for catching this. So much for trying to document things :-) I've updated the code to use the order you suggested.

14575

Thanks to you and Michael for suggesting it :-)

test/CodeGen/X86/wide-integer-cmp.ll
80

Thanks! Not sure how I managed to get them so backwards.

hans updated this revision to Diff 40115.Nov 13 2015, 12:59 AM

Addressing David's comments.

DavidKreitzer added inline comments.Nov 13 2015, 12:43 PM
lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2749 ↗(On Diff #40020)

I didn't have any particular example in mind. I was just thinking about alternative design choices. The current SETCC_PARTS design seems like it is putting a little too much target-specific knowledge in this target-independent piece of type legalization.

Another option you might like that is in line with your earlier thinking, i.e.

I did try to come up with a way to make this more generic, using ISD::SUBC and SUBE

is to define a new SETCCE opcode that behaves just like SETCC except that it has an additional borrow operand. So the sequence you'd generate during type legalization is

SUBC LHSLo, RHSLo
SETCCE LHSHi, RHSHi, <SUBC borrow>, <cond>

And then the SETCCE would be implemented using SBB+SETcc rather than CMP+SETcc as SETCC is currently implemented.

It is probably easier and/or more natural to extend SETCCE to other targets than SETCC_PARTS.

What do you think?

DavidKreitzer added inline comments.Nov 13 2015, 4:59 PM
lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2748 ↗(On Diff #40115)

Another advantage of the SETCCE approach is that it can be used to implement arbitrary precision integer compares. For example, an __int128 compare on x86 could be legalized to

SUBC
SUBE
SUBE
SETCCE

hans updated this revision to Diff 40453.Nov 17 2015, 4:46 PM

Apologies for the delay.

Here is a version using the proposed SETCCE approach. I think this turned out really well, and of course the code for e.g. 128-bit compares is now very nice.

The only downside I can think of is that this gives targets less flexibility in lowering. For example, with SETCC_PARTS a target could choose to lower with a three-way branch on the high part if it wanted too. On the other hand, this flexibility meant more work; the SETCCE node should be very straight-forward for targets to handle, and it handles the cases with more than two halves nicely so I think it's the better option.

Please take a look and let me know what you think.

I really like the way this turned out! Thanks so much for doing the rework. I just had a couple minor additional suggestions.

lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2772 ↗(On Diff #40453)

I would expect every target to be able to support this (at which point, much of the code in this routine can be removed). Maybe it would be a good idea to add a FIXME comment to that effect.

lib/Target/X86/X86ISelLowering.cpp
14548

I would recommend against this restriction. The SETCCE operation itself makes perfect sense for other conditions. (For example, SETCCE "eq" computes op0 - op1 - carry == 0.) For the purposes of large integer compare lowering, it happens to only be useful for < and >=, but we might find other good uses for it. (I'm imagining some fancy DAG combine optimizations ...)

test/CodeGen/X86/wide-integer-cmp.ll
87

Nice! This is SO much better than the current code!

hans marked an inline comment as done.Nov 18 2015, 9:34 AM
hans added inline comments.
lib/Target/X86/X86ISelLowering.cpp
14548

Sounds good to me.

hans updated this revision to Diff 40526.Nov 18 2015, 9:34 AM

Addressing review comments.

I also ran into a problem when the carry value gets folded to CARRY_FALSE. Added code to handle that and a test.

Please take a look.

DavidKreitzer added inline comments.Nov 19 2015, 7:32 AM
lib/Target/X86/X86ISelLowering.cpp
14547

I like the idea of leveraging TranslateX86CC, but I believe these two optimizations at the beginning of TranslateX86CC are invalid for SETCCE and somehow need to be avoided for it:

// X > -1   -> X == 0, jump !sign.
// X < 0   -> X == 0, jump on sign.

They are invalid in the one special case where X is 0x80000000 and the carry is true. FWIW, this optimization *is* valid:

// X < 1   -> X <= 0
hans added inline comments.Nov 19 2015, 7:40 AM
lib/Target/X86/X86ISelLowering.cpp
14547

Thanks for catching that!

hans updated this revision to Diff 40647.Nov 19 2015, 7:41 AM

Breaking out the actual operation translation part from TranslateX86CC and use that instead.

I like it! Everything looks good to me now.

This revision was automatically updated to reflect the committed changes.