This is an archive of the discontinued LLVM Phabricator instance.

[x86] use SETCC_CARRY instead of SBB node for select lowering
ClosedPublic

Authored by spatel on Jan 7 2022, 5:28 AM.

Details

Summary

This is a suggested follow-up to D116765 (and diffs are on top of that patch). This removes a clear of the register operand, so it is better for code size, but it does potentially create a false register dependency on surrounding code.

The asm results match what I was expecting, but I'm not exactly sure how this works. Given that the node is called SETCC_CARRY and documented as:

// Same as SETCC except it's materialized with a sbb and the value is all
// one's or all zero's.
SETCC_CARRY, // R = carry_bit ? ~0 : 0

...why does it require a parameter for X86::COND_B? I changed that parameter in an experimental patch, and it didn't alter the asm.

Diff Detail

Event Timeline

spatel created this revision.Jan 7 2022, 5:28 AM
spatel requested review of this revision.Jan 7 2022, 5:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 7 2022, 5:28 AM

why does it require a parameter for X86::COND_B? I changed that parameter in an experimental patch, and it didn't alter the asm.

I believe there is some code in lowering that uses it. I think it made it easier to handle it along with X86ISD::SETCC if it has a condition code operand in the same place.

Are we concerned about the false dependency on Intel CPUs? I tried checking uops.info to see if I could tell what CPUs have it, but I couldn’t tell.

spatel added a comment.Jan 7 2022, 9:17 AM

Are we concerned about the false dependency on Intel CPUs? I tried checking uops.info to see if I could tell what CPUs have it, but I couldn’t tell.

IMO, this is the right transform at this stage of compiling, but I don't have a strong opinion either way. The bug report ( https://github.com/llvm/llvm-project/issues/53006 ) doesn't seem to expect a zero op.

This revision is now accepted and ready to land.Jan 7 2022, 9:26 AM
This revision was landed with ongoing or failed builds.Jan 9 2022, 4:00 AM
This revision was automatically updated to reflect the committed changes.

We noticed a 15% performance regression for llvm_test_suite/MultiSource/Benchmarks/MallocBench/gs with this patch.
Looking at the assembly, noticed a relatively long dependence chain including the following sequence:

mov   (%rbx),%rax 
sbb   %rax,%rax
or    %rdx,%rax

According to Agner Fog's manual (https://www.agner.org/optimize/microarchitecture.pdf), none of the Intel architectures recognize sbb with same operands as dependence breaking.
So, it seems that the false dependence issue already discussed in the previous comments might not be worth the code reduction gained by removing the clearing of the register, at least for Intel architectures.

RKSimon added a comment.EditedFeb 1 2022, 3:06 AM

It'd be good to at least keep this on AMD targets (bulldozer/zen/bobcat/jaguar) - these recognise the SBB(x,x) dependency breaking behaviour. Either as a tuning flag or if there's a way to access the scheduler model isDependencyBreaking() check.

spatel added a comment.Feb 1 2022, 7:52 AM

We noticed a 15% performance regression for llvm_test_suite/MultiSource/Benchmarks/MallocBench/gs with this patch.
Looking at the assembly, noticed a relatively long dependence chain including the following sequence:

mov   (%rbx),%rax 
sbb   %rax,%rax
or    %rdx,%rax

Do you have the IR for the function where that appears?
I haven't looked at the mechanics of BreakFalseDeps and X86InstrInfo::getUndefRegClearance() in a long time, and I'm not sure if they would handle a pattern like that (the load of %rax may already be providing the expected clearance?).
If we're going to need a tuning flag similar to TuningPOPCNTFalseDep, then we could just use that as a predicate hack for the code that was changed in this patch (so don't take chances and always create a real SBB with zero operand if the flag is set).

We noticed a 15% performance regression for llvm_test_suite/MultiSource/Benchmarks/MallocBench/gs with this patch.
Looking at the assembly, noticed a relatively long dependence chain including the following sequence:

mov   (%rbx),%rax 
sbb   %rax,%rax
or    %rdx,%rax

Do you have the IR for the function where that appears?

Here is a source code example with clang-trunk generated IR and assembly (https://godbolt.org/z/v66TM8W7e) that resembles the affected code and reproduces the aforementioned sequence of instructions.
Notice the non-broken (for Intel targets) dependence chain including the following instructions: callq foo1(long*, long*, y_s*) -> movq 8(%rbx), %rax -> sbbq %rax, %rax -> orq %rdx, %rax -> callq foo2(int*, int, int, int, int, int, y_s*, long, long)

If we're going to need a tuning flag similar to TuningPOPCNTFalseDep, then we could just use that as a predicate hack for the code that was changed in this patch (so don't take chances and always create a real SBB with zero operand if the flag is set).

This sounds okay to me.

We noticed a 15% performance regression for llvm_test_suite/MultiSource/Benchmarks/MallocBench/gs with this patch.
Looking at the assembly, noticed a relatively long dependence chain including the following sequence:

mov   (%rbx),%rax 
sbb   %rax,%rax
or    %rdx,%rax

Do you have the IR for the function where that appears?

Here is a source code example with clang-trunk generated IR and assembly (https://godbolt.org/z/v66TM8W7e) that resembles the affected code and reproduces the aforementioned sequence of instructions.
Notice the non-broken (for Intel targets) dependence chain including the following instructions: callq foo1(long*, long*, y_s*) -> movq 8(%rbx), %rax -> sbbq %rax, %rax -> orq %rdx, %rax -> callq foo2(int*, int, int, int, int, int, y_s*, long, long)

If we're going to need a tuning flag similar to TuningPOPCNTFalseDep, then we could just use that as a predicate hack for the code that was changed in this patch (so don't take chances and always create a real SBB with zero operand if the flag is set).

This sounds okay to me.

Rather than modifying the code touched by this patch with a new tuning flag, should we do it in the X86ISelDAGToDAG.cpp where SETCC_CARRY is selected?

spatel added a comment.Feb 2 2022, 8:00 AM

We noticed a 15% performance regression for llvm_test_suite/MultiSource/Benchmarks/MallocBench/gs with this patch.
Looking at the assembly, noticed a relatively long dependence chain including the following sequence:

mov   (%rbx),%rax 
sbb   %rax,%rax
or    %rdx,%rax

Do you have the IR for the function where that appears?

Here is a source code example with clang-trunk generated IR and assembly (https://godbolt.org/z/v66TM8W7e) that resembles the affected code and reproduces the aforementioned sequence of instructions.
Notice the non-broken (for Intel targets) dependence chain including the following instructions: callq foo1(long*, long*, y_s*) -> movq 8(%rbx), %rax -> sbbq %rax, %rax -> orq %rdx, %rax -> callq foo2(int*, int, int, int, int, int, y_s*, long, long)

If we're going to need a tuning flag similar to TuningPOPCNTFalseDep, then we could just use that as a predicate hack for the code that was changed in this patch (so don't take chances and always create a real SBB with zero operand if the flag is set).

This sounds okay to me.

Rather than modifying the code touched by this patch with a new tuning flag, should we do it in the X86ISelDAGToDAG.cpp where SETCC_CARRY is selected?

Yes, good point. I suspect it would be hard to come up with a test where there's a difference, but the later we make the conversion, the better.
I did look at BreakFalseDeps a bit, and I don't think it can work for this case because we convert the X86ISD::SETCC_CARRY into a X86::SETB_C32r pseudo op, and that has no general register operands for BreakFalseDeps to detect.

We noticed a 15% performance regression for llvm_test_suite/MultiSource/Benchmarks/MallocBench/gs with this patch.
Looking at the assembly, noticed a relatively long dependence chain including the following sequence:

mov   (%rbx),%rax 
sbb   %rax,%rax
or    %rdx,%rax

Do you have the IR for the function where that appears?

Here is a source code example with clang-trunk generated IR and assembly (https://godbolt.org/z/v66TM8W7e) that resembles the affected code and reproduces the aforementioned sequence of instructions.
Notice the non-broken (for Intel targets) dependence chain including the following instructions: callq foo1(long*, long*, y_s*) -> movq 8(%rbx), %rax -> sbbq %rax, %rax -> orq %rdx, %rax -> callq foo2(int*, int, int, int, int, int, y_s*, long, long)

If we're going to need a tuning flag similar to TuningPOPCNTFalseDep, then we could just use that as a predicate hack for the code that was changed in this patch (so don't take chances and always create a real SBB with zero operand if the flag is set).

This sounds okay to me.

Rather than modifying the code touched by this patch with a new tuning flag, should we do it in the X86ISelDAGToDAG.cpp where SETCC_CARRY is selected?

Yes, good point. I suspect it would be hard to come up with a test where there's a difference, but the later we make the conversion, the better.
I did look at BreakFalseDeps a bit, and I don't think it can work for this case because we convert the X86ISD::SETCC_CARRY into a X86::SETB_C32r pseudo op, and that has no general register operands for BreakFalseDeps to detect.

Won't SETB_C32r be turned into SBB32rr by expandPostRAPseudo before we get to BreakFalseDeps? But I don't think BreakFalseDeps is prepared to deal with a tied dest/src that's also used by another source.

spatel added a comment.Feb 2 2022, 8:37 AM

Won't SETB_C32r be turned into SBB32rr by expandPostRAPseudo before we get to BreakFalseDeps? But I don't think BreakFalseDeps is prepared to deal with a tied dest/src that's also used by another source.

Ah, yes - it does become SBB before we run BreakFalseDeps. I had my breakpoints mixed up and missed that. But a quick hack shows the limitation you're suggesting; we insert the xor to clear the reg between the cmp and the sbb user, so that can't work.

skan added a subscriber: skan.Mar 9 2022, 6:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 6:03 AM