Page MenuHomePhabricator

[X86] Transform setcc + movzbl into xorl + setcc
ClosedPublic

Authored by mkuper on Jun 27 2016, 4:50 PM.

Details

Summary

xorl + setcc is generally the preferred sequence due to the partial register stall setcc + movzbl suffers from. It also encodes one byte smaller. llvm.org/PR28146 and the other associated PRs have more details on this.

Unfortunately, this can not be handled in DAG ISel, because of how X86 SETCC is modeled. And changing the way SETCC is modeled does not seem too attractive. So, we need to clean this up post-ISel. Dave Kreitzer suggested using pseudos to represent a zexted setcc, which is cleaner in some sense, but dirtier in others, and on the balance, I think I prefer this patch. Dave, if you (or anyone else) feel strongly that we should be using pseudos - or some other solution - instead, we can continue the discussion here.

Note that this is not a win in 100% of the cases - for example, some pcmpstri test cases below get pessimized. This happens because the register allocator is over-constrained due to the extended value being forced into %eax. So we should not see this sort of thing inside hot loops. Suggestions on how to resolve this are welcome, although I don't believe that should block this patch.

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 62039.Jun 27 2016, 4:50 PM
mkuper retitled this revision from to [X86] Transform setcc + movzbl into xorl + setcc.
mkuper updated this object.
mkuper added a subscriber: llvm-commits.
delena added inline comments.Jun 27 2016, 11:02 PM
lib/Target/X86/X86FixupSetCC.cpp
119 ↗(On Diff #62039)

When you go backward, you should ensure that MI.getOperand(0).getReg() is not used between setcc and FlagsDefMI, including the second one.

for example:

test %eax, %eax
setcc %al

you can kill %eax before the "test".

ab added a subscriber: ab.Jun 28 2016, 8:37 AM

Have you considered doing this in DAGToDAG? I'm not sure you'll be able to model the imp-use of EFLAGS (IIRC that's also a longstanding deficiency of pattern matching: we can't turn explicit into implicit), but seems worth a try?

lib/Target/X86/CMakeLists.txt
20 ↗(On Diff #62039)

Go ahead and reorder separately?

hans added a subscriber: hans.Jun 28 2016, 9:21 AM

Thanks for working on this!

lib/Target/X86/X86FixupSetCC.cpp
84 ↗(On Diff #62039)

nit: I'd use braces here to help the reader

115 ↗(On Diff #62039)

First I thought this was a "break" from the loop, but it's for the switch of course.

Maybe this would be easier to read if the switch part was broken out to an "isSetCC" helper function, then the loop could be

for (auto &MI : MBB) {
  if (!isSetCC(MI))
    continue;

  // do stuff, with no confusion about 'break'

Thanks Elena, Ahmed!

In D21774#468880, @ab wrote:

Have you considered doing this in DAGToDAG? I'm not sure you'll be able to model the imp-use of EFLAGS (IIRC that's also a longstanding deficiency of pattern matching: we can't turn explicit into implicit), but seems worth a try?

Yes, that's what I started out with. That really would have been simpler, but I couldn't find a way to make it work.

lib/Target/X86/CMakeLists.txt
20 ↗(On Diff #62039)

Will do.

lib/Target/X86/X86FixupSetCC.cpp
119 ↗(On Diff #62039)

I'm not sure I understood the scenario. This is pre-RA, the instruction I'm adding writes into a new vreg, so it can't clobber any other register.
But it made me notice a different edge case. I'm not sure this happens in practice, but in theory, FlagsDefMI may also imp-use eflags, in which case the transformation is invalid. I'll update the patch.

Thanks, Hans.

lib/Target/X86/X86FixupSetCC.cpp
84 ↗(On Diff #62039)

Ack.

115 ↗(On Diff #62039)

You're right, it'll look nicer.

mkuper updated this revision to Diff 62107.Jun 28 2016, 10:21 AM
RKSimon added inline comments.Jun 28 2016, 10:36 AM
test/CodeGen/X86/sse42-intrinsics-x86.ll
1 ↗(On Diff #62107)

Regenerate this first so that we can see the diffs?

mkuper added inline comments.Jun 28 2016, 10:43 AM
test/CodeGen/X86/sse42-intrinsics-x86.ll
1 ↗(On Diff #62107)

Will do.

But it's basically the same as in avx-intrinsics-x86.ll - some of the pcmpestr tests get pessimized, because we can't setcc directly into %al (since we can't xor it). So we use an additional register, and get the additional mov and pop (since it's the register's only use).

mkuper updated this revision to Diff 62111.Jun 28 2016, 10:53 AM

Updated SSE4.2 test.

mkuper updated this revision to Diff 62124.Jun 28 2016, 12:04 PM

I managed to confuse myself with the order of iteration, erasing the zext immediately actually shouldn't be safe - even though I don't have a case where it fails.

DavidKreitzer edited edge metadata.Jun 28 2016, 3:09 PM

Hi Michael,

Thanks for working on this! Ultimately I am okay with this approach, but I have a few high level comments.

(1) The advantage of the pseudo-SETcc instruction approach over this approach is that the dependence of the SETcc on the XOR is explicit. In your transformed code, I think there is nothing to prevent this:

v1 = MOV32r0
<CC setter>
v2 = SETcc
v3 = INSERT_SUBREG(v1, v2)

from being transformed into this

<CC setter>
v2 = SETcc
v1 = MOV32r0
v3 = INSERT_SUBREG(v1, v2)

I'd expect this to be a reasonably common occurrence, since the RA might choose to regenerate the MOV32r0 at its use to save on register pressure. (Regenerating the MOV32r0 eliminates the register conflicts with the operands of the CC setter.) And this would, ironically, have the effect of using an extra register & an extra movb instruction since v1 and v2 could no longer be assigned the same register.

(2) Do you have any performance data on the patch?

(3) There is some advantage to generating the XOR/SETcc idiom even for SETcc instructions that don't get zero extended just for the purpose of eliminating the false dependence. I know that carries a code size cost and has some performance risk, but it is at least worth experimenting with. That's something for later, though. We should go after the "easy" cases first like you've done in this patch. We might also choose to handle these harder cases as part of the ExecutionDepsFix pass and not in this new pass.

Thanks,
Dave

lib/Target/X86/X86FixupSetCC.cpp
1 ↗(On Diff #62124)

Please fix the cut & paste error here.

test/CodeGen/X86/fp128-compare.ll
11 ↗(On Diff #62124)

This looks like a problem ...

Thanks for the feedback, Dave!

Hi Michael,

Thanks for working on this! Ultimately I am okay with this approach, but I have a few high level comments.

(1) The advantage of the pseudo-SETcc instruction approach over this approach is that the dependence of the SETcc on the XOR is explicit.

Right, I agree it would be better to have an explicit dependence. It's just that I'm not a fan of either of the two ways we currently have to achieve this (modifying SETCC or introducing a pseudo).

In your transformed code, I think there is nothing to prevent this:

v1 = MOV32r0
<CC setter>
v2 = SETcc
v3 = INSERT_SUBREG(v1, v2)

from being transformed into this

<CC setter>
v2 = SETcc
v1 = MOV32r0
v3 = INSERT_SUBREG(v1, v2)

I'd expect this to be a reasonably common occurrence, since the RA might choose to regenerate the MOV32r0 at its use to save on register pressure. (Regenerating the MOV32r0 eliminates the register conflicts with the operands of the CC setter.) And this would, ironically, have the effect of using an extra register & an extra movb instruction since v1 and v2 could no longer be assigned the same register.

I actually don't think this happens often, but you're right, conceptually it's not something we should be relying on, and it may be fatal.

(2) Do you have any performance data on the patch?

No, I've verified it fixes the simple cases, but didn't really run comprehensive performance checks, as I assumed generating these xors is pure goodness. :-)
Will do, either for this patch or for a different one, depending on whether we go with this or with pseudos.

lib/Target/X86/X86FixupSetCC.cpp
1 ↗(On Diff #62124)

Right, thanks.

test/CodeGen/X86/fp128-compare.ll
11 ↗(On Diff #62124)

It's the same issue as pcmpestr - we are constrained because both the input of the instruction defining eflags, and the eventual output of the setcc must be eax.

The full code is:

# BB#0:                                 # %entry
	pushq	%rax
.Ltmp1:
	.cfi_def_cfa_offset 16
	callq	__getf2
	xorl	%ecx, %ecx
	testl	%eax, %eax
	setns	%cl
	movl	%ecx, %eax
	popq	%rcx
	retq

I should regenerate the test with the update script.

ab added a subscriber: MatzeB.Jun 28 2016, 4:25 PM

+1 to David's comments about the weirdness around the dependencies. I'm not sure how much of a problem that is in practice though, since that sounds like a pessimization that goes against register coalescing regardless of the specifics of this patch.

Speaking of which; I looked a little closer at the ISel approach, and, with Matthias's and Quentin's help, I have a patch that looks like it works? Can you give it a try: D21822

In D21774#469389, @ab wrote:

+1 to David's comments about the weirdness around the dependencies. I'm not sure how much of a problem that is in practice though, since that sounds like a pessimization that goes against register coalescing regardless of the specifics of this patch.
Speaking of which; I looked a little closer at the ISel approach, and, with Matthias's and Quentin's help, I have a patch that looks like it works? Can you give it a try: D21822

Ooh, this looks nice!
I actually tried to do it as a target-specific DAGCombine for zext and not in DAGToDAG, with something like:

if (N0.getOpcode() == X86ISD::SETCC) {
  return DAG.getTargetInsertSubreg(
      X86::sub_8bit, dl, VT,
      SDValue(DAG.getMachineNode(X86::MOV32r0, dl, VT), 0), N0);
}

But got stuck with the same kind of sub-optimal code, and couldn't see a way out.

I'll take a closer look at D21822, thanks!

D21822 looks really nice. It's missing some small things (anyext and zext to i64 need to be handled explicitly - on the MI level, they have already been lowered to a zext to i32), and the call to Select worries me (is it strictly necessary? We're supposed to always select bottom-up, right?), but other than that, I believe it works.

It does however expose some of the weirdness Dave was talking about, simply by virtue of happening early and thus having less predictable code at RA time.
For CodeGen/X86/legalize-shift-64.ll we will get:

...
	movb	$32, %cl
	testb	%cl, %cl
	sete	%bl
	movl	$0, %ecx
	movb	%bl, %cl
...

The block the sete lives in happens to have two different MOV32r0 instructions, one for the sete, and another for an unrelated reason. One of them gets MachineCSE'd, and then regalloc has to remat a 0, and happens to remat it like this.

I played around with it a bit more, and it seems like doing this in DAGToDAG is too early.
This just doesn't play well with higher-level optimizations. In addition to the legalize-shift-64.ll case above, the DAGToDAG patch doesn't even fix the original case PR28146 was reduced from. There, the MOV32r0 gets hoisted up by MachineLICM, with the result being a similar setcc + mov 0 + movb pattern.

Ahmed, are you OK with keeping this a late (after SSA optimizations) pre-RA pass? Or do you prefer to try again with pseudos, like Dave proposed? Any other suggestions?

Dave, regarding performance: this patch (the machine IR pass) looks performance-neutral on SPEC2006.
And, of course, it improves the performance of the workload PR28146 was reduced from by the expected amount. :-)

igorb added a subscriber: igorb.Jun 30 2016, 6:08 AM
ab added a comment.Jun 30 2016, 7:16 AM

Eh, if Dave has no objections, I'm also fine with the pass; I was hoping we'd find a better solution, sorry for the noise!

The pseudo does seem like the least brittle approach; but the pre-RA pass seems good enough?
So... have you considered doing it post-RA? Seems like that lets you avoid the PCMPESTR problem, and I can't think of obvious drawbacks. It's a tad trickier but should be very close to this patch.

the call to Select worries me (is it strictly necessary? We're supposed to always select bottom-up, right?)

You're right, I don't think it's necessary.

It does however expose some of the weirdness Dave was talking about, simply by virtue of happening early and thus having less predictable code at RA time.
For CodeGen/X86/legalize-shift-64.ll we will get:

...
	movb	$32, %cl
	testb	%cl, %cl
	sete	%bl
	movl	$0, %ecx
	movb	%bl, %cl
...

The block the sete lives in happens to have two different MOV32r0 instructions, one for the sete, and another for an unrelated reason. One of them gets MachineCSE'd, and then regalloc has to remat a 0, and happens to remat it like this.

For the curious: I investigated this some more. We do have the ability to remat using MOV32r0, but we don't because EFLAGS is live (I swear that liveness up-and-down loop is duplicated, like, a dozen times). Immediately after that code, there's a jne, so EFLAGS really is live across the rematerialization point, which doesn't sound common.

But that's a separate issue from the INSERT_SUBREG becoming a real copy. We looked into it with Matthias: on 32-bit, only the ABCD registers are available as sub_8bit, so the INSERT_SUBREG ends up being constrained:

  MOV32mi <fi#0>, 1, %noreg, 0, %noreg, 1; mem:ST4[%x]
  MOV32mi <fi#1>, 1, %noreg, 4, %noreg, 0; mem:ST4[%t+4]
  MOV32mi <fi#1>, 1, %noreg, 0, %noreg, 1; mem:ST4[%t](align=8)
  %vreg0<def> = MOV32ri 1; GR32:%vreg0
  %vreg1<def> = MOV32r0 %EFLAGS<imp-def,dead>; GR32:%vreg1
  %vreg2<def,tied1> = SHLD32rri8 %vreg1<tied0>, %vreg0, 32, %EFLAGS<imp-def,dead>; GR32:%vreg2,%vreg1,%vreg0
  %vreg3<def> = MOV32r0 %EFLAGS<imp-def,dead>; GR32:%vreg3
  %vreg4<def> = MOV8ri 32; GR8:%vreg4
  TEST8rr %vreg4, %vreg4, %EFLAGS<imp-def>; GR8:%vreg4
  %vreg5<def> = SETEr %EFLAGS<imp-use>; GR8:%vreg5
## GR32_ABCD:
  %vreg6<def,tied1> = INSERT_SUBREG %vreg3<tied0>, %vreg5<kill>, sub_8bit; GR32_ABCD:%vreg6 GR32:%vreg3 GR8:%vreg5
  %vreg7<def> = CMOV_GR32 %vreg2<kill>, %vreg0, 9, %EFLAGS<imp-use>; GR32:%vreg7,%vreg2,%vreg0
  %vreg8<def,tied1> = XOR32ri8 %vreg7<tied0>, 1, %EFLAGS<imp-def,dead>; GR32:%vreg8,%vreg7
  %vreg9<def,tied1> = OR32rr %vreg6<tied0>, %vreg8<kill>, %EFLAGS<imp-def>; GR32:%vreg9,%vreg8 GR32_ABCD:%vreg6
  JE_1 <BB#2>, %EFLAGS<imp-use>
  JMP_1 <BB#1>

Once the MOV32r0 is CSE'd, we end up not reusing the (unconstrained GR32) vreg, because of that constraint; we end up with a copy after 2-addr:

  MOV32mi <fi#0>, 1, %noreg, 0, %noreg, 1; mem:ST4[%x]
  MOV32mi <fi#1>, 1, %noreg, 4, %noreg, 0; mem:ST4[%t+4]
  MOV32mi <fi#1>, 1, %noreg, 0, %noreg, 1; mem:ST4[%t](align=8)
  %vreg0<def> = MOV32ri 1; GR32:%vreg0
  %vreg1<def> = MOV32r0 %EFLAGS<imp-def,dead>; GR32:%vreg1
  %vreg4<def> = MOV8ri 32; GR8:%vreg4
  TEST8rr %vreg4<kill>, %vreg4, %EFLAGS<imp-def>; GR8:%vreg4
  %vreg5<def> = SETEr %EFLAGS<imp-use>; GR8:%vreg5
## COPY to GR32_ABCD:
  %vreg6<def> = COPY %vreg1; GR32_ABCD:%vreg6 GR32:%vreg1
  %vreg6:sub_8bit<def> = COPY %vreg5<kill>; GR32_ABCD:%vreg6 GR8:%vreg5
  JE_1 <BB#1>, %EFLAGS<imp-use,kill>

And that's how we get the worst-case sequence.
So, we thought it was fine if it was more likely only on 32-bit, but now we have one 64-bit example in PR28146. I'd be interested to investigate that too, but that's only for curiosity; let's forget about the ISel approach.

spatel added a subscriber: spatel.Jun 30 2016, 8:56 AM
spatel added inline comments.
test/CodeGen/X86/avx-intrinsics-x86.ll
1946–1953 ↗(On Diff #62124)

Use the 'nounwind' attribute on this and other tests to eliminate some of the diff noise?

In D21774#471231, @ab wrote:

Eh, if Dave has no objections, I'm also fine with the pass; I was hoping we'd find a better solution, sorry for the noise!

It's not noise at all, I was - and still am - hoping we'd find a better solution as well.

The pseudo does seem like the least brittle approach; but the pre-RA pass seems good enough?

I have two main concerns about pseudos:

  1. Opaqueness: the pseudos are at least somewhat opaque to machine IR passes. I'm not sure how much of a concern this really is in practice, but it bothers me.
  2. Cleanliness: this will mean introducing ~15 new pseudos (or a pseudo with a parametric CC code that only gets resolved in expand-time, which I think is even worse, since it'll then look very different from the normal x86 setcc.)

But if "public opinion" is strongly towards pseudos, I can go with that. Dave has a prototype patch, I'll see if I run into any gotchas with it.

So... have you considered doing it post-RA? Seems like that lets you avoid the PCMPESTR problem, and I can't think of obvious drawbacks. It's a tad trickier but should be very close to this patch.

What bothers me about post-RA is cases when the flags-setting instruction and the setcc use the same register, e.g.

testb %al, %al
seta %al
movzbl %al, %eax

It may be the case that this never has a read stall on the seta, because if there were, then we'll always stall on the the instruction that defs the flag (or on some instruction between the test and the setcc, in more complicated cases). I haven't been able to convince myself of this, though.

And that's how we get the worst-case sequence.
So, we thought it was fine if it was more likely only on 32-bit, but now we have one 64-bit example in PR28146. I'd be interested to investigate that too, but that's only for curiosity; let's forget about the ISel approach.

Yes, I'm curious about it too. I'll try to take a look at it separately and see if I can make sense of it. But, I'm, by a long shot, not as familiar with regalloc as I'd like...

test/CodeGen/X86/avx-intrinsics-x86.ll
1946–1953 ↗(On Diff #62124)

Sure, will do.

I have no objection to this solution. I think it is robust in a functional sense. And we can always change it later if we discover that the worst case MOV32r0 sinking scenario is more common than we think. It would give me the warm fuzzies if we had some experimental evidence to confirm the suspicion that this is a rare case. Do you already have that? If not, maybe it would be a good idea to write a late pass that looks for this kind of pattern and count the number of occurrences on, say, cpu2006?

SETcc r8
xor r32, r32 (or mov $0, r32)
movb r32b, r8

And thanks for the performance data! Once this gets committed, I'll have someone run testing on a broader set of workloads.

test/CodeGen/X86/fp128-compare.ll
11 ↗(On Diff #62124)

Ah yes, of course. You can ignore my comment.

mkuper added a comment.Jul 6 2016, 2:08 PM

After looking at the pseudos a bit more, I really prefer to commit this.

Of course, if we have empirical evidence that this misses cases, I'm open to moving to anything that works better - be it a post-RA pass, or Pseudos.
Dave, feel free to provide such empirical evidence in the form of PRs assigned to me. :-)

This revision was automatically updated to reflect the committed changes.