This is an archive of the discontinued LLVM Phabricator instance.

[X86] Only apply setcc fixup if GR32_ABCDs are free.
Needs ReviewPublic

Authored by bryant on Jul 11 2016, 10:50 AM.

Details

Reviewers
compnerd
mkuper
Summary

X86FixupSetCC transforms setcc zero-ext sequences of the form (intel syntax),

eflags_def_instr
setcc gr8
movzbl gr32, gr8

into:

xor gr32, gr32
eflags_def_instr
setcc gr8

On x86 targets, it's possible for eflags_def_instr to use or def all available GR32 registers (e.g., cmpxchg8b implicitly uses all of E{ABCD}X). Under such circumstances, this transformation should not occur.

Diff Detail

Repository
rL LLVM

Event Timeline

bryant updated this revision to Diff 63535.Jul 11 2016, 10:50 AM
bryant retitled this revision from to [X86] Only apply setcc fixup if GR32_ABCDs are free..
bryant updated this object.
bryant added a reviewer: mkuper.
bryant set the repository for this revision to rL LLVM.
bryant updated this object.
bryant updated this object.
mkuper edited edge metadata.Jul 11 2016, 11:10 AM

Thanks, bryant!

I'm not entirely sure this is the right fix for PR28489, for two reasons:

  1. I'm not convinced this is a bug on the pass site and not the regalloc site. Sure, this is "fast" regalloc, but is it really supposed to just run out of registers here instead of spilling the zero?
  2. Even if so, I may have been two restrict with requiring GR32_ABCD for both ZeroReg and InsertReg. InsertReg certainly must be GR32_ABCD, but ZeroReg may be a regular GR32. Sure, we'd like to map ZeroReg and InsertReg to the same register (using a sane RA, not fast). But a sane RA would put them in the same register unless it can't, anyway. And relaxing this constraint should help FastRA handle this issue.

Regardless, I'm thinking about turning this whole pass off for CodeGenOpt::None. It's an optimization pass, and has no benefit with fast RA anyway, since it won't properly merge the registers...

bryant added a comment.EditedJul 11 2016, 12:23 PM

I have a couple of questions.

  1. Under -O3, which uses the sane RA, the test case I've provided generates:
cmpxchg8b
movl gr32, 0
setcc gr8

Is this what you meant by "spilling the zero"? Because the binary encoding for
this sequence,

0:   0f c7 0e                cmpxchg8b (%esi)
3:   a1 00 00 00 00          mov    0x0,%eax
8:   0f 94 c0                sete   %al

is two bytes longer than the result otherwise gotten without the pass:

 b:   0f c7 0e                cmpxchg8b (%esi)
 e:   0f 94 c0                sete   %al
11:   0f b6 c0                movzbl %al,%eax

which defeats one of the original intended benefits of this pass. So relying on a spill produces a less-optimal result.

  1. What would the final machine output under i686 for a INSERT_SUBREG with a non-GR32_ABCD operand, since only the ABCDs have an 8-bit sub-register?

I have a couple of questions.

  1. Under -O3, which uses the sane RA, the test case I've provided generates:
cmpxchg8b
movl gr32, 0
setcc gr8

Is this what you meant by "spilling the zero"? Because the binary encoding for
this sequence,

0:   0f c7 0e                cmpxchg8b (%esi)
3:   a1 00 00 00 00          mov    0x0,%eax
8:   0f 94 c0                sete   %al

is two bytes longer than the result otherwise gotten without the pass:

 b:   0f c7 0e                cmpxchg8b (%esi)
 e:   0f 94 c0                sete   %al
11:   0f b6 c0                movzbl %al,%eax

which defeats one of the original intended benefits of this pass. So relying on a spill produces a less-optimal result.

Yes, I agree this is a pessimization. But it's probably fairly rare (cmpxchg on i686, and the pcmpstr issue we've already run into, which is slightly different and won't be solved by this patch). And in any case, if we want to prevent it, I'm not sure this is the right patch. More generally, we probably just want to know which physregs are live at the insertion point.

  1. What would the final machine output under i686 for a INSERT_SUBREG with a non-GR32_ABCD operand, since only the ABCDs have an 8-bit sub-register?

It will be broken. :-)
What I meant is to INSERT_SUBREG into a GR32_ABCD, but zero a GR32. I think this leaves even fast RA with enough freedom to use a non-ABCD register for the MOV32r0 and then copy it into an appropriate register for the insert.

I'm not familiar with the pcmpstr issue. Would you mind providing a reference or example?

More generally, we probably just want to know which physregs are live at the insertion point.

What if we moved the pass from pre-reg alloc to, say, post-reg alloc or even pre-rewrite and scavenged for available GR32_ABCDs before the EFLAGS-defining insertion point?

As I said on IRC - the post-RA option was suggested on D21774, and I'm not sure it's superior - it may miss other cases.
I don't object to doing this post-RA on principle, I'm just not sure it's better.

In any case, there are three completely separate issues here:

  1. Fast RA should not break because it runs out of registers. We can keep tracking this as PR28489.
  2. FixupSetCC should not run at -O0. I went ahead and disabled it (r275099), so we should no longer see (1) in practice, but it still should be fixed at some point.
  3. We should not pessimize cmpxchg in optimized code. Could you please file a separate bug for this? Maybe moving the pass to post-RA is the right answer - although I'm still not convinced.