This is an archive of the discontinued LLVM Phabricator instance.

[X86] Generalized transformation of `definstr gr8; movzx gr32, gr8` to `xor gr32, gr32; definstr gr8`
Needs ReviewPublic

Authored by bryant on Aug 7 2016, 9:02 PM.

Details

Summary

As indicated by the title, this post-register allocation pre-rewrite pass
generalizes D21774 by matching patterns of the form,

gr8<def> = instr_defining_gr8  # that may or may not use eflags
gr32<def> = movzx gr8

into,

gr32 = mov32r0 eflags<imp-def>  # carefully avoids clobbering eflags
...
gr8<def> = instr_defining_gr8

with the goal of reducing read stalls, partial register stalls, micro-ops, and
overall binary size.

Except for a few rare cases, it never performs worse than D21774, and does
surprisingly better in other cases. IACA-annotated assembly output can be found
at https://reviews.llvm.org/P7213 (for x86-64) and at
https://reviews.llvm.org/P7214 (for x86-32).

Not all of the tests have been updated; this is still a work in progress.

EDIT: grammar.

Diff Detail

Repository
rL LLVM

Event Timeline

bryant updated this revision to Diff 67116.Aug 7 2016, 9:02 PM
bryant retitled this revision from to [X86] Generalized transformation of `definstr gr8; movzx gr32, gr8` to `xor gr32, gr32; definstr gr8`.
bryant updated this object.
bryant added reviewers: llvm-commits, mkuper.
bryant set the repository for this revision to rL LLVM.
bryant updated this object.
bryant added a comment.Aug 7 2016, 9:06 PM

I should add that the benchmark data in the two aforementioned links were generated by running llc -march={x86,x86-64} -mattr=+sse,+sse4.2 -O3 over the entire test set under test/CodeGen/X86. An appreciable number of cases do not compile under these flags (or simply produce assembly output that gas cannot compile) and have been omitted by my annotator.

bryant added a comment.Aug 7 2016, 9:52 PM

I would also like to note that because this pass works by register
re-allocation, it _never_ "pessimizes" the way D21774 does. For instance, under
-march=x86 (diff of assembler output of
test/CodeGen/X86/sse42-intrinsics-x86.ll, with "-" indicating D21774 and "+"
indicating this pass):

 test_x86_sse42_pcmpestria128:
-# Throughput: 2.25; Uops: 8; Latency: 7; Size: 36
+# Throughput: 1.55; Uops: 6; Latency: 3; Size: 33
-    pushl    %ebx
     movl    $7, %eax
     movl    $7, %edx
-    xorl    %ebx, %ebx
     pcmpestri    $7, %xmm1, %xmm0
-    seta    %bl
-    movl    %ebx, %eax
-    popl    %ebx
+    seta    %al
+    movzbl    %al, %eax
 retl

Or perhaps less obviously (from cmpxchg-i1.ll under x86-64):

 cmpxchg_zext:
-# Throughput: 2.05; Uops: 6; Latency: 10; Size: 24
+# Throughput: 1.85; Uops: 7; Latency: 10; Size: 23
-    xorl    %ecx, %ecx
     movl    %esi, %eax
     lock        cmpxchgl    %edx, (%rdi)
-    sete    %cl
-    movl    %ecx, %eax
+    sete    %al
+    movzbl    %al, %eax
 retq

On the other hand, because it matches on any gr8-defining instruction (not just
setccs), it can do things like (from pre-ra-sched.ll under x86-64):

 test:
-# Throughput: 11.05; Uops: 27; Latency: 14; Size: 77
+# Throughput: 6.55; Uops: 24; Latency: 14; Size: 77
     movzbl    1(%rdi), %r9d
-    movb    2(%rdi), %al
-    xorb    %r9b, %al
+    xorl    %r10d, %r10d
+    movb    2(%rdi), %r10b
+    xorb    %r9b, %r10b
     movzbl    3(%rdi), %esi
-    movb    4(%rdi), %cl
-    xorb    %sil, %cl
+    xorl    %eax, %eax
+    movb    4(%rdi), %al
+    xorb    %sil, %al
     movzbl    5(%rdi), %r8d
-    movb    6(%rdi), %dl
-    xorb    %r8b, %dl
+    xorl    %ecx, %ecx
+    movb    6(%rdi), %cl
+    xorb    %r8b, %cl
     cmpb    $0, (%rdi)
-    movzbl    %al, %edi
-    cmovnel    %r9d, %edi
-    movzbl    %cl, %eax
+    cmovnel    %r9d, %r10d
     cmovnel    %esi, %eax
-    movzbl    %dl, %ecx
     cmovnel    %r8d, %ecx
     testl    %r9d, %r9d
-    cmovnel    %edi, %eax
+    cmovnel    %r10d, %eax
     testl    %esi, %esi
     cmovel    %ecx, %eax
     retq

which, if IACA is to be believed, is 40% ++throughput for free.

Also, it should be noted that this differential depends on the zip iterator
introduced in D23252 .

mkuper edited edge metadata.Aug 8 2016, 12:35 AM

Hi bryant,

Thanks a lot for working on this! The improvements looks really nice, and it's great that this fixes the pessimizations from rL274692.

I haven't really looked at the patch yet, only skimmed it briefly - so I still don't have even the slightest opinion on the logic.
So, for now, some very general comments/questions:

  • Have you considered adding logic to ExecutionDepsFix or X86FixupBWInsts, as opposed to a new pass? I haven't thought about this too much myself, but those are post-RA passes with somewhat similar goals, and it may make sense to share some of the code.
  • How generic is this? E.g. does this handle PR28442?
  • Do you know why this is failing on cases that we catch with rL274692? If the improvement is large enough we don't have to be strictly better than existing code (rL274692 itself wasn't :) ), but it'd be good to understand what's going on so we can at least guess whether it can happen in "interesting" non-toy cases.
  • This is a fairly large patch. It's probably fairly self-contained, and I don't immediately see a way to break into meaningful functional pieces. However, that makes it a priori harder to review. So it'd be really nice to make reviewers' life easier by providing copious documentation. :)
  • There's a decent amount of generic code that doesn't look like it should live in the pass. If it's generally useful, then it should live in one of the utility headers. If not, and it's only ever instantiated for a single type, then we can get rid of the template extra complexity. (Also, beware of the lack of compiler support. We need to build LLVM with some fairly old or odd compilers - MSVC 2013, first of foremost. If you haven't seen a pattern used in LLVM, it may be because we still want to be built with compilers that don't support it)
  • A couple of pervasive differences from the LLVM coding conventions I've noticed from skimming:
    • We generally use camelcase for identifiers, with variables starting with an uppercase letter, and functions with lowercase. Common acronyms (like TRI) are all-caps. There are some exceptions ("standard" constructs like "int i", lower cases with underscores where we want to fit STL style, changes to old code that uses a different convention), but those are fairly rare.
    • We don't usually explicitly compare to nullptr.
bryant updated this revision to Diff 67362.Aug 9 2016, 9:49 AM
bryant added a parent revision: D23252: [ADT] Zip range adapter.
bryant edited edge metadata.

Fixed compile error caused by lack of "const" in return type.

bryant updated this revision to Diff 67980.Aug 14 2016, 11:23 AM
  • Rely on RAII to clean up unused MOV32r0
  • Outsource register allocation hints to VirtRegAuxInfo::copyHint
  • Add regmask constraints if no hints are found
bryant updated this revision to Diff 67981.Aug 14 2016, 11:37 AM

Conform to camel-case naming style for functions.

bryant updated this revision to Diff 67982.Aug 14 2016, 1:00 PM

Conform to variable name casing convention.

bryant updated this revision to Diff 67985.Aug 14 2016, 1:10 PM

Convert explicit checks for nullptr to operator bool.

Hi bryant,

Thanks a lot for working on this! The improvements looks really nice, and it's
great that this fixes the pessimizations from rL274692.

I haven't really looked at the patch yet, only skimmed it briefly - so I still
don't have even the slightest opinion on the logic. So, for now, some very
general comments/questions:

Please, please review. I've put a great deal of thinking into this.

  • Have you considered adding logic to ExecutionDepsFix or X86FixupBWInsts, as

opposed to a new pass? I haven't thought about this too much myself, but those
are post-RA passes with somewhat similar goals, and it may make sense to share
some of the code.

How do you propose doing this in either of those two passes? And in any case,
the set of heurestics relies on virtual register liveness data that is rewritten
away long before pre-sched/pre-emit passes execute.

  • How generic is this? E.g. does this handle PR28442?

It matches on any GR8-defining instruction, so I'd say...as generic as can be?
And yes, of course it handles PR28442:

bash
19:54:19 ~/3rd/llvm> cat > pr28442.c
int foo(int a, int b, int c) {
  return (a > 0 && b > 0 && c > 0);
}
20:15:40 ~/3rd/llvm> clang -o - -S -emit-llvm -O3 pr28442.c | llc -filetype=obj -O3 -o setccfixup.o -setcc-fixup
20:16:07 ~/3rd/llvm> clang -o - -S -emit-llvm -O3 pr28442.c | llc -filetype=obj -O3 -o zextfixup.o
20:16:34 ~/3rd/llvm> diff -u <(python iacapipe.py setccfixup.o) <(python iacapipe.py zextfixup.o)
--- /dev/fd/63  2016-08-14 20:16:48.477885771 +0000
+++ /dev/fd/62  2016-08-14 20:16:48.477885771 +0000
@@ -1,13 +1,13 @@
-# Throughput: 2.65; Uops: 10; Latency: 5; Size: 23
+# Throughput: 2.65; Uops: 9; Latency: 5; Size: 22
 foo:
     test   %edi,%edi
     setg   %al
     test   %esi,%esi
     setg   %cl
     and    %al,%cl
+    xor    %eax,%eax
     test   %edx,%edx
     setg   %al
     and    %cl,%al
-    movzbl %al,%eax
     retq
  • Do you know why this is failing on cases that we catch with rL274692? If the

improvement is large enough we don't have to be strictly better than existing
code (rL274692 itself wasn't :) ), but it'd be good to understand what's going
on so we can at least guess whether it can happen in "interesting" non-toy
cases.

From my tests on CodeGen/X86/*.ll, it only ever fails on x86-32 for two
reasons:

  1. The x32 allocation order for CSRs prioritizes ESI and EDI over EBX. Because

this pass never touches unused CSRs, it is possible for a function to alloc
E{AX,CX,DX,SI,DI} but not EBX and thus reducing the pool of available
GR32_with_sub8bit by one. This is illustrated by 2008-09-11-CoalescerBug.ll:

bash
20:22:18 ~/3rd/llvm> diff -u --unified=9999999 <(python iacapipe.py obj/kuper-x86/2008-09-11-CoalescerBug.o) <(python iacapipe.py obj/control-x86/2008-09-11-CoalescerBug.o)
--- /dev/fd/63  2016-08-14 20:23:00.352808087 +0000
+++ /dev/fd/62  2016-08-14 20:23:00.352808087 +0000
@@ -1,35 +1,34 @@
-# Throughput: 9.60; Uops: 43; Latency: 20; Size: 98
+# Throughput: 10.10; Uops: 45; Latency: 19; Size: 98
 func_3:
-    push   %ebx
+    push   %edi
     push   %esi
     push   %eax
     movzwl 0x0,%esi
     and    $0x1,%esi
     movl   $0x1,(%esp)
     call   15 <func_3+0x15>
     xor    %ecx,%ecx
     cmp    $0x2,%eax
     setl   %cl
-    xor    %ebx,%ebx
     cmp    %ecx,%esi
-    setge  %bl
-    xor    %eax,%eax
+    setge  %al
+    movzbl %al,%esi
     cmpw   $0x0,0x0
     sete   %al
-    mov    %eax,%esi
+    movzbl %al,%edi
     movl   $0x1,(%esp)
     call   3f <func_3+0x3f>
     xor    %ecx,%ecx
-    cmp    %eax,%esi
+    cmp    %eax,%edi
     setge  %cl
-    sub    %ecx,%ebx
-    cmp    $0x2,%ebx
+    sub    %ecx,%esi
+    cmp    $0x2,%esi
     sbb    %eax,%eax
     and    $0x1,%eax
     mov    %eax,(%esp)
     call   58 <func_3+0x58>
     add    $0x4,%esp
     pop    %esi
-    pop    %ebx
+    pop    %edi
     ret
  1. It's also possible for RA to spill the GR8-defining instruction, thus

preventing this pass from recognizing the pattern. This is seen in
2009-08-23-SubRegReuseUndo.ll:

bash
--- annot/kuper-x86/2009-08-23-SubRegReuseUndo.o	2016-08-14 20:27:55.032325645 +0000
+++ annot/control-x86/2009-08-23-SubRegReuseUndo.o	2016-08-14 20:27:07.079072090 +0000
@@ -1,88 +1,88 @@
-# Throughput: 31.00; Uops: 113; Latency: 32; Size: 285
+# Throughput: 32.00; Uops: 116; Latency: 32; Size: 290
 uint80:
     push   %ebp
     push   %ebx
     push   %edi
     push   %esi
     sub    $0x1c,%esp
     movsbl 0x30(%esp),%ecx
-    xor    %eax,%eax
     test   %cx,%cx
-    setne  %al
-    mov    %eax,%esi
+    setne  0x1b(%esp)
     movzwl %cx,%eax
-    mov    %ecx,%edi
+    mov    %ecx,%esi
     mov    %eax,(%esp)
     mov    $0x0,%eax
     movsbl %al,%eax
     mov    %eax,0x4(%esp)
     call   29 <uint80+0x29>
     mov    %eax,%ebx
     or     $0x1,%bl
     movl   $0x1,(%esp)
     call   3c <uint80+0x3c>
     movl   $0x0,0x4(%esp)
     movl   $0x0,(%esp)
     call   50 <uint80+0x50>
-    mov    %edi,%ecx
+    mov    %esi,%ecx
+    mov    %ecx,%edi
     xor    %cl,%al
     xor    %bl,%al
     mov    %eax,%ebp
-    mov    %esi,0x4(%esp)
+    movzbl 0x1b(%esp),%eax
+    mov    %eax,0x4(%esp)
     mov    $0x0,%eax
     movzwl %ax,%eax
     mov    %eax,(%esp)
     call   6c <uint80+0x6c>
     mov    %eax,%esi
     movl   $0x1,0x4(%esp)
     movl   $0x0,(%esp)
     call   82 <uint80+0x82>
     xor    %eax,%eax
     test   %al,%al
     jne .L0
     mov    $0x1,%eax
     xor    %ecx,%ecx
     test   %cl,%cl
     jne .L1
 .L0:
     xor    %eax,%eax
 .L1:
     xor    %ebx,%ebx
     cmp    %eax,%esi
     setne  %bl
     movl   $0xfffffffe,(%esp)
     call   ad <uint80+0xad>
     mov    %ebp,%eax
     movsbl %al,%eax
     mov    %eax,0xc(%esp)
     mov    %ebx,0x8(%esp)
     movl   $0x1,0x4(%esp)
     movl   $0x0,(%esp)
     call   ce <uint80+0xce>
     xor    %eax,%eax
     test   %al,%al
     jne .L2
     mov    0x0,%eax
 .L2:
     mov    0x0,%eax
     movl   $0x1,0x4(%esp)
     movl   $0x0,(%esp)
     call   f2 <uint80+0xf2>
     xor    %eax,%eax
     test   %al,%al
     je .L3
     add    $0x1c,%esp
     pop    %esi
     pop    %edi
     pop    %ebx
     pop    %ebp
     ret
 .L3:
     mov    %edi,%eax
     movsbl %al,%eax
     mov    0x0,%ecx
     mov    %eax,(%esp)
     movl   $0x1,0x4(%esp)
     call   11b <uint80+0x11b>
     sub    $0x8,%esp
  • This is a fairly large patch. It's probably fairly self-contained, and I

don't immediately see a way to break into meaningful functional pieces.
However, that makes it a priori harder to review. So it'd be really nice to
make reviewers' life easier by providing copious documentation. :)

Right, I'll sprinkle in some comments.

  • There's a decent amount of generic code that doesn't look like it should

live in the pass. If it's generally useful, then it should live in one of the
utility headers. If not, and it's only ever instantiated for a single type,
then we can get rid of the template extra complexity. (Also, beware of the
lack of compiler support. We need to build LLVM with some fairly old or odd
compilers - MSVC 2013, first of foremost. If you haven't seen a pattern used
in LLVM, it may be because we still want to be built with compilers that don't
support it)

.

  • A couple of pervasive differences from the LLVM coding conventions I've

noticed from skimming:

.

  • We generally use camelcase for identifiers, with variables starting with an

uppercase letter, and functions with lowercase. Common acronyms (like TRI) are
all-caps. There are some exceptions ("standard" constructs like "int i", lower
cases with underscores where we want to fit STL style, changes to old code
that uses a different convention), but those are fairly rare.

  • We don't usually explicitly compare to nullptr.

Fixed.

bryant added inline comments.Aug 14 2016, 1:42 PM
lib/CodeGen/CalcSpillWeights.cpp
46 ↗(On Diff #67985)

All this does is permit access to copyHint to the general public. Please advise if this belongs in a separate differential.

bryant updated this revision to Diff 68284.Aug 16 2016, 4:42 PM

Fix nullptr deref caused by swapped cases.

bryant updated this revision to Diff 68285.Aug 16 2016, 4:44 PM

upload the right patch.

Hi,

What is the performance impact (both runtime and compile time) of the pass?
You list results for test/CodeGen/X86, which is not usually what we use for perform testing. Could you try on the LLVM test suite?

Moreover, like Michael said, we should try to merge this pass with something else. A possible approach would be a peephole like optimization where we can plug more patterns.

Cheers,
-Quentin

lib/CodeGen/CalcSpillWeights.cpp
46–48 ↗(On Diff #68285)

This seems wrong to export this interface.
External users should rely on MachineRegisterInfo::getSimpleHint.

What is the problem in using MachineRegisterInfo::getSimpleHint?

Hi,

Thanks for actually reviewing this.

What is the performance impact (both runtime and compile time) of the pass?

Run-time impact, according to IACA, would be 1% to 40% (pre-ra-sched.ll) faster. Never worse. I'll measure the compile-time perf tonight. Do you have any suggestions on the preferred way to do this?

You list results for test/CodeGen/X86, which is not usually what we use for perform testing. Could you try on the LLVM test suite?

My measurements have thus far centered around the performance of generated code. Since this pass only ever runs when -march=x86/-64, there would be no difference in the rest of the arches, improvements or otherwise. If you really want, I could run the rest of the suite but the result would be a bunch of passes and xfails. What sort of results do you really want me to record here?

Moreover, like Michael said, we should try to merge this pass with something else. A possible approach would be a peephole like optimization where we can plug more patterns.

As I've mentioned before, this depends on virtual register LiveInterval information that is lost after the rewrite pass.

Cheers,
-Quentin

lib/CodeGen/CalcSpillWeights.cpp
46–48 ↗(On Diff #68285)

In cases where there is more than one hint, getSimpleHint only returns one and misses the rest.

Hi bryant,

I haven't found time to review this patch in detail yet, but here are some initial comments/questions.

Given some of your sample generated-code changes, I expected to see lots of changes in tests/CodeGen/X86. Are you planning to add those changes to this patch later?

Your output code examples have a few instances of this:

+ xorl %r10d, %r10d
+ movb 2(%rdi), %r10b

Rather than insert an xor here, you'd prefer to convert the movb to movzbl.

Converting movb to movzbl (and movw to movzwl) is essentially what FixupBWInstPass does. The author of that pass was deliberately aggressive about converting movw to movzwl but a bit more conservative about converting to movb to movzbl. Here are the relevant comments:

// Only replace 8 bit loads with the zero extending versions if
// in an inner most loop and not optimizing for size. This takes
// an extra byte to encode, and provides limited performance upside.

// Always try to replace 16 bit load with 32 bit zero extending.
// Code size is the same, and there is sometimes a perf advantage
// from eliminating a false dependence on the upper portion of
// the register.

This leads to 2 questions for me.

(1) What is the code size impact of this new pass?
(2) How does the behavior of this new pass compare to simply changing X86FixupBWInsts to always optimize the 8-bit case, i.e. not check for innermost loops?

Thanks,
Dave

Hi bryant,

I haven't found time to review this patch in detail yet, but here are some initial comments/questions.

Given some of your sample generated-code changes, I expected to see lots of changes in tests/CodeGen/X86. Are you planning to add those changes to this patch later?

Yes, definitely. Will include the updates to test/*.ll in my next diff
release.

Your output code examples have a few instances of this:

+ xorl %r10d, %r10d
+ movb 2(%rdi), %r10b

Rather than insert an xor here, you'd prefer to convert the movb to movzbl.

Just to clarify, the optimal result is movzbl 2(%rdi), %r10d, not

movb 2(%rdi), %al; movzbl %al, %r10d (which is the status quo)

nor

xorl %r10d, %r10d; movb 2(%rdi), %r10d (which is what this patch does)

Correct?

Converting movb to movzbl (and movw to movzwl) is essentially what FixupBWInstPass does. The author of that pass was deliberately aggressive about converting movw to movzwl but a bit more conservative about converting to movb to movzbl. Here are the relevant comments:

// Only replace 8 bit loads with the zero extending versions if
// in an inner most loop and not optimizing for size. This takes
// an extra byte to encode, and provides limited performance upside.

I find this comment strange. The stats on x86-64 are:

21:38:41 ~/3rd/llvm> python iacapipe.py --arch 64 a.out
# Throughput: 0.50; Uops: 2; Latency: 6; Size: 7
kuper:
    mov    0x2(%rdi),%al
    movzbl %al,%r10d

# Throughput: 0.50; Uops: 1; Latency: 5; Size: 7
this_patch:
    xor    %r10d,%r10d
    mov    0x2(%rdi),%r10b

# Throughput: 0.50; Uops: 1; Latency: 5; Size: 5
optimal:
    movzbl 0x2(%rdi),%r10d

And on x86-32:

21:39:32 ~/3rd/llvm> python iacapipe.py --arch 32 a.out
# Throughput: 0.50; Uops: 2; Latency: 6; Size: 6
kuper:
    mov    0x2(%edi),%al
    movzbl %al,%eax

# Throughput: 0.50; Uops: 1; Latency: 5; Size: 5
this_patch:
    xor    %eax,%eax
    mov    0x2(%edi),%al

# Throughput: 0.50; Uops: 1; Latency: 5; Size: 4
optimal:
    movzbl 0x2(%edi),%eax

So the converted movzb to movzbl is smaller and has fewer micro-ops. Am I
missing something obvious?

// Always try to replace 16 bit load with 32 bit zero extending.
// Code size is the same, and there is sometimes a perf advantage
// from eliminating a false dependence on the upper portion of
// the register.

This leads to 2 questions for me.

(1) What is the code size impact of this new pass?

I'll compile a nice histogram from the diff data in
https://reviews.llvm.org/P7213 and https://reviews.llvm.org/P7214 .

(2) How does the behavior of this new pass compare to simply changing X86FixupBWInsts to always optimize the 8-bit case, i.e. not check for innermost loops?

If the above results from IACA are to be trusted, it would be better to always
convert "movb; movzbl" to plain old movzbl. Currently, this pass would pre-empt
X86FixupBWInsts by transforming it into "xor; movb" before the later pass has a
chance to see the movb-movzbl pattern. It would be easy to add special cases
patterns that this pass should leave untouched.

Thanks,
Dave

Please let me know if additional clarification is needed.

bryant added inline comments.Aug 22 2016, 7:16 PM
lib/CodeGen/CalcSpillWeights.cpp
46–48 ↗(On Diff #68285)

Here is an example:

c
#include <stdint.h>

typedef struct {
  uint32_t low;
  uint32_t high;
} u32pair;

extern void use8(uint8_t);

uint32_t f(uint64_t *m, uint32_t a, uint32_t b) {
  u32pair src = {a > 0, b > 0};
  use8(a > 0);
  u32pair rv;
  asm volatile("cmpxchg8b %2"
               : "+a"(rv.low), "+d"(rv.high), "+m"(m)
               : "b"(src.low), "c"(src.high)
               : "flags");
  return rv.low > 0 && rv.high > 0;
}

pre-regalloc machineinstrs:

0B      BB#0: derived from LLVM BB %3
            Live Ins: %RDI %ESI %EDX
16B             %vreg2<def> = COPY %EDX; GR32:%vreg2
32B             %vreg1<def> = COPY %ESI; GR32:%vreg1
48B             %vreg0<def> = COPY %RDI; GR64:%vreg0
64B             MOV64mr <fi#0>, 1, %noreg, 0, %noreg, %vreg0; mem:ST8[%4](tbaa=!2) GR64:%vreg0
80B             TEST32rr %vreg1, %vreg1, %EFLAGS<imp-def>; GR32:%vreg1
96B             %vreg5<def> = SETNEr %EFLAGS<imp-use,kill>; GR8:%vreg5
112B            %vreg6<def> = MOVZX32rr8 %vreg5; GR32:%vreg6 GR8:%vreg5
128B            TEST32rr %vreg2, %vreg2, %EFLAGS<imp-def>; GR32:%vreg2
144B            %vreg7<def> = SETNEr %EFLAGS<imp-use,kill>; GR8:%vreg7
160B            %vreg8<def> = MOVZX32rr8 %vreg7; GR32:%vreg8 GR8:%vreg7
176B            ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
192B   ==>      %EDI<def> = COPY %vreg6; GR32:%vreg6
208B            CALL64pcrel32 <ga:@use8>, <regmask %BH %BL %BP %BPL %BX %EBP %EBX %RBP %RBX %R12 %R13 %R14 %R15 %R12B %R13B %R14B %R15B %R12D %R13D %R14D %R15D %R12W %R13W %R14W %R15W>, %RSP<imp-use>, %EDI<imp-use>, %RSP<imp-def>
224B            ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
240B   ==>      %EBX<def> = COPY %vreg6; GR32:%vreg6
256B            %ECX<def> = COPY %vreg8; GR32:%vreg8
272B            INLINEASM <es:cmpxchg8b $2> [sideeffect] [mayload] [maystore] [attdialect], $0:[regdef], %EAX<imp-def,tied>, $1:[regdef], %EDX<imp-def,tied>, $2:[mem:m], <fi#0>, 1, %noreg, 0, %noreg, $3:[reguse], %EBX<kill>, $4:[reguse], %ECX<kill>, $5:[reguse tiedto:$0], %EAX<undef,tied3>, $6:[reguse tiedto:$1], %EDX<undef,tied5>, $7:[mem:m], <fi#0>, 1, %noreg, 0, %noreg, $8:[clobber], %EFLAGS<earlyclobber,imp-def,dead>, $9:[clobber], %EFLAGS<earlyclobber,imp-def,dead>, <!5>
276B            %vreg12<def> = COPY %EDX; GR32:%vreg12
280B            %vreg11<def> = COPY %EAX; GR32:%vreg11
320B            TEST32rr %vreg11, %vreg11, %EFLAGS<imp-def>; GR32:%vreg11
336B            %vreg13<def> = SETNEr %EFLAGS<imp-use,kill>; GR8:%vreg13
352B            TEST32rr %vreg12, %vreg12, %EFLAGS<imp-def>; GR32:%vreg12
368B            %vreg15<def> = SETNEr %EFLAGS<imp-use,kill>; GR8:%vreg15
400B            %vreg15<def,tied1> = AND8rr %vreg15<tied0>, %vreg13, %EFLAGS<imp-def,dead>; GR8:%vreg15,%vreg13
416B            %vreg16<def> = MOVZX32rr8 %vreg15; GR32:%vreg16 GR8:%vreg15
432B            %EAX<def> = COPY %vreg16; GR32:%vreg16
448B            RET 0, %EAX

%vreg6 has two hints: EDI and EBX. However, getSimpleHint shows that
MachineRegisterInfo only records EDI:

selectOrSplit GR32:%vreg6 [112r,240r:0)  0@112r w=5.738636e-03
hints: %EDI
missed hint %EDI

Just to clarify, the optimal result is movzbl 2(%rdi), %r10d, not

movb 2(%rdi), %al; movzbl %al, %r10d (which is the status quo)

nor

xorl %r10d, %r10d; movb 2(%rdi), %r10d (which is what this patch does)

Correct?

Yes, exactly.

So the converted movzb to movzbl is smaller and has fewer micro-ops. Am I
missing something obvious?

AHA! Yes, you are right that this conversion is always profitable:

mov    0x2(%edi),%al
movzbl %al,%eax

TO

movzbl 0x2(%edi),%eax

but the X86FixupBWInsts pass will also do this:

mov    0x2(%edi),%al

TO

movzbl 0x2(%edi),%eax

In other words, it will convert the movb to movzbl even if the result of the movb is never zero extended. Doing so is often a win for performance due to the partial register dependence of the movb, but movzbl does encode larger.

I assume you are saying that your new pass will only transform 8-bit moves that are subsequently zero extended? In that case, you can disregard my code size concern for now.

Ultimately, we will want to conditionally optimize the 8-bit instructions that do not feed into a zero extend. (That applies both to the movb --> movzbl transformation and to inserting xor before instructions like SETcc.) It is just a bit harder, due to the tradeoff that needs to be made. I made the same comment on Michael's FixupSetCC pass - just haven't gotten around to experimenting with it yet.

I'll compile a nice histogram from the diff data in
https://reviews.llvm.org/P7213 and https://reviews.llvm.org/P7214 .

That would be very nice! Like I said, though, I am less worried about the code size impact if you are only optimizing instructions that are already being zero extended.

This does mean, however, that this pass will not be an adequate replacement for FixupBWInsts until we add the raw movb --> movzbl transform.

Thanks,
Dave