This is an archive of the discontinued LLVM Phabricator instance.

[X86] Only match BMI (BLSR, BLSI, BLSMSK) if the add/sub op is single use
ClosedPublic

Authored by goldstein.w.n on Jan 6 2023, 7:40 PM.

Details

Summary

If the add/sub is not single use, it will need to be materialized
later, in which case using the BMI instruction is a de-optimization in
terms of code-size and throughput.

i.e:

// Good
leal -1(%rdi), %eax
andl %eax, %eax
xorl %eax, %esi
...
// Unecessary BMI (lower throughput, larger code size)
leal -1(%rdi), %eax
blsr %edi, %eax
xorl %eax, %esi
...

Note, this may cause more mov instructions to be emitted sometimes
because BMI instructions only have 1 src and write-only to dst. A
better approach may be to only avoid BMI for (and/xor X, (add/sub
0/-1, X)) if this is the last use of X but NOT the last use of
(add/sub 0/-1, X).

Diff Detail

Event Timeline

goldstein.w.n created this revision.Jan 6 2023, 7:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2023, 7:41 PM
goldstein.w.n requested review of this revision.Jan 6 2023, 7:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2023, 7:41 PM

NOTE/Request for advise: This regresses the tests in GlobalISel. I think its because of the single-use defs in the td, as even if I made them just return true the tests the BMI ops still werent getting matched. BMI is still does work (see the bmi-out-of-order.ll tests and bmi.ll is still fine).

Can anyone give me some guidance about how to get the patterns to use custom defs and work for GlobalISel?

NOTE/Request for advise: This regresses the tests in GlobalISel. I think its because of the single-use defs in the td, as even if I made them just return true the tests the BMI ops still werent getting matched. BMI is still does work (see the bmi-out-of-order.ll tests and bmi.ll is still fine).

Can anyone give me some guidance about how to get the patterns to use custom defs and work for GlobalISel?

I wouldn't worry about GlobalISel. It's not actively being used or developed as far as I know.

craig.topper added inline comments.Jan 6 2023, 9:25 PM
llvm/lib/Target/X86/X86InstrInfo.td
1266

Maybe we should have

`class binop_oneuse<SDPatternOperator operator>
    : PatFrag<(ops node:$A, node:$B),
              (operator node:$A, node:$B), [{
  return N->hasOneUse();
}]>;`

def add_su : binop_oneuse<add>;
def and_su : binop_oneuse<and>;
def srl_su : binop_oneuse<srl>;

class unop_oneuse<SDPatternOperator operator>
    : PatFrag<(ops node:$A),
              (operator node:$A), [{
  return N->hasOneUse();
}]>;

def ineg_su : unop_oneuse<ineg>;
def trunc_su : unop_oneuse<trunc>;
goldstein.w.n added inline comments.Jan 6 2023, 9:37 PM
llvm/lib/Target/X86/X86InstrInfo.td
1266

Maybe we should have

`class binop_oneuse<SDPatternOperator operator>
    : PatFrag<(ops node:$A, node:$B),
              (operator node:$A, node:$B), [{
  return N->hasOneUse();
}]>;`

def add_su : binop_oneuse<add>;
def and_su : binop_oneuse<and>;
def srl_su : binop_oneuse<srl>;

class unop_oneuse<SDPatternOperator operator>
    : PatFrag<(ops node:$A),
              (operator node:$A), [{
  return N->hasOneUse();
}]>;

def ineg_su : unop_oneuse<ineg>;
def trunc_su : unop_oneuse<trunc>;

Makes sense, will do for V2.

Any thoughts on:
"""
Note, this may cause more mov instructions to be emitted sometimes
because BMI instructions only have 1 src and write-only to dst. A
better approach may be to only avoid BMI for (and/xor X, (add/sub
0/-1, X)) if this is the last use of X but NOT the last use of
(add/sub 0/-1, X).
"""
?

pengfei added inline comments.Jan 7 2023, 7:07 AM
llvm/lib/Target/X86/X86InstrInfo.td
1266

Microarchitecture usually can eliminate mov instructions by register renaming, so we can consider no performance lost except for code size.

goldstein.w.n added inline comments.Jan 7 2023, 9:38 AM
llvm/lib/Target/X86/X86InstrInfo.td
1266

Microarchitecture usually can eliminate mov instructions by register renaming, so we can consider no performance lost except for code size.

IIRC like 80% success rate and some targets like ICX have to disabled. Also still decode and presumably RR overhead. So unless its backend bound work think it will show up.

Note its 1 extra byte for 64-bit, 1 less byte for 32-bit.

I'll leave it as a todo for now I guess.

goldstein.w.n marked an inline comment as done.Jan 7 2023, 10:05 AM

Factor out binop/unop patterns

RKSimon accepted this revision.Jan 18 2023, 2:33 AM

LGTM - for ryzen at least the moves tend to always be free.

This revision is now accepted and ready to land.Jan 18 2023, 2:33 AM

LGTM - for ryzen at least the moves tend to always be free.

can you also checkout D141179 or should I split this?

This revision was landed with ongoing or failed builds.Feb 6 2023, 12:16 PM
This revision was automatically updated to reflect the committed changes.