This is an archive of the discontinued LLVM Phabricator instance.

[X86] Disable BMI BEXTR in X86DAGToDAGISel::matchBEXTRFromAnd unless we're on compiling for a CPU with single uop BEXTR
ClosedPublic

Authored by craig.topper on Sep 26 2018, 1:14 PM.

Details

Summary

This function turns (X >> C1) & C2 into a BMI BEXTR or TBM BEXTRI instruction. For BMI BEXTR we have to materialize an immediate into a register to feed to the BEXTR instruction.

The BMI BEXTR instruction is 2 uops on Intel CPUs. It looks like on SKL its one port 0/6 uop and one port 1/5 uop. Despite what Agner's tables say. I know one of the uops is a regular shift uop so it would have to go through the port 0/6 shifter unit. So that's the same or worse execution wise than the shift+and which is one 0/6 uop and one 0/1/5/6 uop. The move immediate into register is an additional 0/1/5/6 uop.

For now I've limited this transform to AMD CPUs which have a single uop BEXTR. If may also might make sense if we can fold a load or if the and immediate is larger than 32-bits and can't be encoded as a sign extended 32-bit value or if LICM or CSE can hoist the move immediate and share it. But we'd need to look more carefully at that. In the regression I looked at it doesn't look load folding or large immediates were occurring so the regression isn't caused by the loss of those. So we could try to be smarter here if we find a compelling case.

Diff Detail

Event Timeline

craig.topper created this revision.Sep 26 2018, 1:14 PM

The code that led to PR38938 saw a perf improvement by moving to BEXTR on btver2 - both from the load-folding and hoisting the control constant out of the loop.

I'd really like to avoid yet another fast/slow feature flag but there might not be a better way right now if you want to get this in - @andreadb any thoughts?

https://godbolt.org/z/zdktz_

Forgot to mention that BEXTR breaks the two addressness of the shift+and pattern to avoid a copy which can also be beneficial. Unfortunately isel can't generally determine that a copy will be needed.

I agree, I don't really want to add a new fast/slow flag either.

Can we come up with a slightly better logic rather than just haveTBM()?
Else, i'm not sure how to fix D52293 / PR38938.
Can we perhaps also allow cases when we are shifting a load?

lib/Target/X86/X86ISelDAGToDAG.cpp
2598

I think we are also missing a check that NVT is not i64 if we are in 32-bit mode.

Forgot to mention that BEXTR breaks the two addressness of the shift+and pattern to avoid a copy which can also be beneficial. Unfortunately isel can't generally determine that a copy will be needed.

I agree, I don't really want to add a new fast/slow flag either.

As Simon already wrote, BEXTR is very fast on AMD.
The throughput reported by llvm-mca matches what I see with perf.

What if at ISel stage we just select BEXTR, and then we have a later (peephole) pass that expands it based on the subtarget and properties of the machine instruction?

Basically, for BEXTR we could do something similar to what is done for LEA. We could have a "fixup" pass (or custom patterns matched by the MachineCombiner) to expand BEXTR when the shift-and sequence is more convenient. This would still require a subtarget target hook.

If the decision only depends on the subtarget, and (maybe) properties of the instructions, then we could have tablegen automatically expand a TargetSubtarget hook for us.

Something like this (just an example...):

Index: lib/Target/X86/X86SchedPredicates.td
===================================================================
--- lib/Target/X86/X86SchedPredicates.td        (revision 343198)
+++ lib/Target/X86/X86SchedPredicates.td        (working copy)
@@ -54,3 +54,5 @@
// X86GenInstrInfo.
def IsThreeOperandsLEAFn :
    TIIPredicate<"isThreeOperandsLEA", IsThreeOperandsLEABody>;
+
+def ShouldExpandBEXTRDecl : STIPredicateDecl<"shouldExpandBEXTR", TruePred, /* overrides */ 0, /* expandForMC */ 0>;
Index: lib/Target/X86/X86ScheduleBtVer2.td
===================================================================
--- lib/Target/X86/X86ScheduleBtVer2.td (revision 343198)
+++ lib/Target/X86/X86ScheduleBtVer2.td (working copy)
@@ -772,4 +772,9 @@
  ], ZeroIdiomPredicate>
]>;

+def : STIPredicate<
+  ShouldExpandBEXTRDecl,
+  [ InstructionEquivalenceClass<[BEXTR32rr, BEXTR64rr], FalsePred> ]
+>;
+
} // SchedModel

..that generates this:

bool X86GenSubtargetInfo::shouldExpandBEXTR(const MachineInstr *MI) const {
  unsigned ProcessorID = getSchedModel().getProcessorID();
  switch(MI->getOpcode()) {
  default:
    break;
  case X86::BEXTR32rr:
  case X86::BEXTR64rr:
    if (ProcessorID == 4) {
      return false;
    }
    break;
  }

  return true;
} // X86GenSubtargetInfo::shouldExpandBEXTR

As I wrote. This is just an idea...

Not sure if that helps.
-Andrea

The time for X86InstrInfo::getMachineCombinerPatterns might be upon us.....

Can we come up with a slightly better logic rather than just haveTBM()?
Else, i'm not sure how to fix D52293 / PR38938.
Can we perhaps also allow cases when we are shifting a load?

Folding a load isn't clearly a perf win here either. Doesn't look like load-op fusion occurs on Intel CPUs so BEXTR with load sends 3 uops out of the frontend. So that's the same number of uops at load+shift+and. If if I've added up the sizes correctly, I think the move immediate + bextr with load is one byte longer than load+shift+and at least for 32-bit. 64-bit would incur a rex prefix on the shift. But we'd probably still have a 32-bit and. So that's probably the same size.

lib/Target/X86/X86ISelDAGToDAG.cpp
2598

We're so far past type legalization here that i64 isn't possible on a 32-bit target.

In the short term I think a fast feature flag (for bdver/btver/znver amd targets) is the way to go - this will solve your immediate regression and give us time to get a proper scheduler model driven solution in place (I have my eye on BEXTR, SHLD/SHRD and LEA feature flags to be replaced by this).

In the short term I think a fast feature flag (for bdver/btver/znver amd targets) is the way to go

Hm, but those *are* all the AMD targets that support BMI.
Perhaps this can be simplified down to isAMD() ?

In the short term I think a fast feature flag (for bdver/btver/znver amd targets) is the way to go

Hm, but those *are* all the AMD targets that support BMI.
Perhaps this can be simplified down to isAMD() ?

No - it is never a good idea to map a uarch feature/bug onto a model/vendor predicate. It usually becomes an inaccurate description at some point because companies and chips change. We've experienced this problem in x86 already with "isAtom()".

In the short term I think a fast feature flag (for bdver/btver/znver amd targets) is the way to go

Hm, but those *are* all the AMD targets that support BMI.
Perhaps this can be simplified down to isAMD() ?

Please don't - we need to be getting rid of X86ProcFamilyEnum not adding yet more complications to it - handling all those enums is even worse than feature flags.

Add feature flag and put it on AMD CPUs.

I removed some of the unused check prefixes from bmi.ll so I didn't have to add even more variations of them.

So the target feature, was the way forward, great!

test/CodeGen/X86/extract-bits.ll
1–11

Can you please add +fast-bextr to all these?

<pedantic>This phab's title and summary don't match the contents of the patch any more

craig.topper retitled this revision from [X86] Don't generate BMI BEXTR from X86DAGToDAGISel::matchBEXTRFromAnd to [X86] Disable BMI BEXTR in X86DAGToDAGISel::matchBEXTRFromAnd unless we're on compiling for a CPU with single uop BEXTR.Sep 29 2018, 9:17 AM
craig.topper edited the summary of this revision. (Show Details)

Add +fast-bextr to extract-bits.ll

RKSimon accepted this revision.Sep 29 2018, 10:07 AM

LGTM with one minor cleanup of the CHECKS in bmi.ll

test/CodeGen/X86/bmi.ll
447

cleanup?

This revision is now accepted and ready to land.Sep 29 2018, 10:07 AM
This revision was automatically updated to reflect the committed changes.