This is an archive of the discontinued LLVM Phabricator instance.

[X86] Generate BT instrutions a bit more agressively
AbandonedPublic

Authored by deadalnix on Jan 27 2018, 11:34 AM.

Details

Summary

Right now, BT isntruction are inly generated when the constant cannot be materialized for a test instruction, and when the source is a (srl (and X, 1), N). This is fragile as any transform that get rid of the srl cause bt to not be materialized anymore. For instance anything (and X, 1 << N) will nto see the bt instruction used.

Because there are numerous pattern that match (and X, 1), bt is only generated when the bit position is greater than 0.

Diff Detail

Event Timeline

deadalnix created this revision.Jan 27 2018, 11:34 AM

BT has lower throughput than and/test and can't be macrofused. This probably isn't a win. See D37418 for some more discussion on this.

craig.topper added inline comments.Jan 27 2018, 11:45 AM
test/CodeGen/X86/testb-je-fusion.ll
11

I will say this is kinda stupid that we forced a copy just so we could do a high register trick.

BT has lower throughput than and/test and can't be macrofused. This probably isn't a win. See D37418 for some more discussion on this.

What might be of use is investigating whether this can be instead performed in the MC like @avt77 is trying to do for SHLD/SHRD on D40602.

@craig.topper I had no idea bt has lower throughput than test, I assumed it was the same. If that's the case, then this approach doesn't make much sense.

We should try to get rid of the high register trick that creates extra copies. I'm not sure what is creating this, any suggestions?

test/CodeGen/X86/testb-je-fusion.ll
11

Do you have nay idea what can be causing this? There are numerous test cases where this happens.

The h-register trick is coming from here in X86ISelDAGToDAG.cpp

// For example, "testl %eax, $2048" to "testb %ah, $8".
if (isShiftedUInt<8, 8>(Mask) &&
    (!(Mask & 0x8000) || hasNoSignedComparisonUses(Node))) {
  // Shift the immediate right by 8 bits.
  SDValue ShiftedImm = CurDAG->getTargetConstant(Mask >> 8, dl, MVT::i8);
  SDValue Reg = N0.getOperand(0);

  // Extract the h-register.
  SDValue Subreg = CurDAG->getTargetExtractSubreg(X86::sub_8bit_hi, dl,
                                                  MVT::i8, Reg);

  // Emit a testb.  The EXTRACT_SUBREG becomes a COPY that can only
  // target GR8_NOREX registers, so make sure the register class is
  // forced.
  SDNode *NewNode = CurDAG->getMachineNode(X86::TEST8ri_NOREX, dl,
                                           MVT::i32, Subreg, ShiftedImm);
  // Replace SUB|CMP with TEST, since SUB has two outputs while TEST has
  // one, do not call ReplaceAllUsesWith.
  ReplaceUses(SDValue(Node, (Opcode == X86ISD::SUB ? 1 : 0)),
              SDValue(NewNode, 0));
  CurDAG->RemoveDeadNode(Node);
  return;
}

Looks like this end up being a problem when the value is in EDI due to calling convention. So a few question come to mind:
1/ Shouldn't this optimization be done only after register allocation, if the selected register allows for it ? This would cause it to fail once in a while because the register allocator do not chose the proper register, but it's probably preferable to the extra copies.
2/ Is that possible to hint the register allocator that something is desirable ? For instance, that we would like this value to be in a GR8_NOREX register, but if that's not the case, don't create a copy for it ?

On a second look, when disabling this trick, I only get improvement in various test cases. I'm not sure what's the impact in real source code is, but I'm not convinced this trick is worth doing at all at this stage.

nhaehnle resigned from this revision.Jan 29 2018, 6:53 AM
deadalnix abandoned this revision.Jan 30 2018, 1:44 PM

Ok sounds like this isn't the right approach, closing this one.