This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add a pass to change byte and word instructions to zero-extending versions.
ClosedPublic

Authored by kbsmith1 on Feb 9 2016, 9:16 AM.

Details

Summary

This change adds a pass (which will be currently turned off by default) that can change
byte and word instructions into zero extending instructions. This is an attempt to improve
performance by eliminating some penalties associated with the byte and word versions
of load instructions only at this time. My vision, however, is that this should be able to
be extended over time to provide much more optimization of byte and word instructions
into more efficient forms.
This is intended to eventually solve: https://llvm.org/bugs/show_bug.cgi?id=23155
but won't until turned on by default.

Diff Detail

Repository
rL LLVM

Event Timeline

kbsmith1 updated this revision to Diff 47322.Feb 9 2016, 9:16 AM
kbsmith1 retitled this revision from to [X86] Add a pass to change byte and word instructions to zero-extending versions..
kbsmith1 updated this object.
kbsmith1 added a subscriber: llvm-commits.
spatel edited edge metadata.Feb 10 2016, 1:42 PM
spatel added subscribers: silvas, chandlerc, grosbach and 5 others.

Hi Kevin -

Thanks for working on this! Given that the motivation / direction has already been discussed and that this is off by default, I see no reason not to commit this ASAP. We can test / benchmark / evolve it faster in trunk vs. privately.

I'm cc'ing some of the people from the bug report discussions just so they are aware of this pass.

Quick question: Is this going to get rid of some of the partial reg stall code that we've already got in the source base?

Sanjay - Thanks for the encouraging words. I'll be happy to commit whenever folks think that is appropriate.

Eric - My expectation is this could be used both to get rid or partial stall code in other places, but I really don't know where those other places are at this time. If you have suggestions for places to look at for that, I'd like to look into fixing them using this. It may be a bit before it can fix all issues. The legality proving needs some improvement, not live really isn't quite the right proof, but it does catch important cases, and it uses existing LLVM infrastructure.

Eric - My expectation is this could be used both to get rid or partial stall code in other places, but I really don't know where those other places are at this time. If you have suggestions for places to look at for that, I'd like to look into fixing them using this.

There are a few spots that I know of:

  1. X86TargetLowering::EmitCmp() has the change introduced with http://reviews.llvm.org/rL195496 .
  2. The ExecutionDepsFix pass uses getPartialRegUpdateClearance() and getUndefRegClearance() to insert xor insts ( https://llvm.org/bugs/show_bug.cgi?id=22024 ).
  3. PerformTargetShuffleCombine() has a comment about movsd vs. blendpd.

There are probably other cases in X86ISelLowering. It would be good to consolidate those kinds of transforms in a machine pass. So it may be worth renaming this to something more general (FixupPartialRegStalls?) before checking it in just to save on some naming churn.

Eric - My expectation is this could be used both to get rid or partial stall code in other places, but I really don't know where those other places are at this time. If you have suggestions for places to look at for that, I'd like to look into fixing them using this.

There are a few spots that I know of:

  1. X86TargetLowering::EmitCmp() has the change introduced with http://reviews.llvm.org/rL195496 .
  2. The ExecutionDepsFix pass uses getPartialRegUpdateClearance() and getUndefRegClearance() to insert xor insts ( https://llvm.org/bugs/show_bug.cgi?id=22024 ).
  3. PerformTargetShuffleCombine() has a comment about movsd vs. blendpd.

There are probably other cases in X86ISelLowering. It would be good to consolidate those kinds of transforms in a machine pass. So it may be worth renaming this to something more general (FixupPartialRegStalls?) before checking it in just to save on some naming churn.

I named it X86FixupBWInsts because the analysis needed to prove safety was going to be specific to the byte and word insts; that is to prove that only the specific byte/word reg that was the destination of the instruction was going to be "live" out of the instruction, so that these could be changed in any way that was useful. One example would be to change addw reg16,small_immediate into add reg32,small_immediate, as this could provide a code size saving. And I hadn't intended for it to be able to resolve partial updates of XMM or other SIMD types instructions, as the analysis needed for that is different. Given those considerations, I thought X86FixupBWInsts was a more appropriate name. But if after hearing this explanation, you (or others) feel strongly about using a different file name, I will change it.

Given those considerations, I thought X86FixupBWInsts was a more appropriate name. But if after hearing this explanation, you (or others) feel strongly about using a different file name, I will change it.

I think it's fine as-is. If and when we add other transforms, we can change it.

Naming is hard and shouldn't block anything here. That said, what do you think about using this to get rid of some of the reg stall code that was listed? :)

Naming is hard and shouldn't block anything here. That said, what do you think about using this to get rid of some of the reg stall code that was listed? :)

I am in favor of it. I think the staging should be:

  • Get this in.
  • Get it turned on by default.
  • Pull the other changes out, 1 by 1, making sure that this catches those cases, improving the safety analysis, or heuristics as necessary to make sure that performance is maintained/improved.

Each of those stages seems like it ought to be a separate change, following on to this.

echristo accepted this revision.Feb 10 2016, 4:48 PM
echristo added a reviewer: echristo.

Seems reasonable to me. Note that I haven't looked at it much more than the few inline comments I just gave, but it seems good to me - especially as it isn't turned on by default.

lib/Target/X86/X86FixupBWInsts.cpp
137–138 ↗(On Diff #47322)

Can you elaborate on this as far as what kind of interface you'd need?

184 ↗(On Diff #47322)

Comment here as to why this is a sufficient check please? I've seen the comment above, but...

189 ↗(On Diff #47322)

Extra whitespace.

205 ↗(On Diff #47322)

Delete.

216 ↗(On Diff #47322)

Delete.

This revision is now accepted and ready to land.Feb 10 2016, 4:48 PM
qcolombet added inline comments.Feb 10 2016, 4:55 PM
lib/Target/X86/X86FixupBWInsts.cpp
10 ↗(On Diff #47322)

The explanation should use doxygen style comments.

18 ↗(On Diff #47322)

Please explain why only those instructions can be affected.

29 ↗(On Diff #47322)

I have to admit I don’t remember what the documentation says, but when we write a 16-bit register we may already have partial register dependencies, right?

What I want to point out is that this comment seems to go against one of the initial statement that this pass will avoid false-dependencies on the upper portions of the registers. I think this deserves more explanation like when we do kill those false-dependencies and when we might introduce new ones. No need to be fancy here, I believe something simply stating that some instructions set the upper bits and some don’t and thus have false-dependencies would be sufficient.

73 ↗(On Diff #47322)

\p SuerDestReg

77 ↗(On Diff #47322)

I am assuming we want something like that as well
/// \pre OrigDestSize is a 8 or 16-bit.

90 ↗(On Diff #47322)

Either don’t put this comment or explain why we need it.

101 ↗(On Diff #47322)

We usually write the comment with doxygen style and before the field.

132 ↗(On Diff #47322)

Don’t put doxygen comments on both the declaration and definition.
I think doxygen does not like it and also it means we have two spots to update when we need to change the documentation.

197 ↗(On Diff #47322)

We need to copy the implicit operands as well.
Look for:
MachineInstr::copyImplicitOps

Oh, and don't take my approval as enough. This touches enough that a cursory glance is insufficient, please wait for someone like Quentin to look at it.

Thanks.

Thanks for the comments. I'll work on fixing all those, and have a new version early tomorrow.

kbsmith1 updated this revision to Diff 47670.Feb 11 2016, 9:33 AM
kbsmith1 edited edge metadata.

New revision should address most of the comments.

kbsmith1 marked 13 inline comments as done.Feb 11 2016, 9:38 AM

Addressed inline comments.

lib/Target/X86/X86FixupBWInsts.cpp
198 ↗(On Diff #47670)

I think the loop in lines 217-219 already gets all the implicit ops. I looked at the code for copyImplicitOps, which looks like this:
for (unsigned i = MI->getDesc().getNumOperands(), e = MI->getNumOperands();

   i != e; ++i) {
const MachineOperand &MO = MI->getOperand(i);
if ((MO.isReg() && MO.isImplicit()) || MO.isRegMask())
  addOperand(MF, MO);

}
Loop in 217-219 also ends at MI->getNumOoperands(), and is copying every single operand, so I think that loop is already getting all the implicit operands as well. Let me know if you don't think that is true.

qcolombet accepted this revision.Feb 11 2016, 9:53 AM
qcolombet edited edge metadata.

LGTM.

Thanks,
-Quentin

lib/Target/X86/X86FixupBWInsts.cpp
198 ↗(On Diff #47670)

Thanks for double checking.
I wasn’t sure getNumOperand also included the implicit operands and looking at the doxygen told me it was not… which is wrong :)
http://llvm.org/doxygen/classllvm_1_1MachineInstr.html#a7b5fe96d88954efc855e6c466207e535

This revision was automatically updated to reflect the committed changes.