This is an archive of the discontinued LLVM Phabricator instance.

Avoid false dependencies of undef machine operands
ClosedPublic

Authored by myatsina on Jul 18 2016, 9:30 AM.

Details

Summary

This patch helps avoid false dependencies on undef registers by updating the machine instructions' undef operand to use a register that the instruction is truly dependent on, or use a register with the largest clearance.

Pseudo example:

loop:
xmm0 = ...
xmm1 = vcvtsi2sdl eax, xmm0<undef>
... = inst xmm0
jmp loop

In this example, selecting xmm0 as the undef register creates false dependency between loop iterations.
This false dependency cannot be solved by inserting an xor before vcvtsi2sdl because xmm0 is alive at the point of the vcvtsi2sdl instruction.
Selecting a different register instead of xmm0, especially a register that is not used in the loop, will eliminate this problem.

Diff Detail

Repository
rL LLVM

Event Timeline

myatsina updated this revision to Diff 64335.Jul 18 2016, 9:30 AM
myatsina retitled this revision from to Avoid false dependencies of undef machine operands.
myatsina updated this object.
myatsina added reviewers: atrick, danielcdh, mkuper, spatel.
myatsina set the repository for this revision to rL LLVM.
myatsina added subscribers: llvm-commits, DavidKreitzer.
myatsina retitled this revision from Avoid false dependencies of undef machine operands to Avoid false dependencies of undef machine operands + fix bug in clearance calculation.Jul 18 2016, 9:33 AM
myatsina updated this object.
myatsina added inline comments.
lib/CodeGen/ExecutionDepsFix.cpp
523

This is a fix to bug - the clearance calculation did not take into account registers defined as outputs or clobbers in inline assembly machine instructions because these register defs are implicit.

mkuper added inline comments.Jul 19 2016, 5:28 PM
lib/CodeGen/ExecutionDepsFix.cpp
486

I'm a bit confused about this condition. Could you explain?

492

Why not a range for over MI->operands()?

503

This happens only when the real input operand is in memory, right?
If so, is this better than undoing the load folding, and using the loaded register as undef? Or does that depend on the clearance / being able to xor the register?

(This patch looks like a a net improvement regardless, I'm wondering about a possible TODO/follow-up patch)

511

Can we break if we've passed the "minimum required clearance" (that is, Pref)?

523

Any reason not to commit this separately?

myatsina marked 2 inline comments as done.Jul 21 2016, 9:48 AM
myatsina added inline comments.
lib/CodeGen/ExecutionDepsFix.cpp
486

Commit 224447 by Matthias Braun added some ARM specific fix that allows one register to be mapped to several registers from the register class (for ARM a wider register like Q0 can be mapped to several consecutive D registers).
Before his commit the Q register was mapped to the last D register of the sequence.
I want to avoid these cases where one register can be mapped to several registers because in these cases I cannot replace the original register in the instruction.

503

This catches cases like this:
vcvtsi2ssl %eax, %xmm0, %xmm1
This instruction does not have a form with xmm as a source.
"CVTDQ2PS %xmm0, %xmm1" is the packed version of the instruction.
Are you suggesting that if I see something like:
vpextrl $1, %xmm0, %eax
vcvtsi2ssl %eax, %xmm1, %xmm1
I should replace it with
CVTDQ2PS %xmm0, %xmm1 ?

BTW, I think one possible (and probably rare) follow up is handling cases where the register with the highest clearance is alive (and therefore we cannot add an xor), while we have another register, with lower clearance, that is not alive. In this case we should make a smarter decision regarding which one to choose.

511

I've measured both options, and choosing a register with max clearance gives slightly better results than choosing a register with clearance > Pref.

523
myatsina updated this revision to Diff 64950.Jul 21 2016, 1:40 PM
myatsina retitled this revision from Avoid false dependencies of undef machine operands + fix bug in clearance calculation to Avoid false dependencies of undef machine operands.
myatsina updated this object.
myatsina marked an inline comment as done.
mkuper added inline comments.Jul 21 2016, 5:35 PM
lib/CodeGen/ExecutionDepsFix.cpp
486

Ok. I'm still not sure I understand exactly what's going on, but it's ok to be conservative - especially since right now we should never hit this anyway. We only exercise this code path for X86 (I believe it's the only in-tree backend that actually implements getUndefRegClearance()), and if this is ARM-specific, then it doesn't really matter...

503

No, not what I meant, and in any case, it was not a good idea, ignore my comment. :-)

511

We could treat that as an indication that Pref isn't high enough - see the discussion on D21560.
If you have experimental evidence that increasing Pref to, say, 96 or 128, increases performance (above the noise level), then we should do that. Otherwise, we should probably bail on Pref here as well. I don't think it makes sense to be inconsistent about it.

myatsina updated this revision to Diff 66280.Aug 1 2016, 12:11 AM

Update code using clearance > Pref..
Updated Pref to 128. My runs show it's slightly better than 64.

myatsina marked an inline comment as done.Aug 1 2016, 12:13 AM
mkuper accepted this revision.Aug 1 2016, 11:00 AM
mkuper edited edge metadata.

LGTM, after you add (or document) the possibly missing test.

lib/CodeGen/ExecutionDepsFix.cpp
499

Do we have a test-case where we expect this to happen?
If we don't, can you add one? If we do, can you clearly document it as such?

This revision is now accepted and ready to land.Aug 1 2016, 11:00 AM
myatsina updated this revision to Diff 67351.Aug 9 2016, 8:19 AM
myatsina edited edge metadata.
  1. Added a test for the case of true dependency.
  2. Fixed a bug in my code exposed by Craig Topper's commit #276393 - some of my lit tests started to fail after his change because I was using the register class that was passed in ExeDepsFix's constructor instead of using the register class of the actual machine instruction operand I want to replace.
  3. Fixed additional tests that were added I the meantime.
myatsina marked an inline comment as done.Aug 9 2016, 8:21 AM
myatsina added inline comments.
lib/CodeGen/ExecutionDepsFix.cpp
491

This is a bug fixed exposed by Craig Topper's commit #276393.
Now using the register class of the actual machine instruction operand I want to replace.

mkuper added inline comments.Aug 9 2016, 9:57 AM
lib/CodeGen/ExecutionDepsFix.cpp
491

This looks a bit weird.
It would probably be better to use MRI->getRegClass(OriginalReg).
(We already query MachineRegisterInfo in runOnFunction, you can convert that to a member)

myatsina added inline comments.Aug 10 2016, 2:20 AM
lib/CodeGen/ExecutionDepsFix.cpp
491

MRI->getRegClass() receives a virtual reg, but I do not have access to virtual regs in this pass.
How about "TII->getRegClass(MI->getDesc(), OpIdx, TRI, *MF)"?
This seems to work as well.

mkuper added inline comments.Aug 10 2016, 9:57 AM
lib/CodeGen/ExecutionDepsFix.cpp
491

Gah, you're right, sorry.
Anyway, yes, I think I'd prefer TII->getRegClass.

This revision was automatically updated to reflect the committed changes.
MatzeB added a subscriber: MatzeB.Oct 28 2016, 2:45 PM
MatzeB added inline comments.
llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp
492–494 ↗(On Diff #67658)

getRegClass() can return nullptr if there is not register class declared/defined in MCInstrDesc (happens mostly on generic instructions like COPY or INLINEASM). The code looks like it would crash in this case.

499 ↗(On Diff #67658)

I would use !CurrMO.readsReg() instead of isUndef(), that way bundle internal reads are ignored as well.

512 ↗(On Diff #67658)

This looks bogus if OpRC is ever != ExeDepsFix::RC. Maybe you shouldn't even query OpRC and use ExeDepsFix::RC? Or can there be situations in which an operand has a restricted register class?

523–525 ↗(On Diff #67658)

If multiple operands use the same register, they all need to be updated at the same time (MachineOperand documents that even for undef inputs with the same register read the same value).