This is an archive of the discontinued LLVM Phabricator instance.

[MachineScheduler] exclude INLINEASM from schedule when it would increase register pressure
AbandonedPublic

Authored by nickdesaulniers on Apr 22 2022, 3:22 PM.

Details

Summary

INLINEASM with a large number of operands requires delicate
pre-register-allocation scheduling, so as not to exhaust the number of
physical registers allocatable to the INLINEASM.

Sinking a COPY of a PhysReg to a VirtReg closer to its use is
problematic when sunk past an INLINEASM that requires Physreg's of the
same register class. Doing so extends the LiveRange of the Physregs in a
way that register allocation may fail to allocate enough registers for
the inline asm, resulting in compile time failures for inline asm
statements that have many operands.

When we encounter and INLINEASM whose number of operands of any
particular TargetRegisterClass would be above the register pressure
limit of a given MachineFunction, split the scheduling boundary at the
INLINEASM.

Fixes: https://github.com/llvm/llvm-project/issues/41914

Diff Detail

Event Timeline

nickdesaulniers requested review of this revision.Apr 22 2022, 3:22 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 22 2022, 3:22 PM

Note to reviewers, I'd like to land this AND https://reviews.llvm.org/D122350 (either of which both resolve my issue observed in this MIR test case).

One alternative approach I can think of is rather than shrink the schedule region which is pretty pessimistic (the INLINEASM could have lots of operands of a particular TargetRegisterClass, but the only COPYs the would benefit from rescheduling are of a different TargetRegisterClass); I guess during moving of an instruction, we could check that we weren't moving COPY of a PhysReg past an INLINEASM with an operand of the same TargetRegisterClass would not be permitted if the number of live phyregs at the INLINEASM minus the physregs consumed by the INLINEASM.

I'm not sure yet if what's the cleanest place to handle such an exceptional case and if that's even the best approach.

pengfei accepted this revision.Apr 23 2022, 4:11 AM

The change LG, though I'm not sure if changing getRegPressureLimit affects existing code.

llvm/lib/Target/X86/X86RegisterInfo.cpp
266–271

Should we change them to (Is64Bit ? 12 : 4) - FPDiff ?

275

I wonder why return 4 for it.

This revision is now accepted and ready to land.Apr 23 2022, 4:11 AM
craig.topper added inline comments.Apr 23 2022, 1:29 PM
llvm/lib/CodeGen/MachineScheduler.cpp
473

reschduling -> rescheduling

llvm/lib/Target/X86/X86RegisterInfo.cpp
266–271

Yeah, I think so. Though I'm curious about the use of 4 in the first place. i386 has:

  • eax
  • ebx
  • ecx
  • edx
  • esi
  • edi
  • ebp
  • esp
  • eip

As GPRs. eip and esp aren't allocatable, and ebp is the condition FPDiff here. Why 4? I would have guessed 6 for the first 6 (for -m32)?

Perhaps 4 corresponds to GR32_ABCD, for some reason? So should the cases used be GR32_ABCDRegClassID rather than GR8RegClassID? Or should it be GR8RegClassID then use 6 - FPDiff?

For 64b, I'd have guessed those six plus r8d to r15d, for a total of 14, not 12. Am I missing something (perhaps about esi+edi)?

275

Good question. Some comments could be added to this function to help retain more context as to WHY these magic constants have these values. I'm happy to do so in these commits, if we can document _why_.

I'm running git log -S X86TargetLowering::getRegPressureLimit to see if I can find additional context.

llvm/lib/Target/X86/X86RegisterInfo.cpp
275

This code was added by commit:

commit 37b740c4bfbb ("Add an ILP scheduler.)
Author: Evan Cheng <evan.cheng@apple.com>
AuthorDate: Sat Jul 24 00:39:05 2010 +0000

llvm/lib/Target/X86/X86RegisterInfo.cpp
266–271

Also, if we extend this from 4 to 6 on 32b, then the test case will no longer be fixed.The inline asm from the test case uses 1 GR8 and 4 GR32.

pengfei added inline comments.Apr 26 2022, 12:05 AM
llvm/lib/Target/X86/X86RegisterInfo.cpp
266–271

I'm guessing it's a heuristic number. It might consider an average of the usage of other special registers e.g., eax for returning, ecx for counter, esi and edi for data moving etc.
If the assumption is ture, then it makes sense to 64b using 12 (4 + 8), since the r8~r15 are pure general ones.

275

Oops, I made a mistake. I thought the V means virtual...
The V actually means vector. And the VR64 are MMX registers which have the same size 8 on both 32b and 64b.
Why we use 4 for VR64 and VR128 on 32b while 10 on 64b. A wild guess is it's related to calling conversion.
Anyway, I think we should keep such value before we can make sure the origin.
OTOH, I think maybe we can more register classes (maybe a seperate patch), e.g.

case X86::VR128RegClassID:
case X86::VR256RegClassID:
case X86::VR512RegClassID:
  return Is64Bit ? 10 : 4;
case X86::VR128XRegClassID:
case X86::VR256XRegClassID:
case X86::VR512XRegClassID:
  return Is64Bit ? 26 : 4;

Hi,

I see what you are trying to solve, but at the same time, in theory any instruction could have this kind of register pressure problems. (Though, we probably don't ever create too many not spillable live-ranges around regular instructions and regular instructions have a reasonable number of operands.)

Ultimately, I feel that telling the users that their inline asm is using too many registers is not necessarily a bad thing.

Making inline asm instructions scheduling boundaries when they use a "lot of registers" is a pretty big hammer. Also technically, inline asm instructions with only a few operands could also become scheduling barrier if the surrounding pressure is already high.

I'm on the fence with that patch. I expect it fixes the cases where we may run out of registers but it also over-constrains the scheduling problem.

I guess this is acceptable because when users use inline asm, they essentially are telling the compiler "I know what I'm doing". Thus, I'm leaning toward "this patch is okay" but I could be convinced otherwise.

What do others think?

Cheers,
-Quentin

arsenm added inline comments.May 2 2022, 2:14 PM
llvm/lib/CodeGen/MachineScheduler.cpp
459

This is implicitly relying on register
Class exact matches when register classes are really overlapping sets

473

Remove the new line and INLINEASM since that’s included with the instruction print (or Indent the rest)

My understanding is the inline asm is in the transition zone between a "regular instruction" and "function call", because the content of inline asm is vary largely. It can be as simple as a "nop" or more complicated as modifing memory or even call a function.
I think setting scheduling barrier based on register pressure is compromised solution. It's also much like we prepare/preserve registers for function call when we use many registers in inline asm.
One more thought, we may also need to consider the clobber registers and memory?

nickdesaulniers abandoned this revision.May 3 2022, 11:38 AM

I'm not comfortable pursuing this approach.

Say we had an example where there were two instructions that consumed physregs, producing virt regs, then inlineasm with lots of operands, then consumers of the virt regs, and we could only reschedule one of the two virt reg produces below the asm.

This approach will pessimistically allow neither to be moved.

I will have to think more about this, but am super busy with other things ATM. So I plan to at least land https://reviews.llvm.org/D122350 for now.

MatzeB added a comment.EditedMay 3 2022, 4:06 PM

Do you have a feeling for how well this heuristic works for non-X86? I wonder if it would be better to be conservative and only have it in X86InstrInfo::isSchedulingBoundary, that way you could also hardcode some values instead of fiddeling with getRegPressureLimit.

I also have a feeling like you only need to consider GPR registers on x86 where constraints are plentiful while for XMM/YMM it probably doesn't matter...