This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Implement shouldCoalesce() to help regalloc to avoid running out of registers with GR128 regs
ClosedPublic

Authored by jonpa on Sep 15 2017, 4:27 AM.

Download Raw Diff

Details

Reviewers

uweigand
qcolombet
MatzeB

Summary

Fix for https://bugs.llvm.org/show_bug.cgi?id=34610

Regalloc runs out of registers in a test case with GR128 register. This can be remedied by not allowing coalescing copys of subregs from GR128 regs in shouldCoalesce().

Diff Detail

Event Timeline

jonpa created this revision.Sep 15 2017, 4:27 AM

Thanks for trying this suggestion! See a couple of inline comments.

Also, I'm wondering doing less coalescing here may lead to reduced performance somewhere? It would be good to verify at least some benchmarks to make sure we aren't too restrictive now.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
164	While turning subreg liveness on fixes the simpler test case you have here, the even more restricted test case I've added to the bugzilla actually still fails even then. So I'm not sure we should mention subreg liveness here as the reason why this is needed.
167	Should this be symmetrical? I.e. something like if (NewRC->hasSubClassEq(&SystemZ::GR128BitRegClass) && (getRegSizeInBits(SrcRC) <= 64 \|\| getRegSizeInBits(DstRC) <= 64)) return false;
test/CodeGen/SystemZ/regalloc-GR128.ll
7	Wouldn't just running llc and checking its exit code already test that there is no LLVM ERROR?
10	Maybe better to use the more restricted test from the bugzilla.

jonpa marked 3 inline comments as done.Sep 19 2017, 12:05 AM

jonpa added inline comments.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
164	I see - removed.
167	Do we need a test case for this also?
test/CodeGen/SystemZ/regalloc-GR128.ll
7	OK - I thought it was a nice "comment", but I guess we already have a comment at the top of the file.
10	yes.

Updated per review.

Test cases updated. As expected, there are now more register moves whenever the coalescer is disabled by this patch.

In D37899#875075, @jonpa wrote:

Test cases updated. As expected, there are now more register moves whenever the coalescer is disabled by this patch.

Huh. This seems more extreme than I expected, it looks like just about *every* load of a 128-bit value from memory now reloads into another set of registers, even in those quite simple test cases ... Is this just the coalescing or is there something else going on?

In D37899#875173, @uweigand wrote:

In D37899#875075, @jonpa wrote:

Test cases updated. As expected, there are now more register moves whenever the coalescer is disabled by this patch.

Huh. This seems more extreme than I expected, it looks like just about *every* load of a 128-bit value from memory now reloads into another set of registers, even in those quite simple test cases ... Is this just the coalescing or is there something else going on?

Yes, this is indeed more than needed.

For atomic-load-05.ll, I see during isel:

Optimized lowered selection DAG: BB#0 'f1:'
SelectionDAG has 12 nodes:
  t0: ch = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
    t6: ch = CopyToReg t0, Register:i64 %vreg2, t2
    t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
  t7: i128,ch = AtomicLoad<Volatile LD16[%src]> t6, t4
      t8: i64,ch = CopyFromReg t0, Register:i64 %vreg2
    t11: ch = store<ST16[<unknown>](align=8)> t7:1, t7, t8, undef:i64
  t12: ch = SystemZISD::RET_FLAG t11


Type-legalized selection DAG: BB#0 'f1:'
SelectionDAG has 20 nodes:
  t0: ch = EntryToken
  t8: i64,ch = CopyFromReg t0, Register:i64 %vreg2
      t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
    t6: ch = CopyToReg t0, Register:i64 %vreg2, t2
    t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
  t13: Untyped,ch = SystemZISD::ATOMIC_LOAD_128<Volatile LD16[%src]> t6, t4
        t15: i64 = EXTRACT_SUBREG t13, TargetConstant:i32<2>
      t19: ch = store<ST8[<unknown>]> t13:1, t15, t8, undef:i64
        t17: i64 = EXTRACT_SUBREG t13, TargetConstant:i32<7>
        t21: i64 = add t8, Constant:i64<8>
      t22: ch = store<ST8[<unknown>]> t13:1, t17, t21, undef:i64
    t23: ch = TokenFactor t19, t22
  t12: ch = SystemZISD::RET_FLAG t23

It looks to me that SystemZ does not have i128 as a legal type, and therefore the 128 bit store gets expanded. If the coalescing is then turned off, one of the subregs will need an extra COPY as the GR128 reg is live past the first store.

Maybe this would be a good time to try and make i128 legal as we have discussed earlier?

untyped -> i128
Add pseudos that expands after register allocation, so that the coalescing problems we see here go away.
Make sure that all the i128 operations that actually does not have support still gets expanded per current behavior.

anything else? Any hacks relating to untyped that we know should go away then?

I looked into making i128 legal when I recently added the 128-bit atomics, but in the end decided against it. Not only would it be a lot of work (since you basically would have to reimplement all the operations on i128 that are currently handled by common code), but it is not clear that we actually always want it.

i128 lives in a even/odd register pair, but this is really only needed for the few special instructions that operate on those pairs. If you had just a normal load followed by a normal store of a i128 value, you *want* to break it into two independent i64 copies -- there is no reason at all to constrain the register allocator to use a even/odd pair here.

I've been thinking about alternative solutions to this problem. One way might be to prevent coalescing (as this patch does) and then try to *encourage* (but not enforce) use of the register pair earlier, e.g. by providing register allocator hints for those registers that would have been coalesced with the pair if we had not prevented this - sort of like the ARM register pair hints. But it turns out this is no so easy, because the allocator will assume those hints conflict with the original register pair (unless we switch on enableSubRegLiveness, but this seems to have other problems, it looks like it causes a number of internal compiler errors in the test suite).

Another option might be to not *completely* prevent coalescing a 128-bit pair with its component registers, but allow it in cases where there is no excessive register pressure. But I'm not sure exactly how to get at this information at the time we make the coalescing decision ...

Patch updated.

I looked into making i128 legal when I recently added the 128-bit atomics, but in the end decided against it. Not only would it be a lot of work (since you basically would have to reimplement all the operations on i128 that are currently handled by common code), but it is not clear that we actually always want it.

I thought one could get away with adding calling setOperationAction(..., i128, Expand) on all operations that we do not support...

i128 lives in a even/odd register pair, but this is really only needed for the few special instructions that operate on those pairs. If you had just a normal load followed by a normal store of a i128 value, you *want* to break it into two independent i64 copies -- there is no reason at all to constrain the register allocator to use a even/odd pair here.

I would have expected that for a tiny live range, it is always better to avoid extra COPYs.

Another option might be to not *completely* prevent coalescing a 128-bit pair with its component registers, but allow it in cases where there is no excessive register pressure. But I'm not sure exactly how to get at this information at the time we make the coalescing decision ...

There is a current discussion relating to the mischeduler register pressure tracking about the fact that there is no real support to query global liveness yet. It may be that this will get added soon, and perhaps then some cheap register pressure estimate might be available.

Until then, I thought it might serve well to at least allow coalescing in cases of tiny, local live-ranges, e.g. loading and then storing 128 bits.

The updated patch uses LiveIntervals to quickly check if the two connected vregs are local to MBB and what the first and last instructions are (it may be possible to do this without LIS if we are willing to scan the whole MBB). I merely tried some values of 10 instructions and 8 phys-reg operands in order to detect a non-safe live-range. It may be that we want to check for vreg operands as well.

I don't have adequate tests to tune this further, but this might hopefully be a good starting point. At least all the regressions we saw before are now gone, while still handling your test case.

On SPEC, I see

Current patch:

               master                 patched
lg             : 320549               318515                -2034
stg            : 125311               124230                -1081
lr             :  25927                26632                 +705
lgr            : 343901               343602                 -299
llill          :    982                  869                 -113
c              :  10989                11061                  +72

"Spill|Reload"   168062               165145                -2917

Herald added subscribers: javed.absar, nhaehnle, arsenm. · View Herald TranscriptSep 25 2017, 4:55 AM

In D37899#879892, @jonpa wrote:

Patch updated.

I looked into making i128 legal when I recently added the 128-bit atomics, but in the end decided against it. Not only would it be a lot of work (since you basically would have to reimplement all the operations on i128 that are currently handled by common code), but it is not clear that we actually always want it.

I thought one could get away with adding calling setOperationAction(..., i128, Expand) on all operations that we do not support...

It turns out this doesn't work. E.g. in the case of a 128-bit add, this is split into a sequence of 64-bit add-with-carry operations by the *type legalization* code, not the normal expander. If you declare i128 a legal type, the type legalizer doesn't do this any more. But then if you mark the ADD as Expand, the normal expander doesn't know what to do with it either, and -I think- just falls back to a libcall.

i128 lives in a even/odd register pair, but this is really only needed for the few special instructions that operate on those pairs. If you had just a normal load followed by a normal store of a i128 value, you *want* to break it into two independent i64 copies -- there is no reason at all to constrain the register allocator to use a even/odd pair here.

I would have expected that for a tiny live range, it is always better to avoid extra COPYs.

My point was for a 128-bit load feeding into a 128-bit store, both the load and the store gets expanded to two 64-bit loads / stores anyway. So we end up with two completely independent 64-bit data flow paths, which require 64-bit registers --- but there is absolutely no benefit to require that these registers form an even/odd pair, they might as well be arbitrary registers (it might even be the same one if the sequence ends up scheduled as LG / STG / LG / STG). Actually the same applies to most of the other cases where the 128-bit operation ends up split into 64-bit operations by the type legalizer.

The updated patch uses LiveIntervals to quickly check if the two connected vregs are local to MBB and what the first and last instructions are (it may be possible to do this without LIS if we are willing to scan the whole MBB). I merely tried some values of 10 instructions and 8 phys-reg operands in order to detect a non-safe live-range. It may be that we want to check for vreg operands as well.

I don't have adequate tests to tune this further, but this might hopefully be a good starting point. At least all the regressions we saw before are now gone, while still handling your test case.

This looks promising. Not sure if anybody would object to changing the interface to pass in the LiveIntervals, but it seems reasonable to me.

Not sure about the magic numbers -- does it matter whether we stop after 10 MIs? What kind of compile-time performance impact would it have to scan the whole range?

It looks like your check for > 8 occurrences of physical registers would trigger even if it is always the same register. Should we instead keep a bitmap of the used registers, so we know how many *different* ones are used? Also, it should probably only check for GPRs, not any random physical register.

I thought one could get away with adding calling setOperationAction(..., i128, Expand) on all operations that we do not support...

It turns out this doesn't work. E.g. in the case of a 128-bit add, this is split into a sequence of 64-bit add-with-carry operations by the *type legalization* code, not the normal expander. If you declare i128 a legal type, the type legalizer doesn't do this any more. But then if you mark the ADD as Expand, the normal expander doesn't know what to do with it either, and -I think- just falls back to a libcall.

i128 lives in a even/odd register pair, but this is really only needed for the few special instructions that operate on those pairs. If you had just a normal load followed by a normal store of a i128 value, you *want* to break it into two independent i64 copies -- there is no reason at all to constrain the register allocator to use a even/odd pair here.

I would have expected that for a tiny live range, it is always better to avoid extra COPYs.

My point was for a 128-bit load feeding into a 128-bit store, both the load and the store gets expanded to two 64-bit loads / stores anyway. So we end up with two completely independent 64-bit data flow paths, which require 64-bit registers --- but there is absolutely no benefit to require that these registers form an even/odd pair, they might as well be arbitrary registers (it might even be the same one if the sequence ends up scheduled as LG / STG / LG / STG). Actually the same applies to most of the other cases where the 128-bit operation ends up split into 64-bit operations by the type legalizer.

I see.

The updated patch uses LiveIntervals to quickly check if the two connected vregs are local to MBB and what the first and last instructions are (it may be possible to do this without LIS if we are willing to scan the whole MBB). I merely tried some values of 10 instructions and 8 phys-reg operands in order to detect a non-safe live-range. It may be that we want to check for vreg operands as well.

I don't have adequate tests to tune this further, but this might hopefully be a good starting point. At least all the regressions we saw before are now gone, while still handling your test case.

This looks promising. Not sure if anybody would object to changing the interface to pass in the LiveIntervals, but it seems reasonable to me.

Not sure about the magic numbers -- does it matter whether we stop after 10 MIs? What kind of compile-time performance impact would it have to scan the whole range?

I don't know, I just thought that we basically wanted to at least handle the type of case where there is a GR128 op and then some subreg COPY(s) very close. My reason for having the instruction limit is that we can't know the register pressure, but it seems safe to think that tiny live ranges checked in this way could be coalesced.

It looks like your check for > 8 occurrences of physical registers would trigger even if it is always the same register. Should we instead keep a bitmap of the used registers, so we know how many *different* ones are used? Also, it should probably only check for GPRs, not any random physical register.

Checking for GPRs is something I seem to have missed, but however the different reg check is something I actually took away after first using a SmallSet. I thought that if we would to this properly, we should in addition also check aliases after finding the set of different regs. I then changed my mind since we are merely trying to do an estimate, and we don't really know for sure what the right limit is, etc and may rather just keep it simple (BTW, I also see that it should do +=2 for a GR128 reg).

I guess it depends on the code and if we are trying to handle "any" type of mixed code with an as-good-as-possible estimate, or are we merely trying to keep coalescing the simple and small sequences?

Taking this a step further, if regalloc can only run out of registers when enough physreg defs overlap the GR128 live range so that there is no single 128-bit register pair to be allocated, then we should basically allow coalescing if there is at least one non-clobbered i128 register pair in the region, right?

Even if there were many such GR128 virtregs overlapping, one would hope that regalloc would be able to split them so that one such free pair would be enough. Would we want to stop coalescing at that some point in order to reduce spilling? We could perhaps try to by looking for GR128 virtual register operands in the region.

In D37899#880029, @jonpa wrote:

Not sure about the magic numbers -- does it matter whether we stop after 10 MIs? What kind of compile-time performance impact would it have to scan the whole range?

I don't know, I just thought that we basically wanted to at least handle the type of case where there is a GR128 op and then some subreg COPY(s) very close. My reason for having the instruction limit is that we can't know the register pressure, but it seems safe to think that tiny live ranges checked in this way could be coalesced.

Ah, I misread the code -- if we have more than 10 instructions, you default to *not* coalese (I somehow thought it was the opposite). That seems more reasonable. The interesting question will then be how this interacts with the scheduler, i.e. how frequently it happens that unrelated code is scheduled in between, and thus breaking the coalescing. Not sure if that is a real concern ...

It looks like your check for > 8 occurrences of physical registers would trigger even if it is always the same register. Should we instead keep a bitmap of the used registers, so we know how many *different* ones are used? Also, it should probably only check for GPRs, not any random physical register.

Checking for GPRs is something I seem to have missed, but however the different reg check is something I actually took away after first using a SmallSet. I thought that if we would to this properly, we should in addition also check aliases after finding the set of different regs. I then changed my mind since we are merely trying to do an estimate, and we don't really know for sure what the right limit is, etc and may rather just keep it simple (BTW, I also see that it should do +=2 for a GR128 reg).

Yes, that's why I was thinking of a simple bitmap of the 16 actual GPRs, and just set a bit if this GPR is touched (no matter via which super/subreg or alias).

Taking this a step further, if regalloc can only run out of registers when enough physreg defs overlap the GR128 live range so that there is no single 128-bit register pair to be allocated, then we should basically allow coalescing if there is at least one non-clobbered i128 register pair in the region, right?

Even if there were many such GR128 virtregs overlapping, one would hope that regalloc would be able to split them so that one such free pair would be enough. Would we want to stop coalescing at that some point in order to reduce spilling? We could perhaps try to by looking for GR128 virtual register operands in the region.

Strictly speaking, yes. But I think it also doesn't matter to avoid coalescing even if the register pressure isn't *that* extreme -- it is true that as long as at least one GPR pair is available, the allocator should be able to fix up anything via spilling, but in general spilling is more expensive than extra copying due to non-coalescing would have been ...

(B.t.w. if you do go for the one-pair check, it needs to check for one pair except 0/1, because 0/1 isn't usable for ADDR128.)

Improved to accurately keep track of which registers are actually clobbered in the region by means of a BitVector.

Ah, I misread the code -- if we have more than 10 instructions, you default to *not* coalese (I somehow thought it was the opposite). That seems more reasonable. The interesting question will then be how this interacts with the scheduler, i.e. how frequently it happens that unrelated code is scheduled in between, and thus breaking the coalescing. Not sure if that is a real concern ...

See table below.

Yes, that's why I was thinking of a simple bitmap of the 16 actual GPRs, and just set a bit if this GPR is touched (no matter via which super/subreg or alias).

I did this for the 128 bit regs instead since that is what seems mostly relevant in the end.

(B.t.w. if you do go for the one-pair check, it needs to check for one pair except 0/1, because 0/1 isn't usable for ADDR128.)

I am hoping that by using NewRC (which should be either GR128 or ADDR128), this should work as expected.

Considering other types of code with more heavy use of GR128, would we perhaps want to also give a count for a virtual register definition of a GR128? That should avoid being too optimistic when several of these overlap.

SPEC

                 master               patched
lg             : 320549               318463                -2086
stg            : 125311               124234                -1077
lr             :  25927                26482                 +555
lgr            : 343901               343499                 -402
llill          :    982                  870                 -112
st             :  51272                51343                  +71
stfh           :   2037                 2095                  +58
j              : 100186               100129                  -57
c              :  10989                11045                  +56
"Spill|Reload"   168062               165106                -2956

Number of regions / GR128 pairs clobbered

6070 PhysClob: 0
 206 PhysClob: 1
 187 PhysClob: 2
  17 PhysClob: 3
   8 PhysClob: 4

(Never more than 4 clobbered GR128 phys reg pairs. I therefore took "3" out of the blue as a margin.)

Number of regions / MI count

1305 MIs: 3
2489 MIs: 4
 628 MIs: 5
 581 MIs: 6
 258 MIs: 7
 119 MIs: 8
 103 MIs: 9
  94 MIs: 10
  61 MIs: 11
 129 MIs: 12
 220 MIs: 13
  47 MIs: 14
  49 MIs: 15
  31 MIs: 16
  39 MIs: 17
  23 MIs: 18
  42 MIs: 19
  12 MIs: 20
  46 MIs: 21
  55 MIs: 22
   6 MIs: 23
  13 MIs: 24
  14 MIs: 25
  14 MIs: 26
   3 MIs: 27
   6 MIs: 28
   7 MIs: 29
   7 MIs: 30
   1 MIs: 31
  41 MIs: 32
   3 MIs: 33
   3 MIs: 34
   2 MIs: 36
   1 MIs: 37
   1 MIs: 38
   1 MIs: 40
   1 MIs: 41
   1 MIs: 42
   3 MIs: 44
   4 MIs: 45
   2 MIs: 46
   2 MIs: 47
   2 MIs: 48
   2 MIs: 49
   2 MIs: 50
   1 MIs: 51
   1 MIs: 52
   1 MIs: 55
   2 MIs: 56
   1 MIs: 63
   1 MIs: 65
   2 MIs: 67
   1 MIs: 69
   1 MIs: 70
   1 MIs: 71
   1 MIs: 73
   1 MIs: 78
   1 MIs: 103

Great majority of these cases are then as expected small regions, so I am thinking there's no real need for an instruction limit - at least not on SPEC

@Quentin, Matthias: This patch introduces an extra parameter to shouldCoalesce() (LiveIntervals*). Does this look acceptable?

In D37899#881218, @jonpa wrote:

Considering other types of code with more heavy use of GR128, would we perhaps want to also give a count for a virtual register definition of a GR128? That should avoid being too optimistic when several of these overlap.

Not sure, hard to say if this is going to be overall beneficial or not. I'd say to just do the physreg check for now.

Great majority of these cases are then as expected small regions, so I am thinking there's no real need for an instruction limit - at least not on SPEC

OK, makes sense to me.

The code now looks reasonable to me, just a few inline comments.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
166	Hmm, does this still hit if we get ADDR128BitRegClass? Probably needs to check for that instead of GR128BitRegClass here ... We then ideally should also have a test case that uses ADDR128BitRegClass (one way to achieve this is to add a use of "remainder + 1", which gets selected to an LA instruction).
199	Unused now?
209	Is this really necessary / useful? Why not do the whole test just via the SuperRegIterator loop (using IncludeSelf = true)?

Updated per review. Not much change on SPEC compared to last version of the patch:

lg             :               318552               318542      -10
st             :                51343                51348       +5
cgijlh         :                17962                17958       -4
...
Spill|Reload   :               165200               165200       +0

Test case for ADDR128 still missing.

This patch introduces an extra parameter to shouldCoalesce() (LiveIntervals*). Does this look acceptable?

Looks fine to me.

I would use a reference if we don't allow it to be null or update the comment to say that this may be null.

I would use a reference if we don't allow it to be null or update the comment to say that this may be null.

Patch updated to use reference instead of pointer for LiveIntervals argument.

jonpa added inline comments.Sep 28 2017, 3:01 AM

lib/Target/SystemZ/SystemZRegisterInfo.cpp
166	It should be NewRC->hasSuperClassEq(), which I believe returns true only for ADDR128 and GR128. I verified that this works as expected in gdb.
199	removed
209	removed

Add test that exposes an ADDR128 virtual register.

Given that:

Quentin approved the interface change
The SystemZ parts now look good to me
We have performance runs confirming this does not introduce regressions

this patch now LGTM.

This revision is now accepted and ready to land.Sep 29 2017, 5:22 AM

Thanks for review.
r314516

Revision Contents

Path

Size

include/

llvm/

Target/

TargetRegisterInfo.h

4 lines

lib/

CodeGen/

RegisterCoalescer.cpp

2 lines

Target/

AMDGPU/

SIRegisterInfo.h

4 lines

SIRegisterInfo.cpp

3 lines

ARM/

ARMBaseRegisterInfo.h

5 lines

ARMBaseRegisterInfo.cpp

3 lines

SystemZ/

SystemZRegisterInfo.h

12 lines

SystemZRegisterInfo.cpp

67 lines

test/

CodeGen/

SystemZ/

regalloc-GR128.ll

18 lines

Diff 116961

include/llvm/Target/TargetRegisterInfo.h

Show All 34 Lines
namespace llvm {		namespace llvm {

class BitVector;		class BitVector;
class LiveRegMatrix;		class LiveRegMatrix;
class MachineFunction;		class MachineFunction;
class MachineInstr;		class MachineInstr;
class RegScavenger;		class RegScavenger;
class VirtRegMap;		class VirtRegMap;
		class LiveIntervals;

class TargetRegisterClass {		class TargetRegisterClass {
public:		public:
using iterator = const MCPhysReg *;		using iterator = const MCPhysReg *;
using const_iterator = const MCPhysReg *;		using const_iterator = const MCPhysReg *;
using sc_iterator = const TargetRegisterClass* const *;		using sc_iterator = const TargetRegisterClass* const *;

// Instance variables filled by tablegen, do not use!		// Instance variables filled by tablegen, do not use!
▲ Show 20 Lines • Show All 903 Lines • ▼ Show 20 Lines	public:
/// Subtarget Hooks		/// Subtarget Hooks

/// \brief SrcRC and DstRC will be morphed into NewRC if this returns true.		/// \brief SrcRC and DstRC will be morphed into NewRC if this returns true.
virtual bool shouldCoalesce(MachineInstr *MI,		virtual bool shouldCoalesce(MachineInstr *MI,
const TargetRegisterClass *SrcRC,		const TargetRegisterClass *SrcRC,
unsigned SubReg,		unsigned SubReg,
const TargetRegisterClass *DstRC,		const TargetRegisterClass *DstRC,
unsigned DstSubReg,		unsigned DstSubReg,
const TargetRegisterClass *NewRC) const		const TargetRegisterClass *NewRC,
		LiveIntervals &LIS) const
{ return true; }		{ return true; }

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
/// Debug information queries.		/// Debug information queries.

/// getFrameRegister - This method should return the register used as a base		/// getFrameRegister - This method should return the register used as a base
/// for values allocated in the current stack frame.		/// for values allocated in the current stack frame.
virtual unsigned getFrameRegister(const MachineFunction &MF) const = 0;		virtual unsigned getFrameRegister(const MachineFunction &MF) const = 0;
▲ Show 20 Lines • Show All 193 Lines • Show Last 20 Lines

lib/CodeGen/RegisterCoalescer.cpp

Show First 20 Lines • Show All 1,577 Lines • ▼ Show 20 Lines	if (CP.getNewRC()) {
auto DstRC = MRI->getRegClass(CP.getDstReg());		auto DstRC = MRI->getRegClass(CP.getDstReg());
unsigned SrcIdx = CP.getSrcIdx();		unsigned SrcIdx = CP.getSrcIdx();
unsigned DstIdx = CP.getDstIdx();		unsigned DstIdx = CP.getDstIdx();
if (CP.isFlipped()) {		if (CP.isFlipped()) {
std::swap(SrcIdx, DstIdx);		std::swap(SrcIdx, DstIdx);
std::swap(SrcRC, DstRC);		std::swap(SrcRC, DstRC);
}		}
if (!TRI->shouldCoalesce(CopyMI, SrcRC, SrcIdx, DstRC, DstIdx,		if (!TRI->shouldCoalesce(CopyMI, SrcRC, SrcIdx, DstRC, DstIdx,
CP.getNewRC())) {		CP.getNewRC(), *LIS)) {
DEBUG(dbgs() << "\tSubtarget bailed on coalescing.\n");		DEBUG(dbgs() << "\tSubtarget bailed on coalescing.\n");
return false;		return false;
}		}
}		}

// Dead code elimination. This really should be handled by MachineDCE, but		// Dead code elimination. This really should be handled by MachineDCE, but
// sometimes dead copies slip through, and we can't generate invalid live		// sometimes dead copies slip through, and we can't generate invalid live
// ranges.		// ranges.
▲ Show 20 Lines • Show All 1,813 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIRegisterInfo.h

Show All 16 Lines

#include "AMDGPURegisterInfo.h"		#include "AMDGPURegisterInfo.h"
#include "MCTargetDesc/AMDGPUMCTargetDesc.h"		#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
#include "SIDefines.h"		#include "SIDefines.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"

namespace llvm {		namespace llvm {

		class LiveIntervals;
class MachineRegisterInfo;		class MachineRegisterInfo;
class SISubtarget;		class SISubtarget;
class SIMachineFunctionInfo;		class SIMachineFunctionInfo;

class SIRegisterInfo final : public AMDGPURegisterInfo {		class SIRegisterInfo final : public AMDGPURegisterInfo {
private:		private:
unsigned SGPRSetID;		unsigned SGPRSetID;
unsigned VGPRSetID;		unsigned VGPRSetID;
▲ Show 20 Lines • Show All 174 Lines • ▼ Show 20 Lines	public:
ArrayRef<int16_t> getRegSplitParts(const TargetRegisterClass *RC,		ArrayRef<int16_t> getRegSplitParts(const TargetRegisterClass *RC,
unsigned EltSize) const;		unsigned EltSize) const;

bool shouldCoalesce(MachineInstr *MI,		bool shouldCoalesce(MachineInstr *MI,
const TargetRegisterClass *SrcRC,		const TargetRegisterClass *SrcRC,
unsigned SubReg,		unsigned SubReg,
const TargetRegisterClass *DstRC,		const TargetRegisterClass *DstRC,
unsigned DstSubReg,		unsigned DstSubReg,
const TargetRegisterClass *NewRC) const override;		const TargetRegisterClass *NewRC,
		LiveIntervals &LIS) const override;

unsigned getRegPressureLimit(const TargetRegisterClass *RC,		unsigned getRegPressureLimit(const TargetRegisterClass *RC,
MachineFunction &MF) const override;		MachineFunction &MF) const override;

unsigned getRegPressureSetLimit(const MachineFunction &MF,		unsigned getRegPressureSetLimit(const MachineFunction &MF,
unsigned Idx) const override;		unsigned Idx) const override;

const int *getRegUnitPressureSets(unsigned RegUnit) const override;		const int *getRegUnitPressureSets(unsigned RegUnit) const override;
Show All 22 Lines

lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 1,468 Lines • ▼ Show 20 Lines	bool SIRegisterInfo::isVGPR(const MachineRegisterInfo &MRI,
return hasVGPRs(getRegClassForReg(MRI, Reg));		return hasVGPRs(getRegClassForReg(MRI, Reg));
}		}

bool SIRegisterInfo::shouldCoalesce(MachineInstr *MI,		bool SIRegisterInfo::shouldCoalesce(MachineInstr *MI,
const TargetRegisterClass *SrcRC,		const TargetRegisterClass *SrcRC,
unsigned SubReg,		unsigned SubReg,
const TargetRegisterClass *DstRC,		const TargetRegisterClass *DstRC,
unsigned DstSubReg,		unsigned DstSubReg,
const TargetRegisterClass *NewRC) const {		const TargetRegisterClass *NewRC,
		LiveIntervals &LIS) const {
unsigned SrcSize = getRegSizeInBits(*SrcRC);		unsigned SrcSize = getRegSizeInBits(*SrcRC);
unsigned DstSize = getRegSizeInBits(*DstRC);		unsigned DstSize = getRegSizeInBits(*DstRC);
unsigned NewSize = getRegSizeInBits(*NewRC);		unsigned NewSize = getRegSizeInBits(*NewRC);

// Do not increase size of registers beyond dword, we would need to allocate		// Do not increase size of registers beyond dword, we would need to allocate
// adjacent registers and constraint regalloc more than needed.		// adjacent registers and constraint regalloc more than needed.

// Always allow dword coalescing.		// Always allow dword coalescing.
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

lib/Target/ARM/ARMBaseRegisterInfo.h

Show All 21 Lines
#include "llvm/Target/TargetRegisterInfo.h"		#include "llvm/Target/TargetRegisterInfo.h"
#include <cstdint>		#include <cstdint>

#define GET_REGINFO_HEADER		#define GET_REGINFO_HEADER
#include "ARMGenRegisterInfo.inc"		#include "ARMGenRegisterInfo.inc"

namespace llvm {		namespace llvm {

		class LiveIntervals;

/// Register allocation hints.		/// Register allocation hints.
namespace ARMRI {		namespace ARMRI {

enum {		enum {
RegPairOdd = 1,		RegPairOdd = 1,
RegPairEven = 2		RegPairEven = 2
};		};

▲ Show 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	void eliminateFrameIndex(MachineBasicBlock::iterator II,
RegScavenger *RS = nullptr) const override;		RegScavenger *RS = nullptr) const override;

/// \brief SrcRC and DstRC will be morphed into NewRC if this returns true		/// \brief SrcRC and DstRC will be morphed into NewRC if this returns true
bool shouldCoalesce(MachineInstr *MI,		bool shouldCoalesce(MachineInstr *MI,
const TargetRegisterClass *SrcRC,		const TargetRegisterClass *SrcRC,
unsigned SubReg,		unsigned SubReg,
const TargetRegisterClass *DstRC,		const TargetRegisterClass *DstRC,
unsigned DstSubReg,		unsigned DstSubReg,
const TargetRegisterClass *NewRC) const override;		const TargetRegisterClass *NewRC,
		LiveIntervals &LIS) const override;
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_ARM_ARMBASEREGISTERINFO_H		#endif // LLVM_LIB_TARGET_ARM_ARMBASEREGISTERINFO_H

lib/Target/ARM/ARMBaseRegisterInfo.cpp

Show First 20 Lines • Show All 801 Lines • ▼ Show 20 Lines	#endif // NDEBUG
}		}
}		}

bool ARMBaseRegisterInfo::shouldCoalesce(MachineInstr *MI,		bool ARMBaseRegisterInfo::shouldCoalesce(MachineInstr *MI,
const TargetRegisterClass *SrcRC,		const TargetRegisterClass *SrcRC,
unsigned SubReg,		unsigned SubReg,
const TargetRegisterClass *DstRC,		const TargetRegisterClass *DstRC,
unsigned DstSubReg,		unsigned DstSubReg,
const TargetRegisterClass *NewRC) const {		const TargetRegisterClass *NewRC,
		LiveIntervals &LIS) const {
auto MBB = MI->getParent();		auto MBB = MI->getParent();
auto MF = MBB->getParent();		auto MF = MBB->getParent();
const MachineRegisterInfo &MRI = MF->getRegInfo();		const MachineRegisterInfo &MRI = MF->getRegInfo();
// If not copying into a sub-register this should be ok because we shouldn't		// If not copying into a sub-register this should be ok because we shouldn't
// need to split the reg.		// need to split the reg.
if (!DstSubReg)		if (!DstSubReg)
return true;		return true;
// Small registers don't frequently cause a problem, so we can coalesce them.		// Small registers don't frequently cause a problem, so we can coalesce them.
▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

lib/Target/SystemZ/SystemZRegisterInfo.h

Show All 12 Lines
#include "SystemZ.h"		#include "SystemZ.h"
#include "llvm/Target/TargetRegisterInfo.h"		#include "llvm/Target/TargetRegisterInfo.h"

#define GET_REGINFO_HEADER		#define GET_REGINFO_HEADER
#include "SystemZGenRegisterInfo.inc"		#include "SystemZGenRegisterInfo.inc"

namespace llvm {		namespace llvm {

		class LiveIntervals;

namespace SystemZ {		namespace SystemZ {
// Return the subreg to use for referring to the even and odd registers		// Return the subreg to use for referring to the even and odd registers
// in a GR128 pair. Is32Bit says whether we want a GR32 or GR64.		// in a GR128 pair. Is32Bit says whether we want a GR32 or GR64.
inline unsigned even128(bool Is32bit) {		inline unsigned even128(bool Is32bit) {
return Is32bit ? subreg_hl32 : subreg_h64;		return Is32bit ? subreg_hl32 : subreg_h64;
}		}
inline unsigned odd128(bool Is32bit) {		inline unsigned odd128(bool Is32bit) {
return Is32bit ? subreg_l32 : subreg_l64;		return Is32bit ? subreg_l32 : subreg_l64;
Show All 25 Lines	public:
}		}
const MCPhysReg getCalleeSavedRegs(const MachineFunction MF) const override;		const MCPhysReg getCalleeSavedRegs(const MachineFunction MF) const override;
const uint32_t *getCallPreservedMask(const MachineFunction &MF,		const uint32_t *getCallPreservedMask(const MachineFunction &MF,
CallingConv::ID CC) const override;		CallingConv::ID CC) const override;
BitVector getReservedRegs(const MachineFunction &MF) const override;		BitVector getReservedRegs(const MachineFunction &MF) const override;
void eliminateFrameIndex(MachineBasicBlock::iterator MI,		void eliminateFrameIndex(MachineBasicBlock::iterator MI,
int SPAdj, unsigned FIOperandNum,		int SPAdj, unsigned FIOperandNum,
RegScavenger *RS) const override;		RegScavenger *RS) const override;

		/// \brief SrcRC and DstRC will be morphed into NewRC if this returns true.
		bool shouldCoalesce(MachineInstr *MI,
		const TargetRegisterClass *SrcRC,
		unsigned SubReg,
		const TargetRegisterClass *DstRC,
		unsigned DstSubReg,
		const TargetRegisterClass *NewRC,
		LiveIntervals &LIS) const override;

unsigned getFrameRegister(const MachineFunction &MF) const override;		unsigned getFrameRegister(const MachineFunction &MF) const override;
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Target/SystemZ/SystemZRegisterInfo.cpp

//===-- SystemZRegisterInfo.cpp - SystemZ register information ------------===//		//===-- SystemZRegisterInfo.cpp - SystemZ register information ------------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "SystemZRegisterInfo.h"		#include "SystemZRegisterInfo.h"
#include "SystemZInstrInfo.h"		#include "SystemZInstrInfo.h"
#include "SystemZSubtarget.h"		#include "SystemZSubtarget.h"
		#include "llvm/CodeGen/LiveIntervalAnalysis.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/Target/TargetFrameLowering.h"		#include "llvm/Target/TargetFrameLowering.h"

using namespace llvm;		using namespace llvm;

#define GET_REGINFO_TARGET_DESC		#define GET_REGINFO_TARGET_DESC
#include "SystemZGenRegisterInfo.inc"		#include "SystemZGenRegisterInfo.inc"
▲ Show 20 Lines • Show All 126 Lines • ▼ Show 20 Lines	if (MI->getDesc().TSFlags & SystemZII::HasIndex
MI->getOperand(FIOperandNum).ChangeToRegister(ScratchReg,		MI->getOperand(FIOperandNum).ChangeToRegister(ScratchReg,
false, false, true);		false, false, true);
}		}
}		}
MI->setDesc(TII->get(OpcodeForOffset));		MI->setDesc(TII->get(OpcodeForOffset));
MI->getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);		MI->getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);
}		}

		bool SystemZRegisterInfo::shouldCoalesce(MachineInstr *MI,
		const TargetRegisterClass *SrcRC,
		unsigned SubReg,
		const TargetRegisterClass *DstRC,
		unsigned DstSubReg,
		const TargetRegisterClass *NewRC,
		LiveIntervals &LIS) const {
		assert (MI->isCopy() && "Only expecting COPY instructions");

		uweigandUnsubmitted Done Reply Inline Actions While turning subreg liveness on fixes the simpler test case you have here, the even more restricted test case I've added to the bugzilla actually still fails even then. So I'm not sure we should mention subreg liveness here as the reason why this is needed. uweigand: While turning subreg liveness on fixes the simpler test case you have here, the even more…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions I see - removed. jonpa: I see - removed.
		// Coalesce anything which is not a COPY involving a subreg to/from GR128.
		if (!(NewRC->hasSuperClassEq(&SystemZ::GR128BitRegClass) &&
		uweigandUnsubmitted Not Done Reply Inline Actions Hmm, does this still hit if we get ADDR128BitRegClass? Probably needs to check for that instead of GR128BitRegClass here ... We then ideally should also have a test case that uses ADDR128BitRegClass (one way to achieve this is to add a use of "remainder + 1", which gets selected to an LA instruction). uweigand: Hmm, does this still hit if we get ADDR128BitRegClass? Probably needs to check for that…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions It should be NewRC->hasSuperClassEq(), which I believe returns true only for ADDR128 and GR128. I verified that this works as expected in gdb. jonpa: It should be NewRC->hasSuperClassEq(), which I believe returns true only for ADDR128 and GR128.
		(getRegSizeInBits(SrcRC) <= 64 \|\| getRegSizeInBits(DstRC) <= 64)))
		uweigandUnsubmitted Not Done Reply Inline Actions Should this be symmetrical? I.e. something like if (NewRC->hasSubClassEq(&SystemZ::GR128BitRegClass) && (getRegSizeInBits(SrcRC) <= 64 \|\| getRegSizeInBits(DstRC) <= 64)) return false; uweigand: Should this be symmetrical? I.e. something like ``` if (NewRC->hasSubClassEq(&SystemZ…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions Do we need a test case for this also? jonpa: Do we need a test case for this also?
		return true;

		// Allow coalescing of a GR128 subreg COPY only if the live ranges are small
		// and local to one MBB with not too much interferring registers. Otherwise
		// regalloc may run out of registers.

		unsigned WideOpNo = (getRegSizeInBits(*SrcRC) == 128 ? 1 : 0);
		unsigned GR128Reg = MI->getOperand(WideOpNo).getReg();
		unsigned GRNarReg = MI->getOperand((WideOpNo == 1) ? 0 : 1).getReg();
		LiveInterval &IntGR128 = LIS.getInterval(GR128Reg);
		LiveInterval &IntGRNar = LIS.getInterval(GRNarReg);

		// Check that the two virtual registers are local to MBB.
		MachineBasicBlock *MBB = MI->getParent();
		if (LIS.isLiveInToMBB(IntGR128, MBB) \|\| LIS.isLiveOutOfMBB(IntGR128, MBB) \|\|
		LIS.isLiveInToMBB(IntGRNar, MBB) \|\| LIS.isLiveOutOfMBB(IntGRNar, MBB))
		return false;

		// Find the first and last MIs of the registers.
		MachineInstr FirstMI = nullptr, LastMI = nullptr;
		if (WideOpNo == 1) {
		FirstMI = LIS.getInstructionFromIndex(IntGR128.beginIndex());
		LastMI = LIS.getInstructionFromIndex(IntGRNar.endIndex());
		} else {
		FirstMI = LIS.getInstructionFromIndex(IntGRNar.beginIndex());
		LastMI = LIS.getInstructionFromIndex(IntGR128.endIndex());
		}
		assert (FirstMI && LastMI && "No instruction from index?");

		// Check if coalescing seems safe by finding the set of clobbered physreg
		// pairs in the region.
		BitVector PhysClobbered(getNumRegs());
		uweigandUnsubmitted Done Reply Inline Actions Unused now? uweigand: Unused now?
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions removed jonpa: removed
		MachineBasicBlock::iterator MII = FirstMI, MEE = LastMI;
		MEE++;
		for (; MII != MEE; ++MII) {
		for (const MachineOperand &MO : MII->operands())
		if (MO.isReg() && isPhysicalRegister(MO.getReg())) {
		for (MCSuperRegIterator SI(MO.getReg(), this, true/IncludeSelf/);
		SI.isValid(); ++SI)
		if (NewRC->contains(*SI)) {
		PhysClobbered.set(*SI);
		break;
		uweigandUnsubmitted Done Reply Inline Actions Is this really necessary / useful? Why not do the whole test just via the SuperRegIterator loop (using IncludeSelf = true)? uweigand: Is this really necessary / useful? Why not do the whole test just via the SuperRegIterator…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions removed jonpa: removed
		}
		}
		}

		// Demand an arbitrary margin of free regs.
		unsigned const DemandedFreeGR128 = 3;
		if (PhysClobbered.count() > (NewRC->getNumRegs() - DemandedFreeGR128))
		return false;

		return true;
		}

unsigned		unsigned
SystemZRegisterInfo::getFrameRegister(const MachineFunction &MF) const {		SystemZRegisterInfo::getFrameRegister(const MachineFunction &MF) const {
const SystemZFrameLowering *TFI = getFrameLowering(MF);		const SystemZFrameLowering *TFI = getFrameLowering(MF);
return TFI->hasFP(MF) ? SystemZ::R11D : SystemZ::R15D;		return TFI->hasFP(MF) ? SystemZ::R11D : SystemZ::R15D;
}		}

test/CodeGen/SystemZ/regalloc-GR128.ll

This file was added.

				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 -O3 -o /dev/null
				;
				; Test that regalloc does not run out of registers

				; This test will include a GR128 virtual reg.
				define void @test0(i64 %dividend, i64 %divisor) {
				%rem = urem i64 %dividend, %divisor
				uweigandUnsubmitted Done Reply Inline Actions Wouldn't just running llc and checking its exit code already test that there is no LLVM ERROR? uweigand: Wouldn't just running llc and checking its exit code already test that there is no LLVM ERROR?
				jonpaAuthorUnsubmitted Not Done Reply Inline Actions OK - I thought it was a nice "comment", but I guess we already have a comment at the top of the file. jonpa: OK - I thought it was a nice "comment", but I guess we already have a comment at the top of the…
				call void asm sideeffect "", "{r0},{r1},{r2},{r3},{r4},{r5},{r6},{r7},{r8},{r9},{r10},{r11},{r12},{r13},{r14}"(i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 %rem)
				ret void
				}
				uweigandUnsubmitted Done Reply Inline Actions Maybe better to use the more restricted test from the bugzilla. uweigand: Maybe better to use the more restricted test from the bugzilla.
				jonpaAuthorUnsubmitted Not Done Reply Inline Actions yes. jonpa: yes.

				; This test will include an ADDR128 virtual reg.
				define i64 @test1(i64 %dividend, i64 %divisor) {
				%rem = urem i64 %dividend, %divisor
				call void asm sideeffect "", "{r2},{r3},{r4},{r5},{r6},{r7},{r8},{r9},{r10},{r11},{r12},{r13},{r14}"(i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 %rem)
				%ret = add i64 %rem, 1
				ret i64 %ret
				}