This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Implement shouldCoalesce() to help regalloc to avoid running out of registers with GR128 regs
ClosedPublic

Authored by jonpa on Sep 15 2017, 4:27 AM.

Details

Summary

Fix for https://bugs.llvm.org/show_bug.cgi?id=34610

Regalloc runs out of registers in a test case with GR128 register. This can be remedied by not allowing coalescing copys of subregs from GR128 regs in shouldCoalesce().

Diff Detail

Event Timeline

jonpa created this revision.Sep 15 2017, 4:27 AM
uweigand edited edge metadata.Sep 15 2017, 9:30 AM

Thanks for trying this suggestion! See a couple of inline comments.

Also, I'm wondering doing less coalescing here may lead to reduced performance somewhere? It would be good to verify at least some benchmarks to make sure we aren't too restrictive now.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
163

While turning subreg liveness on fixes the simpler test case you have here, the even more restricted test case I've added to the bugzilla actually still fails even then. So I'm not sure we should mention subreg liveness here as the reason why this is needed.

166

Should this be symmetrical? I.e. something like

if (NewRC->hasSubClassEq(&SystemZ::GR128BitRegClass) &&
    (getRegSizeInBits(*SrcRC) <= 64 || getRegSizeInBits(*DstRC) <= 64))
  return false;
test/CodeGen/SystemZ/regalloc-GR128.ll
7

Wouldn't just running llc and checking its exit code already test that there is no LLVM ERROR?

10

Maybe better to use the more restricted test from the bugzilla.

jonpa marked 3 inline comments as done.Sep 19 2017, 12:05 AM
jonpa added inline comments.
lib/Target/SystemZ/SystemZRegisterInfo.cpp
163

I see - removed.

166

Do we need a test case for this also?

test/CodeGen/SystemZ/regalloc-GR128.ll
7

OK - I thought it was a nice "comment", but I guess we already have a comment at the top of the file.

10

yes.

jonpa updated this revision to Diff 115794.Sep 19 2017, 12:08 AM
jonpa marked 3 inline comments as done.

Updated per review.

jonpa updated this revision to Diff 115832.Sep 19 2017, 6:17 AM

Test cases updated. As expected, there are now more register moves whenever the coalescer is disabled by this patch.

Test cases updated. As expected, there are now more register moves whenever the coalescer is disabled by this patch.

Huh. This seems more extreme than I expected, it looks like just about *every* load of a 128-bit value from memory now reloads into another set of registers, even in those quite simple test cases ... Is this just the coalescing or is there something else going on?

jonpa added a comment.Sep 22 2017, 6:00 AM

Test cases updated. As expected, there are now more register moves whenever the coalescer is disabled by this patch.

Huh. This seems more extreme than I expected, it looks like just about *every* load of a 128-bit value from memory now reloads into another set of registers, even in those quite simple test cases ... Is this just the coalescing or is there something else going on?

Yes, this is indeed more than needed.

For atomic-load-05.ll, I see during isel:

Optimized lowered selection DAG: BB#0 'f1:'
SelectionDAG has 12 nodes:
  t0: ch = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
    t6: ch = CopyToReg t0, Register:i64 %vreg2, t2
    t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
  t7: i128,ch = AtomicLoad<Volatile LD16[%src]> t6, t4
      t8: i64,ch = CopyFromReg t0, Register:i64 %vreg2
    t11: ch = store<ST16[<unknown>](align=8)> t7:1, t7, t8, undef:i64
  t12: ch = SystemZISD::RET_FLAG t11


Type-legalized selection DAG: BB#0 'f1:'
SelectionDAG has 20 nodes:
  t0: ch = EntryToken
  t8: i64,ch = CopyFromReg t0, Register:i64 %vreg2
      t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
    t6: ch = CopyToReg t0, Register:i64 %vreg2, t2
    t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
  t13: Untyped,ch = SystemZISD::ATOMIC_LOAD_128<Volatile LD16[%src]> t6, t4
        t15: i64 = EXTRACT_SUBREG t13, TargetConstant:i32<2>
      t19: ch = store<ST8[<unknown>]> t13:1, t15, t8, undef:i64
        t17: i64 = EXTRACT_SUBREG t13, TargetConstant:i32<7>
        t21: i64 = add t8, Constant:i64<8>
      t22: ch = store<ST8[<unknown>]> t13:1, t17, t21, undef:i64
    t23: ch = TokenFactor t19, t22
  t12: ch = SystemZISD::RET_FLAG t23

It looks to me that SystemZ does not have i128 as a legal type, and therefore the 128 bit store gets expanded. If the coalescing is then turned off, one of the subregs will need an extra COPY as the GR128 reg is live past the first store.

Maybe this would be a good time to try and make i128 legal as we have discussed earlier?

  1. untyped -> i128
  2. Add pseudos that expands after register allocation, so that the coalescing problems we see here go away.
  3. Make sure that all the i128 operations that actually does not have support still gets expanded per current behavior.

anything else? Any hacks relating to untyped that we know should go away then?

I looked into making i128 legal when I recently added the 128-bit atomics, but in the end decided against it. Not only would it be a lot of work (since you basically would have to reimplement all the operations on i128 that are currently handled by common code), but it is not clear that we actually always want it.

i128 lives in a even/odd register pair, but this is really only needed for the few special instructions that operate on those pairs. If you had just a normal load followed by a normal store of a i128 value, you *want* to break it into two independent i64 copies -- there is no reason at all to constrain the register allocator to use a even/odd pair here.

I've been thinking about alternative solutions to this problem. One way might be to prevent coalescing (as this patch does) and then try to *encourage* (but not enforce) use of the register pair earlier, e.g. by providing register allocator hints for those registers that would have been coalesced with the pair if we had not prevented this - sort of like the ARM register pair hints. But it turns out this is no so easy, because the allocator will assume those hints conflict with the original register pair (unless we switch on enableSubRegLiveness, but this seems to have other problems, it looks like it causes a number of internal compiler errors in the test suite).

Another option might be to not *completely* prevent coalescing a 128-bit pair with its component registers, but allow it in cases where there is no excessive register pressure. But I'm not sure exactly how to get at this information at the time we make the coalescing decision ...

jonpa updated this revision to Diff 116522.Sep 25 2017, 4:55 AM

Patch updated.

I looked into making i128 legal when I recently added the 128-bit atomics, but in the end decided against it. Not only would it be a lot of work (since you basically would have to reimplement all the operations on i128 that are currently handled by common code), but it is not clear that we actually always want it.

I thought one could get away with adding calling setOperationAction(..., i128, Expand) on all operations that we do not support...

i128 lives in a even/odd register pair, but this is really only needed for the few special instructions that operate on those pairs. If you had just a normal load followed by a normal store of a i128 value, you *want* to break it into two independent i64 copies -- there is no reason at all to constrain the register allocator to use a even/odd pair here.

I would have expected that for a tiny live range, it is always better to avoid extra COPYs.

Another option might be to not *completely* prevent coalescing a 128-bit pair with its component registers, but allow it in cases where there is no excessive register pressure. But I'm not sure exactly how to get at this information at the time we make the coalescing decision ...

There is a current discussion relating to the mischeduler register pressure tracking about the fact that there is no real support to query global liveness yet. It may be that this will get added soon, and perhaps then some cheap register pressure estimate might be available.

Until then, I thought it might serve well to at least allow coalescing in cases of tiny, local live-ranges, e.g. loading and then storing 128 bits.

The updated patch uses LiveIntervals to quickly check if the two connected vregs are local to MBB and what the first and last instructions are (it may be possible to do this without LIS if we are willing to scan the whole MBB). I merely tried some values of 10 instructions and 8 phys-reg operands in order to detect a non-safe live-range. It may be that we want to check for vreg operands as well.

I don't have adequate tests to tune this further, but this might hopefully be a good starting point. At least all the regressions we saw before are now gone, while still handling your test case.

On SPEC, I see

Current patch:

               master                 patched
lg             : 320549               318515                -2034
stg            : 125311               124230                -1081
lr             :  25927                26632                 +705
lgr            : 343901               343602                 -299
llill          :    982                  869                 -113
c              :  10989                11061                  +72

"Spill|Reload"   168062               165145                -2917

Patch updated.

I looked into making i128 legal when I recently added the 128-bit atomics, but in the end decided against it. Not only would it be a lot of work (since you basically would have to reimplement all the operations on i128 that are currently handled by common code), but it is not clear that we actually always want it.

I thought one could get away with adding calling setOperationAction(..., i128, Expand) on all operations that we do not support...

It turns out this doesn't work. E.g. in the case of a 128-bit add, this is split into a sequence of 64-bit add-with-carry operations by the *type legalization* code, not the normal expander. If you declare i128 a legal type, the type legalizer doesn't do this any more. But then if you mark the ADD as Expand, the normal expander doesn't know what to do with it either, and -I think- just falls back to a libcall.

i128 lives in a even/odd register pair, but this is really only needed for the few special instructions that operate on those pairs. If you had just a normal load followed by a normal store of a i128 value, you *want* to break it into two independent i64 copies -- there is no reason at all to constrain the register allocator to use a even/odd pair here.

I would have expected that for a tiny live range, it is always better to avoid extra COPYs.

My point was for a 128-bit load feeding into a 128-bit store, both the load and the store gets expanded to two 64-bit loads / stores anyway. So we end up with two completely independent 64-bit data flow paths, which require 64-bit registers --- but there is absolutely no benefit to require that these registers form an even/odd pair, they might as well be arbitrary registers (it might even be the same one if the sequence ends up scheduled as LG / STG / LG / STG). Actually the same applies to most of the other cases where the 128-bit operation ends up split into 64-bit operations by the type legalizer.

The updated patch uses LiveIntervals to quickly check if the two connected vregs are local to MBB and what the first and last instructions are (it may be possible to do this without LIS if we are willing to scan the whole MBB). I merely tried some values of 10 instructions and 8 phys-reg operands in order to detect a non-safe live-range. It may be that we want to check for vreg operands as well.

I don't have adequate tests to tune this further, but this might hopefully be a good starting point. At least all the regressions we saw before are now gone, while still handling your test case.

This looks promising. Not sure if anybody would object to changing the interface to pass in the LiveIntervals, but it seems reasonable to me.

Not sure about the magic numbers -- does it matter whether we stop after 10 MIs? What kind of compile-time performance impact would it have to scan the whole range?

It looks like your check for > 8 occurrences of physical registers would trigger even if it is always the same register. Should we instead keep a bitmap of the used registers, so we know how many *different* ones are used? Also, it should probably only check for GPRs, not any random physical register.

jonpa added a comment.Sep 25 2017, 6:25 AM

I thought one could get away with adding calling setOperationAction(..., i128, Expand) on all operations that we do not support...

It turns out this doesn't work. E.g. in the case of a 128-bit add, this is split into a sequence of 64-bit add-with-carry operations by the *type legalization* code, not the normal expander. If you declare i128 a legal type, the type legalizer doesn't do this any more. But then if you mark the ADD as Expand, the normal expander doesn't know what to do with it either, and -I think- just falls back to a libcall.

i128 lives in a even/odd register pair, but this is really only needed for the few special instructions that operate on those pairs. If you had just a normal load followed by a normal store of a i128 value, you *want* to break it into two independent i64 copies -- there is no reason at all to constrain the register allocator to use a even/odd pair here.

I would have expected that for a tiny live range, it is always better to avoid extra COPYs.

My point was for a 128-bit load feeding into a 128-bit store, both the load and the store gets expanded to two 64-bit loads / stores anyway. So we end up with two completely independent 64-bit data flow paths, which require 64-bit registers --- but there is absolutely no benefit to require that these registers form an even/odd pair, they might as well be arbitrary registers (it might even be the same one if the sequence ends up scheduled as LG / STG / LG / STG). Actually the same applies to most of the other cases where the 128-bit operation ends up split into 64-bit operations by the type legalizer.

I see.

The updated patch uses LiveIntervals to quickly check if the two connected vregs are local to MBB and what the first and last instructions are (it may be possible to do this without LIS if we are willing to scan the whole MBB). I merely tried some values of 10 instructions and 8 phys-reg operands in order to detect a non-safe live-range. It may be that we want to check for vreg operands as well.

I don't have adequate tests to tune this further, but this might hopefully be a good starting point. At least all the regressions we saw before are now gone, while still handling your test case.

This looks promising. Not sure if anybody would object to changing the interface to pass in the LiveIntervals, but it seems reasonable to me.

Not sure about the magic numbers -- does it matter whether we stop after 10 MIs? What kind of compile-time performance impact would it have to scan the whole range?

I don't know, I just thought that we basically wanted to at least handle the type of case where there is a GR128 op and then some subreg COPY(s) very close. My reason for having the instruction limit is that we can't know the register pressure, but it seems safe to think that tiny live ranges checked in this way could be coalesced.

It looks like your check for > 8 occurrences of physical registers would trigger even if it is always the same register. Should we instead keep a bitmap of the used registers, so we know how many *different* ones are used? Also, it should probably only check for GPRs, not any random physical register.

Checking for GPRs is something I seem to have missed, but however the different reg check is something I actually took away after first using a SmallSet. I thought that if we would to this properly, we should in addition also check aliases after finding the set of different regs. I then changed my mind since we are merely trying to do an estimate, and we don't really know for sure what the right limit is, etc and may rather just keep it simple (BTW, I also see that it should do +=2 for a GR128 reg).

I guess it depends on the code and if we are trying to handle "any" type of mixed code with an as-good-as-possible estimate, or are we merely trying to keep coalescing the simple and small sequences?

Taking this a step further, if regalloc can only run out of registers when enough physreg defs overlap the GR128 live range so that there is no single 128-bit register pair to be allocated, then we should basically allow coalescing if there is at least one non-clobbered i128 register pair in the region, right?

Even if there were many such GR128 virtregs overlapping, one would hope that regalloc would be able to split them so that one such free pair would be enough. Would we want to stop coalescing at that some point in order to reduce spilling? We could perhaps try to by looking for GR128 virtual register operands in the region.

Not sure about the magic numbers -- does it matter whether we stop after 10 MIs? What kind of compile-time performance impact would it have to scan the whole range?

I don't know, I just thought that we basically wanted to at least handle the type of case where there is a GR128 op and then some subreg COPY(s) very close. My reason for having the instruction limit is that we can't know the register pressure, but it seems safe to think that tiny live ranges checked in this way could be coalesced.

Ah, I misread the code -- if we have more than 10 instructions, you default to *not* coalese (I somehow thought it was the opposite). That seems more reasonable. The interesting question will then be how this interacts with the scheduler, i.e. how frequently it happens that unrelated code is scheduled in between, and thus breaking the coalescing. Not sure if that is a real concern ...

It looks like your check for > 8 occurrences of physical registers would trigger even if it is always the same register. Should we instead keep a bitmap of the used registers, so we know how many *different* ones are used? Also, it should probably only check for GPRs, not any random physical register.

Checking for GPRs is something I seem to have missed, but however the different reg check is something I actually took away after first using a SmallSet. I thought that if we would to this properly, we should in addition also check aliases after finding the set of different regs. I then changed my mind since we are merely trying to do an estimate, and we don't really know for sure what the right limit is, etc and may rather just keep it simple (BTW, I also see that it should do +=2 for a GR128 reg).

Yes, that's why I was thinking of a simple bitmap of the 16 actual GPRs, and just set a bit if this GPR is touched (no matter via which super/subreg or alias).

Taking this a step further, if regalloc can only run out of registers when enough physreg defs overlap the GR128 live range so that there is no single 128-bit register pair to be allocated, then we should basically allow coalescing if there is at least one non-clobbered i128 register pair in the region, right?

Even if there were many such GR128 virtregs overlapping, one would hope that regalloc would be able to split them so that one such free pair would be enough. Would we want to stop coalescing at that some point in order to reduce spilling? We could perhaps try to by looking for GR128 virtual register operands in the region.

Strictly speaking, yes. But I think it also doesn't matter to avoid coalescing even if the register pressure isn't *that* extreme -- it is true that as long as at least one GPR pair is available, the allocator should be able to fix up anything via spilling, but in general spilling is more expensive than extra copying due to non-coalescing would have been ...

(B.t.w. if you do go for the one-pair check, it needs to check for one pair except 0/1, because 0/1 isn't usable for ADDR128.)

jonpa updated this revision to Diff 116675.Sep 26 2017, 9:26 AM

Improved to accurately keep track of which registers are actually clobbered in the region by means of a BitVector.

Ah, I misread the code -- if we have more than 10 instructions, you default to *not* coalese (I somehow thought it was the opposite). That seems more reasonable. The interesting question will then be how this interacts with the scheduler, i.e. how frequently it happens that unrelated code is scheduled in between, and thus breaking the coalescing. Not sure if that is a real concern ...

See table below.

Yes, that's why I was thinking of a simple bitmap of the 16 actual GPRs, and just set a bit if this GPR is touched (no matter via which super/subreg or alias).

I did this for the 128 bit regs instead since that is what seems mostly relevant in the end.

(B.t.w. if you do go for the one-pair check, it needs to check for one pair except 0/1, because 0/1 isn't usable for ADDR128.)

I am hoping that by using NewRC (which should be either GR128 or ADDR128), this should work as expected.

Considering other types of code with more heavy use of GR128, would we perhaps want to also give a count for a virtual register definition of a GR128? That should avoid being too optimistic when several of these overlap.

SPEC

                 master               patched
lg             : 320549               318463                -2086
stg            : 125311               124234                -1077
lr             :  25927                26482                 +555
lgr            : 343901               343499                 -402
llill          :    982                  870                 -112
st             :  51272                51343                  +71
stfh           :   2037                 2095                  +58
j              : 100186               100129                  -57
c              :  10989                11045                  +56
"Spill|Reload"   168062               165106                -2956

Number of regions / GR128 pairs clobbered

6070 PhysClob: 0
 206 PhysClob: 1
 187 PhysClob: 2
  17 PhysClob: 3
   8 PhysClob: 4

(Never more than 4 clobbered GR128 phys reg pairs. I therefore took "3" out of the blue as a margin.)

Number of regions / MI count

1305 MIs: 3
2489 MIs: 4
 628 MIs: 5
 581 MIs: 6
 258 MIs: 7
 119 MIs: 8
 103 MIs: 9
  94 MIs: 10
  61 MIs: 11
 129 MIs: 12
 220 MIs: 13
  47 MIs: 14
  49 MIs: 15
  31 MIs: 16
  39 MIs: 17
  23 MIs: 18
  42 MIs: 19
  12 MIs: 20
  46 MIs: 21
  55 MIs: 22
   6 MIs: 23
  13 MIs: 24
  14 MIs: 25
  14 MIs: 26
   3 MIs: 27
   6 MIs: 28
   7 MIs: 29
   7 MIs: 30
   1 MIs: 31
  41 MIs: 32
   3 MIs: 33
   3 MIs: 34
   2 MIs: 36
   1 MIs: 37
   1 MIs: 38
   1 MIs: 40
   1 MIs: 41
   1 MIs: 42
   3 MIs: 44
   4 MIs: 45
   2 MIs: 46
   2 MIs: 47
   2 MIs: 48
   2 MIs: 49
   2 MIs: 50
   1 MIs: 51
   1 MIs: 52
   1 MIs: 55
   2 MIs: 56
   1 MIs: 63
   1 MIs: 65
   2 MIs: 67
   1 MIs: 69
   1 MIs: 70
   1 MIs: 71
   1 MIs: 73
   1 MIs: 78
   1 MIs: 103

Great majority of these cases are then as expected small regions, so I am thinking there's no real need for an instruction limit - at least not on SPEC

jonpa added a comment.Sep 26 2017, 9:28 AM

@Quentin, Matthias: This patch introduces an extra parameter to shouldCoalesce() (LiveIntervals*). Does this look acceptable?

Considering other types of code with more heavy use of GR128, would we perhaps want to also give a count for a virtual register definition of a GR128? That should avoid being too optimistic when several of these overlap.

Not sure, hard to say if this is going to be overall beneficial or not. I'd say to just do the physreg check for now.

Great majority of these cases are then as expected small regions, so I am thinking there's no real need for an instruction limit - at least not on SPEC

OK, makes sense to me.

The code now looks reasonable to me, just a few inline comments.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
165

Hmm, does this still hit if we get ADDR128BitRegClass? Probably needs to check for that instead of GR128BitRegClass here ...

We then ideally should also have a test case that uses ADDR128BitRegClass (one way to achieve this is to add a use of "remainder + 1", which gets selected to an LA instruction).

198

Unused now?

208

Is this really necessary / useful? Why not do the whole test just via the SuperRegIterator loop (using IncludeSelf = true)?

jonpa updated this revision to Diff 116811.Sep 27 2017, 7:43 AM
jonpa marked 2 inline comments as done.

Updated per review. Not much change on SPEC compared to last version of the patch:

lg             :               318552               318542      -10
st             :                51343                51348       +5
cgijlh         :                17962                17958       -4
...
Spill|Reload   :               165200               165200       +0

Test case for ADDR128 still missing.

qcolombet edited edge metadata.Sep 27 2017, 1:52 PM

This patch introduces an extra parameter to shouldCoalesce() (LiveIntervals*). Does this look acceptable?

Looks fine to me.

I would use a reference if we don't allow it to be null or update the comment to say that this may be null.

jonpa updated this revision to Diff 116937.Sep 28 2017, 12:51 AM

I would use a reference if we don't allow it to be null or update the comment to say that this may be null.

Patch updated to use reference instead of pointer for LiveIntervals argument.

jonpa added inline comments.Sep 28 2017, 3:01 AM
lib/Target/SystemZ/SystemZRegisterInfo.cpp
165

It should be NewRC->hasSuperClassEq(), which I believe returns true only for ADDR128 and GR128. I verified that this works as expected in gdb.

198

removed

208

removed

jonpa updated this revision to Diff 116961.Sep 28 2017, 3:28 AM

Add test that exposes an ADDR128 virtual register.

uweigand accepted this revision.Sep 29 2017, 5:22 AM

Given that:

  • Quentin approved the interface change
  • The SystemZ parts now look good to me
  • We have performance runs confirming this does not introduce regressions

this patch now LGTM.

This revision is now accepted and ready to land.Sep 29 2017, 5:22 AM
jonpa closed this revision.Sep 29 2017, 7:35 AM

Thanks for review.
r314516