Fix for https://bugs.llvm.org/show_bug.cgi?id=34610
Regalloc runs out of registers in a test case with GR128 register. This can be remedied by not allowing coalescing copys of subregs from GR128 regs in shouldCoalesce().
|  Differential  D37899  
[SystemZ] Implement shouldCoalesce() to help regalloc to avoid running out of registers with GR128 regs Authored by jonpa on Sep 15 2017, 4:27 AM. 
Details Fix for https://bugs.llvm.org/show_bug.cgi?id=34610 Regalloc runs out of registers in a test case with GR128 register. This can be remedied by not allowing coalescing copys of subregs from GR128 regs in shouldCoalesce(). 
Diff Detail Event TimelineComment Actions Thanks for trying this suggestion! See a couple of inline comments. Also, I'm wondering doing less coalescing here may lead to reduced performance somewhere? It would be good to verify at least some benchmarks to make sure we aren't too restrictive now. 
 Comment Actions Test cases updated. As expected, there are now more register moves whenever the coalescer is disabled by this patch. Comment Actions Huh. This seems more extreme than I expected, it looks like just about *every* load of a 128-bit value from memory now reloads into another set of registers, even in those quite simple test cases ... Is this just the coalescing or is there something else going on? Comment Actions Yes, this is indeed more than needed. For atomic-load-05.ll, I see during isel: Optimized lowered selection DAG: BB#0 'f1:'
SelectionDAG has 12 nodes:
  t0: ch = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
    t6: ch = CopyToReg t0, Register:i64 %vreg2, t2
    t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
  t7: i128,ch = AtomicLoad<Volatile LD16[%src]> t6, t4
      t8: i64,ch = CopyFromReg t0, Register:i64 %vreg2
    t11: ch = store<ST16[<unknown>](align=8)> t7:1, t7, t8, undef:i64
  t12: ch = SystemZISD::RET_FLAG t11
Type-legalized selection DAG: BB#0 'f1:'
SelectionDAG has 20 nodes:
  t0: ch = EntryToken
  t8: i64,ch = CopyFromReg t0, Register:i64 %vreg2
      t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
    t6: ch = CopyToReg t0, Register:i64 %vreg2, t2
    t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
  t13: Untyped,ch = SystemZISD::ATOMIC_LOAD_128<Volatile LD16[%src]> t6, t4
        t15: i64 = EXTRACT_SUBREG t13, TargetConstant:i32<2>
      t19: ch = store<ST8[<unknown>]> t13:1, t15, t8, undef:i64
        t17: i64 = EXTRACT_SUBREG t13, TargetConstant:i32<7>
        t21: i64 = add t8, Constant:i64<8>
      t22: ch = store<ST8[<unknown>]> t13:1, t17, t21, undef:i64
    t23: ch = TokenFactor t19, t22
  t12: ch = SystemZISD::RET_FLAG t23It looks to me that SystemZ does not have i128 as a legal type, and therefore the 128 bit store gets expanded. If the coalescing is then turned off, one of the subregs will need an extra COPY as the GR128 reg is live past the first store. Maybe this would be a good time to try and make i128 legal as we have discussed earlier? 
 anything else? Any hacks relating to untyped that we know should go away then? Comment Actions I looked into making i128 legal when I recently added the 128-bit atomics, but in the end decided against it. Not only would it be a lot of work (since you basically would have to reimplement all the operations on i128 that are currently handled by common code), but it is not clear that we actually always want it. i128 lives in a even/odd register pair, but this is really only needed for the few special instructions that operate on those pairs. If you had just a normal load followed by a normal store of a i128 value, you *want* to break it into two independent i64 copies -- there is no reason at all to constrain the register allocator to use a even/odd pair here. I've been thinking about alternative solutions to this problem. One way might be to prevent coalescing (as this patch does) and then try to *encourage* (but not enforce) use of the register pair earlier, e.g. by providing register allocator hints for those registers that would have been coalesced with the pair if we had not prevented this - sort of like the ARM register pair hints. But it turns out this is no so easy, because the allocator will assume those hints conflict with the original register pair (unless we switch on enableSubRegLiveness, but this seems to have other problems, it looks like it causes a number of internal compiler errors in the test suite). Another option might be to not *completely* prevent coalescing a 128-bit pair with its component registers, but allow it in cases where there is no excessive register pressure. But I'm not sure exactly how to get at this information at the time we make the coalescing decision ... Comment Actions Patch updated. 
 I thought one could get away with adding calling setOperationAction(..., i128, Expand) on all operations that we do not support... 
 I would have expected that for a tiny live range, it is always better to avoid extra COPYs. 
 There is a current discussion relating to the mischeduler register pressure tracking about the fact that there is no real support to query global liveness yet. It may be that this will get added soon, and perhaps then some cheap register pressure estimate might be available. Until then, I thought it might serve well to at least allow coalescing in cases of tiny, local live-ranges, e.g. loading and then storing 128 bits. The updated patch uses LiveIntervals to quickly check if the two connected vregs are local to MBB and what the first and last instructions are (it may be possible to do this without LIS if we are willing to scan the whole MBB). I merely tried some values of 10 instructions and 8 phys-reg operands in order to detect a non-safe live-range. It may be that we want to check for vreg operands as well. I don't have adequate tests to tune this further, but this might hopefully be a good starting point. At least all the regressions we saw before are now gone, while still handling your test case. On SPEC, I see Current patch: master patched lg : 320549 318515 -2034 stg : 125311 124230 -1081 lr : 25927 26632 +705 lgr : 343901 343602 -299 llill : 982 869 -113 c : 10989 11061 +72 "Spill|Reload" 168062 165145 -2917 Comment Actions It turns out this doesn't work. E.g. in the case of a 128-bit add, this is split into a sequence of 64-bit add-with-carry operations by the *type legalization* code, not the normal expander. If you declare i128 a legal type, the type legalizer doesn't do this any more. But then if you mark the ADD as Expand, the normal expander doesn't know what to do with it either, and -I think- just falls back to a libcall. 
 My point was for a 128-bit load feeding into a 128-bit store, both the load and the store gets expanded to two 64-bit loads / stores anyway. So we end up with two completely independent 64-bit data flow paths, which require 64-bit registers --- but there is absolutely no benefit to require that these registers form an even/odd pair, they might as well be arbitrary registers (it might even be the same one if the sequence ends up scheduled as LG / STG / LG / STG). Actually the same applies to most of the other cases where the 128-bit operation ends up split into 64-bit operations by the type legalizer. 
 This looks promising. Not sure if anybody would object to changing the interface to pass in the LiveIntervals, but it seems reasonable to me. Not sure about the magic numbers -- does it matter whether we stop after 10 MIs? What kind of compile-time performance impact would it have to scan the whole range? It looks like your check for > 8 occurrences of physical registers would trigger even if it is always the same register. Should we instead keep a bitmap of the used registers, so we know how many *different* ones are used? Also, it should probably only check for GPRs, not any random physical register. Comment Actions 
 I see. 
 I don't know, I just thought that we basically wanted to at least handle the type of case where there is a GR128 op and then some subreg COPY(s) very close. My reason for having the instruction limit is that we can't know the register pressure, but it seems safe to think that tiny live ranges checked in this way could be coalesced. 
 Checking for GPRs is something I seem to have missed, but however the different reg check is something I actually took away after first using a SmallSet. I thought that if we would to this properly, we should in addition also check aliases after finding the set of different regs. I then changed my mind since we are merely trying to do an estimate, and we don't really know for sure what the right limit is, etc and may rather just keep it simple (BTW, I also see that it should do +=2 for a GR128 reg). I guess it depends on the code and if we are trying to handle "any" type of mixed code with an as-good-as-possible estimate, or are we merely trying to keep coalescing the simple and small sequences? Taking this a step further, if regalloc can only run out of registers when enough physreg defs overlap the GR128 live range so that there is no single 128-bit register pair to be allocated, then we should basically allow coalescing if there is at least one non-clobbered i128 register pair in the region, right? Even if there were many such GR128 virtregs overlapping, one would hope that regalloc would be able to split them so that one such free pair would be enough. Would we want to stop coalescing at that some point in order to reduce spilling? We could perhaps try to by looking for GR128 virtual register operands in the region. Comment Actions Ah, I misread the code -- if we have more than 10 instructions, you default to *not* coalese (I somehow thought it was the opposite). That seems more reasonable. The interesting question will then be how this interacts with the scheduler, i.e. how frequently it happens that unrelated code is scheduled in between, and thus breaking the coalescing. Not sure if that is a real concern ... 
 Yes, that's why I was thinking of a simple bitmap of the 16 actual GPRs, and just set a bit if this GPR is touched (no matter via which super/subreg or alias). 
 Strictly speaking, yes. But I think it also doesn't matter to avoid coalescing even if the register pressure isn't *that* extreme -- it is true that as long as at least one GPR pair is available, the allocator should be able to fix up anything via spilling, but in general spilling is more expensive than extra copying due to non-coalescing would have been ... (B.t.w. if you do go for the one-pair check, it needs to check for one pair except 0/1, because 0/1 isn't usable for ADDR128.) Comment Actions Improved to accurately keep track of which registers are actually clobbered in the region by means of a BitVector. 
 See table below. 
 I did this for the 128 bit regs instead since that is what seems mostly relevant in the end. 
 I am hoping that by using NewRC (which should be either GR128 or ADDR128), this should work as expected. Considering other types of code with more heavy use of GR128, would we perhaps want to also give a count for a virtual register definition of a GR128? That should avoid being too optimistic when several of these overlap. SPECmaster patched lg : 320549 318463 -2086 stg : 125311 124234 -1077 lr : 25927 26482 +555 lgr : 343901 343499 -402 llill : 982 870 -112 st : 51272 51343 +71 stfh : 2037 2095 +58 j : 100186 100129 -57 c : 10989 11045 +56 "Spill|Reload" 168062 165106 -2956 Number of regions / GR128 pairs clobbered 6070 PhysClob: 0 206 PhysClob: 1 187 PhysClob: 2 17 PhysClob: 3 8 PhysClob: 4 (Never more than 4 clobbered GR128 phys reg pairs. I therefore took "3" out of the blue as a margin.) Number of regions / MI count 1305 MIs: 3 2489 MIs: 4 628 MIs: 5 581 MIs: 6 258 MIs: 7 119 MIs: 8 103 MIs: 9 94 MIs: 10 61 MIs: 11 129 MIs: 12 220 MIs: 13 47 MIs: 14 49 MIs: 15 31 MIs: 16 39 MIs: 17 23 MIs: 18 42 MIs: 19 12 MIs: 20 46 MIs: 21 55 MIs: 22 6 MIs: 23 13 MIs: 24 14 MIs: 25 14 MIs: 26 3 MIs: 27 6 MIs: 28 7 MIs: 29 7 MIs: 30 1 MIs: 31 41 MIs: 32 3 MIs: 33 3 MIs: 34 2 MIs: 36 1 MIs: 37 1 MIs: 38 1 MIs: 40 1 MIs: 41 1 MIs: 42 3 MIs: 44 4 MIs: 45 2 MIs: 46 2 MIs: 47 2 MIs: 48 2 MIs: 49 2 MIs: 50 1 MIs: 51 1 MIs: 52 1 MIs: 55 2 MIs: 56 1 MIs: 63 1 MIs: 65 2 MIs: 67 1 MIs: 69 1 MIs: 70 1 MIs: 71 1 MIs: 73 1 MIs: 78 1 MIs: 103 Great majority of these cases are then as expected small regions, so I am thinking there's no real need for an instruction limit - at least not on SPEC Comment Actions @Quentin, Matthias: This patch introduces an extra parameter to shouldCoalesce() (LiveIntervals*). Does this look acceptable? Comment Actions Not sure, hard to say if this is going to be overall beneficial or not. I'd say to just do the physreg check for now. 
 OK, makes sense to me. The code now looks reasonable to me, just a few inline comments. 
 Comment Actions Updated per review. Not much change on SPEC compared to last version of the patch: lg : 318552 318542 -10 st : 51343 51348 +5 cgijlh : 17962 17958 -4 ... Spill|Reload : 165200 165200 +0 Test case for ADDR128 still missing. Comment Actions 
 Looks fine to me. I would use a reference if we don't allow it to be null or update the comment to say that this may be null. Comment Actions 
 Patch updated to use reference instead of pointer for LiveIntervals argument. Comment Actions Given that: 
 this patch now LGTM. | ||||||||||||||||||||||||||||||
While turning subreg liveness on fixes the simpler test case you have here, the even more restricted test case I've added to the bugzilla actually still fails even then. So I'm not sure we should mention subreg liveness here as the reason why this is needed.