This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Increase number of LOCRs emitted by passing regalloc hints
ClosedPublic

Authored by jonpa on Aug 16 2017, 8:03 AM.

Details

Summary

TargetRegisterInfo::getRegAllocationHints() is implemented for SystemZ with an increase in number of LOCRs.

Diff Detail

Event Timeline

jonpa created this revision.Aug 16 2017, 8:03 AM
jonpa added a comment.Aug 16 2017, 8:13 AM

Below is a little table with different number of LOCRs and spill/reload instructions during experiments explained as follows:

Using just the getRegAllocationHints() implementation above, the number of locr type instructions increases by 171 on the SPEC, which means an increase from 97.4% to 99.1% (6333 -> 6504). This is the percentage of LOCRMux converted to an LOCR type instruction. (B)

As you can see, I have also done some further experiments that involves constraining the GRX32 reg class during isel if the other register class is high or low in emitSelect(). Using just this (and not giving any hints), gives an increase of 50 LOCRs (C). However, if all operands are GRX32, and they are constrained alternatingly between high and low (D), a similar result to (B) is achieved.

The most LOCRs I have gotten so far is with (E), although (F) is close with less spill.

This however turned out to be not quite true, because looking at the code size of loops, it is clear that LOCRXALTERNATE is making loops bigger. My guess is that somehow this lesser freedom forces suboptimal decisions somewhere else.

However, loops are getting better with LOCRCONSTRAIN (E), so looking at just the output, this looks valuable in addition to (B).

So, next is to remove the LOCRXALTERNATE, and then go for (B) or (E) I guess. Unless you have some input that might improve things further...?

Happy for any comments!

BTW, I don't quite understand the purpose of updateRegAllocHint(). It seems to be run during coalescing, but that's before getRegAllocationHints() has ever been called. So what is being updated?

/Jonas

Branch                                          Number of LOCRs         Number of "Spill|Reload" comments

A master                                        6333                    164102
B just reg-alloc hints:                         6504                    164169
C no hints:  LOCRCONSTRAIN                      6383                    164157
D no hints: LOCRCONSTRAIN + LOCRXALTERNATE      6506                    164020
E hints and LOCRCONSTRAIN                       6541                    164224
F hints and LOCRCONSTRAIN + LOCRXALTERNATE      6534                    164056
uweigand edited edge metadata.Aug 17 2017, 7:28 AM

The getRegAllocationHints implementation makes sense to me. However, I'm wondering if we shouldn't also check for the *destination* register -- you only force one source register to the same class as the other source register, but I think we should check whether *any* of the three registers is already allocated, and then always force the other two to the same class.

I think if we do that consistently, we should be able to *guarantee* that we end up with a register assignment that is legal for the instruction, and can therefore completely remove the pass that creates a branch again.

Already constraining the the regclass in emitSelect may make sense as well, but we should verify that it still makes any difference once we've implemented the change above. I agree that the LOCRXALTERNATE variant doesn't look very useful.

Oh, and just comment on this:

BTW, I don't quite understand the purpose of updateRegAllocHint(). It seems to be run during coalescing, but that's before getRegAllocationHints() has ever been called. So what is being updated?

I understand this is about something else: maintaining the MachineRegisterInfo set/getRegAllocationHint data. It seems the ARM target sets up the MRI hints in a special pass ahead of time, and then *uses* the MRI hints in its implementation of getRegAllocationHints. However, for this to be useful the MRI hints need to be updated for the effects of register coalescing; the updateRegAllocHint allows the target to do just that.

Since your implementation of getRegAllocationHints doesn't actually make any use of the MRI hints, you don't need to implement updateRegAllocHint either.

Oh, and just comment on this:

BTW, I don't quite understand the purpose of updateRegAllocHint(). It seems to be run during coalescing, but that's before getRegAllocationHints() has ever been called. So what is being updated?

I understand this is about something else: maintaining the MachineRegisterInfo set/getRegAllocationHint data. It seems the ARM target sets up the MRI hints in a special pass ahead of time, and then *uses* the MRI hints in its implementation of getRegAllocationHints. However, for this to be useful the MRI hints need to be updated for the effects of register coalescing; the updateRegAllocHint allows the target to do just that.

Since your implementation of getRegAllocationHints doesn't actually make any use of the MRI hints, you don't need to implement updateRegAllocHint either.

Ah, I see - thanks for explaining.

jonpa updated this revision to Diff 111636.Aug 18 2017, 2:09 AM

The getRegAllocationHints implementation makes sense to me. However, I'm wondering if we shouldn't also check for the *destination* register -- you only force one source register to the same class as the other source register, but I think we should check whether *any* of the three registers is already allocated, and then always force the other two to the same class.

I tried this (see updated patch below), but found that it gave the identical results (with or without the DestMO lines which I commented out). My guess is that the the TwoAddress pass is already doing this job in processTiedPairs() by

const TargetRegisterClass *RC = MRI->getRegClass(RegB);
...
    if (TargetRegisterInfo::isVirtualRegister(RegA) &&
        TargetRegisterInfo::isVirtualRegister(RegB))
      MRI->constrainRegClass(RegA, RC);

I think if we do that consistently, we should be able to *guarantee* that we end up with a register assignment that is legal for the instruction, and can therefore completely remove the pass that creates a branch again.

Giving hints is not enough if VirtReg is GRX32, and there could still be cases where the source registers are high / low. The question is how/when should we constrain the reg classes? It doesn't seem right to do this in getRegAllocationHints(), since e.g. already Order is passed. It could be done in emitSelect(), but a bit naively (complicated cases suboptimally). It would be nice to do this after coalescing, but not sure how.

Already constraining the the regclass in emitSelect may make sense as well, but we should verify that it still makes any difference once we've implemented the change above.

I don't know exactly why, but doing only this (without hints) give some improved benchmarks without regressions. Doing only the hints give similar improvements, but with regressions... Also, a regression with the regsplit around loops patch applied could only be handled with this, so this part seems valuable until we actually manage to handle the same cases with hints.

I agree that the LOCRXALTERNATE variant doesn't look very useful.

removed

The getRegAllocationHints implementation makes sense to me. However, I'm wondering if we shouldn't also check for the *destination* register -- you only force one source register to the same class as the other source register, but I think we should check whether *any* of the three registers is already allocated, and then always force the other two to the same class.

I tried this (see updated patch below), but found that it gave the identical results (with or without the DestMO lines which I commented out). My guess is that the the TwoAddress pass is already doing this job in processTiedPairs() by

Ah, of course. I had forgotten that these are two-operand instructions anyway, so my suggestion doesn't actually make sense. (I must have been thinking of three-operand instructions like AHHHR.)

I think if we do that consistently, we should be able to *guarantee* that we end up with a register assignment that is legal for the instruction, and can therefore completely remove the pass that creates a branch again.

Giving hints is not enough if VirtReg is GRX32, and there could still be cases where the source registers are high / low. The question is how/when should we constrain the reg classes? It doesn't seem right to do this in getRegAllocationHints(), since e.g. already Order is passed. It could be done in emitSelect(), but a bit naively (complicated cases suboptimally). It would be nice to do this after coalescing, but not sure how.

So my thought that if both registers are still GRX32, then getRegAllocationHints for the first register would return both high and low options. Then, once regalloc chooses one of them, and getRegAllocationHints is later called on the *other* register, we know it must be in the same class, so we return only the low or only the high options. Does it not work this way?

Already constraining the the regclass in emitSelect may make sense as well, but we should verify that it still makes any difference once we've implemented the change above.

I don't know exactly why, but doing only this (without hints) give some improved benchmarks without regressions. Doing only the hints give similar improvements, but with regressions... Also, a regression with the regsplit around loops patch applied could only be handled with this, so this part seems valuable until we actually manage to handle the same cases with hints.

OK. In any case, constraining the regclass in emitSelect, in cases where we already know about restrictions, seems a good idea.

jonpa added a comment.Aug 18 2017, 5:10 AM

So my thought that if both registers are still GRX32, then getRegAllocationHints for the first register would return both high and low options. Then, once regalloc chooses one of them, and getRegAllocationHints is later called on the *other* register, we know it must be in the same class, so we return only the low or only the high options. Does it not work this way?

My understanding is that once one of the registers are allocated, we can only try to pass *hints* for the other register. Regalloc will then first try to use any hint, but if all of them are allocated it is still free to use other registers in GRX32. So I think if we want to make a guarantee, we must constrain the reg-class of the register. Or could it perhaps be possible to add a method to AllocationOrder such as "hardenHints()", which would actually remove non-hinted registers from the allocation order?

I experimented a bit and found that by forcing the AllocationOrder to only return hints ("hard hints"), resulted in only 3 jump/sequence expansions on SPEC. It seems that those remaining cases are due to complex cases like:

1 LOCRMux  V0_GRX32, V1_GR32
2 LOCRMux  VO_GRX32, V2_GRX32

V2 gest assigned to GRH32
...

The reason this is not handled by the regclass constraining in emitSelect() is that the reg-coalescer introduced the V1_GR32, while it was originally a GRX32 during isel.
This is also what would happen if V2 was GRH32, although that is practically not happening currently (3 cases in total during isel).

I think that to handle those cases we would have to constrain regclasses somehow after coalescing. And maybe better than giving hard hints would be to immediately after one out of two GRX32 regs gets allocated constrain the other virtreg.

I am not convinced still that making this guarantee generally is possible (without a target pre-ra pass to do this), especially not for all different kind of register allocators that are around / may appear. It seems that some kind of broader construct is needed in order to always be sure this never goes wrong. Maybe a property of a register class somehow that all operands of any MI must belong to one out of two sub regclasses...? :-/

Since there are more GRX32 constructs to implement in the SystemZ backend, this may still be worthwhile, or?

I think that to handle those cases we would have to constrain regclasses somehow after coalescing.

This could be done in TRI->updateRegAllocHint, possibly?

And maybe better than giving hard hints would be to immediately after one out of two GRX32 regs gets allocated constrain the other virtreg.

Not sure if there is currently any place where this can be done. Does the register allocator (all of them) even go through registers one-by-one and assigns them, or is the algorithm more complex?

I am not convinced still that making this guarantee generally is possible (without a target pre-ra pass to do this), especially not for all different kind of register allocators that are around / may appear. It seems that some kind of broader construct is needed in order to always be sure this never goes wrong. Maybe a property of a register class somehow that all operands of any MI must belong to one out of two sub regclasses...? :-/

Note that this, while appropriate for LOCRMux, would be too restrictive for certain other operations. For example, for comparisons, we can allow high-high, low-low, and also high-low compares, but not low-high compares, and similarly for add and subtract. (For comparison, the alternatives / constraints mechanism in GCC allows targets to exactly describe the valid combinations for each instruction, and the register allocator will chose any of those as appropriate.)

jonpa updated this revision to Diff 112120.Aug 22 2017, 1:19 AM

I think that to handle those cases we would have to constrain regclasses somehow after coalescing.

This could be done in TRI->updateRegAllocHint, possibly?

Not sure if this is ok (although it should be), but tried it and seemed to improve things a bit.

And maybe better than giving hard hints would be to immediately after one out of two GRX32 regs gets allocated constrain the other virtreg.

Not sure if there is currently any place where this can be done. Does the register allocator (all of them) even go through registers one-by-one and assigns them, or is the algorithm more complex?

IIRC, I think this is more complex with RegAllocGreedy, since it seems it does not only allocate a physical register once to an interval, but it could also later evict (cancel) that assignment to give that physreg to another interval instead.

I am not convinced still that making this guarantee generally is possible (without a target pre-ra pass to do this), especially not for all different kind of register allocators that are around / may appear. It seems that some kind of broader construct is needed in order to always be sure this never goes wrong. Maybe a property of a register class somehow that all operands of any MI must belong to one out of two sub regclasses...? :-/

Note that this, while appropriate for LOCRMux, would be too restrictive for certain other operations. For example, for comparisons, we can allow high-high, low-low, and also high-low compares, but not low-high compares, and similarly for add and subtract. (For comparison, the alternatives / constraints mechanism in GCC allows targets to exactly describe the valid combinations for each instruction, and the register allocator will chose any of those as appropriate.)

It would be nice to be able to do something like that. I suppose if we could constrain regclasses during reg-allocation we could afford to do recursive searches since we are only doing them once (in contrast to giving hints). If we are lucky this would turn out well, although we probably can't "start over" after constraining GRX32, at the point of eviction of an assignment. Well, perhaps that could be added as well. Anyway, this seems not really needed at the moment, but perhaps later if we start to see a lot more GRH32 registers.

I updated the patch and made some more builds of SPEC in order to see the effects of the various ways to tackle this:

Branch \ Statistic:                     LOCRs_lo        LOCRs_hi     RISBs   "Number of spilled live ranges"
Master                                  6382            4            225     48939
LOCRHINTS                               6523            28           60      48947
LOCRCONSTRAIN                           6429            3            179     48957
LOCRCONSTRAIN + UPDATEHINT              6464            3            144     48965
LOCRHINTS + LOCRCONSTRAIN               6558            28           25      48958
MOREMUX + LOCRCONSTRAIN                 6561		28           22      48959
HARDHINTS                               6580            27           4       48965
HARDHINTS + LOCRCONSTRAIN               6581            27           3       48971
HARDHINTS + LOCRCONSTRAIN + UPDATEHINT  6581            27           3       48974
HARDHINTS + UPDATEHINT                  6581            27           3       48970
HARDHINTS + MOREMUX                     6584		27           0       48966   :-)

It seemed that setting the RegClass in updateRegAllocHint() per your suggestion did help a bit, but didn't solve those tricky cases for some reason. I then tried to just do a bit more searching for LOCRMuxes (without recursing), and found that this did actually handle the rest (per last line in table).

Still not sure of all the implications of "hard hints" or constraining regclass in updateRegAllocHint(), just know that "it doesn't crash", and gives "good results" ;-) Any thoughts, anyone? Quentin?

HARDHINTS are currently looking very promising on preliminary benchmark results. (Have not yet tried MOREMUX, but it should also be good since it very similar).

jonpa edited the summary of this revision. (Show Details)Aug 22 2017, 5:32 AM
jonpa added reviewers: MatzeB, hfinkel.
jonpa updated this revision to Diff 112161.Aug 22 2017, 6:08 AM

updateRegAllocHint() updated per off-line discussion, which indeed gave better results than before:

New runs after improving updateRegAllocHint() to do a better job regarding physregs. (Basically it was unnecessary to check if the old reg was a physreg; all we need to know is if the NewReg is a low/high regclass...)

BUILD                                  Number of LOCRs_lo	Number of LOCRs_hi	Number of RISBs         Number of spilled live ranges
LOCRCONSTRAIN + UPDATEHINT	       6532			 3			76			48970
LOCRHINTS + LOCRCONSTRAIN + UPDATEHINT 6576			26			 9			48972
jonpa added a comment.Sep 5 2017, 8:45 AM

Ping!

This is my original post on llvm-dev:

Hi,

I am curious if it would be possible to use reg-alloc hints to improve code generation for SystemZ. The background is that I ran into a regression which seems to relate to code generation for conditional register moves.

The SystemZ backend uses a GRX32 register class for a LOCRMux pseudo instroction (Load On Condition of Register), in order to utilize all 32bit registers optimally. However, depending on the register assignment this pseudo will become a single load-on-condition instruction only if both source and dest registers are either low or high parts. Otherwise, LOCRMux will expand to a compare/jump sequence, which is of course less desirable. LOCR can only handle two low-parts, and LOCFHR can only handle two high-parts (GRX32 is the union of these two reg classes).

In order to increase the number of LOCR/LOCFHRs generated, I would like to tell regalloc something like "If src reg of an LOCRMux is high, try to make dst reg high", and similarly for the case where one register is in the low part.

I am not sure if it is possible to do anything about this in a simple manner, and would appreciate any help.

Thanks,

Jonas

I have now experimented a while with this, and found some different ways of handling this, which I need some feedback on. In particular, it seems promising to use "hard hints", by which I mean changing the AllocationOrder so that only hinted registers are returned. This seems to work, but it is a good idea?

If this is not acceptable, another option might be to use the updateRegAllocHints() hook by *constraining the regclass* of the virtreg directly. This is not the documented purpose of this hook, wo I wonder if it would make sense generally?

The goal is to get rid of the RISB (rotate and insert emitted by copyPhysReg(), and the results can be found in the table above, indicating the effectiveness of the various methods.

jonpa added a comment.Sep 7 2017, 11:39 PM

ping!

Quentin, you gave me the first advice regarding this, and it would now be very useful to hear your (or someone you know) opinion on the common-code changes involved here: "hard hints" in AllocationOrder, and setting the regclass in updateRegAllocHint().

thanks / Jonas

qcolombet edited edge metadata.Sep 8 2017, 12:48 PM

The generic change looks fine though it sounds surprising that anyone would want that.

The problem here is that copies are super expensive when those hints are not fulfill, right?

jonpa added a comment.Sep 11 2017, 1:23 AM

The generic change looks fine though it sounds surprising that anyone would want that.

The problem here is that copies are super expensive when those hints are not fulfill, right?

Thanks for review. Well, instead of a conditional move implemented with one instruction, we would get a jump sequence over a block containing just one move instruction. This *may* be very bad in an inner loop. In particular, this is what happened when I applied Weis reg-split patch and got a significant regression.

SystemZ has BTW more instructions which could be implemented the same way: a "mux" pseudo of the GRX32 regclass which would then be expanded into an instruction using either high or low parts, depending on the choice of registers RA did. I don't think all those instructions waiting to be implemented in the backend would necessarily suffer as much as the load-on-condition, so it is a matter of judgment if using the "hard hints" is better or worse. If one were to judge this based on the number of spilled live ranges, this seems to work fine for the LOCR case, as can be seen in the table above. Do you think that looking at the number of spilled live ranges is a good way to consider the trade-off for using hard hints, or do you perhaps have anything to add?

It would have been nice to *guarantee* that the pseudo will get either high or low parts, like gcc will let you specify combinations of legal register operands. If that could be done, SystemZ could even remove its custom pass that handles any rare cases of mixed operands. But I suppose this would not be easy to do, or?

qcolombet accepted this revision.Sep 11 2017, 9:53 AM

Do you think that looking at the number of spilled live ranges is a good way to consider the trade-off for using hard hints, or do you perhaps have anything to add?

To me, it sounds like we should tell RA that the non-hard-hints will be expensive to expand. Right now, copies are expected to be cheap regardless of what, where and how they are done and this already does not play very nice with shrink-wrapping for instance. I believe this is yet another instance of that problem.

Short term the hard hints sound fine. Long term, I don't know, I haven't thought about it.

It would have been nice to *guarantee* that the pseudo will get either high or low parts, like gcc will let you specify combinations of legal register operands. If that could be done, SystemZ could even remove its custom pass that handles any rare cases of mixed operands. But I suppose this would not be easy to do, or?

I didn't get what you would like to do.

This revision is now accepted and ready to land.Sep 11 2017, 9:53 AM

It would have been nice to *guarantee* that the pseudo will get either high or low parts, like gcc will let you specify combinations of legal register operands. If that could be done, SystemZ could even remove its custom pass that handles any rare cases of mixed operands. But I suppose this would not be easy to do, or?

I didn't get what you would like to do.

I believe that in gcc it is possible to specify legal combinations in the .md file, something like (source-reg "GR32, GRH32"), (dst-reg "GR32, GRH32"). This would then be used during regalloc so that one of the two combinations would take effect (GR32/GR32 or GRH32/GRH32). This would be very nice in LLVM in this case, since regalloc would know what to do without hints, and it could also put a guarantee so that the target did not have to catch any bad cases after the fact. I guess I am just asking if there has been any effort or plan in this direction previously or what the best way of achieving this in LLVM might be.

BTW, in this particular case, it would be enough to say that "all operands of a single MI which are all GRX32 must be allocated to the same subclass: GR32 (low) or GRH32 (high)". Not sure if this is general enough to be considered for implementing - the gcc method seems more ideal.

Your suggestion of making non-hard-hints considered expensive (via a cost function for each register in the AllocationOrder?), makes sense to me though, especially if it would help in other contexts as well. This might make for an even better final result than only allowing the hard hints.

jonpa updated this revision to Diff 115203.Sep 14 2017, 5:39 AM
jonpa edited the summary of this revision. (Show Details)

Patch rewritten towards using hard regalloc hints with a worklist search over LOCRMux GRX32 registers.

jonpa updated this revision to Diff 115834.Sep 19 2017, 6:41 AM

Experiments with mixed GRX32 allocation order (%R0-%R5) as well as trying to get more high-high LOCRMux allocations whenever possible have been removed. Benchmarking has shown that the first version (without these two extras) is still the best version, it seems. The idea was to get more high-high allocations.

There is a great unbalance between low-low and high-high allocations (nearly all become low-low), but the important thing is that all the mixed allocations have been eliminated.

No test regressions.

Looks basically good to me, just a couple of cosmetic comments inline.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
30

No, it isn't :-) Either the comment or the code is wrong.

120

I think it might simplify the code to just inline getHintRC_LOCRMux here ... it uses so many of the local variables defined here, and doesn't really serve any separate purpose, so it doesn't make much sense to have it as a separate function.

In fact, maybe even getConstrainedRC32_LOCRMux should be inlined as well.

134

Maybe better "return TargetRegisterInfo:: ..." in case the default is ever changed to return hard hints in some cases.

jonpa updated this revision to Diff 117308.Oct 2 2017, 1:27 AM
jonpa marked an inline comment as done.

Minor updates of SystemZ parts per review, but one point left under discussion.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
30

Aah, the comment was rotten.

120

The reason for having the LOCRMux handling separate, is that I had in mind the other Mux cases as well, that we might want to investigate.

So, it is better to have it inlined until we actually use more Mux instructions as opposed to have the code somewhat ready?

Or is it better to inline even with the other instructions handlings added?

134

aah, yes.

jonpa updated this revision to Diff 117314.Oct 2 2017, 2:31 AM

Added test case. This at least fails on trunk and passes with this patch, even though I guess it may perhaps randomly pass on different revisions over time also without this patch...

jonpa updated this revision to Diff 122208.Nov 9 2017, 1:26 AM

Functions inlined per review.

patch impact on risb/locr instruction counts on SPEC as of latest performance measurement:

risblg         :                 8624                 8442     -182
locrhe         :                  926                  999      +73
risbhg         :                 2057                 1995      -62
locrl          :                 1626                 1658      +32
locrle         :                  453                  482      +29
locre          :                 1039                 1064      +25
...
Spill|Reload   :               165135               165218      +83

--stats:

1 systemz-II                   - Number of LOCRMux jump-sequences (lower is better)

(one LOCRMux not handled with patch, I think this is due to a regalloc eviction, which would be more complex to handle)

jonpa marked 3 inline comments as done.Nov 9 2017, 1:28 AM

See two minor inline comments. Otherwise, this now LGTM. Thanks!

lib/Target/SystemZ/SystemZRegisterInfo.cpp
38

Should this also use hasSubClassEq, just for symmetry with the case above?

103

Should we now pass Matrix also back to the default implementation?

jonpa added inline comments.Nov 9 2017, 7:20 AM
lib/Target/SystemZ/SystemZRegisterInfo.cpp
38

I take it that while GR32BitRegClass has the ADDR32Bit sub class, which GRH32BitRegClass does not, you prefer to keep this general for the future and so on?

uweigand added inline comments.Nov 9 2017, 7:39 AM
lib/Target/SystemZ/SystemZRegisterInfo.cpp
38

Yes, exactly ... just in case there will be subclasses later.

jonpa updated this revision to Diff 122400.Nov 10 2017, 12:44 AM
jonpa marked 4 inline comments as done.

Fixed per review.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
38

Right. Fixed.

103

Oops. Fixed with NFC.

jonpa closed this revision.Nov 10 2017, 12:48 AM

Thanks for review. Commited as r317879.