This is an archive of the discontinued LLVM Phabricator instance.

MIScheduler improved handling of copied physregs
Needs ReviewPublic

Authored by jonpa on Sep 28 2017, 4:45 AM.

Details

Summary

When trying to enable the mischeduler on SystemZ, a regression was discovered which was due to how instructions using copys of physregs were scheduled.

The test case looks attached is the input to mischeduler. The observation here is that %R3D is copied into %vreg1, while %R3D is also being written to later from %vreg7. We we would want the ADJDYNALLOCs using %vreg1 to be scheduled before the one defining %vreg7, so that regalloc will be able to give %r3d to %1 and later to %7 without overlap.

The isel scheduler handled this by seing that the ADJDYNALLOCs using %1 has one more live range, which it wants to minimize, when comparing the ADJDYNALLOCs.

This patch adds a simple heuristic to tryCandidate() which handles this test case. It also seem to generally reduce the number of COPYs across benchmarks. Basically it extends the handling for physregs by looking at virtual registers which are defined by a single COPY of a physreg (%vreg1 in this example). A preference is given then for the SU that uses such a vreg in order to minimize that live-range.

I have tried different positions for this new heuristic in tryCandidate(). Anywhere above the RPDelta.CurrentMax check seems bad as it increases spilling. Then it seems to nearly make no difference at all where it is placed after this, so I put it as the last heuristic just before original instruction order.

A lot of test cases fail with this patch:
X86: 321
AMDGPU: 36
AArch64: 13
PowerPC, SPARC: 3 each
Lanai: 2
ARM: 1

So before going any further, I would like feedback on the feasability of inserting this heuristic in the mischeduler.

Is this the right handling of this test case in mischeduler (I assume it is a scheduling problem)?

Diff Detail

Event Timeline

jonpa created this revision.Sep 28 2017, 4:45 AM
fhahn added a comment.Sep 28 2017, 5:35 AM

Interesting, I'll give it a try on ARM and AArch64.

test/CodeGen/SystemZ/args-11.mir
6

Indent in the comment off by 1?

29

You could use -debug-only=machine-scheduler and check the schedule emitted there, which would map more directly MachineInstrs used in the test case.

MatzeB edited edge metadata.Sep 28 2017, 6:52 PM

Hmm interesting observation. So to recap (and for further comments in the sourcecode :) you seem to deal with this situation:

%1 = OP %0
// ...
%physregX = COPY %1

and moving OP downwards increases the chance that %1 doesn't interfere with physregX and the COPY can be coalesced. Or in the inverse situation

%0 = COPY %physregX
// ...
%1 = OP %0

moving OP upwards minimizes the chance of %0 interfering with physregX and increasing the chance of the COPY being coalesced.

  • If I'm reading this correctly you only implement the first case, but the 2nd case should work as well.
  • You should exclude reserved registers and registers not allocatable to the vreg in question from the rule.
lib/CodeGen/MachineScheduler.cpp
2909

This is a non-obvious rule and needs more comments and a better function name!

2913–2922

I would perform the check on the ScheduleDAG rather than the MIs/operands. That should catch more cases and also gives you a sensible way to implement the reverse pattern.

2915

Please use MachineOperand & instead of auto.

test/CodeGen/SystemZ/args-11.mir
29

or -run-pass=machine-schedule for that matter so we only run the one pass we are interested in.

jonpa marked 3 inline comments as done.Oct 3 2017, 2:26 AM
jonpa added inline comments.
lib/CodeGen/MachineScheduler.cpp
2913–2922

I tried

-  for (const MachineOperand &MO : MI->uses()) {
-    if (!MO.isReg())
+  for (const SDep &Pred : SU->Preds) {
+    if (Pred.getKind() != SDep::Data)
       continue;
-    MachineInstr *DefMI = MRI->getUniqueVRegDef(MO.getReg());
+    unsigned Reg = Pred.getReg();
+    MachineInstr *DefMI = MRI->getUniqueVRegDef(Reg);

This seemed to give more copy coalescing, but also for some reason a bit more spilling, which is why it doesn't look as good. Why would this be better? Basically, checking the SU->Preds (which is what I guessed you mean by "on the ScheduleDAG"), should only check dependencies within the sheduling region. Since coalescing works globally, wouldn't it be better to find the defining MI also if it is in another block? What more cases are you thinking of?

test/CodeGen/SystemZ/args-11.mir
29

I am not sure - wouldn't it be better to confirm that the output in the end ends up to be good? I mean, that way it tests both that mischeduler *and* regalloc are smart, and no matter what, we know this test case should not have that extra copy. I am just curious about what your motivation is for changing this...

jonpa updated this revision to Diff 117486.Oct 3 2017, 2:33 AM

Patch updated with experiments as suggested in review response.

If I'm reading this correctly you only implement the first case, but the 2nd case should work as well.

Actually, I was checking for the use-ops of MI that come from a PReg copy, so this would be case 2, I'd say.

You should exclude reserved registers and registers not allocatable to the vreg in question from the rule.

Checks added. These cause nearly no change currently, but I still think it is good to limit this heuristic if it doesn't hurt,
in case it gets moved around in tryCandidate(). Downside is that it adds a bit more lines of code...

Adding the inverse check per your suggestion did not help as much as hoped - no change in spilling, and just a few less signextending COPYs. Note that SystemZ will initially (as during these experiments) only do bottom-up scheduling.

I tried to compare the effect of activating the second heuristic during bidirectional scheduling, and found that it looked to be about the same difference (merely enabling bidirectional scheduling gives currently a notable increase in spilling on SystemZ).

To summarize:

  • doing the inverse check seems to help only very marginally if at all on SystemZ.
  • checking for reserved regs / reg-class of copied to/from virtreg has very little effect.

Note this is for SystemZ. I updated the patch to include the new experiments so that someone else could try this on other targets to see if we e.g. need the inverse check. For SystemZ, I think it is not needed, and would be happy to remove it and simplify the patch. Also, the check for reserved/regclass could be removed, leaving the patch as it was in the previous version...

MatzeB added inline comments.Oct 11 2017, 4:35 PM
lib/CodeGen/MachineScheduler.cpp
2912

isCoalescablePRegCopy? Use references for parameters that cannot be nullptr.

2913–2922

Hmm right, scheduling regions are limiting as well.

The case I was thinking of is when you have multiple definitions of a vreg in different basic blocks, then MachineRegisterInfo:: getUniqueVRegDef() will return null, while the scheduling region gets you a reaching definition for free; but indeed this thinking is only true if the definition is actually part of the same scheduling region.

Sounds like the region limits are a bigger problem than multiple definitions so keep the code as is.

2928
  • Could use MCPhysReg instead of unsigned here.
  • Subregister indexes are only used for virtual registers getSubReg() is always 0 when a physreg is assigned.
2968–2971
  • This would benefit from some refactoring to avoid the repeated getOperand(0) expressions.
  • isImplicit() should only be set on uses and I don't see why it matters here. (Technically you could see the flag being set on a subregister definition which also acts as a use, but again I don't see why it matters)
2973–2977
  • I think we should be conservative here and restrict the check to cases where the vreg has one use only. I could imagine this showing strange effects and less benefits when there are multiple uses or the use is in a different block. This is another case where it may or may not be better to look at the schedule graph instead...
  • It's enough to check use_nodbg_operands()
2980

No need for braces.

test/CodeGen/SystemZ/args-11.mir
29

There is a place for end-to-end testing and for directed testing of exactly the feature/change you implemented. We tend to use llvm/test for the latter and the llvm test-suite for the former. Admittedly the test-suite is only checked for performance and not for generated assembly at the moment so there may be room for some middle ground, but I don't see why we should start this here and with this commit...

jonpa updated this revision to Diff 118792.Oct 12 2017, 8:20 AM
jonpa marked 7 inline comments as done.

Thanks for review!
Updated.

lib/CodeGen/MachineScheduler.cpp
2913–2922

OK.

It also makes sense to me as it is handling cases where vreg is defined by a COPY of a physreg, and I think this should typically mean that vreg only has one definition.

2928

right

3148

Is it OK to use the same CandReason (PhysRegCopy) as biasPhysRegCopy(), or should it be different, e.g. PhysRegCopyUses or PhysRegCopy2 ?

test/CodeGen/SystemZ/args-11.mir
29

I see.

fhahn added a comment.Oct 30 2017, 6:22 AM

Do you have any benchmark results on SystemZ? I can provide some benchmark numbers on Cortex-A57, AArch64: test-suite & SPEC2000 has a 0.79% geomean speedup with this patch. There is one bigger regression ( reeBench/analyzer/analyzer +7% runtime)

jonpa added a comment.Oct 31 2017, 1:17 AM

Do you have any benchmark results on SystemZ? I can provide some benchmark numbers on Cortex-A57, AArch64: test-suite & SPEC2000 has a 0.79% geomean speedup with this patch. There is one bigger regression ( reeBench/analyzer/analyzer +7% runtime)

That sounds promising.

I have not yet made a serious benchmarking effort since the patch may still change (some preliminary results show some modest improvements).

Is there anything left that might change at this point?

jonpa updated this revision to Diff 122825.Nov 14 2017, 5:23 AM

Ping!

While waiting for review, I updated the patch to do this only if the COPY is in the same MBB, to check for the difference. This actually seemed to make the total number of COPYs even less, and also seemed to have less other side effects. At least currently this therefore looks best to me. Florian, how does this seem to you? Matthias?

BTW, this question seems still unanswered: "Is it OK to use the same CandReason (PhysRegCopy) as biasPhysRegCopy(), or should it be different, e.g. PhysRegCopyUses or PhysRegCopy2 ?"

Are there any volunteers for reviewing test changes if I update them all?

MatzeB accepted this revision.Nov 15 2017, 9:42 AM

LGTM with nitpicks.

Ping!

While waiting for review, I updated the patch to do this only if the COPY is in the same MBB, to check for the difference. This actually seemed to make the total number of COPYs even less, and also seemed to have less other side effects. At least currently this therefore looks best to me. Florian, how does this seem to you? Matthias?

Seems reasonable as that is closer to what a scheduling region covers and the scheduler can influence.

BTW, this question seems still unanswered: "Is it OK to use the same CandReason (PhysRegCopy) as biasPhysRegCopy(), or should it be different, e.g. PhysRegCopyUses or PhysRegCopy2 ?"

I'd introduce a new name.

Are there any volunteers for reviewing test changes if I update them all?

I think generally you can use best judgement and just update tests.
If you are unsure, you could upload an update here so that peoples phabricator/herald rules match and see if someone complains within a few days.
If you worry about a specific test then it's usually best to look at the version control history of the test and CC the person.

lib/CodeGen/MachineScheduler.cpp
2912

Could also do MachineInstr &CopyMI.

2944

Could use const SUnit &

3148

Introducing a new enum entry here for this is cheap and helps debugging.

This revision is now accepted and ready to land.Nov 15 2017, 9:42 AM
jonpa updated this revision to Diff 124129.Nov 24 2017, 12:43 AM
jonpa marked 3 inline comments as done.

Unfortunately, I realized there were still some issues with this patch, which I have now corrected:

  • There were some regressions that wouldn't go away, and I found that by making the patch less aggressive those regressions disappeared as well as noticing even a slight overall improvement. I added the check for "one use" in addition to "same MBB".
  • Even with good overall results, the SystemZ test args-11.mir now failed, since it does have multiple uses of the copied preg. To handle this I had to add a special check that this test case actually reflects. If the two candidates specifically are connected to the same preg where one is using it and the other is defining it, then the ordering should be based on that. This saves some additional COPYs, and gives also some overall improvement compared to without it (but "one use" per above).
  • I also had to avoid moving COPYs of subregs of a GR128 away from the defining instruction. There was a case where regalloc ran out of registers when this was done.

Please review again:

  • MachineScheduler patch.
  • SystemZ tests.

(Other test updates are still needed, but I'll wait with that for now).

jonpa requested review of this revision.Nov 24 2017, 12:45 AM
jonpa edited edge metadata.
lib/CodeGen/MachineScheduler.cpp
3148

All, right - I added a new entry for this with the name PhysRegCp2 / PREG-CP-2 (feel free to modify), with the next-lowest priority just before NodeOrder. The priority change is NFC on SystemZ (since it is bottom-up, I presume).

jonpa added a comment.Dec 4 2017, 2:23 AM

ping

Matthias, do the changes to mischeduler look good to you still with the latest additions?

Would it be acceptable to add temporary target hook for this until all targets have enabled this and gone through their tests? I have found in a similar situation (Copy hints patch), that even when I took the time to manually update +100 tests, there were very little hope of getting all those test changes reviewed. I would rather not update tests without a proper review, so to me then this seems like a sensible solution. Of course, I would prefer to update tests for one or two targets if there would be reviewer(s) anticipating this, since this is a good way to make sure that this is working well generally.

Would it be acceptable to add temporary target hook for this until all targets have enabled this and gone through their tests? I have found in a similar situation (Copy hints patch), that even when I took the time to manually update +100 tests, there were very little hope of getting all those test changes reviewed. I would rather not update tests without a proper review, so to me then this seems like a sensible solution. Of course, I would prefer to update tests for one or two targets if there would be reviewer(s) anticipating this, since this is a good way to make sure that this is working well generally.

Generally I would prefer not to; the "temporary" often becomes months/years/never. I'd rather see you updating test cases, target maintainers should have phabricator herald rules setup to catch the review (or email filters or ...) and then react to the changes. If there are no complaints in time I'd move ahead. Testcases should not be a reason to stop progress. (Of course this all assumes none of the maintainers actively rejects or comments on the changes or you already see an obvious problem doing the changes yourself).

lib/CodeGen/MachineScheduler.cpp
2912

I assume CopyMI can be const?

2915

No space after assert.

2917

Split into two lines.

2931

We don't align equal signs in llvm (unfortunately IMO).

2934

No braces necessary around the whole returned expression.

2935–2936

Should we restrict the subregister case to VRegMO->isUndef()? It is hard to know whether the coalescing can actually succeed in the non-undef case; It could if all other defs copy the relevant subregisters as well, but I wonder how typical that case is given that it corresponds to simply copying the whole register in a single instruction.

2950

Could use the \p SU doxygen syntax to reference parameters.

2969

Could use if (DefMI == nullptr || !DefMI->isCopy()) continue; etc. to reduce indentation of the following code.

2983–3002

I think this would look better if it used more early exit (i.e. if (!getNumOperands()) return; if (skipWideCopy) return; .... Maybe you can rewrite the code that changes NumCopiedPhysRegUses to PRegs.NumCopied += isTop ? 1 : -1` (or -1 : 1) to avoid the need for the last line.

2984–2988

Restricting this to 128bit COPYs is artifical. Would it also work when skipping all subregister COPYs?

Also unnecessary outer brace.

2990

unnecessary brace outer brace around expression.

3012

No space after assert.

3016

Split into two lines.

3020–3047

Shouldn't the NumCopied comparisons already pull the %2 = OP upwards and the %1 = OP downwards in the example? If there is a reason to handle this case separately anyway write in the comment about it.

jonpa marked 14 inline comments as done.Dec 6 2017, 9:38 AM
jonpa added inline comments.
lib/CodeGen/MachineScheduler.cpp
2935–2936

I added it and as you suspected, it gives very little change. On spec, just ~5 instructions in total got rescheduled...

And oh yeah, also actually 1 COPY more now... ;-) That copy is from a vector register into a floating point argument register. This is a bit special: on SystemZ an FP reg lives in a low-part of any of the 32 vector registers. So the FP register is a subreg of the vector register, but it is the only subreg that is interesting here.

No change at all in spill or so, so this doesn't matter. Doesn't seem to save much compile time either, I think.

Due to the vector/fp registers on SystemZ, I will leave it as it was just for good manners sake, unless you insist on it.

2984–2988

I was surprised to find that removing the check for the width of the super-register was a complete NFC :-) Removed.

3020–3047

While developing this patch, I noticed regressions that I could only (at least so far) get rid of by being less aggressive, which made sense "keep the good and get rid of the bad". One thing that seemed to do the trick then was to add

if (MRI->hasOneUse(MO.getReg()))
  // Make this a bit less agressive by checking for one use.                                                                                                                 
  NumCopiedPhysRegUses++;

So the reason that this general heuristic fails at least in this test case, is that the use of %0 is ignored.

This seems unfortunate, but it is still true that I so far cannot get rid of regressions without limiting the general heuristic... At the same time, this case really should be handled.

jonpa updated this revision to Diff 125754.Dec 6 2017, 9:40 AM
jonpa marked 3 inline comments as done.

Thank you for review, Matthias!

I tried to follow your suggestions and evaluate the suggestions - see in-line comments for the conclusions. What do you think?

This is NFC to the previous version.

jonpa updated this revision to Diff 132987.Feb 6 2018, 6:53 AM

Patch reworked into a state with experimental options. Only SystemZ tests
are passing (updated), for now.

I experimented further with the new phys-reg heuristic, trying to get rid of
the lines that does the explicit check on def/use regs (UsedPReg ==
DefedPReg). As Matthias pointed out, this should not be necessary, but
was added to catch specific and beneficial cases while keeping the general
heuristic less aggressive (to avoid regressions). I wanted to find another,
simpler way of achieving this.

Stepping back a bit to before I added the lines with 'UsedPReg ==
DefedPReg'... checks: I got regressions and reasoned that this heuristic
should only handle the simple cases we were seeing, so I wanted to make it
less aggressive. I found that putting a limit on number of users of the
COPYed phys-reg to 1, did the trick quite well. This however didn't handle
the new SystemZ test case (args-11.mir). Then I added the checks (which I am now
trying to remove...) which did the trick and was also quite beneficial in the
sense that the cases it did catch seemed to be very much beneficial.

There is no real problem with this, except it adds more code to an otherwise
simple heuristic. So I experimented further with many variations while checking
the impact of number of register moves and spills on SPEC (see table below
for numbers).

Also, a secondary goal of this patch is to handle two SystemZ test cases that
have been identified and marked as temporary failing:

  • The COPY regression that slipped in with the guessInstructionProperties=0 (risbg-01.ll):
%bb.0: derived from LLVM BB %0
    Live Ins: %r2l %r3d
        %1<def> = COPY %r3d; ADDR64Bit:%1
        %3:subreg_l32<def,read-undef> = COPY %r2l; GR64Bit:%3
(2)     %2<def> = COPY %3:subreg_l32; GR32Bit:%2 GR64Bit:%3
        %2<def,tied1> = SRA %2<tied0>, %noreg, 28, %cc<imp-def,dead>; GR32Bit:%2
        ST %2, %1, 0, %noreg; mem:ST4[%dest] GR32Bit:%2 ADDR64Bit:%1
(5)     %5<def,tied1> = RISBG %5<undef,tied0>, %3, 60, 190, 36, %cc<imp-def,dead>; GR64Bit:%5,%3
        %r2l<def> = COPY %5:subreg_l32; GR64Bit:%5
        Return %r2l<imp-use>

Before (in a completely different patch) fixing instruction flags and setting
guessInstructionProperties to 0, the RISBG used incorrectly to be a global memory
object and new barrier chain. When the instruction flag was *fixed*, the
CopyConstrain DAG mutator now adds a weak edge:

Constraining copy SU(2)
  Local use SU(5) -> SU(2)

This forces the RISBG to be scheduled before the COPY, which is
unfortunate as then the RISBG cannot not write to the connected %r2l
phys-reg. The regression becomes:

f21:                                    f21:
        .cfi_startproc                          .cfi_startproc
# %bb.0:                                # %bb.0:

        lr      %r0, %r2              |         risbg   %r0, %r2, 60, 190, 36
        sra     %r0, 28               |         lr      %r1, %r2
        st      %r0, 0(%r3)           |         sra     %r1, 28
        risbg   %r2, %r2, 60, 190, 36 |         st      %r1, 0(%r3)
                                      |         lr      %r2, %r0
        br      %r14                            br      %r14
.Lfunc_end0:                            .Lfunc_end0:

One remedy to this is to run the new tryPhysRegCopies2() heuristic before the
weak edges check. I am a bit reluctant to that, as it seems safest to keep the new
heuristic as late as possible, after the last reg-pressure check etc.

Alternatively, the CopyConstrain could be fixed to consider the phys reg
deps, since this seems natural at least in this case. For this I tried first
to add checks inside the loops that add the edges in constrainLocalCopy(), but then
changed this to do a general check in the beginning of the loop instead. This
seemed to work fairly well, and is part of the new patch.

  • args-11.mir:

This test case has proven tricky to handle without making the heuristic too aggressive.

  • One way is to add extra checks for the actual used / def:ed reg(s). This is(1) below from Dec 6.
  • This has the users in two regions, due to an SP-def. I tried to redefine the SystemZ scheduling boundary to ignore SP-defs pre-RA so as to be able to demand all users in the region. It however turned out that it was enough to demand all users local to MBB, which is simpler.
  • I experimented with other ways to handle this without causing regressions, tuning for the numbers in the tables. The current patch handles it without the extra specific checks used before in the first attempt.

While experimenting with variations on the cost function
(findConnectedPhysRegs), I took guidance from the number of COPYs (Reg
moves), and number of Spill|Reload instructions in output. This is compared
to master (unpatched) in the tables below:

(1) "patch Dec 6": This is the previous version of the patch, that was
limited for "one use", and also had the extra lengthy checks involving
UsedPReg and DefedPReg comparisons. This does not handle risbg-01.ll.

Reg moves                   unpatched            patch Dec 6
403.gcc        :                62353                  62238     -115
435.gromacs    :                14381                  14484     +103
481.wrf        :                 7252                   7342      +90
445.gobmk      :                14556                  14495      -61
447.dealII     :                60812                  60767      -45
464.h264ref    :                 7006                   6972      -34
454.calculix   :                18640                  18666      +26
453.povray     :                14028                  14007      -21
450.soplex     :                 6807                   6820      +13
483.xalancbmk  :                97079                  97070       -9
471.omnetpp    :                18051                  18042       -9
482.sphinx3    :                 2931                   2937       +6
... (<= 5)
Sum            :               371149                 371101      -48

Spill|Reload                unpatched            patch Dec 6
435.gromacs    :                13085                  13024      -61
445.gobmk      :                 6370                   6344      -26
400.perlbench  :                 5396                   5370      -26
481.wrf        :                 2534                   2550      +16
403.gcc        :                15789                  15777      -12
453.povray     :                 7223                   7212      -11
454.calculix   :                19859                  19851       -8
483.xalancbmk  :                 9598                   9604       +6
464.h264ref    :                12263                  12269       +6
... (<= 5)
Sum            :               165013                 164898     -115

(2) Current patch without any options:

  • Do the check of connected phys regs in CopyConstrain
  • Run tryPhysRegCopies2() last in tryCandidate().
  • In particular: If there is any other use other than a coalescable COPY to phys-reg, don't do anything If there is one or more such COPYs, add 1 to the cost function. ` For any use of a register that is defined by a local coalescable COPY and as well also only have local users, add 1 to the cost function.
Reg moves                   unpatched          current patch
403.gcc        :                62353                  62079     -274
483.xalancbmk  :                97079                  96879     -200
445.gobmk      :                14556                  14399     -157
435.gromacs    :                14381                  14481     +100
481.wrf        :                 7252                   7343      +91
464.h264ref    :                 7006                   6964      -42
400.perlbench  :                18428                  18389      -39
454.calculix   :                18640                  18667      +27
436.cactusADM  :                10367                  10348      -19
471.omnetpp    :                18051                  18035      -16
450.soplex     :                 6807                   6823      +16
456.hmmer      :                 5122                   5132      +10
453.povray     :                14028                  14020       -8
447.dealII     :                60812                  60806       -6
... (<= 5)
Sum            :               371149                 370625     -524

Spill|Reload                unpatched          current patch
435.gromacs    :                13085                  13024      -61
400.perlbench  :                 5396                   5364      -32
445.gobmk      :                 6370                   6344      -26
403.gcc        :                15789                  15764      -25
447.dealII     :                33182                  33200      +18
464.h264ref    :                12263                  12247      -16
481.wrf        :                 2534                   2550      +16
483.xalancbmk  :                 9598                   9606       +8
454.calculix   :                19859                  19852       -7
... (<= 5)
Sum            :               165013                 164876     -137

(3) Current patch with COPY_CONSTRAIN_CHECK = false. Same as (2), but without
the check in CopyConstrain. (does not handle risbg-01.ll)

Reg moves                   unpatched  !COPY_CONSTRAIN_CHECK
403.gcc        :                62353                  62076     -277
483.xalancbmk  :                97079                  96876     -203
445.gobmk      :                14556                  14410     -146
435.gromacs    :                14381                  14478      +97
481.wrf        :                 7252                   7343      +91
464.h264ref    :                 7006                   6964      -42
400.perlbench  :                18428                  18388      -40
454.calculix   :                18640                  18667      +27
436.cactusADM  :                10367                  10348      -19
471.omnetpp    :                18051                  18035      -16
450.soplex     :                 6807                   6821      +14
453.povray     :                14028                  14017      -11
456.hmmer      :                 5122                   5132      +10
447.dealII     :                60812                  60804       -8
... (<= 5)
Sum            :               371149                 370619     -530

Spill|Reload                unpatched  !COPY_CONSTRAIN_CHECK
435.gromacs    :                13085                  13024      -61
400.perlbench  :                 5396                   5364      -32
445.gobmk      :                 6370                   6344      -26
403.gcc        :                15789                  15764      -25
447.dealII     :                33182                  33200      +18
481.wrf        :                 2534                   2550      +16
464.h264ref    :                12263                  12247      -16
483.xalancbmk  :                 9598                   9606       +8
454.calculix   :                19859                  19852       -7
... (<= 5)
Sum            :               165013                 164874     -139

(4) Current patch with COPY_CONSTRAIN_CHECK = false and BEFORE_WEAK =
true. Same as (2), but without the check in CopyConstrain, and with the
tryPhysRegCopies2() run before checking the weak edges instead.

Reg moves                   unpatched   !COPY_CONSTRAIN_CHECK + BEFORE_WEAK
403.gcc        :                62353                  62076     -277
483.xalancbmk  :                97079                  96880     -199
445.gobmk      :                14556                  14411     -145
435.gromacs    :                14381                  14482     +101
481.wrf        :                 7252                   7343      +91
464.h264ref    :                 7006                   6965      -41
400.perlbench  :                18428                  18389      -39
454.calculix   :                18640                  18667      +27
436.cactusADM  :                10367                  10348      -19
450.soplex     :                 6807                   6822      +15
471.omnetpp    :                18051                  18036      -15
453.povray     :                14028                  14017      -11
456.hmmer      :                 5122                   5132      +10
447.dealII     :                60812                  60806       -6
... (<= 5)
Sum            :               371149                 370635     -514

Spill|Reload                unpatched   !COPY_CONSTRAIN_CHECK + BEFORE_WEAK
435.gromacs    :                13085                  13024      -61
400.perlbench  :                 5396                   5364      -32
445.gobmk      :                 6370                   6344      -26
403.gcc        :                15789                  15764      -25
447.dealII     :                33182                  33200      +18
481.wrf        :                 2534                   2550      +16
464.h264ref    :                12263                  12248      -15
483.xalancbmk  :                 9598                   9606       +8
454.calculix   :                19859                  19852       -7
... (<= 5)
Sum            :               165013                 164875     -138

(5) Current patch with COUNT_DEF = false. This ignores the def (inverse)
case, which seems to have an interesting effect on the number of register
moves, while the improvement in spilling dissapears.

Reg moves                   unpatched             !COUNT_DEF
445.gobmk      :                14556                  14279     -277
403.gcc        :                62353                  62114     -239
483.xalancbmk  :                97079                  96908     -171
400.perlbench  :                18428                  18397      -31
436.cactusADM  :                10367                  10344      -23
481.wrf        :                 7252                   7234      -18
456.hmmer      :                 5122                   5112      -10
453.povray     :                14028                  14019       -9
433.milc       :                 2402                   2393       -9
447.dealII     :                60812                  60820       +8
471.omnetpp    :                18051                  18044       -7
450.soplex     :                 6807                   6801       -6
... (<= 5)
Sum            :               371149                 370351     -798

Spill|Reload                unpatched             !COUNT_DEF
481.wrf        :                 2534                   2546      +12
445.gobmk      :                 6370                   6378       +8
403.gcc        :                15789                  15782       -7
464.h264ref    :                12263                  12256       -7
400.perlbench  :                 5396                   5390       -6
... (<= 5)
Sum            :               165013                 165025      +12

(6) Current patch with COUNT_DEF = false and COPY_CONSTRAIN_CHECK = false and
BEFORE_WEAK. Same as (3), but instead of doing the check in CopyConstrain,
the tryPhysRegCopies2() is run before checking the weak edges.

Reg moves                   unpatched !COUNT_DEF + !COPY_CONSTRAIN_CHECK + BEFORE_WEAK
445.gobmk      :                14556                  14283     -273
403.gcc        :                62353                  62111     -242
483.xalancbmk  :                97079                  96905     -174
400.perlbench  :                18428                  18396      -32
436.cactusADM  :                10367                  10344      -23
481.wrf        :                 7252                   7234      -18
456.hmmer      :                 5122                   5112      -10
450.soplex     :                 6807                   6799       -8
471.omnetpp    :                18051                  18044       -7
447.dealII     :                60812                  60819       +7
433.milc       :                 2402                   2395       -7
453.povray     :                14028                  14021       -7
... (<= 5)
Sum            :               371149                 370345     -804

Spill|Reload                unpatched !COUNT_DEF + !COPY_CONSTRAIN_CHECK + BEFORE_WEAK
481.wrf        :                 2534                   2546      +12
445.gobmk      :                 6370                   6378       +8
403.gcc        :                15789                  15782       -7
464.h264ref    :                12263                  12257       -6
400.perlbench  :                 5396                   5390       -6
... (<= 5)
Sum            :               165013                 165024      +11

(7) Trying just the CopyConstrain check, *without* tryPhysRegCopies2. Not much change --
without tryPhysRegCopies2 running after it, there seems to be no use for this.

Reg moves                   unpatched              spec-llvm
... (<= 5)
Sum            :               371149                 371164      +15

Spill|Reload                unpatched              spec-llvm
... (<= 5)
Sum            :               165013                 165015       +2

These are the variants which seem to improve. Others, which included doing
just one of the cases from before, i.e. "just the uses" or "just the def" ("inverse case")
gave bad numbers, a bit to my surprise.

I admit that small changes to the patch influence the numbers while there is
no real guarantee that this patch is the best version over time. Also, the
number of eliminated COPYs / spills is quite marginal.

Looking at SPEC, I have tried to benchmark this a few times over-night
and find that while last week there seemed to be perhaps an improvement, this
week it seems to be mostly minor regressions on average. It seems that those effects
are quite random, as I have not been able to find any obvious explanations.

Today, it looks like (1) is showing about unchanged average impact over
benchmarks, while (3), (4) and (5) show 0.15 % regression on average. (2)
and (6) show 0.3 % / 0.5 % regression on average.

Again, those regressions do not really make sense, and I would guess that
they might very well be reversed next week or so, given that the number of
COPYs and spills are reduced on average.

I have no idea how this affects other targets, I can only hope for the better. It would
be nice if someone could confirm this.

The patch might be finalized into one of the settings above, or even reverted
back to the previous version, if that's better for some reason.

At this point I would appreciate some feedback from the reviewers!

jonpa added a comment.Feb 8 2018, 8:53 AM

At least for the last two days, it seems this patch is a slight improvement on SystemZ.

I also cross-built to check impact on other targets. The results look acceptable to me, even though not as good as on SystemZ:

-target x86_64-linux-gnu (no cpu specified):

Reg moves                      master                patched
454.calculix   :                21958                  22000      +42
483.xalancbmk  :                70109                  70074      -35
435.gromacs    :                14019                  13985      -34
403.gcc        :                71829                  71860      +31
436.cactusADM  :                13496                  13516      +20
456.hmmer      :                 5857                   5865       +8
433.milc       :                 3079                   3086       +7
445.gobmk      :                17615                  17609       -6
... (<= 5)
Sum            :               264267                 264300      +33

Spill|Reload                   master                patched
403.gcc        :                21588                  21529      -59
483.xalancbmk  :                 8248                   8267      +19
454.calculix   :                16587                  16569      -18
456.hmmer      :                 3858                   3842      -16
433.milc       :                 1766                   1781      +15
435.gromacs    :                12109                  12119      +10
... (<= 5)
Sum            :               115627                 115572      -55
-target aarch64-linux-gnu (no cpu specified):

Reg moves                      master                patched
483.xalancbmk  :                73855                  73904      +49
403.gcc        :                82487                  82462      -25
454.calculix   :                22894                  22880      -14
436.cactusADM  :                14834                  14844      +10
435.gromacs    :                16489                  16498       +9
... (<= 5)
Sum            :               295182                 295204      +22

Spill|Reload                   master                patched
403.gcc        :                43066                  43005      -61
454.calculix   :                14818                  14810       -8
400.perlbench  :                14536                  14543       +7
445.gobmk      :                17120                  17126       +6
... (<= 5)
Sum            :               183350                 183288      -62

Matthias, can I proceed with removing experimental options and moving on to test updates for the other targets?

jonpa added a comment.Feb 20 2018, 9:41 AM

PING!

Anyone has any comments on this?

Thanks for CC'ing me, but I can't comment on whether this is the right heuristic since I haven't done this level of tuning in a long time. Deferring to @MatzeB.

I tried this patch on aarch64 A72 firefly linux on a set of benchmarks.
Overall the performance degraded by 35% cumulatively (sum of all speedups and slowdowns.)
There were 5 benchmarks that sped up by more than 1% and 12 that slowed down by >1%.
One benchmark slowed down by >10% and three by >5%.
I will investigate these slowdowns.

I tried this patch on aarch64 A72 firefly linux on a set of benchmarks.
Overall the performance degraded by 35% cumulatively (sum of all speedups and slowdowns.)
There were 5 benchmarks that sped up by more than 1% and 12 that slowed down by >1%.
One benchmark slowed down by >10% and three by >5%.
I will investigate these slowdowns.

I just collected more data on the slowdowns and they all are within noise level.
When taking the best scores over 5 or more runs, the patch shows a small spedup orverall.
Green light to commit from my side.

jonpa added a comment.Mar 19 2018, 8:43 AM

I tried this patch on aarch64 A72 firefly linux on a set of benchmarks.
Overall the performance degraded by 35% cumulatively (sum of all speedups and slowdowns.)
There were 5 benchmarks that sped up by more than 1% and 12 that slowed down by >1%.
One benchmark slowed down by >10% and three by >5%.
I will investigate these slowdowns.

I just collected more data on the slowdowns and they all are within noise level.
When taking the best scores over 5 or more runs, the patch shows a small spedup orverall.
Green light to commit from my side.

Thanks for checking this up! Good to know that this works also on your target.