This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Improve side steering of FPd unit and FXU registers.
Needs ReviewPublic

Authored by jonpa on Mar 5 2018, 7:46 AM.

Details

Reviewers
uweigand
Summary
  • Add side-steering of FXU registers.
  • Improve and implement a general side steering utility.

Add side-steering of FXU registers.

Candidate gets a new BypassCost member.

bypassCost() computes the BypassCost for Candidate. It tries to match the FXU register uses with the previous def of the same register. It also tries not to place a def in the last slot of a decoder group if it has a successor which is only waiting for that reg.

Improve the side steering and also get better results for FPd unit.

On trunk, the only side-steering heuristic is to check for the exact distance of 3, and only then be sure that two FPd instructions end up on oppsite processor sides. This is the very simplest version of side-steering, and can be improved.

Given that the function starts with a taken branch, it should always be possible to know which *possible* groupings each basic block start with. If a block has multiple predecessors and one falls-through into current block, an alternate grouping is possible (unless the linear predecessor ends with a complete group).

Branch probabilites, or defs in previous blocks are ignored. The only knowledge that is added at the beginning of a block compared to trunk are the possible alternate decoder groupings, which enterMBB() is extended to implement.

These are modelled by GroupOffsets. GroupOffset[0] is always true, since this is simply the current scheduler state regardless of any alternate groupings. If GroupOffset[1] is true, this means that there are possibly already one instruction in current decoder group, and similarly for GroupOffset[2].

There are now three alternative situations:

  1. No group offsets. Side steering means looking at the cycle indexes of the two SUs, and directly comparing if they are both high or low. Indexes 0-2 are low and 3-5 are high, or "left" / "right" processor sides.
  2. One offset. This means that there are smaller groups (of two slots) that are true in both alternatives which are checked instead of the full groups.
  3. Two offsets. Group limits could be anywhere, so only the distance-of-3 heuristic is sure to work.

SystemZHazardRecognizer is extended with

  • SideSteerIndexes: A map that records the decoder cycle index at the point of emitting an SU, for the relevant side steering resource, e.g. the FPd unit, or a defined FXU register.
  • checkSide(): Implements points 1-3 above, either to check for the same or opposite side.
  • Some extra care has to be taken when emitting a non-taken branch, or when a block has multiple predecessors. If the there are then any group offsets, they can and must be recomputed. See emitInstruction() and normalize().

Evaluation on SPEC:

It is interesting to note that 88% of the SUs at the point they are emitted are in a state without any grouping offsets. 8% have offset:1, 3% have offset:2, and 0.75% have both offset 1 and 2. It seems therefore as a strong alternative to simply always ignore the alternate groupings. This would simplify the patch greatly, as most of the complexity lies in keeping track of the offsets. This also seems to work about as well on preliminary runs.

Using this patch seems to give perhaps 0.2-0.3 % improvements on average over benchmarks.

It is curious to note that the bypassing heuristic makes the scheduler more often run out of alternative SUs. This seems to mean that there are a few more instances of when a cracked instruction breaks a group early etc. The more aggressive the FXU heuristic is, the more this happens, although it is quite marginal to begin with. See attached table:

C is master (unmodified). E is just improved FPd side steering. G, I, K and M are as well using FXU side steering with different cuttofs of the height (M has no cutoff -> most aggressive). The columns show how much a more aggressive FXU side steering influences the other statistics. E shows some improved FPd scheduling. BypassCost has 'Known' values -2 (good), 1 and 2 (bad). The "rest" are all the cases where the scheduler "does not know". Similarly for GroupingCosts.

Compile time:

Since the noCost() method now looks for -2 bypass cost, many more (x7 !) candidates are evaluated:

                                                                                master                lim5     lim10000
Number of sched candidates evaluated:       272177         2077046      2201446

This seems also somewhat indicated using --time-passes. Average post-RA scheduler pass percentage of compile time:

master
User 1.39%   | System 1.26%  |  User+Sys 1.4%  | Wall 1.49%
User 1.21%   | System 1.07%  |  User+Sys 1.19%  | Wall 1.48%

lim5
User 1.46%   | System 1.28%  |  User+Sys 1.47%  | Wall 1.69%
User 1.4%    | System 1.21%  |  User+Sys 1.39%  | Wall 1.66%

lim10000
User 1.38%  | System 1.19%  |  User+Sys 1.36%  | Wall 1.65%
User 1.4%   | System 1.10%  |  User+Sys 1.38%  | Wall 1.65%

This is not that much, and if it is an issue it can probably be improved further.

Experimental options:

SIDESTEERING_FXU: enables the FXU side steering. Without it, only FPd side steering is affected.

FXU_HEIGHTDIFF: Sets a cutoff as to when to stop looking for an FXU bypass in the Available set. If Best is this much higher than the last tried candidate, it is accepted without a bypass. This adjusts the aggressiveness of the bypass heurstic.

DOGROUPS: Always do the groups as if there is no alternative groupings.
NOSIDESTEERRESET: Don't reset side-steering. If used with DOGROUPS, it then gives the behaviour of "ignoring groups" (simplified version of patch).

Diff Detail

Event Timeline

jonpa created this revision.Mar 5 2018, 7:46 AM
jonpa added a comment.Mar 5 2018, 7:47 AM

This is the table mentioned in the Summary.

jonpa updated this revision to Diff 315875.Jan 11 2021, 11:56 AM
jonpa edited subscribers, added: Andreas-Krebbel; removed: MatzeB.

This patch has been improved to make use of B2B information. B2BW, B2BR, and B2BRW FUs have been added to the SchedModel so that instructions can be modeled to use these. B2BRW is not really needed, but I tried using it for readability. This is one way of keeping track of which instructions can read and/or write B2B - a disadvantage is that the enum for the ProcResources is not available from TableGen so that has been added locally instead for now. It looked like there was probably enough irregularity among the opcodes to motivate this approach - although the differences between subtargets were very small.

One open question is what happens if one instruction defines a high register on one side, and another instruction defines the low part on the other side? Should these subregs be tracked separately, or is it always the full (64-bit) reg that is written? (What about 128bit?)

I revisited the question from before about whether the assumption that the first instruction in the MBB really begins a new decoder group or not. A naive estimate can be made by categorizing the type of incoming edges and their relative frequencies (this ignores probabilites):

  • In these cases, the assumption is correct (the MBBs are scheduled in linear order, so a not scheduled pred will also be taking a branch):
Multiple predecessors, incoming Taken Branch: 29%
Multiple predecessors, block not scheduled: 10%
Single predecessor, sched-state known, taken branch: 9%
Multiple predecessors, linear pred ends group: 8%
Entry-blocks: 3%
Single predecessor, not scheduled : 2%
  • These edges (blocks) simply continue as before, right or wrong:
Single predecessor, sched-state known, linear pred: 31%
  • These edges mean that the scheduler is wrong:
Multiple predecessors, linear pred has 1 in group: 6%
Multiple predecessors, linear pred has 2 in group: 2%

In summary,
61% of the incoming edges known to lead to correct scheduling.
8% of the incoming edges are known to lead to an unmodelled group offset in the scheduler.
31% are continuing from before, which means that they should not change the ratio, which then is actually 88% vs 12%.

As soon as a cracked instruction is scheduled, the scheduler is right again after that point, so the above might be seen as a bit pessimistic.

So it seems that in 9 out of 10 times the scheduler is right in this assumption generally, even though this does not take into account the actual hotness of the edges. And even if the grouping is off, there is still a chance for the bypass if scheduled next to each other:

[x _ x]  90%
[_ x x]  97%
[x x _]  93%
[x _ _][_ _ _]
[x _ _] 100%

The next question is how effective the scheduler is in actually producing a schedule that puts B2B reads on the same side as their B2B writes under the assumption that it can track the current decoder slot. Without any particular heuristic, this should by chance be 50/50 - with a random schedule half of the reads end up on the right side.

Without the the B2B side-steering enabled during node selection:

B2B reads with good schedule: ~8%   (58% ratio)
B2B reads with bad schedule : ~6%

With side-steering of B2B reads only (-sidesteer-fxu):

B2B reads with good schedule: ~9%   (71% ratio)
B2B reads with bad schedule : ~4%

With side-steering of B2B reads and writes (-sidesteer-fxu -sidesteer-lastslot):

B2B reads with good schedule: ~10%  (75% ratio)
B2B reads with bad schedule : ~3%

This shows an improvement with a higher ratio of good B2BR scheduled nodes.

The bypass heuristic is used with a lower priority than grouping or resources, but those costs were present in only of 2% of the cases of a bypass cost.

When any B2B (write or read) cost was scheduled, there were generally not many nodes available to choose from:

1 available: 50%
2 available: 27%
3 available: 11%
6 or more available: 3%

For the B2BW nodes which did not get handled, in 92% of the cases it was the only node available.
For the B2BR nodes which ended up on wrong side, 88% of them where the only node available.

So it seems that the potential of this patch is limited by the fact that the starting point is not "0%", but rather around 50%, and also because the availability of alternate nodes is typically low.

With -sidesteer-exact, only scheduling in a following decoder group on the same slot (modulo 6 instructions) is aimed for which would be immune to incorrect tracking of decoder groups (linear predecessor fall-through).
This gave much fewer known beneficially scheduled reads, which should be due to the low number of available instructions.

Possible improvements / ideas:

  • If an instruction uses two registers both defined with a B2BW, one could try to put both definitions on same side.
  • It was a good while since I checked but it may be worth looking into "breaking anti-dependencies" before post-ra sched, to perhaps make more instructions available.

Benchmarks:

I compared master to "-sidesteer-fxu -sidesteer-lastslot" (1), "-sidesteer-fxu -sidesteer-exact" (2), and "-sidesteer-fxu -newfpd-sides" (3).

  1. This gave small mixed results during the first run of SPEC-17. A few benchmarks were then rerun in "full" mode and it seemed that out of these namd and xalancbmk improved ~1%. Namd is not an integer benchmark, but maybe this was related to some induction variable in some loop?
  1. Gave in the "full" run 2% improvement on xalancbmk, but also 2% regression on omnetpp, and 1% regression on lbm.
  1. This used the tracked groups for FPd ops, but this did not seem to improve benchmarks this time around either.
Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2021, 11:56 AM
jonpa updated this revision to Diff 319313.Jan 26 2021, 9:01 AM

Latest improvements - still with ongoing experiments.

-Instead of keeping side of the last cycle index of defined GPRs, just remember if the B2BW SU went left or right.

  • B2BEdge() looks on the DAG and determines if the two instructions are B2B.

Currently trying two different approaches: "heuristical", or "look-ahead"..."Look-ahead" should hopefully eventually provide a near-optimal result, while maybe the heuristical could get close and simpler...

jonpa updated this revision to Diff 320228.Jan 29 2021, 3:41 PM

Patch updated with latest improvements (still experimental).

I tried refining the look-ahead techniques as I were looking into single cases that could be improved relative to the heuristic approach, which however typically seemed to fix a particular issue while making overall results worse:

  • check against the number of other available instructions in order to let "next" side begin on the "current" side when it had to (as well as letting the current side be evaluated into the next side if it did not fit in current).
  • handlings so that the "other pred" which is also available as a separate candidate did not get scheduled if the algorithm had decided that the optimal sequence was SU, "other pred", SU.

The lookahead algorithm could handle tricky overlapping cases with cost functions, but it got too complicated with the 3-instruction case ("other-pred"). This is rather simple to use as a heuristic: "If SU has a B2B user that is also waiting for another predecessor, schedule all three of them if on first decoder slot. But making that work in a more comprehensive context was not at all as simple...

This version shows the lookahead algorithm that gave the best numbers (see below), which was without the two points above. Overall, it was just slightly better than the heuristic approach. The first section below lists the number of instructions that had a particular number of Good/Bad B2BR operands as they were scheduled. For example, 138k instructions with a single B2BR operand where now scheduled on the right side. There were fewer instructions with both operands on the bad side - some of them got both on the right side, some became 1/1.

master <> heuristical approach
Good: 0  Bad 1   264532   125894  -138638
Good: 0  Bad 2    12455     6691    -5764
Good: 1  Bad 0   506788   645426   138638
Good: 1  Bad 1     9410    11460     2050
Good: 1  Bad 2       71      101       30
Good: 2  Bad 0     7617    11331     3714
Good: 2  Bad 1       52       51       -1

Sum edges LHS: Good:  531626 Bad:  299163
Sum edges RHS: Good:  679755 Bad:  151034
                     +148129      -148129
heuristical <> look-ahead
Good: 0  Bad 1   125894   121273    -4621
Good: 0  Bad 2     6691     7295      604
Good: 1  Bad 0   645426   650047     4621
Good: 1  Bad 1    11460     9803    -1657
Good: 1  Bad 2      101      104        3
Good: 2  Bad 0    11331    12384     1053
Good: 2  Bad 1       51       50       -1
Good: 2  Bad 2        0        1        1

Sum edges LHS: Good:  679755 Bad:  151034
Sum edges RHS: Good:  684827 Bad:  145962
		       +5072        -5072

The look-ahead took a few more instructions to the right side.

jonpa updated this revision to Diff 322006.Feb 7 2021, 4:57 PM

I started to simplify the patch and handled one minor regression and then realized something... (see below :-)

Patch simplified towards getting ready to commit:

  • The look-ahead algorithm removed.
  • Minor regression on imagick handled by having a "height cutoff": I figured that the only way this patch could make things worse was by pulling a lot of lower nodes up in the schedule so much that it would slow down the critical path. I found that by using a height limit cutoff that would take the higher node regardless of the B2B cost, I could get nearly all of the lbm improvement while also eliminating the imagick (2% slowdown). With the best value (-2), I now see +100k more "Good" edges instead of +147k, but benchmarking seemed to say this was better even though the total improvement number is less.
  • B2BWrites: The "OtherOpSides" handling checked if the B2BR of an *unscheduled* B2BW has *another* B2BR operand whose definition is already scheduled on the other ("wrong") side. If so, a positive cost was returned in the hope of scheduling the B2BW later on the other side. This gave merely <500 more "Good" operands. 96% of the B2B readers have only one B2B operand, and the increase with this handling gives only an extra 0.3% improvement, so this was removed-
  • The "OtherPred" handling is meant to handle the case where an available (unscheduled) B2BW has a B2BR successor, but the B2BR also has one other predecessor unscheduled. This gave ~7k more "Good" operands (without the height-cutoff). This is still here, but it could perhaps be removed. The "triangle" cases (where that other B2BR predecessor is the B2BW SU itself) did not seem to give any benefit on benchmarks (about 900 more "Good" instructions), so it was removed for sake of simplicity.

At this point I was happy to have a 10% improvement on LBM, whith an eliminated slowdown of imagick I had seen for a while. There were actually 1-2 other 2-3% improvements as well with the reduced benchmarking, but with -F there was only LBM.

Since only one single benchmark improved notably, I wanted to see if this was really B2B related or not - LBM contains many FP divides, and it might also be related to grouping (different schedules might make a cracked instruction available on the right slot etc...)

Experiments showed that rescheduling MBB3 *alone* in lbm.s (LBM_performStreamCollide_TRT) with this patch applied gives the 8-9% speedup! And it did indeed seem to be due to B2B side steering: With patch, the LA:R9 and 5 of its 6 users are on the same side, and all the AGFI:s are on the right side, which seems to be near ideal: the data-flow happening on the same side:

patched:

        jhe     .LBB8_8             // NT branch on first slot
.LBB8_3:                                # %for.body
					# =>This Inner Loop Header: Depth=1
	la      %r9, 0(%r1,%r3)
	l       %r0, 152(%r1,%r2)
     ld      %f1, 0(%r1,%r2)
     pfd     1, 1432(%r1,%r2)
     pfd     2, 1280(%r1,%r3)
	lgr     %r5, %r9            // R9 B2B
	lgr     %r8, %r9
	lgr     %r7, %r9
     pfd     2, -14704(%r1,%r3)
     pfd     2, 17288(%r1,%r3)
     lgr     %r6, %r9              // R9 read on wrong side just once
	lgr     %r4, %r9           // R9 B2B
	lgr     %r10, %r9
	agfi    %r5, 1617368
     agfi    %r6, -1614608         // R6 B2B
     pfd     2, 0(%r6)
     pfd     2, 0(%r5)
	agfi    %r8, -1582624      // R8, R7, R4 B2B
	agfi    %r7, 1585384
	agfi    %r4, 1601320
     pfd     2, 0(%r4)
     pfd     2, 0(%r7)
     pfd     2, 0(%r8)
	agfi    %r10, -1598672     // R10 B2B
	tmll    %r0, 1
	pfd     2, 0(%r10)
     jne     .LBB8_1

unpatched:

        jhe     .LBB8_8             // NT branch on first slot
LBB8_3:                                # %for.body
					# =>This Inner Loop Header: Depth=1
	l       %r0, 152(%r1,%r2)
	ld      %f1, 0(%r1,%r2)
      la      %r9, 0(%r1,%r3)
      lgr     %r5, %r9
      lgr     %r8, %r9
	lgr     %r7, %r9          // 3 R9 reads on wrong side
	lgr     %r6, %r9
	lgr     %r4, %r9
      lgr     %r10, %r9
      agfi    %r5, 1617368
      agfi    %r8, -1582624
	agfi    %r7, 1585384
	agfi    %r6, -1614608
	pfd     1, 1432(%r1,%r2)
      agfi    %r4, 1601320        // R4 wrong side
      pfd     2, 1280(%r1,%r3)
      pfd     2, -14704(%r1,%r3)
	agfi    %r10, -1598672    // R10 wrong side
	tmll    %r0, 1
	pfd     2, 17288(%r1,%r3)
      pfd     2, 0(%r10)
      pfd     2, 0(%r4)
      pfd     2, 0(%r6)
	pfd     2, 0(%r7)
	pfd     2, 0(%r8)
	pfd     2, 0(%r5)
       jne     .LBB8_1

The trunk (unpatched) schedule is not entirely bad since it randomly puts a several reads B2B. It seems that in this case with many LGRs from the same LA, which in turn are used by AGFIs, it is clearly beneficial to have all that happening on one side. Since this involves a lot of register moves, maybe the z15 machine has a better register renaming support, which would explain why there is no further benchmark improvement on that machine?

So this was not due to FP divides, or cracked instructions -it was seemingly quite likely benefiting from the side-steering, which is what I wanted to see. BUT...

That code itself is only intended to compute the addresses for the prefetches, and it is far from optimal. It is a single LA, with multiple LGRs and AGFIs then used by PFDs without any displacement. So I decided to try and hand code that into better assembly:

jhe     .LBB8_8
.LBB8_3:                                # %for.body
					# =>This Inner Loop Header: Depth=1
	l       %r0, 152(%r1,%r2)
	ld      %f1, 0(%r1,%r2)
	la      %r9, 0(%r1,%r3)     // r9 is live out
	la      %r7, 0(%r1,%r3)     // the other regs (r4-r8 seemed to be local only)
	la      %r8, 0(%r1,%r3)     // (la seems just as fast as lgr per table...)
	agfi    %r7, 1585384
	agfi    %r8, -1614608
	pfd     1, 1432(%r1,%r2)
	pfd     2, 1280(%r1,%r3)
	pfd     2, -14704(%r1,%r3)
	tmll    %r0, 1
	pfd     2, 17288(%r1,%r3)
	pfd     2, 15936(%r8)
	pfd     2, 15936(%r7)
	pfd     2, 0(%r8)
	pfd     2, 0(%r7)
	pfd     2, 31984(%r7)
	pfd     2, 31984(%r8)

Would doing this alone *without* this B2B patch provide the same benefit? Yes! And even more than before: 12%!!

I reran LBM with -F:

clang unpatched                    : 335s
Handcoded ber above                : 295s   88%
GCC (ffp-contract=off)             : 362s  108%
clang, disabling extra prefetching : 479s  143%

It looks like clang benefited heavily (and still does) from the increased prefetching, but there was still a missed further improvement to be made in regards to the generation of those addresses for the prefetches.

It seems to me now then that the address computations should first be handled and then possibly this patch could be revisited again (I tried putting the two LAs/AGFIs I used on the *wrong* sides, and it did not seem to matter...).