- Add side-steering of FXU registers.
- Improve and implement a general side steering utility.
Add side-steering of FXU registers.
Candidate gets a new BypassCost member.
bypassCost() computes the BypassCost for Candidate. It tries to match the FXU register uses with the previous def of the same register. It also tries not to place a def in the last slot of a decoder group if it has a successor which is only waiting for that reg.
Improve the side steering and also get better results for FPd unit.
On trunk, the only side-steering heuristic is to check for the exact distance of 3, and only then be sure that two FPd instructions end up on oppsite processor sides. This is the very simplest version of side-steering, and can be improved.
Given that the function starts with a taken branch, it should always be possible to know which *possible* groupings each basic block start with. If a block has multiple predecessors and one falls-through into current block, an alternate grouping is possible (unless the linear predecessor ends with a complete group).
Branch probabilites, or defs in previous blocks are ignored. The only knowledge that is added at the beginning of a block compared to trunk are the possible alternate decoder groupings, which enterMBB() is extended to implement.
These are modelled by GroupOffsets. GroupOffset[0] is always true, since this is simply the current scheduler state regardless of any alternate groupings. If GroupOffset[1] is true, this means that there are possibly already one instruction in current decoder group, and similarly for GroupOffset[2].
There are now three alternative situations:
- No group offsets. Side steering means looking at the cycle indexes of the two SUs, and directly comparing if they are both high or low. Indexes 0-2 are low and 3-5 are high, or "left" / "right" processor sides.
- One offset. This means that there are smaller groups (of two slots) that are true in both alternatives which are checked instead of the full groups.
- Two offsets. Group limits could be anywhere, so only the distance-of-3 heuristic is sure to work.
SystemZHazardRecognizer is extended with
- SideSteerIndexes: A map that records the decoder cycle index at the point of emitting an SU, for the relevant side steering resource, e.g. the FPd unit, or a defined FXU register.
- checkSide(): Implements points 1-3 above, either to check for the same or opposite side.
- Some extra care has to be taken when emitting a non-taken branch, or when a block has multiple predecessors. If the there are then any group offsets, they can and must be recomputed. See emitInstruction() and normalize().
Evaluation on SPEC:
It is interesting to note that 88% of the SUs at the point they are emitted are in a state without any grouping offsets. 8% have offset:1, 3% have offset:2, and 0.75% have both offset 1 and 2. It seems therefore as a strong alternative to simply always ignore the alternate groupings. This would simplify the patch greatly, as most of the complexity lies in keeping track of the offsets. This also seems to work about as well on preliminary runs.
Using this patch seems to give perhaps 0.2-0.3 % improvements on average over benchmarks.
It is curious to note that the bypassing heuristic makes the scheduler more often run out of alternative SUs. This seems to mean that there are a few more instances of when a cracked instruction breaks a group early etc. The more aggressive the FXU heuristic is, the more this happens, although it is quite marginal to begin with. See attached table:
C is master (unmodified). E is just improved FPd side steering. G, I, K and M are as well using FXU side steering with different cuttofs of the height (M has no cutoff -> most aggressive). The columns show how much a more aggressive FXU side steering influences the other statistics. E shows some improved FPd scheduling. BypassCost has 'Known' values -2 (good), 1 and 2 (bad). The "rest" are all the cases where the scheduler "does not know". Similarly for GroupingCosts.
Compile time:
Since the noCost() method now looks for -2 bypass cost, many more (x7 !) candidates are evaluated:
master lim5 lim10000 Number of sched candidates evaluated: 272177 2077046 2201446
This seems also somewhat indicated using --time-passes. Average post-RA scheduler pass percentage of compile time:
master User 1.39% | System 1.26% | User+Sys 1.4% | Wall 1.49% User 1.21% | System 1.07% | User+Sys 1.19% | Wall 1.48% lim5 User 1.46% | System 1.28% | User+Sys 1.47% | Wall 1.69% User 1.4% | System 1.21% | User+Sys 1.39% | Wall 1.66% lim10000 User 1.38% | System 1.19% | User+Sys 1.36% | Wall 1.65% User 1.4% | System 1.10% | User+Sys 1.38% | Wall 1.65%
This is not that much, and if it is an issue it can probably be improved further.
Experimental options:
SIDESTEERING_FXU: enables the FXU side steering. Without it, only FPd side steering is affected.
FXU_HEIGHTDIFF: Sets a cutoff as to when to stop looking for an FXU bypass in the Available set. If Best is this much higher than the last tried candidate, it is accepted without a bypass. This adjusts the aggressiveness of the bypass heurstic.
DOGROUPS: Always do the groups as if there is no alternative groupings.
NOSIDESTEERRESET: Don't reset side-steering. If used with DOGROUPS, it then gives the behaviour of "ignoring groups" (simplified version of patch).