In bidirectional scheduling this gives more stable results than just
comparing the "reason" fields of the top/bottom node because the reason
field may be higher depending on what other nodes are in the queue.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Note that this is an early review. While the change improves many situations and a bunch of the testcases I fixed even show fewer spills/reloads than before, I cannot commit this yet, because we systematically get many tests on AMDGPU "wrong" because latency in the bottom schedule boundary trumps everything else (the nodes are in the pending queue and not subject to tryCandidate). This leads to unnecessary violation of weak edges where one of the two linked nodes is still in the pending queue. Previously we would accidentally get this right because we would schedule the weak edges correctly from the top...
I'm going to test this patch to try to get some register usage statistics and I'll try to see if I can fix the regressing tests.
test/CodeGen/AMDGPU/ds_read2_offset_order.ll | ||
---|---|---|
11–13 ↗ | (On Diff #54605) | This change doesn't look right to me. I need to look at this test more closely. |
test/CodeGen/AMDGPU/s_addk_i32.ll | ||
6–7 ↗ | (On Diff #54605) | This is a regression. |
31–50 ↗ | (On Diff #54605) | These are all regressions. |
test/CodeGen/AMDGPU/shl_add_constant.ll | ||
61–62 ↗ | (On Diff #54605) | This is an improvement. |
77–78 ↗ | (On Diff #54605) | This is a regression. |
I would like to move this patch forward. I analyze some of the AMDGPU regression and they are accidentally cause by shortcomings in the MachineScheduler ReadyCycle computation which calculates suboptimal values when an instruction requires 0 cycles. Fixing this properly is for another day, but the obvious fix for AMDGPU is not defining COPY as 0 cycles (which most likely was not intentional anyway). Tom or Matt could you take a look at: http://reviews.llvm.org/D21540?
test/CodeGen/AMDGPU/s_addk_i32.ll | ||
---|---|---|
6–7 ↗ | (On Diff #54605) | This is avoided with with http://reviews.llvm.org/D21540 applied. |
31–50 ↗ | (On Diff #54605) | No regression with http://reviews.llvm.org/D21540 applied. |
test/CodeGen/AMDGPU/shl_add_constant.ll | ||
77–78 ↗ | (On Diff #54605) | Are you sure this is actually a regression: The different ordering in the function also gives me: |
I would like to take a quick look at the performance stats with these updated before this is committed. This shouldn't take me too long.
test/CodeGen/AMDGPU/shl_add_constant.ll | ||
---|---|---|
77–78 ↗ | (On Diff #54605) | Using 0-48 SGPRs gives allows for maximum occupancy, so saving 2 sgprs doesn't impact performance here. I think this is probably uncovering a bug in somewhere else in the backend. |
As far as the scheduling algorithms are concerned less registers are better, I don't think the heuristic knows about the threshold at 48 regs.
Tweaked the heuristic to get the reg-usage.ll test under control that Tom provided: The problem exposed by that case was in situations where both the top and bottom boundary only had pressure increasing choices available. Comparing the magnitude and then schedule some top, some bottom nodes was a bad idea:
In some situations we would schedule a top node early (because it happens to increase the pressure least accross both boundaries) only to find out a few nodes later that it would have decreased the pressure on the bottom. I tweaked the heuristic to always pick a bottom node in cases where all available choices in top+bottom increase register pressure.
In reg-usage.ll I see no regression in NumVgprs anymore now. There is still a 38 -> 46 Regression for SGprs which I will investigate, but it would probably be worth to re-run the benchmarks with this updated version.
Investigating the scheduler some more, I came to the realization that when scheduling bidirectional we often have situations in which there is no clear good choice at the top or bottom boundary. Picking something from the top boundary because some heuristic rule with a "tie-breaking nature" tells us to is often contraproductive as we often get into a situation later where the node would have been useful to reduce register pressure at the bottom boundary.
Tweaking the heuristic in the sense that we only pick something from the top boundary if it is a good choice in itself (= only 1 choice, register pressure improves, cluster edges are respected, keeping physreg copies early). In cases where we only have "tie-breaking" rules (= register pressure is increased less than with another node, weak edges are not fulfilled, nodes are in program order, ...) we stay with the choice from the bottom boundary.
This restores the quality of the reg-usage.ll example from Tom, and seems to be beneficial in general. On AArch64 (the only other target with bidirectonal scheduling enabled) I measure a 2% improvement in 252.eon, 1% in 401.bzip2 and improvements in a handful of smaller benchmarks without any regressions (outside of noise).
The updated branch improves register usage when compared to trunk:
SGPRS: 318600 -> 313856 (-1.49 %)
VGPRS: 206385 -> 203085 (-1.60 %)
Code Size: 7199132 -> 7191340 (-0.11 %) bytes
Max Waves: 49246 -> 49626 (0.77 %)
I have no objections to this change, thanks for working on this.