This is an archive of the discontinued LLVM Phabricator instance.

[MachineScheduler] Favor instructions that do not increase pressure.
Needs RevisionPublic

Authored by fhahn on Sep 22 2017, 12:22 AM.

Details

Reviewers
MatzeB
qcolombet
Summary

I think all other things being equal, it could be beneficial to prioritize
candidates that do not increase register pressure, as this may uncover
new candidates that reduce register pressure.

I have been experimenting with using the MachineScheduler on ARM with
Cortex-A57, and this change helps to reduce spilling in some cases.
For the LLVM test-suite this change helps to reduce spilling in some cases.For
the test-suite and SPEC2k, it improves the geomean of the execution times
compared to the old scheduler from -0.82% (MachinScheduler for ARM on
Cortex-A57) to -1.03% (same as before, but with this patch) and eliminates a
few big outliners. For SPEC2k6, it improves the geomean of execution
times from -0.24% to -0.29%.

For AArch64 I did some benchmark runs comparing master against this
patch.
Cortex-A57:

  • -0.62% geomean exec time for test-suite and SPEC2k
  • -0.29% for SPEC2k6
  • +0.19% SPEC2017 score

Cortex-A72:

  • -0.29% geomean exec time for test-suite and SPEC2k

The impact of this change on the unit tests is quite small. The
modified test case in this patch uses 1 register less now, with the
order of instructions slightly changed. The earliest fcvt gets still
scheduled after 2 instructions, so latency wise the schedule should not
have regressed.

The following ADMGPU tests also fail:

  • CodeGen/AMDGPU/attr-amdgpu-num-sgpr.ll (with -mtriple=amdgcn--amdhsa -mcpu=fiji -amdgpu-spill-sgpr-to-smem=1) Expected: NumSGPRsForWavesPerEU: 9 Actual: NumSGPRsForWavesPerEU: 14
  • CodeGen/AMDGPU/schedule-regpressure-limit2.ll (with -march=amdgcn -mcpu=fiji -misched=gcn-max-occupancy-experimental Expected: NumSgprs: {{[1-5][0-9]$}} Actual: NumSGPRsForWavesPerEU: 78

I do not know anything about AMDGPU, so I cannot really say if that is
good or bad?

@jonpa this might be interesting to you too, as I gathered that you are
working on switching SystemZ to the MachineScheduler.

Diff Detail

Event Timeline

fhahn created this revision.Sep 22 2017, 12:22 AM
fhahn added a subscriber: arsenm.Sep 22 2017, 12:24 AM

cc Matt, maybe you know if the change in the AMDGPU is good or bad?

MatzeB edited edge metadata.Sep 22 2017, 11:57 AM
  • I vaguely remember trying something like this and having some crypto benchmarks produce bad schedules; I'll see if I can remember/find it
  • tryPressure seems like the wrong place to me, as it is used in 3 different contexts: (compared with target limits, compared with increase region limits, and the current max). From your description it sounds like we only want this behavior once.
  • Do you have a specific example where this helps?
  • I vaguely remember trying something like this and having some crypto benchmarks produce bad schedules; I'll see if I can remember/find it

Thanks for the pointer, I have a few crypto benchmarks I can run too.

  • tryPressure seems like the wrong place to me, as it is used in 3 different contexts: (compared with target limits, compared with increase region limits, and the current max). From your description it sounds like we only want this behavior once.

Thanks, I'll look into that.

  • Do you have a specific example where this helps?

The runtime of float-mm from the LLVM test-suite increase by around 15% because of additional spilling on ARM with the MachineScheduler enabled (on Cortex-A57). I can also have a look at some improvements on AArch64.

  • I vaguely remember trying something like this and having some crypto benchmarks produce bad schedules; I'll see if I can remember/find it

Thanks for the pointer, I have a few crypto benchmarks I can run too.

  • tryPressure seems like the wrong place to me, as it is used in 3 different contexts: (compared with target limits, compared with increase region limits, and the current max). From your description it sounds like we only want this behavior once.

Thanks, I'll look into that.

  • Do you have a specific example where this helps?

The runtime of float-mm from the LLVM test-suite increase by around 15% because of additional spilling on ARM with the MachineScheduler enabled (on Cortex-A57). I can also have a look at some improvements on AArch64.

I cannot find that particular test, in the test-suite what's the exact name/path (or is it an internal one)?

lib/CodeGen/MachineScheduler.cpp
2777–2788

Why isn't this using tryGreate() and isn't this the same as just changing getUnitInc() < 0 to getUnitInc() <= 0 in the earlier check?

MatzeB added a subscriber: atrick.Sep 25 2017, 8:28 PM

I think the cryptocode I was looking at back then was john the ripper (openbenchmarking.org / pts/john-the-ripper).

Just tried with/without your changes and I see the descrypt part improving and the LM part regressing. So whether this is an improvement or regression may depend on the situation and we should find/look at examples.

I assume Andy designed the current behavior to give more freedom to the latency minimizing code? So an out of order Cortex-A57 may not be the best test case for it; then again if it improves the spilling situation and leads to a win on out of order cores we could switch the behavior based on out-of-order/in-order.

fhahn added a comment.Sep 26 2017, 2:11 PM
  • tryPressure seems like the wrong place to me, as it is used in 3 different contexts: (compared with target limits, compared with increase region limits, and the current max). From your description it sounds like we only want this behavior once.

I tried only favoring instructions that do not increase pressure in each context. When using the current max, the results are very similar to the initial version of the patch and the AMDGPU failures go away. Using it in the other context yields smaller improvements on the set of benchmarks mentioned above.

I think the cryptocode I was looking at back then was john the ripper (openbenchmarking.org / pts/john-the-ripper).

Just tried with/without your changes and I see the descrypt part improving and the LM part regressing. So whether this is an improvement or regression may depend on the situation and we should find/look at examples.

I'll try to collect a list of examples where it makes things better/worse.

I assume Andy designed the current behavior to give more freedom to the latency minimizing code? So an out of order Cortex-A57 may not be the best test case for it; then again if it improves the spilling situation and leads to a win on out of order cores we could switch the behavior based on out-of-order/in-order.

I'll also run the benchmarks on an in-order core. I was under the impression that the scheduler prioritizes register needs over latency, as the latency of candidates is one of the last things checked in tryCandidate.

lib/CodeGen/MachineScheduler.cpp
2777–2788

Initially I used tryGreater above with getUnitInc() <= 0. But that version does not prioritize TryCand over Cand if TryCand decreases pressure and Cand does not change pressure I think.

jonpa added a comment.Sep 28 2017, 4:46 AM

@jonpa this might be interesting to you too, as I gathered that you are working on switching SystemZ to the MachineScheduler.

Thanks for putting me on cc! I have a particular test case I am working with now which is kind of related. I tried your patch, but it didn't help in this particular case. See my proposed patch: https://reviews.llvm.org/D38351

MatzeB requested changes to this revision.Oct 11 2017, 4:47 PM
MatzeB added inline comments.
lib/CodeGen/MachineScheduler.cpp
2777–2788

I just looked at this patch again:
There actually already is tryLess(TryP.getUnitInc(), CandP.getUnitInc(), TryCand, Cand, Reason); below. And the important part here is that it is guarded by Cand.AtTop != TryCand.AtTop. Last time I meddled with comparing candidates across the top/bottom boundary I found that we tend to produce terrible schedules for long basic blocks (well Tom complained because of AMDGPU regressions and as it turns out GPUs are more likely to have big scheduling regions). The reason is that it is easy to make decisions on one boundary that run contrary to the decisions being made at the other boundary. With that in mind I'd rather reject this patch or to convince me I would like to some concrete situations or test cases.

This revision now requires changes to proceed.Oct 11 2017, 4:47 PM