This is an archive of the discontinued LLVM Phabricator instance.

[MachineScheduler] Fix the TopDepth/BotHeightReduce latency heuristics
ClosedPublic

Authored by foad on Jan 8 2020, 3:50 AM.

Details

Summary

tryLatency compares two sched candidates. For the top zone it prefers
the one with lesser depth, but only if that depth is greater than the
total latency of the instructions we've already scheduled -- otherwise
its latency would be hidden and there would be no stall.

Unfortunately it only tests the depth of one of the candidates. This can
lead to situations where the TopDepthReduce heuristic does not kick in,
but a lower priority heuristic chooses the other candidate, whose depth
*is* greater than the already scheduled latency, which causes a stall.

The fix is to apply the heuristic if the depth of *either* candidate is
greater than the already scheduled latency.

All this also applies to the BotHeightReduce heuristic in the bottom
zone.

Diff Detail

Event Timeline

foad created this revision.Jan 8 2020, 3:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 8 2020, 3:50 AM

Unit tests: pass. 61306 tests passed, 0 failed and 736 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: pass.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

fhahn added a subscriber: fhahn.Jan 9 2020, 2:58 AM

There's a lot of small test changes (expected), but IMO I would be good to have an isolated test only tests the new behavior and shows an improvement (probably best as MIR machine-scheduler only test).

Also, do you have any performance/codesize numbers for impacted targets?

foad added a comment.Jan 9 2020, 3:57 AM

The diff in llvm/test/CodeGen/AArch64/arm64-zero-cycle-zeroing.ll is:

        adrp    x8, .LCPI0_0
-       ldr     h0, [x8, :lo12:.LCPI0_0]
        movi    v1.2d, #0000000000000000
        movi    v2.2d, #0000000000000000
+       ldr     h0, [x8, :lo12:.LCPI0_0]
        movi    v3.2d, #0000000000000000

The scheduler prefers not to put the ldr immediately after the adrp because of the register dependency on x8 with latency 1.

foad added a comment.Jan 9 2020, 3:59 AM

The diff in llvm/test/CodeGen/X86/testb-je-fusion.ll is:

        movl    %edi, %eax
-       addl    $-512, %eax             # imm = 0xFE00
        movb    $1, (%rsi)
+       addl    $-512, %eax             # imm = 0xFE00
        je      .LBB2_2
...
        movl    %edi, %eax
-       decl    %eax
        movb    $1, (%rsi)
+       decl    %eax
        je      .LBB3_2

The scheduler prefers not to put the addl/decl immediately after the first movl because of the register dependency on eax with latency 1.

foad added a comment.Jan 9 2020, 4:10 AM

Also, do you have any performance/codesize numbers for impacted targets?

I did some performance testing of graphics shaders on the AMDGPU target. Out of ~350 shaders the average performance changed by less than 0.01%, only 5 shaders changed by more than 1%, and the biggest swings were -4%/+3%. So I think there is no significant overall trend.

I would welcome any help with performance testing on AArch64/PowerPC/X86.

fhahn added a comment.Jan 9 2020, 4:12 AM

The diff in llvm/test/CodeGen/X86/testb-je-fusion.ll is:

        movl    %edi, %eax
-       addl    $-512, %eax             # imm = 0xFE00
        movb    $1, (%rsi)
+       addl    $-512, %eax             # imm = 0xFE00
        je      .LBB2_2
...
        movl    %edi, %eax
-       decl    %eax
        movb    $1, (%rsi)
+       decl    %eax
        je      .LBB3_2

The scheduler prefers not to put the addl/decl immediately after the first movl because of the register dependency on eax with latency 1.

Great. My main point about the tests was that I think it would be good to have a dedicated small test that just checks on rod the improved cases, with a clear explanation, to guard against regressions by other scheduler changes with a lot of test changes

foad updated this revision to Diff 237332.Jan 10 2020, 7:58 AM

New tests for this specific fix:
test/CodeGen/PowerPC/botheightreduce.mir
test/CodeGen/PowerPC/topdepthreduce-postra.mir
test/CodeGen/X86/topdepthreduce-postra.mir

Unit tests: pass. 61747 tests passed, 0 failed and 780 were skipped.

clang-tidy: pass.

clang-format: pass.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

foad updated this revision to Diff 238215.Jan 15 2020, 3:58 AM

Rebase.

foad added a reviewer: fhahn.Jan 15 2020, 3:59 AM

Unit tests: pass. 61887 tests passed, 0 failed and 782 were skipped.

clang-tidy: unknown.

clang-format: pass.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

foad added a comment.Feb 11 2020, 11:23 AM

Ping! @fhahn do you have any further comments?

arsenm accepted this revision.Jul 16 2020, 8:56 AM
This revision is now accepted and ready to land.Jul 16 2020, 8:56 AM
fhahn accepted this revision.Jul 16 2020, 9:03 AM

Looks reasonable, thanks for adding the MIR test as well. Sorry for the delay, it dropped off my radar.

llvm/lib/CodeGen/MachineScheduler.cpp
2685

It might be good to add a comment here explaining what we are checking here. (same as below).

This revision was automatically updated to reflect the committed changes.