Page MenuHomePhabricator

[Scheduling] Improve memory ops cluster preparation
Needs ReviewPublic

Authored by qiucf on Feb 12 2020, 9:47 PM.

Details

Reviewers
fhahn
evandro
arsenm
foad
Group Reviewers
Restricted Project
Summary

SUnits in ScheduleDAGInstrs will be divided into several groups by their ctrl pred node number. Scheduler tries to build possible cluster edges inside each group. However, there're some units with no ctrl preds.

This patch

  1. Add these preds to each group before clusterNeighboringMemOps, to catch more cluster opportunities.
  2. Add a bitmap to mark units already clustered, remove them before clusterNeighboringMemOps

This impacts several test cases, by comparing scheduling log, we can find more cluster edges added and they are scheduled together.

Diff Detail

Unit TestsFailed

TimeTest
1,290 msMLIR.mlir-cpu-runner::bare_ptr_call_conv.mlir
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/mlir-opt /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/bare_ptr_call_conv.mlir -convert-loop-to-std -convert-std-to-llvm='use-bare-ptr-memref-call-conv=1' | mlir-cpu-runner -shared-libs=/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libmlir_runner_utils.so -entry-point-result=void | /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/FileCheck /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/bare_ptr_call_conv.mlir
1,150 msMLIR.mlir-cpu-runner::linalg_integration_test.mlir
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/mlir-opt /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/linalg_integration_test.mlir -convert-linalg-to-llvm | mlir-cpu-runner -e dot -entry-point-result=f32 -shared-libs=/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libcblas.so,/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libcblas_interface.so | /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/FileCheck /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/linalg_integration_test.mlir
1,080 msMLIR.mlir-cpu-runner::simple.mlir
Script: -- : 'RUN: at line 1'; mlir-cpu-runner /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/simple.mlir | /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/FileCheck /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/simple.mlir
1,280 msMLIR.mlir-cpu-runner::unranked_memref.mlir
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/mlir-opt /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/unranked_memref.mlir -convert-linalg-to-loops -convert-linalg-to-llvm -convert-std-to-llvm | mlir-cpu-runner -e main -entry-point-result=void -shared-libs=/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libmlir_runner_utils.so,/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libcblas.so,/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libcblas_interface.so | /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/FileCheck /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/unranked_memref.mlir
1,290 msMLIR.mlir-cpu-runner::utils.mlir
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/mlir-opt /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/utils.mlir -convert-linalg-to-loops -convert-linalg-to-llvm -convert-std-to-llvm | mlir-cpu-runner -e print_0d -entry-point-result=void -shared-libs=/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libmlir_runner_utils.so | /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/FileCheck /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/utils.mlir --check-prefix=PRINT-0D

Event Timeline

qiucf created this revision.Feb 12 2020, 9:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 12 2020, 9:47 PM
qiucf edited the summary of this revision. (Show Details)Feb 12 2020, 9:54 PM
qiucf updated this revision to Diff 244339.Feb 12 2020, 10:33 PM

Fix build failure

rampitec added inline comments.Feb 12 2020, 11:11 PM
llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
494

This is a regression I guess. A memory operation should always go before an independent ALU as it has higher latency.

524

And then loads are preferably go before stores. Loads have higher latency and their results needs to be consumed by some other instruction. So it looks like a regression to me either.

llvm/test/CodeGen/AMDGPU/captured-frame-index.ll
55

And then this is a progression, as two stores are scheduled together. It would be nice to understand if they were clustered or is it a coincidence.

71

Same here. It seems to help store clustering.

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll
64

Stores are clustered again, here and below, which is nice improvement.

foad added inline comments.Feb 13 2020, 1:24 AM
llvm/test/CodeGen/AArch64/overeager_mla_fusing.ll
12

This looks suspicious. Why has it stopped clustering the two loads from x0, and the two loads from x1?

foad added a comment.Feb 13 2020, 1:30 AM

Thanks for working on this. I have to admit I have never understood the store chain stuff in BaseMemOpClusterMutation::apply. Can you point me at any documentation that explains what a store chain is, and why it's based on control dependence, and why it's useful for clustering?

Is this approach generally sound? I worry that you may get circular dependencies, e.g.: A has no control dependency, but B has a control dependency on A (possibly indirectly). Clustering introduces new artificial dependencies, which could potentially lead to A becoming dependent on an SUnit C that depends on B, creating a circular dependency.

qiucf marked 2 inline comments as done.Feb 16 2020, 7:06 PM
qiucf added inline comments.
llvm/test/CodeGen/AArch64/overeager_mla_fusing.ll
12

AArch64's shouldClusterMemOps requires the offset should be N and N+1. Here is 2 and 6. They are not clustered in the log.

But the scheduler really has some issues about clustering. It doesn't know SU(13) and SU(15) can't be scheduled together (SU(14) must be in). This can be fixed by not creating this edge when SUa and SUb can't be scheduled together. We can do it in future patches.

llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
494

Yes. This can and another regression in this file will be eliminated if we prevent cluster edge from being created when they can't be scheduled together, as above says.

qiucf added a subscriber: atrick.Feb 19 2020, 6:49 AM

Thanks for working on this. I have to admit I have never understood the store chain stuff in BaseMemOpClusterMutation::apply. Can you point me at any documentation that explains what a store chain is, and why it's based on control dependence, and why it's useful for clustering?

Per my understanding (maybe not correct), ‘chains’ are used to describe some kinds of dependency other than use-def, such as memory operations. (This discussion is a good reference) So it’s natural for the chains to become anti, output, or order dependencies.

About why it's useful for clustering, I guess what apply does is a cheap but not perfect way to eliminate 'impossible' cluster pairs (like SU(3) clustered with SU(5) but SU(4) is a barrier), since if two units has the same pred-dep, they're likely to be able to neighbor. Actually, as I roughly change if (Pred.isCtrl() && !Pred.isArtificial()) to if (Pred.isNormalMemoryOrBarrier()), all check-llvm tests passed. Since this method was written by @atrick several years ago and its core logic haven't changed, it's sometimes also confusing to me :) There're some other issue with this implementation, nevertheless, they may save compiling time.

Is this approach generally sound? I worry that you may get circular dependencies, e.g.: A has no control dependency, but B has a control dependency on A (possibly indirectly). Clustering introduces new artificial dependencies, which could potentially lead to A becoming dependent on an SUnit C that depends on B, creating a circular dependency.

Currently (before D72031 lands), after cluster edges created, only succ-deps (assume that's C) of A will depend on B. If we want circular dependency to happen, B has to directly depends on C or depends on D which depends on C. But we calls Topo.IsReachable before adding edges and that would be detected. (A can't depends on C because C depends on A) This seems not related to whether A or B has control dependencies. Am I right?

+->B+---->A<-+
|  +         |
|  +>D+      |
|     |      |
|     v      |
+----+C+-----+
qiucf updated this revision to Diff 245570.Feb 19 2020, 8:25 PM

Update to rebase and reflect new test changes.