Page MenuHomePhabricator

[AMDGPU] Do not raise wave priority beyond a specific number of VALU instructions.
Needs ReviewPublic

Authored by kosarev on Apr 29 2022, 6:14 AM.

Details

Summary

It has been discovered that lowering the priority after VMEM loads
following long sequences of VALU instructions does not improve
performance.

Diff Detail

Unit TestsFailed

TimeTest
60,120 msx64 debian > AddressSanitizer-x86_64-linux-dynamic.TestCases::scariness_score_test.cpp
Script: -- : 'RUN: at line 4'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -gline-tables-only -m64 -shared-libasan -O0 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/asan/TestCases/scariness_score_test.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/asan/X86_64LinuxDynamicConfig/TestCases/Output/scariness_score_test.cpp.tmp
60,110 msx64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp
Script: -- : 'RUN: at line 4'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -gline-tables-only -m64 -O0 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/asan/TestCases/scariness_score_test.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/asan/X86_64LinuxConfig/TestCases/Output/scariness_score_test.cpp.tmp

Event Timeline

kosarev created this revision.Apr 29 2022, 6:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 6:14 AM
kosarev requested review of this revision.Apr 29 2022, 6:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 6:14 AM
tsymalla added inline comments.Apr 29 2022, 7:30 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
119

You could early opt-out when VALUInstsThreshold == 0 (at the beginning), is that correct?

arsenm added inline comments.Apr 29 2022, 10:58 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31

Should we be counting cycles instead of instructions?

124

You're counting VALU instructions here and above?

191–192

Why not construct directly at the insert point?

kosarev updated this revision to Diff 426646.May 3 2022, 4:40 AM

Updated as suggested.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31

We don't know it yet, I'm afraid. For the couple real use cases that we have for the issue that this pass tries to address counting instructions looks sufficient. This being coupled with that proper cycle counting might be not very trivial, we may be at the risk of over-engineering here.

119

We could, but I'm not sure I know how that might be useful in practice.

124

Yes, that's not good. Combined the two new loops into one and simplified related code. Thanks.

191–192

Well, the instruction we create here is not just an auxiliary value that we coincidentally happen to use in both the cases, if that's what you mean. That instruction must be spent in all cases and we do want it be the same instruction.

arsenm added inline comments.May 17 2022, 2:46 PM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
191–192

But you can construct it in the right place rather than constructing the instruction and then inserting after. You have the insert point you want, you can just construct it there? It's unusual to need insert/insertAfter

kosarev added inline comments.May 18 2022, 1:01 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
191–192

I'm not sure I see how this answers the point. Granted that constructing the instruction at the points of insertion is possible, but as I said in this case we have reasons to create it in a single place.

foad added a comment.May 18 2022, 2:45 AM

I'm not sure how to review this. Can you explain why this heuristic makes sense intuitively? Or do you have any benchmarks to back it up? Or preferably both? :)

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
111

If you're ignoring loops, can you described succinctly what you *are* counting? Is it something like the minimum number of VALU instructions along any path from the start of the function to the VMEM load in question?

foad added inline comments.May 18 2022, 2:50 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
191–192

I agree with Matt that it would be more normal to construct-and-insert at the same time, though I don't feel very strongly about it. I think I suggested before that you could pass the insertion point into BuildSetprioMI. In this case the insertion point would be something like MBBInfos[MBB].LastVMEMLoad ? std::next(MBBInfos[MBB].LastVMEMLoad) : MBB->begin().

kosarev updated this revision to Diff 430382.May 18 2022, 7:54 AM

Reworked as suggested.

I'm not sure how to review this. Can you explain why this heuristic makes sense intuitively? Or do you have any benchmarks to back it up? Or preferably both? :)

Right, that's the hardest part of it, isn't it. Because we don't have much use cases provided, the idea was to start with replicating more or less the same counting logic we have in the other compiler, and then adjust things using feedback from people who can give it some proper testing.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
111

Good point. Amended the comment.

191–192

OK, changed.