This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Speedup GCNDownwardRPTracker::advanceBeforeNext
ClosedPublic

Authored by vpykhtin on Oct 19 2022, 9:59 AM.

Details

Summary

The function makes liveness tests for the entire live register set for every instruction it passes by.
This becomes very slow on high RP regions such as ASAN enabled code.

Instead only uses of last tracked instruction should be tested and this greatly improves compilation time.

This patch revealed few bugs in SIFormMemoryClauses and PreRARematStage::sinkTriviallyRematInsts which should
be fixed first.

Diff Detail

Event Timeline

vpykhtin created this revision.Oct 19 2022, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 19 2022, 9:59 AM
vpykhtin requested review of this revision.Oct 19 2022, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 19 2022, 9:59 AM
arsenm added inline comments.Oct 19 2022, 10:11 AM
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
331

Can you defer this until after skipDebugInstructionsForward?

363

I don't understand why LiveRegs is a map in the first place

vpykhtin added inline comments.Oct 19 2022, 11:44 AM
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
363

The map consists of pairs register:lanemask

arsenm added inline comments.Oct 20 2022, 2:29 PM
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
331

I guess that doesn't make sense.

363

Yes, but tracking it this way doesn't really make sense to me. Does this really need to be erased?

vpykhtin added inline comments.Oct 20 2022, 7:26 PM
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
331

Well it can save some time at the end of the block

363

I'm not sure: do you mean leaving zero lanemask instead of erasing?

arsenm added inline comments.Oct 20 2022, 8:06 PM
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
363

Yes (although I guess you aren't really changing the behavior here)

vpykhtin marked an inline comment as done.Oct 20 2022, 11:55 PM
vpykhtin added inline comments.
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
363

It would require checking all dependent code and I don't think it makes big impact as a register is deleted by the iterator

arsenm accepted this revision.Oct 21 2022, 8:27 AM

LGTM (still not sure about the debug instr handling)

This revision is now accepted and ready to land.Oct 21 2022, 8:27 AM
vpykhtin updated this revision to Diff 479345.Dec 1 2022, 10:16 AM
vpykhtin marked an inline comment as done.
  • Rebased.
  • Replaced collectVirtualRegUses with operand walk.
  • Replaced asserts with llvm_unreachable.
arsenm added inline comments.Dec 1 2022, 10:18 AM
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
345–346

Might as well use SmallSetVector?

vpykhtin added inline comments.Dec 1 2022, 10:36 AM
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
345–346

It's only used to avoid visiting already seen registers, so this is more like a set, not vector. I'm just not sure if it worth to use some kind of hashing sets here because the number of registers is usually small and it seems it may be faster to use just plain vector. I don't have strong opinion here, however.

arsenm added inline comments.Dec 1 2022, 10:37 AM
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
345–346

Then SmallSet or SmallSet vector. Both do the linear scan when the set is small

arsenm accepted this revision.Dec 1 2022, 10:37 AM
vpykhtin updated this revision to Diff 479372.Dec 1 2022, 11:34 AM
  • Replaced SmallVector with SmallSet.
This revision was automatically updated to reflect the committed changes.
foad added a comment.Dec 2 2022, 2:24 AM

I'm now hitting an UNREACHABLE "register isn't live" in this somewhat reduced test case. Can you investigate/fix/revert as appropriate please? Thanks!

; RUN: llc -march=amdgcn -mcpu=gfx900 < %s

declare void @llvm.amdgcn.kill(i1)
declare <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32 immarg, float, float, <8 x i32>, <4 x i32>, i1 immarg, i32 immarg, i32 immarg)
declare <2 x half> @llvm.amdgcn.cvt.pkrtz(float, float)
declare void @llvm.amdgcn.exp.compr.v2f16(i32 immarg, i32 immarg, <2 x half>, <2 x half>, i1 immarg, i1 immarg)

define amdgpu_ps void @_amdgpu_ps_main(float %arg) {
bb:
  %i = fcmp olt float %arg, 0.000000e+00
  br i1 %i, label %bb5, label %bb1

bb1:
  %i2 = call <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32 15, float 0.000000e+00, float %arg, <8 x i32> zeroinitializer, <4 x i32> zeroinitializer, i1 false, i32 0, i32 0)
  %i3 = extractelement <4 x float> %i2, i64 1
  %i4 = extractelement <4 x float> %i2, i64 0
  br label %bb6

bb5:
  call void @llvm.amdgcn.kill(i1 false)
  br label %bb6

bb6:
  %i7 = phi float [ 0.000000e+00, %bb5 ], [ %i3, %bb1 ]
  %i8 = phi float [ 0.000000e+00, %bb5 ], [ 1.000000e+00, %bb1 ]
  %i9 = phi float [ undef, %bb5 ], [ %i4, %bb1 ]
  %i10 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float 0.000000e+00, float %i7)
  %i11 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %i8, float %i9)
  call void @llvm.amdgcn.exp.compr.v2f16(i32 0, i32 0, <2 x half> %i10, <2 x half> %i11, i1 false, i1 false)
  ret void
}
vpykhtin reopened this revision.Mar 2 2023, 9:30 AM

This patch depends on https://reviews.llvm.org/D136918 fix.

This revision is now accepted and ready to land.Mar 2 2023, 9:30 AM
vpykhtin updated this revision to Diff 503732.Mar 9 2023, 4:33 AM
  • added regression test.
  • rebased after the fix a999669982d0: [AMDGPU] Scheduler: fix RP calculation for a MBB with one successor
arsenm added inline comments.Mar 9 2023, 4:42 AM
llvm/test/CodeGen/AMDGPU/scheduler-rp-calc-one-successor-two-predecessors-bug.ll
3

Include generated checks?

arsenm accepted this revision.Mar 9 2023, 4:53 AM

LGTM with test checks

vpykhtin updated this revision to Diff 503737.Mar 9 2023, 5:20 AM
  • added generated checks.
vpykhtin marked an inline comment as done.Mar 9 2023, 5:21 AM
This revision was automatically updated to reflect the committed changes.