This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Reduce the duration of whole-quad-mode
ClosedPublic

Authored by nhaehnle on Jul 7 2016, 7:34 AM.

Details

Summary

This contains two changes that reduce the time spent in WQM, with the
intention of reducing bandwidth required by VMEM loads:

  1. Sampling instructions by themselves don't need to run in WQM, only their coordinate inputs need it (unless of course there is a dependent sampling instruction). The initial scanInstructions step is modified accordingly.
  1. When switching back from WQM to Exact, switch back as soon as possible. This affects the logic in processBlock.

This should always be a win or at best neutral.

There are also some cleanups (e.g. remove unused ExecExports) and some new
debugging output.

Diff Detail

Event Timeline

nhaehnle updated this revision to Diff 63071.Jul 7 2016, 7:34 AM
nhaehnle retitled this revision from to AMDGPU: Reduce the duration of whole-quad-mode.
nhaehnle updated this object.
nhaehnle added a subscriber: llvm-commits.
nhaehnle updated this revision to Diff 64245.Jul 17 2016, 3:10 AM

Rebase on top of recent changes.

nhaehnle updated this revision to Diff 65493.Jul 26 2016, 4:16 AM

Rebase on top of D22673 and D22675, to make it easier for those fixes to
still go into the 3.9 release branch.

arsenm added inline comments.Jul 26 2016, 4:37 PM
lib/Target/AMDGPU/SIWholeQuadMode.cpp
156–168

I think it would be simpler to just write to the ostream directly instead of building a string and printing that

172–173

Can be combined into one statement

182–183

Ditto

184

Single quotes

228–231

This isn't actually true in the case of SCC. It's rare but possible to come up with test cases that break this. A uniform branch with a use of the same i1 value in a later block should do it if you want to try to break things

nhaehnle added inline comments.Jul 27 2016, 7:14 AM
lib/Target/AMDGPU/SIWholeQuadMode.cpp
228–231

Hmm, I was going off some comments that I saw in one of the Live*.h analyses.

Do you have a sample of that? I tried a naive

define amdgpu_ps float @test(i32 inreg %a) nounwind {
entry:
  %cc = icmp ugt i32 %a, 5
  br i1 %cc, label %if, label %next
if:
  br label %next
next:
  %v = phi float [ 2.0, %if ], [ 0.0, %entry ]
  br i1 %cc, label %if2, label %end
if2:
  br label %end
end:
  %r = phi float [ 3.0, %if2 ], [ %v, %next ]
  ret float %r
}

and I don't get SCC reuse:

test:                                   ; @test
; BB#0:                                 ; %entry
      v_mov_b32_e32 v0, 0
      s_cmp_lt_u32 s0, 6
      s_cbranch_scc1 BB0_2
; BB#1:                                 ; %if
      v_mov_b32_e32 v0, 2.0
BB0_2:                                  ; %next
      s_cmp_lt_u32 s0, 6
      s_cbranch_scc1 BB0_4
; BB#3:                                 ; %if2
      v_mov_b32_e32 v0, 0x40400000
BB0_4:                                  ; %end
      ; return
nhaehnle updated this revision to Diff 66780.Aug 4 2016, 3:02 AM
nhaehnle marked 4 inline comments as done.

Rebase and address most review comments.

Any ideas about the SCC question?

nhaehnle updated this revision to Diff 66781.Aug 4 2016, 3:05 AM

Forgot to update the skip-if-dead.ll test.

arsenm accepted this revision.Sep 1 2016, 10:46 AM
arsenm edited edge metadata.

LGTM

lib/Target/AMDGPU/SIWholeQuadMode.cpp
228–231

I thought I had an example similar to this a long time ago. Is duplicating the s_cmp an optimization that started working relatively recently? I guess you can ignore this for now, maybe you can do something with inline asm but it seems to be pretty broken with scc right now

478

This looks like a contextless instruction printing which probably isn't helpful

This revision is now accepted and ready to land.Sep 1 2016, 10:46 AM
This revision was automatically updated to reflect the committed changes.