This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] New Waitcnt Insertion Pass
ClosedPublic

Authored by kanarayan on Mar 20 2017, 4:49 PM.

Details

Summary

This pass implements the algorithm deployed by our internal compiler for inserting waitcnt instructions. The pass performs cross basic-block analysis and tracks individual registers, and provides predicted performance improvements over the current implementation.

There are further improvements forthcoming, including relaxing overtly conservative assumptions about LDS access, integration of memory model pass, and more targeted tests for the corners.

Diff Detail

Event Timeline

kanarayan created this revision.Mar 20 2017, 4:49 PM
t-tye added a subscriber: t-tye.Mar 22 2017, 6:40 PM
tony-tye removed a subscriber: tony-tye.Mar 22 2017, 6:47 PM
kzhuravl edited edge metadata.Mar 23 2017, 11:17 AM

@rampitec wrote:
Please add test with s_barrier.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
75

@rampitec wrote:
Should not we get these numbers from TD files for target which we already have?

1018

@rampitec wrote:
For a barrier it will always insert strongest:

s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)

regardless of what was an argument of the barrier.

This also seems to completely ignore atomic fences inserted around the barrier from the library, which shall be a real source of wait argument. Note, that semantics of needWaitcntBeforeBarrier() is not that we always need to insert wait with barrier, but that we may need to insert it.

Also note that existing pass does not seem to do it for a barrier.

1614

@rampitec wrote:
Need to check for XNACK support.

rampitec edited edge metadata.Mar 23 2017, 11:19 AM

Please add test with s_barrier.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
75

Should not we get these numbers from TD files for target which we already have?

1018

For a barrier it will always insert strongest:

s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)

regardless of what was an argument of the barrier.

This also seems to completely ignore atomic fences inserted around the barrier from the library, which shall be a real source of wait argument. Note, that semantics of needWaitcntBeforeBarrier() is not that we always need to insert wait with barrier, but that we may need to insert it.

Also note that existing pass does not seem to do it for a barrier.

1614

Need to check for XNACK support.

kzhuravl added inline comments.Mar 23 2017, 11:23 AM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1018

Atomics are currently handled in a separate pass, which we determined to be conservatively correct. We plan to integrate atomics into waitcnt insertion but later and in a separate patch.

The *needWaitcntBeforeBarrier* tells you whether you need a waitcnt before the barrier or not. For >=GFX9, waitcnt is automatically inserted before the barrier, so we do not need to generate it, and *needWaitcntBeforeBarrier* returns false for >=GFX9.

Also note that existing pass does not seem to do it for a barrier.

https://github.com/llvm-mirror/llvm/blob/master/lib/Target/AMDGPU/SIInsertWaits.cpp#L633

How about waitcnt operands in that test?

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1018

Barrier itself does not need fence. OpenCL barrier needs fence, but these are generated in the library.

kanarayan updated this revision to Diff 92883.Mar 23 2017, 5:28 PM

This addresses two issues:

  1. SQ_MAX_PGM_VGPRS and other constants are used to map the llvm register map to a data structure internal to this algorithm. For the register ampping used by the algorithm see the comments before the enum. They are maximum values across all targets. It is ideal to use dynamically sized arrays that fit the particular architecture target. As an interim step, I have changed these constants to enum values, asserted that no target has a larger register file in the main entry to this pass. (In response to comments from Tony and Konstantin) I have also updated the getRegInterval call. Please notice that the previous version was identical to the pass used by the old pass except I do the necessary adjustments for the mapping used by this algorithm. In a next step, I will remove the assert and the associated code.
  2. Barriers no longer force a zero waitcnt. For GFX9 and above, barrier needs no additional waitcnt. For lesser targets, waitcnts are added only if needed. The following tests already test barrier: LLVM :: CodeGen/AMDGPU/addrspacecast.ll LLVM :: CodeGen/AMDGPU/array-ptr-calc-i32.ll LLVM :: CodeGen/AMDGPU/ds-negative-offset-addressing-mode-loop.ll LLVM :: CodeGen/AMDGPU/ds_read2.ll LLVM :: CodeGen/AMDGPU/indirect-private-64.ll LLVM :: CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll LLVM :: CodeGen/AMDGPU/local-memory.amdgcn.ll LLVM :: CodeGen/AMDGPU/merge-stores.ll LLVM :: CodeGen/AMDGPU/schedule-vs-if-nested-loop-failure.ll LLVM :: CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll LLVM :: CodeGen/AMDGPU/store-barrier.ll LLVM :: CodeGen/AMDGPU/wait.ll

There is also a CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll

This patch does not yet address the XNACK changes in https://reviews.llvm.org/D30302

Test with barrier surrounded with fences is needed. All relevant combinations of fences needs to be checked and pattern shall check that only needed counters are used, wait is produced, and that is only one wait. Please refer to OpenCL barrier() implementation for the code to check.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1006

Not needed anymore.

rampitec added inline comments.Mar 24 2017, 3:08 AM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
819

I believe there is no SI_RETURN anymore, it is renamed.

kanarayan updated this revision to Diff 93059.Mar 25 2017, 5:28 PM
  1. Rename SI_RETURN to SI_RETURN_TO_EPILOG to reflect recent change.
  2. Condition the nop insertion to break soft clauses on isXNACKEnabled(). Add a note to also condition this code on hasSoftClauses when that code is put back.
  3. Fix tests added since last patch to reflect the new pass.
  4. Remove experimental code.
kanarayan updated this revision to Diff 93060.Mar 25 2017, 6:22 PM

Fix the last patch that inadvertently deleted "amdgpu_kernel" added recently.

  1. Rename SI_RETURN to SI_RETURN_TO_EPILOG to reflect recent change.
  2. Condition the nop insertion to break soft clauses on isXNACKEnabled(). Add a note to also condition this code on hasSoftClauses when that code is put back.
  3. Fix tests added since last patch to reflect the new pass.
  4. Remove experimental code.
kzhuravl requested changes to this revision.Mar 28 2017, 10:13 AM
  • Your pass does not take into account encoding changes for gfx9.
  • New pass name is confusing.
  • Make a note somewhere that the new pass is on by default, and you can switch it off by such and such option.
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1070–1072

Use helper functions from AMDGPUBaseInfo.h (they also have logic for gfx9 in it)

1099–1100

Use helper functions from AMDGPUBaseInfo.h (they also have logic for gfx9 in it)

This revision now requires changes to proceed.Mar 28 2017, 10:13 AM
kanarayan updated this revision to Diff 93299.Mar 28 2017, 2:25 PM
kanarayan edited edge metadata.

Use the architecture neutral interfaces to decode wait counts.

In response to the other comments, please note:

  1. The old pass will be removed shortly. I agree the pass names are close enough to cause some confusion in the interim, but the names are apt.
  2. Please refer to the comment here below (from AMDGPUTargetMachine.cpp) that indicates the new pass is the default and with -enable-si-insert-waitcnts=0 one can revert to the old pass.

// Option to enable new waitcnt insertion pass.
static cl::opt<bool> EnableSIInsertWaitcntsPass(

"enable-si-insert-waitcnts",
cl::desc("Use new waitcnt insertion pass"),
cl::init(true));
arsenm added inline comments.Mar 28 2017, 2:39 PM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
3

Extra blank line

89–92

This should be replaced with an iterator range or simply removed

511–516

Can't you just get the ::data operand and check if it is a use or def? Same for the other places checking getAtomicNoRetOp

761–763

C++ style comments

780

nullptr

786

Linemapping->LineMapping

787

MI.isDebugValue()

930

I'm concerned by relying on the memoperands since it's possible they were dropped. The uses LDS memory check could at least be factored into a predicate function

969–972

Ditto

1141–1147

This should check the gds operand, not the mem operands. This also needs MIR tests since we don't emit GDS operations now

1194

getNamedOperand

1609–1610

Braces around the block and addImm on next line

1688

Use BuildMI rather than the low level instruction creation APIs

Test with barrier surrounded with fences is needed. All relevant combinations of fences needs to be checked and pattern shall check that only needed counters are used, wait is produced, and that is only one wait. Please refer to OpenCL barrier() implementation for the code to check.

these tests must go downstream however (separate change). we do not have fence implementation upstreamed yet.

Test with barrier surrounded with fences is needed. All relevant combinations of fences needs to be checked and pattern shall check that only needed counters are used, wait is produced, and that is only one wait. Please refer to OpenCL barrier() implementation for the code to check.

these tests must go downstream however (separate change). we do not have fence implementation upstreamed yet.

It can be in the internal repo, but we need these tests to be sure we are producing right fences for OpenCL barriers.

Test with barrier surrounded with fences is needed. All relevant combinations of fences needs to be checked and pattern shall check that only needed counters are used, wait is produced, and that is only one wait. Please refer to OpenCL barrier() implementation for the code to check.

these tests must go downstream however (separate change). we do not have fence implementation upstreamed yet.

It can be in the internal repo, but we need these tests to be sure we are producing right fences for OpenCL barriers.

that is exactly what i meant by my previous comment (tests should go to amd-common branch).

kanarayan updated this revision to Diff 93410.Mar 29 2017, 3:16 PM

In response to review comments from arsenm, the following updates/responses:

The following updates to the code:

  1. Removed Extra blank line.
  2. Removed macro defining iterator on enum and replace it with inline code.
  3. Replace with C++ style comments.
  4. Use nullptr.
  5. Replace Linemapping with LineMapping.
  6. Use MI.isDebugValue()
  7. Use getNamedOperand
  8. Braces around the block (but clang-format likes addImm on same line).

The following will be addressed in a subsequent put-back.

  1. LDS access check & GDS tests: The code can be simplified further as suggested. The generated code is overly conservative but correct.
  2. Uses of getAtomicNoRetOp: The code is looking for atomic operations that do not return. The atomic operations that do return are covered by the code that look for stores. Again, this is conservatively correct as is written, but needs to be fixed.

CreateMachineInstr - BuildMI was unusable in this context. Besides, X86 and some other code written for our target also use CreateMachineInstr.

kanarayan updated this revision to Diff 94310.Apr 5 2017, 6:20 PM
kanarayan marked 19 inline comments as done.

The update includes the following after rebasing:

Re-organize the getRegInterval as follows:

  1. Support the registers covered under isSGPRReg that are allocatable.
  2. Assert that the register numbers are within bounds.

Check the GDS bit and avoid the loop on the memory operands. For DS operations, additionally check they are either load or a store.

Fix the LGKM bit for ds_bpermute, ds_permute operations that do not access DS.

Add comments on TODO items to the code. The following are some of the other items:

  1. We currently list all !VGPR registers under isSGPRReg that causes confusion.
  2. There are waintcnt flags/bits which do not appear to be consistently set (example: ds_swizzle). I need to verify the flag bits are set property and then could use the bits in the new pass.
kanarayan updated this revision to Diff 94311.Apr 5 2017, 6:29 PM

clang-format

t-tye added inline comments.Apr 5 2017, 7:57 PM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
424

Could M0, EXEC, etc. also potentially be a source or dest for SMEM and VMEM load/store?

Perhaps add an assert to ensure they never happen.

arsenm added inline comments.Apr 5 2017, 8:00 PM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
424

Those can't even be encoded for VMEM. M0 isn't in an allocatable class, and exec is reserved so neither one will appear there either. They are also disallowed by the operand constraints so will be a verifier error

rampitec requested changes to this revision.Apr 7 2017, 6:48 PM

According to the showed ISA for barrier with global and local memory fence this pass produces incorrect code. Please fix.

This revision now requires changes to proceed.Apr 7 2017, 6:48 PM

When applied to current trunk, this code can get into an infinite loop. Please see the sample LLVM IR at https://paste.debian.net/926533/

When compiled with llc -march=amdgcn -mcpu=bonaire, the new pass gets into an infinite loop. When compiled for Tonga, no infinite loop is encountered.

kanarayan updated this revision to Diff 94775.Apr 10 2017, 8:36 PM
kanarayan edited edge metadata.

Fix the issue related to barrier.

kanarayan updated this revision to Diff 94868.Apr 11 2017, 12:20 PM

Fix the looping issue found with the GL input. The bug only triggers for some targets under some conditions. This is due to an earlier fix for a VCCZ bug; that code needs to go away once the scheduler is fixed to handle this scenario. I need to add a lit test.

kanarayan updated this revision to Diff 94929.Apr 11 2017, 6:59 PM

This update includes all the fixes so far, but keeps the old pass the default. We would like to get broader test coverage under the option first, and then add the tests and turn the new pass on by default.

t-tye accepted this revision.Apr 11 2017, 7:02 PM

LGTM

rampitec accepted this revision.Apr 11 2017, 9:49 PM
rampitec resigned from this revision.Apr 24 2017, 11:37 AM
This revision now requires changes to proceed.Apr 24 2017, 11:37 AM
kzhuravl resigned from this revision.Apr 24 2017, 12:33 PM
This revision is now accepted and ready to land.Apr 24 2017, 12:33 PM
arsenm closed this revision.Jan 27 2020, 12:10 PM
arsenm added inline comments.
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
569–580

This can be deleted