This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions
ClosedPublic

Authored by tstellarAMD on Oct 26 2016, 9:53 AM.

Details

Summary

Flat instruction can return out of order, so we need always need to wait
for all the outstanding flat operations.

Diff Detail

Repository
rL LLVM

Event Timeline

tstellarAMD retitled this revision from to AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions.
tstellarAMD updated this object.
tstellarAMD added reviewers: arsenm, tony-tye.
tstellarAMD added a subscriber: llvm-commits.
arsenm added inline comments.Oct 26 2016, 12:23 PM
lib/Target/AMDGPU/SIInsertWaits.cpp
300–301 ↗(On Diff #75913)

Is this too strict? I know the manual says something like the only sensible value to use is 0 ,but from the reasoning before it it sounds like that's only if accessing a generic address. We could check the MMO and see if it is really global which is the common case

Only treat flat operations as unordered if they access the flat address space.

arsenm accepted this revision.Oct 27 2016, 5:07 PM
arsenm edited edge metadata.

LGTM with test fixed

test/CodeGen/MIR/AMDGPU/waitcnt.mir
50 ↗(On Diff #76033)

This looks like it has a mem operand although the comment on the check line says it doesn't

This revision is now accepted and ready to land.Oct 27 2016, 5:07 PM
test/CodeGen/MIR/AMDGPU/waitcnt.mir
50 ↗(On Diff #76033)

The first load is the one without the mem operand, I can clarify this in the comment.

This revision was automatically updated to reflect the committed changes.
tony-tye edited edge metadata.Nov 2 2016, 12:01 AM

Feedback on overall pass.

lib/Target/AMDGPU/SIInsertWaits.cpp
54 ↗(On Diff #75913)

Is there a named constant for the maximum register number rather than using 512?

88 ↗(On Diff #75913)

Would be helpful to state what the bits mean. It seems 1 is EXPORT and 2 is MEM-WRITE and perhaps have an enumeration that is used.

Would need to add 4 for GDS when supported.

137 ↗(On Diff #75913)

instrucitons -> instructions

197 ↗(On Diff #75913)

Only GFX6 uses exp_cnt for stores.

Later targets do not increment this count, but stores of more than 2 dwords have a hardware hazard that requires at lease one instruction between the store and the next write of the register.

So EXP_CNT property should only be put on M*BUF instructions for GFX6 and not later.

300–301 ↗(On Diff #75913)

If the flat operation is known not to access LDS then it cannot return out of order. For example, flat is used in 64 bit to access the global address space. So wonder if should also check the address space of the operation and only do this if the address space is FLAT (and not when GLOBAL)?

316 ↗(On Diff #75913)

Should this be querying a subtarget feature instead of a specific target generation? The feature here seems to be that soft clauses are supported.

320–325 ↗(On Diff #75913)

This comment indicates that both SMEM and VMEM clauses must be broken, but the following code only handles VMEM as SMEM is handled elsewhere.

The rules for VMEM only have to be followed when XNACK is supported. However, the rules for SMEM need to be followed regardless of whether XNACK is enabled as SMEM operations can complete out of order.

326–331 ↗(On Diff #75913)

Don't VMEM clauses only have to ensure input registers are not modified inside the clause when XNACK is being supported? We now have a subtarget feature to indicate that so should that be used here instead of checking the generation? So should this NOP insertion only be done when the XNACK feature is enabled?

377–378 ↗(On Diff #75913)

This is conservatively correct. But a better approach would be to not increment the vmcnt for flat instructions that are to generic address space, and record in the DefinedRegs of the destination as maxint. That would allow non-0 vmcnt for using registers produced by non flat instructions (it would be conservative as the value would assume the flat may have completed early), and 0 for the result of the flat instruction.

381 ↗(On Diff #75913)

Is this still true with current hardware? Pre-SI I think this was the case, but I thought SI onwards no longer used the export counter for VMEM instructions?

97 ↗(On Diff #76033)

Given that this is only for flat instructions that can complete early, not all flat, should this be renamed?

193 ↗(On Diff #76033)

Are BUFFER_CACHE_INV* marked as updating vmcnt?
Are FLAT* marked as updating vmcnt?
Are GDS instructions marked as lgkmcnt and expcnt?
GDS needs waitcnt 0 before EXEC can be updated.

199 ↗(On Diff #76033)

// LGKM counters may be incremented by more than 1.

200 ↗(On Diff #76033)

Are S_DCACHE_* marked as updating lgkmcnt?
Are FLAT* marked as updating lgkmcnt?

202 ↗(On Diff #76033)

This check should also apply to scalar writes.

216 ↗(On Diff #76033)

The scalar data cache invalidate and writeback instructions do not affect the lgkm counter so should not be marked in the td file as affecting the counter.

224 ↗(On Diff #76033)

Why is this needed as Result is initialized to 0?

241 ↗(On Diff #76033)

also GDS

246 ↗(On Diff #76033)

Are any source operands that are registers with a counter value that has not yet been satisfied (ie counter < value already waitedOn)?

268–275 ↗(On Diff #76033)

Only GFX6 requires to use the expcnt to determine if the input value is InUse.
There is also a hardware hazard if input is larger than N dwords which requires M instructions before register can be used as a destination (is that hazard checked?).

333 ↗(On Diff #76033)

Should this also include scalar writes?

333–337 ↗(On Diff #76033)

Why is this if nested inside the enclosing if? Seems tracking the lastOpcodeType should be done regardless of breaking the soft clauses for consistency.

341 ↗(On Diff #76033)

Add another bit for GDS.
Exports are kept in order only within each export type (color/null, position, parameter cache) so need separate bits.

359 ↗(On Diff #76033)

For UsedRegs only the expcnt needs to be waited on before the register is available. For VMEM store in GFX6 both vmcnt and expcnt will be present in Limit; for GDS both expcnt and lgkm willbe present in Limit. So should this just update the expcnt?

382 ↗(On Diff #76033)

Should this be != 3? If both are seen then it will be 3, so 3 means it is NOT ordered, not that it IS ordered?
If adding other bits for GDS and export types then better to use BitCount(ExpInstrTypesSeen ) == 1

385 ↗(On Diff #76033)

Currently the LGKM counter is always assumed unordered but this could be improved by tracking the classes of instruction that update it (as is done for EXP_CNT) and then can use non-0 waitcnt when only a single class of instructions have been seen since the last waitcnt for LGKM. This would potentially benefit the DS_* instructions greatly.

401 ↗(On Diff #76033)

If Required is 0 then no wait is needed on this counter so Value should be set to Hardware limit. Only if Required is non 0 does it mean that there is an instruction in this BB that we must wait on.

459 ↗(On Diff #76033)

Need delayed waitcnt if Counts is trying to wait on an instruction after the WaitedOn. So this should be:

if (Counts.Array[i] < LastIssued.Array[i] - WaitedOn.Array[i])

487 ↗(On Diff #76033)

Only the expcnt should be considered. When UseRegs is set it includes all counters and we do not need to wait for a GFX6 store to complete before being able to use the source register. That only has to be waited for before using the destination register.

501 ↗(On Diff #76033)

Should this be a target feature?

558 ↗(On Diff #76033)

Seem better if this was a target feature that was tested.