Flat instruction can return out of order, so we need always need to wait
for all the outstanding flat operations.
Details
Diff Detail
- Build Status
Buildable 800 Build 800: arc lint + arc unit
Event Timeline
lib/Target/AMDGPU/SIInsertWaits.cpp | ||
---|---|---|
300–301 | Is this too strict? I know the manual says something like the only sensible value to use is 0 ,but from the reasoning before it it sounds like that's only if accessing a generic address. We could check the MMO and see if it is really global which is the common case |
LGTM with test fixed
test/CodeGen/MIR/AMDGPU/waitcnt.mir | ||
---|---|---|
51 | This looks like it has a mem operand although the comment on the check line says it doesn't |
test/CodeGen/MIR/AMDGPU/waitcnt.mir | ||
---|---|---|
51 | The first load is the one without the mem operand, I can clarify this in the comment. |
Feedback on overall pass.
lib/Target/AMDGPU/SIInsertWaits.cpp | ||
---|---|---|
54 | Is there a named constant for the maximum register number rather than using 512? | |
88 | Would be helpful to state what the bits mean. It seems 1 is EXPORT and 2 is MEM-WRITE and perhaps have an enumeration that is used. Would need to add 4 for GDS when supported. | |
97 | Given that this is only for flat instructions that can complete early, not all flat, should this be renamed? | |
137 | instrucitons -> instructions | |
193 | Are BUFFER_CACHE_INV* marked as updating vmcnt? | |
197 | Only GFX6 uses exp_cnt for stores. Later targets do not increment this count, but stores of more than 2 dwords have a hardware hazard that requires at lease one instruction between the store and the next write of the register. So EXP_CNT property should only be put on M*BUF instructions for GFX6 and not later. | |
199 | // LGKM counters may be incremented by more than 1. | |
200 | Are S_DCACHE_* marked as updating lgkmcnt? | |
202 | This check should also apply to scalar writes. | |
216 | The scalar data cache invalidate and writeback instructions do not affect the lgkm counter so should not be marked in the td file as affecting the counter. | |
224 | Why is this needed as Result is initialized to 0? | |
241 | also GDS | |
246 | Are any source operands that are registers with a counter value that has not yet been satisfied (ie counter < value already waitedOn)? | |
268–275 | Only GFX6 requires to use the expcnt to determine if the input value is InUse. | |
300–301 | If the flat operation is known not to access LDS then it cannot return out of order. For example, flat is used in 64 bit to access the global address space. So wonder if should also check the address space of the operation and only do this if the address space is FLAT (and not when GLOBAL)? | |
316 | Should this be querying a subtarget feature instead of a specific target generation? The feature here seems to be that soft clauses are supported. | |
320–325 | This comment indicates that both SMEM and VMEM clauses must be broken, but the following code only handles VMEM as SMEM is handled elsewhere. The rules for VMEM only have to be followed when XNACK is supported. However, the rules for SMEM need to be followed regardless of whether XNACK is enabled as SMEM operations can complete out of order. | |
326–331 | Don't VMEM clauses only have to ensure input registers are not modified inside the clause when XNACK is being supported? We now have a subtarget feature to indicate that so should that be used here instead of checking the generation? So should this NOP insertion only be done when the XNACK feature is enabled? | |
333 | Should this also include scalar writes? | |
333–337 | Why is this if nested inside the enclosing if? Seems tracking the lastOpcodeType should be done regardless of breaking the soft clauses for consistency. | |
341 | Add another bit for GDS. | |
359 | For UsedRegs only the expcnt needs to be waited on before the register is available. For VMEM store in GFX6 both vmcnt and expcnt will be present in Limit; for GDS both expcnt and lgkm willbe present in Limit. So should this just update the expcnt? | |
377–378 | This is conservatively correct. But a better approach would be to not increment the vmcnt for flat instructions that are to generic address space, and record in the DefinedRegs of the destination as maxint. That would allow non-0 vmcnt for using registers produced by non flat instructions (it would be conservative as the value would assume the flat may have completed early), and 0 for the result of the flat instruction. | |
381 | Is this still true with current hardware? Pre-SI I think this was the case, but I thought SI onwards no longer used the export counter for VMEM instructions? | |
382 | Should this be != 3? If both are seen then it will be 3, so 3 means it is NOT ordered, not that it IS ordered? | |
385 | Currently the LGKM counter is always assumed unordered but this could be improved by tracking the classes of instruction that update it (as is done for EXP_CNT) and then can use non-0 waitcnt when only a single class of instructions have been seen since the last waitcnt for LGKM. This would potentially benefit the DS_* instructions greatly. | |
401 | If Required is 0 then no wait is needed on this counter so Value should be set to Hardware limit. Only if Required is non 0 does it mean that there is an instruction in this BB that we must wait on. | |
459 | Need delayed waitcnt if Counts is trying to wait on an instruction after the WaitedOn. So this should be: if (Counts.Array[i] < LastIssued.Array[i] - WaitedOn.Array[i]) | |
487 | Only the expcnt should be considered. When UseRegs is set it includes all counters and we do not need to wait for a GFX6 store to complete before being able to use the source register. That only has to be waited for before using the destination register. | |
501 | Should this be a target feature? | |
558 | Seem better if this was a target feature that was tested. |
Is there a named constant for the maximum register number rather than using 512?