Details
- Reviewers
• tstellarAMD
Diff Detail
Event Timeline
test/CodeGen/AMDGPU/readcyclecounter.ll | ||
---|---|---|
8 | That would surprise me if it does need it since it isn't actually a load. This should probably be special cased to not require a waitcnt |
NAK on that. I agree that it feels odd, but...
The documentation explicitly mentions that "s_memtime counts the same as an s_load_dwordx2" (in the publicly released AMD_GCN3_Instruction_Set_Architecture.pdf, page 39 of the PDF in section 4.4, where LGKM_CNT is discussed).
OK, I didn't find this in the SI/CI manual (which seems to used old names: "smrd_fetch_time counts the same as an smrd_fetch_2.")
My main question now is still whether this should map to s_memtime, or match the behavior of the HSAIL clock instruction w.r.t s_memrealtime on VI.
Another question would be where the waitcnt should be inserted, and if this should have one inserted right after it to be more precise. Right now with the attempt to reduce waits, the s_memtime ends up being the first instruction, and has a waitcnt shared with the kernel argument load in the testcase. This probably isn't what you want
I don't have a strong opinion s_memtime vs. s_memrealtime. Probably the least surprise is to mirror HSAIL, but I don't really care.
I am against forcing there to be a waitcnt immediately after the s_memtime, because that might cause unnecessary stalls. However, we should prevent the s_memtime itself from getting reordered. GL_ARB_shader_clock explicitly states that the clock functions are code motion barriers. Isn't the intrinsic marked as having side effects?
Yes, the intrinsic already won't be reordered with other side effects / memory writes
Ah, now I understand. The load after the s_memtime logically belongs to the store that follows it, so there is no reordering going on. Combining the waits is exactly what we want: when s_memtime is executed, the shader hardware immediately sends the corresponding request to some other hardware block (LDS? I'm not sure). The returned cycle count will be the correct one, no matter how late the corresponding s_waitcnt is scheduled.
So with a check in readcyclecounter.ll to ensure that lgkmcnt is present after s_memtime, this revision looks good to me (don't know if somebody else wants to chime in on the s_memtime vs. s_memrealtime issue).
Make it have the same behavior as hsail clock. Add separate s_memtime and s_memrealtime intrinsic.s
FWIW, OpenGL (ARB_shader_clock) doesn't really care.