This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Implement readcyclecounter
ClosedPublic

Authored by arsenm on Jan 4 2016, 2:28 PM.

Details

Reviewers
tstellarAMD

Diff Detail

Event Timeline

arsenm updated this revision to Diff 43924.Jan 4 2016, 2:28 PM
arsenm retitled this revision from to AMDGPU: Implement readcyclecounter.
arsenm updated this object.
arsenm added a reviewer: tstellarAMD.
arsenm added a subscriber: llvm-commits.

Looks good to me.

lib/Target/AMDGPU/SIInstructions.td
91–100

FWIW, OpenGL (ARB_shader_clock) doesn't really care.

test/CodeGen/AMDGPU/readcyclecounter.ll
8

s_memtime needs a s_waitcnt, right? Makes sense to check for that here, but not a big deal either way.

arsenm added inline comments.Jan 4 2016, 3:38 PM
test/CodeGen/AMDGPU/readcyclecounter.ll
8

That would surprise me if it does need it since it isn't actually a load. This should probably be special cased to not require a waitcnt

arsenm updated this revision to Diff 43960.Jan 4 2016, 8:26 PM

Disable waitcnt on s_memtime

NAK on that. I agree that it feels odd, but...

The documentation explicitly mentions that "s_memtime counts the same as an s_load_dwordx2" (in the publicly released AMD_GCN3_Instruction_Set_Architecture.pdf, page 39 of the PDF in section 4.4, where LGKM_CNT is discussed).

arsenm added a comment.Jan 5 2016, 6:47 AM

NAK on that. I agree that it feels odd, but...

The documentation explicitly mentions that "s_memtime counts the same as an s_load_dwordx2" (in the publicly released AMD_GCN3_Instruction_Set_Architecture.pdf, page 39 of the PDF in section 4.4, where LGKM_CNT is discussed).

OK, I didn't find this in the SI/CI manual (which seems to used old names: "smrd_fetch_time counts the same as an smrd_fetch_2.")

arsenm added a comment.Jan 5 2016, 6:48 AM

My main question now is still whether this should map to s_memtime, or match the behavior of the HSAIL clock instruction w.r.t s_memrealtime on VI.

arsenm added a comment.Jan 5 2016, 6:53 AM

NAK on that. I agree that it feels odd, but...

The documentation explicitly mentions that "s_memtime counts the same as an s_load_dwordx2" (in the publicly released AMD_GCN3_Instruction_Set_Architecture.pdf, page 39 of the PDF in section 4.4, where LGKM_CNT is discussed).

OK, I didn't find this in the SI/CI manual (which seems to used old names: "smrd_fetch_time counts the same as an smrd_fetch_2.")

Another question would be where the waitcnt should be inserted, and if this should have one inserted right after it to be more precise. Right now with the attempt to reduce waits, the s_memtime ends up being the first instruction, and has a waitcnt shared with the kernel argument load in the testcase. This probably isn't what you want

I don't have a strong opinion s_memtime vs. s_memrealtime. Probably the least surprise is to mirror HSAIL, but I don't really care.

I am against forcing there to be a waitcnt immediately after the s_memtime, because that might cause unnecessary stalls. However, we should prevent the s_memtime itself from getting reordered. GL_ARB_shader_clock explicitly states that the clock functions are code motion barriers. Isn't the intrinsic marked as having side effects?

arsenm added a comment.Jan 5 2016, 7:27 AM

I don't have a strong opinion s_memtime vs. s_memrealtime. Probably the least surprise is to mirror HSAIL, but I don't really care.

I am against forcing there to be a waitcnt immediately after the s_memtime, because that might cause unnecessary stalls. However, we should prevent the s_memtime itself from getting reordered. GL_ARB_shader_clock explicitly states that the clock functions are code motion barriers. Isn't the intrinsic marked as having side effects?

Yes, the intrinsic already won't be reordered with other side effects / memory writes

Ah, now I understand. The load after the s_memtime logically belongs to the store that follows it, so there is no reordering going on. Combining the waits is exactly what we want: when s_memtime is executed, the shader hardware immediately sends the corresponding request to some other hardware block (LDS? I'm not sure). The returned cycle count will be the correct one, no matter how late the corresponding s_waitcnt is scheduled.

So with a check in readcyclecounter.ll to ensure that lgkmcnt is present after s_memtime, this revision looks good to me (don't know if somebody else wants to chime in on the s_memtime vs. s_memrealtime issue).

arsenm updated this revision to Diff 48696.Feb 22 2016, 9:31 AM

Make it have the same behavior as hsail clock. Add separate s_memtime and s_memrealtime intrinsic.s

tstellarAMD accepted this revision.Feb 24 2016, 7:00 PM
tstellarAMD edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Feb 24 2016, 7:00 PM
arsenm closed this revision.Feb 27 2016, 1:04 AM

r262119