This is an archive of the discontinued LLVM Phabricator instance.

[libc] Initial support for microbenchmarking GPU code
Needs ReviewPublic

Authored by jhuber6 on Aug 18 2023, 2:59 PM.

Details

Summary

This is the initial attempt at microbenchmarking GPU code. It uses
several compiler hacks to ensure that only the code we want to test is
between these profiling instructions. I tested this on both NVPTX and
AMDGPu architecture. AMDGPU seems to work quite well and matches what I
expect from llvm-mca when checking the assembly via llvm-objump -D
on the binary. NVPTX on the other hand requires -Xcuda-ptxas -O0 to
get consistent results, otherwise it will reorder the operations and end
up getting noise.

This is difficult because if there is a single load or store inside of
the timing region it well completely drown out any latency. A single
load / store is probably more costly than most primitive match
functions so it drowns out everything else.

I'm putting this up as a stand-in that can hopefully be refined further
in the future, as such there are no users currently.

Diff Detail

Event Timeline

jhuber6 created this revision.Aug 18 2023, 2:59 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 18 2023, 2:59 PM
jhuber6 requested review of this revision.Aug 18 2023, 2:59 PM

You want memory fences to keep the operations inside the profiled region, the asm won't do that unless it has a memory clobber. Inline asm is likely to mess up codegen too.

libc/utils/gpu/timing/amdgpu/timing.h
37

Simulate? Delicate?

You want memory fences to keep the operations inside the profiled region, the asm won't do that unless it has a memory clobber. Inline asm is likely to mess up codegen too.

I messed around with fences but didn't notice any difference when I was messing around with this. The noinline and ordering seems to handle that for me.

arsenm added inline comments.Aug 18 2023, 3:30 PM
libc/utils/gpu/timing/amdgpu/timing.h
42

guarntee

46

Don't use r constraint

51

the post wait-for-result should be handled for you

jhuber6 updated this revision to Diff 551668.Aug 18 2023, 4:08 PM

I'm checking with https://godbolt.org/z/1n96MG7Mh and using `v changes the codegen to put unwanted things in the profile section.

I'm checking with https://godbolt.org/z/1n96MG7Mh and using `v changes the codegen to put unwanted things in the profile section.

I think this is just discovering ways that "r" is buggy

I'm checking with https://godbolt.org/z/1n96MG7Mh and using `v changes the codegen to put unwanted things in the profile section.

I think this is just discovering ways that "r" is buggy

It seems s works here. Is that functional? https://godbolt.org/z/j9TvP5hff.

I'm checking with https://godbolt.org/z/1n96MG7Mh and using `v changes the codegen to put unwanted things in the profile section.

I think this is just discovering ways that "r" is buggy

It seems s works here. Is that functional? https://godbolt.org/z/j9TvP5hff.

Yes, you want s especially when the source is a direct s output intrinsic

jhuber6 updated this revision to Diff 552161.Aug 21 2023, 4:15 PM

Add fence and move to "s"

arsenm added inline comments.Aug 23 2023, 4:58 PM
libc/utils/gpu/timing/amdgpu/timing.h
48

either the fence or the waitcnt, bot hare redundant

52

you shouldn't need this one, the waitcnt insertion has to do this for you to produce the result

jhuber6 updated this revision to Diff 553580.Aug 25 2023, 12:41 PM

Address comments. Also add __syncthreads() to the NVPTX implementation. It
seems to succeed in preventing optimizations when I check the SASS of the
produced binary.

tra added a comment.Sep 6 2023, 10:37 AM

LGTM for NVPTX side.

libc/utils/gpu/timing/nvptx/timing.h
56–57

This arrangement still seems to be a bit fragile.
sync_threads will confine clock reading to happen between them, but withing that range they may still be moved around by LLVM or ptxas.
asm volatile will probably restrict that on IR level, but it would not do anything on ptxas level. We can hope that ptxas would not move sreg reads around much, but I don't think it's guaranteed.

This example happens to work, but I would not be surprised that we'll run into issues trying to bench more complicated code.

I'd wrap each clock read between sync_threads() to make sure that ptxas can't move those reads.