This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add basic utility support for timing functions on the GPU
ClosedPublic

Authored by jhuber6 on Jun 20 2023, 5:37 PM.

Details

Summary

This patch adds the utilities for the clocks on the GPU. This is done
prior to exporting it via some other interface and is mainly just done
so they are availible if we wish to do internal testing.

Diff Detail

Event Timeline

jhuber6 created this revision.Jun 20 2023, 5:37 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 20 2023, 5:37 PM
jhuber6 requested review of this revision.Jun 20 2023, 5:37 PM

It's not obvious to me that this is a good thing overall.

First off benchmarking is difficult to do meaningfully and adding easy to access timing functions makes it more likely that they'll be used to make superficial comparisons when changing implementations.

Secondly picking timers that run at different frequencies on different hardware will interact badly with people comparing performance between different machines.

What's the benefit side to this like?

It's not obvious to me that this is a good thing overall.

First off benchmarking is difficult to do meaningfully and adding easy to access timing functions makes it more likely that they'll be used to make superficial comparisons when changing implementations.

Secondly picking timers that run at different frequencies on different hardware will interact badly with people comparing performance between different machines.

What's the benefit side to this like?

This is internal to libc and currently won't be exported as is. We could potentially put some of this into some time.h like functionality but that's tangential. The frequencies are different so we could potentially wrap them into different flags. Another option is to get the loader to look up the frequency and write it to some global so we can measure the nanosecond timer consistently between both architectures. It's useful because we want to be able to measure time elapsed on a much finger-grain than nvprof or rocprof would provide. This makes sure we can embed this analysis into some common tester and check the cycles.

arsenm added inline comments.Jun 20 2023, 6:55 PM
libc/src/__support/GPU/amdgpu/utils.h
159

Does __has_builtin cover subtarget feature restrictions?

jhuber6 updated this revision to Diff 533101.Jun 20 2023, 7:06 PM

Using __has_builtin

needs some tests targeting the different targets, I don't know if __has_builtin will really work this way

needs some tests targeting the different targets, I don't know if __has_builtin will really work this way

Not sure how to stand up a test just yet. In the future I would like to add a global to normalize the timestamp to a nanosecond resolution. I checked briefly via https://godbolt.org/z/4xjxjTovb and figured it worked.

sivachandra added inline comments.Jun 20 2023, 7:12 PM
libc/src/__support/GPU/amdgpu/utils.h
155
jhuber6 updated this revision to Diff 533103.Jun 20 2023, 7:14 PM

-> LIBC_HAS_BUILTIN

arsenm added inline comments.Jun 21 2023, 4:55 AM
libc/src/__support/GPU/amdgpu/utils.h
159

Should check for has s_memtime in case future hardware decides 4th time's the charm on timing instructions

jhuber6 updated this revision to Diff 533229.Jun 21 2023, 6:00 AM

Add another check that uses the s_memtime if it's availible. This is because
on RDNA2 we still have s_memtime but also have the read cycle counter. The
read cycle counter implementation is 20-bits so we should favor the 64-bit one
if availible.

Test could be C to IR I think, check the ASM comes out for ptx and that some known architectures pick the right value for gcn.
I'm not sure we can do much with the generic case.

@arsenm what do you think of adding a built-in that expands to one of the above? This sort of "try to get the time" isn't libc specific and the switch over different builtins is basically architecture mess that ideally wouldn't get reimplemented in lot of different library code.

Test could be C to IR I think, check the ASM comes out for ptx and that some known architectures pick the right value for gcn.
I'm not sure we can do much with the generic case.

And IR test wouldn't fit in libc, that'd probably be more of a clang thing to check that we specify the built-ins properly.

@arsenm what do you think of adding a built-in that expands to one of the above? This sort of "try to get the time" isn't libc specific and the switch over different builtins is basically architecture mess that ideally wouldn't get reimplemented in lot of different library code.

It would definitely allow us to remove another cause of divergence for codes that want to be more generic w.r.t. the architecture, but they do slightly different things.

@arsenm what do you think of adding a built-in that expands to one of the above? This sort of "try to get the time" isn't libc specific and the switch over different builtins is basically architecture mess that ideally wouldn't get reimplemented in lot of different library code.

That's kind of what I thought __builtin_readcyclecounter was for. I don't really understand the difference between the different time operations (e.g what is the difference between clock and time?)

@arsenm what do you think of adding a built-in that expands to one of the above? This sort of "try to get the time" isn't libc specific and the switch over different builtins is basically architecture mess that ideally wouldn't get reimplemented in lot of different library code.

That's kind of what I thought __builtin_readcyclecounter was for. I don't really understand the difference between the different time operations (e.g what is the difference between clock and time?)

The boards have a fixed-frequency clock and a cycle counter. The fixed-frequency clock from time is used for measuring absolute time elapsed because it is not matched to any frequency throttling or stalling the processor might do. The cycle counter from clock is the variable frequency clock that corresponds to the actual number of processor cycles that have elapsed since you last called it. In this patch I specifically choose to use memtime over readcyclecounter if available (As it is in RDNA2) because the former is 64-bits while the latter is only 20-bits. Getting the real time from the fixed-frequency clock requires querying the driver unfortunately.

Let's go very explicit on naming then. fixed_frequency_timer and guess_current_time_in_arbitrary_units or whatever these things actually do.

Let's go very explicit on naming then. fixed_frequency_timer and guess_current_time_in_arbitrary_units or whatever these things actually do.

Sure, how about processor_clock and fixed_frequency_clock?

Is that the right basis set? I wonder if we want cycle_count, frequency, width as the controls. It seems like PTX wants to do ns granularity which is useful and GCN needs the loader to write to a global to get the frequency value, and the type wraparound could be anywhere from 20 to 64

Is that the right basis set? I wonder if we want cycle_count, frequency, width as the controls. It seems like PTX wants to do ns granularity which is useful and GCN needs the loader to write to a global to get the frequency value, and the type wraparound could be anywhere from 20 to 64

It describes what the underlying counters are. We could mention that the fixed frequency is 1E9 on PTX and some runtime value on AMDGCN. Relative timings should be fine in either case. The 20 bits is unfortunate but only affects gfx11xx architectures since we still have s_memtime on gfx10xx.

jhuber6 updated this revision to Diff 533272.Jun 21 2023, 8:25 AM

Change names.

arsenm added inline comments.Jun 21 2023, 8:27 AM
libc/src/__support/GPU/amdgpu/utils.h
167

__builtin_amdgcn_s_memfixed_frequrency_clock doesn't exist, and is a typo. If we want to introduce a new readcyclecounter alternative it should be a fully generic intrinsic like readcyclecounter

jhuber6 updated this revision to Diff 533274.Jun 21 2023, 8:29 AM

Fix the dangers of using sed to do your edits.

lntue added inline comments.Jun 21 2023, 8:45 AM
libc/src/__support/GPU/nvptx/utils.h
140

nit: Can you change this to the internal macro LIBC_INLINE_ASM?

jhuber6 updated this revision to Diff 533283.Jun 21 2023, 8:46 AM

-> LIBC_INLINE_ASM

lntue accepted this revision.Jun 21 2023, 9:01 AM

LGTM for me, but I'll let others to approve the GPU builtin / inline assembly usage.

This revision is now accepted and ready to land.Jun 21 2023, 9:01 AM