This patch adds the utilities for the clocks on the GPU. This is done
prior to exporting it via some other interface and is mainly just done
so they are availible if we wish to do internal testing.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
It's not obvious to me that this is a good thing overall.
First off benchmarking is difficult to do meaningfully and adding easy to access timing functions makes it more likely that they'll be used to make superficial comparisons when changing implementations.
Secondly picking timers that run at different frequencies on different hardware will interact badly with people comparing performance between different machines.
What's the benefit side to this like?
This is internal to libc and currently won't be exported as is. We could potentially put some of this into some time.h like functionality but that's tangential. The frequencies are different so we could potentially wrap them into different flags. Another option is to get the loader to look up the frequency and write it to some global so we can measure the nanosecond timer consistently between both architectures. It's useful because we want to be able to measure time elapsed on a much finger-grain than nvprof or rocprof would provide. This makes sure we can embed this analysis into some common tester and check the cycles.
libc/src/__support/GPU/amdgpu/utils.h | ||
---|---|---|
159 | Does __has_builtin cover subtarget feature restrictions? |
needs some tests targeting the different targets, I don't know if __has_builtin will really work this way
Not sure how to stand up a test just yet. In the future I would like to add a global to normalize the timestamp to a nanosecond resolution. I checked briefly via https://godbolt.org/z/4xjxjTovb and figured it worked.
libc/src/__support/GPU/amdgpu/utils.h | ||
---|---|---|
155 | For consistency, use LIBC_HAS_BUILTIN: https://github.com/llvm/llvm-project/blob/main/libc/src/__support/macros/config.h#L29 |
libc/src/__support/GPU/amdgpu/utils.h | ||
---|---|---|
159 | Should check for has s_memtime in case future hardware decides 4th time's the charm on timing instructions |
Add another check that uses the s_memtime if it's availible. This is because
on RDNA2 we still have s_memtime but also have the read cycle counter. The
read cycle counter implementation is 20-bits so we should favor the 64-bit one
if availible.
Test could be C to IR I think, check the ASM comes out for ptx and that some known architectures pick the right value for gcn.
I'm not sure we can do much with the generic case.
@arsenm what do you think of adding a built-in that expands to one of the above? This sort of "try to get the time" isn't libc specific and the switch over different builtins is basically architecture mess that ideally wouldn't get reimplemented in lot of different library code.
And IR test wouldn't fit in libc, that'd probably be more of a clang thing to check that we specify the built-ins properly.
It would definitely allow us to remove another cause of divergence for codes that want to be more generic w.r.t. the architecture, but they do slightly different things.
That's kind of what I thought __builtin_readcyclecounter was for. I don't really understand the difference between the different time operations (e.g what is the difference between clock and time?)
The boards have a fixed-frequency clock and a cycle counter. The fixed-frequency clock from time is used for measuring absolute time elapsed because it is not matched to any frequency throttling or stalling the processor might do. The cycle counter from clock is the variable frequency clock that corresponds to the actual number of processor cycles that have elapsed since you last called it. In this patch I specifically choose to use memtime over readcyclecounter if available (As it is in RDNA2) because the former is 64-bits while the latter is only 20-bits. Getting the real time from the fixed-frequency clock requires querying the driver unfortunately.
Let's go very explicit on naming then. fixed_frequency_timer and guess_current_time_in_arbitrary_units or whatever these things actually do.
Is that the right basis set? I wonder if we want cycle_count, frequency, width as the controls. It seems like PTX wants to do ns granularity which is useful and GCN needs the loader to write to a global to get the frequency value, and the type wraparound could be anywhere from 20 to 64
It describes what the underlying counters are. We could mention that the fixed frequency is 1E9 on PTX and some runtime value on AMDGCN. Relative timings should be fine in either case. The 20 bits is unfortunate but only affects gfx11xx architectures since we still have s_memtime on gfx10xx.
libc/src/__support/GPU/amdgpu/utils.h | ||
---|---|---|
167 | __builtin_amdgcn_s_memfixed_frequrency_clock doesn't exist, and is a typo. If we want to introduce a new readcyclecounter alternative it should be a fully generic intrinsic like readcyclecounter |
libc/src/__support/GPU/nvptx/utils.h | ||
---|---|---|
140 | nit: Can you change this to the internal macro LIBC_INLINE_ASM? |
For consistency, use LIBC_HAS_BUILTIN: https://github.com/llvm/llvm-project/blob/main/libc/src/__support/macros/config.h#L29