This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/
-
src/__support/GPU/nvptx/
-
__support/
-
GPU/
-
nvptx/
-
utils.h
-
utils/gpu/
-
gpu/
-
CMakeLists.txt
-
timing/
-
CMakeLists.txt
-
amdgpu/
-
CMakeLists.txt
6
timing.h
-
nvptx/
-
CMakeLists.txt
1
timing.h
-
timing.h

Differential D158320

[libc] Initial support for microbenchmarking GPU code
Needs ReviewPublic

Authored by jhuber6 on Aug 18 2023, 2:59 PM.

Download Raw Diff

Details

Reviewers

tra
arsenm
sivachandra
lntue
michaelrj
JonChesterfield

Summary

This is the initial attempt at microbenchmarking GPU code. It uses
several compiler hacks to ensure that only the code we want to test is
between these profiling instructions. I tested this on both NVPTX and
AMDGPu architecture. AMDGPU seems to work quite well and matches what I
expect from llvm-mca when checking the assembly via llvm-objump -D
on the binary. NVPTX on the other hand requires -Xcuda-ptxas -O0 to
get consistent results, otherwise it will reorder the operations and end
up getting noise.

This is difficult because if there is a single load or store inside of
the timing region it well completely drown out any latency. A single
load / store is probably more costly than most primitive match
functions so it drowns out everything else.

I'm putting this up as a stand-in that can hopefully be refined further
in the future, as such there are no users currently.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Aug 18 2023, 2:59 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 18 2023, 2:59 PM

Herald added subscribers: libc-commits, mattd, asavonic and 3 others. · View Herald Transcript

jhuber6 requested review of this revision.Aug 18 2023, 2:59 PM

Herald added subscribers: wangpc, wdng. · View Herald TranscriptAug 18 2023, 2:59 PM

Harbormaster completed remote builds in B253590: Diff 551655.Aug 18 2023, 3:05 PM

You want memory fences to keep the operations inside the profiled region, the asm won't do that unless it has a memory clobber. Inline asm is likely to mess up codegen too.

libc/utils/gpu/timing/amdgpu/timing.h
37	Simulate? Delicate?

In D158320#4600380, @JonChesterfield wrote:

You want memory fences to keep the operations inside the profiled region, the asm won't do that unless it has a memory clobber. Inline asm is likely to mess up codegen too.

I messed around with fences but didn't notice any difference when I was messing around with this. The noinline and ordering seems to handle that for me.

arsenm added inline comments.Aug 18 2023, 3:30 PM

libc/utils/gpu/timing/amdgpu/timing.h
42	guarntee
46	Don't use r constraint
51	the post wait-for-result should be handled for you

I'm checking with https://godbolt.org/z/1n96MG7Mh and using `v changes the codegen to put unwanted things in the profile section.

Harbormaster completed remote builds in B253600: Diff 551668.Aug 18 2023, 4:14 PM

In D158320#4600488, @jhuber6 wrote:

I'm checking with https://godbolt.org/z/1n96MG7Mh and using `v changes the codegen to put unwanted things in the profile section.

I think this is just discovering ways that "r" is buggy

In D158320#4604894, @arsenm wrote:

In D158320#4600488, @jhuber6 wrote:

I'm checking with https://godbolt.org/z/1n96MG7Mh and using `v changes the codegen to put unwanted things in the profile section.

I think this is just discovering ways that "r" is buggy

It seems s works here. Is that functional? https://godbolt.org/z/j9TvP5hff.

In D158320#4604914, @jhuber6 wrote:

In D158320#4604894, @arsenm wrote:

In D158320#4600488, @jhuber6 wrote:

I'm checking with https://godbolt.org/z/1n96MG7Mh and using `v changes the codegen to put unwanted things in the profile section.

I think this is just discovering ways that "r" is buggy

It seems s works here. Is that functional? https://godbolt.org/z/j9TvP5hff.

Yes, you want s especially when the source is a direct s output intrinsic

Add fence and move to "s"

Harbormaster completed remote builds in B253947: Diff 552161.Aug 21 2023, 4:20 PM

arsenm added inline comments.Aug 23 2023, 4:58 PM

libc/utils/gpu/timing/amdgpu/timing.h
48	either the fence or the waitcnt, bot hare redundant
52	you shouldn't need this one, the waitcnt insertion has to do this for you to produce the result

Address comments. Also add __syncthreads() to the NVPTX implementation. It
seems to succeed in preventing optimizations when I check the SASS of the
produced binary.

Harbormaster completed remote builds in B254966: Diff 553580.Aug 25 2023, 2:17 PM

ping

LGTM for NVPTX side.

libc/utils/gpu/timing/nvptx/timing.h
55–56	This arrangement still seems to be a bit fragile. sync_threads will confine clock reading to happen between them, but withing that range they may still be moved around by LLVM or ptxas. `asm volatile` will probably restrict that on IR level, but it would not do anything on ptxas level. We can hope that ptxas would not move sreg reads around much, but I don't think it's guaranteed. This example happens to work, but I would not be surprised that we'll run into issues trying to bench more complicated code. I'd wrap each clock read between sync_threads() to make sure that ptxas can't move those reads.

Revision Contents

Path

Size

libc/

src/

__support/

GPU/

nvptx/

utils.h

4 lines

utils/

gpu/

CMakeLists.txt

1 line

timing/

CMakeLists.txt

16 lines

amdgpu/

CMakeLists.txt

7 lines

timing.h

73 lines

nvptx/

CMakeLists.txt

7 lines

timing.h

82 lines

timing.h

22 lines

Diff 553580

libc/src/__support/GPU/nvptx/utils.h

	Show First 20 Lines • Show All 136 Lines • ▼ Show 20 Lines

	/// Waits for all threads in the warp to reconverge for independent scheduling.			/// Waits for all threads in the warp to reconverge for independent scheduling.
	[[clang::convergent]] LIBC_INLINE void sync_lane(uint64_t mask) {			[[clang::convergent]] LIBC_INLINE void sync_lane(uint64_t mask) {
	__nvvm_bar_warp_sync(static_cast<uint32_t>(mask));			__nvvm_bar_warp_sync(static_cast<uint32_t>(mask));
	}			}

	/// Returns the current value of the GPU's processor clock.			/// Returns the current value of the GPU's processor clock.
	LIBC_INLINE uint64_t processor_clock() {			LIBC_INLINE uint64_t processor_clock() {
	uint64_t timestamp;			return __nvvm_read_ptx_sreg_clock64();
	LIBC_INLINE_ASM("mov.u64 %0, %%clock64;" : "=l"(timestamp));
	return timestamp;
	}			}

	/// Returns a global fixed-frequency timer at nanosecond frequency.			/// Returns a global fixed-frequency timer at nanosecond frequency.
	LIBC_INLINE uint64_t fixed_frequency_clock() {			LIBC_INLINE uint64_t fixed_frequency_clock() {
	uint64_t nsecs;			uint64_t nsecs;
	LIBC_INLINE_ASM("mov.u64 %0, %%globaltimer;" : "=l"(nsecs));			LIBC_INLINE_ASM("mov.u64 %0, %%globaltimer;" : "=l"(nsecs));
	return nsecs;			return nsecs;
	}			}

	} // namespace gpu			} // namespace gpu
	} // namespace __llvm_libc			} // namespace __llvm_libc

	#endif			#endif

libc/utils/gpu/CMakeLists.txt

	add_subdirectory(server)			add_subdirectory(server)
	add_subdirectory(loader)			add_subdirectory(loader)
				add_subdirectory(timing)

libc/utils/gpu/timing/CMakeLists.txt

This file was added.

				if(NOT LIBC_TARGET_ARCHITECTURE_IS_GPU)
				return()
				endif()

				foreach(target nvptx amdgpu)
				add_subdirectory(${target})
				list(APPEND target_gpu_timing libc.utils.gpu.timing.${target}.${target}_timing)
				endforeach()

				add_header_library(
				timing
				HDRS
				timing.h
				DEPENDS
				${target_gpu_timing}
				)

libc/utils/gpu/timing/amdgpu/CMakeLists.txt

This file was added.

				add_header_library(
				amdgpu_timing
				HDRS
				timing.h
				DEPENDS
				libc.src.__support.common
				)

libc/utils/gpu/timing/amdgpu/timing.h

This file was added.

				//===------------- AMDGPU implementation of timing utils --------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU
				#define LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU

				#include "src/__support/GPU/utils.h"
				#include "src/__support/common.h"
				#include "src/__support/macros/attributes.h"
				#include "src/__support/macros/config.h"

				#include <stdint.h>

				namespace __llvm_libc {

				// Returns the overhead associated with calling the profiling region. This
				// allows us to substract the constant-time overhead from the latency to
				// obtain a true result. This can vary with system load.
				[[gnu::noinline]] static LIBC_INLINE uint64_t overhead() {
				__builtin_amdgcn_s_waitcnt(0);
				uint64_t start = gpu::processor_clock();
				uint32_t result = 0.0;
				asm volatile("v_or_b32 %[v_reg], 0, %[v_reg]\n" ::[v_reg] "v"(result) :);
				asm volatile("" ::"s"(start));
				uint64_t stop = gpu::processor_clock();
				return stop - start;
				}

				// Profile a simple function and obtain its latency in clock cycles on the
				// system. This function cannot be inlined or else it will disturb the very
				// deliccate balance of hard-coded dependencies.
				template <typename F, typename T>
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Simulate? Delicate? JonChesterfield: Simulate? Delicate?
				[[gnu::noinline]] static LIBC_INLINE uint64_t latency(F f, T t) {
				// We need to store the input somewhere to guarantee that the compiler will
				// not constant propagate it and remove the profiling region.
				volatile uint32_t storage = t;
				float arg = storage;
				arsenmUnsubmitted Not Done Reply Inline Actions guarntee arsenm: guarntee
				asm volatile("" ::"s"(arg));

				// The AMDGPU architecture needs to wait on pending results.
				__builtin_amdgcn_s_waitcnt(0);
				arsenmUnsubmitted Not Done Reply Inline Actions Don't use r constraint arsenm: Don't use r constraint
				// Get the current timestamp from the clock.
				uint64_t start = gpu::processor_clock();
				arsenmUnsubmitted Not Done Reply Inline Actions either the fence or the waitcnt, bot hare redundant arsenm: either the fence or the waitcnt, bot hare redundant

				// This forces the compiler to load the input argument and run the clock cycle
				// counter before the profiling region.
				arsenmUnsubmitted Not Done Reply Inline Actions the post wait-for-result should be handled for you arsenm: the post wait-for-result should be handled for you
				asm volatile("" ::"s"(arg), "s"(start));
				arsenmUnsubmitted Not Done Reply Inline Actions you shouldn't need this one, the waitcnt insertion has to do this for you to produce the result arsenm: you shouldn't need this one, the waitcnt insertion has to do this for you to produce the result

				// Run the function under test and return its value.
				auto result = f(arg);

				// This inline assembly performs a no-op which forces the result to both be
				// used and prevents us from exiting this region before it's complete.
				asm volatile("v_or_b32 %[v_reg], 0, %[v_reg]\n" ::[v_reg] "v"(result) :);

				// Obtain the current timestamp after running the calculation and force
				// ordering.
				uint64_t stop = gpu::processor_clock();
				asm volatile("" ::"s"(stop));
				__builtin_amdgcn_fence(__ATOMIC_ACQUIRE, "workgroup");

				// Return the time elapsed.
				return stop - start;
				}

				} // namespace __llvm_libc

				#endif // LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU

libc/utils/gpu/timing/nvptx/CMakeLists.txt

This file was added.

				add_header_library(
				nvptx_timing
				HDRS
				timing.h
				DEPENDS
				libc.src.__support.common
				)

libc/utils/gpu/timing/nvptx/timing.h

This file was added.

				//===------------- NVPTX implementation of timing utils ---------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_UTILS_GPU_TIMING_NVPTX
				#define LLVM_LIBC_UTILS_GPU_TIMING_NVPTX

				#include "src/__support/GPU/utils.h"
				#include "src/__support/common.h"
				#include "src/__support/macros/attributes.h"
				#include "src/__support/macros/config.h"

				#include <stdint.h>

				namespace __llvm_libc {

				// Returns the overhead associated with calling the profiling region. This
				// allows us to substract the constant-time overhead from the latency to
				// obtain a true result. This can vary with system load.
				[[gnu::noinline]] static uint64_t overhead() {
				volatile uint32_t x = 1;
				uint32_t y = x;
				gpu::sync_threads();
				uint64_t start = gpu::processor_clock();
				asm volatile("" ::"r"(y), "r"(start));
				uint32_t result = y;
				asm volatile("or.b32 %[v_reg], %[v_reg], 0;" ::[v_reg] "r"(result) :);
				uint64_t stop = gpu::processor_clock();
				gpu::sync_threads();
				volatile auto storage = result;
				return stop - start;
				}

				// Stimulate a simple function and obtain its latency in clock cycles on the
				// system. This function cannot be inlined or else it will disturb the very
				// deliccate balance of hard-coded dependencies.
				//
				// FIXME: This does not work in general on NVPTX because of further
				// optimizations ptxas performs. The only way to get consistent results is to
				// pass and extra "SHELL:-Xcuda-ptxas -O0" to CMake's compiler flag. This
				// negatively implacts performance but it is at least stable.
				template <typename F, typename T>
				[[gnu::noinline]] static LIBC_INLINE uint64_t latency(F f, T t) {
				// We need to store the input somewhere to guarantee that the compiler will
				// not constant propagate it and remove the profiling region.
				volatile T storage = t;
				T arg = storage;
				asm volatile("" ::"r"(arg));

				// Get the current timestamp from the clock.
				gpu::sync_threads();
				uint64_t start = gpu::processor_clock();
				traUnsubmitted Not Done Reply Inline Actions This arrangement still seems to be a bit fragile. sync_threads will confine clock reading to happen between them, but withing that range they may still be moved around by LLVM or ptxas. `asm volatile` will probably restrict that on IR level, but it would not do anything on ptxas level. We can hope that ptxas would not move sreg reads around much, but I don't think it's guaranteed. This example happens to work, but I would not be surprised that we'll run into issues trying to bench more complicated code. I'd wrap each clock read between sync_threads() to make sure that ptxas can't move those reads. tra: This arrangement still seems to be a bit fragile. sync_threads will confine clock reading to…

				// This forces the compiler to load the input argument and run the clock cycle
				// counter before the profiling region.
				asm volatile("" ::"r"(arg), "r"(start));

				// Run the function under test and return its value.
				auto result = f(arg);

				// This inline assembly performs a no-op which forces the result to both be
				// used and prevents us from exiting this region before it's complete.
				asm volatile("or.b32 %[v_reg], %[v_reg], 0;" ::[v_reg] "r"(result) :);

				// Obtain the current timestamp after running the calculation and force
				// ordering.
				uint64_t stop = gpu::processor_clock();
				gpu::sync_threads();
				asm volatile("" ::"r"(stop));
				volatile T output = result;

				// Return the time elapsed.
				return stop - start;
				}

				} // namespace __llvm_libc

				#endif // LLVM_LIBC_UTILS_GPU_TIMING_NVPTX

libc/utils/gpu/timing/timing.h

This file was added.

				//===------------- Implementation of GPU timing utils ------------ C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_UTILS_GPU_TIMING_H
				#define LLVM_LIBC_UTILS_GPU_TIMING_H

				#include "src/__support/macros/properties/architectures.h"

				#if defined(LIBC_TARGET_ARCH_IS_AMDGPU)
				#include "amdgpu/timing.h"
				#elif defined(LIBC_TARGET_ARCH_IS_NVPTX)
				#include "nvptx/timing.h"
				#else
				#error "unsupported platform"
				#endif

				#endif // LLVM_LIBC_UTILS_GPU_TIMING_H

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Initial support for microbenchmarking GPU codeNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 553580

libc/src/__support/GPU/nvptx/utils.h

libc/utils/gpu/CMakeLists.txt

libc/utils/gpu/timing/CMakeLists.txt

libc/utils/gpu/timing/amdgpu/CMakeLists.txt

libc/utils/gpu/timing/amdgpu/timing.h

libc/utils/gpu/timing/nvptx/CMakeLists.txt

libc/utils/gpu/timing/nvptx/timing.h

libc/utils/gpu/timing/timing.h

[libc] Initial support for microbenchmarking GPU code
Needs ReviewPublic