This is an archive of the discontinued LLVM Phabricator instance.

[Libomptarget] Configure the RPC port count from the plugin
ClosedPublic

Authored by jhuber6 on Jul 20 2023, 5:41 PM.

Download Raw Diff

Details

Reviewers

JonChesterfield
tianshilei1992
jdoerfert
jplehr

Commits

rG06adac8c4e26: [Libomptarget] Configure the RPC port count from the plugin

Summary

This patch allows us to configure the port count to what the specific
card would desire for parallelism. For AMDGPU we need to use the maximum
number of hardware parallelism to avoid deadlocks. For NVPTX we don't
have this problem due to the friendlier scheduler, so we use the number
of warps active on an SM times the number of SMs as a good guess.

Note that the max ports currently is going to be smaller than these
numbers. That will be improved in the future.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Jul 20 2023, 5:41 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2023, 5:41 PM

Herald added subscribers: kerbowa, tpr, jvesely. · View Herald Transcript

jhuber6 requested review of this revision.Jul 20 2023, 5:41 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2023, 5:41 PM

Herald added subscribers: openmp-commits, wangpc. · View Herald Transcript

Harbormaster completed remote builds in B247074: Diff 542728.Jul 20 2023, 5:44 PM

Herald added a subscriber: sstefan1. · View Herald TranscriptJul 20 2023, 5:44 PM

JonChesterfield added inline comments.Jul 20 2023, 5:50 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1792	This is valid if a wave opens at most one port at a time. I think an argument could be made that a wave could try to open one port per thread. Likely to be something we should add to the libc tests. Async also comprises the sizing, though I currently think we should limit openmp to synchronous calls.
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
788	Typos. Also isn't num_teams the wrong constant here? Should be whatever openmp calls warps
openmp/libomptarget/plugins-nextgen/common/PluginInterface/RPC.cpp
64	This should probably be a hard error if the plugin wants more ports than it can have, especially on platforms where that implies deadlock risk
openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
384	It'll affect performance. Also I'm not totally confident Nvidia has a fair scheduler on SMs, that would be a good thing to check. Amdgpu does not have a fair scheduler on CUs
900	Typo. Also doesn't seem to match the comment

jdoerfert added inline comments.Jul 20 2023, 6:59 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
2593	In a follow up, can you please expose this via ompx_get_hardware_num_processing_elements(int device) (or whatever the non device version is spelled)?

jhuber6 added inline comments.Jul 20 2023, 8:21 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
2593	Yeah I'm not sure how to calculate this on CUDA unfortunately.
openmp/libomptarget/plugins-nextgen/common/PluginInterface/RPC.cpp
64	Was putting that off considering that the limit is currently like 64, which is far below the like 2000 that most platforms will want.

Updating

Harbormaster completed remote builds in B248636: Diff 544853.Jul 27 2023, 10:43 AM

ping

LG, fix comments.

This revision is now accepted and ready to land.Aug 11 2023, 10:51 AM

Closed by commit rG06adac8c4e26: [Libomptarget] Configure the RPC port count from the plugin (authored by jhuber6). · Explain WhyAug 11 2023, 10:55 AM

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG06adac8c4e26: [Libomptarget] Configure the RPC port count from the plugin.

Revision Contents

Path

Size

openmp/

libomptarget/

plugins-nextgen/

amdgpu/

src/

rtl.cpp

15 lines

common/

PluginInterface/

PluginInterface.h

14 lines

RPC.cpp

5 lines

cuda/

src/

rtl.cpp

23 lines

Diff 549461

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Show First 20 Lines • Show All 1,779 Lines • ▼ Show 20 Lines	Error initImpl(GenericPluginTy &Plugin) override {

// Compute the default number of teams.		// Compute the default number of teams.
uint32_t ComputeUnits = 0;		uint32_t ComputeUnits = 0;
if (auto Err =		if (auto Err =
getDeviceAttr(HSA_AMD_AGENT_INFO_COMPUTE_UNIT_COUNT, ComputeUnits))		getDeviceAttr(HSA_AMD_AGENT_INFO_COMPUTE_UNIT_COUNT, ComputeUnits))
return Err;		return Err;
GridValues.GV_Default_Num_Teams = ComputeUnits * OMPX_DefaultTeamsPerCU;		GridValues.GV_Default_Num_Teams = ComputeUnits * OMPX_DefaultTeamsPerCU;

		uint32_t WavesPerCU = 0;
		if (auto Err =
		getDeviceAttr(HSA_AMD_AGENT_INFO_MAX_WAVES_PER_CU, WavesPerCU))
		return Err;
		HardwareParallelism = ComputeUnits * WavesPerCU;
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions This is valid if a wave opens at most one port at a time. I think an argument could be made that a wave could try to open one port per thread. Likely to be something we should add to the libc tests. Async also comprises the sizing, though I currently think we should limit openmp to synchronous calls. JonChesterfield: This is valid if a wave opens at most one port at a time. I think an argument could be made…

// Get maximum size of any device queues and maximum number of queues.		// Get maximum size of any device queues and maximum number of queues.
uint32_t MaxQueueSize;		uint32_t MaxQueueSize;
if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUE_MAX_SIZE, MaxQueueSize))		if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUE_MAX_SIZE, MaxQueueSize))
return Err;		return Err;

uint32_t MaxQueues;		uint32_t MaxQueues;
if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUES_MAX, MaxQueues))		if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUES_MAX, MaxQueues))
return Err;		return Err;
▲ Show 20 Lines • Show All 131 Lines • ▼ Show 20 Lines	struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
Error setContext() override { return Plugin::success(); }		Error setContext() override { return Plugin::success(); }

/// We want to set up the RPC server for host services to the GPU if it is		/// We want to set up the RPC server for host services to the GPU if it is
/// availible.		/// availible.
bool shouldSetupRPCServer() const override {		bool shouldSetupRPCServer() const override {
return libomptargetSupportsRPC();		return libomptargetSupportsRPC();
}		}

		/// AMDGPU returns the product of the number of compute units and the waves
		/// per compute unit.
		uint64_t requestedRPCPortCount() const override {
		return HardwareParallelism;
		}

/// Get the stream of the asynchronous info sructure or get a new one.		/// Get the stream of the asynchronous info sructure or get a new one.
Error getStream(AsyncInfoWrapperTy &AsyncInfoWrapper,		Error getStream(AsyncInfoWrapperTy &AsyncInfoWrapper,
AMDGPUStreamTy *&Stream) {		AMDGPUStreamTy *&Stream) {
// Get the stream (if any) from the async info.		// Get the stream (if any) from the async info.
Stream = AsyncInfoWrapper.getQueueAs<AMDGPUStreamTy *>();		Stream = AsyncInfoWrapper.getQueueAs<AMDGPUStreamTy *>();
if (!Stream) {		if (!Stream) {
// There was no stream; get an idle one.		// There was no stream; get an idle one.
if (auto Err = AMDGPUStreamManager.getResource(Stream))		if (auto Err = AMDGPUStreamManager.getResource(Stream))
▲ Show 20 Lines • Show All 629 Lines • ▼ Show 20 Lines	private:
hsa_agent_t Agent;		hsa_agent_t Agent;

/// The GPU architecture.		/// The GPU architecture.
std::string ComputeUnitKind;		std::string ComputeUnitKind;

/// The frequency of the steady clock inside the device.		/// The frequency of the steady clock inside the device.
uint64_t ClockFrequency;		uint64_t ClockFrequency;

		/// The total number of concurrent work items that can be running on the GPU.
		uint64_t HardwareParallelism;
		jdoerfertUnsubmitted Not Done Reply Inline Actions In a follow up, can you please expose this via ompx_get_hardware_num_processing_elements(int device) (or whatever the non device version is spelled)? jdoerfert: In a follow up, can you please expose this via ompx_get_hardware_num_processing_elements(int…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Yeah I'm not sure how to calculate this on CUDA unfortunately. jhuber6: Yeah I'm not sure how to calculate this on CUDA unfortunately.

/// Reference to the host device.		/// Reference to the host device.
AMDHostDeviceTy &HostDevice;		AMDHostDeviceTy &HostDevice;
};		};

Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {		Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {
hsa_status_t Status;		hsa_status_t Status;
Status = hsa_code_object_deserialize(getStart(), getSize(), "", &CodeObject);		Status = hsa_code_object_deserialize(getStart(), getSize(), "", &CodeObject);
if (auto Err =		if (auto Err =
▲ Show 20 Lines • Show All 543 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h

Show First 20 Lines • Show All 776 Lines • ▼ Show 20 Lines	struct GenericDeviceTy : public DeviceAllocatorTy {
/// @see OMPX_MinThreadsForLowTripCount		/// @see OMPX_MinThreadsForLowTripCount
virtual uint32_t getMinThreadsForLowTripCountLoop() {		virtual uint32_t getMinThreadsForLowTripCountLoop() {
return OMPX_MinThreadsForLowTripCount;		return OMPX_MinThreadsForLowTripCount;
}		}

/// Get the RPC server running on this device.		/// Get the RPC server running on this device.
RPCServerTy *getRPCServer() const { return RPCServer; }		RPCServerTy *getRPCServer() const { return RPCServer; }

		/// The number of parallel RPC ports to use on the device. In general, this
		/// should be roughly equivalent to the amount of hardware parallelism the
		/// device can support. This is because GPUs in general do not have forward
		/// progress guarantees, so we minimize thread level dependencies by
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Typos. Also isn't num_teams the wrong constant here? Should be whatever openmp calls warps JonChesterfield: Typos. Also isn't num_teams the wrong constant here? Should be whatever openmp calls warps
		/// allocating enough space such that each device thread can have a port. This
		/// is likely overly pessimistic in the average case, but guarantees no
		/// deadlocks at the cost of memory. This must be overloaded by targets
		/// expecting to use the RPC server.
		virtual uint64_t requestedRPCPortCount() const {
		assert(!shouldSetupRPCServer() && "Default implementation cannot be used");
		return 0;
		}

private:		private:
/// Register offload entry for global variable.		/// Register offload entry for global variable.
Error registerGlobalOffloadEntry(DeviceImageTy &DeviceImage,		Error registerGlobalOffloadEntry(DeviceImageTy &DeviceImage,
const __tgt_offload_entry &GlobalEntry,		const __tgt_offload_entry &GlobalEntry,
__tgt_offload_entry &DeviceEntry);		__tgt_offload_entry &DeviceEntry);

/// Register offload entry for kernel function.		/// Register offload entry for kernel function.
Error registerKernelOffloadEntry(DeviceImageTy &DeviceImage,		Error registerKernelOffloadEntry(DeviceImageTy &DeviceImage,
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	#define defineOmptCallback(Name, Type, Code) Name##_t Name##_fn = nullptr;
FOREACH_OMPT_DEVICE_EVENT(defineOmptCallback)		FOREACH_OMPT_DEVICE_EVENT(defineOmptCallback)
#undef defineOmptCallback		#undef defineOmptCallback

/// Internal representation for OMPT device (initialize & finalize)		/// Internal representation for OMPT device (initialize & finalize)
std::atomic<bool> OmptInitialized;		std::atomic<bool> OmptInitialized;
#endif		#endif

private:		private:

/// Return the kernel environment object for kernel \p Name.		/// Return the kernel environment object for kernel \p Name.
Expected<KernelEnvironmentTy>		Expected<KernelEnvironmentTy>
getKernelEnvironmentForKernel(StringRef Name, DeviceImageTy &Image);		getKernelEnvironmentForKernel(StringRef Name, DeviceImageTy &Image);
};		};

/// Class implementing common functionalities of offload plugins. Each plugin		/// Class implementing common functionalities of offload plugins. Each plugin
/// should define the specific plugin class, derive from this generic one, and		/// should define the specific plugin class, derive from this generic one, and
/// implement the necessary virtual function members.		/// implement the necessary virtual function members.
▲ Show 20 Lines • Show All 434 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/RPC.cpp

Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	Error RPCServerTy::initDevice(plugin::GenericDeviceTy &Device,
plugin::DeviceImageTy &Image) {		plugin::DeviceImageTy &Image) {
#ifdef LIBOMPTARGET_RPC_SUPPORT		#ifdef LIBOMPTARGET_RPC_SUPPORT
uint32_t DeviceId = Device.getDeviceId();		uint32_t DeviceId = Device.getDeviceId();
auto Alloc = [](uint64_t Size, void *Data) {		auto Alloc = [](uint64_t Size, void *Data) {
plugin::GenericDeviceTy &Device =		plugin::GenericDeviceTy &Device =
reinterpret_cast<plugin::GenericDeviceTy >(Data);		reinterpret_cast<plugin::GenericDeviceTy >(Data);
return Device.allocate(Size, nullptr, TARGET_ALLOC_HOST);		return Device.allocate(Size, nullptr, TARGET_ALLOC_HOST);
};		};
// TODO: Allow the device to declare its requested port count.		uint64_t NumPorts =
if (rpc_status_t Err = rpc_server_init(DeviceId, RPC_MAXIMUM_PORT_COUNT,		std::min(Device.requestedRPCPortCount(), RPC_MAXIMUM_PORT_COUNT);
		if (rpc_status_t Err = rpc_server_init(DeviceId, NumPorts,
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions This should probably be a hard error if the plugin wants more ports than it can have, especially on platforms where that implies deadlock risk JonChesterfield: This should probably be a hard error if the plugin wants more ports than it can have…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Was putting that off considering that the limit is currently like 64, which is far below the like 2000 that most platforms will want. jhuber6: Was putting that off considering that the limit is currently like 64, which is far below the…
Device.getWarpSize(), Alloc, &Device))		Device.getWarpSize(), Alloc, &Device))
return plugin::Plugin::error(		return plugin::Plugin::error(
"Failed to initialize RPC server for device %d: %d", DeviceId, Err);		"Failed to initialize RPC server for device %d: %d", DeviceId, Err);

// Register a custom opcode handler to perform plugin specific allocation.		// Register a custom opcode handler to perform plugin specific allocation.
// FIXME: We need to make sure this uses asynchronous allocations on CUDA.		// FIXME: We need to make sure this uses asynchronous allocations on CUDA.
auto MallocHandler = [](rpc_port_t Port, void *Data) {		auto MallocHandler = [](rpc_port_t Port, void *Data) {
rpc_recv_and_send(		rpc_recv_and_send(
▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp

Show First 20 Lines • Show All 289 Lines • ▼ Show 20 Lines	Error initImpl(GenericPluginTy &Plugin) override {
if (auto Err = getDeviceAttr(CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR,		if (auto Err = getDeviceAttr(CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR,
ComputeCapability.Major))		ComputeCapability.Major))
return Err;		return Err;

if (auto Err = getDeviceAttr(CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR,		if (auto Err = getDeviceAttr(CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR,
ComputeCapability.Minor))		ComputeCapability.Minor))
return Err;		return Err;

		uint32_t NumMuliprocessors = 0;
		uint32_t MaxThreadsPerSM = 0;
		uint32_t WarpSize = 0;
		if (auto Err = getDeviceAttr(CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT,
		NumMuliprocessors))
		return Err;
		if (auto Err = getDeviceAttr(CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR,
		MaxThreadsPerSM))
		return Err;
		if (auto Err = getDeviceAttr(CU_DEVICE_ATTRIBUTE_WARP_SIZE, WarpSize))
		return Err;
		HardwareParallelism = NumMuliprocessors * (MaxThreadsPerSM / WarpSize);

return Plugin::success();		return Plugin::success();
}		}

/// Deinitialize the device and release its resources.		/// Deinitialize the device and release its resources.
Error deinitImpl() override {		Error deinitImpl() override {
if (Context) {		if (Context) {
if (auto Err = setContext())		if (auto Err = setContext())
return Err;		return Err;
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	struct CUDADeviceTy : public GenericDeviceTy {
}		}

/// We want to set up the RPC server for host services to the GPU if it is		/// We want to set up the RPC server for host services to the GPU if it is
/// availible.		/// availible.
bool shouldSetupRPCServer() const override {		bool shouldSetupRPCServer() const override {
return libomptargetSupportsRPC();		return libomptargetSupportsRPC();
}		}

		/// NVIDIA returns the product of the SM count and the number of warps that
		/// fit if the maximum number of threads were scheduled on each SM.
		uint64_t requestedRPCPortCount() const override {
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions It'll affect performance. Also I'm not totally confident Nvidia has a fair scheduler on SMs, that would be a good thing to check. Amdgpu does not have a fair scheduler on CUs JonChesterfield: It'll affect performance. Also I'm not totally confident Nvidia has a fair scheduler on SMs…
		return HardwareParallelism;
		}

/// Get the stream of the asynchronous info sructure or get a new one.		/// Get the stream of the asynchronous info sructure or get a new one.
Error getStream(AsyncInfoWrapperTy &AsyncInfoWrapper, CUstream &Stream) {		Error getStream(AsyncInfoWrapperTy &AsyncInfoWrapper, CUstream &Stream) {
// Get the stream (if any) from the async info.		// Get the stream (if any) from the async info.
Stream = AsyncInfoWrapper.getQueueAs<CUstream>();		Stream = AsyncInfoWrapper.getQueueAs<CUstream>();
if (!Stream) {		if (!Stream) {
// There was no stream; get an idle one.		// There was no stream; get an idle one.
if (auto Err = CUDAStreamManager.getResource(Stream))		if (auto Err = CUDAStreamManager.getResource(Stream))
return Err;		return Err;
▲ Show 20 Lines • Show All 494 Lines • ▼ Show 20 Lines	private:
/// The compute capability of the corresponding CUDA device.		/// The compute capability of the corresponding CUDA device.
struct ComputeCapabilityTy {		struct ComputeCapabilityTy {
uint32_t Major;		uint32_t Major;
uint32_t Minor;		uint32_t Minor;
std::string str() const {		std::string str() const {
return "sm_" + std::to_string(Major * 10 + Minor);		return "sm_" + std::to_string(Major * 10 + Minor);
}		}
} ComputeCapability;		} ComputeCapability;

		/// The maximum number of warps that can be resident on all the SMs
		/// simultaneously.
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Typo. Also doesn't seem to match the comment JonChesterfield: Typo. Also doesn't seem to match the comment
		uint32_t HardwareParallelism = 0;
};		};

Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,		Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
uint32_t NumThreads, uint64_t NumBlocks,		uint32_t NumThreads, uint64_t NumBlocks,
KernelArgsTy &KernelArgs, void *Args,		KernelArgsTy &KernelArgs, void *Args,
AsyncInfoWrapperTy &AsyncInfoWrapper) const {		AsyncInfoWrapperTy &AsyncInfoWrapper) const {
CUDADeviceTy &CUDADevice = static_cast<CUDADeviceTy &>(GenericDevice);		CUDADeviceTy &CUDADevice = static_cast<CUDADeviceTy &>(GenericDevice);

▲ Show 20 Lines • Show All 231 Lines • Show Last 20 Lines