This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/plugins-nextgen/amdgpu/src/
-
libomptarget/
-
plugins-nextgen/
-
amdgpu/
-
src/
18/18
rtl.cpp

Differential D140720

[OpenMP][JIT] Introduce support for AMDGPU
ClosedPublic

Authored by jdoerfert on Dec 27 2022, 11:24 PM.

Download Raw Diff

Details

Reviewers

tianshilei1992
jhuber6
kevinsala
JonChesterfield
ronlieb

Commits

rGc63dced93b46: [OpenMP][JIT] Introduce support for AMDGPU

Summary

To JIT kernels for AMDGPUs we need to provide the architecture, the

triple, and a post-link callback. The first two are simple, the last one
is a little more complicated since we need to invoke `lld`. There is
some library interface but for that we need the lld library, which is
not generally available, thus we go with the executable for now. In
either way we need to manifest the (amdgcn) object file and read the
output from another file. We should try to avoid that in the future.
The options for `lld` are copied from the way clang invokes it.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jdoerfert created this revision.Dec 27 2022, 11:24 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 27 2022, 11:24 PM

Herald added subscribers: kosarev, kerbowa, guansong and 6 others. · View Herald Transcript

jdoerfert requested review of this revision.Dec 27 2022, 11:24 PM

Herald added subscribers: sstefan1, wdng. · View Herald TranscriptDec 27 2022, 11:24 PM

jdoerfert added a subscriber: arsenm.Dec 27 2022, 11:25 PM

Harbormaster completed remote builds in B205052: Diff 485463.Dec 27 2022, 11:27 PM

Instead of invoking lld and copying the flags, could you invoke clang instead? The flags may change over time. Rust calls cc for linking.

arsenm added inline comments.Dec 28 2022, 4:38 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1654	SmallString
1656	Don’t need llvm::
1673	Also need to get ecc and xnack target features?
1679	Do we still need this flag with real LTO
1688	ExecuteAndWait should switch to std::error_code
1696	Return StringRef
2329	Why not return a complete triple

tianshilei1992 added inline comments.Dec 28 2022, 6:27 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1696	It is not always good to return a `StringRef` here. In the NVPTX implementation, the arch string is built on the fly (in the return statement).
2329	Because we don't know the other components of a complete triple, though the assumption here is if the arch is same, the module should be compatible.

arsenm added inline comments.Dec 28 2022, 7:26 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1696	That doesn’t mean it should be done there
2329	That defeats the point of the other triple fields

jhuber6 added inline comments.Dec 28 2022, 7:33 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1679	I don't think we need it since we should be linking only the object files to a shared object. Probably copied from the `LinkerWrapper`, I think it should probably be removed there as well.

tianshilei1992 added inline comments.Dec 28 2022, 8:17 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1696	this is override function

tianshilei1992 added inline comments.Dec 28 2022, 8:22 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
2329	yes, and then when we compare, we have to get arch again, which I don’t see the point.

In D140720#4018224, @tschuett wrote:

Instead of invoking lld and copying the flags, could you invoke clang instead? The flags may change over time. Rust calls cc for linking.

I think invoking lld directly should be fine. It's the only linker that can link AMDGPU images and we already call it directly in the clang-linker-wrapper tool. To use clang you would need to pass --target=amdgcn-amd-amdhsa which most likely isn't compatible between different user cc's so we can't use that trick. If you do use clang is just gives this invocation, which probably won't be changed too much in the future:

"ld.lld" "device.o" "-shared" "-o" "a.out"

@arsenm Not sure if/how we should deal with the feature flags (xsnack, etc.) I'll leave that up to AMD people to look into. This should give us basic support, tested with OpenMC on a gfx90a.

Rust does not really care whether cc is clang or gcc. My point is that it is safer to invoke clang and let it manage the link step.

I will update the nits I marked as done but didn't comment on. The commented stuff are kept for "follow ups", assuming people don't object. Anything else?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1673	Let's keep that question for a follow up. I honestly don't know. Basic support should work w/o all of this. AMD folks will look into it (and how to propagate the information).
1688	Not my department.
1696	Can't call a StringRef w/o refactoring stuff. Once we JIT per kernel, not per module, this might become interesting, for now, it's less of an issue.
2329	Similar answer as to above. Once we look at more than the arch + sub-arch, e.g., features, we might want to restructure this. I'll assume AMD (@jhuber6) will take a look how to do that.

In D140720#4018590, @tschuett wrote:

Rust does not really care whether cc is clang or gcc. My point is that it is safer to invoke clang and let it manage the link step.

FWIW, what we really want is to use JITLink or lld as a library (neither we can really do right now). If people think using clang is better than lld directly, I can change it for sure.

Halide is apparently using lld as a library: https://reviews.llvm.org/D140726

In D140720#4018603, @tschuett wrote:

Halide is apparently using lld as a library: https://reviews.llvm.org/D140726

I even started using that interface (`lld::elf::link) but:

It doesn't solve our problem of materializing the two files for the link step, and
It requires the lld libraries, which we don't build by default and there is no guarantee the system lld (e.g., distributed with rocm) provides those.

In D140720#4018576, @jdoerfert wrote:

@arsenm Not sure if/how we should deal with the feature flags (xsnack, etc.) I'll leave that up to AMD people to look into. This should give us basic support, tested with OpenMC on a gfx90a.

Feature flags are not needed if you don't need portability. I do have a piece of code copied from somewhere that I don't remember: https://github.com/shiltian/llvm-project/commit/02bc7effccc6ff2f5ab3fe5218336094c0485766#diff-321c2038035972ad4994ff9d85b29950ba72c08a79891db5048b8f5d46915314R432.

Addressed comments, I think.

Harbormaster completed remote builds in B205581: Diff 486142.Jan 3 2023, 6:54 PM

Pick the right patch file.

Harbormaster completed remote builds in B205582: Diff 486143.Jan 3 2023, 6:58 PM

JIT part looks good to me. Not sure if the invocation of external tool will have more concerns.

This revision is now accepted and ready to land.Jan 3 2023, 7:56 PM

Closed by commit rGc63dced93b46: [OpenMP][JIT] Introduce support for AMDGPU (authored by jdoerfert). · Explain WhyJan 4 2023, 10:15 AM

This revision was automatically updated to reflect the committed changes.

jdoerfert added a commit: rGc63dced93b46: [OpenMP][JIT] Introduce support for AMDGPU.

Herald added a project: Restricted Project. · View Herald TranscriptJan 4 2023, 10:15 AM

Herald added a subscriber: openmp-commits. · View Herald Transcript

Revision Contents

Path

Size

openmp/

libomptarget/

plugins-nextgen/

amdgpu/

src/

rtl.cpp

73 lines

Diff 486335

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Show All 12 Lines
#include <atomic>		#include <atomic>
#include <cassert>		#include <cassert>
#include <cstddef>		#include <cstddef>
#include <deque>		#include <deque>
#include <hsa.h>		#include <hsa.h>
#include <hsa_ext_amd.h>		#include <hsa_ext_amd.h>
#include <mutex>		#include <mutex>
#include <string>		#include <string>
		#include <system_error>
#include <unistd.h>		#include <unistd.h>
#include <unordered_map>		#include <unordered_map>

#include "Debug.h"		#include "Debug.h"
#include "DeviceEnvironment.h"		#include "DeviceEnvironment.h"
#include "GlobalHandler.h"		#include "GlobalHandler.h"
#include "PluginInterface.h"		#include "PluginInterface.h"
#include "Utilities.h"		#include "Utilities.h"
#include "UtilitiesRTL.h"		#include "UtilitiesRTL.h"

		#include "llvm/ADT/SmallString.h"
		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/BinaryFormat/ELF.h"		#include "llvm/BinaryFormat/ELF.h"
#include "llvm/Frontend/OpenMP/OMPConstants.h"		#include "llvm/Frontend/OpenMP/OMPConstants.h"
#include "llvm/Frontend/OpenMP/OMPGridValues.h"		#include "llvm/Frontend/OpenMP/OMPGridValues.h"
		#include "llvm/Support/Error.h"
		#include "llvm/Support/FileSystem.h"
		#include "llvm/Support/MemoryBuffer.h"
		#include "llvm/Support/Program.h"
		#include "llvm/Support/raw_ostream.h"

namespace llvm {		namespace llvm {
namespace omp {		namespace omp {
namespace target {		namespace target {
namespace plugin {		namespace plugin {

/// Forward declarations for all specialized data structures.		/// Forward declarations for all specialized data structures.
struct AMDGPUKernelTy;		struct AMDGPUKernelTy;
▲ Show 20 Lines • Show All 1,471 Lines • ▼ Show 20 Lines	struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
~AMDGPUDeviceTy() {}		~AMDGPUDeviceTy() {}

/// Initialize the device, its resources and get its properties.		/// Initialize the device, its resources and get its properties.
Error initImpl(GenericPluginTy &Plugin) override {		Error initImpl(GenericPluginTy &Plugin) override {
// First setup all the memory pools.		// First setup all the memory pools.
if (auto Err = initMemoryPools())		if (auto Err = initMemoryPools())
return Err;		return Err;

		char GPUName[64];
		if (auto Err = getDeviceAttr(HSA_AGENT_INFO_NAME, GPUName))
		return Err;
		Arch = GPUName;

// Get the wavefront size.		// Get the wavefront size.
uint32_t WavefrontSize = 0;		uint32_t WavefrontSize = 0;
if (auto Err = getDeviceAttr(HSA_AGENT_INFO_WAVEFRONT_SIZE, WavefrontSize))		if (auto Err = getDeviceAttr(HSA_AGENT_INFO_WAVEFRONT_SIZE, WavefrontSize))
return Err;		return Err;
GridValues.GV_Warp_Size = WavefrontSize;		GridValues.GV_Warp_Size = WavefrontSize;

// Load the grid values dependending on the wavefront.		// Load the grid values dependending on the wavefront.
if (WavefrontSize == 32)		if (WavefrontSize == 32)
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	Error deinitImpl() override {
}		}

// Invalidate agent reference.		// Invalidate agent reference.
Agent = {0};		Agent = {0};

return Plugin::success();		return Plugin::success();
}		}

		Expected<std::unique_ptr<MemoryBuffer>>
		doJITPostProcessing(std::unique_ptr<MemoryBuffer> MB) const override {

		// TODO: We should try to avoid materialization but there seems to be no
		// good linker interface w/o file i/o.
		SmallString<128> LinkerOutputFilePath;
		std::error_code EC = sys::fs::createTemporaryFile(
		"amdgpu-pre-link-jit", ".out", LinkerOutputFilePath);
		if (EC)
		return createStringError(EC,
		"Failed to create temporary file for linker");

		SmallString<128> LinkerInputFilePath = LinkerOutputFilePath;
		arsenmUnsubmitted Done Reply Inline Actions SmallString arsenm: SmallString
		LinkerInputFilePath.pop_back_n(2);

		arsenmUnsubmitted Done Reply Inline Actions Don’t need llvm:: arsenm: Don’t need llvm::
		auto FD = raw_fd_ostream(LinkerInputFilePath.data(), EC);
		if (EC)
		return createStringError(EC, "Failed to open temporary file for linker");
		FD.write(MB->getBufferStart(), MB->getBufferSize());
		FD.close();

		const auto &ErrorOrPath = sys::findProgramByName("lld");
		if (!ErrorOrPath)
		return createStringError(inconvertibleErrorCode(),
		"Failed to find `lld` on the PATH.");

		std::string LLDPath = ErrorOrPath.get();
		INFO(OMP_INFOTYPE_PLUGIN_KERNEL, getDeviceId(),
		"Using `%s` to link JITed amdgcn ouput.", LLDPath.c_str());

		std::string MCPU = "-plugin-opt=mcpu=" + getArch();

		arsenmUnsubmitted Done Reply Inline Actions Also need to get ecc and xnack target features? arsenm: Also need to get ecc and xnack target features?
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions Let's keep that question for a follow up. I honestly don't know. Basic support should work w/o all of this. AMD folks will look into it (and how to propagate the information). jdoerfert: Let's keep that question for a follow up. I honestly don't know. Basic support should work w/o…
		StringRef Args[] = {LLDPath,
		"-flavor",
		"gnu",
		"--no-undefined",
		"-shared",
		MCPU,
		arsenmUnsubmitted Done Reply Inline Actions Do we still need this flag with real LTO arsenm: Do we still need this flag with real LTO
		jhuber6Unsubmitted Done Reply Inline Actions I don't think we need it since we should be linking only the object files to a shared object. Probably copied from the `LinkerWrapper`, I think it should probably be removed there as well. jhuber6: I don't think we need it since we should be linking only the object files to a shared object.
		"-o",
		LinkerOutputFilePath.data(),
		LinkerInputFilePath.data()};

		std::string Error;
		int RC = sys::ExecuteAndWait(LLDPath, Args, std::nullopt, {}, 0, 0, &Error);
		if (RC)
		return createStringError(inconvertibleErrorCode(),
		"Linking optimized bitcode failed: %s",
		arsenmUnsubmitted Done Reply Inline Actions ExecuteAndWait should switch to std::error_code arsenm: ExecuteAndWait should switch to std::error_code
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions Not my department. jdoerfert: Not my department.
		Error.c_str());

		return std::move(
		MemoryBuffer::getFileOrSTDIN(LinkerOutputFilePath.data()).get());
		}

		std::string getArch() const override { return Arch; }

		arsenmUnsubmitted Done Reply Inline Actions Return StringRef arsenm: Return StringRef
		tianshilei1992Unsubmitted Done Reply Inline Actions It is not always good to return a `StringRef` here. In the NVPTX implementation, the arch string is built on the fly (in the return statement). tianshilei1992: It is not always good to return a `StringRef` here. In the NVPTX implementation, the arch…
		arsenmUnsubmitted Done Reply Inline Actions That doesn’t mean it should be done there arsenm: That doesn’t mean it should be done there
		tianshilei1992Unsubmitted Done Reply Inline Actions this is override function tianshilei1992: this is override function
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions Can't call a StringRef w/o refactoring stuff. Once we JIT per kernel, not per module, this might become interesting, for now, it's less of an issue. jdoerfert: Can't call a StringRef w/o refactoring stuff. Once we JIT per kernel, not per module, this…
/// Allocate and construct an AMDGPU kernel.		/// Allocate and construct an AMDGPU kernel.
Expected<GenericKernelTy *>		Expected<GenericKernelTy *>
constructKernelEntry(const __tgt_offload_entry &KernelEntry,		constructKernelEntry(const __tgt_offload_entry &KernelEntry,
DeviceImageTy &Image) override {		DeviceImageTy &Image) override {

Expected<OMPTgtExecModeFlags> ExecModeOrErr =		Expected<OMPTgtExecModeFlags> ExecModeOrErr =
getExecutionModeForKernel(KernelEntry.name, Image);		getExecutionModeForKernel(KernelEntry.name, Image);
if (!ExecModeOrErr)		if (!ExecModeOrErr)
▲ Show 20 Lines • Show All 385 Lines • ▼ Show 20 Lines	private:
AMDGPUEventManagerTy AMDGPUEventManager;		AMDGPUEventManagerTy AMDGPUEventManager;

/// Signal manager for AMDGPU signals.		/// Signal manager for AMDGPU signals.
AMDGPUSignalManagerTy AMDGPUSignalManager;		AMDGPUSignalManagerTy AMDGPUSignalManager;

/// The agent handler corresponding to the device.		/// The agent handler corresponding to the device.
hsa_agent_t Agent;		hsa_agent_t Agent;

		/// The GPU architecture.
		std::string Arch;

/// Reference to the host device.		/// Reference to the host device.
AMDHostDeviceTy &HostDevice;		AMDHostDeviceTy &HostDevice;

/// List of device packet queues.		/// List of device packet queues.
std::vector<AMDGPUQueueTy> Queues;		std::vector<AMDGPUQueueTy> Queues;
};		};

Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {		Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {
▲ Show 20 Lines • Show All 212 Lines • ▼ Show 20 Lines	Error deinitImpl() override {
if (auto Err = HostDevice->deinit())		if (auto Err = HostDevice->deinit())
return Err;		return Err;

// Finalize the HSA runtime.		// Finalize the HSA runtime.
hsa_status_t Status = hsa_shut_down();		hsa_status_t Status = hsa_shut_down();
return Plugin::check(Status, "Error in hsa_shut_down: %s");		return Plugin::check(Status, "Error in hsa_shut_down: %s");
}		}

		Triple::ArchType getTripleArch() const override { return Triple::amdgcn; }
		arsenmUnsubmitted Done Reply Inline Actions Why not return a complete triple arsenm: Why not return a complete triple
		tianshilei1992Unsubmitted Done Reply Inline Actions Because we don't know the other components of a complete triple, though the assumption here is if the arch is same, the module should be compatible. tianshilei1992: Because we don't know the other components of a complete triple, though the assumption here is…
		arsenmUnsubmitted Done Reply Inline Actions That defeats the point of the other triple fields arsenm: That defeats the point of the other triple fields
		tianshilei1992Unsubmitted Done Reply Inline Actions yes, and then when we compare, we have to get arch again, which I don’t see the point. tianshilei1992: yes, and then when we compare, we have to get arch again, which I don’t see the point.
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions Similar answer as to above. Once we look at more than the arch + sub-arch, e.g., features, we might want to restructure this. I'll assume AMD (@jhuber6) will take a look how to do that. jdoerfert: Similar answer as to above. Once we look at more than the arch + sub-arch, e.g., features, we…

/// Get the ELF code for recognizing the compatible image binary.		/// Get the ELF code for recognizing the compatible image binary.
uint16_t getMagicElfBits() const override { return ELF::EM_AMDGPU; }		uint16_t getMagicElfBits() const override { return ELF::EM_AMDGPU; }

/// Check whether the image is compatible with an AMDGPU device.		/// Check whether the image is compatible with an AMDGPU device.
Expected<bool> isImageCompatible(__tgt_image_info *Info) const override {		Expected<bool> isImageCompatible(__tgt_image_info *Info) const override {
for (hsa_agent_t Agent : KernelAgents) {		for (hsa_agent_t Agent : KernelAgents) {
std::string Target;		std::string Target;
auto Err = utils::iterateAgentISAs(Agent, [&](hsa_isa_t ISA) {		auto Err = utils::iterateAgentISAs(Agent, [&](hsa_isa_t ISA) {
▲ Show 20 Lines • Show All 243 Lines • Show Last 20 Lines