This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add amdgpu-promote-pointer-kernargs pass
AbandonedPublic

Authored by hliao on Oct 31 2019, 1:25 PM.

Download Raw Diff

Details

Reviewers

yaxunl
rampitec
arsenm
rjmccall
tra

Summary

Enable it before infer-address-space pass.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 40359
Build 40466: arc lint + arc unit

Event Timeline

hliao created this revision.Oct 31 2019, 1:25 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 31 2019, 1:25 PM

Herald added subscribers: llvm-commits, hiraditya, t-tye and 8 others. · View Herald Transcript

Harbormaster completed remote builds in B40359: Diff 227329.Oct 31 2019, 1:26 PM

Clang should just be directly emitting the arguments with the global address space in the first place. It already has support for coercing argument types per calling convention and this is no different

This revision now requires changes to proceed.Oct 31 2019, 1:28 PM

yaxunl added reviewers: rjmccall, tra.Oct 31 2019, 2:18 PM

In D69679#1729204, @arsenm wrote:

Clang should just be directly emitting the arguments with the global address space in the first place. It already has support for coercing argument types per calling convention and this is no different

In CUDA/HIP language all pointer type kernel args are in default address space, so it is reasonable to emit them as pointers in default address space in IR. Translating them to global address space in clang codegen is not necessary and is better done in backend, as what is done in NVPTX

https://github.com/llvm/llvm-project/blob/master/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp#L30

Add @tra @rjmccall for comments about where should this be implemented.

In D69679#1729328, @yaxunl wrote:

In D69679#1729204, @arsenm wrote:

Clang should just be directly emitting the arguments with the global address space in the first place. It already has support for coercing argument types per calling convention and this is no different

In CUDA/HIP language all pointer type kernel args are in default address space, so it is reasonable to emit them as pointers in default address space in IR. Translating them to global address space in clang codegen is not necessary and is better done in backend, as what is done in NVPTX

https://github.com/llvm/llvm-project/blob/master/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp#L30

Add @tra @rjmccall for comments about where should this be implemented.

The language address space doesn't need to match the calling convention argument type. You can coerce them to global for codegen purposes

@arsenm has a point. We can do it in clang, and it seems to be a better long-term solution compared to patching-up the inputs' AS that we've done for NVPTX and, now, AMDGPU.

On the other hand I wonder whether the benefit is worth the effort. It moves responsibility of coercing pointers' AS from LLVM to the end-users with not much to show for it. I think the only real issue I see with the status quo and this patch is that this is the second instance of a trivial pass that does this kind of job. Perhaps we can make it into a generic IR pass which would be able to coerce the pointers to the right address space and use it for both AMDGPU and NVPTX.

If we do want to change the IR-level calling convention for the kernels, clang would not be the only place that would need to adapt to that change. We will need to think about transitioning existing external users, too (e.g. XLA in TensorFlow & JAX, julia). We may need to keep the promote-pointers-to-global-AS pass around for a while until the LLVM users have a chance to change their code to pass pointers using correct address space.

Another issue I'm concerned about is the interaction with the (still out of tree), hacky pass to handle kernels calling other kernels as a normal function for OpenCL. This is another area that clang needs codegen work to really emit the right IR/ABI for kernels. If we blindly rewrite arguments with this hack in place, we risk incorrectly rewriting the callable function form

In D69679#1729474, @arsenm wrote:

Another issue I'm concerned about is the interaction with the (still out of tree), hacky pass to handle kernels calling other kernels as a normal function for OpenCL. This is another area that clang needs codegen work to really emit the right IR/ABI for kernels. If we blindly rewrite arguments with this hack in place, we risk incorrectly rewriting the callable function form

I think OpenCL still disallows generic pointer arguments to kernels, although I wouldn't like to rely on this fact for correctness

In D69679#1729466, @tra wrote:

@arsenm has a point. We can do it in clang, and it seems to be a better long-term solution compared to patching-up the inputs' AS that we've done for NVPTX and, now, AMDGPU.

If we do want to change the IR-level calling convention for the kernels, clang would not be the only place that would need to adapt to that change. We will need to think about transitioning existing external users, too (e.g. XLA in TensorFlow & JAX, julia). We may need to keep the promote-pointers-to-global-AS pass around for a while until the LLVM users have a chance to change their code to pass pointers using correct address space.

AMDGPU backend is able to handle pointer type kernel arg in default address space. This pass is more for performance.

In D69679#1729684, @yaxunl wrote:

In D69679#1729466, @tra wrote:

@arsenm has a point. We can do it in clang, and it seems to be a better long-term solution compared to patching-up the inputs' AS that we've done for NVPTX and, now, AMDGPU.

If we do want to change the IR-level calling convention for the kernels, clang would not be the only place that would need to adapt to that change. We will need to think about transitioning existing external users, too (e.g. XLA in TensorFlow & JAX, julia). We may need to keep the promote-pointers-to-global-AS pass around for a while until the LLVM users have a chance to change their code to pass pointers using correct address space.

AMDGPU backend is able to handle pointer type kernel arg in default address space. This pass is more for performance.

Yes, but it’s ultimately a workaround for not producing the correct pointer type in the first place. It adds more passes and intermediate instructions that could be avoided

A different approach (D69826) is taken to address the same issue.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPU.h

3 lines

AMDGPUPromotePointerKernArgsToGlobal.cpp

72 lines

AMDGPUTargetMachine.cpp

4 lines

CMakeLists.txt

1 line

test/

CodeGen/

AMDGPU/

promote-pointer-kernargs.ll

13 lines

Diff 227329

llvm/lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 237 Lines • ▼ Show 20 Lines
	extern char &AMDGPUOpenCLEnqueuedBlockLoweringID;			extern char &AMDGPUOpenCLEnqueuedBlockLoweringID;

	void initializeGCNRegBankReassignPass(PassRegistry &);			void initializeGCNRegBankReassignPass(PassRegistry &);
	extern char &GCNRegBankReassignID;			extern char &GCNRegBankReassignID;

	void initializeGCNNSAReassignPass(PassRegistry &);			void initializeGCNNSAReassignPass(PassRegistry &);
	extern char &GCNNSAReassignID;			extern char &GCNNSAReassignID;

				FunctionPass *createAMDGPUPromotePointerKernArgsToGlobalPass();
				void initializeAMDGPUPromotePointerKernArgsToGlobalPass(PassRegistry &);

	namespace AMDGPU {			namespace AMDGPU {
	enum TargetIndex {			enum TargetIndex {
	TI_CONSTDATA_START,			TI_CONSTDATA_START,
	TI_SCRATCH_RSRC_DWORD0,			TI_SCRATCH_RSRC_DWORD0,
	TI_SCRATCH_RSRC_DWORD1,			TI_SCRATCH_RSRC_DWORD1,
	TI_SCRATCH_RSRC_DWORD2,			TI_SCRATCH_RSRC_DWORD2,
	TI_SCRATCH_RSRC_DWORD3			TI_SCRATCH_RSRC_DWORD3
	};			};
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUPromotePointerKernArgsToGlobal.cpp

This file was added.

				//===-- AMDGPUPromotePointerKernArgsToGlobal.cpp - Promote pointer args ---===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// Generic pointer kernel arguments need promoting to global ones.
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/Pass.h"

				using namespace llvm;

				#define DEBUG_TYPE "amdgpu-promote-pointer-kernargs"

				namespace {

				class AMDGPUPromotePointerKernArgsToGlobal : public FunctionPass {
				public:
				static char ID;

				AMDGPUPromotePointerKernArgsToGlobal() : FunctionPass(ID) {}

				bool runOnFunction(Function &F) override;
				};

				} // End anonymous namespace

				char AMDGPUPromotePointerKernArgsToGlobal::ID = 0;

				INITIALIZE_PASS(AMDGPUPromotePointerKernArgsToGlobal, DEBUG_TYPE,
				"Lower intrinsics", false, false)

				bool AMDGPUPromotePointerKernArgsToGlobal::runOnFunction(Function &F) {
				// Skip non-entry function.
				if (F.getCallingConv() != CallingConv::AMDGPU_KERNEL)
				return false;

				auto &Entry = F.getEntryBlock();
				IRBuilder<> IRB(&Entry, Entry.begin());

				bool Changed = false;
				for (auto &Arg : F.args()) {
				auto PtrTy = dyn_cast<PointerType>(Arg.getType());
				if (!PtrTy \|\| PtrTy->getPointerAddressSpace() != AMDGPUAS::FLAT_ADDRESS)
				continue;

				auto GlobalPtr =
				IRB.CreateAddrSpaceCast(&Arg,
				PointerType::get(PtrTy->getPointerElementType(),
				AMDGPUAS::GLOBAL_ADDRESS),
				Arg.getName());
				auto NewFlatPtr = IRB.CreateAddrSpaceCast(GlobalPtr, PtrTy, Arg.getName());
				Arg.replaceAllUsesWith(NewFlatPtr);
				// Fix the global pointer itself.
				cast<Instruction>(GlobalPtr)->setOperand(0, &Arg);
				Changed = true;
				}

				return Changed;
				}

				FunctionPass *llvm::createAMDGPUPromotePointerKernArgsToGlobalPass() {
				return new AMDGPUPromotePointerKernArgsToGlobal();
				}

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 211 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAMDGPUTarget() {
initializeAMDGPUAnnotateUniformValuesPass(*PR);		initializeAMDGPUAnnotateUniformValuesPass(*PR);
initializeAMDGPUArgumentUsageInfoPass(*PR);		initializeAMDGPUArgumentUsageInfoPass(*PR);
initializeAMDGPUAtomicOptimizerPass(*PR);		initializeAMDGPUAtomicOptimizerPass(*PR);
initializeAMDGPULowerKernelArgumentsPass(*PR);		initializeAMDGPULowerKernelArgumentsPass(*PR);
initializeAMDGPULowerKernelAttributesPass(*PR);		initializeAMDGPULowerKernelAttributesPass(*PR);
initializeAMDGPULowerIntrinsicsPass(*PR);		initializeAMDGPULowerIntrinsicsPass(*PR);
initializeAMDGPUOpenCLEnqueuedBlockLoweringPass(*PR);		initializeAMDGPUOpenCLEnqueuedBlockLoweringPass(*PR);
initializeAMDGPUPromoteAllocaPass(*PR);		initializeAMDGPUPromoteAllocaPass(*PR);
		initializeAMDGPUPromotePointerKernArgsToGlobalPass(*PR);
initializeAMDGPUCodeGenPreparePass(*PR);		initializeAMDGPUCodeGenPreparePass(*PR);
initializeAMDGPUPropagateAttributesEarlyPass(*PR);		initializeAMDGPUPropagateAttributesEarlyPass(*PR);
initializeAMDGPUPropagateAttributesLatePass(*PR);		initializeAMDGPUPropagateAttributesLatePass(*PR);
initializeAMDGPURewriteOutArgumentsPass(*PR);		initializeAMDGPURewriteOutArgumentsPass(*PR);
initializeAMDGPUUnifyMetadataPass(*PR);		initializeAMDGPUUnifyMetadataPass(*PR);
initializeSIAnnotateControlFlowPass(*PR);		initializeSIAnnotateControlFlowPass(*PR);
initializeSIInsertWaitcntsPass(*PR);		initializeSIInsertWaitcntsPass(*PR);
initializeSIModeRegisterPass(*PR);		initializeSIModeRegisterPass(*PR);
▲ Show 20 Lines • Show All 208 Lines • ▼ Show 20 Lines	[AMDGPUAA, LibCallSimplify, &Opt, this](const PassManagerBuilder &,
PM.add(llvm::createAMDGPUUseNativeCallsPass());		PM.add(llvm::createAMDGPUUseNativeCallsPass());
if (LibCallSimplify)		if (LibCallSimplify)
PM.add(llvm::createAMDGPUSimplifyLibCallsPass(Opt, this));		PM.add(llvm::createAMDGPUSimplifyLibCallsPass(Opt, this));
});		});

Builder.addExtension(		Builder.addExtension(
PassManagerBuilder::EP_CGSCCOptimizerLate,		PassManagerBuilder::EP_CGSCCOptimizerLate,
[](const PassManagerBuilder &, legacy::PassManagerBase &PM) {		[](const PassManagerBuilder &, legacy::PassManagerBase &PM) {
		// Premote generic pointer kernel arguments to global ones.
		PM.add(llvm::createAMDGPUPromotePointerKernArgsToGlobalPass());

// Add infer address spaces pass to the opt pipeline after inlining		// Add infer address spaces pass to the opt pipeline after inlining
// but before SROA to increase SROA opportunities.		// but before SROA to increase SROA opportunities.
PM.add(createInferAddressSpacesPass());		PM.add(createInferAddressSpacesPass());

// This should run after inlining to have any chance of doing anything,		// This should run after inlining to have any chance of doing anything,
// and before other cleanup optimizations.		// and before other cleanup optimizations.
PM.add(createAMDGPULowerKernelAttributesPass());		PM.add(createAMDGPULowerKernelAttributesPass());
});		});
▲ Show 20 Lines • Show All 705 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPULowerKernelAttributes.cpp		AMDGPULowerKernelAttributes.cpp
AMDGPUMachineCFGStructurizer.cpp		AMDGPUMachineCFGStructurizer.cpp
AMDGPUMachineFunction.cpp		AMDGPUMachineFunction.cpp
AMDGPUMachineModuleInfo.cpp		AMDGPUMachineModuleInfo.cpp
AMDGPUMacroFusion.cpp		AMDGPUMacroFusion.cpp
AMDGPUMCInstLower.cpp		AMDGPUMCInstLower.cpp
AMDGPUOpenCLEnqueuedBlockLowering.cpp		AMDGPUOpenCLEnqueuedBlockLowering.cpp
AMDGPUPromoteAlloca.cpp		AMDGPUPromoteAlloca.cpp
		AMDGPUPromotePointerKernArgsToGlobal.cpp
AMDGPUPropagateAttributes.cpp		AMDGPUPropagateAttributes.cpp
AMDGPURegisterBankInfo.cpp		AMDGPURegisterBankInfo.cpp
AMDGPURegisterInfo.cpp		AMDGPURegisterInfo.cpp
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/promote-pointer-kernargs.ll

This file was added.

				; RUN: opt -O1 -S -o - -mtriple=amdgcn %s \| FileCheck %s

				; CHECK-LABEL: promote_pointer_kernargs
				; CHECK-NEXT: addrspacecast i32* %{{.}} to i32 addrspace(1)
				; CHECK-NEXT: addrspacecast i32* %{{.}} to i32 addrspace(1)
				; CHECK-NEXT: load i32, i32 addrspace(1)*
				; CHECK-NEXT: store i32 %{{.}}, i32 addrspace(1)
				; CHECK-NEXT: ret void
				define amdgpu_kernel void @promote_pointer_kernargs(i32* %out, i32* %in) {
				%v = load i32, i32* %in
				store i32 %v, i32* %out
				ret void
				}