This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
TargetPassConfig.h
-
lib/
-
CodeGen/
-
TargetPassConfig.cpp
-
Target/NVPTX/
-
NVPTX/
-
CMakeLists.txt
-
NVPTX.h
2/2
NVPTXAsmPrinter.cpp
2/2
NVPTXFrameLowering.cpp
-
NVPTXFunctionDataSharing.h
1/2
NVPTXFunctionDataSharing.cpp
-
NVPTXInstrInfo.td
-
NVPTXLowerAlloca.cpp
1/1
NVPTXLowerSharedFrameIndicesPass.cpp
-
NVPTXRegisterInfo.h
1/1
NVPTXRegisterInfo.cpp
1/1
NVPTXRegisterInfo.td
-
NVPTXTargetMachine.cpp
-
NVPTXUtilities.h
1/1
NVPTXUtilities.cpp
-
test/CodeGen/NVPTX/
-
CodeGen/
-
NVPTX/
2/4
insert-shared-depot.ll
-
lower-alloca-shared.ll
-
no-shared-depot.ll
-
nvptx-function-data-sharing.ll

Differential D38978

[OpenMP] Enable the lowering of implicitly shared variables in OpenMP GPU-offloaded target regions to the GPU shared memory
AbandonedPublic

Authored by gtbercea on Oct 16 2017, 2:29 PM.

Download Raw Diff

Details

Reviewers

hfinkel
carlo.bertolli
arpith-jacob
ABataev
caomhin

Summary

This patch is part of the development effort to add support in the current OpenMP GPU offloading implementation for implicitly sharing variables between a target region executed by the team master thread and the worker threads within that team.

This patch is the second of three required for successfully performing the implicit sharing of master thread variables with the worker threads within a team:
-Patch D38976 extends the CLANG code generation with code that handles shared variables.
-Patch (coming soon) extends the functionality of libomptarget to maintain a list of references to shared variables.

This patch adds a shared memory stack to the prolog of the kernel function representing the device offloaded OpenMP target region. The new passes along with the changes to existing ones, ensure that any OpenMP variable which needs to be shared across several threads will be allocated in this new stack, in the shared memory of the device. This patch covers the case of sharing variables from the master thread to the worker threads:

#pragma omp target
{
   // master thread only
   int v;
   #pragma omp parallel
   {
      // worker threads
      // use v
   }
}

Diff Detail

Repository: rL LLVM

Event Timeline

gtbercea created this revision.Oct 16 2017, 2:29 PM

Herald added subscribers: mgorny, jholewinski. · View Herald TranscriptOct 16 2017, 2:29 PM

gtbercea mentioned this in D38976: [OpenMP] Add implicit data sharing support when offloading to NVIDIA GPUs using OpenMP device offloading.Oct 16 2017, 2:30 PM

Please add tests for the cases where such local->shaed conversion should and should not happen.
I would appreciate if you could add details on what exactly your passes are supposed to move to shared memory.

Considering that device-side code tends to be heavily inlined, it may be prudent to add an option to control the total size of shared memory we allow to be used for this purpose.

In case your passes are not executed (or didn't move anything to shared memory), is there any impact on the generated PTX. I.e. can ptxas successfully optimize unused shared memory away?

If the code intentionally wants to allocate something in local memory, would the allocation ever be moved to shared memory by your pass? If so, how would I prevent that?

lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1752	Nit: the name should end with `S` as the L in `SPL` was for 'local' address space. which then gets converted to generic AS. In your case it will be in shared space, hence S would be more appropriate.
lib/Target/NVPTX/NVPTXAssignValidGlobalNames.cpp
68 ↗	(On Diff #119210)	The name cleanup changes in this file should probably be committed by themselves as they have nothing to do with the rest of the patch.
lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp
10	Please add details about what the pass is supposed to do.

Eliminate variable and function name clean-up. That has been moved into a separate patch: D39005

gtbercea marked an inline comment as done.Oct 17 2017, 8:28 AM

Diffusion mentioned this in rL318773: [OpenMP] Add implicit data sharing support when offloading to NVIDIA GPUs using….Nov 21 2017, 7:55 AM

Add regression tests and allow for shared memory lowering to be disabled at function level.

gtbercea marked 2 inline comments as done.Nov 24 2017, 3:24 PM

Hahnfeld edited subscribers, added: llvm-commits; removed: cfe-commits.Nov 26 2017, 6:45 AM

ping

@tra @hfinkel

ping

guansong added a subscriber: guansong.Dec 6 2017, 2:59 PM

yaxunl added a subscriber: yaxunl.Dec 6 2017, 5:14 PM

Here is a question, do we require that the alloca size to be compile time constant?

hfinkel added inline comments.Dec 11 2017, 7:53 PM

lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1737	Line too long.
lib/Target/NVPTX/NVPTXFrameLowering.cpp
71	In other places in this patch you refer explicitly to OpenMP, so it probably makes sense to say "the OpenMP runtime" here as well (but just saying "the runtime" seems potentially confusing).
85	Line too long.
lib/Target/NVPTX/NVPTXLowerSharedFrameIndicesPass.cpp
12	Can you be more specific? I believe that we fixed PEI to handle virtual registers, so if that's the only motivation, can we use the regular PEI now?
lib/Target/NVPTX/NVPTXRegisterInfo.cpp
134	Line too long.
lib/Target/NVPTX/NVPTXUtilities.cpp
321	Can't you use PointerMayBeCaptured (include/llvm/Analysis/CaptureTracking.h) instead of this function? If so, please do.

Use LLVM function for checking if pointer is stored.

gtbercea marked 5 inline comments as done.Dec 19 2017, 4:23 AM

gtbercea marked an inline comment as done.Dec 19 2017, 9:31 AM

ping

Dotting the 'i's on the questions that were not replied to directly.

In D38978#899205, @tra wrote:

Considering that device-side code tends to be heavily inlined, it may be prudent to add an option to control the total size of shared memory we allow to be used for this purpose.

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

In case your passes are not executed (or didn't move anything to shared memory), is there any impact on the generated PTX. I.e. can ptxas successfully optimize unused shared memory away?

This may have been addressed by the no-shared-depot.ll test. It would be nice to add few comments in the tests explaining what they do.

If the code intentionally wants to allocate something in local memory, would the allocation ever be moved to shared memory by your pass? If so, how would I prevent that?

AFAICT this functionality only applies to functions with has-nvptx-shared-depot attribute. Works for me.

lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp
99	Nit: `return false` would match the intent better.
lib/Target/NVPTX/NVPTXRegisterInfo.td
75	Line too long.
test/CodeGen/NVPTX/insert-shared-depot.ll
5–6	You could put common checks under the same label (e.g. `CHECK`) and run tests with `-check-prefixes=PTX32,CHECK`
30	'LABEL' is not a check-prefix and `@linsert_shared_depot` is not this function's name, so I'm puzzled what this line is supposed to do. Did you intend `<prefix>-LABEL: @kernel` ? This appears in all the test cases in the patch.

Address comments.

Harbormaster completed remote builds in B13540: Diff 128725.Jan 5 2018, 2:54 AM

In D38978#967485, @tra wrote:

Dotting the 'i's on the questions that were not replied to directly.

In D38978#899205, @tra wrote:

Considering that device-side code tends to be heavily inlined, it may be prudent to add an option to control the total size of shared memory we allow to be used for this purpose.

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

When shared memory isn't enough to hold the shared depot, global memory will be used instead. That is a scheme which will be covered by a future patch.

In case your passes are not executed (or didn't move anything to shared memory), is there any impact on the generated PTX. I.e. can ptxas successfully optimize unused shared memory away?

This may have been addressed by the no-shared-depot.ll test. It would be nice to add few comments in the tests explaining what they do.

Done.

If the code intentionally wants to allocate something in local memory, would the allocation ever be moved to shared memory by your pass? If so, how would I prevent that?

AFAICT this functionality only applies to functions with has-nvptx-shared-depot attribute. Works for me.

That's right.

test/CodeGen/NVPTX/insert-shared-depot.ll
30	This is modeled after the lower-alloca.ll test which has a similar label. The label is always equal to the name of the test file. In this particular case there is a typo, it should be "insert_shared_depot" not "linsert_shared_depot"

In D38978#968222, @gtbercea wrote:

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

When shared memory isn't enough to hold the shared depot, global memory will be used instead. That is a scheme which will be covered by a future patch.

Good luck with that. IMO if your kernel requires all shared memory available per multiprocessor, you are almost guaranteed suboptimal performance because you will not have enough threads running -- neither for peak compute, nor to hide global memory access latency. My bet that you will eventually end up limiting shared memory use to a fairly small fraction of it.

Given that impact is limited to explicitly annotated functions only, this lack of tune-ability is OK with me for now. I'd add a TODO item somewhere to describe that tuning specific limits is WIP.

test/CodeGen/NVPTX/insert-shared-depot.ll
30	This is modeled after the lower-alloca.ll test which has a similar label. lower-alloca.ll indeed has the same problem. The label is always equal to the name of the test file. I don't think FileCheck has such a feature. Nor do I see anything matching this description in the FileCheck documentation. Nor does it work. See below. In this particular case there is a typo, it should be "insert_shared_depot" not "linsert_shared_depot" The line does not check anything right now. In this test FileCheck only pays attention to lines that have CHECK or PTX64/PTX32. This line contains neither and is ignored. You can do an experiment -- replace the line with `; LABEL: this should never match` and run the test. I've tried that on lower-alloca.ll and the test, as expected, passes regardless of the nonsense I put after the `LABEL:`.

In D38978#968565, @tra wrote:

In D38978#968222, @gtbercea wrote:

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

When shared memory isn't enough to hold the shared depot, global memory will be used instead. That is a scheme which will be covered by a future patch.

Good luck with that. IMO if your kernel requires all shared memory available per multiprocessor, you are almost guaranteed suboptimal performance because you will not have enough threads running -- neither for peak compute, nor to hide global memory access latency. My bet that you will eventually end up limiting shared memory use to a fairly small fraction of it.

I completely agree, this scheme will be efficient only when modest amounts of shared memory are required, for larger memory footprints, a global memory scheme will be used instead.

Given that impact is limited to explicitly annotated functions only, this lack of tune-ability is OK with me for now. I'd add a TODO item somewhere to describe that tuning specific limits is WIP.

I'll choose a sensible default for the cut-off point/condition and make it tune-able by the user once we have the global memory scheme in place.

Remove LABEL from tests and add TODO comment for shared memory limit.

Not my area of expertise

Herald added a subscriber: hintonda. · View Herald TranscriptJan 28 2018, 2:20 AM

ping

Alternative solution was implemented.

Herald added a subscriber: jdoerfert. · View Herald TranscriptJun 12 2019, 10:13 AM

Revision Contents

Path

Size

include/

llvm/

CodeGen/

TargetPassConfig.h

4 lines

lib/

CodeGen/

TargetPassConfig.cpp

5 lines

Target/

NVPTX/

CMakeLists.txt

2 lines

NVPTX.h

2 lines

NVPTXAsmPrinter.cpp

19 lines

NVPTXFrameLowering.cpp

25 lines

NVPTXFunctionDataSharing.h

37 lines

NVPTXFunctionDataSharing.cpp

127 lines

NVPTXInstrInfo.td

4 lines

NVPTXLowerAlloca.cpp

44 lines

NVPTXLowerSharedFrameIndicesPass.cpp

291 lines

NVPTXRegisterInfo.h

2 lines

NVPTXRegisterInfo.cpp

4 lines

NVPTXRegisterInfo.td

5 lines

NVPTXTargetMachine.cpp

19 lines

NVPTXUtilities.h

4 lines

NVPTXUtilities.cpp

48 lines

test/

CodeGen/

NVPTX/

insert-shared-depot.ll

42 lines

lower-alloca-shared.ll

31 lines

no-shared-depot.ll

40 lines

nvptx-function-data-sharing.ll

31 lines

Diff 124243

include/llvm/CodeGen/TargetPassConfig.h

Show First 20 Lines • Show All 349 Lines • ▼ Show 20 Lines	protected:
virtual bool addPreISel() {		virtual bool addPreISel() {
return true;		return true;
}		}

/// addMachineSSAOptimization - Add standard passes that optimize machine		/// addMachineSSAOptimization - Add standard passes that optimize machine
/// instructions in SSA form.		/// instructions in SSA form.
virtual void addMachineSSAOptimization();		virtual void addMachineSSAOptimization();

		/// Add passes that lower variables to a
		/// particular memory type.
		virtual void addMachineSSALowering() {}

/// Add passes that optimize instruction level parallelism for out-of-order		/// Add passes that optimize instruction level parallelism for out-of-order
/// targets. These passes are run while the machine code is still in SSA		/// targets. These passes are run while the machine code is still in SSA
/// form, so they can use MachineTraceMetrics to control their heuristics.		/// form, so they can use MachineTraceMetrics to control their heuristics.
///		///
/// All passes added here should preserve the MachineDominatorTree,		/// All passes added here should preserve the MachineDominatorTree,
/// MachineLoopInfo, and MachineTraceMetrics analyses.		/// MachineLoopInfo, and MachineTraceMetrics analyses.
virtual bool addILPOpts() {		virtual bool addILPOpts() {
return false;		return false;
▲ Show 20 Lines • Show All 82 Lines • Show Last 20 Lines

lib/CodeGen/TargetPassConfig.cpp

Show First 20 Lines • Show All 809 Lines • ▼ Show 20 Lines	void TargetPassConfig::addMachinePasses() {

// Expand pseudo-instructions emitted by ISel.		// Expand pseudo-instructions emitted by ISel.
addPass(&ExpandISelPseudosID);		addPass(&ExpandISelPseudosID);

// Add passes that optimize machine instructions in SSA form.		// Add passes that optimize machine instructions in SSA form.
if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
addMachineSSAOptimization();		addMachineSSAOptimization();
} else {		} else {
		// Ensure lowering to the appropriate memroy type occurs even when no
		// optimizations are enabled. This type of lowering is required for
		// correctness by the NVPTX backend.
		addMachineSSALowering();

// If the target requests it, assign local variables to stack slots relative		// If the target requests it, assign local variables to stack slots relative
// to one another and simplify frame index references where possible.		// to one another and simplify frame index references where possible.
addPass(&LocalStackSlotAllocationID, false);		addPass(&LocalStackSlotAllocationID, false);
}		}

if (TM->Options.EnableIPRA)		if (TM->Options.EnableIPRA)
addPass(createRegUsageInfoPropPass());		addPass(createRegUsageInfoPropPass());

▲ Show 20 Lines • Show All 318 Lines • Show Last 20 Lines

lib/Target/NVPTX/CMakeLists.txt

Show All 18 Lines	set(NVPTXCodeGen_sources
NVPTXImageOptimizer.cpp		NVPTXImageOptimizer.cpp
NVPTXInstrInfo.cpp		NVPTXInstrInfo.cpp
NVPTXLowerAggrCopies.cpp		NVPTXLowerAggrCopies.cpp
NVPTXLowerArgs.cpp		NVPTXLowerArgs.cpp
NVPTXLowerAlloca.cpp		NVPTXLowerAlloca.cpp
NVPTXPeephole.cpp		NVPTXPeephole.cpp
NVPTXMCExpr.cpp		NVPTXMCExpr.cpp
NVPTXPrologEpilogPass.cpp		NVPTXPrologEpilogPass.cpp
		NVPTXLowerSharedFrameIndicesPass.cpp
NVPTXRegisterInfo.cpp		NVPTXRegisterInfo.cpp
NVPTXReplaceImageHandles.cpp		NVPTXReplaceImageHandles.cpp
NVPTXSubtarget.cpp		NVPTXSubtarget.cpp
NVPTXTargetMachine.cpp		NVPTXTargetMachine.cpp
NVPTXTargetTransformInfo.cpp		NVPTXTargetTransformInfo.cpp
		NVPTXFunctionDataSharing.cpp
NVPTXUtilities.cpp		NVPTXUtilities.cpp
NVVMIntrRange.cpp		NVVMIntrRange.cpp
NVVMReflect.cpp		NVVMReflect.cpp
)		)

add_llvm_target(NVPTXCodeGen ${NVPTXCodeGen_sources})		add_llvm_target(NVPTXCodeGen ${NVPTXCodeGen_sources})

add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)

lib/Target/NVPTX/NVPTX.h

	Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines

	FunctionPass *createNVPTXISelDag(NVPTXTargetMachine &TM,			FunctionPass *createNVPTXISelDag(NVPTXTargetMachine &TM,
	llvm::CodeGenOpt::Level OptLevel);			llvm::CodeGenOpt::Level OptLevel);
	ModulePass *createNVPTXAssignValidGlobalNamesPass();			ModulePass *createNVPTXAssignValidGlobalNamesPass();
	ModulePass *createGenericToNVVMPass();			ModulePass *createGenericToNVVMPass();
	FunctionPass *createNVVMIntrRangePass(unsigned int SmVersion);			FunctionPass *createNVVMIntrRangePass(unsigned int SmVersion);
	FunctionPass *createNVVMReflectPass();			FunctionPass *createNVVMReflectPass();
	MachineFunctionPass *createNVPTXPrologEpilogPass();			MachineFunctionPass *createNVPTXPrologEpilogPass();
				MachineFunctionPass *createNVPTXLowerSharedFrameIndicesPass();
	MachineFunctionPass *createNVPTXReplaceImageHandlesPass();			MachineFunctionPass *createNVPTXReplaceImageHandlesPass();
	FunctionPass *createNVPTXImageOptimizerPass();			FunctionPass *createNVPTXImageOptimizerPass();
	FunctionPass createNVPTXLowerArgsPass(const NVPTXTargetMachine TM);			FunctionPass createNVPTXLowerArgsPass(const NVPTXTargetMachine TM);
	BasicBlockPass *createNVPTXLowerAllocaPass();			BasicBlockPass *createNVPTXLowerAllocaPass();
				FunctionPass createNVPTXFunctionDataSharingPass(const NVPTXTargetMachine TM);
	MachineFunctionPass *createNVPTXPeephole();			MachineFunctionPass *createNVPTXPeephole();

	Target &getTheNVPTXTarget32();			Target &getTheNVPTXTarget32();
	Target &getTheNVPTXTarget64();			Target &getTheNVPTXTarget64();

	namespace NVPTX {			namespace NVPTX {
	enum DrvInterface {			enum DrvInterface {
	NVCL,			NVCL,
	▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

lib/Target/NVPTX/NVPTXAsmPrinter.cpp

Show First 20 Lines • Show All 86 Lines • ▼ Show 20 Lines
#include <sstream>		#include <sstream>
#include <string>		#include <string>
#include <utility>		#include <utility>
#include <vector>		#include <vector>

using namespace llvm;		using namespace llvm;

#define DEPOTNAME "__local_depot"		#define DEPOTNAME "__local_depot"
		#define SHARED_DEPOTNAME "__shared_depot"

static cl::opt<bool>		static cl::opt<bool>
EmitLineNumbers("nvptx-emit-line-numbers", cl::Hidden,		EmitLineNumbers("nvptx-emit-line-numbers", cl::Hidden,
cl::desc("NVPTX Specific: Emit Line numbers even without -G"),		cl::desc("NVPTX Specific: Emit Line numbers even without -G"),
cl::init(true));		cl::init(true));

static cl::opt<bool>		static cl::opt<bool>
InterleaveSrc("nvptx-emit-src", cl::ZeroOrMore, cl::Hidden,		InterleaveSrc("nvptx-emit-src", cl::ZeroOrMore, cl::Hidden,
▲ Show 20 Lines • Show All 1,613 Lines • ▼ Show 20 Lines	void NVPTXAsmPrinter::setAndEmitFunctionVirtualRegisters(
const MachineFunction &MF) {		const MachineFunction &MF) {
SmallString<128> Str;		SmallString<128> Str;
raw_svector_ostream O(Str);		raw_svector_ostream O(Str);

// Map the global virtual register number to a register class specific		// Map the global virtual register number to a register class specific
// virtual register number starting from 1 with that class.		// virtual register number starting from 1 with that class.
const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();		const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();
//unsigned numRegClasses = TRI->getNumRegClasses();		//unsigned numRegClasses = TRI->getNumRegClasses();
		bool IsKernelFunction = isKernelFunction(*MF.getFunction());

		bool GenerateSharedDepot =
		MF.getFunction()->hasFnAttribute("has-nvptx-shared-depot");

// Emit the Fake Stack Object		// Emit the Fake Stack Object
const MachineFrameInfo &MFI = MF.getFrameInfo();		const MachineFrameInfo &MFI = MF.getFrameInfo();
int NumBytes = (int) MFI.getStackSize();		int NumBytes = (int) MFI.getStackSize();
if (NumBytes) {		if (NumBytes) {
O << "\t.local .align " << MFI.getMaxAlignment() << " .b8 \t" << DEPOTNAME		O << "\t.local .align " << MFI.getMaxAlignment() << " .b8 \t" << DEPOTNAME
<< getFunctionNumber() << "[" << NumBytes << "];\n";		<< getFunctionNumber() << "[" << NumBytes << "];\n";
		if (IsKernelFunction && GenerateSharedDepot) {
		O << "\t.shared .align " << MFI.getMaxAlignment() << " .b8 \t" << SHARED_DEPOTNAME
		hfinkelUnsubmitted Done Reply Inline Actions Line too long. hfinkel: Line too long.
		<< getFunctionNumber() << "[" << NumBytes << "];\n";
		}
if (static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit()) {		if (static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit()) {
O << "\t.reg .b64 \t%SP;\n";		O << "\t.reg .b64 \t%SP;\n";
O << "\t.reg .b64 \t%SPL;\n";		O << "\t.reg .b64 \t%SPL;\n";
		if (IsKernelFunction && GenerateSharedDepot) {
		O << "\t.reg .b64 \t%SPS;\n";
		O << "\t.reg .b64 \t%SPSH;\n";
		}
} else {		} else {
O << "\t.reg .b32 \t%SP;\n";		O << "\t.reg .b32 \t%SP;\n";
O << "\t.reg .b32 \t%SPL;\n";		O << "\t.reg .b32 \t%SPL;\n";
		if (IsKernelFunction && GenerateSharedDepot) {
		O << "\t.reg .b32 \t%SPS;\n";
		O << "\t.reg .b32 \t%SPSH;\n";
		traUnsubmitted Done Reply Inline Actions Nit: the name should end with `S` as the L in `SPL` was for 'local' address space. which then gets converted to generic AS. In your case it will be in shared space, hence S would be more appropriate. tra: Nit: the name should end with `S` as the L in `SPL` was for 'local' address space. which then…
		}
}		}
}		}

// Go through all virtual registers to establish the mapping between the		// Go through all virtual registers to establish the mapping between the
// global virtual		// global virtual
// register number and the per class virtual register number.		// register number and the per class virtual register number.
// We use the per class virtual register number in the ptx output.		// We use the per class virtual register number in the ptx output.
unsigned int numVRs = MRI->getNumVirtRegs();		unsigned int numVRs = MRI->getNumVirtRegs();
▲ Show 20 Lines • Show All 612 Lines • ▼ Show 20 Lines
void NVPTXAsmPrinter::printOperand(const MachineInstr *MI, int opNum,		void NVPTXAsmPrinter::printOperand(const MachineInstr *MI, int opNum,
raw_ostream &O, const char *Modifier) {		raw_ostream &O, const char *Modifier) {
const MachineOperand &MO = MI->getOperand(opNum);		const MachineOperand &MO = MI->getOperand(opNum);
switch (MO.getType()) {		switch (MO.getType()) {
case MachineOperand::MO_Register:		case MachineOperand::MO_Register:
if (TargetRegisterInfo::isPhysicalRegister(MO.getReg())) {		if (TargetRegisterInfo::isPhysicalRegister(MO.getReg())) {
if (MO.getReg() == NVPTX::VRDepot)		if (MO.getReg() == NVPTX::VRDepot)
O << DEPOTNAME << getFunctionNumber();		O << DEPOTNAME << getFunctionNumber();
		else if (MO.getReg() == NVPTX::VRSharedDepot)
		O << SHARED_DEPOTNAME << getFunctionNumber();
else		else
O << NVPTXInstPrinter::getRegisterName(MO.getReg());		O << NVPTXInstPrinter::getRegisterName(MO.getReg());
} else {		} else {
emitVirtualRegister(MO.getReg(), O);		emitVirtualRegister(MO.getReg(), O);
}		}
return;		return;

case MachineOperand::MO_Immediate:		case MachineOperand::MO_Immediate:
▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

lib/Target/NVPTX/NVPTXFrameLowering.cpp

Show All 10 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "NVPTXFrameLowering.h"		#include "NVPTXFrameLowering.h"
#include "NVPTX.h"		#include "NVPTX.h"
#include "NVPTXRegisterInfo.h"		#include "NVPTXRegisterInfo.h"
#include "NVPTXSubtarget.h"		#include "NVPTXSubtarget.h"
#include "NVPTXTargetMachine.h"		#include "NVPTXTargetMachine.h"
		#include "NVPTXUtilities.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/CodeGen/TargetInstrInfo.h"		#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/MC/MachineLocation.h"		#include "llvm/MC/MachineLocation.h"

using namespace llvm;		using namespace llvm;
Show All 29 Lines	if (!MR.use_empty(NVPTX::VRFrame)) {
MI = BuildMI(MBB, MI, dl,		MI = BuildMI(MBB, MI, dl,
MF.getSubtarget().getInstrInfo()->get(CvtaLocalOpcode),		MF.getSubtarget().getInstrInfo()->get(CvtaLocalOpcode),
NVPTX::VRFrame)		NVPTX::VRFrame)
.addReg(NVPTX::VRFrameLocal);		.addReg(NVPTX::VRFrameLocal);
}		}
BuildMI(MBB, MI, dl, MF.getSubtarget().getInstrInfo()->get(MovDepotOpcode),		BuildMI(MBB, MI, dl, MF.getSubtarget().getInstrInfo()->get(MovDepotOpcode),
NVPTX::VRFrameLocal)		NVPTX::VRFrameLocal)
.addImm(MF.getFunctionNumber());		.addImm(MF.getFunctionNumber());

		bool SharedStackPointerInit =
		MF.getFunction()->hasFnAttribute("has-nvptx-shared-depot");

		// Only emit a shared depot for the main kernel function.
		// The other device functions need to get a handle on this shared depot
		// by interacting with the runtime.
		hfinkelUnsubmitted Done Reply Inline Actions In other places in this patch you refer explicitly to OpenMP, so it probably makes sense to say "the OpenMP runtime" here as well (but just saying "the runtime" seems potentially confusing). hfinkel: In other places in this patch you refer explicitly to OpenMP, so it probably makes sense to say…
		if (isKernelFunction(*MF.getFunction()) && SharedStackPointerInit) {
		// Emits
		// mov %SHSPL, %shared_depot;
		// cvta.shared %SHSP, %SHSPL;
		// For the time being just emit it even if it's not used.
		unsigned CvtaSharedOpcode =
		Is64Bit ? NVPTX::cvta_shared_yes_64 : NVPTX::cvta_shared_yes;
		unsigned MovSharedDepotOpcode =
		Is64Bit ? NVPTX::MOV_SHARED_DEPOT_ADDR_64 : NVPTX::MOV_SHARED_DEPOT_ADDR;
		MI = BuildMI(MBB, MI, dl,
		MF.getSubtarget().getInstrInfo()->get(CvtaSharedOpcode),
		NVPTX::VRShared)
		.addReg(NVPTX::VRFrameShared);
		BuildMI(MBB, MI, dl, MF.getSubtarget().getInstrInfo()->get(MovSharedDepotOpcode),
		hfinkelUnsubmitted Done Reply Inline Actions Line too long. hfinkel: Line too long.
		NVPTX::VRFrameShared)
		.addImm(MF.getFunctionNumber());
		}
}		}
}		}

void NVPTXFrameLowering::emitEpilogue(MachineFunction &MF,		void NVPTXFrameLowering::emitEpilogue(MachineFunction &MF,
MachineBasicBlock &MBB) const {}		MachineBasicBlock &MBB) const {}

// This function eliminates ADJCALLSTACKDOWN,		// This function eliminates ADJCALLSTACKDOWN,
// ADJCALLSTACKUP pseudo instructions		// ADJCALLSTACKUP pseudo instructions
MachineBasicBlock::iterator NVPTXFrameLowering::eliminateCallFramePseudoInstr(		MachineBasicBlock::iterator NVPTXFrameLowering::eliminateCallFramePseudoInstr(
MachineFunction &MF, MachineBasicBlock &MBB,		MachineFunction &MF, MachineBasicBlock &MBB,
MachineBasicBlock::iterator I) const {		MachineBasicBlock::iterator I) const {
// Simply discard ADJCALLSTACKDOWN,		// Simply discard ADJCALLSTACKDOWN,
// ADJCALLSTACKUP instructions.		// ADJCALLSTACKUP instructions.
return MBB.erase(I);		return MBB.erase(I);
}		}

lib/Target/NVPTX/NVPTXFunctionDataSharing.h

This file was added.

				//===--- NVPTXFrameLowering.h - Define frame lowering for NVPTX -- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				//
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIB_TARGET_NVPTX_NVPTXFUNCTIONDATASHARING_H
				#define LLVM_LIB_TARGET_NVPTX_NVPTXFUNCTIONDATASHARING_H

				namespace llvm {

				class NVPTXFunctionDataSharing : public FunctionPass {
				bool runOnFunction(Function &F) override;
				bool runOnKernelFunction(Function &F);
				bool runOnDeviceFunction(Function &F);

				public:
				static char ID; // Pass identification, replacement for typeid
				NVPTXFunctionDataSharing(const NVPTXTargetMachine *TM = nullptr)
				: FunctionPass(ID), TM(TM) {}
				StringRef getPassName() const override {
				return "Function level data sharing pass.";
				}

				private:
				const NVPTXTargetMachine *TM;
				};
				} // End llvm namespace

				#endif
				No newline at end of file

lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp

This file was added.

				//===-- FunctionDataSharing.cpp - Mark pointers as shared -----------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// For all alloca instructions, add a pair of cast to shared address for
				traUnsubmitted Done Reply Inline Actions Please add details about what the pass is supposed to do. tra: Please add details about what the pass is supposed to do.
				// each of them. For example,
				//
				// %A = alloca i32
				// store i32 0, i32* %A ; emits st.u32
				//
				// will be transformed to
				//
				// %A = alloca i32
				// %Local = addrspacecast i32* %A to i32 addrspace(3)*
				// %Shared = addrspacecast i32 addrspace(3)* %A to i32*
				// store i32 0, i32 addrspace(5)* %Shared ; emits st.shared.u32
				//
				// And we will rely on NVPTXInferAddressSpaces to combine the last two
				// instructions.
				//
				// This pass is invoked for -O0 only.
				//
				//===----------------------------------------------------------------------===//

				#include "NVPTX.h"
				#include "NVPTXUtilities.h"
				#include "NVPTXTargetMachine.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/Module.h"
				#include "llvm/IR/Type.h"
				#include "llvm/Pass.h"

				using namespace llvm;

				namespace llvm {
				void initializeNVPTXFunctionDataSharingPass(PassRegistry &);
				}

				namespace {
				class NVPTXFunctionDataSharing : public FunctionPass {
				bool runOnFunction(Function &F) override;
				bool runOnKernelFunction(Function &F);
				bool runOnDeviceFunction(Function &F);

				public:
				static char ID; // Pass identification, replacement for typeid
				NVPTXFunctionDataSharing(const NVPTXTargetMachine *TM = nullptr)
				: FunctionPass(ID) {}
				StringRef getPassName() const override {
				return "Function level data sharing pass.";
				}
				};
				} // namespace

				char NVPTXFunctionDataSharing::ID = 1;

				INITIALIZE_PASS(NVPTXFunctionDataSharing, "nvptx-function-data-sharing",
				"Function Data Sharing (NVPTX)", false, false)

				static void markPointerAsShared(Value *Ptr) {
				if (Ptr->getType()->getPointerAddressSpace() == ADDRESS_SPACE_SHARED)
				return;

				// Deciding where to emit the addrspacecast pair.
				// Insert right after Ptr if Ptr is an instruction.
				BasicBlock::iterator InsertPt =
				std::next(cast<Instruction>(Ptr)->getIterator());
				assert(InsertPt != InsertPt->getParent()->end() &&
				"We don't call this function with Ptr being a terminator.");

				auto *PtrInShared = new AddrSpaceCastInst(
				Ptr, PointerType::get(Ptr->getType()->getPointerElementType(),
				ADDRESS_SPACE_SHARED),
				Ptr->getName(), &*InsertPt);
				// Old version
				auto *PtrInGeneric = new AddrSpaceCastInst(PtrInShared, Ptr->getType(),
				Ptr->getName(), &*InsertPt);
				// Replace with PtrInGeneric all uses of Ptr except PtrInShared.
				Ptr->replaceAllUsesWith(PtrInGeneric);
				PtrInShared->setOperand(0, Ptr);
				}

				// =============================================================================
				// Main function for this pass.
				// =============================================================================
				bool NVPTXFunctionDataSharing::runOnKernelFunction(Function &F) {
				bool Modified = false;

				// Skip pass if no data sharing is required.
				if (!F.hasFnAttribute("has-nvptx-shared-depot"))
				return Modified;

				traUnsubmitted Not Done Reply Inline Actions Nit: `return false` would match the intent better. tra: Nit: `return false` would match the intent better.
				for (auto &B : F) {
				for (auto &I : B) {
				auto *AI = dyn_cast<AllocaInst>(&I);
				if (!AI)
				continue;
				if (AI->getType()->isPointerTy() && ptrIsStored(AI)) {
				markPointerAsShared(AI);
				Modified = true;
				}
				}
				}

				return Modified;
				}

				// Device functions only need to copy byval args into local memory.
				bool NVPTXFunctionDataSharing::runOnDeviceFunction(Function &F) {
				return true;
				}

				bool NVPTXFunctionDataSharing::runOnFunction(Function &F) {
				return isKernelFunction(F) ? runOnKernelFunction(F) : runOnDeviceFunction(F);
				}

				FunctionPass *
				llvm::createNVPTXFunctionDataSharingPass(const NVPTXTargetMachine *TM) {
				return new NVPTXFunctionDataSharing(TM);
				}

lib/Target/NVPTX/NVPTXInstrInfo.td

Show First 20 Lines • Show All 1,577 Lines • ▼ Show 20 Lines	def MOV_ADDR64 : NVPTXInst<(outs Int64Regs:$dst), (ins imem:$a),
[(set Int64Regs:$dst, (Wrapper tglobaladdr:$a))]>;		[(set Int64Regs:$dst, (Wrapper tglobaladdr:$a))]>;

// Get pointer to local stack.		// Get pointer to local stack.
let hasSideEffects = 0 in {		let hasSideEffects = 0 in {
def MOV_DEPOT_ADDR : NVPTXInst<(outs Int32Regs:$d), (ins i32imm:$num),		def MOV_DEPOT_ADDR : NVPTXInst<(outs Int32Regs:$d), (ins i32imm:$num),
"mov.u32 \t$d, __local_depot$num;", []>;		"mov.u32 \t$d, __local_depot$num;", []>;
def MOV_DEPOT_ADDR_64 : NVPTXInst<(outs Int64Regs:$d), (ins i32imm:$num),		def MOV_DEPOT_ADDR_64 : NVPTXInst<(outs Int64Regs:$d), (ins i32imm:$num),
"mov.u64 \t$d, __local_depot$num;", []>;		"mov.u64 \t$d, __local_depot$num;", []>;
		def MOV_SHARED_DEPOT_ADDR : NVPTXInst<(outs Int32Regs:$d), (ins i32imm:$num),
		"mov.u32 \t$d, __shared_depot$num;", []>;
		def MOV_SHARED_DEPOT_ADDR_64 : NVPTXInst<(outs Int64Regs:$d), (ins i32imm:$num),
		"mov.u64 \t$d, __shared_depot$num;", []>;
}		}


// copyPhysreg is hard-coded in NVPTXInstrInfo.cpp		// copyPhysreg is hard-coded in NVPTXInstrInfo.cpp
let IsSimpleMove=1, hasSideEffects=0 in {		let IsSimpleMove=1, hasSideEffects=0 in {
def IMOV1rr : NVPTXInst<(outs Int1Regs:$dst), (ins Int1Regs:$sss),		def IMOV1rr : NVPTXInst<(outs Int1Regs:$dst), (ins Int1Regs:$sss),
"mov.pred \t$dst, $sss;", []>;		"mov.pred \t$dst, $sss;", []>;
def IMOV16rr : NVPTXInst<(outs Int16Regs:$dst), (ins Int16Regs:$sss),		def IMOV16rr : NVPTXInst<(outs Int16Regs:$dst), (ins Int16Regs:$sss),
▲ Show 20 Lines • Show All 1,576 Lines • Show Last 20 Lines

lib/Target/NVPTX/NVPTXLowerAlloca.cpp

Show All 17 Lines
// %A = alloca i32		// %A = alloca i32
// %Local = addrspacecast i32* %A to i32 addrspace(5)*		// %Local = addrspacecast i32* %A to i32 addrspace(5)*
// %Generic = addrspacecast i32 addrspace(5)* %A to i32*		// %Generic = addrspacecast i32 addrspace(5)* %A to i32*
// store i32 0, i32 addrspace(5)* %Generic ; emits st.local.u32		// store i32 0, i32 addrspace(5)* %Generic ; emits st.local.u32
//		//
// And we will rely on NVPTXInferAddressSpaces to combine the last two		// And we will rely on NVPTXInferAddressSpaces to combine the last two
// instructions.		// instructions.
//		//
		// In the case of OpenMP shared variables, perform the same transformation as
		// for local variables but using the shared address space.
		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "NVPTX.h"		#include "NVPTX.h"
#include "NVPTXUtilities.h"		#include "NVPTXUtilities.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
Show All 32 Lines	if (skipBasicBlock(BB))
return false;		return false;

bool Changed = false;		bool Changed = false;
for (auto &I : BB) {		for (auto &I : BB) {
if (auto allocaInst = dyn_cast<AllocaInst>(&I)) {		if (auto allocaInst = dyn_cast<AllocaInst>(&I)) {
Changed = true;		Changed = true;
auto PTy = dyn_cast<PointerType>(allocaInst->getType());		auto PTy = dyn_cast<PointerType>(allocaInst->getType());
auto ETy = PTy->getElementType();		auto ETy = PTy->getElementType();
auto LocalAddrTy = PointerType::get(ETy, ADDRESS_SPACE_LOCAL);
auto NewASCToLocal = new AddrSpaceCastInst(allocaInst, LocalAddrTy, "");		// In the CUDA case, this is always a local address.
auto GenericAddrTy = PointerType::get(ETy, ADDRESS_SPACE_GENERIC);		// In offloading to a device using OpenMP this may be an
		// address allocated in the shared memory of the device.
		auto *AddrTy = PointerType::get(ETy, ADDRESS_SPACE_LOCAL);
		bool PtrIsStored = ptrIsStored(allocaInst);
		bool RequiresSharedMemory =
		BB.getParent()->hasFnAttribute("has-nvptx-shared-depot");

		// Handle shared args: currently shared args are declared as
		// an alloca in LLVM-IR code generation and lowered to
		// shared memory.
		if (PtrIsStored && RequiresSharedMemory)
		AddrTy = PointerType::get(ETy, ADDRESS_SPACE_SHARED);

		auto NewASCToLocal = new AddrSpaceCastInst(allocaInst, AddrTy, "");
		auto *GenericAddrTy = PointerType::get(ETy, ADDRESS_SPACE_GENERIC);
auto NewASCToGeneric = new AddrSpaceCastInst(NewASCToLocal,		auto NewASCToGeneric = new AddrSpaceCastInst(NewASCToLocal,
GenericAddrTy, "");		GenericAddrTy, "");
NewASCToLocal->insertAfter(allocaInst);		NewASCToLocal->insertAfter(allocaInst);
NewASCToGeneric->insertAfter(NewASCToLocal);		NewASCToGeneric->insertAfter(NewASCToLocal);

		// If a value is shared then the additional conversions are required for
		// correctness.
		if (PtrIsStored && RequiresSharedMemory) {
		allocaInst->replaceAllUsesWith(NewASCToGeneric);
		NewASCToLocal->setOperand(0, allocaInst);
		continue;
		}

for (Value::use_iterator UI = allocaInst->use_begin(),		for (Value::use_iterator UI = allocaInst->use_begin(),
UE = allocaInst->use_end();		UE = allocaInst->use_end();
UI != UE; ) {		UI != UE; ) {
// Check Load, Store, GEP, and BitCast Uses on alloca and make them		// Check Load, Store, GEP, and BitCast Uses on alloca and make them
// use the converted generic address, in order to expose non-generic		// use the converted generic address, in order to expose non-generic
// addrspacecast to NVPTXInferAddressSpaces. For other types		// addrspacecast to NVPTXInferAddressSpaces. For other types
// of instructions this is unnecessary and may introduce redundant		// of instructions this is unnecessary and may introduce redundant
// address cast.		// address cast.
const auto &AllocaUse = *UI++;		const auto &AllocaUse = *UI++;
auto LI = dyn_cast<LoadInst>(AllocaUse.getUser());		auto LI = dyn_cast<LoadInst>(AllocaUse.getUser());
if (LI && LI->getPointerOperand() == allocaInst && !LI->isVolatile()) {		if (LI && LI->getPointerOperand() == allocaInst && !LI->isVolatile()) {
LI->setOperand(LI->getPointerOperandIndex(), NewASCToGeneric);		LI->setOperand(LI->getPointerOperandIndex(), NewASCToGeneric);
continue;		continue;
}		}
auto SI = dyn_cast<StoreInst>(AllocaUse.getUser());		auto SI = dyn_cast<StoreInst>(AllocaUse.getUser());
if (SI && SI->getPointerOperand() == allocaInst && !SI->isVolatile()) {		if (SI && !SI->isVolatile()){
SI->setOperand(SI->getPointerOperandIndex(), NewASCToGeneric);		unsigned Idx;
		if (SI->getPointerOperand() == allocaInst)
		Idx = SI->getPointerOperandIndex();
		else if (SI->getValueOperand() == allocaInst)
		Idx = 0;
		else
continue;		continue;
		SI->setOperand(Idx, NewASCToGeneric);
}		}
auto GI = dyn_cast<GetElementPtrInst>(AllocaUse.getUser());		auto GI = dyn_cast<GetElementPtrInst>(AllocaUse.getUser());
if (GI && GI->getPointerOperand() == allocaInst) {		if (GI && GI->getPointerOperand() == allocaInst) {
GI->setOperand(GI->getPointerOperandIndex(), NewASCToGeneric);		GI->setOperand(GI->getPointerOperandIndex(), NewASCToGeneric);
continue;		continue;
}		}
auto BI = dyn_cast<BitCastInst>(AllocaUse.getUser());		auto BI = dyn_cast<BitCastInst>(AllocaUse.getUser());
if (BI && BI->getOperand(0) == allocaInst) {		if (BI && BI->getOperand(0) == allocaInst) {
Show All 12 Lines

lib/Target/NVPTX/NVPTXLowerSharedFrameIndicesPass.cpp

This file was added.

				//===-- NVPTXLowerSharedFrameIndicesPass.cpp - NVPTX lowering ------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file is a copy of the generic LLVM PrologEpilogInserter pass, modified
				// to remove unneeded functionality and to handle virtual registers. This pass
				// lowers the frame indices to the shared framed index wherever needed.
				hfinkelUnsubmitted Done Reply Inline Actions Can you be more specific? I believe that we fixed PEI to handle virtual registers, so if that's the only motivation, can we use the regular PEI now? hfinkel: Can you be more specific? I believe that we fixed PEI to handle virtual registers, so if that's…
				//
				//===----------------------------------------------------------------------===//

				#include "NVPTX.h"
				#include "NVPTXUtilities.h"
				#include "NVPTXRegisterInfo.h"
				#include "NVPTXSubtarget.h"
				#include "NVPTXTargetMachine.h"
				#include "llvm/CodeGen/MachineFrameInfo.h"
				#include "llvm/CodeGen/MachineFunction.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/CodeGen/TargetFrameLowering.h"
				#include "llvm/CodeGen/TargetRegisterInfo.h"
				#include "llvm/CodeGen/TargetSubtargetInfo.h"
				#include "llvm/CodeGen/TargetInstrInfo.h"
				#include "llvm/MC/MachineLocation.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/Pass.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/raw_ostream.h"

				using namespace llvm;

				#define DEBUG_TYPE "nvptx-lower-shared-frame-indices"

				namespace {
				class NVPTXLowerSharedFrameIndicesPass : public MachineFunctionPass {
				public:
				static char ID;
				NVPTXLowerSharedFrameIndicesPass() : MachineFunctionPass(ID) {}

				bool runOnMachineFunction(MachineFunction &MF) override;

				private:
				void calculateSharedFrameObjectOffsets(MachineFunction &Fn);
				};
				}

				MachineFunctionPass *llvm::createNVPTXLowerSharedFrameIndicesPass() {
				return new NVPTXLowerSharedFrameIndicesPass();
				}

				char NVPTXLowerSharedFrameIndicesPass::ID = 0;

				static bool isSharedFrame(
				MachineBasicBlock::iterator II,
				MachineFunction &MF) {
				MachineInstr &currentMI = *II;

				if (!currentMI.getOperand(0).isReg())
				return false;;

				bool useSharedFrame = false;
				unsigned AllocRegisterNumber = currentMI.getOperand(0).getReg();

				for (MachineBasicBlock &MBB : MF) {
				for (MachineInstr &MI : MBB) {
				if (MI.getOpcode() == NVPTX::cvta_to_shared_yes_64 \|\|
				MI.getOpcode() == NVPTX::cvta_to_shared_yes) {
				if (AllocRegisterNumber == MI.getOperand(1).getReg()) {
				useSharedFrame = true;
				break;
				}
				}
				}
				}
				return useSharedFrame;
				}

				bool NVPTXLowerSharedFrameIndicesPass::runOnMachineFunction(MachineFunction &MF) {
				bool Modified = false;
				bool IsKernel = isKernelFunction(*MF.getFunction());

				// Skip pass if function is not the kernel.
				if (!IsKernel)
				return Modified;

				// Skip pass if no data sharing is required.
				if (!MF.getFunction()->hasFnAttribute("has-nvptx-shared-depot"))
				return Modified;

				SmallVector<int, 16> SharedFrameIndices;

				calculateSharedFrameObjectOffsets(MF);

				for (MachineBasicBlock &MBB : MF) {
				for (MachineInstr &MI : MBB) {
				for (unsigned i = 0, e = MI.getNumOperands(); i != e; ++i) {
				if (!MI.getOperand(i).isFI())
				continue;

				if (i + 1 >= MI.getNumOperands())
				continue;

				bool IsSharedFrame = false;
				int FrameIndex = MI.getOperand(i).getIndex();

				for(int SFI : SharedFrameIndices)
				if (FrameIndex == SFI)
				IsSharedFrame = true;

				if (!IsSharedFrame && isSharedFrame(MI, MF)) {
				SharedFrameIndices.push_back(FrameIndex);
				IsSharedFrame = true;
				}

				if (IsSharedFrame) {
				// Change Frame index to use shared stack.
				MachineFunction &MF = *MI.getParent()->getParent();
				int Offset = MF.getFrameInfo().getObjectOffset(FrameIndex) +
				MI.getOperand(i + 1).getImm();

				// Using I0 as the frame pointer
				// For shared data use the appropriate virtual register: VRShared
				MI.getOperand(i).ChangeToRegister(NVPTX::VRShared, false);
				MI.getOperand(i + 1).ChangeToImmediate(Offset);
				}
				Modified = true;
				}
				}
				}

				return Modified;
				}

				/// AdjustStackOffset - Helper function used to adjust the stack frame offset.
				static inline void
				AdjustStackOffset(MachineFrameInfo &MFI, int FrameIdx,
				bool StackGrowsDown, int64_t &Offset,
				unsigned &MaxAlign) {
				// If the stack grows down, add the object size to find the lowest address.
				if (StackGrowsDown)
				Offset += MFI.getObjectSize(FrameIdx);

				unsigned Align = MFI.getObjectAlignment(FrameIdx);

				// If the alignment of this object is greater than that of the stack, then
				// increase the stack alignment to match.
				MaxAlign = std::max(MaxAlign, Align);

				// Adjust to alignment boundary.
				Offset = (Offset + Align - 1) / Align * Align;

				if (StackGrowsDown) {
				DEBUG(dbgs() << "alloc FI(" << FrameIdx << ") at SP[" << -Offset << "]\n");
				MFI.setObjectOffset(FrameIdx, -Offset); // Set the computed offset
				} else {
				DEBUG(dbgs() << "alloc FI(" << FrameIdx << ") at SP[" << Offset << "]\n");
				MFI.setObjectOffset(FrameIdx, Offset);
				Offset += MFI.getObjectSize(FrameIdx);
				}
				}

				/// This function computes the offset inside the shared stack.
				///
				/// TODO: For simplicity, currently, the offsets conincide with
				/// the local stack frame offsets - the local and stack frame
				/// offsets are the same length.
				void
				NVPTXLowerSharedFrameIndicesPass::calculateSharedFrameObjectOffsets(
				MachineFunction &Fn) {
				const TargetFrameLowering &TFI = *Fn.getSubtarget().getFrameLowering();
				const TargetRegisterInfo *RegInfo = Fn.getSubtarget().getRegisterInfo();

				bool StackGrowsDown =
				TFI.getStackGrowthDirection() == TargetFrameLowering::StackGrowsDown;

				// Loop over all of the stack objects, assigning sequential addresses...
				MachineFrameInfo &MFI = Fn.getFrameInfo();

				// Start at the beginning of the local area.
				// The Offset is the distance from the stack top in the direction
				// of stack growth -- so it's always nonnegative.
				int LocalAreaOffset = TFI.getOffsetOfLocalArea();
				if (StackGrowsDown)
				LocalAreaOffset = -LocalAreaOffset;
				assert(LocalAreaOffset >= 0
				&& "Local area offset should be in direction of stack growth");
				int64_t Offset = LocalAreaOffset;

				// If there are fixed sized objects that are preallocated in the local area,
				// non-fixed objects can't be allocated right at the start of local area.
				// We currently don't support filling in holes in between fixed sized
				// objects, so we adjust 'Offset' to point to the end of last fixed sized
				// preallocated object.
				for (int i = MFI.getObjectIndexBegin(); i != 0; ++i) {
				int64_t FixedOff;
				if (StackGrowsDown) {
				// The maximum distance from the stack pointer is at lower address of
				// the object -- which is given by offset. For down growing stack
				// the offset is negative, so we negate the offset to get the distance.
				FixedOff = -MFI.getObjectOffset(i);
				} else {
				// The maximum distance from the start pointer is at the upper
				// address of the object.
				FixedOff = MFI.getObjectOffset(i) + MFI.getObjectSize(i);
				}
				if (FixedOff > Offset) Offset = FixedOff;
				}

				// NOTE: We do not have a call stack

				unsigned MaxAlign = MFI.getMaxAlignment();

				// No scavenger

				// FIXME: Once this is working, then enable flag will change to a target
				// check for whether the frame is large enough to want to use virtual
				// frame index registers. Functions which don't want/need this optimization
				// will continue to use the existing code path.
				if (MFI.getUseLocalStackAllocationBlock()) {
				unsigned Align = MFI.getLocalFrameMaxAlign();

				// Adjust to alignment boundary.
				Offset = (Offset + Align - 1) / Align * Align;

				DEBUG(dbgs() << "Local frame base offset: " << Offset << "\n");

				// Resolve offsets for objects in the local block.
				for (unsigned i = 0, e = MFI.getLocalFrameObjectCount(); i != e; ++i) {
				std::pair<int, int64_t> Entry = MFI.getLocalFrameObjectMap(i);
				int64_t FIOffset = (StackGrowsDown ? -Offset : Offset) + Entry.second;
				DEBUG(dbgs() << "alloc FI(" << Entry.first << ") at SP[" <<
				FIOffset << "]\n");
				MFI.setObjectOffset(Entry.first, FIOffset);
				}
				// Allocate the local block
				Offset += MFI.getLocalFrameSize();

				MaxAlign = std::max(Align, MaxAlign);
				}

				// No stack protector

				// Then assign frame offsets to stack objects that are not used to spill
				// callee saved registers.
				for (unsigned i = 0, e = MFI.getObjectIndexEnd(); i != e; ++i) {
				if (MFI.isObjectPreAllocated(i) &&
				MFI.getUseLocalStackAllocationBlock())
				continue;
				if (MFI.isDeadObjectIndex(i))
				continue;

				AdjustStackOffset(MFI, i, StackGrowsDown, Offset, MaxAlign);
				}

				// No scavenger

				if (!TFI.targetHandlesStackFrameRounding()) {
				// If we have reserved argument space for call sites in the function
				// immediately on entry to the current function, count it as part of the
				// overall stack size.
				if (MFI.adjustsStack() && TFI.hasReservedCallFrame(Fn))
				Offset += MFI.getMaxCallFrameSize();

				// Round up the size to a multiple of the alignment. If the function has
				// any calls or alloca's, align to the target's StackAlignment value to
				// ensure that the callee's frame or the alloca data is suitably aligned;
				// otherwise, for leaf functions, align to the TransientStackAlignment
				// value.
				unsigned StackAlign;
				if (MFI.adjustsStack() \|\| MFI.hasVarSizedObjects() \|\|
				(RegInfo->needsStackRealignment(Fn) && MFI.getObjectIndexEnd() != 0))
				StackAlign = TFI.getStackAlignment();
				else
				StackAlign = TFI.getTransientStackAlignment();

				// If the frame pointer is eliminated, all frame offsets will be relative to
				// SP not FP. Align to MaxAlign so this works.
				StackAlign = std::max(StackAlign, MaxAlign);
				unsigned AlignMask = StackAlign - 1;
				Offset = (Offset + AlignMask) & ~uint64_t(AlignMask);
				}

				// Update frame info to pretend that this is part of the stack...
				int64_t StackSize = Offset - LocalAreaOffset;
				MFI.setStackSize(StackSize);
				}

lib/Target/NVPTX/NVPTXRegisterInfo.h

Show All 39 Lines	public:
BitVector getReservedRegs(const MachineFunction &MF) const override;		BitVector getReservedRegs(const MachineFunction &MF) const override;

void eliminateFrameIndex(MachineBasicBlock::iterator MI, int SPAdj,		void eliminateFrameIndex(MachineBasicBlock::iterator MI, int SPAdj,
unsigned FIOperandNum,		unsigned FIOperandNum,
RegScavenger *RS = nullptr) const override;		RegScavenger *RS = nullptr) const override;

unsigned getFrameRegister(const MachineFunction &MF) const override;		unsigned getFrameRegister(const MachineFunction &MF) const override;

		unsigned getSharedFrameRegister(const MachineFunction &MF) const;

ManagedStringPool *getStrPool() const {		ManagedStringPool *getStrPool() const {
return const_cast<ManagedStringPool *>(&ManagedStrPool);		return const_cast<ManagedStringPool *>(&ManagedStrPool);
}		}

const char *getName(unsigned RegNo) const {		const char *getName(unsigned RegNo) const {
std::stringstream O;		std::stringstream O;
O << "reg" << RegNo;		O << "reg" << RegNo;
return getStrPool()->getManagedString(O.str().c_str())->c_str();		return getStrPool()->getManagedString(O.str().c_str())->c_str();
Show All 10 Lines

lib/Target/NVPTX/NVPTXRegisterInfo.cpp

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines	void NVPTXRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
// Using I0 as the frame pointer		// Using I0 as the frame pointer
MI.getOperand(FIOperandNum).ChangeToRegister(NVPTX::VRFrame, false);		MI.getOperand(FIOperandNum).ChangeToRegister(NVPTX::VRFrame, false);
MI.getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);		MI.getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);
}		}

unsigned NVPTXRegisterInfo::getFrameRegister(const MachineFunction &MF) const {		unsigned NVPTXRegisterInfo::getFrameRegister(const MachineFunction &MF) const {
return NVPTX::VRFrame;		return NVPTX::VRFrame;
}		}

		unsigned NVPTXRegisterInfo::getSharedFrameRegister(const MachineFunction &MF) const {
		hfinkelUnsubmitted Done Reply Inline Actions Line too long. hfinkel: Line too long.
		return NVPTX::VRShared;
		}

lib/Target/NVPTX/NVPTXRegisterInfo.td

	Show All 19 Lines

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Registers			// Registers
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	// Special Registers used as stack pointer			// Special Registers used as stack pointer
	def VRFrame : NVPTXReg<"%SP">;			def VRFrame : NVPTXReg<"%SP">;
	def VRFrameLocal : NVPTXReg<"%SPL">;			def VRFrameLocal : NVPTXReg<"%SPL">;
				def VRShared : NVPTXReg<"%SPS">;
				def VRFrameShared : NVPTXReg<"%SPSH">;

	// Special Registers used as the stack			// Special Registers used as the stack
	def VRDepot : NVPTXReg<"%Depot">;			def VRDepot : NVPTXReg<"%Depot">;
				def VRSharedDepot : NVPTXReg<"%SharedDepot">;

	// We use virtual registers, but define a few physical registers here to keep			// We use virtual registers, but define a few physical registers here to keep
	// SDAG and the MachineInstr layers happy.			// SDAG and the MachineInstr layers happy.
	foreach i = 0-4 in {			foreach i = 0-4 in {
	def P#i : NVPTXReg<"%p"#i>; // Predicate			def P#i : NVPTXReg<"%p"#i>; // Predicate
	def RS#i : NVPTXReg<"%rs"#i>; // 16-bit			def RS#i : NVPTXReg<"%rs"#i>; // 16-bit
	def R#i : NVPTXReg<"%r"#i>; // 32-bit			def R#i : NVPTXReg<"%r"#i>; // 32-bit
	def RL#i : NVPTXReg<"%rd"#i>; // 64-bit			def RL#i : NVPTXReg<"%rd"#i>; // 64-bit
	Show All 25 Lines
	def Float32Regs : NVPTXRegClass<[f32], 32, (add (sequence "F%u", 0, 4))>;			def Float32Regs : NVPTXRegClass<[f32], 32, (add (sequence "F%u", 0, 4))>;
	def Float64Regs : NVPTXRegClass<[f64], 64, (add (sequence "FL%u", 0, 4))>;			def Float64Regs : NVPTXRegClass<[f64], 64, (add (sequence "FL%u", 0, 4))>;
	def Int32ArgRegs : NVPTXRegClass<[i32], 32, (add (sequence "ia%u", 0, 4))>;			def Int32ArgRegs : NVPTXRegClass<[i32], 32, (add (sequence "ia%u", 0, 4))>;
	def Int64ArgRegs : NVPTXRegClass<[i64], 64, (add (sequence "la%u", 0, 4))>;			def Int64ArgRegs : NVPTXRegClass<[i64], 64, (add (sequence "la%u", 0, 4))>;
	def Float32ArgRegs : NVPTXRegClass<[f32], 32, (add (sequence "fa%u", 0, 4))>;			def Float32ArgRegs : NVPTXRegClass<[f32], 32, (add (sequence "fa%u", 0, 4))>;
	def Float64ArgRegs : NVPTXRegClass<[f64], 64, (add (sequence "da%u", 0, 4))>;			def Float64ArgRegs : NVPTXRegClass<[f64], 64, (add (sequence "da%u", 0, 4))>;

	// Read NVPTXRegisterInfo.cpp to see how VRFrame and VRDepot are used.			// Read NVPTXRegisterInfo.cpp to see how VRFrame and VRDepot are used.
	def SpecialRegs : NVPTXRegClass<[i32], 32, (add VRFrame, VRFrameLocal, VRDepot,			def SpecialRegs : NVPTXRegClass<[i32], 32, (add VRFrame, VRFrameLocal, VRDepot, VRShared, VRFrameShared, VRSharedDepot,
				traUnsubmitted Done Reply Inline Actions Line too long. tra: Line too long.
	(sequence "ENVREG%u", 0, 31))>;			(sequence "ENVREG%u", 0, 31))>;

lib/Target/NVPTX/NVPTXTargetMachine.cpp

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
void initializeNVVMIntrRangePass(PassRegistry&);		void initializeNVVMIntrRangePass(PassRegistry&);
void initializeNVVMReflectPass(PassRegistry&);		void initializeNVVMReflectPass(PassRegistry&);
void initializeGenericToNVVMPass(PassRegistry&);		void initializeGenericToNVVMPass(PassRegistry&);
void initializeNVPTXAllocaHoistingPass(PassRegistry &);		void initializeNVPTXAllocaHoistingPass(PassRegistry &);
void initializeNVPTXAssignValidGlobalNamesPass(PassRegistry&);		void initializeNVPTXAssignValidGlobalNamesPass(PassRegistry&);
void initializeNVPTXLowerAggrCopiesPass(PassRegistry &);		void initializeNVPTXLowerAggrCopiesPass(PassRegistry &);
void initializeNVPTXLowerArgsPass(PassRegistry &);		void initializeNVPTXLowerArgsPass(PassRegistry &);
void initializeNVPTXLowerAllocaPass(PassRegistry &);		void initializeNVPTXLowerAllocaPass(PassRegistry &);
		void initializeNVPTXFunctionDataSharingPass(PassRegistry &);

} // end namespace llvm		} // end namespace llvm

extern "C" void LLVMInitializeNVPTXTarget() {		extern "C" void LLVMInitializeNVPTXTarget() {
// Register the target.		// Register the target.
RegisterTargetMachine<NVPTXTargetMachine32> X(getTheNVPTXTarget32());		RegisterTargetMachine<NVPTXTargetMachine32> X(getTheNVPTXTarget32());
RegisterTargetMachine<NVPTXTargetMachine64> Y(getTheNVPTXTarget64());		RegisterTargetMachine<NVPTXTargetMachine64> Y(getTheNVPTXTarget64());

// FIXME: This pass is really intended to be invoked during IR optimization,		// FIXME: This pass is really intended to be invoked during IR optimization,
// but it's very NVPTX-specific.		// but it's very NVPTX-specific.
PassRegistry &PR = *PassRegistry::getPassRegistry();		PassRegistry &PR = *PassRegistry::getPassRegistry();
initializeNVVMReflectPass(PR);		initializeNVVMReflectPass(PR);
initializeNVVMIntrRangePass(PR);		initializeNVVMIntrRangePass(PR);
initializeGenericToNVVMPass(PR);		initializeGenericToNVVMPass(PR);
initializeNVPTXAllocaHoistingPass(PR);		initializeNVPTXAllocaHoistingPass(PR);
initializeNVPTXAssignValidGlobalNamesPass(PR);		initializeNVPTXAssignValidGlobalNamesPass(PR);
initializeNVPTXLowerArgsPass(PR);		initializeNVPTXLowerArgsPass(PR);
initializeNVPTXLowerAllocaPass(PR);		initializeNVPTXLowerAllocaPass(PR);
		initializeNVPTXFunctionDataSharingPass(PR);
initializeNVPTXLowerAggrCopiesPass(PR);		initializeNVPTXLowerAggrCopiesPass(PR);
}		}

static std::string computeDataLayout(bool is64Bit) {		static std::string computeDataLayout(bool is64Bit) {
std::string Ret = "e";		std::string Ret = "e";

if (!is64Bit)		if (!is64Bit)
Ret += "-p:32:32";		Ret += "-p:32:32";
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	public:
NVPTXTargetMachine &getNVPTXTargetMachine() const {		NVPTXTargetMachine &getNVPTXTargetMachine() const {
return getTM<NVPTXTargetMachine>();		return getTM<NVPTXTargetMachine>();
}		}

void addIRPasses() override;		void addIRPasses() override;
bool addInstSelector() override;		bool addInstSelector() override;
void addPostRegAlloc() override;		void addPostRegAlloc() override;
void addMachineSSAOptimization() override;		void addMachineSSAOptimization() override;
		void addMachineSSALowering() override;

FunctionPass *createTargetRegisterAllocator(bool) override;		FunctionPass *createTargetRegisterAllocator(bool) override;
void addFastRegAlloc(FunctionPass *RegAllocPass) override;		void addFastRegAlloc(FunctionPass *RegAllocPass) override;
void addOptimizedRegAlloc(FunctionPass *RegAllocPass) override;		void addOptimizedRegAlloc(FunctionPass *RegAllocPass) override;

private:		private:
// If the opt level is aggressive, add GVN; otherwise, add EarlyCSE. This		// If the opt level is aggressive, add GVN; otherwise, add EarlyCSE. This
// function is only called in opt mode.		// function is only called in opt mode.
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	if (getOptLevel() != CodeGenOpt::None)
addPass(createNVPTXImageOptimizerPass());		addPass(createNVPTXImageOptimizerPass());
addPass(createNVPTXAssignValidGlobalNamesPass());		addPass(createNVPTXAssignValidGlobalNamesPass());
addPass(createGenericToNVVMPass());		addPass(createGenericToNVVMPass());

// NVPTXLowerArgs is required for correctness and should be run right		// NVPTXLowerArgs is required for correctness and should be run right
// before the address space inference passes.		// before the address space inference passes.
addPass(createNVPTXLowerArgsPass(&getNVPTXTargetMachine()));		addPass(createNVPTXLowerArgsPass(&getNVPTXTargetMachine()));
if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
		// Add address space inference passes
addAddressSpaceInferencePasses();		addAddressSpaceInferencePasses();
if (!DisableLoadStoreVectorizer)		if (!DisableLoadStoreVectorizer)
addPass(createLoadStoreVectorizerPass());		addPass(createLoadStoreVectorizerPass());
addStraightLineScalarOptimizationPasses();		addStraightLineScalarOptimizationPasses();
		} else {
		// When the shared depot is generated, even when no optimizations are
		// used, we need to lower certain alloca instructions to the appropriate
		// memory type for correctness.
		addPass(createNVPTXFunctionDataSharingPass(&getNVPTXTargetMachine()));
}		}

// === LSR and other generic IR passes ===		// === LSR and other generic IR passes ===
TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();
// EarlyCSE is not always strong enough to clean up what LSR produces. For		// EarlyCSE is not always strong enough to clean up what LSR produces. For
// example, GVN can combine		// example, GVN can combine
//		//
// %0 = add %a, %b		// %0 = add %a, %b
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	void NVPTXPassConfig::addOptimizedRegAlloc(FunctionPass *RegAllocPass) {
addPass(&StackSlotColoringID);		addPass(&StackSlotColoringID);

// FIXME: Needs physical registers		// FIXME: Needs physical registers
//addPass(&PostRAMachineLICMID);		//addPass(&PostRAMachineLICMID);

printAndVerify("After StackSlotColoring");		printAndVerify("After StackSlotColoring");
}		}

		void NVPTXPassConfig::addMachineSSALowering() {
		// Lower shared frame indices.
		addPass(createNVPTXLowerSharedFrameIndicesPass(), false);
		}

void NVPTXPassConfig::addMachineSSAOptimization() {		void NVPTXPassConfig::addMachineSSAOptimization() {
// Pre-ra tail duplication.		// Pre-ra tail duplication.
if (addPass(&EarlyTailDuplicateID))		if (addPass(&EarlyTailDuplicateID))
printAndVerify("After Pre-RegAlloc TailDuplicate");		printAndVerify("After Pre-RegAlloc TailDuplicate");

// Optimize PHIs before DCE: removing dead PHI cycles may make more		// Optimize PHIs before DCE: removing dead PHI cycles may make more
// instructions dead.		// instructions dead.
addPass(&OptimizePHIsID);		addPass(&OptimizePHIsID);

		// To avoid SSA optimizations on the local frame indices from treating
		// shared and local frame indices the same, we will lower shared frame
		// before the optimizations are applied.
		addMachineSSALowering();

// This pass merges large allocas. StackSlotColoring is a different pass		// This pass merges large allocas. StackSlotColoring is a different pass
// which merges spill slots.		// which merges spill slots.
addPass(&StackColoringID);		addPass(&StackColoringID);

// If the target requests it, assign local variables to stack slots relative		// If the target requests it, assign local variables to stack slots relative
// to one another and simplify frame index references where possible.		// to one another and simplify frame index references where possible.
addPass(&LocalStackSlotAllocationID);		addPass(&LocalStackSlotAllocationID);

Show All 22 Lines

lib/Target/NVPTX/NVPTXUtilities.h

	//===-- NVPTXUtilities - Utilities ------------------------------ C++ --====//			//===-- NVPTXUtilities - Utilities ------------------------------ C++ --====//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the declaration of the NVVM specific utility functions.			// This file contains the declaration of the NVVM specific utility functions.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIB_TARGET_NVPTX_NVPTXUTILITIES_H			#ifndef LLVM_LIB_TARGET_NVPTX_NVPTXUTILITIES_H
	#define LLVM_LIB_TARGET_NVPTX_NVPTXUTILITIES_H			#define LLVM_LIB_TARGET_NVPTX_NVPTXUTILITIES_H

				#include "NVPTXTargetMachine.h"
				#include "llvm/CodeGen/MachineFunction.h"
	#include "llvm/IR/Function.h"			#include "llvm/IR/Function.h"
	#include "llvm/IR/GlobalVariable.h"			#include "llvm/IR/GlobalVariable.h"
	#include "llvm/IR/IntrinsicInst.h"			#include "llvm/IR/IntrinsicInst.h"
	#include "llvm/IR/Value.h"			#include "llvm/IR/Value.h"
	#include <cstdarg>			#include <cstdarg>
	#include <set>			#include <set>
	#include <string>			#include <string>
	#include <vector>			#include <vector>
	Show All 30 Lines

	bool getMinCTASm(const Function &, unsigned &);			bool getMinCTASm(const Function &, unsigned &);
	bool getMaxNReg(const Function &, unsigned &);			bool getMaxNReg(const Function &, unsigned &);
	bool isKernelFunction(const Function &);			bool isKernelFunction(const Function &);

	bool getAlign(const Function &, unsigned index, unsigned &);			bool getAlign(const Function &, unsigned index, unsigned &);
	bool getAlign(const CallInst &, unsigned index, unsigned &);			bool getAlign(const CallInst &, unsigned index, unsigned &);

				bool ptrIsStored(Value *Ptr);

	}			}

	#endif			#endif

lib/Target/NVPTX/NVPTXUtilities.cpp

Show All 22 Lines
#include <algorithm>		#include <algorithm>
#include <cstring>		#include <cstring>
#include <map>		#include <map>
#include <string>		#include <string>
#include <vector>		#include <vector>

namespace llvm {		namespace llvm {

		#define DEBUG_TYPE "nvptx-utilities"

namespace {		namespace {
typedef std::map<std::string, std::vector<unsigned> > key_val_pair_t;		typedef std::map<std::string, std::vector<unsigned> > key_val_pair_t;
typedef std::map<const GlobalValue *, key_val_pair_t> global_val_annot_t;		typedef std::map<const GlobalValue *, key_val_pair_t> global_val_annot_t;
typedef std::map<const Module *, global_val_annot_t> per_module_annot_t;		typedef std::map<const Module *, global_val_annot_t> per_module_annot_t;
} // anonymous namespace		} // anonymous namespace

static ManagedStatic<per_module_annot_t> annotationCache;		static ManagedStatic<per_module_annot_t> annotationCache;
static sys::Mutex Lock;		static sys::Mutex Lock;
▲ Show 20 Lines • Show All 270 Lines • ▼ Show 20 Lines	for (int i = 0, n = alignNode->getNumOperands(); i < n; i++) {
return false;		return false;
}		}
}		}
}		}
}		}
return false;		return false;
}		}

		/// Returns true if there are any instructions storing
		/// the address of this pointer.
		bool ptrIsStored(Value *Ptr) {
		hfinkelUnsubmitted Done Reply Inline Actions Can't you use PointerMayBeCaptured (include/llvm/Analysis/CaptureTracking.h) instead of this function? If so, please do. hfinkel: Can't you use PointerMayBeCaptured (include/llvm/Analysis/CaptureTracking.h) instead of this…
		SmallVector<const Value*, 16> PointerAliases;
		PointerAliases.push_back(Ptr);

		SmallVector<const User*, 16> Users;
		for (const Use &U : Ptr->uses())
		Users.push_back(U.getUser());

		for (unsigned I = 0; I < Users.size(); ++I) {
		// Get pointer usage
		const User *FU = Users[I];

		// Check if Ptr or an alias to it is the destination of the store
		auto SI = dyn_cast<StoreInst>(FU);
		if (SI) {
		for (auto Alias: PointerAliases)
		if (SI->getValueOperand() == Alias)
		return true;
		continue;
		}

		// TODO: Can loads lead to address being taken?
		// TODO: Can GEPs lead to address being taken?

		// Bitcasts increase aliases of the pointer
		auto BI = dyn_cast<BitCastInst>(FU);
		if (BI) {
		for (const Use &U : BI->uses())
		Users.push_back(U.getUser());
		PointerAliases.push_back(BI);
		continue;
		}

		// TODO:
		// There may be other instructions which increase the number
		// of alias values ex. operations on the address of the alloca.
		// The whole alloca'ed memory region needs to be shared if at
		// least one of the values needs to be shared.
		}

		// Address of the pointer has been stored
		return false;
		}

} // namespace llvm		} // namespace llvm

test/CodeGen/NVPTX/insert-shared-depot.ll

This file was added.

				; RUN: llc < %s -march=nvptx -mcpu=sm_20 \| FileCheck %s --check-prefix=PTX32
				; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 \| FileCheck %s --check-prefix=PTX64

				; PTX32: {{.*}}kernel()
				; PTX64: {{.*}}kernel()

				traUnsubmitted Done Reply Inline Actions You could put common checks under the same label (e.g. `CHECK`) and run tests with `-check-prefixes=PTX32,CHECK` tra: You could put common checks under the same label (e.g. `CHECK`) and run tests with `-check…
				; PTX32: .local .align 8{{.}}.b8{{.}}__local_depot0
				; PTX64: .local .align 8{{.}}.b8{{.}}__local_depot0

				; PTX32: .shared .align 8{{.}}.b8{{.}}__shared_depot0
				; PTX64: .shared .align 8{{.}}.b8{{.}}__shared_depot0

				; PTX32: .reg .b32{{.*}}%SPS;
				; PTX64: .reg .b64{{.*}}%SPS;

				; PTX32: .reg .b32{{.*}}%SPSH;
				; PTX64: .reg .b64{{.*}}%SPSH;

				; PTX32: mov.u32{{.*}}%SPSH, __shared_depot0;
				; PTX64: mov.u64{{.*}}%SPSH, __shared_depot0;

				; PTX32: cvta.shared.u32{{.*}}%SPS, %SPSH;
				; PTX64: cvta.shared.u64{{.*}}%SPS, %SPSH;

				target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-unknown-unknown"

				define void @kernel() #0 {
				; LABEL: @linsert_shared_depot
				%A = alloca i32, align 4
				traUnsubmitted Done Reply Inline Actions 'LABEL' is not a check-prefix and `@linsert_shared_depot` is not this function's name, so I'm puzzled what this line is supposed to do. Did you intend `<prefix>-LABEL: @kernel` ? This appears in all the test cases in the patch. tra: 'LABEL' is not a check-prefix and `@linsert_shared_depot` is not this function's name, so I'm…
				gtberceaAuthorUnsubmitted Not Done Reply Inline Actions This is modeled after the lower-alloca.ll test which has a similar label. The label is always equal to the name of the test file. In this particular case there is a typo, it should be "insert_shared_depot" not "linsert_shared_depot" gtbercea: This is modeled after the lower-alloca.ll test which has a similar label. The label is always…
				traUnsubmitted Not Done Reply Inline Actions This is modeled after the lower-alloca.ll test which has a similar label. lower-alloca.ll indeed has the same problem. The label is always equal to the name of the test file. I don't think FileCheck has such a feature. Nor do I see anything matching this description in the FileCheck documentation. Nor does it work. See below. In this particular case there is a typo, it should be "insert_shared_depot" not "linsert_shared_depot" The line does not check anything right now. In this test FileCheck only pays attention to lines that have CHECK or PTX64/PTX32. This line contains neither and is ignored. You can do an experiment -- replace the line with `; LABEL: this should never match` and run the test. I've tried that on lower-alloca.ll and the test, as expected, passes regardless of the nonsense I put after the `LABEL:`. tra: > This is modeled after the lower-alloca.ll test which has a similar label. lower-alloca.ll…
				%shared_args = alloca i8**, align 8
				call void @callee(i8*** %shared_args)
				store i32 10, i32* %A
				ret void
				}

				declare void @callee(i8***)

				attributes #0 = {"has-nvptx-shared-depot"}

				!nvvm.annotations = !{!0}
				!0 = !{void ()* @kernel, !"kernel", i32 1}

test/CodeGen/NVPTX/lower-alloca-shared.ll

This file was added.

				; RUN: opt < %s -S -nvptx-lower-alloca -infer-address-spaces \| FileCheck %s
				; RUN: llc < %s -march=nvptx64 -mcpu=sm_35 \| FileCheck %s --check-prefix PTX

				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
				target triple = "nvptx64-unknown-unknown"

				define void @kernel() #0 {
				; LABEL: @lower_shared_alloca
				; PTX-LABEL: .visible .entry kernel(
				%A = alloca i32
				; CHECK: addrspacecast i32* %A to i32 addrspace(3)*
				; CHECK: addrspacecast i32 addrspace(3)* %1 to i32*
				; CHECK: store i32 0, i32 addrspace(3)* {{%.+}}
				; PTX: add.u64 {{%rd[0-9]+}}, %SPS, 0;
				; PTX: cvta.to.shared.u64 {{%rd[0-9]+}}, {{%rd[0-9]+}};
				; PTX: st.shared.u32 [{{%rd[0-9]+}}], {{%r[0-9]+}}
				%shared_args = alloca i32**
				call void @callee(i32*** %shared_args)
				%1 = load i32, i32* %shared_args
				%2 = getelementptr inbounds i32, i32* %1, i64 0
				store i32* %A, i32** %2
				store i32 0, i32* %A
				ret void
				}

				declare void @callee(i32***)

				attributes #0 = {"has-nvptx-shared-depot"}

				!nvvm.annotations = !{!0}
				!0 = !{void ()* @kernel, !"kernel", i32 1}

test/CodeGen/NVPTX/no-shared-depot.ll

This file was added.

				; RUN: llc < %s -march=nvptx -mcpu=sm_20 \| FileCheck %s --check-prefix=PTX32
				; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 \| FileCheck %s --check-prefix=PTX64

				; PTX32: {{.*}}kernel()
				; PTX64: {{.*}}kernel()

				; PTX32: .local .align 8{{.}}.b8{{.}}__local_depot0
				; PTX64: .local .align 8{{.}}.b8{{.}}__local_depot0

				; PTX32-NOT: .shared .align 8{{.}}.b8{{.}}__shared_depot0
				; PTX64-NOT: .shared .align 8{{.}}.b8{{.}}__shared_depot0

				; PTX32-NOT: .reg .b32{{.*}}%SPS;
				; PTX64-NOT: .reg .b64{{.*}}%SPS;

				; PTX32-NOT: .reg .b32{{.*}}%SPSH;
				; PTX64-NOT: .reg .b64{{.*}}%SPSH;

				; PTX32-NOT: mov.u32{{.*}}%SPSH, __shared_depot0;
				; PTX64-NOT: mov.u64{{.*}}%SPSH, __shared_depot0;

				; PTX32-NOT: cvta.shared.u32{{.*}}%SPS, %SPSH;
				; PTX64-NOT: cvta.shared.u64{{.*}}%SPS, %SPSH;

				target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-unknown-unknown"

				define void @kernel() {
				; LABEL: @linsert_shared_depot
				%A = alloca i32, align 4
				%shared_args = alloca i8**, align 8
				call void @callee(i8*** %shared_args)
				store i32 10, i32* %A
				ret void
				}

				declare void @callee(i8***)

				!nvvm.annotations = !{!0}
				!0 = !{void ()* @kernel, !"kernel", i32 1}

test/CodeGen/NVPTX/nvptx-function-data-sharing.ll

This file was added.

				; RUN: opt < %s -S -nvptx-function-data-sharing -infer-address-spaces \| FileCheck %s
				; RUN: llc < %s -march=nvptx64 -mcpu=sm_35 \| FileCheck %s --check-prefix PTX

				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
				target triple = "nvptx64-unknown-unknown"

				define void @kernel() #0 {
				; LABEL: @lower_shared_alloca
				; PTX-LABEL: .visible .entry kernel(
				%A = alloca i32
				; CHECK: addrspacecast i32* %A to i32 addrspace(3)*
				; CHECK: addrspacecast i32 addrspace(3)* %A1 to i32*
				; CHECK: store i32 0, i32 addrspace(3)* {{%.+}}
				; PTX: add.u64 {{%rd[0-9]+}}, %SPS, 0;
				; PTX: cvta.to.shared.u64 {{%rd[0-9]+}}, {{%rd[0-9]+}};
				; PTX: st.shared.u32 [{{%rd[0-9]+}}], {{%r[0-9]+}}
				%shared_args = alloca i32**
				call void @callee(i32*** %shared_args)
				%1 = load i32, i32* %shared_args
				%2 = getelementptr inbounds i32, i32* %1, i64 0
				store i32* %A, i32** %2
				store i32 0, i32* %A
				ret void
				}

				declare void @callee(i32***)

				attributes #0 = {"has-nvptx-shared-depot"}

				!nvvm.annotations = !{!0}
				!0 = !{void ()* @kernel, !"kernel", i32 1}

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Enable the lowering of implicitly shared variables in OpenMP GPU-offloaded target regions to the GPU shared memoryAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 124243

include/llvm/CodeGen/TargetPassConfig.h

lib/CodeGen/TargetPassConfig.cpp

lib/Target/NVPTX/CMakeLists.txt

lib/Target/NVPTX/NVPTX.h

lib/Target/NVPTX/NVPTXAsmPrinter.cpp

lib/Target/NVPTX/NVPTXFrameLowering.cpp

lib/Target/NVPTX/NVPTXFunctionDataSharing.h

lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp

lib/Target/NVPTX/NVPTXInstrInfo.td

lib/Target/NVPTX/NVPTXLowerAlloca.cpp

lib/Target/NVPTX/NVPTXLowerSharedFrameIndicesPass.cpp

lib/Target/NVPTX/NVPTXRegisterInfo.h

lib/Target/NVPTX/NVPTXRegisterInfo.cpp

lib/Target/NVPTX/NVPTXRegisterInfo.td

lib/Target/NVPTX/NVPTXTargetMachine.cpp

lib/Target/NVPTX/NVPTXUtilities.h

lib/Target/NVPTX/NVPTXUtilities.cpp

test/CodeGen/NVPTX/insert-shared-depot.ll

test/CodeGen/NVPTX/lower-alloca-shared.ll

test/CodeGen/NVPTX/no-shared-depot.ll

test/CodeGen/NVPTX/nvptx-function-data-sharing.ll

[OpenMP] Enable the lowering of implicitly shared variables in OpenMP GPU-offloaded target regions to the GPU shared memory
AbandonedPublic