Download Raw Diff

Details

Reviewers

jholewinski
jingyue

Commits

rG9c71150bfbe4: Add NVPTXPeephole pass to reduce unnecessary address cast
rL240587: Add NVPTXPeephole pass to reduce unnecessary address cast

Summary

This patch first change the register that holds local address for stack
frame to %SPL. Then the new NVPTXPeephole pass will try to scan the
following pattern

%vreg0<def> = LEA_ADDRi64 <fi#0>, 4
%vreg1<def> = cvta_to_local %vreg0

and transform it into

%vreg1<def> = LEA_ADDRi64 %VRFrameLocal, 4

Patched by Xuetian Weng

Diff Detail

Event Timeline

wengxt updated this revision to Diff 27961.Jun 18 2015, 1:56 PM

wengxt retitled this revision from to Add NVPTXPeephole pass to reduce unnecessary address cast.

wengxt updated this object.

wengxt edited the test plan for this revision. (Show Details)

wengxt added reviewers: jholewinski, jingyue.

wengxt added a subscriber: Unknown Object (MLST).

Herald added a subscriber: jholewinski. · View Herald TranscriptJun 18 2015, 1:56 PM

jingyue added inline comments.Jun 18 2015, 4:07 PM

lib/Target/NVPTX/NVPTXFrameLowering.cpp
62–68	Is `NVPTX::VRFrameLocal` 32-bit or 64-bit? You use it for both 32-bit and 64-bit. Does that matter?
lib/Target/NVPTX/NVPTXPeephole.cpp
65	style nit: we prefer putting { at the same line as the function header.
82	The indentation looks strange. Did you run clang-format?
90	Could there be multiple frame indices? `fi#0`, `fi#1`, ...
98	again, move { up one line.
110	The type casts seem unnecessary, aren't they?
test/CodeGen/NVPTX/call-with-alloca-buffer.ll
24	I don't see `cvta.local.u64` removed in any tests, probably because %SP is still used. Is it worthwhile adding a test where the `cvta.local.u64` can indeed be removed, because that's the whole point of this peephole optimization?

eliben added a subscriber: eliben.Jun 18 2015, 4:14 PM

eliben added inline comments.

lib/Target/NVPTX/NVPTXPeephole.cpp
9	Slight rewording/grammar: In NVPTX, we always use a special frame register which holds local address of frame. NVPTXLowerAlloca may introduce a lot of cvta.to.local instructions. // This peephole pass optimizes these cases, for example also, I'd clarify a bit more what "holds local address" means
82	FWIW - since this is a new file, I'd suggest running it through clang-format with the LLVM style setting.

wengxt added inline comments.Jun 18 2015, 5:37 PM

lib/Target/NVPTX/NVPTXFrameLowering.cpp
62–68	See NVPTXAsmPrinter::setAndEmitFunctionVirtualRegisters, it will change depending on arch. Though NVPTXRegisterInfo.td defines them as i32, but seems this info is not used.
lib/Target/NVPTX/NVPTXPeephole.cpp
110	Latter one is unnecessary, first one is necessary to eliminate ambiguity.

Fix issues raised in comments.

add a new test case to test a case of not emitting cvta.local %SP, %SPL

wengxt added inline comments.Jun 18 2015, 5:45 PM

lib/Target/NVPTX/NVPTXPeephole.cpp
9	Expanded with more details.
65	Fixed
90	Fixed
test/CodeGen/NVPTX/call-with-alloca-buffer.ll
24	A test case for this is added in local-stack-frame.ll

jingyue added inline comments.Jun 18 2015, 9:02 PM

lib/Target/NVPTX/NVPTXFrameLowering.cpp
51	I'd run clang-format on this file again. Indentation changed since your last modification.
lib/Target/NVPTX/NVPTXPeephole.cpp
19	"avoid casting"
90	Did you miss uploading something? I didn't see anything's changed.

wengxt added inline comments.Jun 19 2015, 1:31 PM

lib/Target/NVPTX/NVPTXPeephole.cpp
90	Related change is in CombineCVTAToLocal

update based on previous comments

LGTM

lib/Target/NVPTX/NVPTXPeephole.cpp
90	ACK'ed

This revision is now accepted and ready to land.Jun 22 2015, 5:48 PM

What is the future expectation for NVPTXPeephole? Are you planning on adding additional transforms? If not, perhaps a more specific name is warranted. Otherwise, LGTM! Thanks!

lib/Target/NVPTX/NVPTXPeephole.cpp
8	Missing separator, e.g. ===----------------------------------------------------------------------===
lib/Target/NVPTX/NVPTXRegisterInfo.td
68	Good catch!

In D10549#193496, @jholewinski wrote:

What is the future expectation for NVPTXPeephole? Are you planning on adding additional transforms? If not, perhaps a more specific name is warranted. Otherwise, LGTM! Thanks!

One thing currently in my mind is to combine some other unnecessary address cast just like this pass. I've noticed some other pointless address cast in generated code (e.g. the "FIXME" in call-with-alloca-buffer.ll).

update based on comments.

jingyue updated this object.Jun 24 2015, 1:02 PM

jingyue closed this revision.Jun 24 2015, 1:24 PM

Hi jingyue,

I got a regression caused by this patch. What I see is that %SP is still used if references to automatic vars (alloca) are passed to functions but %SP is no longer initialized to anything.

I have this (broken) code being now generated:

.visible .entry __omptgt__0_db262_31_(
        .param .u64 __omptgt__0_db262_31__param_0,
        .param .u64 __omptgt__0_db262_31__param_1,
        .param .u64 __omptgt__0_db262_31__param_2,
        .param .u64 __omptgt__0_db262_31__param_3,
        .param .u64 __omptgt__0_db262_31__param_4
)
{
        .local .align 8 .b8     __local_depot0[56];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .pred      %p<18>;
        .reg .s32       %r<31>;
        .reg .s64       %rd<30>;

        mov.u64         %SPL, __local_depot0;
...
          add.u64         %rd14, %SP, 36;
        // Callseq Start 1
        {
        .reg .b32 temp_param_reg;
        // <end>}
        .param .b64 param0;
        st.param.b64    [param0+0], %rd12;
        .param .b32 param1;
        st.param.b32    [param1+0], %r6;
        .param .b32 param2;
        st.param.b32    [param2+0], %r18;
        .param .b64 param3;
        st.param.b64    [param3+0], %rd13;
        .param .b64 param4;
        st.param.b64    [param4+0], %rd10;
        .param .b64 param5;
        st.param.b64    [param5+0], %rd8;
        .param .b64 param6;
        st.param.b64    [param6+0], %rd14;
        .param .b32 param7;
        st.param.b32    [param7+0], %r10;
        .param .b32 param8;
        st.param.b32    [param8+0], %r11;
        call.uni 
        __kmpc_for_static_init_4, 
        (
        param0, 
        param1, 
        param2, 
        param3, 
        param4, 
        param5, 
        param6, 
        param7, 
        param8
        );

%rd14 is computed from %SP but %SP never gets initialized. Before I used to have:

.visible .entry __omptgt__0_db262_31_(
        .param .u64 __omptgt__0_db262_31__param_0,
        .param .u64 __omptgt__0_db262_31__param_1,
        .param .u64 __omptgt__0_db262_31__param_2,
        .param .u64 __omptgt__0_db262_31__param_3,
        .param .u64 __omptgt__0_db262_31__param_4
)
{
        .local .align 8 .b8     __local_depot0[56];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .pred      %p<18>;
        .reg .s32       %r<31>;
        .reg .s64       %rd<31>;

        mov.u64         %rd30, __local_depot0;
        cvta.local.u64  %SP, %rd30;

%SP is initialized properly (and %SPL is not used at all) so the references are properly generated. I suspect that the changes in this patch have to be reflected in other pieces of the backend. What do you think is the best way to tackle the problem?

Thanks,
Samuel

Looks like a bug to me. We'll figure it out. Thank you!

Samuel,

I hope http://reviews.llvm.org/D10844 fixes the issue on your end. LMK if
it doesn't.

Hi jingyue,

Thanks for looking into this.

The issue is solved partially by http://reviews.llvm.org/D10844. Now the address can be accessed inside the function. However, what I read from that address in the callee is different from what I write in the caller.

I've been investigating the issue and what I see is that with the patch I have for, e.g. param5 (that should point to the 63 constant) (I'm filtering out the code for the other parameters):

        mov.u64         %SPL, __local_depot0;
        cvta.local.u64  %SP, %SPL;
...
        add.u64         %rd3, %SPL, 0;
...
        mov.u32         %r21, 63;
        st.local.u32    [%rd3], %r21;
...
        add.u64         %rd18, %SP, 32;
...
        .param .b64 param5;
        st.param.b64    [param5+0], %rd18;
...
        call.uni 
        __kmpc_for_static_init_4, 
        (
...
        param5, 
...
        );

The code I had before (that was working), for the same parameter, was:

        mov.u64         %rd29, __local_depot0;
        cvta.local.u64  %SP, %rd29;
...
        add.u64         %rd9, %SP, 32;
        cvta.to.local.u64       %rd3, %rd9;
...
        mov.u32         %r21, 63;
        st.local.u32    [%rd3], %r21;
...
        add.u64         %rd18, %SP, 32;
...
        .param .b64 param5;
        st.param.b64    [param5+0], %rd18;
...
        call.uni 
        __kmpc_for_static_init_4, 
        (
...
        param5, 
...
        );

So apparently, the offsets of the frame are not being produced properly after the patch. Let me know if you need more information.

Thanks!
Samuel

Hi Samuel,

I hope http://reviews.llvm.org/D10853 can resolve the other issue you found.

FYI, D10853 submitted in r241185.

Seems to be fixed now.

Thanks!
Samuel

Diff 28379

lib/Target/NVPTX/CMakeLists.txt

Show All 16 Lines	set(NVPTXCodeGen_sources
NVPTXGenericToNVVM.cpp		NVPTXGenericToNVVM.cpp
NVPTXISelDAGToDAG.cpp		NVPTXISelDAGToDAG.cpp
NVPTXISelLowering.cpp		NVPTXISelLowering.cpp
NVPTXImageOptimizer.cpp		NVPTXImageOptimizer.cpp
NVPTXInstrInfo.cpp		NVPTXInstrInfo.cpp
NVPTXLowerAggrCopies.cpp		NVPTXLowerAggrCopies.cpp
NVPTXLowerKernelArgs.cpp		NVPTXLowerKernelArgs.cpp
NVPTXLowerAlloca.cpp		NVPTXLowerAlloca.cpp
		NVPTXPeephole.cpp
NVPTXMCExpr.cpp		NVPTXMCExpr.cpp
NVPTXPrologEpilogPass.cpp		NVPTXPrologEpilogPass.cpp
NVPTXRegisterInfo.cpp		NVPTXRegisterInfo.cpp
NVPTXReplaceImageHandles.cpp		NVPTXReplaceImageHandles.cpp
NVPTXSubtarget.cpp		NVPTXSubtarget.cpp
NVPTXTargetMachine.cpp		NVPTXTargetMachine.cpp
NVPTXTargetTransformInfo.cpp		NVPTXTargetTransformInfo.cpp
NVPTXUtilities.cpp		NVPTXUtilities.cpp
NVVMReflect.cpp		NVVMReflect.cpp
)		)

add_llvm_target(NVPTXCodeGen ${NVPTXCodeGen_sources})		add_llvm_target(NVPTXCodeGen ${NVPTXCodeGen_sources})

add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)

lib/Target/NVPTX/NVPTX.h

	Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
	FunctionPass *createNVPTXFavorNonGenericAddrSpacesPass();			FunctionPass *createNVPTXFavorNonGenericAddrSpacesPass();
	ModulePass *createNVVMReflectPass();			ModulePass *createNVVMReflectPass();
	ModulePass *createNVVMReflectPass(const StringMap<int>& Mapping);			ModulePass *createNVVMReflectPass(const StringMap<int>& Mapping);
	MachineFunctionPass *createNVPTXPrologEpilogPass();			MachineFunctionPass *createNVPTXPrologEpilogPass();
	MachineFunctionPass *createNVPTXReplaceImageHandlesPass();			MachineFunctionPass *createNVPTXReplaceImageHandlesPass();
	FunctionPass *createNVPTXImageOptimizerPass();			FunctionPass *createNVPTXImageOptimizerPass();
	FunctionPass createNVPTXLowerKernelArgsPass(const NVPTXTargetMachine TM);			FunctionPass createNVPTXLowerKernelArgsPass(const NVPTXTargetMachine TM);
	BasicBlockPass *createNVPTXLowerAllocaPass();			BasicBlockPass *createNVPTXLowerAllocaPass();
				MachineFunctionPass *createNVPTXPeephole();

	bool isImageOrSamplerVal(const Value , const Module );			bool isImageOrSamplerVal(const Value , const Module );

	extern Target TheNVPTXTarget32;			extern Target TheNVPTXTarget32;
	extern Target TheNVPTXTarget64;			extern Target TheNVPTXTarget64;

	namespace NVPTX {			namespace NVPTX {
	enum DrvInterface {			enum DrvInterface {
	▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

lib/Target/NVPTX/NVPTXFrameLowering.cpp

	Show All 30 Lines

	bool NVPTXFrameLowering::hasFP(const MachineFunction &MF) const { return true; }			bool NVPTXFrameLowering::hasFP(const MachineFunction &MF) const { return true; }

	void NVPTXFrameLowering::emitPrologue(MachineFunction &MF,			void NVPTXFrameLowering::emitPrologue(MachineFunction &MF,
	MachineBasicBlock &MBB) const {			MachineBasicBlock &MBB) const {
	if (MF.getFrameInfo()->hasStackObjects()) {			if (MF.getFrameInfo()->hasStackObjects()) {
	assert(&MF.front() == &MBB && "Shrink-wrapping not yet supported");			assert(&MF.front() == &MBB && "Shrink-wrapping not yet supported");
	// Insert "mov.u32 %SP, %Depot"			// Insert "mov.u32 %SP, %Depot"
	MachineBasicBlock::iterator MBBI = MBB.begin();			MachineInstr *MI = MBB.begin();
				MachineRegisterInfo &MR = MF.getRegInfo();

	// This instruction really occurs before first instruction			// This instruction really occurs before first instruction
	// in the BB, so giving it no debug location.			// in the BB, so giving it no debug location.
	DebugLoc dl = DebugLoc();			DebugLoc dl = DebugLoc();

	MachineRegisterInfo &MRI = MF.getRegInfo();

	// mov %SPL, %depot;			// mov %SPL, %depot;
	// cvta.local %SP, %SPL;			// cvta.local %SP, %SPL;
	if (static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit()) {			if (static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit()) {
	unsigned LocalReg = MRI.createVirtualRegister(&NVPTX::Int64RegsRegClass);			// Check if %SP is actually used
	MachineInstr *MI =			if (MR.hasOneNonDBGUse(NVPTX::VRFrame)) {
	BuildMI(MBB, MBBI, dl, MF.getSubtarget().getInstrInfo()->get(			MI = BuildMI(MBB, MI, dl, MF.getSubtarget().getInstrInfo()->get(
				jingyueUnsubmitted Not Done Reply Inline Actions I'd run clang-format on this file again. Indentation changed since your last modification. jingyue: I'd run clang-format on this file again. Indentation changed since your last modification.
	NVPTX::cvta_local_yes_64),			NVPTX::cvta_local_yes_64),
	NVPTX::VRFrame).addReg(LocalReg);			NVPTX::VRFrame)
				.addReg(NVPTX::VRFrameLocal);
				}

	BuildMI(MBB, MI, dl,			BuildMI(MBB, MI, dl,
	MF.getSubtarget().getInstrInfo()->get(NVPTX::MOV_DEPOT_ADDR_64),			MF.getSubtarget().getInstrInfo()->get(NVPTX::MOV_DEPOT_ADDR_64),
	LocalReg).addImm(MF.getFunctionNumber());			NVPTX::VRFrameLocal)
				.addImm(MF.getFunctionNumber());
	} else {			} else {
	unsigned LocalReg = MRI.createVirtualRegister(&NVPTX::Int32RegsRegClass);			// Check if %SP is actually used
	MachineInstr *MI =			if (MR.hasOneNonDBGUse(NVPTX::VRFrame)) {
	BuildMI(MBB, MBBI, dl,			MI = BuildMI(MBB, MI, dl, MF.getSubtarget().getInstrInfo()->get(
	MF.getSubtarget().getInstrInfo()->get(NVPTX::cvta_local_yes),			NVPTX::cvta_local_yes),
	NVPTX::VRFrame).addReg(LocalReg);			NVPTX::VRFrame)
				.addReg(NVPTX::VRFrameLocal);
				}
				jingyueUnsubmitted Not Done Reply Inline Actions Is `NVPTX::VRFrameLocal` 32-bit or 64-bit? You use it for both 32-bit and 64-bit. Does that matter? jingyue: Is `NVPTX::VRFrameLocal` 32-bit or 64-bit? You use it for both 32-bit and 64-bit. Does that…
				wengxtAuthorUnsubmitted Not Done Reply Inline Actions See NVPTXAsmPrinter::setAndEmitFunctionVirtualRegisters, it will change depending on arch. Though NVPTXRegisterInfo.td defines them as i32, but seems this info is not used. wengxt: See NVPTXAsmPrinter::setAndEmitFunctionVirtualRegisters, it will change depending on arch.
	BuildMI(MBB, MI, dl,			BuildMI(MBB, MI, dl,
	MF.getSubtarget().getInstrInfo()->get(NVPTX::MOV_DEPOT_ADDR),			MF.getSubtarget().getInstrInfo()->get(NVPTX::MOV_DEPOT_ADDR),
	LocalReg).addImm(MF.getFunctionNumber());			NVPTX::VRFrameLocal)
				.addImm(MF.getFunctionNumber());
	}			}
	}			}
	}			}

	void NVPTXFrameLowering::emitEpilogue(MachineFunction &MF,			void NVPTXFrameLowering::emitEpilogue(MachineFunction &MF,
	MachineBasicBlock &MBB) const {}			MachineBasicBlock &MBB) const {}

	// This function eliminates ADJCALLSTACKDOWN,			// This function eliminates ADJCALLSTACKDOWN,
	// ADJCALLSTACKUP pseudo instructions			// ADJCALLSTACKUP pseudo instructions
	void NVPTXFrameLowering::eliminateCallFramePseudoInstr(			void NVPTXFrameLowering::eliminateCallFramePseudoInstr(
	MachineFunction &MF, MachineBasicBlock &MBB,			MachineFunction &MF, MachineBasicBlock &MBB,
	MachineBasicBlock::iterator I) const {			MachineBasicBlock::iterator I) const {
	// Simply discard ADJCALLSTACKDOWN,			// Simply discard ADJCALLSTACKDOWN,
	// ADJCALLSTACKUP instructions.			// ADJCALLSTACKUP instructions.
	MBB.erase(I);			MBB.erase(I);
	}			}

lib/Target/NVPTX/NVPTXPeephole.cpp

This file was added.

				//===-- NVPTXPeephole.cpp - NVPTX Peephole Optimiztions -------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				jholewinskiUnsubmitted Not Done Reply Inline Actions Missing separator, e.g. ===----------------------------------------------------------------------=== jholewinski: Missing separator, e.g. //===-----------------------------------------------------------------…
				//
				elibenUnsubmitted Not Done Reply Inline Actions Slight rewording/grammar: In NVPTX, we always use a special frame register which holds local address of frame. NVPTXLowerAlloca may introduce a lot of cvta.to.local instructions. // This peephole pass optimizes these cases, for example also, I'd clarify a bit more what "holds local address" means eliben: Slight rewording/grammar: // In NVPTX, we always use a special frame register which holds…
				wengxtAuthorUnsubmitted Not Done Reply Inline Actions Expanded with more details. wengxt: Expanded with more details.
				// In NVPTX, NVPTXFrameLowering will emit following instruction at the beginning
				// of a MachineFunction.
				//
				// mov %SPL, %depot
				// cvta.local %SP, %SPL
				//
				// Because Frame Index is a generic address and alloca can only return generic
				// pointer, without this pass the instructions producing alloca'ed address will
				// be based on %SP. NVPTXLowerAlloca tends to help replace store and load on
				// this address with their .local versions, but this may introduce a lot of
				jingyueUnsubmitted Not Done Reply Inline Actions "avoid casting" jingyue: "avoid casting"
				// cvta.to.local instructions. Performance can be improved if we avoid casting
				// address back and forth and directly calculate local address based on %SPL.
				// This peephole pass optimizes these cases, for example
				//
				// It will transform the following pattern
				// %vreg0<def> = LEA_ADDRi64 <fi#0>, 4
				// %vreg1<def> = cvta_to_local_yes_64 %vreg0
				//
				// into
				// %vreg1<def> = LEA_ADDRi64 %VRFrameLocal, 4
				//
				// %VRFrameLocal is the virtual register name of %SPL
				//
				//===----------------------------------------------------------------------===//

				#include "NVPTX.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/CodeGen/MachineFrameInfo.h"
				#include "llvm/Target/TargetRegisterInfo.h"
				#include "llvm/Target/TargetInstrInfo.h"

				using namespace llvm;

				#define DEBUG_TYPE "nvptx-peephole"

				namespace llvm {
				void initializeNVPTXPeepholePass(PassRegistry &);
				}

				namespace {
				struct NVPTXPeephole : public MachineFunctionPass {
				public:
				static char ID;
				NVPTXPeephole() : MachineFunctionPass(ID) {
				initializeNVPTXPeepholePass(*PassRegistry::getPassRegistry());
				}

				bool runOnMachineFunction(MachineFunction &MF) override;

				const char *getPassName() const override {
				return "NVPTX optimize redundant cvta.to.local instruction";
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				jingyueUnsubmitted Not Done Reply Inline Actions style nit: we prefer putting { at the same line as the function header. jingyue: style nit: we prefer putting { at the same line as the function header.
				wengxtAuthorUnsubmitted Not Done Reply Inline Actions Fixed wengxt: Fixed
				MachineFunctionPass::getAnalysisUsage(AU);
				}
				};
				}

				char NVPTXPeephole::ID = 0;

				INITIALIZE_PASS(NVPTXPeephole, "nvptx-peephole", "NVPTX Peephole", false, false)

				static bool isCVTAToLocalCombinationCandidate(MachineInstr &Root) {
				auto &MBB = *Root.getParent();
				auto &MF = *MBB.getParent();
				// Check current instruction is cvta.to.local
				if (Root.getOpcode() != NVPTX::cvta_to_local_yes_64 &&
				Root.getOpcode() != NVPTX::cvta_to_local_yes)
				return false;

				jingyueUnsubmitted Not Done Reply Inline Actions The indentation looks strange. Did you run clang-format? jingyue: The indentation looks strange. Did you run clang-format?
				elibenUnsubmitted Not Done Reply Inline Actions FWIW - since this is a new file, I'd suggest running it through clang-format with the LLVM style setting. eliben: FWIW - since this is a new file, I'd suggest running it through clang-format with the LLVM…
				auto &Op = Root.getOperand(1);
				const auto &MRI = MF.getRegInfo();
				MachineInstr *GenericAddrDef = nullptr;
				if (Op.isReg() && TargetRegisterInfo::isVirtualRegister(Op.getReg())) {
				GenericAddrDef = MRI.getUniqueVRegDef(Op.getReg());
				}

				// Check the register operand is uniquely defined by LEA_ADDRi instruction
				jingyueUnsubmitted Not Done Reply Inline Actions Could there be multiple frame indices? `fi#0`, `fi#1`, ... jingyue: Could there be multiple frame indices? `fi#0`, `fi#1`, ...
				wengxtAuthorUnsubmitted Not Done Reply Inline Actions Fixed wengxt: Fixed
				jingyueUnsubmitted Not Done Reply Inline Actions Did you miss uploading something? I didn't see anything's changed. jingyue: Did you miss uploading something? I didn't see anything's changed.
				wengxtAuthorUnsubmitted Not Done Reply Inline Actions Related change is in CombineCVTAToLocal wengxt: Related change is in CombineCVTAToLocal
				jingyueUnsubmitted Not Done Reply Inline Actions ACK'ed jingyue: ACK'ed
				if (!GenericAddrDef \|\| GenericAddrDef->getParent() != &MBB \|\|
				(GenericAddrDef->getOpcode() != NVPTX::LEA_ADDRi64 &&
				GenericAddrDef->getOpcode() != NVPTX::LEA_ADDRi)) {
				return false;
				}

				// Check the LEA_ADDRi operand is Frame index
				auto &BaseAddrOp = GenericAddrDef->getOperand(1);
				jingyueUnsubmitted Not Done Reply Inline Actions again, move { up one line. jingyue: again, move { up one line.
				if (BaseAddrOp.getType() == MachineOperand::MO_FrameIndex) {
				return true;
				}

				return false;
				}

				static void CombineCVTAToLocal(MachineInstr &Root) {
				auto &MBB = *Root.getParent();
				auto &MF = *MBB.getParent();
				const auto &MRI = MF.getRegInfo();
				const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
				jingyueUnsubmitted Not Done Reply Inline Actions The type casts seem unnecessary, aren't they? jingyue: The type casts seem unnecessary, aren't they?
				wengxtAuthorUnsubmitted Not Done Reply Inline Actions Latter one is unnecessary, first one is necessary to eliminate ambiguity. wengxt: Latter one is unnecessary, first one is necessary to eliminate ambiguity.
				auto &Prev = *MRI.getUniqueVRegDef(Root.getOperand(1).getReg());

				// Get the correct offset
				int FrameIndex = Prev.getOperand(1).getIndex();
				int Offset = MF.getFrameInfo()->getObjectOffset(FrameIndex) +
				Prev.getOperand(2).getImm();

				MachineInstrBuilder MIB =
				BuildMI(MF, Root.getDebugLoc(), TII->get(Prev.getOpcode()),
				Root.getOperand(0).getReg())
				.addReg(NVPTX::VRFrameLocal)
				.addOperand(MachineOperand::CreateImm(Offset));

				MBB.insert((MachineBasicBlock::iterator)&Root, MIB);

				// Check if MRI has only one non dbg use, which is Root
				if (MRI.hasOneNonDBGUse(Prev.getOperand(0).getReg())) {
				Prev.eraseFromParentAndMarkDBGValuesForRemoval();
				}
				Root.eraseFromParentAndMarkDBGValuesForRemoval();
				}

				bool NVPTXPeephole::runOnMachineFunction(MachineFunction &MF) {
				bool Changed = false;
				// Loop over all of the basic blocks.
				for (auto &MBB : MF) {
				// Traverse the basic block.
				auto BlockIter = MBB.begin();

				while (BlockIter != MBB.end()) {
				auto &MI = *BlockIter++;
				if (isCVTAToLocalCombinationCandidate(MI)) {
				CombineCVTAToLocal(MI);
				Changed = true;
				}
				} // Instruction
				} // Basic Block
				return Changed;
				}

				MachineFunctionPass *llvm::createNVPTXPeephole() { return new NVPTXPeephole(); }

lib/Target/NVPTX/NVPTXRegisterInfo.td

	Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
	def Float32Regs : NVPTXRegClass<[f32], 32, (add (sequence "F%u", 0, 4))>;			def Float32Regs : NVPTXRegClass<[f32], 32, (add (sequence "F%u", 0, 4))>;
	def Float64Regs : NVPTXRegClass<[f64], 64, (add (sequence "FL%u", 0, 4))>;			def Float64Regs : NVPTXRegClass<[f64], 64, (add (sequence "FL%u", 0, 4))>;
	def Int32ArgRegs : NVPTXRegClass<[i32], 32, (add (sequence "ia%u", 0, 4))>;			def Int32ArgRegs : NVPTXRegClass<[i32], 32, (add (sequence "ia%u", 0, 4))>;
	def Int64ArgRegs : NVPTXRegClass<[i64], 64, (add (sequence "la%u", 0, 4))>;			def Int64ArgRegs : NVPTXRegClass<[i64], 64, (add (sequence "la%u", 0, 4))>;
	def Float32ArgRegs : NVPTXRegClass<[f32], 32, (add (sequence "fa%u", 0, 4))>;			def Float32ArgRegs : NVPTXRegClass<[f32], 32, (add (sequence "fa%u", 0, 4))>;
	def Float64ArgRegs : NVPTXRegClass<[f64], 64, (add (sequence "da%u", 0, 4))>;			def Float64ArgRegs : NVPTXRegClass<[f64], 64, (add (sequence "da%u", 0, 4))>;

	// Read NVPTXRegisterInfo.cpp to see how VRFrame and VRDepot are used.			// Read NVPTXRegisterInfo.cpp to see how VRFrame and VRDepot are used.
	def SpecialRegs : NVPTXRegClass<[i32], 32, (add VRFrame, VRDepot,			def SpecialRegs : NVPTXRegClass<[i32], 32, (add VRFrame, VRFrameLocal, VRDepot,
				jholewinskiUnsubmitted Not Done Reply Inline Actions Good catch! jholewinski: Good catch!
	(sequence "ENVREG%u", 0, 31))>;			(sequence "ENVREG%u", 0, 31))>;

lib/Target/NVPTX/NVPTXTargetMachine.cpp

Show First 20 Lines • Show All 199 Lines • ▼ Show 20 Lines	bool NVPTXPassConfig::addInstSelector() {

addPass(createLowerAggrCopies());		addPass(createLowerAggrCopies());
addPass(createAllocaHoisting());		addPass(createAllocaHoisting());
addPass(createNVPTXISelDag(getNVPTXTargetMachine(), getOptLevel()));		addPass(createNVPTXISelDag(getNVPTXTargetMachine(), getOptLevel()));

if (!ST.hasImageHandles())		if (!ST.hasImageHandles())
addPass(createNVPTXReplaceImageHandlesPass());		addPass(createNVPTXReplaceImageHandlesPass());

		addPass(createNVPTXPeephole());

return false;		return false;
}		}

void NVPTXPassConfig::addPostRegAlloc() {		void NVPTXPassConfig::addPostRegAlloc() {
addPass(createNVPTXPrologEpilogPass(), false);		addPass(createNVPTXPrologEpilogPass(), false);
}		}

FunctionPass *NVPTXPassConfig::createTargetRegisterAllocator(bool) {		FunctionPass *NVPTXPassConfig::createTargetRegisterAllocator(bool) {
▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

test/CodeGen/NVPTX/call-with-alloca-buffer.ll

	Show All 14 Lines
	; }			; }

	; CHECK: .visible .entry kernel_func			; CHECK: .visible .entry kernel_func
	define void @kernel_func(float* %a) {			define void @kernel_func(float* %a) {
	entry:			entry:
	%buf = alloca [16 x i8], align 4			%buf = alloca [16 x i8], align 4

	; CHECK: .local .align 4 .b8 __local_depot0[16]			; CHECK: .local .align 4 .b8 __local_depot0[16]
	; CHECK: mov.u64 %rd[[BUF_REG:[0-9]+]]			; CHECK: mov.u64 %SPL
	; CHECK: cvta.local.u64 %SP, %rd[[BUF_REG]]

				jingyueUnsubmitted Not Done Reply Inline Actions I don't see `cvta.local.u64` removed in any tests, probably because %SP is still used. Is it worthwhile adding a test where the `cvta.local.u64` can indeed be removed, because that's the whole point of this peephole optimization? jingyue: I don't see `cvta.local.u64` removed in any tests, probably because %SP is still used. Is it…
				wengxtAuthorUnsubmitted Not Done Reply Inline Actions A test case for this is added in local-stack-frame.ll wengxt: A test case for this is added in local-stack-frame.ll
	; CHECK: ld.param.u64 %rd[[A_REG:[0-9]+]], [kernel_func_param_0]			; CHECK: ld.param.u64 %rd[[A_REG:[0-9]+]], [kernel_func_param_0]
	; CHECK: cvta.to.global.u64 %rd[[A1_REG:[0-9]+]], %rd[[A_REG]]			; CHECK: cvta.to.global.u64 %rd[[A1_REG:[0-9]+]], %rd[[A_REG]]
	; FIXME: casting A1_REG to A2_REG is unnecessary; A2_REG is essentially A_REG			; FIXME: casting A1_REG to A2_REG is unnecessary; A2_REG is essentially A_REG
	; CHECK: cvta.global.u64 %rd[[A2_REG:[0-9]+]], %rd[[A1_REG]]			; CHECK: cvta.global.u64 %rd[[A2_REG:[0-9]+]], %rd[[A1_REG]]
	; CHECK: cvta.local.u64 %rd[[SP_REG:[0-9]+]]			; CHECK: cvta.local.u64 %rd[[SP_REG:[0-9]+]]
	; CHECK: ld.global.f32 %f[[A0_REG:[0-9]+]], [%rd[[A1_REG]]]			; CHECK: ld.global.f32 %f[[A0_REG:[0-9]+]], [%rd[[A1_REG]]]
	; CHECK: st.local.f32 [{{%rd[0-9]+}}], %f[[A0_REG]]			; CHECK: st.local.f32 [{{%rd[0-9]+}}], %f[[A0_REG]]

	Show All 36 Lines

test/CodeGen/NVPTX/local-stack-frame.ll

	; RUN: llc < %s -march=nvptx -mcpu=sm_20 \| FileCheck %s --check-prefix=PTX32			; RUN: llc < %s -march=nvptx -mcpu=sm_20 \| FileCheck %s --check-prefix=PTX32
	; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 \| FileCheck %s --check-prefix=PTX64			; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 \| FileCheck %s --check-prefix=PTX64

	; Ensure we access the local stack properly			; Ensure we access the local stack properly

	; PTX32: mov.u32 %r{{[0-9]+}}, __local_depot{{[0-9]+}};			; PTX32: mov.u32 %SPL, __local_depot{{[0-9]+}};
	; PTX32: cvta.local.u32 %SP, %r{{[0-9]+}};			; PTX32: cvta.local.u32 %SP, %SPL;
	; PTX32: ld.param.u32 %r{{[0-9]+}}, [foo_param_0];			; PTX32: ld.param.u32 %r{{[0-9]+}}, [foo_param_0];
	; PTX32: st.volatile.u32 [%SP+0], %r{{[0-9]+}};			; PTX32: st.volatile.u32 [%SP+0], %r{{[0-9]+}};
	; PTX64: mov.u64 %rd{{[0-9]+}}, __local_depot{{[0-9]+}};			; PTX64: mov.u64 %SPL, __local_depot{{[0-9]+}};
	; PTX64: cvta.local.u64 %SP, %rd{{[0-9]+}};			; PTX64: cvta.local.u64 %SP, %SPL;
	; PTX64: ld.param.u32 %r{{[0-9]+}}, [foo_param_0];			; PTX64: ld.param.u32 %r{{[0-9]+}}, [foo_param_0];
	; PTX64: st.volatile.u32 [%SP+0], %r{{[0-9]+}};			; PTX64: st.volatile.u32 [%SP+0], %r{{[0-9]+}};
	define void @foo(i32 %a) {			define void @foo(i32 %a) {
	%local = alloca i32, align 4			%local = alloca i32, align 4
	store volatile i32 %a, i32* %local			store volatile i32 %a, i32* %local
	ret void			ret void
	}			}

				; PTX32: mov.u32 %SPL, __local_depot{{[0-9]+}};
				; PTX32: cvta.local.u32 %SP, %SPL;
				; PTX32: ld.param.u32 %r{{[0-9]+}}, [foo2_param_0];
				; PTX32: add.u32 %r[[SP_REG:[0-9]+]], %SPL, 0;
				; PTX32: st.local.u32 [%r[[SP_REG]]], %r{{[0-9]+}};
				; PTX64: mov.u64 %SPL, __local_depot{{[0-9]+}};
				; PTX64: cvta.local.u64 %SP, %SPL;
				; PTX64: ld.param.u32 %r{{[0-9]+}}, [foo2_param_0];
				; PTX64: add.u64 %rd[[SP_REG:[0-9]+]], %SPL, 0;
				; PTX64: st.local.u32 [%rd[[SP_REG]]], %r{{[0-9]+}};
				define void @foo2(i32 %a) {
				%local = alloca i32, align 4
				store i32 %a, i32* %local
				call void @bar(i32* %local)
				ret void
				}

				declare void @bar(i32* %a)

				!nvvm.annotations = !{!0}
				!0 = !{void (i32)* @foo2, !"kernel", i32 1}

				; PTX32: mov.u32 %SPL, __local_depot{{[0-9]+}};
				; PTX32-NOT: cvta.local.u32 %SP, %SPL;
				; PTX32: ld.param.u32 %r{{[0-9]+}}, [foo3_param_0];
				; PTX32: add.u32 %r{{[0-9]+}}, %SPL, 0;
				; PTX32: st.local.u32 [%r{{[0-9]+}}], %r{{[0-9]+}};
				; PTX64: mov.u64 %SPL, __local_depot{{[0-9]+}};
				; PTX64-NOT: cvta.local.u64 %SP, %SPL;
				; PTX64: ld.param.u32 %r{{[0-9]+}}, [foo3_param_0];
				; PTX64: add.u64 %rd{{[0-9]+}}, %SPL, 0;
				; PTX64: st.local.u32 [%rd{{[0-9]+}}], %r{{[0-9]+}};
				define void @foo3(i32 %a) {
				%local = alloca [3 x i32], align 4
				%1 = bitcast [3 x i32]* %local to i32*
				%2 = getelementptr inbounds i32, i32* %1, i32 %a
				store i32 %a, i32* %2
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

Add NVPTXPeephole pass to reduce unnecessary address cast
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 28379

lib/Target/NVPTX/CMakeLists.txt

lib/Target/NVPTX/NVPTX.h

lib/Target/NVPTX/NVPTXFrameLowering.cpp

lib/Target/NVPTX/NVPTXPeephole.cpp

lib/Target/NVPTX/NVPTXRegisterInfo.td

lib/Target/NVPTX/NVPTXTargetMachine.cpp

test/CodeGen/NVPTX/call-with-alloca-buffer.ll

test/CodeGen/NVPTX/local-stack-frame.ll

This is an archive of the discontinued LLVM Phabricator instance.

Add NVPTXPeephole pass to reduce unnecessary address castClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 28379

lib/Target/NVPTX/CMakeLists.txt

lib/Target/NVPTX/NVPTX.h

lib/Target/NVPTX/NVPTXFrameLowering.cpp

lib/Target/NVPTX/NVPTXPeephole.cpp

lib/Target/NVPTX/NVPTXRegisterInfo.td

lib/Target/NVPTX/NVPTXTargetMachine.cpp

test/CodeGen/NVPTX/call-with-alloca-buffer.ll

test/CodeGen/NVPTX/local-stack-frame.ll

Add NVPTXPeephole pass to reduce unnecessary address cast
ClosedPublic