This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Extend the SI Load/Store optimizer to combine more things.
ClosedPublic

Authored by sheredom on Nov 2 2018, 11:27 AM.

Download Raw Diff

Details

Reviewers

nhaehnle
arsenm

Commits

rG76504a4c5e19: [AMDGPU] Extend the SI Load/Store optimizer to combine more things.
rL348937: [AMDGPU] Extend the SI Load/Store optimizer to combine more things.

Summary

I've extended the load/store optimizer to be able to produce dwordx3 loads and stores, and also enable it to produce dwordx8 and dwordx16 sgpr loads. This change allows many more load/stores to be combined, and results in much more optimal code for our hardware.

Diff Detail

Event Timeline

sheredom created this revision.Nov 2 2018, 11:27 AM

Herald added subscribers: llvm-commits, t-tye, tpr and 5 others. · View Herald TranscriptNov 2 2018, 11:27 AM

I think these cases should mostly be handled by an IR pass to merge load/store intrinsics, and we need to fix handling of 3 element vectors in SelectionDAG

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
106	No global static initializer

I'm concerned that x8 and x16 loads will significantly increase SGPR usage and therefore SGPR spilling. We have a shader database with over 70 games and benchmarks and I guess the results will not be good after this is committed.

There is another case that can be optimized: Loading {f32, f32, skip, f32} and {f32, skip, f32, f32}. Those can be done with x4 loads for both scalar and vector instructions. The cost is 1 more used VGPR or SGPR. Also, register allocation may reuse the unused register immediately, which will cause unnecessary s_waitcnt after the load and may hurt us.

My original intent with this pass was to handle the non-adjacent DS writes, with a goal of someday merging to x8 and x16 loads when known from better register pressure information at this later point

We discussed this on an internal AMD meeting Monday 5th November 2018, and came to the conclusion that even though I do want the scalar load combining to be brought upstream, it would be better as a separate change so that we can get broader testing across the users of our AMDGPU backend.

This change has been reduced to only allow production of dwordx3, and combining of x3 to turn it into x4.

The huge switch statements are a poster child for the generic SearchableTables, somewhat analogous to what already exists for MIMGInstructions. Sketching it out:

class LoadStoreBaseOpcode {
  LoadStoreBaseOpcode BaseOpcode = !cast<LoadStoreBaseOpcode>(NAME);
  bit Srsrc;
  bit Sbase;
  ...
}

def LoadStoreBaseOpcode : GenericEnum {
  let FilterClass = "LoadStoreBaseOpcode";
}

class LoadStoreOpcode {
  Instruction Opcode;
  LoadStoreBaseOpcode BaseOpcode;
  bits<8> Width;
}

def LoadStoreOpcodeTable : GenericTable {
  let FilterClass = "LoadStoreOpcode";
  let CppTypeName = "LoadStoreOpcode";
  let Fields = ["Opcode", "BaseOpcode", "Width"];
  GenericEnum TypeOf_BaseOpcode = LoadStoreBaseOpcode;

  let PrimaryKey = ["BaseOpcode", "Width"];
  let PrimaryKeyName = "getLoadStoreOpcode";
} 

... and so on ...

Not a complete review yet, but I need to sign off.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78	Why is this needed?
82	Do those actually occur like this in practice?

sheredom added inline comments.Nov 8 2018, 9:25 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78	I use it for the cases where the intrinsic does not match any of the intrinsics that we can do optimizations on.
82	Do you mean do I see idxen loads in workloads? Yup!

nhaehnle added inline comments.Nov 8 2018, 10:18 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
82	Mergeable ones? What pattern in the high-level source does that correspond to?

Fixed review comments:

Made the switch statements autogenerated by BUFInstructions.td tablegen.
Removed idxen mappings (will try and add them back in later in their own commit).

LGTM

This revision is now accepted and ready to land.Dec 12 2018, 2:12 AM

Closed by commit rL348937: [AMDGPU] Extend the SI Load/Store optimizer to combine more things. (authored by sheredom). · Explain WhyDec 12 2018, 8:18 AM

This revision was automatically updated to reflect the committed changes.

Please have look at https://bugs.llvm.org/show_bug.cgi?id=40129

Thanks!

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

BUFInstructions.td

41 lines

SILoadStoreOptimizer.cpp

680 lines

Utils/

AMDGPUBaseInfo.h

18 lines

AMDGPUBaseInfo.cpp

43 lines

test/

CodeGen/

AMDGPU/

cvt_f32_ubyte.ll

6 lines

early-if-convert-cost.ll

3 lines

insert_vector_elt.ll

6 lines

llvm.amdgcn.buffer.load.ll

30 lines

llvm.amdgcn.buffer.store.ll

65 lines

llvm.amdgcn.s.buffer.load.ll

114 lines

merge-stores.ll

25 lines

store-global.ll

3 lines

store-v3i64.ll

3 lines

Diff 176566

lib/Target/AMDGPU/BUFInstructions.td

Show First 20 Lines • Show All 280 Lines • ▼ Show 20 Lines	multiclass MTBUF_Pseudo_Stores<string opName, RegisterClass vdataClass,
}		}
}		}


//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// MUBUF classes		// MUBUF classes
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		class MUBUFGetBaseOpcode<string Op> {
		string ret = !subst("DWORDX2", "DWORD",
		!subst("DWORDX3", "DWORD",
		!subst("DWORDX4", "DWORD", Op)));
		}

class MUBUF_Pseudo <string opName, dag outs, dag ins,		class MUBUF_Pseudo <string opName, dag outs, dag ins,
string asmOps, list<dag> pattern=[]> :		string asmOps, list<dag> pattern=[]> :
InstSI<outs, ins, "", pattern>,		InstSI<outs, ins, "", pattern>,
SIMCInstr<opName, SIEncodingFamily.NONE> {		SIMCInstr<opName, SIEncodingFamily.NONE> {

let isPseudo = 1;		let isPseudo = 1;
let isCodeGenOnly = 1;		let isCodeGenOnly = 1;
let Size = 8;		let Size = 8;
let UseNamedOperandTable = 1;		let UseNamedOperandTable = 1;

string Mnemonic = opName;		string Mnemonic = opName;
string AsmOperands = asmOps;		string AsmOperands = asmOps;

		Instruction Opcode = !cast<Instruction>(NAME);
		Instruction BaseOpcode = !cast<Instruction>(MUBUFGetBaseOpcode<NAME>.ret);

let VM_CNT = 1;		let VM_CNT = 1;
let EXP_CNT = 1;		let EXP_CNT = 1;
let MUBUF = 1;		let MUBUF = 1;
let Uses = [EXEC];		let Uses = [EXEC];
let hasSideEffects = 0;		let hasSideEffects = 0;
let SchedRW = [WriteVMEM];		let SchedRW = [WriteVMEM];

let AsmMatchConverter = "cvtMubuf";		let AsmMatchConverter = "cvtMubuf";

bits<1> offen = 0;		bits<1> offen = 0;
bits<1> idxen = 0;		bits<1> idxen = 0;
bits<1> addr64 = 0;		bits<1> addr64 = 0;
bits<1> lds = 0;		bits<1> lds = 0;
bits<1> has_vdata = 1;		bits<1> has_vdata = 1;
bits<1> has_vaddr = 1;		bits<1> has_vaddr = 1;
bits<1> has_glc = 1;		bits<1> has_glc = 1;
bits<1> glc_value = 0; // the value for glc if no such operand		bits<1> glc_value = 0; // the value for glc if no such operand
bits<1> has_srsrc = 1;		bits<1> has_srsrc = 1;
bits<1> has_soffset = 1;		bits<1> has_soffset = 1;
bits<1> has_offset = 1;		bits<1> has_offset = 1;
bits<1> has_slc = 1;		bits<1> has_slc = 1;
bits<1> has_tfe = 1;		bits<1> has_tfe = 1;
		bits<4> dwords = 0;
}		}

class MUBUF_Real <bits<7> op, MUBUF_Pseudo ps> :		class MUBUF_Real <bits<7> op, MUBUF_Pseudo ps> :
InstSI <ps.OutOperandList, ps.InOperandList, ps.Mnemonic # ps.AsmOperands, []> {		InstSI <ps.OutOperandList, ps.InOperandList, ps.Mnemonic # ps.AsmOperands, []> {

let isPseudo = 0;		let isPseudo = 0;
let isCodeGenOnly = 0;		let isCodeGenOnly = 0;

▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	(ins vdataClass:$vdata, vaddrClass:$vaddr, SReg_128:$srsrc,
SCSrc_b32:$soffset, offset:$offset, GLC:$glc, SLC:$slc)		SCSrc_b32:$soffset, offset:$offset, GLC:$glc, SLC:$slc)
);		);
dag ret = !con(		dag ret = !con(
!if(!empty(vdataList), InsNoData, InsData),		!if(!empty(vdataList), InsNoData, InsData),
!if(isLds, (ins), (ins TFE:$tfe))		!if(isLds, (ins), (ins TFE:$tfe))
);		);
}		}

		class getMUBUFDwords<RegisterClass regClass> {
		string regClassAsInt = !cast<string>(regClass);
		int ret =
		!if(!eq(regClassAsInt, !cast<string>(VGPR_32)), 1,
		!if(!eq(regClassAsInt, !cast<string>(VReg_64)), 2,
		!if(!eq(regClassAsInt, !cast<string>(VReg_96)), 3,
		!if(!eq(regClassAsInt, !cast<string>(VReg_128)), 4,
		0))));
		}

class getMUBUFIns<int addrKind, list<RegisterClass> vdataList=[], bit isLds = 0> {		class getMUBUFIns<int addrKind, list<RegisterClass> vdataList=[], bit isLds = 0> {
dag ret =		dag ret =
!if(!eq(addrKind, BUFAddrKind.Offset), getMUBUFInsDA<vdataList, [], isLds>.ret,		!if(!eq(addrKind, BUFAddrKind.Offset), getMUBUFInsDA<vdataList, [], isLds>.ret,
!if(!eq(addrKind, BUFAddrKind.OffEn), getMUBUFInsDA<vdataList, [VGPR_32], isLds>.ret,		!if(!eq(addrKind, BUFAddrKind.OffEn), getMUBUFInsDA<vdataList, [VGPR_32], isLds>.ret,
!if(!eq(addrKind, BUFAddrKind.IdxEn), getMUBUFInsDA<vdataList, [VGPR_32], isLds>.ret,		!if(!eq(addrKind, BUFAddrKind.IdxEn), getMUBUFInsDA<vdataList, [VGPR_32], isLds>.ret,
!if(!eq(addrKind, BUFAddrKind.BothEn), getMUBUFInsDA<vdataList, [VReg_64], isLds>.ret,		!if(!eq(addrKind, BUFAddrKind.BothEn), getMUBUFInsDA<vdataList, [VReg_64], isLds>.ret,
!if(!eq(addrKind, BUFAddrKind.Addr64), getMUBUFInsDA<vdataList, [VReg_64], isLds>.ret,		!if(!eq(addrKind, BUFAddrKind.Addr64), getMUBUFInsDA<vdataList, [VReg_64], isLds>.ret,
(ins))))));		(ins))))));
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	class MUBUF_Load_Pseudo <string opName,

let Constraints = !if(HasTiedDest, "$vdata = $vdata_in", "");		let Constraints = !if(HasTiedDest, "$vdata = $vdata_in", "");
let mayLoad = 1;		let mayLoad = 1;
let mayStore = 0;		let mayStore = 0;
let maybeAtomic = 1;		let maybeAtomic = 1;
let Uses = !if(isLds, [EXEC, M0], [EXEC]);		let Uses = !if(isLds, [EXEC, M0], [EXEC]);
let has_tfe = !if(isLds, 0, 1);		let has_tfe = !if(isLds, 0, 1);
let lds = isLds;		let lds = isLds;
		let dwords = getMUBUFDwords<vdataClass>.ret;
}		}

// FIXME: tfe can't be an operand because it requires a separate		// FIXME: tfe can't be an operand because it requires a separate
// opcode because it needs an N+1 register class dest register.		// opcode because it needs an N+1 register class dest register.
multiclass MUBUF_Pseudo_Loads<string opName, RegisterClass vdataClass,		multiclass MUBUF_Pseudo_Loads<string opName, RegisterClass vdataClass,
ValueType load_vt = i32,		ValueType load_vt = i32,
SDPatternOperator ld = null_frag,		SDPatternOperator ld = null_frag,
bit TiedDest = 0,		bit TiedDest = 0,
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	: MUBUF_Pseudo<opName,
getMUBUFIns<addrKindCopy, [vdataClassCopy]>.ret,		getMUBUFIns<addrKindCopy, [vdataClassCopy]>.ret,
" $vdata, " # getMUBUFAsmOps<addrKindCopy>.ret # "$glc$slc$tfe",		" $vdata, " # getMUBUFAsmOps<addrKindCopy>.ret # "$glc$slc$tfe",
pattern>,		pattern>,
MUBUF_SetupAddr<addrKindCopy> {		MUBUF_SetupAddr<addrKindCopy> {
let PseudoInstr = opName # "_" # getAddrName<addrKindCopy>.ret;		let PseudoInstr = opName # "_" # getAddrName<addrKindCopy>.ret;
let mayLoad = 0;		let mayLoad = 0;
let mayStore = 1;		let mayStore = 1;
let maybeAtomic = 1;		let maybeAtomic = 1;
		let dwords = getMUBUFDwords<vdataClass>.ret;
}		}

multiclass MUBUF_Pseudo_Stores<string opName, RegisterClass vdataClass,		multiclass MUBUF_Pseudo_Stores<string opName, RegisterClass vdataClass,
ValueType store_vt = i32,		ValueType store_vt = i32,
SDPatternOperator st = null_frag> {		SDPatternOperator st = null_frag> {

def _OFFSET : MUBUF_Store_Pseudo <opName, BUFAddrKind.Offset, vdataClass,		def _OFFSET : MUBUF_Store_Pseudo <opName, BUFAddrKind.Offset, vdataClass,
[(st store_vt:$vdata, (MUBUFOffset v4i32:$srsrc, i32:$soffset,		[(st store_vt:$vdata, (MUBUFOffset v4i32:$srsrc, i32:$soffset,
▲ Show 20 Lines • Show All 1,583 Lines • ▼ Show 20 Lines	let SubtargetPredicate = HasPackedD16VMem in {
defm TBUFFER_LOAD_FORMAT_D16_XY : MTBUF_Real_AllAddr_vi <0x09>;		defm TBUFFER_LOAD_FORMAT_D16_XY : MTBUF_Real_AllAddr_vi <0x09>;
defm TBUFFER_LOAD_FORMAT_D16_XYZ : MTBUF_Real_AllAddr_vi <0x0a>;		defm TBUFFER_LOAD_FORMAT_D16_XYZ : MTBUF_Real_AllAddr_vi <0x0a>;
defm TBUFFER_LOAD_FORMAT_D16_XYZW : MTBUF_Real_AllAddr_vi <0x0b>;		defm TBUFFER_LOAD_FORMAT_D16_XYZW : MTBUF_Real_AllAddr_vi <0x0b>;
defm TBUFFER_STORE_FORMAT_D16_X : MTBUF_Real_AllAddr_vi <0x0c>;		defm TBUFFER_STORE_FORMAT_D16_X : MTBUF_Real_AllAddr_vi <0x0c>;
defm TBUFFER_STORE_FORMAT_D16_XY : MTBUF_Real_AllAddr_vi <0x0d>;		defm TBUFFER_STORE_FORMAT_D16_XY : MTBUF_Real_AllAddr_vi <0x0d>;
defm TBUFFER_STORE_FORMAT_D16_XYZ : MTBUF_Real_AllAddr_vi <0x0e>;		defm TBUFFER_STORE_FORMAT_D16_XYZ : MTBUF_Real_AllAddr_vi <0x0e>;
defm TBUFFER_STORE_FORMAT_D16_XYZW : MTBUF_Real_AllAddr_vi <0x0f>;		defm TBUFFER_STORE_FORMAT_D16_XYZW : MTBUF_Real_AllAddr_vi <0x0f>;
} // End HasUnpackedD16VMem.		} // End HasUnpackedD16VMem.

		def MUBUFInfoTable : GenericTable {
		let FilterClass = "MUBUF_Pseudo";
		let CppTypeName = "MUBUFInfo";
		let Fields = ["Opcode", "BaseOpcode", "dwords", "has_vaddr", "has_srsrc", "has_soffset"];

		let PrimaryKey = ["Opcode"];
		let PrimaryKeyName = "getMUBUFOpcodeHelper";
		}

		def getMUBUFInfoFromOpcode : SearchIndex {
		let Table = MUBUFInfoTable;
		let Key = ["Opcode"];
		}

		def getMUBUFInfoFromBaseOpcodeAndDwords : SearchIndex {
		let Table = MUBUFInfoTable;
		let Key = ["BaseOpcode", "dwords"];
		}

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show All 37 Lines
// cluster of loads have offsets that are too large to fit in the 8-bit		// cluster of loads have offsets that are too large to fit in the 8-bit
// offsets, but are close enough to fit in the 8 bits, we can add to the base		// offsets, but are close enough to fit in the 8 bits, we can add to the base
// pointer and use the new reduced offsets.		// pointer and use the new reduced offsets.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
		#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "SIRegisterInfo.h"		#include "SIRegisterInfo.h"
#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
#include "Utils/AMDGPUBaseInfo.h"		#include "Utils/AMDGPUBaseInfo.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
Show All 12 Lines
#include <iterator>		#include <iterator>
#include <utility>		#include <utility>

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "si-load-store-opt"		#define DEBUG_TYPE "si-load-store-opt"

namespace {		namespace {

class SILoadStoreOptimizer : public MachineFunctionPass {
enum InstClassEnum {		enum InstClassEnum {
DS_READ_WRITE,		UNKNOWN,
		nhaehnleUnsubmitted Not Done Reply Inline Actions Why is this needed? nhaehnle: Why is this needed?
		sheredomAuthorUnsubmitted Not Done Reply Inline Actions I use it for the cases where the intrinsic does not match any of the intrinsics that we can do optimizations on. sheredom: I use it for the cases where the intrinsic does not match any of the intrinsics that we can do…
		DS_READ,
		DS_WRITE,
S_BUFFER_LOAD_IMM,		S_BUFFER_LOAD_IMM,
BUFFER_LOAD_OFFEN,		BUFFER_LOAD_OFFEN = AMDGPU::BUFFER_LOAD_DWORD_OFFEN,
		nhaehnleUnsubmitted Not Done Reply Inline Actions Do those actually occur like this in practice? nhaehnle: Do those actually occur like this in practice?
		sheredomAuthorUnsubmitted Not Done Reply Inline Actions Do you mean do I see idxen loads in workloads? Yup! sheredom: Do you mean do I see idxen loads in workloads? Yup!
		nhaehnleUnsubmitted Not Done Reply Inline Actions Mergeable ones? What pattern in the high-level source does that correspond to? nhaehnle: Mergeable ones? What pattern in the high-level source does that correspond to?
BUFFER_LOAD_OFFSET,		BUFFER_LOAD_OFFSET = AMDGPU::BUFFER_LOAD_DWORD_OFFSET,
BUFFER_STORE_OFFEN,		BUFFER_STORE_OFFEN = AMDGPU::BUFFER_STORE_DWORD_OFFEN,
BUFFER_STORE_OFFSET,		BUFFER_STORE_OFFSET = AMDGPU::BUFFER_STORE_DWORD_OFFSET,
		BUFFER_LOAD_OFFEN_exact = AMDGPU::BUFFER_LOAD_DWORD_OFFEN_exact,
		BUFFER_LOAD_OFFSET_exact = AMDGPU::BUFFER_LOAD_DWORD_OFFSET_exact,
		BUFFER_STORE_OFFEN_exact = AMDGPU::BUFFER_STORE_DWORD_OFFEN_exact,
		BUFFER_STORE_OFFSET_exact = AMDGPU::BUFFER_STORE_DWORD_OFFSET_exact,
};		};

		enum RegisterEnum {
		SBASE = 0x1,
		SRSRC = 0x2,
		SOFFSET = 0x4,
		VADDR = 0x8,
		ADDR = 0x10,
		};

		class SILoadStoreOptimizer : public MachineFunctionPass {
struct CombineInfo {		struct CombineInfo {
MachineBasicBlock::iterator I;		MachineBasicBlock::iterator I;
MachineBasicBlock::iterator Paired;		MachineBasicBlock::iterator Paired;
unsigned EltSize;		unsigned EltSize;
unsigned Offset0;		unsigned Offset0;
unsigned Offset1;		unsigned Offset1;
		arsenmUnsubmitted Not Done Reply Inline Actions No global static initializer arsenm: No global static initializer
		unsigned Width0;
		unsigned Width1;
unsigned BaseOff;		unsigned BaseOff;
InstClassEnum InstClass;		InstClassEnum InstClass;
bool GLC0;		bool GLC0;
bool GLC1;		bool GLC1;
bool SLC0;		bool SLC0;
bool SLC1;		bool SLC1;
bool UseST64;		bool UseST64;
bool IsX2;
SmallVector<MachineInstr*, 8> InstsToMove;		SmallVector<MachineInstr *, 8> InstsToMove;
};		};

private:		private:
const GCNSubtarget *STM = nullptr;		const GCNSubtarget *STM = nullptr;
const SIInstrInfo *TII = nullptr;		const SIInstrInfo *TII = nullptr;
const SIRegisterInfo *TRI = nullptr;		const SIRegisterInfo *TRI = nullptr;
MachineRegisterInfo *MRI = nullptr;		MachineRegisterInfo *MRI = nullptr;
AliasAnalysis *AA = nullptr;		AliasAnalysis *AA = nullptr;
unsigned CreatedX2;		bool OptimizeAgain;

static bool offsetsCanBeCombined(CombineInfo &CI);		static bool offsetsCanBeCombined(CombineInfo &CI);
		static bool widthsFit(const CombineInfo &CI);
		static unsigned getNewOpcode(const CombineInfo &CI);
		static std::pair<unsigned, unsigned> getSubRegIdxs(const CombineInfo &CI);
		const TargetRegisterClass *getTargetRegisterClass(const CombineInfo &CI);
		unsigned getOpcodeWidth(const MachineInstr &MI);
		InstClassEnum getInstClass(unsigned Opc);
		unsigned getRegs(unsigned Opc);

bool findMatchingInst(CombineInfo &CI);		bool findMatchingInst(CombineInfo &CI);

unsigned read2Opcode(unsigned EltSize) const;		unsigned read2Opcode(unsigned EltSize) const;
unsigned read2ST64Opcode(unsigned EltSize) const;		unsigned read2ST64Opcode(unsigned EltSize) const;
MachineBasicBlock::iterator mergeRead2Pair(CombineInfo &CI);		MachineBasicBlock::iterator mergeRead2Pair(CombineInfo &CI);

unsigned write2Opcode(unsigned EltSize) const;		unsigned write2Opcode(unsigned EltSize) const;
unsigned write2ST64Opcode(unsigned EltSize) const;		unsigned write2ST64Opcode(unsigned EltSize) const;
MachineBasicBlock::iterator mergeWrite2Pair(CombineInfo &CI);		MachineBasicBlock::iterator mergeWrite2Pair(CombineInfo &CI);
MachineBasicBlock::iterator mergeSBufferLoadImmPair(CombineInfo &CI);		MachineBasicBlock::iterator mergeSBufferLoadImmPair(CombineInfo &CI);
MachineBasicBlock::iterator mergeBufferLoadPair(CombineInfo &CI);		MachineBasicBlock::iterator mergeBufferLoadPair(CombineInfo &CI);
unsigned promoteBufferStoreOpcode(const MachineInstr &I, bool &IsX2,
bool &IsOffen) const;
MachineBasicBlock::iterator mergeBufferStorePair(CombineInfo &CI);		MachineBasicBlock::iterator mergeBufferStorePair(CombineInfo &CI);

public:		public:
static char ID;		static char ID;

SILoadStoreOptimizer() : MachineFunctionPass(ID) {		SILoadStoreOptimizer() : MachineFunctionPass(ID) {
initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());		initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());
}		}
Show All 12 Lines	public:
}		}
};		};

} // end anonymous namespace.		} // end anonymous namespace.

INITIALIZE_PASS_BEGIN(SILoadStoreOptimizer, DEBUG_TYPE,		INITIALIZE_PASS_BEGIN(SILoadStoreOptimizer, DEBUG_TYPE,
"SI Load Store Optimizer", false, false)		"SI Load Store Optimizer", false, false)
INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)		INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
INITIALIZE_PASS_END(SILoadStoreOptimizer, DEBUG_TYPE,		INITIALIZE_PASS_END(SILoadStoreOptimizer, DEBUG_TYPE, "SI Load Store Optimizer",
"SI Load Store Optimizer", false, false)		false, false)

char SILoadStoreOptimizer::ID = 0;		char SILoadStoreOptimizer::ID = 0;

char &llvm::SILoadStoreOptimizerID = SILoadStoreOptimizer::ID;		char &llvm::SILoadStoreOptimizerID = SILoadStoreOptimizer::ID;

FunctionPass *llvm::createSILoadStoreOptimizerPass() {		FunctionPass *llvm::createSILoadStoreOptimizerPass() {
return new SILoadStoreOptimizer();		return new SILoadStoreOptimizer();
}		}

static void moveInstsAfter(MachineBasicBlock::iterator I,		static void moveInstsAfter(MachineBasicBlock::iterator I,
ArrayRef<MachineInstr*> InstsToMove) {		ArrayRef<MachineInstr *> InstsToMove) {
MachineBasicBlock *MBB = I->getParent();		MachineBasicBlock *MBB = I->getParent();
++I;		++I;
for (MachineInstr *MI : InstsToMove) {		for (MachineInstr *MI : InstsToMove) {
MI->removeFromParent();		MI->removeFromParent();
MBB->insert(I, MI);		MBB->insert(I, MI);
}		}
}		}

Show All 9 Lines	if (Op.isReg()) {
PhysRegUses.insert(Op.getReg());		PhysRegUses.insert(Op.getReg());
}		}
}		}
}		}

static bool memAccessesCanBeReordered(MachineBasicBlock::iterator A,		static bool memAccessesCanBeReordered(MachineBasicBlock::iterator A,
MachineBasicBlock::iterator B,		MachineBasicBlock::iterator B,
const SIInstrInfo *TII,		const SIInstrInfo *TII,
AliasAnalysis * AA) {		AliasAnalysis *AA) {
// RAW or WAR - cannot reorder		// RAW or WAR - cannot reorder
// WAW - cannot reorder		// WAW - cannot reorder
// RAR - safe to reorder		// RAR - safe to reorder
return !(A->mayStore() \|\| B->mayStore()) \|\|		return !(A->mayStore() \|\| B->mayStore()) \|\|
TII->areMemAccessesTriviallyDisjoint(A, B, AA);		TII->areMemAccessesTriviallyDisjoint(A, B, AA);
}		}

// Add MI and its defs to the lists if MI reads one of the defs that are		// Add MI and its defs to the lists if MI reads one of the defs that are
// already in the list. Returns true in that case.		// already in the list. Returns true in that case.
static bool		static bool addToListsIfDependent(MachineInstr &MI, DenseSet<unsigned> &RegDefs,
addToListsIfDependent(MachineInstr &MI,
DenseSet<unsigned> &RegDefs,
DenseSet<unsigned> &PhysRegUses,		DenseSet<unsigned> &PhysRegUses,
SmallVectorImpl<MachineInstr*> &Insts) {		SmallVectorImpl<MachineInstr *> &Insts) {
for (MachineOperand &Use : MI.operands()) {		for (MachineOperand &Use : MI.operands()) {
// If one of the defs is read, then there is a use of Def between I and the		// If one of the defs is read, then there is a use of Def between I and the
// instruction that I will potentially be merged with. We will need to move		// instruction that I will potentially be merged with. We will need to move
// this instruction after the merged instructions.		// this instruction after the merged instructions.
//		//
// Similarly, if there is a def which is read by an instruction that is to		// Similarly, if there is a def which is read by an instruction that is to
// be moved for merging, then we need to move the def-instruction as well.		// be moved for merging, then we need to move the def-instruction as well.
// This can only happen for physical registers such as M0; virtual		// This can only happen for physical registers such as M0; virtual
// registers are in SSA form.		// registers are in SSA form.
if (Use.isReg() &&		if (Use.isReg() &&
((Use.readsReg() && RegDefs.count(Use.getReg())) \|\|		((Use.readsReg() && RegDefs.count(Use.getReg())) \|\|
(Use.isDef() && TargetRegisterInfo::isPhysicalRegister(Use.getReg()) &&		(Use.isDef() && TargetRegisterInfo::isPhysicalRegister(Use.getReg()) &&
PhysRegUses.count(Use.getReg())))) {		PhysRegUses.count(Use.getReg())))) {
Insts.push_back(&MI);		Insts.push_back(&MI);
addDefsUsesToList(MI, RegDefs, PhysRegUses);		addDefsUsesToList(MI, RegDefs, PhysRegUses);
return true;		return true;
}		}
}		}

return false;		return false;
}		}

static bool		static bool canMoveInstsAcrossMemOp(MachineInstr &MemOp,
canMoveInstsAcrossMemOp(MachineInstr &MemOp,
ArrayRef<MachineInstr*> InstsToMove,		ArrayRef<MachineInstr *> InstsToMove,
const SIInstrInfo *TII,		const SIInstrInfo TII, AliasAnalysis AA) {
AliasAnalysis *AA) {
assert(MemOp.mayLoadOrStore());		assert(MemOp.mayLoadOrStore());

for (MachineInstr *InstToMove : InstsToMove) {		for (MachineInstr *InstToMove : InstsToMove) {
if (!InstToMove->mayLoadOrStore())		if (!InstToMove->mayLoadOrStore())
continue;		continue;
if (!memAccessesCanBeReordered(MemOp, *InstToMove, TII, AA))		if (!memAccessesCanBeReordered(MemOp, *InstToMove, TII, AA))
return false;		return false;
}		}
return true;		return true;
}		}

bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI) {		bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI) {
// XXX - Would the same offset be OK? Is there any reason this would happen or		// XXX - Would the same offset be OK? Is there any reason this would happen or
// be useful?		// be useful?
if (CI.Offset0 == CI.Offset1)		if (CI.Offset0 == CI.Offset1)
return false;		return false;

// This won't be valid if the offset isn't aligned.		// This won't be valid if the offset isn't aligned.
if ((CI.Offset0 % CI.EltSize != 0) \|\| (CI.Offset1 % CI.EltSize != 0))		if ((CI.Offset0 % CI.EltSize != 0) \|\| (CI.Offset1 % CI.EltSize != 0))
return false;		return false;

unsigned EltOffset0 = CI.Offset0 / CI.EltSize;		unsigned EltOffset0 = CI.Offset0 / CI.EltSize;
unsigned EltOffset1 = CI.Offset1 / CI.EltSize;		unsigned EltOffset1 = CI.Offset1 / CI.EltSize;
CI.UseST64 = false;		CI.UseST64 = false;
CI.BaseOff = 0;		CI.BaseOff = 0;

// Handle SMEM and VMEM instructions.		// Handle SMEM and VMEM instructions.
if (CI.InstClass != DS_READ_WRITE) {		if ((CI.InstClass != DS_READ) && (CI.InstClass != DS_WRITE)) {
unsigned Diff = CI.IsX2 ? 2 : 1;		return (EltOffset0 + CI.Width0 == EltOffset1 \|\|
return (EltOffset0 + Diff == EltOffset1 \|\|		EltOffset1 + CI.Width1 == EltOffset0) &&
EltOffset1 + Diff == EltOffset0) &&
CI.GLC0 == CI.GLC1 &&		CI.GLC0 == CI.GLC1 &&
(CI.InstClass == S_BUFFER_LOAD_IMM \|\| CI.SLC0 == CI.SLC1);		(CI.InstClass == S_BUFFER_LOAD_IMM \|\| CI.SLC0 == CI.SLC1);
}		}

// If the offset in elements doesn't fit in 8-bits, we might be able to use		// If the offset in elements doesn't fit in 8-bits, we might be able to use
// the stride 64 versions.		// the stride 64 versions.
if ((EltOffset0 % 64 == 0) && (EltOffset1 % 64) == 0 &&		if ((EltOffset0 % 64 == 0) && (EltOffset1 % 64) == 0 &&
isUInt<8>(EltOffset0 / 64) && isUInt<8>(EltOffset1 / 64)) {		isUInt<8>(EltOffset0 / 64) && isUInt<8>(EltOffset1 / 64)) {
Show All 25 Lines	if (isUInt<8>(OffsetDiff)) {
CI.Offset0 = EltOffset0 - CI.BaseOff / CI.EltSize;		CI.Offset0 = EltOffset0 - CI.BaseOff / CI.EltSize;
CI.Offset1 = EltOffset1 - CI.BaseOff / CI.EltSize;		CI.Offset1 = EltOffset1 - CI.BaseOff / CI.EltSize;
return true;		return true;
}		}

return false;		return false;
}		}

		bool SILoadStoreOptimizer::widthsFit(const CombineInfo &CI) {
		const unsigned Width = (CI.Width0 + CI.Width1);
		switch (CI.InstClass) {
		default:
		return Width <= 4;
		case S_BUFFER_LOAD_IMM:
		switch (Width) {
		default:
		return false;
		case 2:
		case 4:
		return true;
		}
		}
		}

		unsigned SILoadStoreOptimizer::getOpcodeWidth(const MachineInstr &MI) {
		const unsigned Opc = MI.getOpcode();

		if (TII->isMUBUF(MI)) {
		return AMDGPU::getMUBUFDwords(Opc);
		}

		switch (Opc) {
		default:
		return 0;
		case AMDGPU::S_BUFFER_LOAD_DWORD_IMM:
		return 1;
		case AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM:
		return 2;
		case AMDGPU::S_BUFFER_LOAD_DWORDX4_IMM:
		return 4;
		}
		}

		InstClassEnum SILoadStoreOptimizer::getInstClass(unsigned Opc) {
		if (TII->isMUBUF(Opc)) {
		const int baseOpcode = AMDGPU::getMUBUFBaseOpcode(Opc);

		// If we couldn't identify the opcode, bail out.
		if (baseOpcode == -1) {
		return UNKNOWN;
		}

		switch (baseOpcode) {
		default:
		return UNKNOWN;
		case AMDGPU::BUFFER_LOAD_DWORD_OFFEN:
		return BUFFER_LOAD_OFFEN;
		case AMDGPU::BUFFER_LOAD_DWORD_OFFSET:
		return BUFFER_LOAD_OFFSET;
		case AMDGPU::BUFFER_STORE_DWORD_OFFEN:
		return BUFFER_STORE_OFFEN;
		case AMDGPU::BUFFER_STORE_DWORD_OFFSET:
		return BUFFER_STORE_OFFSET;
		case AMDGPU::BUFFER_LOAD_DWORD_OFFEN_exact:
		return BUFFER_LOAD_OFFEN_exact;
		case AMDGPU::BUFFER_LOAD_DWORD_OFFSET_exact:
		return BUFFER_LOAD_OFFSET_exact;
		case AMDGPU::BUFFER_STORE_DWORD_OFFEN_exact:
		return BUFFER_STORE_OFFEN_exact;
		case AMDGPU::BUFFER_STORE_DWORD_OFFSET_exact:
		return BUFFER_STORE_OFFSET_exact;
		}
		}

		switch (Opc) {
		default:
		return UNKNOWN;
		case AMDGPU::S_BUFFER_LOAD_DWORD_IMM:
		case AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM:
		case AMDGPU::S_BUFFER_LOAD_DWORDX4_IMM:
		return S_BUFFER_LOAD_IMM;
		case AMDGPU::DS_READ_B32:
		case AMDGPU::DS_READ_B64:
		case AMDGPU::DS_READ_B32_gfx9:
		case AMDGPU::DS_READ_B64_gfx9:
		return DS_READ;
		case AMDGPU::DS_WRITE_B32:
		case AMDGPU::DS_WRITE_B64:
		case AMDGPU::DS_WRITE_B32_gfx9:
		case AMDGPU::DS_WRITE_B64_gfx9:
		return DS_WRITE;
		}
		}

		unsigned SILoadStoreOptimizer::getRegs(unsigned Opc) {
		if (TII->isMUBUF(Opc)) {
		unsigned result = 0;

		if (AMDGPU::getMUBUFHasVAddr(Opc)) {
		result \|= VADDR;
		}

		if (AMDGPU::getMUBUFHasSrsrc(Opc)) {
		result \|= SRSRC;
		}

		if (AMDGPU::getMUBUFHasSoffset(Opc)) {
		result \|= SOFFSET;
		}

		return result;
		}

		switch (Opc) {
		default:
		return 0;
		case AMDGPU::S_BUFFER_LOAD_DWORD_IMM:
		case AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM:
		case AMDGPU::S_BUFFER_LOAD_DWORDX4_IMM:
		return SBASE;
		case AMDGPU::DS_READ_B32:
		case AMDGPU::DS_READ_B64:
		case AMDGPU::DS_READ_B32_gfx9:
		case AMDGPU::DS_READ_B64_gfx9:
		case AMDGPU::DS_WRITE_B32:
		case AMDGPU::DS_WRITE_B64:
		case AMDGPU::DS_WRITE_B32_gfx9:
		case AMDGPU::DS_WRITE_B64_gfx9:
		return ADDR;
		}
		}

bool SILoadStoreOptimizer::findMatchingInst(CombineInfo &CI) {		bool SILoadStoreOptimizer::findMatchingInst(CombineInfo &CI) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
MachineBasicBlock::iterator E = MBB->end();		MachineBasicBlock::iterator E = MBB->end();
MachineBasicBlock::iterator MBBI = CI.I;		MachineBasicBlock::iterator MBBI = CI.I;

unsigned AddrOpName[3] = {0};		const unsigned Opc = CI.I->getOpcode();
int AddrIdx[3];		const InstClassEnum InstClass = getInstClass(Opc);
const MachineOperand *AddrReg[3];
		if (InstClass == UNKNOWN) {
		return false;
		}

		const unsigned Regs = getRegs(Opc);

		unsigned AddrOpName[5] = {0};
		int AddrIdx[5];
		const MachineOperand *AddrReg[5];
unsigned NumAddresses = 0;		unsigned NumAddresses = 0;

switch (CI.InstClass) {		if (Regs & ADDR) {
case DS_READ_WRITE:
AddrOpName[NumAddresses++] = AMDGPU::OpName::addr;		AddrOpName[NumAddresses++] = AMDGPU::OpName::addr;
break;		}
case S_BUFFER_LOAD_IMM:
		if (Regs & SBASE) {
AddrOpName[NumAddresses++] = AMDGPU::OpName::sbase;		AddrOpName[NumAddresses++] = AMDGPU::OpName::sbase;
break;		}
case BUFFER_LOAD_OFFEN:
case BUFFER_STORE_OFFEN:		if (Regs & SRSRC) {
AddrOpName[NumAddresses++] = AMDGPU::OpName::srsrc;
AddrOpName[NumAddresses++] = AMDGPU::OpName::vaddr;
AddrOpName[NumAddresses++] = AMDGPU::OpName::soffset;
break;
case BUFFER_LOAD_OFFSET:
case BUFFER_STORE_OFFSET:
AddrOpName[NumAddresses++] = AMDGPU::OpName::srsrc;		AddrOpName[NumAddresses++] = AMDGPU::OpName::srsrc;
		}

		if (Regs & SOFFSET) {
AddrOpName[NumAddresses++] = AMDGPU::OpName::soffset;		AddrOpName[NumAddresses++] = AMDGPU::OpName::soffset;
break;		}

		if (Regs & VADDR) {
		AddrOpName[NumAddresses++] = AMDGPU::OpName::vaddr;
}		}

for (unsigned i = 0; i < NumAddresses; i++) {		for (unsigned i = 0; i < NumAddresses; i++) {
AddrIdx[i] = AMDGPU::getNamedOperandIdx(CI.I->getOpcode(), AddrOpName[i]);		AddrIdx[i] = AMDGPU::getNamedOperandIdx(CI.I->getOpcode(), AddrOpName[i]);
AddrReg[i] = &CI.I->getOperand(AddrIdx[i]);		AddrReg[i] = &CI.I->getOperand(AddrIdx[i]);

// We only ever merge operations with the same base address register, so don't		// We only ever merge operations with the same base address register, so
// bother scanning forward if there are no other uses.		// don't bother scanning forward if there are no other uses.
if (AddrReg[i]->isReg() &&		if (AddrReg[i]->isReg() &&
(TargetRegisterInfo::isPhysicalRegister(AddrReg[i]->getReg()) \|\|		(TargetRegisterInfo::isPhysicalRegister(AddrReg[i]->getReg()) \|\|
MRI->hasOneNonDBGUse(AddrReg[i]->getReg())))		MRI->hasOneNonDBGUse(AddrReg[i]->getReg())))
return false;		return false;
}		}

++MBBI;		++MBBI;

DenseSet<unsigned> RegDefsToMove;		DenseSet<unsigned> RegDefsToMove;
DenseSet<unsigned> PhysRegUsesToMove;		DenseSet<unsigned> PhysRegUsesToMove;
addDefsUsesToList(*CI.I, RegDefsToMove, PhysRegUsesToMove);		addDefsUsesToList(*CI.I, RegDefsToMove, PhysRegUsesToMove);

for ( ; MBBI != E; ++MBBI) {		for (; MBBI != E; ++MBBI) {
if (MBBI->getOpcode() != CI.I->getOpcode()) {		const bool IsDS = (InstClass == DS_READ) \|\| (InstClass == DS_WRITE);

		if ((getInstClass(MBBI->getOpcode()) != InstClass) \|\|
		(IsDS && (MBBI->getOpcode() != Opc))) {
// This is not a matching DS instruction, but we can keep looking as		// This is not a matching DS instruction, but we can keep looking as
// long as one of these conditions are met:		// long as one of these conditions are met:
// 1. It is safe to move I down past MBBI.		// 1. It is safe to move I down past MBBI.
// 2. It is safe to move MBBI down past the instruction that I will		// 2. It is safe to move MBBI down past the instruction that I will
// be merged into.		// be merged into.

if (MBBI->hasUnmodeledSideEffects()) {		if (MBBI->hasUnmodeledSideEffects()) {
// We can't re-order this instruction with respect to other memory		// We can't re-order this instruction with respect to other memory
// operations, so we fail both conditions mentioned above.		// operations, so we fail both conditions mentioned above.
return false;		return false;
}		}

if (MBBI->mayLoadOrStore() &&		if (MBBI->mayLoadOrStore() &&
(!memAccessesCanBeReordered(CI.I, MBBI, TII, AA) \|\|		(!memAccessesCanBeReordered(CI.I, MBBI, TII, AA) \|\|
!canMoveInstsAcrossMemOp(*MBBI, CI.InstsToMove, TII, AA))) {		!canMoveInstsAcrossMemOp(*MBBI, CI.InstsToMove, TII, AA))) {
// We fail condition #1, but we may still be able to satisfy condition		// We fail condition #1, but we may still be able to satisfy condition
// #2. Add this instruction to the move list and then we will check		// #2. Add this instruction to the move list and then we will check
// if condition #2 holds once we have selected the matching instruction.		// if condition #2 holds once we have selected the matching instruction.
CI.InstsToMove.push_back(&*MBBI);		CI.InstsToMove.push_back(&*MBBI);
addDefsUsesToList(*MBBI, RegDefsToMove, PhysRegUsesToMove);		addDefsUsesToList(*MBBI, RegDefsToMove, PhysRegUsesToMove);
continue;		continue;
}		}

Show All 27 Lines	for (unsigned i = 0; i < NumAddresses; i++) {
if (AddrReg[i]->isImm() != AddrRegNext.isImm() \|\|		if (AddrReg[i]->isImm() != AddrRegNext.isImm() \|\|
AddrReg[i]->getImm() != AddrRegNext.getImm()) {		AddrReg[i]->getImm() != AddrRegNext.getImm()) {
Match = false;		Match = false;
break;		break;
}		}
continue;		continue;
}		}

// Check same base pointer. Be careful of subregisters, which can occur with		// Check same base pointer. Be careful of subregisters, which can occur
// vectors of pointers.		// with vectors of pointers.
if (AddrReg[i]->getReg() != AddrRegNext.getReg() \|\|		if (AddrReg[i]->getReg() != AddrRegNext.getReg() \|\|
AddrReg[i]->getSubReg() != AddrRegNext.getSubReg()) {		AddrReg[i]->getSubReg() != AddrRegNext.getSubReg()) {
Match = false;		Match = false;
break;		break;
}		}
}		}

if (Match) {		if (Match) {
int OffsetIdx = AMDGPU::getNamedOperandIdx(CI.I->getOpcode(),		int OffsetIdx =
AMDGPU::OpName::offset);		AMDGPU::getNamedOperandIdx(CI.I->getOpcode(), AMDGPU::OpName::offset);
CI.Offset0 = CI.I->getOperand(OffsetIdx).getImm();		CI.Offset0 = CI.I->getOperand(OffsetIdx).getImm();
		CI.Width0 = getOpcodeWidth(*CI.I);
CI.Offset1 = MBBI->getOperand(OffsetIdx).getImm();		CI.Offset1 = MBBI->getOperand(OffsetIdx).getImm();
		CI.Width1 = getOpcodeWidth(*MBBI);
CI.Paired = MBBI;		CI.Paired = MBBI;

if (CI.InstClass == DS_READ_WRITE) {		if ((CI.InstClass == DS_READ) \|\| (CI.InstClass == DS_WRITE)) {
CI.Offset0 &= 0xffff;		CI.Offset0 &= 0xffff;
CI.Offset1 &= 0xffff;		CI.Offset1 &= 0xffff;
} else {		} else {
CI.GLC0 = TII->getNamedOperand(*CI.I, AMDGPU::OpName::glc)->getImm();		CI.GLC0 = TII->getNamedOperand(*CI.I, AMDGPU::OpName::glc)->getImm();
CI.GLC1 = TII->getNamedOperand(*MBBI, AMDGPU::OpName::glc)->getImm();		CI.GLC1 = TII->getNamedOperand(*MBBI, AMDGPU::OpName::glc)->getImm();
if (CI.InstClass != S_BUFFER_LOAD_IMM) {		if (CI.InstClass != S_BUFFER_LOAD_IMM) {
CI.SLC0 = TII->getNamedOperand(*CI.I, AMDGPU::OpName::slc)->getImm();		CI.SLC0 = TII->getNamedOperand(*CI.I, AMDGPU::OpName::slc)->getImm();
CI.SLC1 = TII->getNamedOperand(*MBBI, AMDGPU::OpName::slc)->getImm();		CI.SLC1 = TII->getNamedOperand(*MBBI, AMDGPU::OpName::slc)->getImm();
}		}
}		}

// Check both offsets fit in the reduced range.		// Check both offsets fit in the reduced range.
// We also need to go through the list of instructions that we plan to		// We also need to go through the list of instructions that we plan to
// move and make sure they are all safe to move down past the merged		// move and make sure they are all safe to move down past the merged
// instruction.		// instruction.
if (offsetsCanBeCombined(CI))		if (widthsFit(CI) && offsetsCanBeCombined(CI))
if (canMoveInstsAcrossMemOp(*MBBI, CI.InstsToMove, TII, AA))		if (canMoveInstsAcrossMemOp(*MBBI, CI.InstsToMove, TII, AA))
return true;		return true;
}		}

// We've found a load/store that we couldn't merge for some reason.		// We've found a load/store that we couldn't merge for some reason.
// We could potentially keep looking, but we'd need to make sure that		// We could potentially keep looking, but we'd need to make sure that
// it was safe to move I and also all the instruction in InstsToMove		// it was safe to move I and also all the instruction in InstsToMove
// down past this instruction.		// down past this instruction.
Show All 10 Lines	if (STM->ldsRequiresM0Init())
return (EltSize == 4) ? AMDGPU::DS_READ2_B32 : AMDGPU::DS_READ2_B64;		return (EltSize == 4) ? AMDGPU::DS_READ2_B32 : AMDGPU::DS_READ2_B64;
return (EltSize == 4) ? AMDGPU::DS_READ2_B32_gfx9 : AMDGPU::DS_READ2_B64_gfx9;		return (EltSize == 4) ? AMDGPU::DS_READ2_B32_gfx9 : AMDGPU::DS_READ2_B64_gfx9;
}		}

unsigned SILoadStoreOptimizer::read2ST64Opcode(unsigned EltSize) const {		unsigned SILoadStoreOptimizer::read2ST64Opcode(unsigned EltSize) const {
if (STM->ldsRequiresM0Init())		if (STM->ldsRequiresM0Init())
return (EltSize == 4) ? AMDGPU::DS_READ2ST64_B32 : AMDGPU::DS_READ2ST64_B64;		return (EltSize == 4) ? AMDGPU::DS_READ2ST64_B32 : AMDGPU::DS_READ2ST64_B64;

return (EltSize == 4) ?		return (EltSize == 4) ? AMDGPU::DS_READ2ST64_B32_gfx9
AMDGPU::DS_READ2ST64_B32_gfx9 : AMDGPU::DS_READ2ST64_B64_gfx9;		: AMDGPU::DS_READ2ST64_B64_gfx9;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeRead2Pair(		MachineBasicBlock::iterator
CombineInfo &CI) {		SILoadStoreOptimizer::mergeRead2Pair(CombineInfo &CI) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();

// Be careful, since the addresses could be subregisters themselves in weird		// Be careful, since the addresses could be subregisters themselves in weird
// cases, like vectors of pointers.		// cases, like vectors of pointers.
const auto AddrReg = TII->getNamedOperand(CI.I, AMDGPU::OpName::addr);		const auto AddrReg = TII->getNamedOperand(CI.I, AMDGPU::OpName::addr);

const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdst);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdst);
const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::vdst);		const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::vdst);

unsigned NewOffset0 = CI.Offset0;		unsigned NewOffset0 = CI.Offset0;
unsigned NewOffset1 = CI.Offset1;		unsigned NewOffset1 = CI.Offset1;
unsigned Opc = CI.UseST64 ?		unsigned Opc =
read2ST64Opcode(CI.EltSize) : read2Opcode(CI.EltSize);		CI.UseST64 ? read2ST64Opcode(CI.EltSize) : read2Opcode(CI.EltSize);

unsigned SubRegIdx0 = (CI.EltSize == 4) ? AMDGPU::sub0 : AMDGPU::sub0_sub1;		unsigned SubRegIdx0 = (CI.EltSize == 4) ? AMDGPU::sub0 : AMDGPU::sub0_sub1;
unsigned SubRegIdx1 = (CI.EltSize == 4) ? AMDGPU::sub1 : AMDGPU::sub2_sub3;		unsigned SubRegIdx1 = (CI.EltSize == 4) ? AMDGPU::sub1 : AMDGPU::sub2_sub3;

if (NewOffset0 > NewOffset1) {		if (NewOffset0 > NewOffset1) {
// Canonicalize the merged instruction so the smaller offset comes first.		// Canonicalize the merged instruction so the smaller offset comes first.
std::swap(NewOffset0, NewOffset1);		std::swap(NewOffset0, NewOffset1);
std::swap(SubRegIdx0, SubRegIdx1);		std::swap(SubRegIdx0, SubRegIdx1);
}		}

assert((isUInt<8>(NewOffset0) && isUInt<8>(NewOffset1)) &&		assert((isUInt<8>(NewOffset0) && isUInt<8>(NewOffset1)) &&
(NewOffset0 != NewOffset1) &&		(NewOffset0 != NewOffset1) && "Computed offset doesn't fit");
"Computed offset doesn't fit");

const MCInstrDesc &Read2Desc = TII->get(Opc);		const MCInstrDesc &Read2Desc = TII->get(Opc);

const TargetRegisterClass *SuperRC		const TargetRegisterClass *SuperRC =
= (CI.EltSize == 4) ? &AMDGPU::VReg_64RegClass : &AMDGPU::VReg_128RegClass;		(CI.EltSize == 4) ? &AMDGPU::VReg_64RegClass : &AMDGPU::VReg_128RegClass;
unsigned DestReg = MRI->createVirtualRegister(SuperRC);		unsigned DestReg = MRI->createVirtualRegister(SuperRC);

DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();

unsigned BaseReg = AddrReg->getReg();		unsigned BaseReg = AddrReg->getReg();
unsigned BaseSubReg = AddrReg->getSubReg();		unsigned BaseSubReg = AddrReg->getSubReg();
unsigned BaseRegFlags = 0;		unsigned BaseRegFlags = 0;
if (CI.BaseOff) {		if (CI.BaseOff) {
unsigned ImmReg = MRI->createVirtualRegister(&AMDGPU::SGPR_32RegClass);		unsigned ImmReg = MRI->createVirtualRegister(&AMDGPU::SGPR_32RegClass);
BuildMI(*MBB, CI.Paired, DL, TII->get(AMDGPU::S_MOV_B32), ImmReg)		BuildMI(*MBB, CI.Paired, DL, TII->get(AMDGPU::S_MOV_B32), ImmReg)
.addImm(CI.BaseOff);		.addImm(CI.BaseOff);

BaseReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);		BaseReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
BaseRegFlags = RegState::Kill;		BaseRegFlags = RegState::Kill;

TII->getAddNoCarry(*MBB, CI.Paired, DL, BaseReg)		TII->getAddNoCarry(*MBB, CI.Paired, DL, BaseReg)
.addReg(ImmReg)		.addReg(ImmReg)
.addReg(AddrReg->getReg(), 0, BaseSubReg);		.addReg(AddrReg->getReg(), 0, BaseSubReg);
BaseSubReg = 0;		BaseSubReg = 0;
}		}

MachineInstrBuilder Read2 = BuildMI(*MBB, CI.Paired, DL, Read2Desc, DestReg)		MachineInstrBuilder Read2 =
		BuildMI(*MBB, CI.Paired, DL, Read2Desc, DestReg)
.addReg(BaseReg, BaseRegFlags, BaseSubReg) // addr		.addReg(BaseReg, BaseRegFlags, BaseSubReg) // addr
.addImm(NewOffset0) // offset0		.addImm(NewOffset0) // offset0
.addImm(NewOffset1) // offset1		.addImm(NewOffset1) // offset1
.addImm(0) // gds		.addImm(0) // gds
.cloneMergedMemRefs({&CI.I, &CI.Paired});		.cloneMergedMemRefs({&CI.I, &CI.Paired});

(void)Read2;		(void)Read2;

const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);

// Copy to the old destination registers.		// Copy to the old destination registers.
BuildMI(*MBB, CI.Paired, DL, CopyDesc)		BuildMI(*MBB, CI.Paired, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
Show All 10 Lines	SILoadStoreOptimizer::mergeRead2Pair(CombineInfo &CI) {

LLVM_DEBUG(dbgs() << "Inserted read2: " << *Read2 << '\n');		LLVM_DEBUG(dbgs() << "Inserted read2: " << *Read2 << '\n');
return Next;		return Next;
}		}

unsigned SILoadStoreOptimizer::write2Opcode(unsigned EltSize) const {		unsigned SILoadStoreOptimizer::write2Opcode(unsigned EltSize) const {
if (STM->ldsRequiresM0Init())		if (STM->ldsRequiresM0Init())
return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32 : AMDGPU::DS_WRITE2_B64;		return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32 : AMDGPU::DS_WRITE2_B64;
return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32_gfx9 : AMDGPU::DS_WRITE2_B64_gfx9;		return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32_gfx9
		: AMDGPU::DS_WRITE2_B64_gfx9;
}		}

unsigned SILoadStoreOptimizer::write2ST64Opcode(unsigned EltSize) const {		unsigned SILoadStoreOptimizer::write2ST64Opcode(unsigned EltSize) const {
if (STM->ldsRequiresM0Init())		if (STM->ldsRequiresM0Init())
return (EltSize == 4) ? AMDGPU::DS_WRITE2ST64_B32 : AMDGPU::DS_WRITE2ST64_B64;		return (EltSize == 4) ? AMDGPU::DS_WRITE2ST64_B32
		: AMDGPU::DS_WRITE2ST64_B64;

return (EltSize == 4) ?		return (EltSize == 4) ? AMDGPU::DS_WRITE2ST64_B32_gfx9
AMDGPU::DS_WRITE2ST64_B32_gfx9 : AMDGPU::DS_WRITE2ST64_B64_gfx9;		: AMDGPU::DS_WRITE2ST64_B64_gfx9;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeWrite2Pair(		MachineBasicBlock::iterator
CombineInfo &CI) {		SILoadStoreOptimizer::mergeWrite2Pair(CombineInfo &CI) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();

// Be sure to use .addOperand(), and not .addReg() with these. We want to be		// Be sure to use .addOperand(), and not .addReg() with these. We want to be
// sure we preserve the subregister index and any register flags set on them.		// sure we preserve the subregister index and any register flags set on them.
const MachineOperand AddrReg = TII->getNamedOperand(CI.I, AMDGPU::OpName::addr);		const MachineOperand *AddrReg =
const MachineOperand Data0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::data0);		TII->getNamedOperand(*CI.I, AMDGPU::OpName::addr);
const MachineOperand *Data1		const MachineOperand *Data0 =
= TII->getNamedOperand(*CI.Paired, AMDGPU::OpName::data0);		TII->getNamedOperand(*CI.I, AMDGPU::OpName::data0);
		const MachineOperand *Data1 =
		TII->getNamedOperand(*CI.Paired, AMDGPU::OpName::data0);

unsigned NewOffset0 = CI.Offset0;		unsigned NewOffset0 = CI.Offset0;
unsigned NewOffset1 = CI.Offset1;		unsigned NewOffset1 = CI.Offset1;
unsigned Opc = CI.UseST64 ?		unsigned Opc =
write2ST64Opcode(CI.EltSize) : write2Opcode(CI.EltSize);		CI.UseST64 ? write2ST64Opcode(CI.EltSize) : write2Opcode(CI.EltSize);

if (NewOffset0 > NewOffset1) {		if (NewOffset0 > NewOffset1) {
// Canonicalize the merged instruction so the smaller offset comes first.		// Canonicalize the merged instruction so the smaller offset comes first.
std::swap(NewOffset0, NewOffset1);		std::swap(NewOffset0, NewOffset1);
std::swap(Data0, Data1);		std::swap(Data0, Data1);
}		}

assert((isUInt<8>(NewOffset0) && isUInt<8>(NewOffset1)) &&		assert((isUInt<8>(NewOffset0) && isUInt<8>(NewOffset1)) &&
(NewOffset0 != NewOffset1) &&		(NewOffset0 != NewOffset1) && "Computed offset doesn't fit");
"Computed offset doesn't fit");

const MCInstrDesc &Write2Desc = TII->get(Opc);		const MCInstrDesc &Write2Desc = TII->get(Opc);
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();

unsigned BaseReg = AddrReg->getReg();		unsigned BaseReg = AddrReg->getReg();
unsigned BaseSubReg = AddrReg->getSubReg();		unsigned BaseSubReg = AddrReg->getSubReg();
unsigned BaseRegFlags = 0;		unsigned BaseRegFlags = 0;
if (CI.BaseOff) {		if (CI.BaseOff) {
unsigned ImmReg = MRI->createVirtualRegister(&AMDGPU::SGPR_32RegClass);		unsigned ImmReg = MRI->createVirtualRegister(&AMDGPU::SGPR_32RegClass);
BuildMI(*MBB, CI.Paired, DL, TII->get(AMDGPU::S_MOV_B32), ImmReg)		BuildMI(*MBB, CI.Paired, DL, TII->get(AMDGPU::S_MOV_B32), ImmReg)
.addImm(CI.BaseOff);		.addImm(CI.BaseOff);

BaseReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);		BaseReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
BaseRegFlags = RegState::Kill;		BaseRegFlags = RegState::Kill;

TII->getAddNoCarry(*MBB, CI.Paired, DL, BaseReg)		TII->getAddNoCarry(*MBB, CI.Paired, DL, BaseReg)
.addReg(ImmReg)		.addReg(ImmReg)
.addReg(AddrReg->getReg(), 0, BaseSubReg);		.addReg(AddrReg->getReg(), 0, BaseSubReg);
BaseSubReg = 0;		BaseSubReg = 0;
}		}

MachineInstrBuilder Write2 = BuildMI(*MBB, CI.Paired, DL, Write2Desc)		MachineInstrBuilder Write2 =
		BuildMI(*MBB, CI.Paired, DL, Write2Desc)
.addReg(BaseReg, BaseRegFlags, BaseSubReg) // addr		.addReg(BaseReg, BaseRegFlags, BaseSubReg) // addr
.add(*Data0) // data0		.add(*Data0) // data0
.add(*Data1) // data1		.add(*Data1) // data1
.addImm(NewOffset0) // offset0		.addImm(NewOffset0) // offset0
.addImm(NewOffset1) // offset1		.addImm(NewOffset1) // offset1
.addImm(0) // gds		.addImm(0) // gds
.cloneMergedMemRefs({&CI.I, &CI.Paired});		.cloneMergedMemRefs({&CI.I, &CI.Paired});

moveInstsAfter(Write2, CI.InstsToMove);		moveInstsAfter(Write2, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(CI.I);		MachineBasicBlock::iterator Next = std::next(CI.I);
CI.I->eraseFromParent();		CI.I->eraseFromParent();
CI.Paired->eraseFromParent();		CI.Paired->eraseFromParent();

LLVM_DEBUG(dbgs() << "Inserted write2 inst: " << *Write2 << '\n');		LLVM_DEBUG(dbgs() << "Inserted write2 inst: " << *Write2 << '\n');
return Next;		return Next;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeSBufferLoadImmPair(		MachineBasicBlock::iterator
CombineInfo &CI) {		SILoadStoreOptimizer::mergeSBufferLoadImmPair(CombineInfo &CI) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();
unsigned Opcode = CI.IsX2 ? AMDGPU::S_BUFFER_LOAD_DWORDX4_IMM :		const unsigned Opcode = getNewOpcode(CI);
AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM;
		const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI);

const TargetRegisterClass *SuperRC =
CI.IsX2 ? &AMDGPU::SReg_128RegClass : &AMDGPU::SReg_64_XEXECRegClass;
unsigned DestReg = MRI->createVirtualRegister(SuperRC);		unsigned DestReg = MRI->createVirtualRegister(SuperRC);
unsigned MergedOffset = std::min(CI.Offset0, CI.Offset1);		unsigned MergedOffset = std::min(CI.Offset0, CI.Offset1);

BuildMI(*MBB, CI.Paired, DL, TII->get(Opcode), DestReg)		BuildMI(*MBB, CI.Paired, DL, TII->get(Opcode), DestReg)
.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::sbase))		.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::sbase))
.addImm(MergedOffset) // offset		.addImm(MergedOffset) // offset
.addImm(CI.GLC0) // glc		.addImm(CI.GLC0) // glc
.cloneMergedMemRefs({&CI.I, &CI.Paired});		.cloneMergedMemRefs({&CI.I, &CI.Paired});

unsigned SubRegIdx0 = CI.IsX2 ? AMDGPU::sub0_sub1 : AMDGPU::sub0;		std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI);
unsigned SubRegIdx1 = CI.IsX2 ? AMDGPU::sub2_sub3 : AMDGPU::sub1;		const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);
		const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);
// Handle descending offsets
if (CI.Offset0 > CI.Offset1)
std::swap(SubRegIdx0, SubRegIdx1);

// Copy to the old destination registers.		// Copy to the old destination registers.
const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);
const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::sdst);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::sdst);
const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::sdst);		const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::sdst);

BuildMI(*MBB, CI.Paired, DL, CopyDesc)		BuildMI(*MBB, CI.Paired, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)		MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, CI.InstsToMove);		moveInstsAfter(Copy1, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(CI.I);		MachineBasicBlock::iterator Next = std::next(CI.I);
CI.I->eraseFromParent();		CI.I->eraseFromParent();
CI.Paired->eraseFromParent();		CI.Paired->eraseFromParent();
return Next;		return Next;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeBufferLoadPair(		MachineBasicBlock::iterator
CombineInfo &CI) {		SILoadStoreOptimizer::mergeBufferLoadPair(CombineInfo &CI) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();
unsigned Opcode;

if (CI.InstClass == BUFFER_LOAD_OFFEN) {		const unsigned Opcode = getNewOpcode(CI);
Opcode = CI.IsX2 ? AMDGPU::BUFFER_LOAD_DWORDX4_OFFEN :
AMDGPU::BUFFER_LOAD_DWORDX2_OFFEN;
} else {
Opcode = CI.IsX2 ? AMDGPU::BUFFER_LOAD_DWORDX4_OFFSET :
AMDGPU::BUFFER_LOAD_DWORDX2_OFFSET;
}

const TargetRegisterClass *SuperRC =		const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI);
CI.IsX2 ? &AMDGPU::VReg_128RegClass : &AMDGPU::VReg_64RegClass;
		// Copy to the new source register.
unsigned DestReg = MRI->createVirtualRegister(SuperRC);		unsigned DestReg = MRI->createVirtualRegister(SuperRC);
unsigned MergedOffset = std::min(CI.Offset0, CI.Offset1);		unsigned MergedOffset = std::min(CI.Offset0, CI.Offset1);

auto MIB = BuildMI(*MBB, CI.Paired, DL, TII->get(Opcode), DestReg);		auto MIB = BuildMI(*MBB, CI.Paired, DL, TII->get(Opcode), DestReg);

if (CI.InstClass == BUFFER_LOAD_OFFEN)		const unsigned Regs = getRegs(Opcode);

		if (Regs & VADDR)
MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));

MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::srsrc))		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::srsrc))
.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::soffset))		.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::soffset))
.addImm(MergedOffset) // offset		.addImm(MergedOffset) // offset
.addImm(CI.GLC0) // glc		.addImm(CI.GLC0) // glc
.addImm(CI.SLC0) // slc		.addImm(CI.SLC0) // slc
.addImm(0) // tfe		.addImm(0) // tfe
.cloneMergedMemRefs({&CI.I, &CI.Paired});		.cloneMergedMemRefs({&CI.I, &CI.Paired});

unsigned SubRegIdx0 = CI.IsX2 ? AMDGPU::sub0_sub1 : AMDGPU::sub0;		std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI);
unsigned SubRegIdx1 = CI.IsX2 ? AMDGPU::sub2_sub3 : AMDGPU::sub1;		const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);
		const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);
// Handle descending offsets
if (CI.Offset0 > CI.Offset1)
std::swap(SubRegIdx0, SubRegIdx1);

// Copy to the old destination registers.		// Copy to the old destination registers.
const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);
const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);
const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::vdata);		const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::vdata);

BuildMI(*MBB, CI.Paired, DL, CopyDesc)		BuildMI(*MBB, CI.Paired, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)		MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, CI.InstsToMove);		moveInstsAfter(Copy1, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(CI.I);		MachineBasicBlock::iterator Next = std::next(CI.I);
CI.I->eraseFromParent();		CI.I->eraseFromParent();
CI.Paired->eraseFromParent();		CI.Paired->eraseFromParent();
return Next;		return Next;
}		}

unsigned SILoadStoreOptimizer::promoteBufferStoreOpcode(		unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo &CI) {
const MachineInstr &I, bool &IsX2, bool &IsOffen) const {		const unsigned Width = CI.Width0 + CI.Width1;
IsX2 = false;
IsOffen = false;

switch (I.getOpcode()) {		switch (CI.InstClass) {
case AMDGPU::BUFFER_STORE_DWORD_OFFEN:		default:
IsOffen = true;		return AMDGPU::getMUBUFOpcode(CI.InstClass, Width);
return AMDGPU::BUFFER_STORE_DWORDX2_OFFEN;		case UNKNOWN:
case AMDGPU::BUFFER_STORE_DWORD_OFFEN_exact:		llvm_unreachable("Unknown instruction class");
IsOffen = true;		case S_BUFFER_LOAD_IMM:
return AMDGPU::BUFFER_STORE_DWORDX2_OFFEN_exact;		switch (Width) {
case AMDGPU::BUFFER_STORE_DWORDX2_OFFEN:		default:
IsX2 = true;
IsOffen = true;
return AMDGPU::BUFFER_STORE_DWORDX4_OFFEN;
case AMDGPU::BUFFER_STORE_DWORDX2_OFFEN_exact:
IsX2 = true;
IsOffen = true;
return AMDGPU::BUFFER_STORE_DWORDX4_OFFEN_exact;
case AMDGPU::BUFFER_STORE_DWORD_OFFSET:
return AMDGPU::BUFFER_STORE_DWORDX2_OFFSET;
case AMDGPU::BUFFER_STORE_DWORD_OFFSET_exact:
return AMDGPU::BUFFER_STORE_DWORDX2_OFFSET_exact;
case AMDGPU::BUFFER_STORE_DWORDX2_OFFSET:
IsX2 = true;
return AMDGPU::BUFFER_STORE_DWORDX4_OFFSET;
case AMDGPU::BUFFER_STORE_DWORDX2_OFFSET_exact:
IsX2 = true;
return AMDGPU::BUFFER_STORE_DWORDX4_OFFSET_exact;
}
return 0;		return 0;
		case 2:
		return AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM;
		case 4:
		return AMDGPU::S_BUFFER_LOAD_DWORDX4_IMM;
		}
		}
		}

		std::pair<unsigned, unsigned>
		SILoadStoreOptimizer::getSubRegIdxs(const CombineInfo &CI) {
		if (CI.Offset0 > CI.Offset1) {
		switch (CI.Width0) {
		default:
		return std::make_pair(0, 0);
		case 1:
		switch (CI.Width1) {
		default:
		return std::make_pair(0, 0);
		case 1:
		return std::make_pair(AMDGPU::sub1, AMDGPU::sub0);
		case 2:
		return std::make_pair(AMDGPU::sub2, AMDGPU::sub0_sub1);
		case 3:
		return std::make_pair(AMDGPU::sub3, AMDGPU::sub0_sub1_sub2);
		}
		case 2:
		switch (CI.Width1) {
		default:
		return std::make_pair(0, 0);
		case 1:
		return std::make_pair(AMDGPU::sub1_sub2, AMDGPU::sub0);
		case 2:
		return std::make_pair(AMDGPU::sub2_sub3, AMDGPU::sub0_sub1);
		}
		case 3:
		switch (CI.Width1) {
		default:
		return std::make_pair(0, 0);
		case 1:
		return std::make_pair(AMDGPU::sub1_sub2_sub3, AMDGPU::sub0);
		}
		}
		} else {
		switch (CI.Width0) {
		default:
		return std::make_pair(0, 0);
		case 1:
		switch (CI.Width1) {
		default:
		return std::make_pair(0, 0);
		case 1:
		return std::make_pair(AMDGPU::sub0, AMDGPU::sub1);
		case 2:
		return std::make_pair(AMDGPU::sub0, AMDGPU::sub1_sub2);
		case 3:
		return std::make_pair(AMDGPU::sub0, AMDGPU::sub1_sub2_sub3);
		}
		case 2:
		switch (CI.Width1) {
		default:
		return std::make_pair(0, 0);
		case 1:
		return std::make_pair(AMDGPU::sub0_sub1, AMDGPU::sub2);
		case 2:
		return std::make_pair(AMDGPU::sub0_sub1, AMDGPU::sub2_sub3);
		}
		case 3:
		switch (CI.Width1) {
		default:
		return std::make_pair(0, 0);
		case 1:
		return std::make_pair(AMDGPU::sub0_sub1_sub2, AMDGPU::sub3);
		}
		}
		}
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeBufferStorePair(		const TargetRegisterClass *
CombineInfo &CI) {		SILoadStoreOptimizer::getTargetRegisterClass(const CombineInfo &CI) {
		if (CI.InstClass == S_BUFFER_LOAD_IMM) {
		switch (CI.Width0 + CI.Width1) {
		default:
		return nullptr;
		case 2:
		return &AMDGPU::SReg_64_XEXECRegClass;
		case 4:
		return &AMDGPU::SReg_128RegClass;
		case 8:
		return &AMDGPU::SReg_256RegClass;
		case 16:
		return &AMDGPU::SReg_512RegClass;
		}
		} else {
		switch (CI.Width0 + CI.Width1) {
		default:
		return nullptr;
		case 2:
		return &AMDGPU::VReg_64RegClass;
		case 3:
		return &AMDGPU::VReg_96RegClass;
		case 4:
		return &AMDGPU::VReg_128RegClass;
		}
		}
		}

		MachineBasicBlock::iterator
		SILoadStoreOptimizer::mergeBufferStorePair(CombineInfo &CI) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();
bool Unused1, Unused2;
unsigned Opcode = promoteBufferStoreOpcode(*CI.I, Unused1, Unused2);

unsigned SubRegIdx0 = CI.IsX2 ? AMDGPU::sub0_sub1 : AMDGPU::sub0;		const unsigned Opcode = getNewOpcode(CI);
unsigned SubRegIdx1 = CI.IsX2 ? AMDGPU::sub2_sub3 : AMDGPU::sub1;

// Handle descending offsets		std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI);
if (CI.Offset0 > CI.Offset1)		const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);
std::swap(SubRegIdx0, SubRegIdx1);		const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);

// Copy to the new source register.		// Copy to the new source register.
const TargetRegisterClass *SuperRC =		const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI);
CI.IsX2 ? &AMDGPU::VReg_128RegClass : &AMDGPU::VReg_64RegClass;
unsigned SrcReg = MRI->createVirtualRegister(SuperRC);		unsigned SrcReg = MRI->createVirtualRegister(SuperRC);

const auto Src0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);		const auto Src0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);
const auto Src1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::vdata);		const auto Src1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::vdata);

BuildMI(*MBB, CI.Paired, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)		BuildMI(*MBB, CI.Paired, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
.add(*Src0)		.add(*Src0)
.addImm(SubRegIdx0)		.addImm(SubRegIdx0)
.add(*Src1)		.add(*Src1)
.addImm(SubRegIdx1);		.addImm(SubRegIdx1);

auto MIB = BuildMI(*MBB, CI.Paired, DL, TII->get(Opcode))		auto MIB = BuildMI(*MBB, CI.Paired, DL, TII->get(Opcode))
.addReg(SrcReg, RegState::Kill);		.addReg(SrcReg, RegState::Kill);

if (CI.InstClass == BUFFER_STORE_OFFEN)		const unsigned Regs = getRegs(Opcode);

		if (Regs & VADDR)
MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));

MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::srsrc))		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::srsrc))
.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::soffset))		.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::soffset))
.addImm(std::min(CI.Offset0, CI.Offset1)) // offset		.addImm(std::min(CI.Offset0, CI.Offset1)) // offset
.addImm(CI.GLC0) // glc		.addImm(CI.GLC0) // glc
.addImm(CI.SLC0) // slc		.addImm(CI.SLC0) // slc
.addImm(0) // tfe		.addImm(0) // tfe
.cloneMergedMemRefs({&CI.I, &CI.Paired});		.cloneMergedMemRefs({&CI.I, &CI.Paired});
Show All 16 Lines	for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end(); I != E;) {
MachineInstr &MI = *I;		MachineInstr &MI = *I;

// Don't combine if volatile.		// Don't combine if volatile.
if (MI.hasOrderedMemoryRef()) {		if (MI.hasOrderedMemoryRef()) {
++I;		++I;
continue;		continue;
}		}

		const unsigned Opc = MI.getOpcode();

CombineInfo CI;		CombineInfo CI;
CI.I = I;		CI.I = I;
unsigned Opc = MI.getOpcode();		CI.InstClass = getInstClass(Opc);
if (Opc == AMDGPU::DS_READ_B32 \|\| Opc == AMDGPU::DS_READ_B64 \|\|
Opc == AMDGPU::DS_READ_B32_gfx9 \|\| Opc == AMDGPU::DS_READ_B64_gfx9) {

CI.InstClass = DS_READ_WRITE;		switch (CI.InstClass) {
		default:
		break;
		case DS_READ:
CI.EltSize =		CI.EltSize =
(Opc == AMDGPU::DS_READ_B64 \|\| Opc == AMDGPU::DS_READ_B64_gfx9) ? 8 : 4;		(Opc == AMDGPU::DS_READ_B64 \|\| Opc == AMDGPU::DS_READ_B64_gfx9) ? 8
		: 4;
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeRead2Pair(CI);		I = mergeRead2Pair(CI);
} else {		} else {
++I;		++I;
}		}

continue;		continue;
} else if (Opc == AMDGPU::DS_WRITE_B32 \|\| Opc == AMDGPU::DS_WRITE_B64 \|\|		case DS_WRITE:
Opc == AMDGPU::DS_WRITE_B32_gfx9 \|\|		CI.EltSize =
Opc == AMDGPU::DS_WRITE_B64_gfx9) {		(Opc == AMDGPU::DS_WRITE_B64 \|\| Opc == AMDGPU::DS_WRITE_B64_gfx9) ? 8
CI.InstClass = DS_READ_WRITE;		: 4;
CI.EltSize
= (Opc == AMDGPU::DS_WRITE_B64 \|\| Opc == AMDGPU::DS_WRITE_B64_gfx9) ? 8 : 4;

if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeWrite2Pair(CI);		I = mergeWrite2Pair(CI);
} else {		} else {
++I;		++I;
}		}

continue;		continue;
}		case S_BUFFER_LOAD_IMM:
if (Opc == AMDGPU::S_BUFFER_LOAD_DWORD_IMM \|\|
Opc == AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM) {
// EltSize is in units of the offset encoding.
CI.InstClass = S_BUFFER_LOAD_IMM;
CI.EltSize = AMDGPU::getSMRDEncodedOffset(*STM, 4);		CI.EltSize = AMDGPU::getSMRDEncodedOffset(*STM, 4);
CI.IsX2 = Opc == AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM;
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeSBufferLoadImmPair(CI);		I = mergeSBufferLoadImmPair(CI);
if (!CI.IsX2)		OptimizeAgain \|= (CI.Width0 + CI.Width1) < 16;
CreatedX2++;
} else {		} else {
++I;		++I;
}		}
continue;		continue;
}		case BUFFER_LOAD_OFFEN:
if (Opc == AMDGPU::BUFFER_LOAD_DWORD_OFFEN \|\|		case BUFFER_LOAD_OFFSET:
Opc == AMDGPU::BUFFER_LOAD_DWORDX2_OFFEN \|\|		case BUFFER_LOAD_OFFEN_exact:
Opc == AMDGPU::BUFFER_LOAD_DWORD_OFFSET \|\|		case BUFFER_LOAD_OFFSET_exact:
Opc == AMDGPU::BUFFER_LOAD_DWORDX2_OFFSET) {
if (Opc == AMDGPU::BUFFER_LOAD_DWORD_OFFEN \|\|
Opc == AMDGPU::BUFFER_LOAD_DWORDX2_OFFEN)
CI.InstClass = BUFFER_LOAD_OFFEN;
else
CI.InstClass = BUFFER_LOAD_OFFSET;

CI.EltSize = 4;		CI.EltSize = 4;
CI.IsX2 = Opc == AMDGPU::BUFFER_LOAD_DWORDX2_OFFEN \|\|
Opc == AMDGPU::BUFFER_LOAD_DWORDX2_OFFSET;
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeBufferLoadPair(CI);		I = mergeBufferLoadPair(CI);
if (!CI.IsX2)		OptimizeAgain \|= (CI.Width0 + CI.Width1) < 4;
CreatedX2++;
} else {		} else {
++I;		++I;
}		}
continue;		continue;
}		case BUFFER_STORE_OFFEN:
		case BUFFER_STORE_OFFSET:
bool StoreIsX2, IsOffen;		case BUFFER_STORE_OFFEN_exact:
if (promoteBufferStoreOpcode(*I, StoreIsX2, IsOffen)) {		case BUFFER_STORE_OFFSET_exact:
CI.InstClass = IsOffen ? BUFFER_STORE_OFFEN : BUFFER_STORE_OFFSET;
CI.EltSize = 4;		CI.EltSize = 4;
CI.IsX2 = StoreIsX2;
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeBufferStorePair(CI);		I = mergeBufferStorePair(CI);
if (!CI.IsX2)		OptimizeAgain \|= (CI.Width0 + CI.Width1) < 4;
CreatedX2++;
} else {		} else {
++I;		++I;
}		}
continue;		continue;
}		}

++I;		++I;
}		}
Show All 17 Lines	bool SILoadStoreOptimizer::runOnMachineFunction(MachineFunction &MF) {

assert(MRI->isSSA() && "Must be run on SSA");		assert(MRI->isSSA() && "Must be run on SSA");

LLVM_DEBUG(dbgs() << "Running SILoadStoreOptimizer\n");		LLVM_DEBUG(dbgs() << "Running SILoadStoreOptimizer\n");

bool Modified = false;		bool Modified = false;

for (MachineBasicBlock &MBB : MF) {		for (MachineBasicBlock &MBB : MF) {
CreatedX2 = 0;		do {
Modified \|= optimizeBlock(MBB);		OptimizeAgain = false;

// Run again to convert x2 to x4.
if (CreatedX2 >= 1)
Modified \|= optimizeBlock(MBB);		Modified \|= optimizeBlock(MBB);
		} while (OptimizeAgain);
}		}

return Modified;		return Modified;
}		}

lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h

	Show First 20 Lines • Show All 215 Lines • ▼ Show 20 Lines
	LLVM_READONLY			LLVM_READONLY
	int getMIMGOpcode(unsigned BaseOpcode, unsigned MIMGEncoding,			int getMIMGOpcode(unsigned BaseOpcode, unsigned MIMGEncoding,
	unsigned VDataDwords, unsigned VAddrDwords);			unsigned VDataDwords, unsigned VAddrDwords);

	LLVM_READONLY			LLVM_READONLY
	int getMaskedMIMGOp(unsigned Opc, unsigned NewChannels);			int getMaskedMIMGOp(unsigned Opc, unsigned NewChannels);

	LLVM_READONLY			LLVM_READONLY
				int getMUBUFBaseOpcode(unsigned Opc);

				LLVM_READONLY
				int getMUBUFOpcode(unsigned BaseOpc, unsigned Dwords);

				LLVM_READONLY
				int getMUBUFDwords(unsigned Opc);

				LLVM_READONLY
				bool getMUBUFHasVAddr(unsigned Opc);

				LLVM_READONLY
				bool getMUBUFHasSrsrc(unsigned Opc);

				LLVM_READONLY
				bool getMUBUFHasSoffset(unsigned Opc);

				LLVM_READONLY
	int getMCOpcode(uint16_t Opcode, unsigned Gen);			int getMCOpcode(uint16_t Opcode, unsigned Gen);

	void initDefaultAMDKernelCodeT(amd_kernel_code_t &Header,			void initDefaultAMDKernelCodeT(amd_kernel_code_t &Header,
	const MCSubtargetInfo *STI);			const MCSubtargetInfo *STI);

	amdhsa::kernel_descriptor_t getDefaultAmdhsaKernelDescriptor();			amdhsa::kernel_descriptor_t getDefaultAmdhsaKernelDescriptor();

	bool isGroupSegment(const GlobalValue *GV);			bool isGroupSegment(const GlobalValue *GV);
	▲ Show 20 Lines • Show All 252 Lines • Show Last 20 Lines

lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp

	Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines
	int getMaskedMIMGOp(unsigned Opc, unsigned NewChannels) {			int getMaskedMIMGOp(unsigned Opc, unsigned NewChannels) {
	const MIMGInfo *OrigInfo = getMIMGInfo(Opc);			const MIMGInfo *OrigInfo = getMIMGInfo(Opc);
	const MIMGInfo *NewInfo =			const MIMGInfo *NewInfo =
	getMIMGOpcodeHelper(OrigInfo->BaseOpcode, OrigInfo->MIMGEncoding,			getMIMGOpcodeHelper(OrigInfo->BaseOpcode, OrigInfo->MIMGEncoding,
	NewChannels, OrigInfo->VAddrDwords);			NewChannels, OrigInfo->VAddrDwords);
	return NewInfo ? NewInfo->Opcode : -1;			return NewInfo ? NewInfo->Opcode : -1;
	}			}

				struct MUBUFInfo {
				uint16_t Opcode;
				uint16_t BaseOpcode;
				uint8_t dwords;
				bool has_vaddr;
				bool has_srsrc;
				bool has_soffset;
				};

				#define GET_MUBUFInfoTable_DECL
				#define GET_MUBUFInfoTable_IMPL
				#include "AMDGPUGenSearchableTables.inc"

				int getMUBUFBaseOpcode(unsigned Opc) {
				const MUBUFInfo *Info = getMUBUFInfoFromOpcode(Opc);
				return Info ? Info->BaseOpcode : -1;
				}

				int getMUBUFOpcode(unsigned BaseOpc, unsigned Dwords) {
				const MUBUFInfo *Info = getMUBUFInfoFromBaseOpcodeAndDwords(BaseOpc, Dwords);
				return Info ? Info->Opcode : -1;
				}

				int getMUBUFDwords(unsigned Opc) {
				const MUBUFInfo *Info = getMUBUFOpcodeHelper(Opc);
				return Info ? Info->dwords : 0;
				}

				bool getMUBUFHasVAddr(unsigned Opc) {
				const MUBUFInfo *Info = getMUBUFOpcodeHelper(Opc);
				return Info ? Info->has_vaddr : false;
				}

				bool getMUBUFHasSrsrc(unsigned Opc) {
				const MUBUFInfo *Info = getMUBUFOpcodeHelper(Opc);
				return Info ? Info->has_srsrc : false;
				}

				bool getMUBUFHasSoffset(unsigned Opc) {
				const MUBUFInfo *Info = getMUBUFOpcodeHelper(Opc);
				return Info ? Info->has_soffset : false;
				}

	// Wrapper for Tablegen'd function. enum Subtarget is not defined in any			// Wrapper for Tablegen'd function. enum Subtarget is not defined in any
	// header files, so we need to wrap it in a function that takes unsigned			// header files, so we need to wrap it in a function that takes unsigned
	// instead.			// instead.
	int getMCOpcode(uint16_t Opcode, unsigned Gen) {			int getMCOpcode(uint16_t Opcode, unsigned Gen) {
	return getMCOpcodeGen(Opcode, static_cast<Subtarget>(Gen));			return getMCOpcodeGen(Opcode, static_cast<Subtarget>(Gen));
	}			}

	namespace IsaInfo {			namespace IsaInfo {
	▲ Show 20 Lines • Show All 832 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/cvt_f32_ubyte.ll

Show All 30 Lines	define amdgpu_kernel void @load_v2i8_to_v2f32(<2 x float> addrspace(1)* noalias %out, <2 x i8> addrspace(1)* noalias %in) nounwind {
%cvt = uitofp <2 x i8> %load to <2 x float>		%cvt = uitofp <2 x i8> %load to <2 x float>
store <2 x float> %cvt, <2 x float> addrspace(1)* %out, align 16		store <2 x float> %cvt, <2 x float> addrspace(1)* %out, align 16
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_v3i8_to_v3f32:		; GCN-LABEL: {{^}}load_v3i8_to_v3f32:
; GCN: {{buffer\|flat}}_load_dword [[VAL:v[0-9]+]]		; GCN: {{buffer\|flat}}_load_dword [[VAL:v[0-9]+]]
; GCN-NOT: v_cvt_f32_ubyte3_e32		; GCN-NOT: v_cvt_f32_ubyte3_e32
; GCN-DAG: v_cvt_f32_ubyte2_e32 v{{[0-9]+}}, [[VAL]]		; GCN-DAG: v_cvt_f32_ubyte2_e32 v[[HIRESULT:[0-9]+]], [[VAL]]
; GCN-DAG: v_cvt_f32_ubyte1_e32 v[[HIRESULT:[0-9]+]], [[VAL]]		; GCN-DAG: v_cvt_f32_ubyte1_e32 v{{[0-9]+}}, [[VAL]]
; GCN-DAG: v_cvt_f32_ubyte0_e32 v[[LORESULT:[0-9]+]], [[VAL]]		; GCN-DAG: v_cvt_f32_ubyte0_e32 v[[LORESULT:[0-9]+]], [[VAL]]
; GCN: buffer_store_dwordx2 v{{\[}}[[LORESULT]]:[[HIRESULT]]{{\]}},		; GCN: buffer_store_dwordx3 v{{\[}}[[LORESULT]]:[[HIRESULT]]{{\]}},
define amdgpu_kernel void @load_v3i8_to_v3f32(<3 x float> addrspace(1)* noalias %out, <3 x i8> addrspace(1)* noalias %in) nounwind {		define amdgpu_kernel void @load_v3i8_to_v3f32(<3 x float> addrspace(1)* noalias %out, <3 x i8> addrspace(1)* noalias %in) nounwind {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep = getelementptr <3 x i8>, <3 x i8> addrspace(1)* %in, i32 %tid		%gep = getelementptr <3 x i8>, <3 x i8> addrspace(1)* %in, i32 %tid
%load = load <3 x i8>, <3 x i8> addrspace(1)* %gep, align 4		%load = load <3 x i8>, <3 x i8> addrspace(1)* %gep, align 4
%cvt = uitofp <3 x i8> %load to <3 x float>		%cvt = uitofp <3 x i8> %load to <3 x float>
store <3 x float> %cvt, <3 x float> addrspace(1)* %out, align 16		store <3 x float> %cvt, <3 x float> addrspace(1)* %out, align 16
ret void		ret void
}		}
▲ Show 20 Lines • Show All 233 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/early-if-convert-cost.ll

	Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
	; GCN: v_add_i32_e32			; GCN: v_add_i32_e32
	; GCN: v_add_i32_e32			; GCN: v_add_i32_e32
	; GCN: s_mov_b64 vcc, [[CMP]]			; GCN: s_mov_b64 vcc, [[CMP]]

	; GCN: v_cndmask_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, vcc			; GCN: v_cndmask_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, vcc
	; GCN: v_cndmask_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, vcc			; GCN: v_cndmask_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, vcc
	; GCN: v_cndmask_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, vcc			; GCN: v_cndmask_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, vcc

	; GCN-DAG: buffer_store_dword v			; GCN-DAG: buffer_store_dwordx3
	; GCN-DAG: buffer_store_dwordx2
	define amdgpu_kernel void @test_vccnz_ifcvt_triangle96(<3 x i32> addrspace(1)* %out, <3 x i32> addrspace(1)* %in, float %cnd) #0 {			define amdgpu_kernel void @test_vccnz_ifcvt_triangle96(<3 x i32> addrspace(1)* %out, <3 x i32> addrspace(1)* %in, float %cnd) #0 {
	entry:			entry:
	%v = load <3 x i32>, <3 x i32> addrspace(1)* %in			%v = load <3 x i32>, <3 x i32> addrspace(1)* %in
	%cc = fcmp oeq float %cnd, 1.000000e+00			%cc = fcmp oeq float %cnd, 1.000000e+00
	br i1 %cc, label %if, label %endif			br i1 %cc, label %if, label %endif

	if:			if:
	%u = add <3 x i32> %v, %v			%u = add <3 x i32> %v, %v
	Show All 38 Lines

test/CodeGen/AMDGPU/insert_vector_elt.ll

	Show First 20 Lines • Show All 97 Lines • ▼ Show 20 Lines
	; GCN-LABEL: {{^}}dynamic_insertelement_v3f32:			; GCN-LABEL: {{^}}dynamic_insertelement_v3f32:
	; GCN-DAG: v_mov_b32_e32 [[CONST:v[0-9]+]], 0x40a00000			; GCN-DAG: v_mov_b32_e32 [[CONST:v[0-9]+]], 0x40a00000
	; GCN-DAG: v_cmp_ne_u32_e64 [[CC3:[^,]+]], [[IDX:s[0-9]+]], 2			; GCN-DAG: v_cmp_ne_u32_e64 [[CC3:[^,]+]], [[IDX:s[0-9]+]], 2
	; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, [[CONST]], v{{[0-9]+}}, [[CC3]]			; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, [[CONST]], v{{[0-9]+}}, [[CC3]]
	; GCN-DAG: v_cmp_ne_u32_e64 [[CC2:[^,]+]], [[IDX]], 1			; GCN-DAG: v_cmp_ne_u32_e64 [[CC2:[^,]+]], [[IDX]], 1
	; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, [[CONST]], v{{[0-9]+}}, [[CC2]]			; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, [[CONST]], v{{[0-9]+}}, [[CC2]]
	; GCN-DAG: v_cmp_ne_u32_e64 [[CC1:[^,]+]], [[IDX]], 0			; GCN-DAG: v_cmp_ne_u32_e64 [[CC1:[^,]+]], [[IDX]], 0
	; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, [[CONST]], v{{[0-9]+}}, [[CC1]]			; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, [[CONST]], v{{[0-9]+}}, [[CC1]]
	; GCN-DAG: buffer_store_dwordx2 v			; GCN-DAG: buffer_store_dwordx3 v
	; GCN-DAG: buffer_store_dword v
	define amdgpu_kernel void @dynamic_insertelement_v3f32(<3 x float> addrspace(1)* %out, <3 x float> %a, i32 %b) nounwind {			define amdgpu_kernel void @dynamic_insertelement_v3f32(<3 x float> addrspace(1)* %out, <3 x float> %a, i32 %b) nounwind {
	%vecins = insertelement <3 x float> %a, float 5.000000e+00, i32 %b			%vecins = insertelement <3 x float> %a, float 5.000000e+00, i32 %b
	store <3 x float> %vecins, <3 x float> addrspace(1)* %out, align 16			store <3 x float> %vecins, <3 x float> addrspace(1)* %out, align 16
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}dynamic_insertelement_v4f32:			; GCN-LABEL: {{^}}dynamic_insertelement_v4f32:
	; GCN-DAG: v_mov_b32_e32 [[CONST:v[0-9]+]], 0x40a00000			; GCN-DAG: v_mov_b32_e32 [[CONST:v[0-9]+]], 0x40a00000
	▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines

	; GCN-LABEL: {{^}}dynamic_insertelement_v3i32:			; GCN-LABEL: {{^}}dynamic_insertelement_v3i32:
	; GCN-DAG: v_cmp_ne_u32_e64 [[CC3:[^,]+]], [[IDX:s[0-9]+]], 2			; GCN-DAG: v_cmp_ne_u32_e64 [[CC3:[^,]+]], [[IDX:s[0-9]+]], 2
	; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 5, v{{[0-9]+}}, [[CC3]]			; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 5, v{{[0-9]+}}, [[CC3]]
	; GCN-DAG: v_cmp_ne_u32_e64 [[CC2:[^,]+]], [[IDX]], 1			; GCN-DAG: v_cmp_ne_u32_e64 [[CC2:[^,]+]], [[IDX]], 1
	; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 5, v{{[0-9]+}}, [[CC2]]			; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 5, v{{[0-9]+}}, [[CC2]]
	; GCN-DAG: v_cmp_ne_u32_e64 [[CC1:[^,]+]], [[IDX]], 0			; GCN-DAG: v_cmp_ne_u32_e64 [[CC1:[^,]+]], [[IDX]], 0
	; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 5, v{{[0-9]+}}, [[CC1]]			; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 5, v{{[0-9]+}}, [[CC1]]
	; GCN-DAG: buffer_store_dwordx2 v			; GCN-DAG: buffer_store_dwordx3 v
	; GCN-DAG: buffer_store_dword v
	define amdgpu_kernel void @dynamic_insertelement_v3i32(<3 x i32> addrspace(1)* %out, <3 x i32> %a, i32 %b) nounwind {			define amdgpu_kernel void @dynamic_insertelement_v3i32(<3 x i32> addrspace(1)* %out, <3 x i32> %a, i32 %b) nounwind {
	%vecins = insertelement <3 x i32> %a, i32 5, i32 %b			%vecins = insertelement <3 x i32> %a, i32 5, i32 %b
	store <3 x i32> %vecins, <3 x i32> addrspace(1)* %out, align 16			store <3 x i32> %vecins, <3 x i32> addrspace(1)* %out, align 16
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}dynamic_insertelement_v4i32:			; GCN-LABEL: {{^}}dynamic_insertelement_v4i32:
	; GCN: s_load_dword [[SVAL:s[0-9]+]], s{{\[[0-9]+:[0-9]+\]}}, {{0x11\|0x44}}			; GCN: s_load_dword [[SVAL:s[0-9]+]], s{{\[[0-9]+:[0-9]+\]}}, {{0x11\|0x44}}
	▲ Show 20 Lines • Show All 294 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.ll

Show First 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	main_body:
%r1 = extractelement <2 x float> %vr1, i32 0		%r1 = extractelement <2 x float> %vr1, i32 0
%r2 = extractelement <2 x float> %vr1, i32 1		%r2 = extractelement <2 x float> %vr1, i32 1
%r3 = extractelement <2 x float> %vr2, i32 0		%r3 = extractelement <2 x float> %vr2, i32 0
%r4 = extractelement <2 x float> %vr2, i32 1		%r4 = extractelement <2 x float> %vr2, i32 1
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true)		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true)
ret void		ret void
}		}

		;CHECK-LABEL: {{^}}buffer_load_x3_offen_merged:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_dwordx3 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:4
		;CHECK: s_waitcnt
		define amdgpu_ps void @buffer_load_x3_offen_merged(<4 x i32> inreg %rsrc, i32 %a) {
		main_body:
		%a1 = add i32 %a, 4
		%a2 = add i32 %a, 12
		%vr1 = call <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32> %rsrc, i32 0, i32 %a1, i1 0, i1 0)
		%r3 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> %rsrc, i32 0, i32 %a2, i1 0, i1 0)
		%r1 = extractelement <2 x float> %vr1, i32 0
		%r2 = extractelement <2 x float> %vr1, i32 1
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float undef, i1 true, i1 true)
		ret void
		}

;CHECK-LABEL: {{^}}buffer_load_x1_offset_merged:		;CHECK-LABEL: {{^}}buffer_load_x1_offset_merged:
;CHECK-NEXT: %bb.		;CHECK-NEXT: %bb.
;CHECK-NEXT: buffer_load_dwordx4 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4		;CHECK-NEXT: buffer_load_dwordx4 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4
;CHECK-NEXT: buffer_load_dwordx2 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:28		;CHECK-NEXT: buffer_load_dwordx2 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:28
;CHECK: s_waitcnt		;CHECK: s_waitcnt
define amdgpu_ps void @buffer_load_x1_offset_merged(<4 x i32> inreg %rsrc) {		define amdgpu_ps void @buffer_load_x1_offset_merged(<4 x i32> inreg %rsrc) {
main_body:		main_body:
%r1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)		%r1 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)
Show All 18 Lines	main_body:
%r1 = extractelement <2 x float> %vr1, i32 0		%r1 = extractelement <2 x float> %vr1, i32 0
%r2 = extractelement <2 x float> %vr1, i32 1		%r2 = extractelement <2 x float> %vr1, i32 1
%r3 = extractelement <2 x float> %vr2, i32 0		%r3 = extractelement <2 x float> %vr2, i32 0
%r4 = extractelement <2 x float> %vr2, i32 1		%r4 = extractelement <2 x float> %vr2, i32 1
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true)		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true)
ret void		ret void
}		}

		;CHECK-LABEL: {{^}}buffer_load_x3_offset_merged:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_dwordx3 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4
		;CHECK: s_waitcnt
		define amdgpu_ps void @buffer_load_x3_offset_merged(<4 x i32> inreg %rsrc) {
		main_body:
		%vr1 = call <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)
		%r3 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> %rsrc, i32 0, i32 12, i1 0, i1 0)
		%r1 = extractelement <2 x float> %vr1, i32 0
		%r2 = extractelement <2 x float> %vr1, i32 1
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float undef, i1 true, i1 true)
		ret void
		}

declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #0		declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #0
declare <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32>, i32, i32, i1, i1) #0		declare <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32>, i32, i32, i1, i1) #0
declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #0		declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #0
declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0		declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0

attributes #0 = { nounwind readonly }		attributes #0 = { nounwind readonly }

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.ll

	Show First 20 Lines • Show All 141 Lines • ▼ Show 20 Lines
	define amdgpu_ps void @buffer_store_x2_offen_merged(<4 x i32> inreg %rsrc, i32 %a, <2 x float> %v1, <2 x float> %v2) {			define amdgpu_ps void @buffer_store_x2_offen_merged(<4 x i32> inreg %rsrc, i32 %a, <2 x float> %v1, <2 x float> %v2) {
	%a1 = add i32 %a, 4			%a1 = add i32 %a, 4
	%a2 = add i32 %a, 12			%a2 = add i32 %a, 12
	call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v1, <4 x i32> %rsrc, i32 0, i32 %a1, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v1, <4 x i32> %rsrc, i32 0, i32 %a1, i1 0, i1 0)
	call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v2, <4 x i32> %rsrc, i32 0, i32 %a2, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v2, <4 x i32> %rsrc, i32 0, i32 %a2, i1 0, i1 0)
	ret void			ret void
	}			}

				;CHECK-LABEL: {{^}}buffer_store_x3_offen_merged:
				;CHECK-NOT: s_waitcnt
				;CHECK: buffer_store_dwordx3 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:28
				define amdgpu_ps void @buffer_store_x3_offen_merged(<4 x i32> inreg %rsrc, i32 %a, float %v1, float %v2, float %v3) {
				%a1 = add i32 %a, 28
				%a2 = add i32 %a, 32
				%a3 = add i32 %a, 36
				call void @llvm.amdgcn.buffer.store.f32(float %v1, <4 x i32> %rsrc, i32 0, i32 %a1, i1 0, i1 0)
				call void @llvm.amdgcn.buffer.store.f32(float %v2, <4 x i32> %rsrc, i32 0, i32 %a2, i1 0, i1 0)
				call void @llvm.amdgcn.buffer.store.f32(float %v3, <4 x i32> %rsrc, i32 0, i32 %a3, i1 0, i1 0)
				ret void
				}

				;CHECK-LABEL: {{^}}buffer_store_x3_offen_merged2:
				;CHECK-NOT: s_waitcnt
				;CHECK: buffer_store_dwordx3 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:4
				define amdgpu_ps void @buffer_store_x3_offen_merged2(<4 x i32> inreg %rsrc, i32 %a, <2 x float> %v1, float %v2) {
				%a1 = add i32 %a, 4
				%a2 = add i32 %a, 12
				call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v1, <4 x i32> %rsrc, i32 0, i32 %a1, i1 0, i1 0)
				call void @llvm.amdgcn.buffer.store.f32(float %v2, <4 x i32> %rsrc, i32 0, i32 %a2, i1 0, i1 0)
				ret void
				}

				;CHECK-LABEL: {{^}}buffer_store_x3_offen_merged3:
				;CHECK-NOT: s_waitcnt
				;CHECK: buffer_store_dwordx3 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:4
				define amdgpu_ps void @buffer_store_x3_offen_merged3(<4 x i32> inreg %rsrc, i32 %a, float %v1, <2 x float> %v2) {
				%a1 = add i32 %a, 4
				%a2 = add i32 %a, 8
				call void @llvm.amdgcn.buffer.store.f32(float %v1, <4 x i32> %rsrc, i32 0, i32 %a1, i1 0, i1 0)
				call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v2, <4 x i32> %rsrc, i32 0, i32 %a2, i1 0, i1 0)
				ret void
				}

	;CHECK-LABEL: {{^}}buffer_store_x1_offset_merged:			;CHECK-LABEL: {{^}}buffer_store_x1_offset_merged:
	;CHECK-NOT: s_waitcnt			;CHECK-NOT: s_waitcnt
	;CHECK-DAG: buffer_store_dwordx4 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4			;CHECK-DAG: buffer_store_dwordx4 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4
	;CHECK-DAG: buffer_store_dwordx2 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:28			;CHECK-DAG: buffer_store_dwordx2 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:28
	define amdgpu_ps void @buffer_store_x1_offset_merged(<4 x i32> inreg %rsrc, float %v1, float %v2, float %v3, float %v4, float %v5, float %v6) {			define amdgpu_ps void @buffer_store_x1_offset_merged(<4 x i32> inreg %rsrc, float %v1, float %v2, float %v3, float %v4, float %v5, float %v6) {
	call void @llvm.amdgcn.buffer.store.f32(float %v1, <4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.f32(float %v1, <4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)
	call void @llvm.amdgcn.buffer.store.f32(float %v2, <4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.f32(float %v2, <4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
	call void @llvm.amdgcn.buffer.store.f32(float %v3, <4 x i32> %rsrc, i32 0, i32 12, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.f32(float %v3, <4 x i32> %rsrc, i32 0, i32 12, i1 0, i1 0)
	call void @llvm.amdgcn.buffer.store.f32(float %v4, <4 x i32> %rsrc, i32 0, i32 16, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.f32(float %v4, <4 x i32> %rsrc, i32 0, i32 16, i1 0, i1 0)
	call void @llvm.amdgcn.buffer.store.f32(float %v5, <4 x i32> %rsrc, i32 0, i32 28, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.f32(float %v5, <4 x i32> %rsrc, i32 0, i32 28, i1 0, i1 0)
	call void @llvm.amdgcn.buffer.store.f32(float %v6, <4 x i32> %rsrc, i32 0, i32 32, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.f32(float %v6, <4 x i32> %rsrc, i32 0, i32 32, i1 0, i1 0)
	ret void			ret void
	}			}

	;CHECK-LABEL: {{^}}buffer_store_x2_offset_merged:			;CHECK-LABEL: {{^}}buffer_store_x2_offset_merged:
	;CHECK-NOT: s_waitcnt			;CHECK-NOT: s_waitcnt
	;CHECK: buffer_store_dwordx4 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4			;CHECK: buffer_store_dwordx4 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4
	define amdgpu_ps void @buffer_store_x2_offset_merged(<4 x i32> inreg %rsrc, <2 x float> %v1,<2 x float> %v2) {			define amdgpu_ps void @buffer_store_x2_offset_merged(<4 x i32> inreg %rsrc, <2 x float> %v1, <2 x float> %v2) {
	call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v1, <4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v1, <4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)
	call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v2, <4 x i32> %rsrc, i32 0, i32 12, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v2, <4 x i32> %rsrc, i32 0, i32 12, i1 0, i1 0)
	ret void			ret void
	}			}

				;CHECK-LABEL: {{^}}buffer_store_x3_offset_merged:
				;CHECK-NOT: s_waitcnt
				;CHECK-DAG: buffer_store_dwordx3 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4
				define amdgpu_ps void @buffer_store_x3_offset_merged(<4 x i32> inreg %rsrc, float %v1, float %v2, float %v3) {
				call void @llvm.amdgcn.buffer.store.f32(float %v1, <4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)
				call void @llvm.amdgcn.buffer.store.f32(float %v2, <4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
				call void @llvm.amdgcn.buffer.store.f32(float %v3, <4 x i32> %rsrc, i32 0, i32 12, i1 0, i1 0)
				ret void
				}

				;CHECK-LABEL: {{^}}buffer_store_x3_offset_merged2:
				;CHECK-NOT: s_waitcnt
				;CHECK-DAG: buffer_store_dwordx3 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4
				define amdgpu_ps void @buffer_store_x3_offset_merged2(<4 x i32> inreg %rsrc, float %v1, <2 x float> %v2) {
				call void @llvm.amdgcn.buffer.store.f32(float %v1, <4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)
				call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v2, <4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
				ret void
				}

				;CHECK-LABEL: {{^}}buffer_store_x3_offset_merged3:
				;CHECK-NOT: s_waitcnt
				;CHECK-DAG: buffer_store_dwordx3 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:8
				define amdgpu_ps void @buffer_store_x3_offset_merged3(<4 x i32> inreg %rsrc, <2 x float> %v1, float %v2) {
				call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v1, <4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
				call void @llvm.amdgcn.buffer.store.f32(float %v2, <4 x i32> %rsrc, i32 0, i32 16, i1 0, i1 0)
				ret void
				}

	declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #0			declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #0
	declare void @llvm.amdgcn.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i1, i1) #0			declare void @llvm.amdgcn.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i1, i1) #0
	declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #0			declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #0
	declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #1			declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #1

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readonly }			attributes #1 = { nounwind readonly }

test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll

This file was added.

				;RUN: llc < %s -march=amdgcn -mcpu=tonga -verify-machineinstrs \| FileCheck %s

				;CHECK-LABEL: {{^}}s_buffer_load_imm:
				;CHECK-NOT: s_waitcnt;
				;CHECK: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x4
				define amdgpu_ps void @s_buffer_load_imm(<4 x i32> inreg %desc) {
				main_body:
				%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 4, i32 0)
				%bitcast = bitcast i32 %load to float
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float undef, float undef, float undef, i1 true, i1 true)
				ret void
				}

				;CHECK-LABEL: {{^}}s_buffer_load_index:
				;CHECK-NOT: s_waitcnt;
				;CHECK: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
				define amdgpu_ps void @s_buffer_load_index(<4 x i32> inreg %desc, i32 inreg %index) {
				main_body:
				%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %index, i32 0)
				%bitcast = bitcast i32 %load to float
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float undef, float undef, float undef, i1 true, i1 true)
				ret void
				}

				;CHECK-LABEL: {{^}}s_buffer_loadx2_imm:
				;CHECK-NOT: s_waitcnt;
				;CHECK: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x40
				define amdgpu_ps void @s_buffer_loadx2_imm(<4 x i32> inreg %desc) {
				main_body:
				%load = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %desc, i32 64, i32 0)
				%bitcast = bitcast <2 x i32> %load to <2 x float>
				%x = extractelement <2 x float> %bitcast, i32 0
				%y = extractelement <2 x float> %bitcast, i32 1
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float undef, float undef, i1 true, i1 true)
				ret void
				}

				;CHECK-LABEL: {{^}}s_buffer_loadx2_index:
				;CHECK-NOT: s_waitcnt;
				;CHECK: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
				define amdgpu_ps void @s_buffer_loadx2_index(<4 x i32> inreg %desc, i32 inreg %index) {
				main_body:
				%load = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %desc, i32 %index, i32 0)
				%bitcast = bitcast <2 x i32> %load to <2 x float>
				%x = extractelement <2 x float> %bitcast, i32 0
				%y = extractelement <2 x float> %bitcast, i32 1
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float undef, float undef, i1 true, i1 true)
				ret void
				}

				;CHECK-LABEL: {{^}}s_buffer_loadx4_imm:
				;CHECK-NOT: s_waitcnt;
				;CHECK: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0xc8
				define amdgpu_ps void @s_buffer_loadx4_imm(<4 x i32> inreg %desc) {
				main_body:
				%load = call <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32> %desc, i32 200, i32 0)
				%bitcast = bitcast <4 x i32> %load to <4 x float>
				%x = extractelement <4 x float> %bitcast, i32 0
				%y = extractelement <4 x float> %bitcast, i32 1
				%z = extractelement <4 x float> %bitcast, i32 2
				%w = extractelement <4 x float> %bitcast, i32 3
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)
				ret void
				}

				;CHECK-LABEL: {{^}}s_buffer_loadx4_index:
				;CHECK-NOT: s_waitcnt;
				;CHECK: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
				define amdgpu_ps void @s_buffer_loadx4_index(<4 x i32> inreg %desc, i32 inreg %index) {
				main_body:
				%load = call <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32> %desc, i32 %index, i32 0)
				%bitcast = bitcast <4 x i32> %load to <4 x float>
				%x = extractelement <4 x float> %bitcast, i32 0
				%y = extractelement <4 x float> %bitcast, i32 1
				%z = extractelement <4 x float> %bitcast, i32 2
				%w = extractelement <4 x float> %bitcast, i32 3
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)
				ret void
				}

				;CHECK-LABEL: {{^}}s_buffer_load_imm_mergex2:
				;CHECK-NOT: s_waitcnt;
				;CHECK: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x4
				define amdgpu_ps void @s_buffer_load_imm_mergex2(<4 x i32> inreg %desc) {
				main_body:
				%load0 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 4, i32 0)
				%load1 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 8, i32 0)
				%x = bitcast i32 %load0 to float
				%y = bitcast i32 %load1 to float
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float undef, float undef, i1 true, i1 true)
				ret void
				}

				;CHECK-LABEL: {{^}}s_buffer_load_imm_mergex4:
				;CHECK-NOT: s_waitcnt;
				;CHECK: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x8
				define amdgpu_ps void @s_buffer_load_imm_mergex4(<4 x i32> inreg %desc) {
				main_body:
				%load0 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 8, i32 0)
				%load1 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 12, i32 0)
				%load2 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 16, i32 0)
				%load3 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 20, i32 0)
				%x = bitcast i32 %load0 to float
				%y = bitcast i32 %load1 to float
				%z = bitcast i32 %load2 to float
				%w = bitcast i32 %load3 to float
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)
				ret void
				}

				declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1)
				declare i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32>, i32, i32)
				declare <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32>, i32, i32)
				declare <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32>, i32, i32)

test/CodeGen/AMDGPU/merge-stores.ll

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @merge_global_store_4_constants_mixed_i32_f32(float addrspace(1)* %out) #0 {
store i32 11, i32 addrspace(1)* %out.gep.1.bc		store i32 11, i32 addrspace(1)* %out.gep.1.bc
store float 2.0, float addrspace(1)* %out.gep.2		store float 2.0, float addrspace(1)* %out.gep.2
store i32 17, i32 addrspace(1)* %out.gep.3.bc		store i32 17, i32 addrspace(1)* %out.gep.3.bc
store float 8.0, float addrspace(1)* %out		store float 8.0, float addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}merge_global_store_3_constants_i32:		; GCN-LABEL: {{^}}merge_global_store_3_constants_i32:
; SI-DAG: buffer_store_dwordx2		; SI-DAG: buffer_store_dwordx3
; SI-DAG: buffer_store_dword		; SI-NOT: buffer_store_dwordx2
; SI-NOT: buffer_store_dword		; SI-NOT: buffer_store_dword
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_kernel void @merge_global_store_3_constants_i32(i32 addrspace(1)* %out) #0 {		define amdgpu_kernel void @merge_global_store_3_constants_i32(i32 addrspace(1)* %out) #0 {
%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1		%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2		%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2

store i32 123, i32 addrspace(1)* %out.gep.1		store i32 123, i32 addrspace(1)* %out.gep.1
store i32 456, i32 addrspace(1)* %out.gep.2		store i32 456, i32 addrspace(1)* %out.gep.2
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @merge_global_store_4_adjacent_loads_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {
store i32 %x, i32 addrspace(1)* %out		store i32 %x, i32 addrspace(1)* %out
store i32 %y, i32 addrspace(1)* %out.gep.1		store i32 %y, i32 addrspace(1)* %out.gep.1
store i32 %z, i32 addrspace(1)* %out.gep.2		store i32 %z, i32 addrspace(1)* %out.gep.2
store i32 %w, i32 addrspace(1)* %out.gep.3		store i32 %w, i32 addrspace(1)* %out.gep.3
ret void		ret void
}		}

; GCN-LABEL: {{^}}merge_global_store_3_adjacent_loads_i32:		; GCN-LABEL: {{^}}merge_global_store_3_adjacent_loads_i32:
; SI-DAG: buffer_load_dwordx2		; SI-DAG: buffer_load_dwordx3
; SI-DAG: buffer_load_dword v
; GCN: s_waitcnt		; GCN: s_waitcnt
; SI-DAG: buffer_store_dword v		; SI-DAG: buffer_store_dwordx3 v
; SI-DAG: buffer_store_dwordx2 v
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_kernel void @merge_global_store_3_adjacent_loads_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {		define amdgpu_kernel void @merge_global_store_3_adjacent_loads_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {
%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1		%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2		%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2
%in.gep.1 = getelementptr i32, i32 addrspace(1)* %in, i32 1		%in.gep.1 = getelementptr i32, i32 addrspace(1)* %in, i32 1
%in.gep.2 = getelementptr i32, i32 addrspace(1)* %in, i32 2		%in.gep.2 = getelementptr i32, i32 addrspace(1)* %in, i32 2

%x = load i32, i32 addrspace(1)* %in		%x = load i32, i32 addrspace(1)* %in
▲ Show 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @merge_global_store_6_constants_i32(i32 addrspace(1)* %out) {
store i32 11, i32 addrspace(1)* %idx4, align 4		store i32 11, i32 addrspace(1)* %idx4, align 4
%idx5 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 5		%idx5 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 5
store i32 123, i32 addrspace(1)* %idx5, align 4		store i32 123, i32 addrspace(1)* %idx5, align 4
ret void		ret void
}		}

; GCN-LABEL: {{^}}merge_global_store_7_constants_i32:		; GCN-LABEL: {{^}}merge_global_store_7_constants_i32:
; GCN: buffer_store_dwordx4		; GCN: buffer_store_dwordx4
; GCN: buffer_store_dwordx2		; GCN: buffer_store_dwordx3
; GCN: buffer_store_dword v
define amdgpu_kernel void @merge_global_store_7_constants_i32(i32 addrspace(1)* %out) {		define amdgpu_kernel void @merge_global_store_7_constants_i32(i32 addrspace(1)* %out) {
store i32 34, i32 addrspace(1)* %out, align 4		store i32 34, i32 addrspace(1)* %out, align 4
%idx1 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 1		%idx1 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 1
store i32 999, i32 addrspace(1)* %idx1, align 4		store i32 999, i32 addrspace(1)* %idx1, align 4
%idx2 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 2		%idx2 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 2
store i32 65, i32 addrspace(1)* %idx2, align 4		store i32 65, i32 addrspace(1)* %idx2, align 4
%idx3 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 3		%idx3 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 3
store i32 33, i32 addrspace(1)* %idx3, align 4		store i32 33, i32 addrspace(1)* %idx3, align 4
Show All 30 Lines
}		}

; This requires handling of scalar_to_vector for v2i64 to avoid		; This requires handling of scalar_to_vector for v2i64 to avoid
; scratch usage.		; scratch usage.
; FIXME: Should do single load and store		; FIXME: Should do single load and store

; GCN-LABEL: {{^}}copy_v3i32_align4:		; GCN-LABEL: {{^}}copy_v3i32_align4:
; GCN-NOT: SCRATCH_RSRC_DWORD		; GCN-NOT: SCRATCH_RSRC_DWORD
; GCN-DAG: buffer_load_dword v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:8		; GCN-DAG: buffer_load_dwordx3 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}
; GCN-DAG: buffer_load_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}
; GCN-NOT: offen		; GCN-NOT: offen
; GCN: s_waitcnt vmcnt		; GCN: s_waitcnt vmcnt
; GCN-NOT: offen		; GCN-NOT: offen
; GCN-DAG: buffer_store_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}		; GCN-DAG: buffer_store_dwordx3 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}
; GCN-DAG: buffer_store_dword v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:8

; GCN: ScratchSize: 0{{$}}		; GCN: ScratchSize: 0{{$}}
define amdgpu_kernel void @copy_v3i32_align4(<3 x i32> addrspace(1)* noalias %out, <3 x i32> addrspace(1)* noalias %in) #0 {		define amdgpu_kernel void @copy_v3i32_align4(<3 x i32> addrspace(1)* noalias %out, <3 x i32> addrspace(1)* noalias %in) #0 {
%vec = load <3 x i32>, <3 x i32> addrspace(1)* %in, align 4		%vec = load <3 x i32>, <3 x i32> addrspace(1)* %in, align 4
store <3 x i32> %vec, <3 x i32> addrspace(1)* %out		store <3 x i32> %vec, <3 x i32> addrspace(1)* %out
ret void		ret void
}		}

Show All 10 Lines
define amdgpu_kernel void @copy_v3i64_align4(<3 x i64> addrspace(1)* noalias %out, <3 x i64> addrspace(1)* noalias %in) #0 {		define amdgpu_kernel void @copy_v3i64_align4(<3 x i64> addrspace(1)* noalias %out, <3 x i64> addrspace(1)* noalias %in) #0 {
%vec = load <3 x i64>, <3 x i64> addrspace(1)* %in, align 4		%vec = load <3 x i64>, <3 x i64> addrspace(1)* %in, align 4
store <3 x i64> %vec, <3 x i64> addrspace(1)* %out		store <3 x i64> %vec, <3 x i64> addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}copy_v3f32_align4:		; GCN-LABEL: {{^}}copy_v3f32_align4:
; GCN-NOT: SCRATCH_RSRC_DWORD		; GCN-NOT: SCRATCH_RSRC_DWORD
; GCN-DAG: buffer_load_dword v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:8		; GCN-DAG: buffer_load_dwordx3 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}
; GCN-DAG: buffer_load_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}
; GCN-NOT: offen		; GCN-NOT: offen
; GCN: s_waitcnt vmcnt		; GCN: s_waitcnt vmcnt
; GCN-NOT: offen		; GCN-NOT: offen
; GCN-DAG: buffer_store_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}		; GCN-DAG: buffer_store_dwordx3 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}
; GCN-DAG: buffer_store_dword v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:8
; GCN: ScratchSize: 0{{$}}		; GCN: ScratchSize: 0{{$}}
define amdgpu_kernel void @copy_v3f32_align4(<3 x float> addrspace(1)* noalias %out, <3 x float> addrspace(1)* noalias %in) #0 {		define amdgpu_kernel void @copy_v3f32_align4(<3 x float> addrspace(1)* noalias %out, <3 x float> addrspace(1)* noalias %in) #0 {
%vec = load <3 x float>, <3 x float> addrspace(1)* %in, align 4		%vec = load <3 x float>, <3 x float> addrspace(1)* %in, align 4
%fadd = fadd <3 x float> %vec, <float 1.0, float 2.0, float 4.0>		%fadd = fadd <3 x float> %vec, <float 1.0, float 2.0, float 4.0>
store <3 x float> %fadd, <3 x float> addrspace(1)* %out		store <3 x float> %fadd, <3 x float> addrspace(1)* %out
ret void		ret void
}		}

Show All 21 Lines

test/CodeGen/AMDGPU/store-global.ll

	Show First 20 Lines • Show All 267 Lines • ▼ Show 20 Lines
	entry:			entry:
	%0 = insertelement <2 x float> <float 0.0, float 0.0>, float %a, i32 0			%0 = insertelement <2 x float> <float 0.0, float 0.0>, float %a, i32 0
	%1 = insertelement <2 x float> %0, float %b, i32 1			%1 = insertelement <2 x float> %0, float %b, i32 1
	store <2 x float> %1, <2 x float> addrspace(1)* %out			store <2 x float> %1, <2 x float> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}store_v3i32:			; FUNC-LABEL: {{^}}store_v3i32:
	; SIVI-DAG: buffer_store_dwordx2			; SIVI-DAG: buffer_store_dwordx3
	; SIVI-DAG: buffer_store_dword v

	; GFX9-DAG: global_store_dwordx2			; GFX9-DAG: global_store_dwordx2
	; GFX9-DAG: global_store_dword v			; GFX9-DAG: global_store_dword v

	; EG-DAG: MEM_RAT_CACHELESS STORE_RAW {{T[0-9]+\.[XYZW]}}, {{T[0-9]+\.[XYZW]}},			; EG-DAG: MEM_RAT_CACHELESS STORE_RAW {{T[0-9]+\.[XYZW]}}, {{T[0-9]+\.[XYZW]}},
	; EG-DAG: MEM_RAT_CACHELESS STORE_RAW {{T[0-9]+\.XY}}, {{T[0-9]+\.[XYZW]}},			; EG-DAG: MEM_RAT_CACHELESS STORE_RAW {{T[0-9]+\.XY}}, {{T[0-9]+\.[XYZW]}},
	define amdgpu_kernel void @store_v3i32(<3 x i32> addrspace(1)* %out, <3 x i32> %a) nounwind {			define amdgpu_kernel void @store_v3i32(<3 x i32> addrspace(1)* %out, <3 x i32> %a) nounwind {
	store <3 x i32> %a, <3 x i32> addrspace(1)* %out, align 16			store <3 x i32> %a, <3 x i32> addrspace(1)* %out, align 16
	▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/store-v3i64.ll

	Show First 20 Lines • Show All 83 Lines • ▼ Show 20 Lines
	; GCN: ds_write_b8			; GCN: ds_write_b8
	; GCN: ds_write_b8			; GCN: ds_write_b8
	define amdgpu_kernel void @local_store_v3i64_unaligned(<3 x i64> addrspace(3)* %out, <3 x i64> %x) {			define amdgpu_kernel void @local_store_v3i64_unaligned(<3 x i64> addrspace(3)* %out, <3 x i64> %x) {
	store <3 x i64> %x, <3 x i64> addrspace(3)* %out, align 1			store <3 x i64> %x, <3 x i64> addrspace(3)* %out, align 1
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}global_truncstore_v3i64_to_v3i32:			; GCN-LABEL: {{^}}global_truncstore_v3i64_to_v3i32:
	; GCN-DAG: buffer_store_dwordx2			; GCN-DAG: buffer_store_dwordx3
	; GCN-DAG: buffer_store_dword v
	define amdgpu_kernel void @global_truncstore_v3i64_to_v3i32(<3 x i32> addrspace(1)* %out, <3 x i64> %x) {			define amdgpu_kernel void @global_truncstore_v3i64_to_v3i32(<3 x i32> addrspace(1)* %out, <3 x i64> %x) {
	%trunc = trunc <3 x i64> %x to <3 x i32>			%trunc = trunc <3 x i64> %x to <3 x i32>
	store <3 x i32> %trunc, <3 x i32> addrspace(1)* %out			store <3 x i32> %trunc, <3 x i32> addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}global_truncstore_v3i64_to_v3i16:			; GCN-LABEL: {{^}}global_truncstore_v3i64_to_v3i16:
	; GCN-DAG: buffer_store_short			; GCN-DAG: buffer_store_short
	Show All 26 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Extend the SI Load/Store optimizer to combine more things.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 176566

lib/Target/AMDGPU/BUFInstructions.td

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h

lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp

test/CodeGen/AMDGPU/cvt_f32_ubyte.ll

test/CodeGen/AMDGPU/early-if-convert-cost.ll

test/CodeGen/AMDGPU/insert_vector_elt.ll

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.ll

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.ll

test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll

test/CodeGen/AMDGPU/merge-stores.ll

test/CodeGen/AMDGPU/store-global.ll

test/CodeGen/AMDGPU/store-v3i64.ll

[AMDGPU] Extend the SI Load/Store optimizer to combine more things.
ClosedPublic