This is an archive of the discontinued LLVM Phabricator instance.

Differential D42079

AMDGPU: Add a function attribute that shrinks buggy s_buffer opcodes on GFX9
AbandonedPublic

Authored by mareko on Jan 15 2018, 9:41 AM.

Download Raw Diff

Details

Reviewers

arsenm
nhaehnle

Summary

This shouldn't be set for shaders running in an untrusted environment.

42952 affected shaders.
Code size in affected shaders: -12.64%

Diff Detail

Repository

rL LLVM

Build Status

Buildable 13848
Build 13848: arc lint + arc unit

Event Timeline

mareko created this revision.Jan 15 2018, 9:41 AM

Herald added subscribers: JDevlieghere, t-tye, tpr and 4 others. · View Herald TranscriptJan 15 2018, 9:41 AM

arsenm added inline comments.Jan 15 2018, 10:04 AM

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
165–166	I don't understand why the function attribute for a hardware bug. If this happens to help other targets not working around the bug we should add an optimization to do this somewhere.
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	We shouldn't be putting bug workarounds in an optimization pass. This should probably be part of the initial selection

mareko added inline comments.Jan 15 2018, 10:23 AM

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
165–166	Can you be more specific? I don't understand. Note that Mesa may set the function attribute differently for each shader, because it depends on whether the shader is trusted on untrusted.
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	What initial section are you talking about? Note that the placement of this pass is perfect, because the new code needs to run after SILoadStoreOptimizer.

arsenm added inline comments.Jan 15 2018, 10:29 AM

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	One option would be to do this in the DAG and split the intrinsics before getting selected in the first place, or we could do this in AdjustInstrPostInstrSelection or in a new bug workaround pass. Why does it need to be after SILoadStoreOptimizer? Is it just because it will try to merge and form these? If it's buggy it should just not do that in the first place.

mareko added inline comments.Jan 15 2018, 11:23 AM

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	The intrinsic is translated into s_buffer_load_dword only. There are no intrinsics for the xN opcodes - this is actually the optimal situation because we have SILoadStoreOptimizer. SILoadStoreOptimizer merges s_buffer_load_dword into x2 and x4 - this is the only place that generates the xN opcodes. This new code needs to run after that to convert s_buffer_load into s_load for xN opcodes where N >= 2.

dstuttard added inline comments.Jan 17 2018, 1:05 AM

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	I'm planning to add intrinsics that will allow multi-dword s_buffer_loads as this is considerably easier for our front-end. However, I guess the placement of this work-around means that this won't matter. I agree that the work-around being after SILoadStoreOptimizer looks like the best place, but whether that should be as a separate bug workaround pass as Matt suggests is arguable - Matt, do you think it is just cleaner to have it pulled into a separate pass?

nhaehnle added inline comments.Jan 24 2018, 9:18 AM

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected chips? By the way, we may want to assume that buffers are aligned to e.g. 16 bytes and make this dependent on the alignment, e.g. if the low 4 bits of the saddr are known to be zero, we should still be able to use x4.

mareko added inline comments.Jan 25 2018, 12:11 PM

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected chips That's the current behavior, and it increases code size by 14.4%. Loads are only guaranteed to be aligned to 4 bytes.

Okay, I see it now, and I can live with it. Please still consider the alternative alignment approach - it is a bit more restrictive in what it can do, but it could be enabled unconditionally.

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	Loads are only guaranteed to be aligned to 4 bytes. We could easily increase the required alignment on UBOs in radeonsi, let's say to 16 or even 64 bytes. I believe some other drivers set it to 256 bytes, which is the maximum allowed by the OpenGL spec. Then checking alignment for constant offsets becomes trivial, and for non-constant offsets we could still use computeKnownBits.

This revision is now accepted and ready to land.Jan 26 2018, 8:15 AM

mareko added inline comments.Jan 28 2018, 4:23 PM

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	The problem is, even when we set the required UBO alignment to 16, there is not guarantee that s_buffer_load_dwordx4 will be aligned to 16, because s_buffer_load_dwordx4 is constructed from multiple s_buffer_load_dword which load consecutive dwords, but it doesn't mean that the first one is aligned to 16. Or course, we could enforce the literal offset to be a multiple of 16, but I think that would decrease the number of opportunities for load opcode merging.

nhaehnle added inline comments.Jan 30 2018, 2:53 AM

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	Yes, that's precisely what I meant. Yes, it provides a smaller number of opportunities, but it could be enabled unconditionally without a special flag. So it's still a useful trade-off.

mareko added inline comments.Jan 30 2018, 4:10 AM

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	I'm a little concerned about app compatibility if we increase the alignment. 256 bytes was for ancient hw.

arsenm requested changes to this revision.Feb 1 2018, 9:18 AM

arsenm added inline comments.

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312	This cannot be in this optimization pass. This must be done somewhere else

This revision now requires changes to proceed.Feb 1 2018, 9:18 AM

The workaround is not needed.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPUSubtarget.h

4 lines

SILoadStoreOptimizer.cpp

5 lines

SIMachineFunctionInfo.h

12 lines

SIMachineFunctionInfo.cpp

4 lines

SIShrinkInstructions.cpp

46 lines

test/

CodeGen/

AMDGPU/

smrd.ll

22 lines

Diff 129877

lib/Target/AMDGPU/AMDGPUSubtarget.h

Show First 20 Lines • Show All 322 Lines • ▼ Show 20 Lines	public:
bool hasMin3Max3_16() const {		bool hasMin3Max3_16() const {
return getGeneration() >= GFX9;		return getGeneration() >= GFX9;
}		}

bool hasMadMixInsts() const {		bool hasMadMixInsts() const {
return HasMadMixInsts;		return HasMadMixInsts;
}		}

bool hasSBufferLoadStoreAtomicDwordxN() const {		bool hasBuggySBufferLoadStoreAtomicxN() const {
// Only use the "x1" variants on GFX9 or don't use the buffer variants.		// Only use the "x1" variants on GFX9 or don't use the buffer variants.
// For x2 and higher variants, if the accessed region spans 2 VM pages and		// For x2 and higher variants, if the accessed region spans 2 VM pages and
// the second page is unmapped, the hw hangs.		// the second page is unmapped, the hw hangs.
// TODO: There is one future GFX9 chip that doesn't have this bug.		// TODO: There is one future GFX9 chip that doesn't have this bug.
return getGeneration() != GFX9;		return getGeneration() == GFX9;
}		}

bool hasCARRY() const {		bool hasCARRY() const {
return (getGeneration() >= EVERGREEN);		return (getGeneration() >= EVERGREEN);
}		}

bool hasBORROW() const {		bool hasBORROW() const {
return (getGeneration() >= EVERGREEN);		return (getGeneration() >= EVERGREEN);
▲ Show 20 Lines • Show All 608 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 798 Lines • ▼ Show 20 Lines	MachineBasicBlock::iterator SILoadStoreOptimizer::mergeBufferStorePair(
CI.Paired->eraseFromParent();		CI.Paired->eraseFromParent();
return Next;		return Next;
}		}

// Scan through looking for adjacent LDS operations with constant offsets from		// Scan through looking for adjacent LDS operations with constant offsets from
// the same base register. We rely on the scheduler to do the hard work of		// the same base register. We rely on the scheduler to do the hard work of
// clustering nearby loads, and assume these are all adjacent.		// clustering nearby loads, and assume these are all adjacent.
bool SILoadStoreOptimizer::optimizeBlock(MachineBasicBlock &MBB) {		bool SILoadStoreOptimizer::optimizeBlock(MachineBasicBlock &MBB) {
		const SIMachineFunctionInfo *MFI =
		MBB.getParent()->getInfo<SIMachineFunctionInfo>();
bool Modified = false;		bool Modified = false;

for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end(); I != E;) {		for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end(); I != E;) {
MachineInstr &MI = *I;		MachineInstr &MI = *I;

// Don't combine if volatile.		// Don't combine if volatile.
if (MI.hasOrderedMemoryRef()) {		if (MI.hasOrderedMemoryRef()) {
++I;		++I;
Show All 29 Lines	if (Opc == AMDGPU::DS_READ_B32 \|\| Opc == AMDGPU::DS_READ_B64 \|\|
Modified = true;		Modified = true;
I = mergeWrite2Pair(CI);		I = mergeWrite2Pair(CI);
} else {		} else {
++I;		++I;
}		}

continue;		continue;
}		}
if (STM->hasSBufferLoadStoreAtomicDwordxN() &&		if ((!STM->hasBuggySBufferLoadStoreAtomicxN() \|\|
		MFI->shrinkBuggySBufferLoadStoreAtomicxN()) &&
(Opc == AMDGPU::S_BUFFER_LOAD_DWORD_IMM \|\|		(Opc == AMDGPU::S_BUFFER_LOAD_DWORD_IMM \|\|
Opc == AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM)) {		Opc == AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM)) {
// EltSize is in units of the offset encoding.		// EltSize is in units of the offset encoding.
CI.InstClass = S_BUFFER_LOAD_IMM;		CI.InstClass = S_BUFFER_LOAD_IMM;
CI.EltSize = AMDGPU::getSMRDEncodedOffset(*STM, 4);		CI.EltSize = AMDGPU::getSMRDEncodedOffset(*STM, 4);
CI.IsX2 = Opc == AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM;		CI.IsX2 = Opc == AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM;
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIMachineFunctionInfo.h

Show First 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	private:
// Compute directly in sgpr[0:1]		// Compute directly in sgpr[0:1]
// Other shaders indirect 64-bits at sgpr[0:1]		// Other shaders indirect 64-bits at sgpr[0:1]
bool ImplicitBufferPtr : 1;		bool ImplicitBufferPtr : 1;

// Pointer to where the ABI inserts special kernel arguments separate from the		// Pointer to where the ABI inserts special kernel arguments separate from the
// user arguments. This is an offset from the KernargSegmentPtr.		// user arguments. This is an offset from the KernargSegmentPtr.
bool ImplicitArgPtr : 1;		bool ImplicitArgPtr : 1;

		// This converts s_buffer_xxx to s_xxx to allow xN loads on chips where
		// the buffer opcodes are buggy, but at the cost of removing bounds checking
		// that is provided by buffer opcodes.
		//
		// Constraint: Only the BASE_ADDRESS_HI field of WORD1 can be set, so that
		// WORD0:WORD1 can trivially be used as an address.
		bool ShrinkBuggySBufferLoadStoreAtomicxN : 1;

// The hard-wired high half of the address of the global information table		// The hard-wired high half of the address of the global information table
// for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since		// for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since
// current hardware only allows a 16 bit value.		// current hardware only allows a 16 bit value.
unsigned GITPtrHigh;		unsigned GITPtrHigh;

unsigned HighBitsOf32BitAddress;		unsigned HighBitsOf32BitAddress;

MCPhysReg getNextUserSGPR() const {		MCPhysReg getNextUserSGPR() const {
▲ Show 20 Lines • Show All 195 Lines • ▼ Show 20 Lines	public:
bool hasImplicitArgPtr() const {		bool hasImplicitArgPtr() const {
return ImplicitArgPtr;		return ImplicitArgPtr;
}		}

bool hasImplicitBufferPtr() const {		bool hasImplicitBufferPtr() const {
return ImplicitBufferPtr;		return ImplicitBufferPtr;
}		}

		unsigned shrinkBuggySBufferLoadStoreAtomicxN() const {
		return ShrinkBuggySBufferLoadStoreAtomicxN;
		}

AMDGPUFunctionArgInfo &getArgInfo() {		AMDGPUFunctionArgInfo &getArgInfo() {
return ArgInfo;		return ArgInfo;
}		}

const AMDGPUFunctionArgInfo &getArgInfo() const {		const AMDGPUFunctionArgInfo &getArgInfo() const {
return ArgInfo;		return ArgInfo;
}		}

▲ Show 20 Lines • Show All 257 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	: AMDGPUMachineFunction(MF),
WorkGroupIDZ(false),		WorkGroupIDZ(false),
WorkGroupInfo(false),		WorkGroupInfo(false),
PrivateSegmentWaveByteOffset(false),		PrivateSegmentWaveByteOffset(false),
WorkItemIDX(false),		WorkItemIDX(false),
WorkItemIDY(false),		WorkItemIDY(false),
WorkItemIDZ(false),		WorkItemIDZ(false),
ImplicitBufferPtr(false),		ImplicitBufferPtr(false),
ImplicitArgPtr(false),		ImplicitArgPtr(false),
		ShrinkBuggySBufferLoadStoreAtomicxN(false),
GITPtrHigh(0xffffffff),		GITPtrHigh(0xffffffff),
HighBitsOf32BitAddress(0) {		HighBitsOf32BitAddress(0) {
const SISubtarget &ST = MF.getSubtarget<SISubtarget>();		const SISubtarget &ST = MF.getSubtarget<SISubtarget>();
const Function &F = MF.getFunction();		const Function &F = MF.getFunction();
FlatWorkGroupSizes = ST.getFlatWorkGroupSizes(F);		FlatWorkGroupSizes = ST.getFlatWorkGroupSizes(F);
WavesPerEU = ST.getWavesPerEU(F);		WavesPerEU = ST.getWavesPerEU(F);

if (!isEntryFunction()) {		if (!isEntryFunction()) {
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	SIMachineFunctionInfo::SIMachineFunctionInfo(const MachineFunction &MF)

if (ST.hasFlatAddressSpace() && isEntryFunction() && IsCOV2) {		if (ST.hasFlatAddressSpace() && isEntryFunction() && IsCOV2) {
// TODO: This could be refined a lot. The attribute is a poor way of		// TODO: This could be refined a lot. The attribute is a poor way of
// detecting calls that may require it before argument lowering.		// detecting calls that may require it before argument lowering.
if (HasStackObjects \|\| F.hasFnAttribute("amdgpu-flat-scratch"))		if (HasStackObjects \|\| F.hasFnAttribute("amdgpu-flat-scratch"))
FlatScratchInit = true;		FlatScratchInit = true;
}		}

		if (F.hasFnAttribute("amdgpu-shrink-buggy-sbuffer-opcodes"))
		ShrinkBuggySBufferLoadStoreAtomicxN = true;
		arsenmUnsubmitted Not Done Reply Inline Actions I don't understand why the function attribute for a hardware bug. If this happens to help other targets not working around the bug we should add an optimization to do this somewhere. arsenm: I don't understand why the function attribute for a hardware bug. If this happens to help other…
		marekoAuthorUnsubmitted Not Done Reply Inline Actions Can you be more specific? I don't understand. Note that Mesa may set the function attribute differently for each shader, because it depends on whether the shader is trusted on untrusted. mareko: Can you be more specific? I don't understand. Note that Mesa may set the function attribute…

Attribute A = F.getFnAttribute("amdgpu-git-ptr-high");		Attribute A = F.getFnAttribute("amdgpu-git-ptr-high");
StringRef S = A.getValueAsString();		StringRef S = A.getValueAsString();
if (!S.empty())		if (!S.empty())
S.consumeInteger(0, GITPtrHigh);		S.consumeInteger(0, GITPtrHigh);

A = F.getFnAttribute("amdgpu-32bit-address-high-bits");		A = F.getFnAttribute("amdgpu-32bit-address-high-bits");
S = A.getValueAsString();		S = A.getValueAsString();
if (!S.empty())		if (!S.empty())
▲ Show 20 Lines • Show All 130 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIShrinkInstructions.cpp

	Show First 20 Lines • Show All 287 Lines • ▼ Show 20 Lines
	bool SIShrinkInstructions::runOnMachineFunction(MachineFunction &MF) {			bool SIShrinkInstructions::runOnMachineFunction(MachineFunction &MF) {
	if (skipFunction(MF.getFunction()))			if (skipFunction(MF.getFunction()))
	return false;			return false;

	MachineRegisterInfo &MRI = MF.getRegInfo();			MachineRegisterInfo &MRI = MF.getRegInfo();
	const SISubtarget &ST = MF.getSubtarget<SISubtarget>();			const SISubtarget &ST = MF.getSubtarget<SISubtarget>();
	const SIInstrInfo *TII = ST.getInstrInfo();			const SIInstrInfo *TII = ST.getInstrInfo();
	const SIRegisterInfo &TRI = TII->getRegisterInfo();			const SIRegisterInfo &TRI = TII->getRegisterInfo();
				const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();

	std::vector<unsigned> I1Defs;			std::vector<unsigned> I1Defs;

	for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();			for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();
	BI != BE; ++BI) {			BI != BE; ++BI) {

	MachineBasicBlock &MBB = *BI;			MachineBasicBlock &MBB = *BI;
	MachineBasicBlock::iterator I, Next;			MachineBasicBlock::iterator I, Next;
	for (I = MBB.begin(); I != MBB.end(); I = Next) {			for (I = MBB.begin(); I != MBB.end(); I = Next) {
	Next = std::next(I);			Next = std::next(I);
	MachineInstr &MI = *I;			MachineInstr &MI = *I;

				// Shrink buggy scalar buffer loads.
				if (ST.hasBuggySBufferLoadStoreAtomicxN() &&
				MFI->shrinkBuggySBufferLoadStoreAtomicxN() &&
				TII->isSMRD(MI.getOpcode())) {
				arsenmUnsubmitted Not Done Reply Inline Actions We shouldn't be putting bug workarounds in an optimization pass. This should probably be part of the initial selection arsenm: We shouldn't be putting bug workarounds in an optimization pass. This should probably be part…
				marekoAuthorUnsubmitted Not Done Reply Inline Actions What initial section are you talking about? Note that the placement of this pass is perfect, because the new code needs to run after SILoadStoreOptimizer. mareko: What initial section are you talking about? Note that the placement of this pass is perfect…
				arsenmUnsubmitted Not Done Reply Inline Actions One option would be to do this in the DAG and split the intrinsics before getting selected in the first place, or we could do this in AdjustInstrPostInstrSelection or in a new bug workaround pass. Why does it need to be after SILoadStoreOptimizer? Is it just because it will try to merge and form these? If it's buggy it should just not do that in the first place. arsenm: One option would be to do this in the DAG and split the intrinsics before getting selected in…
				marekoAuthorUnsubmitted Not Done Reply Inline Actions The intrinsic is translated into s_buffer_load_dword only. There are no intrinsics for the xN opcodes - this is actually the optimal situation because we have SILoadStoreOptimizer. SILoadStoreOptimizer merges s_buffer_load_dword into x2 and x4 - this is the only place that generates the xN opcodes. This new code needs to run after that to convert s_buffer_load into s_load for xN opcodes where N >= 2. mareko: The intrinsic is translated into s_buffer_load_dword only. There are no intrinsics for the xN…
				dstuttardUnsubmitted Not Done Reply Inline Actions I'm planning to add intrinsics that will allow multi-dword s_buffer_loads as this is considerably easier for our front-end. However, I guess the placement of this work-around means that this won't matter. I agree that the work-around being after SILoadStoreOptimizer looks like the best place, but whether that should be as a separate bug workaround pass as Matt suggests is arguable - Matt, do you think it is just cleaner to have it pulled into a separate pass? dstuttard: I'm planning to add intrinsics that will allow multi-dword s_buffer_loads as this is…
				nhaehnleUnsubmitted Not Done Reply Inline Actions Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected chips? By the way, we may want to assume that buffers are aligned to e.g. 16 bytes and make this dependent on the alignment, e.g. if the low 4 bits of the saddr are known to be zero, we should still be able to use x4. nhaehnle: Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected…
				marekoAuthorUnsubmitted Not Done Reply Inline Actions Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected chips That's the current behavior, and it increases code size by 14.4%. Loads are only guaranteed to be aligned to 4 bytes. mareko: > Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected…
				nhaehnleUnsubmitted Not Done Reply Inline Actions Loads are only guaranteed to be aligned to 4 bytes. We could easily increase the required alignment on UBOs in radeonsi, let's say to 16 or even 64 bytes. I believe some other drivers set it to 256 bytes, which is the maximum allowed by the OpenGL spec. Then checking alignment for constant offsets becomes trivial, and for non-constant offsets we could still use computeKnownBits. nhaehnle: > Loads are only guaranteed to be aligned to 4 bytes. We could easily increase the required…
				marekoAuthorUnsubmitted Not Done Reply Inline Actions The problem is, even when we set the required UBO alignment to 16, there is not guarantee that s_buffer_load_dwordx4 will be aligned to 16, because s_buffer_load_dwordx4 is constructed from multiple s_buffer_load_dword which load consecutive dwords, but it doesn't mean that the first one is aligned to 16. Or course, we could enforce the literal offset to be a multiple of 16, but I think that would decrease the number of opportunities for load opcode merging. mareko: The problem is, even when we set the required UBO alignment to 16, there is not guarantee that…
				nhaehnleUnsubmitted Not Done Reply Inline Actions Yes, that's precisely what I meant. Yes, it provides a smaller number of opportunities, but it could be enabled unconditionally without a special flag. So it's still a useful trade-off. nhaehnle: Yes, that's precisely what I meant. Yes, it provides a smaller number of opportunities, but it…
				marekoAuthorUnsubmitted Not Done Reply Inline Actions I'm a little concerned about app compatibility if we increase the alignment. 256 bytes was for ancient hw. mareko: I'm a little concerned about app compatibility if we increase the alignment. 256 bytes was for…
				arsenmUnsubmitted Not Done Reply Inline Actions This cannot be in this optimization pass. This must be done somewhere else arsenm: This cannot be in this optimization pass. This must be done somewhere else
				unsigned NewOpcode = 0;

				// No other s_buffer opcodes can be generated by LLVM at the moment.
				switch (MI.getOpcode()) {
				case AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM:
				NewOpcode = AMDGPU::S_LOAD_DWORDX2_IMM;
				break;
				case AMDGPU::S_BUFFER_LOAD_DWORDX4_IMM:
				NewOpcode = AMDGPU::S_LOAD_DWORDX4_IMM;
				break;
				case AMDGPU::S_BUFFER_LOAD_DWORDX8_IMM:
				NewOpcode = AMDGPU::S_LOAD_DWORDX8_IMM;
				break;
				case AMDGPU::S_BUFFER_LOAD_DWORDX16_IMM:
				NewOpcode = AMDGPU::S_LOAD_DWORDX16_IMM;
				break;
				case AMDGPU::S_BUFFER_LOAD_DWORDX2_SGPR:
				NewOpcode = AMDGPU::S_LOAD_DWORDX2_SGPR;
				break;
				case AMDGPU::S_BUFFER_LOAD_DWORDX4_SGPR:
				NewOpcode = AMDGPU::S_LOAD_DWORDX4_SGPR;
				break;
				case AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR:
				NewOpcode = AMDGPU::S_LOAD_DWORDX8_SGPR;
				break;
				case AMDGPU::S_BUFFER_LOAD_DWORDX16_SGPR:
				NewOpcode = AMDGPU::S_LOAD_DWORDX16_SGPR;
				break;
				default:
				continue;
				}

				unsigned SAddr = TII->buildExtractSubReg(MI, MRI, MI.getOperand(1),
				&AMDGPU::SReg_128RegClass,
				AMDGPU::sub0_sub1,
				&AMDGPU::SReg_64_XEXECRegClass);
				MI.setDesc(TII->get(NewOpcode));
				MI.getOperand(1).setReg(SAddr);
				continue;
				}

	if (MI.getOpcode() == AMDGPU::V_MOV_B32_e32) {			if (MI.getOpcode() == AMDGPU::V_MOV_B32_e32) {
	// If this has a literal constant source that is the same as the			// If this has a literal constant source that is the same as the
	// reversed bits of an inline immediate, replace with a bitreverse of			// reversed bits of an inline immediate, replace with a bitreverse of
	// that constant. This saves 4 bytes in the common case of materializing			// that constant. This saves 4 bytes in the common case of materializing
	// sign bits.			// sign bits.

	// Test if we are after regalloc. We only want to do this after any			// Test if we are after regalloc. We only want to do this after any
	// optimizations happen because this will confuse them.			// optimizations happen because this will confuse them.
	▲ Show 20 Lines • Show All 231 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/smrd.ll

Show First 20 Lines • Show All 232 Lines • ▼ Show 20 Lines	main_body:
%r4 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 16)		%r4 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 16)
%r5 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 28)		%r5 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 28)
%r6 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 32)		%r6 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 32)
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true) #0
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r5, float %r6, float undef, float undef, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r5, float %r6, float undef, float undef, i1 true, i1 true) #0
ret void		ret void
}		}

		; GCN-LABEL: {{^}}smrd_imm_merged_shrunk:
		; GCN-NEXT: %bb.
		; SICI-NEXT: s_buffer_load_dwordx4 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x1
		; SICI-NEXT: s_buffer_load_dwordx2 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x7
		; VI-NEXT: s_buffer_load_dwordx4 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x4
		; VI-NEXT: s_buffer_load_dwordx2 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x1c
		; GFX9-NEXT: s_load_dwordx4 s[{{[0-9]}}:{{[0-9]}}], s[0:1], 0x4
		; GFX9-NEXT: s_load_dwordx2 s[{{[0-9]}}:{{[0-9]}}], s[0:1], 0x1c
		define amdgpu_ps void @smrd_imm_merged_shrunk(<4 x i32> inreg %desc) #2 {
		main_body:
		%r1 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 4)
		%r2 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 8)
		%r3 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 12)
		%r4 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 16)
		%r5 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 28)
		%r6 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 32)
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true) #0
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r5, float %r6, float undef, float undef, i1 true, i1 true) #0
		ret void
		}

; GCN-LABEL: {{^}}smrd_vgpr_merged:		; GCN-LABEL: {{^}}smrd_vgpr_merged:
; GCN-NEXT: %bb.		; GCN-NEXT: %bb.
; GCN-NEXT: buffer_load_dwordx4 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:4		; GCN-NEXT: buffer_load_dwordx4 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:4
; GCN-NEXT: buffer_load_dwordx2 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:28		; GCN-NEXT: buffer_load_dwordx2 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:28
define amdgpu_ps void @smrd_vgpr_merged(<4 x i32> inreg %desc, i32 %a) #0 {		define amdgpu_ps void @smrd_vgpr_merged(<4 x i32> inreg %desc, i32 %a) #0 {
main_body:		main_body:
%a1 = add i32 %a, 4		%a1 = add i32 %a, 4
%a2 = add i32 %a, 8		%a2 = add i32 %a, 8
Show All 12 Lines	main_body:
ret void		ret void
}		}

declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0		declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0
declare float @llvm.SI.load.const.v4i32(<4 x i32>, i32) #1		declare float @llvm.SI.load.const.v4i32(<4 x i32>, i32) #1

attributes #0 = { nounwind }		attributes #0 = { nounwind }
attributes #1 = { nounwind readnone }		attributes #1 = { nounwind readnone }
		attributes #2 = { nounwind "amdgpu-shrink-buggy-sbuffer-opcodes" }