This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add MIMG NSA threshold configuration attribute
ClosedPublic

Authored by critson on Sep 27 2022, 6:31 PM.

Download Raw Diff

Details

Reviewers

foad
rampitec

Commits

rG266b5dbc5dd4: [AMDGPU] Add MIMG NSA threshold configuration attribute

Summary

Make MIMG NSA minimum addresses threshold an attribute that can
be set on a function or configured via command line.
This enables frontend tuning which allows increased NSA usage
where beneficial.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

critson created this revision.Sep 27 2022, 6:31 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2022, 6:31 PM

Herald added subscribers: kosarev, kerbowa, hiraditya and 8 others. · View Herald Transcript

critson requested review of this revision.Sep 27 2022, 6:31 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2022, 6:31 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B189060: Diff 463388.Sep 27 2022, 7:09 PM

Seems reasonable.

I would also be in favour of changing the default to 2. That would tend to introduce fewer "mov"s to shuffle data around, at the expense of using a larger encoding on average for mimg instructions. But I think that is a good trade-off to make for the sake of performance.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
959	Use max()
963	Use max()

This revision is now accepted and ready to land.Sep 28 2022, 1:39 AM

In D134780#3820209, @foad wrote:

Seems reasonable.

I would also be in favour of changing the default to 2. That would tend to introduce fewer "mov"s to shuffle data around, at the expense of using a larger encoding on average for mimg instructions. But I think that is a good trade-off to make for the sake of performance.

Thank you for the quick review!
I did test the idea of using default 2, but I am not convinced yet that the benefit is universal enough.
Out of ~20000 pipelines, ~2000 had higher VGPR usage with threshold 2 and ~1000 had lower VGPR usage.
When VGPR usage decreases, it does so more beneficial ways though.
Spilling decreased in 2 pipelines, and only increased in 1.

Out of ~20000 pipelines, ~2000 had higher VGPR usage with threshold 2 and ~1000 had lower VGPR usage.

That's weird. I can't see why enabling NSA would consistently cause higher vgpr usage.

Address reviewer feedback

LGTM, thanks.

This revision was landed with ongoing or failed builds.Sep 28 2022, 4:04 AM

Closed by commit rG266b5dbc5dd4: [AMDGPU] Add MIMG NSA threshold configuration attribute (authored by critson). · Explain Why

This revision was automatically updated to reflect the committed changes.

critson added a commit: rG266b5dbc5dd4: [AMDGPU] Add MIMG NSA threshold configuration attribute.

Harbormaster completed remote builds in B189129: Diff 463490.Sep 28 2022, 4:30 AM

foad added inline comments.Sep 28 2022, 5:34 AM

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
59	On the "don't repeat yourself" principle you could either remove the cl::init here...
959	I don't think you need the .getValue().
965	... or return NSAThreshold here.

Out of ~20000 pipelines, ~2000 had higher VGPR usage with threshold 2 and ~1000 had lower VGPR usage.

That's weird. I can't see why enabling NSA would consistently cause higher vgpr usage.

FYI I looked at one case where this happens and it was caused by GCNNSAReassign making strange (well, different) decisions. So now I need to try to understand what that pass does.

In D134780#3821260, @foad wrote:

Out of ~20000 pipelines, ~2000 had higher VGPR usage with threshold 2 and ~1000 had lower VGPR usage.

That's weird. I can't see why enabling NSA would consistently cause higher vgpr usage.

FYI I looked at one case where this happens and it was caused by GCNNSAReassign making strange (well, different) decisions. So now I need to try to understand what that pass does.

Yes, I think we probably need to revisit the logic in GCNNSAReassign.
I seem to remember it lack some support for subregisters which might be part of the issue.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
59	True, it would be nice to have this number only once.
959	Pretty sure it is required, because std:max does not know how to handle cl::opt.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPULegalizerInfo.cpp

7 lines

AMDGPUSubtarget.cpp

15 lines

GCNSubtarget.h

4 lines

SIISelLowering.cpp

2 lines

test/

CodeGen/

AMDGPU/

amdgpu-nsa-threshold.ll

285 lines

llvm.amdgcn.image.nsa.ll

21 lines

Diff 463498

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp

Show First 20 Lines • Show All 4,857 Lines • ▼ Show 20 Lines
/// want a selected instruction entering RegBankSelect. In order to avoid		/// want a selected instruction entering RegBankSelect. In order to avoid
/// defining a multitude of intermediate image instructions, directly hack on		/// defining a multitude of intermediate image instructions, directly hack on
/// the intrinsic's arguments. In cases like a16 addresses, this requires		/// the intrinsic's arguments. In cases like a16 addresses, this requires
/// padding now unnecessary arguments with $noreg.		/// padding now unnecessary arguments with $noreg.
bool AMDGPULegalizerInfo::legalizeImageIntrinsic(		bool AMDGPULegalizerInfo::legalizeImageIntrinsic(
MachineInstr &MI, MachineIRBuilder &B, GISelChangeObserver &Observer,		MachineInstr &MI, MachineIRBuilder &B, GISelChangeObserver &Observer,
const AMDGPU::ImageDimIntrinsicInfo *Intr) const {		const AMDGPU::ImageDimIntrinsicInfo *Intr) const {

		const MachineFunction &MF = *MI.getMF();
const unsigned NumDefs = MI.getNumExplicitDefs();		const unsigned NumDefs = MI.getNumExplicitDefs();
const unsigned ArgOffset = NumDefs + 1;		const unsigned ArgOffset = NumDefs + 1;
bool IsTFE = NumDefs == 2;		bool IsTFE = NumDefs == 2;
// We are only processing the operands of d16 image operations on subtargets		// We are only processing the operands of d16 image operations on subtargets
// that use the unpacked register layout, or need to repack the TFE result.		// that use the unpacked register layout, or need to repack the TFE result.

// TODO: Do we need to guard against already legalized intrinsics?		// TODO: Do we need to guard against already legalized intrinsics?
const AMDGPU::MIMGBaseOpcodeInfo *BaseOpcode =		const AMDGPU::MIMGBaseOpcodeInfo *BaseOpcode =
▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	bool AMDGPULegalizerInfo::legalizeImageIntrinsic(
if (IsA16 \|\| IsG16) {		if (IsA16 \|\| IsG16) {
if (Intr->NumVAddrs > 1) {		if (Intr->NumVAddrs > 1) {
SmallVector<Register, 4> PackedRegs;		SmallVector<Register, 4> PackedRegs;

packImage16bitOpsToDwords(B, MI, PackedRegs, ArgOffset, Intr, IsA16,		packImage16bitOpsToDwords(B, MI, PackedRegs, ArgOffset, Intr, IsA16,
IsG16);		IsG16);

// See also below in the non-a16 branch		// See also below in the non-a16 branch
const bool UseNSA = ST.hasNSAEncoding() && PackedRegs.size() >= 3 &&		const bool UseNSA = ST.hasNSAEncoding() &&
		PackedRegs.size() >= ST.getNSAThreshold(MF) &&
PackedRegs.size() <= ST.getNSAMaxSize();		PackedRegs.size() <= ST.getNSAMaxSize();

if (!UseNSA && PackedRegs.size() > 1) {		if (!UseNSA && PackedRegs.size() > 1) {
LLT PackedAddrTy = LLT::fixed_vector(2 * PackedRegs.size(), 16);		LLT PackedAddrTy = LLT::fixed_vector(2 * PackedRegs.size(), 16);
auto Concat = B.buildConcatVectors(PackedAddrTy, PackedRegs);		auto Concat = B.buildConcatVectors(PackedAddrTy, PackedRegs);
PackedRegs[0] = Concat.getReg(0);		PackedRegs[0] = Concat.getReg(0);
PackedRegs.resize(1);		PackedRegs.resize(1);
}		}
Show All 25 Lines	if (IsA16 \|\| IsG16) {
// do so, so force non-NSA for the common 2-address case as a heuristic.		// do so, so force non-NSA for the common 2-address case as a heuristic.
//		//
// SIShrinkInstructions will convert NSA encodings to non-NSA after register		// SIShrinkInstructions will convert NSA encodings to non-NSA after register
// allocation when possible.		// allocation when possible.
//		//
// TODO: we can actually allow partial NSA where the final register is a		// TODO: we can actually allow partial NSA where the final register is a
// contiguous set of the remaining addresses.		// contiguous set of the remaining addresses.
// This could help where there are more addresses than supported.		// This could help where there are more addresses than supported.
const bool UseNSA = ST.hasNSAEncoding() && CorrectedNumVAddrs >= 3 &&		const bool UseNSA = ST.hasNSAEncoding() &&
		CorrectedNumVAddrs >= ST.getNSAThreshold(MF) &&
CorrectedNumVAddrs <= ST.getNSAMaxSize();		CorrectedNumVAddrs <= ST.getNSAMaxSize();

if (!UseNSA && Intr->NumVAddrs > 1)		if (!UseNSA && Intr->NumVAddrs > 1)
convertImageAddrToPacked(B, MI, ArgOffset + Intr->VAddrStart,		convertImageAddrToPacked(B, MI, ArgOffset + Intr->VAddrStart,
Intr->NumVAddrs);		Intr->NumVAddrs);
}		}

int Flags = 0;		int Flags = 0;
▲ Show 20 Lines • Show All 779 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableVGPRIndexMode(
"amdgpu-vgpr-index-mode",		"amdgpu-vgpr-index-mode",
cl::desc("Use GPR indexing mode instead of movrel for vector indexing"),		cl::desc("Use GPR indexing mode instead of movrel for vector indexing"),
cl::init(false));		cl::init(false));

static cl::opt<bool> UseAA("amdgpu-use-aa-in-codegen",		static cl::opt<bool> UseAA("amdgpu-use-aa-in-codegen",
cl::desc("Enable the use of AA during codegen."),		cl::desc("Enable the use of AA during codegen."),
cl::init(true));		cl::init(true));

		static cl::opt<unsigned> NSAThreshold("amdgpu-nsa-threshold",
		cl::desc("Number of addresses from which to enable MIMG NSA."),
		cl::init(3), cl::Hidden);
		foadUnsubmitted Not Done Reply Inline Actions On the "don't repeat yourself" principle you could either remove the cl::init here... foad: On the "don't repeat yourself" principle you could either remove the cl::init here...
		critsonAuthorUnsubmitted Done Reply Inline Actions True, it would be nice to have this number only once. critson: True, it would be nice to have this number only once.

GCNSubtarget::~GCNSubtarget() = default;		GCNSubtarget::~GCNSubtarget() = default;

GCNSubtarget &		GCNSubtarget &
GCNSubtarget::initializeSubtargetDependencies(const Triple &TT,		GCNSubtarget::initializeSubtargetDependencies(const Triple &TT,
StringRef GPU, StringRef FS) {		StringRef GPU, StringRef FS) {
// Determine default and user-specified characteristics		// Determine default and user-specified characteristics
//		//
// We want to be able to turn these off, but making this a subtarget feature		// We want to be able to turn these off, but making this a subtarget feature
▲ Show 20 Lines • Show All 880 Lines • ▼ Show 20 Lines
}		}

std::unique_ptr<ScheduleDAGMutation>		std::unique_ptr<ScheduleDAGMutation>
GCNSubtarget::createFillMFMAShadowMutation(const TargetInstrInfo *TII) const {		GCNSubtarget::createFillMFMAShadowMutation(const TargetInstrInfo *TII) const {
return EnablePowerSched ? std::make_unique<FillMFMAShadowMutation>(&InstrInfo)		return EnablePowerSched ? std::make_unique<FillMFMAShadowMutation>(&InstrInfo)
: nullptr;		: nullptr;
}		}

		unsigned GCNSubtarget::getNSAThreshold(const MachineFunction &MF) const {
		if (NSAThreshold.getNumOccurrences() > 0)
		return std::max(NSAThreshold.getValue(), 2u);
		foadUnsubmitted Done Reply Inline Actions Use max() foad: Use max()
		foadUnsubmitted Not Done Reply Inline Actions I don't think you need the .getValue(). foad: I don't think you need the .getValue().
		critsonAuthorUnsubmitted Done Reply Inline Actions Pretty sure it is required, because std:max does not know how to handle cl::opt. critson: Pretty sure it is required, because std:max does not know how to handle cl::opt.

		int Value = AMDGPU::getIntegerAttribute(MF.getFunction(), "amdgpu-nsa-threshold", -1);
		if (Value > 0)
		return std::max(Value, 2);
		foadUnsubmitted Done Reply Inline Actions Use max() foad: Use max()

		return 3;
		foadUnsubmitted Not Done Reply Inline Actions ... or return NSAThreshold here. foad: ... or return NSAThreshold here.
		}

const AMDGPUSubtarget &AMDGPUSubtarget::get(const MachineFunction &MF) {		const AMDGPUSubtarget &AMDGPUSubtarget::get(const MachineFunction &MF) {
if (MF.getTarget().getTargetTriple().getArch() == Triple::amdgcn)		if (MF.getTarget().getTargetTriple().getArch() == Triple::amdgcn)
return static_cast<const AMDGPUSubtarget&>(MF.getSubtarget<GCNSubtarget>());		return static_cast<const AMDGPUSubtarget&>(MF.getSubtarget<GCNSubtarget>());
else		else
return static_cast<const AMDGPUSubtarget&>(MF.getSubtarget<R600Subtarget>());		return static_cast<const AMDGPUSubtarget&>(MF.getSubtarget<R600Subtarget>());
}		}

const AMDGPUSubtarget &AMDGPUSubtarget::get(const TargetMachine &TM, const Function &F) {		const AMDGPUSubtarget &AMDGPUSubtarget::get(const TargetMachine &TM, const Function &F) {
if (TM.getTargetTriple().getArch() == Triple::amdgcn)		if (TM.getTargetTriple().getArch() == Triple::amdgcn)
return static_cast<const AMDGPUSubtarget&>(TM.getSubtarget<GCNSubtarget>(F));		return static_cast<const AMDGPUSubtarget&>(TM.getSubtarget<GCNSubtarget>(F));
else		else
return static_cast<const AMDGPUSubtarget&>(TM.getSubtarget<R600Subtarget>(F));		return static_cast<const AMDGPUSubtarget&>(TM.getSubtarget<R600Subtarget>(F));
}		}

llvm/lib/Target/AMDGPU/GCNSubtarget.h

	Show First 20 Lines • Show All 1,299 Lines • ▼ Show 20 Lines
	}			}

	void adjustSchedDependency(SUnit Def, int DefOpIdx, SUnit Use, int UseOpIdx,			void adjustSchedDependency(SUnit Def, int DefOpIdx, SUnit Use, int UseOpIdx,
	SDep &Dep) const override;			SDep &Dep) const override;

	// \returns true if it's beneficial on this subtarget for the scheduler to			// \returns true if it's beneficial on this subtarget for the scheduler to
	// cluster stores as well as loads.			// cluster stores as well as loads.
	bool shouldClusterStores() const { return getGeneration() >= GFX11; }			bool shouldClusterStores() const { return getGeneration() >= GFX11; }

				// \returns the number of address arguments from which to enable MIMG NSA
				// on supported architectures.
				unsigned getNSAThreshold(const MachineFunction &MF) const;
	};			};

	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_LIB_TARGET_AMDGPU_GCNSUBTARGET_H			#endif // LLVM_LIB_TARGET_AMDGPU_GCNSUBTARGET_H

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,516 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::lowerImage(SDValue Op,
//		//
// SIShrinkInstructions will convert NSA encodings to non-NSA after register		// SIShrinkInstructions will convert NSA encodings to non-NSA after register
// allocation when possible.		// allocation when possible.
//		//
// TODO: we can actually allow partial NSA where the final register is a		// TODO: we can actually allow partial NSA where the final register is a
// contiguous set of the remaining addresses.		// contiguous set of the remaining addresses.
// This could help where there are more addresses than supported.		// This could help where there are more addresses than supported.
bool UseNSA = ST->hasFeature(AMDGPU::FeatureNSAEncoding) &&		bool UseNSA = ST->hasFeature(AMDGPU::FeatureNSAEncoding) &&
VAddrs.size() >= 3 &&		VAddrs.size() >= (unsigned)ST->getNSAThreshold(MF) &&
VAddrs.size() <= (unsigned)ST->getNSAMaxSize();		VAddrs.size() <= (unsigned)ST->getNSAMaxSize();
SDValue VAddr;		SDValue VAddr;
if (!UseNSA)		if (!UseNSA)
VAddr = getBuildDwordsVector(DAG, DL, VAddrs);		VAddr = getBuildDwordsVector(DAG, DL, VAddrs);

SDValue True = DAG.getTargetConstant(1, DL, MVT::i1);		SDValue True = DAG.getTargetConstant(1, DL, MVT::i1);
SDValue False = DAG.getTargetConstant(0, DL, MVT::i1);		SDValue False = DAG.getTargetConstant(0, DL, MVT::i1);
SDValue Unorm;		SDValue Unorm;
▲ Show 20 Lines • Show All 6,463 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/amdgpu-nsa-threshold.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s \| FileCheck -check-prefix=ATTRIB %s
				; RUN: llc -march=amdgcn -mcpu=gfx1100 -amdgpu-nsa-threshold=2 -verify-machineinstrs < %s \| FileCheck -check-prefix=FORCE-2 %s
				; RUN: llc -march=amdgcn -mcpu=gfx1100 -amdgpu-nsa-threshold=3 -verify-machineinstrs < %s \| FileCheck -check-prefix=FORCE-3 %s
				; RUN: llc -march=amdgcn -mcpu=gfx1100 -amdgpu-nsa-threshold=4 -verify-machineinstrs < %s \| FileCheck -check-prefix=FORCE-4 %s

				; Note: command line argument should override function attribute.

				define amdgpu_ps <4 x float> @sample_2d_nsa2(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %t, float %s) #2 {
				; ATTRIB-LABEL: sample_2d_nsa2:
				; ATTRIB: ; %bb.0: ; %main_body
				; ATTRIB-NEXT: s_mov_b32 s12, exec_lo
				; ATTRIB-NEXT: s_wqm_b32 exec_lo, exec_lo
				; ATTRIB-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; ATTRIB-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; ATTRIB-NEXT: image_sample v[0:3], [v1, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; ATTRIB-NEXT: s_waitcnt vmcnt(0)
				; ATTRIB-NEXT: ; return to shader part epilog
				;
				; FORCE-2-LABEL: sample_2d_nsa2:
				; FORCE-2: ; %bb.0: ; %main_body
				; FORCE-2-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-2-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-2-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; FORCE-2-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-2-NEXT: image_sample v[0:3], [v1, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; FORCE-2-NEXT: s_waitcnt vmcnt(0)
				; FORCE-2-NEXT: ; return to shader part epilog
				;
				; FORCE-3-LABEL: sample_2d_nsa2:
				; FORCE-3: ; %bb.0: ; %main_body
				; FORCE-3-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-3-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-3-NEXT: v_mov_b32_e32 v2, v0
				; FORCE-3-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-3-NEXT: image_sample v[0:3], v[1:2], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; FORCE-3-NEXT: s_waitcnt vmcnt(0)
				; FORCE-3-NEXT: ; return to shader part epilog
				;
				; FORCE-4-LABEL: sample_2d_nsa2:
				; FORCE-4: ; %bb.0: ; %main_body
				; FORCE-4-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-4-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-4-NEXT: v_mov_b32_e32 v2, v0
				; FORCE-4-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-4-NEXT: image_sample v[0:3], v[1:2], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; FORCE-4-NEXT: s_waitcnt vmcnt(0)
				; FORCE-4-NEXT: ; return to shader part epilog
				main_body:
				%v = call <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32 15, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 0, i32 0, i32 0)
				ret <4 x float> %v
				}

				define amdgpu_ps <4 x float> @sample_3d_nsa2(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %r, float %s, float %t) #2 {
				; ATTRIB-LABEL: sample_3d_nsa2:
				; ATTRIB: ; %bb.0: ; %main_body
				; ATTRIB-NEXT: s_mov_b32 s12, exec_lo
				; ATTRIB-NEXT: s_wqm_b32 exec_lo, exec_lo
				; ATTRIB-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; ATTRIB-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; ATTRIB-NEXT: image_sample v[0:3], [v1, v2, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; ATTRIB-NEXT: s_waitcnt vmcnt(0)
				; ATTRIB-NEXT: ; return to shader part epilog
				;
				; FORCE-2-LABEL: sample_3d_nsa2:
				; FORCE-2: ; %bb.0: ; %main_body
				; FORCE-2-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-2-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-2-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; FORCE-2-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-2-NEXT: image_sample v[0:3], [v1, v2, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; FORCE-2-NEXT: s_waitcnt vmcnt(0)
				; FORCE-2-NEXT: ; return to shader part epilog
				;
				; FORCE-3-LABEL: sample_3d_nsa2:
				; FORCE-3: ; %bb.0: ; %main_body
				; FORCE-3-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-3-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-3-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; FORCE-3-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-3-NEXT: image_sample v[0:3], [v1, v2, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; FORCE-3-NEXT: s_waitcnt vmcnt(0)
				; FORCE-3-NEXT: ; return to shader part epilog
				;
				; FORCE-4-LABEL: sample_3d_nsa2:
				; FORCE-4: ; %bb.0: ; %main_body
				; FORCE-4-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-4-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-4-NEXT: v_mov_b32_e32 v3, v0
				; FORCE-4-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-4-NEXT: image_sample v[0:3], v[1:3], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; FORCE-4-NEXT: s_waitcnt vmcnt(0)
				; FORCE-4-NEXT: ; return to shader part epilog
				main_body:
				%v = call <4 x float> @llvm.amdgcn.image.sample.3d.v4f32.f32(i32 15, float %s, float %t, float %r, <8 x i32> %rsrc, <4 x i32> %samp, i1 0, i32 0, i32 0)
				ret <4 x float> %v
				}

				define amdgpu_ps <4 x float> @sample_2d_nsa3(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %t, float %s) #3 {
				; ATTRIB-LABEL: sample_2d_nsa3:
				; ATTRIB: ; %bb.0: ; %main_body
				; ATTRIB-NEXT: s_mov_b32 s12, exec_lo
				; ATTRIB-NEXT: s_wqm_b32 exec_lo, exec_lo
				; ATTRIB-NEXT: v_mov_b32_e32 v2, v0
				; ATTRIB-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; ATTRIB-NEXT: image_sample v[0:3], v[1:2], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; ATTRIB-NEXT: s_waitcnt vmcnt(0)
				; ATTRIB-NEXT: ; return to shader part epilog
				;
				; FORCE-2-LABEL: sample_2d_nsa3:
				; FORCE-2: ; %bb.0: ; %main_body
				; FORCE-2-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-2-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-2-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; FORCE-2-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-2-NEXT: image_sample v[0:3], [v1, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; FORCE-2-NEXT: s_waitcnt vmcnt(0)
				; FORCE-2-NEXT: ; return to shader part epilog
				;
				; FORCE-3-LABEL: sample_2d_nsa3:
				; FORCE-3: ; %bb.0: ; %main_body
				; FORCE-3-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-3-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-3-NEXT: v_mov_b32_e32 v2, v0
				; FORCE-3-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-3-NEXT: image_sample v[0:3], v[1:2], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; FORCE-3-NEXT: s_waitcnt vmcnt(0)
				; FORCE-3-NEXT: ; return to shader part epilog
				;
				; FORCE-4-LABEL: sample_2d_nsa3:
				; FORCE-4: ; %bb.0: ; %main_body
				; FORCE-4-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-4-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-4-NEXT: v_mov_b32_e32 v2, v0
				; FORCE-4-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-4-NEXT: image_sample v[0:3], v[1:2], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; FORCE-4-NEXT: s_waitcnt vmcnt(0)
				; FORCE-4-NEXT: ; return to shader part epilog
				main_body:
				%v = call <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32 15, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 0, i32 0, i32 0)
				ret <4 x float> %v
				}

				define amdgpu_ps <4 x float> @sample_3d_nsa3(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %r, float %s, float %t) #3 {
				; ATTRIB-LABEL: sample_3d_nsa3:
				; ATTRIB: ; %bb.0: ; %main_body
				; ATTRIB-NEXT: s_mov_b32 s12, exec_lo
				; ATTRIB-NEXT: s_wqm_b32 exec_lo, exec_lo
				; ATTRIB-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; ATTRIB-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; ATTRIB-NEXT: image_sample v[0:3], [v1, v2, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; ATTRIB-NEXT: s_waitcnt vmcnt(0)
				; ATTRIB-NEXT: ; return to shader part epilog
				;
				; FORCE-2-LABEL: sample_3d_nsa3:
				; FORCE-2: ; %bb.0: ; %main_body
				; FORCE-2-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-2-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-2-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; FORCE-2-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-2-NEXT: image_sample v[0:3], [v1, v2, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; FORCE-2-NEXT: s_waitcnt vmcnt(0)
				; FORCE-2-NEXT: ; return to shader part epilog
				;
				; FORCE-3-LABEL: sample_3d_nsa3:
				; FORCE-3: ; %bb.0: ; %main_body
				; FORCE-3-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-3-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-3-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; FORCE-3-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-3-NEXT: image_sample v[0:3], [v1, v2, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; FORCE-3-NEXT: s_waitcnt vmcnt(0)
				; FORCE-3-NEXT: ; return to shader part epilog
				;
				; FORCE-4-LABEL: sample_3d_nsa3:
				; FORCE-4: ; %bb.0: ; %main_body
				; FORCE-4-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-4-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-4-NEXT: v_mov_b32_e32 v3, v0
				; FORCE-4-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-4-NEXT: image_sample v[0:3], v[1:3], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; FORCE-4-NEXT: s_waitcnt vmcnt(0)
				; FORCE-4-NEXT: ; return to shader part epilog
				main_body:
				%v = call <4 x float> @llvm.amdgcn.image.sample.3d.v4f32.f32(i32 15, float %s, float %t, float %r, <8 x i32> %rsrc, <4 x i32> %samp, i1 0, i32 0, i32 0)
				ret <4 x float> %v
				}

				define amdgpu_ps <4 x float> @sample_2d_nsa4(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %t, float %s) #4 {
				; ATTRIB-LABEL: sample_2d_nsa4:
				; ATTRIB: ; %bb.0: ; %main_body
				; ATTRIB-NEXT: s_mov_b32 s12, exec_lo
				; ATTRIB-NEXT: s_wqm_b32 exec_lo, exec_lo
				; ATTRIB-NEXT: v_mov_b32_e32 v2, v0
				; ATTRIB-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; ATTRIB-NEXT: image_sample v[0:3], v[1:2], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; ATTRIB-NEXT: s_waitcnt vmcnt(0)
				; ATTRIB-NEXT: ; return to shader part epilog
				;
				; FORCE-2-LABEL: sample_2d_nsa4:
				; FORCE-2: ; %bb.0: ; %main_body
				; FORCE-2-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-2-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-2-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; FORCE-2-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-2-NEXT: image_sample v[0:3], [v1, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; FORCE-2-NEXT: s_waitcnt vmcnt(0)
				; FORCE-2-NEXT: ; return to shader part epilog
				;
				; FORCE-3-LABEL: sample_2d_nsa4:
				; FORCE-3: ; %bb.0: ; %main_body
				; FORCE-3-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-3-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-3-NEXT: v_mov_b32_e32 v2, v0
				; FORCE-3-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-3-NEXT: image_sample v[0:3], v[1:2], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; FORCE-3-NEXT: s_waitcnt vmcnt(0)
				; FORCE-3-NEXT: ; return to shader part epilog
				;
				; FORCE-4-LABEL: sample_2d_nsa4:
				; FORCE-4: ; %bb.0: ; %main_body
				; FORCE-4-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-4-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-4-NEXT: v_mov_b32_e32 v2, v0
				; FORCE-4-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-4-NEXT: image_sample v[0:3], v[1:2], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
				; FORCE-4-NEXT: s_waitcnt vmcnt(0)
				; FORCE-4-NEXT: ; return to shader part epilog
				main_body:
				%v = call <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32 15, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 0, i32 0, i32 0)
				ret <4 x float> %v
				}

				define amdgpu_ps <4 x float> @sample_3d_nsa4(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %r, float %s, float %t) #4 {
				; ATTRIB-LABEL: sample_3d_nsa4:
				; ATTRIB: ; %bb.0: ; %main_body
				; ATTRIB-NEXT: s_mov_b32 s12, exec_lo
				; ATTRIB-NEXT: s_wqm_b32 exec_lo, exec_lo
				; ATTRIB-NEXT: v_mov_b32_e32 v3, v0
				; ATTRIB-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; ATTRIB-NEXT: image_sample v[0:3], v[1:3], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; ATTRIB-NEXT: s_waitcnt vmcnt(0)
				; ATTRIB-NEXT: ; return to shader part epilog
				;
				; FORCE-2-LABEL: sample_3d_nsa4:
				; FORCE-2: ; %bb.0: ; %main_body
				; FORCE-2-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-2-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-2-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; FORCE-2-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-2-NEXT: image_sample v[0:3], [v1, v2, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; FORCE-2-NEXT: s_waitcnt vmcnt(0)
				; FORCE-2-NEXT: ; return to shader part epilog
				;
				; FORCE-3-LABEL: sample_3d_nsa4:
				; FORCE-3: ; %bb.0: ; %main_body
				; FORCE-3-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-3-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-3-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; FORCE-3-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-3-NEXT: image_sample v[0:3], [v1, v2, v0], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; FORCE-3-NEXT: s_waitcnt vmcnt(0)
				; FORCE-3-NEXT: ; return to shader part epilog
				;
				; FORCE-4-LABEL: sample_3d_nsa4:
				; FORCE-4: ; %bb.0: ; %main_body
				; FORCE-4-NEXT: s_mov_b32 s12, exec_lo
				; FORCE-4-NEXT: s_wqm_b32 exec_lo, exec_lo
				; FORCE-4-NEXT: v_mov_b32_e32 v3, v0
				; FORCE-4-NEXT: s_and_b32 exec_lo, exec_lo, s12
				; FORCE-4-NEXT: image_sample v[0:3], v[1:3], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_3D
				; FORCE-4-NEXT: s_waitcnt vmcnt(0)
				; FORCE-4-NEXT: ; return to shader part epilog
				main_body:
				%v = call <4 x float> @llvm.amdgcn.image.sample.3d.v4f32.f32(i32 15, float %s, float %t, float %r, <8 x i32> %rsrc, <4 x i32> %samp, i1 0, i32 0, i32 0)
				ret <4 x float> %v
				}

				declare <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
				declare <4 x float> @llvm.amdgcn.image.sample.3d.v4f32.f32(i32, float, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1

				attributes #1 = { nounwind readonly }
				attributes #2 = { nounwind readonly "amdgpu-nsa-threshold"="2" }
				attributes #3 = { nounwind readonly "amdgpu-nsa-threshold"="3" }
				attributes #4 = { nounwind readonly "amdgpu-nsa-threshold"="4" }

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.nsa.ll

	; RUN: llc -march=amdgcn -mcpu=gfx1010 -mattr=-nsa-encoding -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NONSA,GFX10-NONSA %s			; RUN: llc -march=amdgcn -mcpu=gfx1010 -mattr=-nsa-encoding -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NONSA,GFX10-NONSA %s
	; RUN: llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NSA,GFX1010-NSA %s			; RUN: llc -march=amdgcn -mcpu=gfx1010 -amdgpu-nsa-threshold=32 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NONSA,GFX10-NONSA %s
	; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NSA,GFX1030-NSA %s			; RUN: llc -march=amdgcn -mcpu=gfx1010 -amdgpu-nsa-threshold=2 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NSA,NSA-T2 %s
				; RUN: llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NSA,NSA-T3,GFX1010-NSA %s
				; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NSA,NSA-T3,GFX1030-NSA %s
	; RUN: llc -march=amdgcn -mcpu=gfx1100 -mattr=-nsa-encoding -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NONSA,GFX11-NONSA %s			; RUN: llc -march=amdgcn -mcpu=gfx1100 -mattr=-nsa-encoding -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NONSA,GFX11-NONSA %s
	; RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NSA,GFX11-NSA %s			; RUN: llc -march=amdgcn -mcpu=gfx1100 -amdgpu-nsa-threshold=32 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NONSA,GFX11-NONSA %s
				; RUN: llc -march=amdgcn -mcpu=gfx1100 -amdgpu-nsa-threshold=2 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NSA,NSA-T2 %s
				; RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,NSA,NSA-T3,GFX11-NSA %s

				; Default NSA threshold is 3 addresses
	; GCN-LABEL: {{^}}sample_2d:			; GCN-LABEL: {{^}}sample_2d:
	;			; NONSA: v_mov_b32_e32 v2, v0
	; TODO: use NSA here			; NONSA: image_sample v[0:3], v[1:2],
	; GCN: v_mov_b32_e32 v2, v0			; NSA-T2: image_sample v[0:3], [v1, v0],
	;			; NSA-T3: v_mov_b32_e32 v2, v0
	; GCN: image_sample v[0:3], v[1:2],			; NSA-T3: image_sample v[0:3], v[1:2],
	define amdgpu_ps <4 x float> @sample_2d(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %t, float %s) {			define amdgpu_ps <4 x float> @sample_2d(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %t, float %s) {
	main_body:			main_body:
	%v = call <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32 15, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 0, i32 0, i32 0)			%v = call <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32 15, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 0, i32 0, i32 0)
	ret <4 x float> %v			ret <4 x float> %v
	}			}

	; GCN-LABEL: {{^}}sample_3d:			; GCN-LABEL: {{^}}sample_3d:
	; NONSA: v_mov_b32_e32 v3, v0			; NONSA: v_mov_b32_e32 v3, v0
	▲ Show 20 Lines • Show All 140 Lines • Show Last 20 Lines