This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Fold omod into instructions
ClosedPublic

Authored by arsenm on Feb 21 2017, 9:15 AM.

Download Raw Diff

Details

Reviewers

artem.tamazov
mareko

Diff Detail

Event Timeline

arsenm created this revision.Feb 21 2017, 9:15 AM

Herald added subscribers: tpr, dstuttard, tony-tye and 4 others. · View Herald TranscriptFeb 21 2017, 9:15 AM

arsenm added a child revision: D30212: AMDGPU: Use clamp with f64.Feb 21 2017, 9:37 AM

Comments need to be updated at least.

lib/Target/AMDGPU/SIFoldOperands.cpp
790–792	Output modifiers are not compatible with output denorms, i.e.: If output denorms are allowed (in the HW MODe register), then any output modifier is ignored If output denorms are not allowed, then denorms will be flushed to +/-0 first. Then, if output modifier is non-zero, -0 will be forced to +0 prior applying the modifier. Output modifiers are not IEEE compliant (-0*x=+0). Output modifiers are ignored by hardware if ieee bit is set in the HW MODE register. The above applies to all supported floating types, including f16, f32, f64.
829	Nope. Pls. see comment above.
853	Yes. If IEEE is set, OMOD does not work.

This revision now requires changes to proceed.Feb 22 2017, 2:57 AM

Don't fold if IEEE bit is set. Don't fold unless no-signed-zeros-fp-math is enabled

Other than that, nice work.

lib/Target/AMDGPU/AMDGPUCallingConv.td
38	Why are the calling conventions being changed?

arsenm added inline comments.Feb 23 2017, 4:03 PM

lib/Target/AMDGPU/AMDGPUCallingConv.td
38	I needed a way to get an f16 input into a graphics shader. This would just assert on unhandled value type before. I can commit this separately

mareko added inline comments.Feb 24 2017, 1:42 AM

lib/Target/AMDGPU/AMDGPUCallingConv.td
38	OK. I guess you can keep it here.

LGTM

This revision is now accepted and ready to land.Feb 27 2017, 8:38 AM

r296372

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPUCallingConv.td

4 lines

AMDGPUMachineFunction.h

5 lines

AMDGPUMachineFunction.cpp

3 lines

SIFoldOperands.cpp

144 lines

test/

CodeGen/

AMDGPU/

clamp-omod-special-case.mir

293 lines

omod.ll

286 lines

Diff 89459

lib/Target/AMDGPU/AMDGPUCallingConv.td

	Show All 11 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	// Inversion of CCIfInReg			// Inversion of CCIfInReg
	class CCIfNotInReg<CCAction A> : CCIf<"!ArgFlags.isInReg()", A> {}			class CCIfNotInReg<CCAction A> : CCIf<"!ArgFlags.isInReg()", A> {}

	// Calling convention for SI			// Calling convention for SI
	def CC_SI : CallingConv<[			def CC_SI : CallingConv<[

	CCIfInReg<CCIfType<[f32, i32] , CCAssignToReg<[			CCIfInReg<CCIfType<[f32, i32, f16] , CCAssignToReg<[
	SGPR0, SGPR1, SGPR2, SGPR3, SGPR4, SGPR5, SGPR6, SGPR7,			SGPR0, SGPR1, SGPR2, SGPR3, SGPR4, SGPR5, SGPR6, SGPR7,
	SGPR8, SGPR9, SGPR10, SGPR11, SGPR12, SGPR13, SGPR14, SGPR15,			SGPR8, SGPR9, SGPR10, SGPR11, SGPR12, SGPR13, SGPR14, SGPR15,
	SGPR16, SGPR17, SGPR18, SGPR19, SGPR20, SGPR21, SGPR22, SGPR23,			SGPR16, SGPR17, SGPR18, SGPR19, SGPR20, SGPR21, SGPR22, SGPR23,
	SGPR24, SGPR25, SGPR26, SGPR27, SGPR28, SGPR29, SGPR30, SGPR31,			SGPR24, SGPR25, SGPR26, SGPR27, SGPR28, SGPR29, SGPR30, SGPR31,
	SGPR32, SGPR33, SGPR34, SGPR35, SGPR36, SGPR37, SGPR38, SGPR39			SGPR32, SGPR33, SGPR34, SGPR35, SGPR36, SGPR37, SGPR38, SGPR39
	]>>>,			]>>>,

	CCIfInReg<CCIfType<[i64] , CCAssignToRegWithShadow<			CCIfInReg<CCIfType<[i64] , CCAssignToRegWithShadow<
	[ SGPR0, SGPR2, SGPR4, SGPR6, SGPR8, SGPR10, SGPR12, SGPR14,			[ SGPR0, SGPR2, SGPR4, SGPR6, SGPR8, SGPR10, SGPR12, SGPR14,
	SGPR16, SGPR18, SGPR20, SGPR22, SGPR24, SGPR26, SGPR28, SGPR30,			SGPR16, SGPR18, SGPR20, SGPR22, SGPR24, SGPR26, SGPR28, SGPR30,
	SGPR32, SGPR34, SGPR36, SGPR38 ],			SGPR32, SGPR34, SGPR36, SGPR38 ],
	[ SGPR1, SGPR3, SGPR5, SGPR7, SGPR9, SGPR11, SGPR13, SGPR15,			[ SGPR1, SGPR3, SGPR5, SGPR7, SGPR9, SGPR11, SGPR13, SGPR15,
	SGPR17, SGPR19, SGPR21, SGPR23, SGPR25, SGPR27, SGPR29, SGPR31,			SGPR17, SGPR19, SGPR21, SGPR23, SGPR25, SGPR27, SGPR29, SGPR31,
	SGPR33, SGPR35, SGPR37, SGPR39 ]			SGPR33, SGPR35, SGPR37, SGPR39 ]
	>>>,			>>>,

	// 32*4 + 4 is the minimum for a fetch shader consumer with 32 inputs.			// 32*4 + 4 is the minimum for a fetch shader consumer with 32 inputs.
	CCIfNotInReg<CCIfType<[f32, i32] , CCAssignToReg<[			CCIfNotInReg<CCIfType<[f32, i32, f16] , CCAssignToReg<[
				marekoUnsubmitted Not Done Reply Inline Actions Why are the calling conventions being changed? mareko: Why are the calling conventions being changed?
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions I needed a way to get an f16 input into a graphics shader. This would just assert on unhandled value type before. I can commit this separately arsenm: I needed a way to get an f16 input into a graphics shader. This would just assert on unhandled…
				marekoUnsubmitted Not Done Reply Inline Actions OK. I guess you can keep it here. mareko: OK. I guess you can keep it here.
	VGPR0, VGPR1, VGPR2, VGPR3, VGPR4, VGPR5, VGPR6, VGPR7,			VGPR0, VGPR1, VGPR2, VGPR3, VGPR4, VGPR5, VGPR6, VGPR7,
	VGPR8, VGPR9, VGPR10, VGPR11, VGPR12, VGPR13, VGPR14, VGPR15,			VGPR8, VGPR9, VGPR10, VGPR11, VGPR12, VGPR13, VGPR14, VGPR15,
	VGPR16, VGPR17, VGPR18, VGPR19, VGPR20, VGPR21, VGPR22, VGPR23,			VGPR16, VGPR17, VGPR18, VGPR19, VGPR20, VGPR21, VGPR22, VGPR23,
	VGPR24, VGPR25, VGPR26, VGPR27, VGPR28, VGPR29, VGPR30, VGPR31,			VGPR24, VGPR25, VGPR26, VGPR27, VGPR28, VGPR29, VGPR30, VGPR31,
	VGPR32, VGPR33, VGPR34, VGPR35, VGPR36, VGPR37, VGPR38, VGPR39,			VGPR32, VGPR33, VGPR34, VGPR35, VGPR36, VGPR37, VGPR38, VGPR39,
	VGPR40, VGPR41, VGPR42, VGPR43, VGPR44, VGPR45, VGPR46, VGPR47,			VGPR40, VGPR41, VGPR42, VGPR43, VGPR44, VGPR45, VGPR46, VGPR47,
	VGPR48, VGPR49, VGPR50, VGPR51, VGPR52, VGPR53, VGPR54, VGPR55,			VGPR48, VGPR49, VGPR50, VGPR51, VGPR52, VGPR53, VGPR54, VGPR55,
	VGPR56, VGPR57, VGPR58, VGPR59, VGPR60, VGPR61, VGPR62, VGPR63,			VGPR56, VGPR57, VGPR58, VGPR59, VGPR60, VGPR61, VGPR62, VGPR63,
	▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUMachineFunction.h

Show All 25 Lines	class AMDGPUMachineFunction : public MachineFunctionInfo {
/// Number of bytes in the LDS that are being used.		/// Number of bytes in the LDS that are being used.
unsigned LDSSize;		unsigned LDSSize;

// FIXME: This should probably be removed.		// FIXME: This should probably be removed.
/// Start of implicit kernel args		/// Start of implicit kernel args
unsigned ABIArgOffset;		unsigned ABIArgOffset;

bool IsKernel;		bool IsKernel;
		bool NoSignedZerosFPMath;

public:		public:
AMDGPUMachineFunction(const MachineFunction &MF);		AMDGPUMachineFunction(const MachineFunction &MF);

uint64_t allocateKernArg(uint64_t Size, unsigned Align) {		uint64_t allocateKernArg(uint64_t Size, unsigned Align) {
assert(isPowerOf2_32(Align));		assert(isPowerOf2_32(Align));
KernArgSize = alignTo(KernArgSize, Align);		KernArgSize = alignTo(KernArgSize, Align);

Show All 23 Lines	public:
unsigned getLDSSize() const {		unsigned getLDSSize() const {
return LDSSize;		return LDSSize;
}		}

bool isKernel() const {		bool isKernel() const {
return IsKernel;		return IsKernel;
}		}

		bool hasNoSignedZerosFPMath() const {
		return NoSignedZerosFPMath;
		}

unsigned allocateLDSGlobal(const DataLayout &DL, const GlobalValue &GV);		unsigned allocateLDSGlobal(const DataLayout &DL, const GlobalValue &GV);
};		};

}		}
#endif		#endif

lib/Target/AMDGPU/AMDGPUMachineFunction.cpp

	Show All 14 Lines
	AMDGPUMachineFunction::AMDGPUMachineFunction(const MachineFunction &MF) :			AMDGPUMachineFunction::AMDGPUMachineFunction(const MachineFunction &MF) :
	MachineFunctionInfo(),			MachineFunctionInfo(),
	LocalMemoryObjects(),			LocalMemoryObjects(),
	KernArgSize(0),			KernArgSize(0),
	MaxKernArgAlign(0),			MaxKernArgAlign(0),
	LDSSize(0),			LDSSize(0),
	ABIArgOffset(0),			ABIArgOffset(0),
	IsKernel(MF.getFunction()->getCallingConv() == CallingConv::AMDGPU_KERNEL \|\|			IsKernel(MF.getFunction()->getCallingConv() == CallingConv::AMDGPU_KERNEL \|\|
	MF.getFunction()->getCallingConv() == CallingConv::SPIR_KERNEL) {			MF.getFunction()->getCallingConv() == CallingConv::SPIR_KERNEL),
				NoSignedZerosFPMath(MF.getTarget().Options.NoSignedZerosFPMath) {
	// FIXME: Should initialize KernArgSize based on ExplicitKernelArgOffset,			// FIXME: Should initialize KernArgSize based on ExplicitKernelArgOffset,
	// except reserved size is not correctly aligned.			// except reserved size is not correctly aligned.
	}			}

	unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,			unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,
	const GlobalValue &GV) {			const GlobalValue &GV) {
	auto Entry = LocalMemoryObjects.insert(std::make_pair(&GV, 0));			auto Entry = LocalMemoryObjects.insert(std::make_pair(&GV, 0));
	if (!Entry.second)			if (!Entry.second)
	Show All 16 Lines

lib/Target/AMDGPU/SIFoldOperands.cpp

//===-- SIFoldOperands.cpp - Fold operands --- ----------------------------===//		//===-- SIFoldOperands.cpp - Fold operands --- ----------------------------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
/// \file		/// \file
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//

#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
		#include "SIMachineFunctionInfo.h"
#include "llvm/CodeGen/LiveIntervalAnalysis.h"		#include "llvm/CodeGen/LiveIntervalAnalysis.h"
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"

▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	void foldOperand(MachineOperand &OpToFold,
SmallVectorImpl<FoldCandidate> &FoldList,		SmallVectorImpl<FoldCandidate> &FoldList,
SmallVectorImpl<MachineInstr *> &CopiesToReplace) const;		SmallVectorImpl<MachineInstr *> &CopiesToReplace) const;

void foldInstOperand(MachineInstr &MI, MachineOperand &OpToFold) const;		void foldInstOperand(MachineInstr &MI, MachineOperand &OpToFold) const;

const MachineOperand *isClamp(const MachineInstr &MI) const;		const MachineOperand *isClamp(const MachineInstr &MI) const;
bool tryFoldClamp(MachineInstr &MI);		bool tryFoldClamp(MachineInstr &MI);

		std::pair<const MachineOperand *, int> isOMod(const MachineInstr &MI) const;
		bool tryFoldOMod(MachineInstr &MI);

public:		public:
SIFoldOperands() : MachineFunctionPass(ID) {		SIFoldOperands() : MachineFunctionPass(ID) {
initializeSIFoldOperandsPass(*PassRegistry::getPassRegistry());		initializeSIFoldOperandsPass(*PassRegistry::getPassRegistry());
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

StringRef getPassName() const override { return "SI Fold Operands"; }		StringRef getPassName() const override { return "SI Fold Operands"; }
Show All 40 Lines	default:
return false;		return false;
}		}
}		}

FunctionPass *llvm::createSIFoldOperandsPass() {		FunctionPass *llvm::createSIFoldOperandsPass() {
return new SIFoldOperands();		return new SIFoldOperands();
}		}

static bool isSafeToFold(const MachineInstr &MI) {		static bool isFoldableCopy(const MachineInstr &MI) {
switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
case AMDGPU::V_MOV_B32_e32:		case AMDGPU::V_MOV_B32_e32:
case AMDGPU::V_MOV_B32_e64:		case AMDGPU::V_MOV_B32_e64:
case AMDGPU::V_MOV_B64_PSEUDO: {		case AMDGPU::V_MOV_B64_PSEUDO: {
// If there are additional implicit register operands, this may be used for		// If there are additional implicit register operands, this may be used for
// register indexing so the source register operand isn't simply copied.		// register indexing so the source register operand isn't simply copied.
unsigned NumOps = MI.getDesc().getNumOperands() +		unsigned NumOps = MI.getDesc().getNumOperands() +
MI.getDesc().getNumImplicitUses();		MI.getDesc().getNumImplicitUses();
▲ Show 20 Lines • Show All 579 Lines • ▼ Show 20 Lines	for (auto I = MRI.use_instr_nodbg_begin(Reg), E = MRI.use_instr_nodbg_end();
I != E; ++I) {		I != E; ++I) {
if (++Count > 1)		if (++Count > 1)
return false;		return false;
}		}

return true;		return true;
}		}

// FIXME: Does this need to check IEEE bit on function?
bool SIFoldOperands::tryFoldClamp(MachineInstr &MI) {		bool SIFoldOperands::tryFoldClamp(MachineInstr &MI) {
const MachineOperand *ClampSrc = isClamp(MI);		const MachineOperand *ClampSrc = isClamp(MI);
if (!ClampSrc \|\| !hasOneNonDBGUseInst(*MRI, ClampSrc->getReg()))		if (!ClampSrc \|\| !hasOneNonDBGUseInst(*MRI, ClampSrc->getReg()))
return false;		return false;

MachineInstr *Def = MRI->getVRegDef(ClampSrc->getReg());		MachineInstr *Def = MRI->getVRegDef(ClampSrc->getReg());
if (!TII->hasFPClamp(*Def))		if (!TII->hasFPClamp(*Def))
return false;		return false;
MachineOperand DefClamp = TII->getNamedOperand(Def, AMDGPU::OpName::clamp);		MachineOperand DefClamp = TII->getNamedOperand(Def, AMDGPU::OpName::clamp);
if (!DefClamp)		if (!DefClamp)
return false;		return false;

DEBUG(dbgs() << "Folding clamp " << DefClamp << " into " << Def << '\n');		DEBUG(dbgs() << "Folding clamp " << DefClamp << " into " << Def << '\n');

// Clamp is applied after omod, so it is OK if omod is set.		// Clamp is applied after omod, so it is OK if omod is set.
DefClamp->setImm(1);		DefClamp->setImm(1);
MRI->replaceRegWith(MI.getOperand(0).getReg(), Def->getOperand(0).getReg());		MRI->replaceRegWith(MI.getOperand(0).getReg(), Def->getOperand(0).getReg());
MI.eraseFromParent();		MI.eraseFromParent();
return true;		return true;
}		}

		static int getOModValue(unsigned Opc, int64_t Val) {
		switch (Opc) {
		case AMDGPU::V_MUL_F32_e64: {
		switch (static_cast<uint32_t>(Val)) {
		case 0x3f000000: // 0.5
		return SIOutMods::DIV2;
		case 0x40000000: // 2.0
		return SIOutMods::MUL2;
		case 0x40800000: // 4.0
		return SIOutMods::MUL4;
		default:
		return SIOutMods::NONE;
		}
		}
		case AMDGPU::V_MUL_F16_e64: {
		switch (static_cast<uint16_t>(Val)) {
		case 0x3800: // 0.5
		return SIOutMods::DIV2;
		case 0x4000: // 2.0
		return SIOutMods::MUL2;
		case 0x4400: // 4.0
		return SIOutMods::MUL4;
		default:
		return SIOutMods::NONE;
		}
		}
		default:
		llvm_unreachable("invalid mul opcode");
		}
		}

		// FIXME: Does this really not support denormals with f16?
		// FIXME: Does this need to check IEEE mode bit? SNaNs are generally not
		// handled, so will anything other than that break?
		artem.tamazovUnsubmitted Not Done Reply Inline Actions Output modifiers are not compatible with output denorms, i.e.: If output denorms are allowed (in the HW MODe register), then any output modifier is ignored If output denorms are not allowed, then denorms will be flushed to +/-0 first. Then, if output modifier is non-zero, -0 will be forced to +0 prior applying the modifier. Output modifiers are not IEEE compliant (-0x=+0). Output modifiers are ignored by hardware if ieee bit is set in the HW MODE register. The above applies to all supported floating types, including f16, f32, f64. artem.tamazov:* Output modifiers are not compatible with output denorms, i.e.: - If output denorms are allowed…
		std::pair<const MachineOperand *, int>
		SIFoldOperands::isOMod(const MachineInstr &MI) const {
		unsigned Op = MI.getOpcode();
		switch (Op) {
		case AMDGPU::V_MUL_F32_e64:
		case AMDGPU::V_MUL_F16_e64: {
		// If output denormals are enabled, omod is ignored.
		if ((Op == AMDGPU::V_MUL_F32_e64 && ST->hasFP32Denormals()) \|\|
		(Op == AMDGPU::V_MUL_F16_e64 && ST->hasFP16Denormals()))
		return std::make_pair(nullptr, SIOutMods::NONE);

		const MachineOperand *RegOp = nullptr;
		const MachineOperand *ImmOp = nullptr;
		const MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
		const MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
		if (Src0->isImm()) {
		ImmOp = Src0;
		RegOp = Src1;
		} else if (Src1->isImm()) {
		ImmOp = Src1;
		RegOp = Src0;
		} else
		return std::make_pair(nullptr, SIOutMods::NONE);

		int OMod = getOModValue(Op, ImmOp->getImm());
		if (OMod == SIOutMods::NONE \|\|
		TII->hasModifiersSet(MI, AMDGPU::OpName::src0_modifiers) \|\|
		TII->hasModifiersSet(MI, AMDGPU::OpName::src1_modifiers) \|\|
		TII->hasModifiersSet(MI, AMDGPU::OpName::omod) \|\|
		TII->hasModifiersSet(MI, AMDGPU::OpName::clamp))
		return std::make_pair(nullptr, SIOutMods::NONE);

		return std::make_pair(RegOp, OMod);
		}
		case AMDGPU::V_ADD_F32_e64:
		case AMDGPU::V_ADD_F16_e64: {
		// If output denormals are enabled, omod is ignored.
		artem.tamazovUnsubmitted Not Done Reply Inline Actions Nope. Pls. see comment above. artem.tamazov: Nope. Pls. see comment above.
		if ((Op == AMDGPU::V_ADD_F32_e64 && ST->hasFP32Denormals()) \|\|
		(Op == AMDGPU::V_ADD_F16_e64 && ST->hasFP16Denormals()))
		return std::make_pair(nullptr, SIOutMods::NONE);

		// Look through the DAGCombiner canonicalization fmul x, 2 -> fadd x, x
		const MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
		const MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);

		if (Src0->isReg() && Src1->isReg() && Src0->getReg() == Src1->getReg() &&
		Src0->getSubReg() == Src1->getSubReg() &&
		!TII->hasModifiersSet(MI, AMDGPU::OpName::src0_modifiers) &&
		!TII->hasModifiersSet(MI, AMDGPU::OpName::src1_modifiers) &&
		!TII->hasModifiersSet(MI, AMDGPU::OpName::clamp) &&
		!TII->hasModifiersSet(MI, AMDGPU::OpName::omod))
		return std::make_pair(Src0, SIOutMods::MUL2);

		return std::make_pair(nullptr, SIOutMods::NONE);
		}
		default:
		return std::make_pair(nullptr, SIOutMods::NONE);
		}
		}

		// FIXME: Does this need to check IEEE bit on function?
		artem.tamazovUnsubmitted Not Done Reply Inline Actions Yes. If IEEE is set, OMOD does not work. artem.tamazov: Yes. If IEEE is set, OMOD does not work.
		bool SIFoldOperands::tryFoldOMod(MachineInstr &MI) {
		const MachineOperand *RegOp;
		int OMod;
		std::tie(RegOp, OMod) = isOMod(MI);
		if (OMod == SIOutMods::NONE \|\| !RegOp->isReg() \|\|
		RegOp->getSubReg() != AMDGPU::NoSubRegister \|\|
		!hasOneNonDBGUseInst(*MRI, RegOp->getReg()))
		return false;

		MachineInstr *Def = MRI->getVRegDef(RegOp->getReg());
		MachineOperand DefOMod = TII->getNamedOperand(Def, AMDGPU::OpName::omod);
		if (!DefOMod \|\| DefOMod->getImm() != SIOutMods::NONE)
		return false;

		// Clamp is applied after omod. If the source already has clamp set, don't
		// fold it.
		if (TII->hasModifiersSet(*Def, AMDGPU::OpName::clamp))
		return false;

		DEBUG(dbgs() << "Folding omod " << MI << " into " << *Def << '\n');

		DefOMod->setImm(OMod);
		MRI->replaceRegWith(MI.getOperand(0).getReg(), Def->getOperand(0).getReg());
		MI.eraseFromParent();
		return true;
		}

bool SIFoldOperands::runOnMachineFunction(MachineFunction &MF) {		bool SIFoldOperands::runOnMachineFunction(MachineFunction &MF) {
if (skipFunction(*MF.getFunction()))		if (skipFunction(*MF.getFunction()))
return false;		return false;

MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
ST = &MF.getSubtarget<SISubtarget>();		ST = &MF.getSubtarget<SISubtarget>();
TII = ST->getInstrInfo();		TII = ST->getInstrInfo();
TRI = &TII->getRegisterInfo();		TRI = &TII->getRegisterInfo();

		const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();

		// omod is ignored by hardware if IEEE bit is enabled. omod also does not
		// correctly handle signed zeros.
		//
		// TODO: Check nsz on instructions when fast math flags are preserved to MI
		// level.
		bool IsIEEEMode = ST->enableIEEEBit(MF) \|\| !MFI->hasNoSignedZerosFPMath();

for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();		for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();
BI != BE; ++BI) {		BI != BE; ++BI) {

MachineBasicBlock &MBB = *BI;		MachineBasicBlock &MBB = *BI;
MachineBasicBlock::iterator I, Next;		MachineBasicBlock::iterator I, Next;
for (I = MBB.begin(); I != MBB.end(); I = Next) {		for (I = MBB.begin(); I != MBB.end(); I = Next) {
Next = std::next(I);		Next = std::next(I);
MachineInstr &MI = *I;		MachineInstr &MI = *I;

if (!isSafeToFold(MI)) {		if (!isFoldableCopy(MI)) {
// TODO: Try omod also.		if (IsIEEEMode \|\| !tryFoldOMod(MI))
tryFoldClamp(MI);		tryFoldClamp(MI);
continue;		continue;
}		}

MachineOperand &OpToFold = MI.getOperand(1);		MachineOperand &OpToFold = MI.getOperand(1);
bool FoldingImm = OpToFold.isImm() \|\| OpToFold.isFI();		bool FoldingImm = OpToFold.isImm() \|\| OpToFold.isFI();

// FIXME: We could also be folding things like TargetIndexes.		// FIXME: We could also be folding things like TargetIndexes.
if (!FoldingImm && !OpToFold.isReg())		if (!FoldingImm && !OpToFold.isReg())
Show All 22 Lines

test/CodeGen/AMDGPU/clamp-omod-special-case.mir

# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-fold-operands %s -o - \| FileCheck -check-prefix=GCN %s		# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-fold-operands %s -o - \| FileCheck -check-prefix=GCN %s
--- \|		--- \|
define amdgpu_kernel void @v_max_self_clamp_not_set_f32(float addrspace(1)* %out, float addrspace(1)* %aptr) {		define amdgpu_ps void @v_max_self_clamp_not_set_f32() #0 {
ret void		ret void
}		}

define amdgpu_kernel void @v_clamp_omod_already_set_f32(float addrspace(1)* %out, float addrspace(1)* %aptr) {		define amdgpu_ps void @v_clamp_omod_already_set_f32() #0 {
ret void		ret void
}		}

		define amdgpu_ps void @v_omod_mul_omod_already_set_f32() #0 {
		ret void
		}

		define amdgpu_ps void @v_omod_mul_clamp_already_set_f32() #0 {
		ret void
		}

		define amdgpu_ps void @v_omod_add_omod_already_set_f32() #0 {
		ret void
		}

		define amdgpu_ps void @v_omod_add_clamp_already_set_f32() #0 {
		ret void
		}

		attributes #0 = { nounwind "no-signed-zeros-fp-math"="false" }

...		...
---		---
# GCN-LABEL: name: v_max_self_clamp_not_set_f32		# GCN-LABEL: name: v_max_self_clamp_not_set_f32
# GCN: %20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec		# GCN: %20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec
# GCN-NEXT: %21 = V_MAX_F32_e64 0, killed %20, 0, killed %20, 0, 0, implicit %exec		# GCN-NEXT: %21 = V_MAX_F32_e64 0, killed %20, 0, killed %20, 0, 0, implicit %exec

name: v_max_self_clamp_not_set_f32		name: v_max_self_clamp_not_set_f32
tracksRegLiveness: true		tracksRegLiveness: true
▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	bb.0 (%ir-block.0):
%16 = REG_SEQUENCE killed %4, 17, %12, 18		%16 = REG_SEQUENCE killed %4, 17, %12, 18
%18 = COPY %26		%18 = COPY %26
%17 = BUFFER_LOAD_DWORD_ADDR64 %26, killed %13, 0, 0, 0, 0, 0, implicit %exec		%17 = BUFFER_LOAD_DWORD_ADDR64 %26, killed %13, 0, 0, 0, 0, 0, implicit %exec
%20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec		%20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec
%21 = V_MAX_F32_e64 0, killed %20, 0, killed %20, 1, 3, implicit %exec		%21 = V_MAX_F32_e64 0, killed %20, 0, killed %20, 1, 3, implicit %exec
BUFFER_STORE_DWORD_ADDR64 killed %21, %26, killed %16, 0, 0, 0, 0, 0, implicit %exec		BUFFER_STORE_DWORD_ADDR64 killed %21, %26, killed %16, 0, 0, 0, 0, 0, implicit %exec
S_ENDPGM		S_ENDPGM
...		...
		---
		# Don't fold a mul that looks like an omod if itself has omod set

		# GCN-LABEL: name: v_omod_mul_omod_already_set_f32
		# GCN: %20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec
		# GCN-NEXT: %21 = V_MUL_F32_e64 0, killed %20, 0, 1056964608, 0, 3, implicit %exec
		name: v_omod_mul_omod_already_set_f32
		tracksRegLiveness: true
		registers:
		- { id: 0, class: sgpr_64 }
		- { id: 1, class: sreg_32_xm0 }
		- { id: 2, class: sgpr_32 }
		- { id: 3, class: vgpr_32 }
		- { id: 4, class: sreg_64_xexec }
		- { id: 5, class: sreg_64_xexec }
		- { id: 6, class: sreg_32 }
		- { id: 7, class: sreg_32 }
		- { id: 8, class: sreg_32_xm0 }
		- { id: 9, class: sreg_64 }
		- { id: 10, class: sreg_32_xm0 }
		- { id: 11, class: sreg_32_xm0 }
		- { id: 12, class: sgpr_64 }
		- { id: 13, class: sgpr_128 }
		- { id: 14, class: sreg_32_xm0 }
		- { id: 15, class: sreg_64 }
		- { id: 16, class: sgpr_128 }
		- { id: 17, class: vgpr_32 }
		- { id: 18, class: vreg_64 }
		- { id: 19, class: vgpr_32 }
		- { id: 20, class: vgpr_32 }
		- { id: 21, class: vgpr_32 }
		- { id: 22, class: vgpr_32 }
		- { id: 23, class: vreg_64 }
		- { id: 24, class: vgpr_32 }
		- { id: 25, class: vreg_64 }
		- { id: 26, class: vreg_64 }
		liveins:
		- { reg: '%sgpr0_sgpr1', virtual-reg: '%0' }
		- { reg: '%vgpr0', virtual-reg: '%3' }
		body: \|
		bb.0 (%ir-block.0):
		liveins: %sgpr0_sgpr1, %vgpr0

		%3 = COPY %vgpr0
		%0 = COPY %sgpr0_sgpr1
		%4 = S_LOAD_DWORDX2_IMM %0, 9, 0 :: (non-temporal dereferenceable invariant load 8 from `i64 addrspace(2)* undef`)
		%5 = S_LOAD_DWORDX2_IMM %0, 11, 0 :: (non-temporal dereferenceable invariant load 8 from `i64 addrspace(2)* undef`)
		%24 = V_ASHRREV_I32_e32 31, %3, implicit %exec
		%25 = REG_SEQUENCE %3, 1, %24, 2
		%10 = S_MOV_B32 61440
		%11 = S_MOV_B32 0
		%12 = REG_SEQUENCE killed %11, 1, killed %10, 2
		%13 = REG_SEQUENCE killed %5, 17, %12, 18
		%14 = S_MOV_B32 2
		%26 = V_LSHL_B64 killed %25, 2, implicit %exec
		%16 = REG_SEQUENCE killed %4, 17, %12, 18
		%18 = COPY %26
		%17 = BUFFER_LOAD_DWORD_ADDR64 %26, killed %13, 0, 0, 0, 0, 0, implicit %exec
		%20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec
		%21 = V_MUL_F32_e64 0, killed %20, 0, 1056964608, 0, 3, implicit %exec
		BUFFER_STORE_DWORD_ADDR64 killed %21, %26, killed %16, 0, 0, 0, 0, 0, implicit %exec
		S_ENDPGM

		...
		---
		# Don't fold a mul that looks like an omod if itself has clamp set
		# This might be OK, but would require folding the clamp at the same time.
		# GCN-LABEL: name: v_omod_mul_clamp_already_set_f32
		# GCN: %20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec
		# GCN-NEXT: %21 = V_MUL_F32_e64 0, killed %20, 0, 1056964608, 1, 0, implicit %exec

		name: v_omod_mul_clamp_already_set_f32
		tracksRegLiveness: true
		registers:
		- { id: 0, class: sgpr_64 }
		- { id: 1, class: sreg_32_xm0 }
		- { id: 2, class: sgpr_32 }
		- { id: 3, class: vgpr_32 }
		- { id: 4, class: sreg_64_xexec }
		- { id: 5, class: sreg_64_xexec }
		- { id: 6, class: sreg_32 }
		- { id: 7, class: sreg_32 }
		- { id: 8, class: sreg_32_xm0 }
		- { id: 9, class: sreg_64 }
		- { id: 10, class: sreg_32_xm0 }
		- { id: 11, class: sreg_32_xm0 }
		- { id: 12, class: sgpr_64 }
		- { id: 13, class: sgpr_128 }
		- { id: 14, class: sreg_32_xm0 }
		- { id: 15, class: sreg_64 }
		- { id: 16, class: sgpr_128 }
		- { id: 17, class: vgpr_32 }
		- { id: 18, class: vreg_64 }
		- { id: 19, class: vgpr_32 }
		- { id: 20, class: vgpr_32 }
		- { id: 21, class: vgpr_32 }
		- { id: 22, class: vgpr_32 }
		- { id: 23, class: vreg_64 }
		- { id: 24, class: vgpr_32 }
		- { id: 25, class: vreg_64 }
		- { id: 26, class: vreg_64 }
		liveins:
		- { reg: '%sgpr0_sgpr1', virtual-reg: '%0' }
		- { reg: '%vgpr0', virtual-reg: '%3' }
		body: \|
		bb.0 (%ir-block.0):
		liveins: %sgpr0_sgpr1, %vgpr0

		%3 = COPY %vgpr0
		%0 = COPY %sgpr0_sgpr1
		%4 = S_LOAD_DWORDX2_IMM %0, 9, 0 :: (non-temporal dereferenceable invariant load 8 from `i64 addrspace(2)* undef`)
		%5 = S_LOAD_DWORDX2_IMM %0, 11, 0 :: (non-temporal dereferenceable invariant load 8 from `i64 addrspace(2)* undef`)
		%24 = V_ASHRREV_I32_e32 31, %3, implicit %exec
		%25 = REG_SEQUENCE %3, 1, %24, 2
		%10 = S_MOV_B32 61440
		%11 = S_MOV_B32 0
		%12 = REG_SEQUENCE killed %11, 1, killed %10, 2
		%13 = REG_SEQUENCE killed %5, 17, %12, 18
		%14 = S_MOV_B32 2
		%26 = V_LSHL_B64 killed %25, 2, implicit %exec
		%16 = REG_SEQUENCE killed %4, 17, %12, 18
		%18 = COPY %26
		%17 = BUFFER_LOAD_DWORD_ADDR64 %26, killed %13, 0, 0, 0, 0, 0, implicit %exec
		%20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec
		%21 = V_MUL_F32_e64 0, killed %20, 0, 1056964608, 1, 0, implicit %exec
		BUFFER_STORE_DWORD_ADDR64 killed %21, %26, killed %16, 0, 0, 0, 0, 0, implicit %exec
		S_ENDPGM

		...













		---
		# Don't fold a mul that looks like an omod if itself has omod set

		# GCN-LABEL: name: v_omod_add_omod_already_set_f32
		# GCN: %20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec
		# GCN-NEXT: %21 = V_ADD_F32_e64 0, killed %20, 0, killed %20, 0, 3, implicit %exec
		name: v_omod_add_omod_already_set_f32
		tracksRegLiveness: true
		registers:
		- { id: 0, class: sgpr_64 }
		- { id: 1, class: sreg_32_xm0 }
		- { id: 2, class: sgpr_32 }
		- { id: 3, class: vgpr_32 }
		- { id: 4, class: sreg_64_xexec }
		- { id: 5, class: sreg_64_xexec }
		- { id: 6, class: sreg_32 }
		- { id: 7, class: sreg_32 }
		- { id: 8, class: sreg_32_xm0 }
		- { id: 9, class: sreg_64 }
		- { id: 10, class: sreg_32_xm0 }
		- { id: 11, class: sreg_32_xm0 }
		- { id: 12, class: sgpr_64 }
		- { id: 13, class: sgpr_128 }
		- { id: 14, class: sreg_32_xm0 }
		- { id: 15, class: sreg_64 }
		- { id: 16, class: sgpr_128 }
		- { id: 17, class: vgpr_32 }
		- { id: 18, class: vreg_64 }
		- { id: 19, class: vgpr_32 }
		- { id: 20, class: vgpr_32 }
		- { id: 21, class: vgpr_32 }
		- { id: 22, class: vgpr_32 }
		- { id: 23, class: vreg_64 }
		- { id: 24, class: vgpr_32 }
		- { id: 25, class: vreg_64 }
		- { id: 26, class: vreg_64 }
		liveins:
		- { reg: '%sgpr0_sgpr1', virtual-reg: '%0' }
		- { reg: '%vgpr0', virtual-reg: '%3' }
		body: \|
		bb.0 (%ir-block.0):
		liveins: %sgpr0_sgpr1, %vgpr0

		%3 = COPY %vgpr0
		%0 = COPY %sgpr0_sgpr1
		%4 = S_LOAD_DWORDX2_IMM %0, 9, 0 :: (non-temporal dereferenceable invariant load 8 from `i64 addrspace(2)* undef`)
		%5 = S_LOAD_DWORDX2_IMM %0, 11, 0 :: (non-temporal dereferenceable invariant load 8 from `i64 addrspace(2)* undef`)
		%24 = V_ASHRREV_I32_e32 31, %3, implicit %exec
		%25 = REG_SEQUENCE %3, 1, %24, 2
		%10 = S_MOV_B32 61440
		%11 = S_MOV_B32 0
		%12 = REG_SEQUENCE killed %11, 1, killed %10, 2
		%13 = REG_SEQUENCE killed %5, 17, %12, 18
		%14 = S_MOV_B32 2
		%26 = V_LSHL_B64 killed %25, 2, implicit %exec
		%16 = REG_SEQUENCE killed %4, 17, %12, 18
		%18 = COPY %26
		%17 = BUFFER_LOAD_DWORD_ADDR64 %26, killed %13, 0, 0, 0, 0, 0, implicit %exec
		%20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec
		%21 = V_ADD_F32_e64 0, killed %20, 0, killed %20, 0, 3, implicit %exec
		BUFFER_STORE_DWORD_ADDR64 killed %21, %26, killed %16, 0, 0, 0, 0, 0, implicit %exec
		S_ENDPGM

		...
		---
		# Don't fold a mul that looks like an omod if itself has clamp set
		# This might be OK, but would require folding the clamp at the same time.
		# GCN-LABEL: name: v_omod_add_clamp_already_set_f32
		# GCN: %20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec
		# GCN-NEXT: %21 = V_ADD_F32_e64 0, killed %20, 0, killed %20, 1, 0, implicit %exec

		name: v_omod_add_clamp_already_set_f32
		tracksRegLiveness: true
		registers:
		- { id: 0, class: sgpr_64 }
		- { id: 1, class: sreg_32_xm0 }
		- { id: 2, class: sgpr_32 }
		- { id: 3, class: vgpr_32 }
		- { id: 4, class: sreg_64_xexec }
		- { id: 5, class: sreg_64_xexec }
		- { id: 6, class: sreg_32 }
		- { id: 7, class: sreg_32 }
		- { id: 8, class: sreg_32_xm0 }
		- { id: 9, class: sreg_64 }
		- { id: 10, class: sreg_32_xm0 }
		- { id: 11, class: sreg_32_xm0 }
		- { id: 12, class: sgpr_64 }
		- { id: 13, class: sgpr_128 }
		- { id: 14, class: sreg_32_xm0 }
		- { id: 15, class: sreg_64 }
		- { id: 16, class: sgpr_128 }
		- { id: 17, class: vgpr_32 }
		- { id: 18, class: vreg_64 }
		- { id: 19, class: vgpr_32 }
		- { id: 20, class: vgpr_32 }
		- { id: 21, class: vgpr_32 }
		- { id: 22, class: vgpr_32 }
		- { id: 23, class: vreg_64 }
		- { id: 24, class: vgpr_32 }
		- { id: 25, class: vreg_64 }
		- { id: 26, class: vreg_64 }
		liveins:
		- { reg: '%sgpr0_sgpr1', virtual-reg: '%0' }
		- { reg: '%vgpr0', virtual-reg: '%3' }
		body: \|
		bb.0 (%ir-block.0):
		liveins: %sgpr0_sgpr1, %vgpr0

		%3 = COPY %vgpr0
		%0 = COPY %sgpr0_sgpr1
		%4 = S_LOAD_DWORDX2_IMM %0, 9, 0 :: (non-temporal dereferenceable invariant load 8 from `i64 addrspace(2)* undef`)
		%5 = S_LOAD_DWORDX2_IMM %0, 11, 0 :: (non-temporal dereferenceable invariant load 8 from `i64 addrspace(2)* undef`)
		%24 = V_ASHRREV_I32_e32 31, %3, implicit %exec
		%25 = REG_SEQUENCE %3, 1, %24, 2
		%10 = S_MOV_B32 61440
		%11 = S_MOV_B32 0
		%12 = REG_SEQUENCE killed %11, 1, killed %10, 2
		%13 = REG_SEQUENCE killed %5, 17, %12, 18
		%14 = S_MOV_B32 2
		%26 = V_LSHL_B64 killed %25, 2, implicit %exec
		%16 = REG_SEQUENCE killed %4, 17, %12, 18
		%18 = COPY %26
		%17 = BUFFER_LOAD_DWORD_ADDR64 %26, killed %13, 0, 0, 0, 0, 0, implicit %exec
		%20 = V_ADD_F32_e64 0, killed %17, 0, 1065353216, 0, 0, implicit %exec
		%21 = V_ADD_F32_e64 0, killed %20, 0, killed %20, 1, 0, implicit %exec
		BUFFER_STORE_DWORD_ADDR64 killed %21, %26, killed %16, 0, 0, 0, 0, 0, implicit %exec
		S_ENDPGM

		...

test/CodeGen/AMDGPU/omod.ll

This file was added.

				; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=SI %s
				; RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VI %s

				; IEEE bit enabled for compute kernel, no shouldn't use.
				; GCN-LABEL: {{^}}v_omod_div2_f32_enable_ieee_signed_zeros:
				; GCN: {{buffer\|flat}}_load_dword [[A:v[0-9]+]]
				; GCN: v_add_f32_e32 [[ADD:v[0-9]+]], 1.0, [[A]]{{$}}
				; GCN: v_mul_f32_e32 v{{[0-9]+}}, 0.5, [[ADD]]{{$}}
				define amdgpu_kernel void @v_omod_div2_f32_enable_ieee_signed_zeros(float addrspace(1)* %out, float addrspace(1)* %aptr) #4 {
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%gep0 = getelementptr float, float addrspace(1)* %aptr, i32 %tid
				%out.gep = getelementptr float, float addrspace(1)* %out, i32 %tid
				%a = load float, float addrspace(1)* %gep0
				%add = fadd float %a, 1.0
				%div2 = fmul float %add, 0.5
				store float %div2, float addrspace(1)* %out.gep
				ret void
				}

				; IEEE bit enabled for compute kernel, no shouldn't use even though nsz is allowed
				; GCN-LABEL: {{^}}v_omod_div2_f32_enable_ieee_nsz:
				; GCN: {{buffer\|flat}}_load_dword [[A:v[0-9]+]]
				; GCN: v_add_f32_e32 [[ADD:v[0-9]+]], 1.0, [[A]]{{$}}
				; GCN: v_mul_f32_e32 v{{[0-9]+}}, 0.5, [[ADD]]{{$}}
				define amdgpu_kernel void @v_omod_div2_f32_enable_ieee_nsz(float addrspace(1)* %out, float addrspace(1)* %aptr) #0 {
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%gep0 = getelementptr float, float addrspace(1)* %aptr, i32 %tid
				%out.gep = getelementptr float, float addrspace(1)* %out, i32 %tid
				%a = load float, float addrspace(1)* %gep0
				%add = fadd float %a, 1.0
				%div2 = fmul float %add, 0.5
				store float %div2, float addrspace(1)* %out.gep
				ret void
				}

				; Only allow without IEEE bit if signed zeros are significant.
				; GCN-LABEL: {{^}}v_omod_div2_f32_signed_zeros:
				; GCN: v_add_f32_e32 [[ADD:v[0-9]+]], 1.0, v0{{$}}
				; GCN: v_mul_f32_e32 v{{[0-9]+}}, 0.5, [[ADD]]{{$}}
				define amdgpu_ps void @v_omod_div2_f32_signed_zeros(float %a) #4 {
				%add = fadd float %a, 1.0
				%div2 = fmul float %add, 0.5
				store float %div2, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_div2_f32:
				; GCN: v_add_f32_e64 v{{[0-9]+}}, v0, 1.0 div:2{{$}}
				define amdgpu_ps void @v_omod_div2_f32(float %a) #0 {
				%add = fadd float %a, 1.0
				%div2 = fmul float %add, 0.5
				store float %div2, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_mul2_f32:
				; GCN: v_add_f32_e64 v{{[0-9]+}}, v0, 1.0 mul:2{{$}}
				define amdgpu_ps void @v_omod_mul2_f32(float %a) #0 {
				%add = fadd float %a, 1.0
				%div2 = fmul float %add, 2.0
				store float %div2, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_mul4_f32:
				; GCN: v_add_f32_e64 v{{[0-9]+}}, v0, 1.0 mul:4{{$}}
				define amdgpu_ps void @v_omod_mul4_f32(float %a) #0 {
				%add = fadd float %a, 1.0
				%div2 = fmul float %add, 4.0
				store float %div2, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_mul4_multi_use_f32:
				; GCN: v_add_f32_e32 [[ADD:v[0-9]+]], 1.0, v0{{$}}
				; GCN: v_mul_f32_e32 v{{[0-9]+}}, 4.0, [[ADD]]{{$}}
				define amdgpu_ps void @v_omod_mul4_multi_use_f32(float %a) #0 {
				%add = fadd float %a, 1.0
				%div2 = fmul float %add, 4.0
				store float %div2, float addrspace(1)* undef
				store volatile float %add, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_mul4_dbg_use_f32:
				; GCN: v_add_f32_e64 v{{[0-9]+}}, v0, 1.0 mul:4{{$}}
				define amdgpu_ps void @v_omod_mul4_dbg_use_f32(float %a) #0 {
				%add = fadd float %a, 1.0
				call void @llvm.dbg.value(metadata float %add, i64 0, metadata !4, metadata !9), !dbg !10
				%div2 = fmul float %add, 4.0
				store float %div2, float addrspace(1)* undef
				ret void
				}

				; Clamp is applied after omod, folding both into instruction is OK.
				; GCN-LABEL: {{^}}v_clamp_omod_div2_f32:
				; GCN: v_add_f32_e64 v{{[0-9]+}}, v0, 1.0 clamp div:2{{$}}
				define amdgpu_ps void @v_clamp_omod_div2_f32(float %a) #0 {
				%add = fadd float %a, 1.0
				%div2 = fmul float %add, 0.5

				%max = call float @llvm.maxnum.f32(float %div2, float 0.0)
				%clamp = call float @llvm.minnum.f32(float %max, float 1.0)
				store float %clamp, float addrspace(1)* undef
				ret void
				}

				; Cannot fold omod into clamp
				; GCN-LABEL: {{^}}v_omod_div2_clamp_f32:
				; GCN: v_add_f32_e64 [[ADD:v[0-9]+]], v0, 1.0 clamp{{$}}
				; GCN: v_mul_f32_e32 v{{[0-9]+}}, 0.5, [[ADD]]{{$}}
				define amdgpu_ps void @v_omod_div2_clamp_f32(float %a) #0 {
				%add = fadd float %a, 1.0
				%max = call float @llvm.maxnum.f32(float %add, float 0.0)
				%clamp = call float @llvm.minnum.f32(float %max, float 1.0)
				%div2 = fmul float %clamp, 0.5
				store float %div2, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_div2_abs_src_f32:
				; GCN: v_add_f32_e32 [[ADD:v[0-9]+]], 1.0, v0{{$}}
				; GCN: v_mul_f32_e64 v{{[0-9]+}}, \|[[ADD]]\|, 0.5{{$}}
				define amdgpu_ps void @v_omod_div2_abs_src_f32(float %a) #0 {
				%add = fadd float %a, 1.0
				%abs.add = call float @llvm.fabs.f32(float %add)
				%div2 = fmul float %abs.add, 0.5
				store float %div2, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_add_self_clamp_f32:
				; GCN: v_add_f32_e64 v{{[0-9]+}}, v0, v0 clamp{{$}}
				define amdgpu_ps void @v_omod_add_self_clamp_f32(float %a) #0 {
				%add = fadd float %a, %a
				%max = call float @llvm.maxnum.f32(float %add, float 0.0)
				%clamp = call float @llvm.minnum.f32(float %max, float 1.0)
				store float %clamp, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_add_clamp_self_f32:
				; GCN: v_max_f32_e64 [[CLAMP:v[0-9]+]], v0, v0 clamp{{$}}
				; GCN: v_add_f32_e32 v{{[0-9]+}}, [[CLAMP]], [[CLAMP]]{{$}}
				define amdgpu_ps void @v_omod_add_clamp_self_f32(float %a) #0 {
				%max = call float @llvm.maxnum.f32(float %a, float 0.0)
				%clamp = call float @llvm.minnum.f32(float %max, float 1.0)
				%add = fadd float %clamp, %clamp
				store float %add, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_add_abs_self_f32:
				; GCN: v_add_f32_e32 [[X:v[0-9]+]], 1.0, v0
				; GCN: v_add_f32_e64 v{{[0-9]+}}, \|[[X]]\|, \|[[X]]\|{{$}}
				define amdgpu_ps void @v_omod_add_abs_self_f32(float %a) #0 {
				%x = fadd float %a, 1.0
				%abs.x = call float @llvm.fabs.f32(float %x)
				%add = fadd float %abs.x, %abs.x
				store float %add, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_add_abs_x_x_f32:

				; GCN: v_add_f32_e32 [[X:v[0-9]+]], 1.0, v0
				; GCN: v_add_f32_e64 v{{[0-9]+}}, \|[[X]]\|, [[X]]{{$}}
				define amdgpu_ps void @v_omod_add_abs_x_x_f32(float %a) #0 {
				%x = fadd float %a, 1.0
				%abs.x = call float @llvm.fabs.f32(float %x)
				%add = fadd float %abs.x, %x
				store float %add, float addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_add_x_abs_x_f32:
				; GCN: v_add_f32_e32 [[X:v[0-9]+]], 1.0, v0
				; GCN: v_add_f32_e64 v{{[0-9]+}}, [[X]], \|[[X]]\|{{$}}
				define amdgpu_ps void @v_omod_add_x_abs_x_f32(float %a) #0 {
				%x = fadd float %a, 1.0
				%abs.x = call float @llvm.fabs.f32(float %x)
				%add = fadd float %x, %abs.x
				store float %add, float addrspace(1)* undef
				ret void
				}

				; Don't fold omod into omod into another omod.
				; GCN-LABEL: {{^}}v_omod_div2_omod_div2_f32:
				; GCN: v_add_f32_e64 [[ADD:v[0-9]+]], v0, 1.0 div:2{{$}}
				; GCN: v_mul_f32_e32 v{{[0-9]+}}, 0.5, [[ADD]]{{$}}
				define amdgpu_ps void @v_omod_div2_omod_div2_f32(float %a) #0 {
				%add = fadd float %a, 1.0
				%div2.0 = fmul float %add, 0.5
				%div2.1 = fmul float %div2.0, 0.5
				store float %div2.1, float addrspace(1)* undef
				ret void
				}

				; Don't fold omod if denorms enabled
				; GCN-LABEL: {{^}}v_omod_div2_f32_denormals:
				; GCN: v_add_f32_e32 [[ADD:v[0-9]+]], 1.0, v0{{$}}
				; GCN: v_mul_f32_e32 v{{[0-9]+}}, 0.5, [[ADD]]{{$}}
				define amdgpu_ps void @v_omod_div2_f32_denormals(float %a) #2 {
				%add = fadd float %a, 1.0
				%div2 = fmul float %add, 0.5
				store float %div2, float addrspace(1)* undef
				ret void
				}

				; Don't fold omod if denorms enabled for add form.
				; GCN-LABEL: {{^}}v_omod_mul2_f32_denormals:
				; GCN: v_add_f32_e32 [[ADD:v[0-9]+]], 1.0, v0{{$}}
				; GCN: v_add_f32_e32 v{{[0-9]+}}, [[ADD]], [[ADD]]{{$}}
				define amdgpu_ps void @v_omod_mul2_f32_denormals(float %a) #2 {
				%add = fadd float %a, 1.0
				%mul2 = fadd float %add, %add
				store float %mul2, float addrspace(1)* undef
				ret void
				}

				; Don't fold omod if denorms enabled
				; GCN-LABEL: {{^}}v_omod_div2_f16_denormals:
				; VI: v_add_f16_e32 [[ADD:v[0-9]+]], 1.0, v0{{$}}
				; VI: v_mul_f16_e32 v{{[0-9]+}}, 0.5, [[ADD]]{{$}}
				define amdgpu_ps void @v_omod_div2_f16_denormals(half %a) #0 {
				%add = fadd half %a, 1.0
				%div2 = fmul half %add, 0.5
				store half %div2, half addrspace(1)* undef
				ret void
				}

				; Don't fold omod if denorms enabled for add form.
				; GCN-LABEL: {{^}}v_omod_mul2_f16_denormals:
				; VI: v_add_f16_e32 [[ADD:v[0-9]+]], 1.0, v0{{$}}
				; VI: v_add_f16_e32 v{{[0-9]+}}, [[ADD]], [[ADD]]{{$}}
				define amdgpu_ps void @v_omod_mul2_f16_denormals(half %a) #0 {
				%add = fadd half %a, 1.0
				%mul2 = fadd half %add, %add
				store half %mul2, half addrspace(1)* undef
				ret void
				}

				; GCN-LABEL: {{^}}v_omod_div2_f16_no_denormals:
				; VI-NOT: v0
				; VI: v_add_f16_e64 [[ADD:v[0-9]+]], v0, 1.0 div:2{{$}}
				define amdgpu_ps void @v_omod_div2_f16_no_denormals(half %a) #3 {
				%add = fadd half %a, 1.0
				%div2 = fmul half %add, 0.5
				store half %div2, half addrspace(1)* undef
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #1
				declare float @llvm.fabs.f32(float) #1
				declare float @llvm.floor.f32(float) #1
				declare float @llvm.minnum.f32(float, float) #1
				declare float @llvm.maxnum.f32(float, float) #1
				declare float @llvm.amdgcn.fmed3.f32(float, float, float) #1
				declare double @llvm.fabs.f64(double) #1
				declare double @llvm.minnum.f64(double, double) #1
				declare double @llvm.maxnum.f64(double, double) #1
				declare half @llvm.fabs.f16(half) #1
				declare half @llvm.minnum.f16(half, half) #1
				declare half @llvm.maxnum.f16(half, half) #1
				declare void @llvm.dbg.value(metadata, i64, metadata, metadata) #1

				attributes #0 = { nounwind "no-signed-zeros-fp-math"="true" }
				attributes #1 = { nounwind readnone }
				attributes #2 = { nounwind "target-features"="+fp32-denormals" "no-signed-zeros-fp-math"="true" }
				attributes #3 = { nounwind "target-features"="-fp64-fp16-denormals" "no-signed-zeros-fp-math"="true" }
				attributes #4 = { nounwind "no-signed-zeros-fp-math"="false" }

				!llvm.dbg.cu = !{!0}
				!llvm.module.flags = !{!2, !3}

				!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, isOptimized: true, runtimeVersion: 0, emissionKind: NoDebug)
				!1 = !DIFile(filename: "/tmp/foo.cl", directory: "/dev/null")
				!2 = !{i32 2, !"Dwarf Version", i32 4}
				!3 = !{i32 2, !"Debug Info Version", i32 3}
				!4 = !DILocalVariable(name: "add", arg: 1, scope: !5, file: !1, line: 1)
				!5 = distinct !DISubprogram(name: "foo", scope: !1, file: !1, line: 1, type: !6, isLocal: false, isDefinition: true, scopeLine: 2, flags: DIFlagPrototyped, isOptimized: true, unit: !0)
				!6 = !DISubroutineType(types: !7)
				!7 = !{null, !8}
				!8 = !DIBasicType(name: "float", size: 32, align: 32)
				!9 = !DIExpression()
				!10 = !DILocation(line: 1, column: 42, scope: !5)