This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Avoid redundant calls to numBits in AMDGPUCodeGenPrepare::replaceMulWithMul24().
ClosedPublic

Authored by abinavpp on Oct 14 2021, 9:12 PM.

Download Raw Diff

Details

Reviewers

arsenm
foad
rampitec

Commits

rGde3038400b16: [AMDGPU] Avoid redundant calls to numBits in AMDGPUCodeGenPrepare…

Summary

The isU24() and isI24() calls numBits to make its decision. This change
replaces them with the internal numBits call so that we can use its
result for the > 32 bit width cases.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

abinavpp created this revision.Oct 14 2021, 9:12 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 6 others. · View Herald TranscriptOct 14 2021, 9:12 PM

abinavpp requested review of this revision.Oct 14 2021, 9:12 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2021, 9:12 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B129002: Diff 379913.Oct 14 2021, 9:12 PM

abinavpp mentioned this in D111523: [AMDGPU] Fix 24 bit mul intrinsic generation for > 32 bit result..Oct 14 2021, 9:17 PM

foad added inline comments.Oct 15 2021, 2:04 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
512	I realize this check was already in isI24, but I don't see the point of it. I think the only effect it has is: on a machine that has mul_i24 but not mul_u24, we will fail to do this optimization, for no good reason.

Addressed review comment.

LGTM, thanks.

This revision is now accepted and ready to land.Oct 15 2021, 7:10 AM

This revision was landed with ongoing or failed builds.Oct 15 2021, 7:20 AM

Closed by commit rGde3038400b16: [AMDGPU] Avoid redundant calls to numBits in AMDGPUCodeGenPrepare… (authored by abinavpp). · Explain Why

This revision was automatically updated to reflect the committed changes.

abinavpp added a commit: rGde3038400b16: [AMDGPU] Avoid redundant calls to numBits in AMDGPUCodeGenPrepare….

Harbormaster completed remote builds in B129055: Diff 379992.Oct 15 2021, 7:41 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUCodeGenPrepare.cpp

28 lines

Diff 379996

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	class AMDGPUCodeGenPrepare : public FunctionPass,
/// the result of the shift operation back to \p I's original type.		/// the result of the shift operation back to \p I's original type.
///		///
/// \returns True.		/// \returns True.
bool promoteUniformBitreverseToI32(IntrinsicInst &I) const;		bool promoteUniformBitreverseToI32(IntrinsicInst &I) const;


unsigned numBitsUnsigned(Value *Op, unsigned ScalarSize) const;		unsigned numBitsUnsigned(Value *Op, unsigned ScalarSize) const;
unsigned numBitsSigned(Value *Op, unsigned ScalarSize) const;		unsigned numBitsSigned(Value *Op, unsigned ScalarSize) const;
bool isI24(Value *V, unsigned ScalarSize) const;
bool isU24(Value *V, unsigned ScalarSize) const;

/// Replace mul instructions with llvm.amdgcn.mul.u24 or llvm.amdgcn.mul.s24.		/// Replace mul instructions with llvm.amdgcn.mul.u24 or llvm.amdgcn.mul.s24.
/// SelectionDAG has an issue where an and asserting the bits are known		/// SelectionDAG has an issue where an and asserting the bits are known
bool replaceMulWithMul24(BinaryOperator &I) const;		bool replaceMulWithMul24(BinaryOperator &I) const;

/// Perform same function as equivalently named function in DAGCombiner. Since		/// Perform same function as equivalently named function in DAGCombiner. Since
/// we expand some divisions here, we need to perform this before obscuring.		/// we expand some divisions here, we need to perform this before obscuring.
bool foldBinOpIntoSelect(BinaryOperator &I) const;		bool foldBinOpIntoSelect(BinaryOperator &I) const;
▲ Show 20 Lines • Show All 285 Lines • ▼ Show 20 Lines

unsigned AMDGPUCodeGenPrepare::numBitsSigned(Value *Op,		unsigned AMDGPUCodeGenPrepare::numBitsSigned(Value *Op,
unsigned ScalarSize) const {		unsigned ScalarSize) const {
// In order for this to be a signed 24-bit value, bit 23, must		// In order for this to be a signed 24-bit value, bit 23, must
// be a sign bit.		// be a sign bit.
return ScalarSize - ComputeNumSignBits(Op, *DL, 0, AC);		return ScalarSize - ComputeNumSignBits(Op, *DL, 0, AC);
}		}

bool AMDGPUCodeGenPrepare::isI24(Value *V, unsigned ScalarSize) const {
return ScalarSize >= 24 && // Types less than 24-bit should be treated
// as unsigned 24-bit values.
numBitsSigned(V, ScalarSize) < 24;
}

bool AMDGPUCodeGenPrepare::isU24(Value *V, unsigned ScalarSize) const {
return numBitsUnsigned(V, ScalarSize) <= 24;
}

static void extractValues(IRBuilder<> &Builder,		static void extractValues(IRBuilder<> &Builder,
SmallVectorImpl<Value > &Values, Value V) {		SmallVectorImpl<Value > &Values, Value V) {
auto *VT = dyn_cast<FixedVectorType>(V->getType());		auto *VT = dyn_cast<FixedVectorType>(V->getType());
if (!VT) {		if (!VT) {
Values.push_back(V);		Values.push_back(V);
return;		return;
}		}

Show All 29 Lines	bool AMDGPUCodeGenPrepare::replaceMulWithMul24(BinaryOperator &I) const {

Value *LHS = I.getOperand(0);		Value *LHS = I.getOperand(0);
Value *RHS = I.getOperand(1);		Value *RHS = I.getOperand(1);
IRBuilder<> Builder(&I);		IRBuilder<> Builder(&I);
Builder.SetCurrentDebugLocation(I.getDebugLoc());		Builder.SetCurrentDebugLocation(I.getDebugLoc());

Intrinsic::ID IntrID = Intrinsic::not_intrinsic;		Intrinsic::ID IntrID = Intrinsic::not_intrinsic;

if (ST->hasMulU24() && isU24(LHS, Size) && isU24(RHS, Size)) {		unsigned LHSBits = 0, RHSBits = 0;

		if (ST->hasMulU24() && (LHSBits = numBitsUnsigned(LHS, Size)) <= 24 &&
		(RHSBits = numBitsUnsigned(RHS, Size)) <= 24) {
// The mul24 instruction yields the low-order 32 bits. If the original		// The mul24 instruction yields the low-order 32 bits. If the original
// result and the destination is wider than 32 bits, the mul24 would		// result and the destination is wider than 32 bits, the mul24 would
// truncate the result.		// truncate the result.
if (Size > 32 &&		if (Size > 32 && LHSBits + RHSBits > 32)
numBitsUnsigned(LHS, Size) + numBitsUnsigned(RHS, Size) > 32) {
return false;		return false;
}

IntrID = Intrinsic::amdgcn_mul_u24;		IntrID = Intrinsic::amdgcn_mul_u24;
} else if (ST->hasMulI24() && isI24(LHS, Size) && isI24(RHS, Size)) {		} else if (ST->hasMulI24() &&
		(LHSBits = numBitsSigned(LHS, Size)) < 24 &&
		foadUnsubmitted Done Reply Inline Actions I realize this check was already in isI24, but I don't see the point of it. I think the only effect it has is: on a machine that has mul_i24 but not mul_u24, we will fail to do this optimization, for no good reason. foad: I realize this check was already in isI24, but I don't see the point of it. I think the only…
		(RHSBits = numBitsSigned(RHS, Size)) < 24) {
// The original result is positive if its destination is wider than 32 bits		// The original result is positive if its destination is wider than 32 bits
// and its highest set bit is at bit 31. Generating mul24 and sign-extending		// and its highest set bit is at bit 31. Generating mul24 and sign-extending
// it would yield a negative value.		// it would yield a negative value.
if (Size > 32 && numBitsSigned(LHS, Size) + numBitsSigned(RHS, Size) > 30) {		if (Size > 32 && LHSBits + RHSBits > 30)
return false;		return false;
}

IntrID = Intrinsic::amdgcn_mul_i24;		IntrID = Intrinsic::amdgcn_mul_i24;
} else		} else
return false;		return false;

SmallVector<Value *, 4> LHSVals;		SmallVector<Value *, 4> LHSVals;
SmallVector<Value *, 4> RHSVals;		SmallVector<Value *, 4> RHSVals;
SmallVector<Value *, 4> ResultVals;		SmallVector<Value *, 4> ResultVals;
▲ Show 20 Lines • Show All 925 Lines • Show Last 20 Lines