This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
6/10
AMDGPUCodeGenPrepare.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
amdgpu-codegenprepare-mul24.ll

Differential D111523

[AMDGPU] Fix 24 bit mul intrinsic generation for > 32 bit result.
ClosedPublic

Authored by abinavpp on Oct 10 2021, 11:23 PM.

Download Raw Diff

Details

Reviewers

arsenm
foad
rampitec

Commits

rGb3c9d84e5a8e: [AMDGPU] Fix 24-bit mul intrinsic generation for > 32-bit result.

Summary

The 24 bit mul intrinsics yields the low-order 32 bits. We should only
do the transformation if the operands are known to be not wider than 24
bits and the result is known to be not wider than 32 bits.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

abinavpp created this revision.Oct 10 2021, 11:23 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 6 others. · View Herald TranscriptOct 10 2021, 11:23 PM

abinavpp requested review of this revision.Oct 10 2021, 11:23 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 10 2021, 11:23 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

This fixes https://github.com/RadeonOpenCompute/ROCm/issues/1383, which is tracked by SWDEV-273166.

Harbormaster completed remote builds in B128042: Diff 378573.Oct 11 2021, 12:03 AM

Reworded the comment.

abinavpp retitled this revision from [AMDGPU] Don't emit 24 bit mul for > 32 bit result. to [AMDGPU] Don't emit 24 bit mul intrinsic for > 32 bit result..Oct 11 2021, 12:37 AM

abinavpp edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B128049: Diff 378581.Oct 11 2021, 1:20 AM

There should be nothing wrong with mul24 regardless of the destination type. If a 64 bit mul fits 24 bit it still can use 24 bit mul, just extended to 64 bit.

In D111523#3055898, @rampitec wrote:

There should be nothing wrong with mul24 regardless of the destination type. If a 64 bit mul fits 24 bit it still can use 24 bit mul, just extended to 64 bit.

No, I think this patch makes sense. We were only checking that the inputs fit in 24 bits. A full 24-bit multiply would have a 48 bit result, which you could safely extend to 64 bits. But mul24 gives you a truncated 32-bit result, which you can't safely extend to 64 bits.

In D111523#3056031, @foad wrote:

In D111523#3055898, @rampitec wrote:

There should be nothing wrong with mul24 regardless of the destination type. If a 64 bit mul fits 24 bit it still can use 24 bit mul, just extended to 64 bit.

No, I think this patch makes sense. We were only checking that the inputs fit in 24 bits. A full 24-bit multiply would have a 48 bit result, which you could safely extend to 64 bits. But mul24 gives you a truncated 32-bit result, which you can't safely extend to 64 bits.

Should we check numBits(LHS) + numBits(RHS) <= DstTy.sizeInBits() || Size <= 32 instead?

In D111523#3056260, @rampitec wrote:

In D111523#3056031, @foad wrote:

In D111523#3055898, @rampitec wrote:

There should be nothing wrong with mul24 regardless of the destination type. If a 64 bit mul fits 24 bit it still can use 24 bit mul, just extended to 64 bit.

No, I think this patch makes sense. We were only checking that the inputs fit in 24 bits. A full 24-bit multiply would have a 48 bit result, which you could safely extend to 64 bits. But mul24 gives you a truncated 32-bit result, which you can't safely extend to 64 bits.

Should we check numBits(LHS) + numBits(RHS) <= DstTy.sizeInBits() || Size <= 32 instead?

No, that would still go wrong for the original case of multiplying two 24-bit numbers to get a 64-bit result, because the intermediate 48-bit result will get truncated to 32 bits by the mul24 instruction before it gets extended to 64 bits.

I think numBits(LHS) + numBits(RHS) <= 32 || Size <= 32 might be OK. (Assuming you have already checked numBits(LHS) <= 24 and numBits(RHS) <= 24.)

We only need to make sure that we're conforming to the semantics of the 24 bit
low-order mul instructions in the AMDGPU ISA. I agree with @foad, we shoud do
the following:

if (mul is wider than 32 bits) {

if (numBits(a) > 24 || numBits(b) > 24)
  return false;

// Check if numBits(mul(a, b)) > 32
if (numBits(a) + numBits(b) > 32)
  return false;

}

Ideally, we're suppose to split (mul i64 a, b) to (build_pair i64 (mul24hi a,
b), (mul24 a, b)) like how getMul24() of AMDGPUISelLowering.cpp does. I'm not
sure about doing that in LLVM-IR.

Addressed review comments.

abinavpp retitled this revision from [AMDGPU] Don't emit 24 bit mul intrinsic for > 32 bit result. to [AMDGPU] Fix 24 bit mul intrinsic generation for > 32 bit result..Oct 12 2021, 7:12 AM

abinavpp edited the summary of this revision. (Show Details)

foad added inline comments.Oct 12 2021, 7:38 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
511–512	These checks are already done below, and done better because they can use the u24 or i24 instruction as appropriate.
515	This check probably needs to be done after we've decided whether to use the u24 or i24 instruction.

Harbormaster completed remote builds in B128367: Diff 379027.Oct 12 2021, 7:50 AM

Addressed review comments.

Harbormaster completed remote builds in B128379: Diff 379046.Oct 12 2021, 8:57 AM

You now need to add tests where we prove it fits 32 bits and use 24 bit mul.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
523	Should we use a signed check for signed mul?

Addressed review comments.

Harbormaster completed remote builds in B128573: Diff 379328.Oct 13 2021, 4:20 AM

LGTM

This revision is now accepted and ready to land.Oct 13 2021, 9:50 AM

Closed by commit rGb3c9d84e5a8e: [AMDGPU] Fix 24-bit mul intrinsic generation for > 32-bit result. (authored by abinavpp). · Explain WhyOct 13 2021, 8:36 PM

This revision was automatically updated to reflect the committed changes.

abinavpp added a commit: rGb3c9d84e5a8e: [AMDGPU] Fix 24-bit mul intrinsic generation for > 32-bit result..

foad added inline comments.Oct 14 2021, 6:39 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
454	I think it would make more sense for this to return `ScalarSize - ComputeNumSignBits(Op, DL, 0, AC) - 1`. This is the maximum size that Op could be truncated to, and sign extended to give you the original value. This is analogous to numBitsUnsigned, which tells you the maximum size you could truncate to and then zero* extend to get the original value.
515	This comment just tells you what the "if" condition is, it doesn't really explain why.
516	I don't think `> 31` is safe here. For example it will transform this function: define amdgpu_ps i64 @test(i64 %x, i64 %y) { %xx = trunc i64 %x to i16 %xxx = sext i16 %xx to i64 %yy = trunc i64 %y to i17 %yyy = sext i17 %yy to i64 %mul = mul i64 %xxx, %yyy ret i64 %mul } into: define amdgpu_ps i64 @test(i64 %x, i64 %y) { %xx = trunc i64 %x to i16 %xxx = sext i16 %xx to i64 %yy = trunc i64 %y to i17 %yyy = sext i17 %yy to i64 %1 = trunc i64 %xxx to i32 %2 = trunc i64 %yyy to i32 %3 = call i32 @llvm.amdgcn.mul.i24(i32 %1, i32 %2) %mul = sext i32 %3 to i64 ret i64 %mul } If %x is -0x8000 and %y is -0x10000 then the result should be +0x80000000, but the transformed code would return -0x80000000. I think this condition needs to be `> 30` instead.
517	`isU24` already includes a call to numBitsUnsigned, so maybe remove the isU24 and isI24 wrappers altogether and just call the lower level function once for each argument: unsigned LHSBits = numBitsUnsigned(LHS, Size); unsigned RHSBits = numBitsUnsigned(RHS, Size): if (ST->hasMulU24() && LHSBits <= 24 && RHSBits <= 24 && (Size <= 32 \|\| LHSBits + RHSBits <= 32)) { IntrID = Intrinsic::amdgcn_mul_u24; } ... and similarly for the signed case.

abinavpp added inline comments.Oct 14 2021, 9:17 PM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
515	Addressed in D111823.
516	Addressed in D111823.
517	Addressed in D111864.

Do we not emit 24 bit mulhi in the IR pass? Could we just start emitting the full 48 bit computation?

In D111523#3070777, @arsenm wrote:

Do we not emit 24 bit mulhi in the IR pass? Could we just start emitting the full 48 bit computation?

I couldn't find the 24-bit mulhi intrinsic in IntrinsicsAMDGPU.td. At the
moment, AMDGPUTargetLowering::performMulCombine() of AMDGPUISelLowering.cpp is
creating the 48-bit mul if this bails out.

Ideally we should generate the 48-bit mul here like how getMul24() of
AMDGPUISelLowering.cpp does. I'm not sure about the right approach for that. If
we create a 24-bit mulhi intrinsic, then store the 24-bit mul and 24-bit mulhi
results in a { i16, i32 } struct, how do we replaceAllUsesWith() of the i64
uses with { i16, i32 }. Otherwise, we can create a 48-bit mul intrinsic of
type i64 (i32, i32), and split it during lowering. What do you think?

In D111523#3072078, @abinavpp wrote:

In D111523#3070777, @arsenm wrote:

Do we not emit 24 bit mulhi in the IR pass? Could we just start emitting the full 48 bit computation?

I couldn't find the 24-bit mulhi intrinsic in IntrinsicsAMDGPU.td. At the
moment, AMDGPUTargetLowering::performMulCombine() of AMDGPUISelLowering.cpp is
creating the 48-bit mul if this bails out.

You would have to add IR intrinsics for this. The only reason we do this in the IR is due to weakness in the known bits handling in the DAG since it can't see across blocks. We'll get some simple cases in the DAG now, but miss more complex cases.

Ideally we should generate the 48-bit mul here like how getMul24() of
AMDGPUISelLowering.cpp does. I'm not sure about the right approach for that. If
we create a 24-bit mulhi intrinsic, then store the 24-bit mul and 24-bit mulhi
results in a { i16, i32 } struct, how do we replaceAllUsesWith() of the i64
uses with { i16, i32 }. Otherwise, we can create a 48-bit mul intrinsic of
type i64 (i32, i32), and split it during lowering. What do you think?

The result of the intrinsic should be 32-bit. The canonical way to merge 2 integer values in the IR would be to do an extend, shift and or

In D111523#3073407, @arsenm wrote:

In D111523#3072078, @abinavpp wrote:

In D111523#3070777, @arsenm wrote:

Do we not emit 24 bit mulhi in the IR pass? Could we just start emitting the full 48 bit computation?

I couldn't find the 24-bit mulhi intrinsic in IntrinsicsAMDGPU.td. At the
moment, AMDGPUTargetLowering::performMulCombine() of AMDGPUISelLowering.cpp is
creating the 48-bit mul if this bails out.

You would have to add IR intrinsics for this. The only reason we do this in the IR is due to weakness in the known bits handling in the DAG since it can't see across blocks. We'll get some simple cases in the DAG now, but miss more complex cases.

Ideally we should generate the 48-bit mul here like how getMul24() of
AMDGPUISelLowering.cpp does. I'm not sure about the right approach for that. If
we create a 24-bit mulhi intrinsic, then store the 24-bit mul and 24-bit mulhi
results in a { i16, i32 } struct, how do we replaceAllUsesWith() of the i64
uses with { i16, i32 }. Otherwise, we can create a 48-bit mul intrinsic of
type i64 (i32, i32), and split it during lowering. What do you think?

The result of the intrinsic should be 32-bit. The canonical way to merge 2 integer values in the IR would be to do an extend, shift and or

Addressed in D112394 and D112395.

abinavpp mentioned this in D112395: [AMDGPU] Enable 48-bit mul in AMDGPUCodeGenPrepare..Oct 24 2021, 9:22 PM

dmikushin added a subscriber: dmikushin.Jul 6 2022, 9:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2022, 9:09 AM

Herald added subscribers: kosarev, jsilvanus. · View Herald Transcript

Referencing ROCm 5.2 issue, which seems to be (still?) affected by the problem being fixed here.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUCodeGenPrepare.cpp

7 lines

test/

CodeGen/

AMDGPU/

amdgpu-codegenprepare-mul24.ll

70 lines

Diff 379046

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

Show First 20 Lines • Show All 445 Lines • ▼ Show 20 Lines	unsigned AMDGPUCodeGenPrepare::numBitsUnsigned(Value *Op,
KnownBits Known = computeKnownBits(Op, *DL, 0, AC);		KnownBits Known = computeKnownBits(Op, *DL, 0, AC);
return ScalarSize - Known.countMinLeadingZeros();		return ScalarSize - Known.countMinLeadingZeros();
}		}

unsigned AMDGPUCodeGenPrepare::numBitsSigned(Value *Op,		unsigned AMDGPUCodeGenPrepare::numBitsSigned(Value *Op,
unsigned ScalarSize) const {		unsigned ScalarSize) const {
// In order for this to be a signed 24-bit value, bit 23, must		// In order for this to be a signed 24-bit value, bit 23, must
// be a sign bit.		// be a sign bit.
return ScalarSize - ComputeNumSignBits(Op, *DL, 0, AC);		return ScalarSize - ComputeNumSignBits(Op, *DL, 0, AC);
		foadUnsubmitted Not Done Reply Inline Actions I think it would make more sense for this to return `ScalarSize - ComputeNumSignBits(Op, DL, 0, AC) - 1`. This is the maximum size that Op could be truncated to, and sign extended to give you the original value. This is analogous to numBitsUnsigned, which tells you the maximum size you could truncate to and then zero* extend to get the original value. foad: I think it would make more sense for this to return `ScalarSize - ComputeNumSignBits(Op, *DL, 0…
}		}

bool AMDGPUCodeGenPrepare::isI24(Value *V, unsigned ScalarSize) const {		bool AMDGPUCodeGenPrepare::isI24(Value *V, unsigned ScalarSize) const {
return ScalarSize >= 24 && // Types less than 24-bit should be treated		return ScalarSize >= 24 && // Types less than 24-bit should be treated
// as unsigned 24-bit values.		// as unsigned 24-bit values.
numBitsSigned(V, ScalarSize) < 24;		numBitsSigned(V, ScalarSize) < 24;
}		}

Show All 40 Lines	if (DA->isUniform(&I))
return false;		return false;

Value *LHS = I.getOperand(0);		Value *LHS = I.getOperand(0);
Value *RHS = I.getOperand(1);		Value *RHS = I.getOperand(1);
IRBuilder<> Builder(&I);		IRBuilder<> Builder(&I);
Builder.SetCurrentDebugLocation(I.getDebugLoc());		Builder.SetCurrentDebugLocation(I.getDebugLoc());

Intrinsic::ID IntrID = Intrinsic::not_intrinsic;		Intrinsic::ID IntrID = Intrinsic::not_intrinsic;

// TODO: Should this try to match mulhi24?		// TODO: Should this try to match mulhi24?
		foadUnsubmitted Done Reply Inline Actions These checks are already done below, and done better because they can use the u24 or i24 instruction as appropriate. foad: These checks are already done below, and done better because they can use the u24 or i24…
if (ST->hasMulU24() && isU24(LHS, Size) && isU24(RHS, Size)) {		if (ST->hasMulU24() && isU24(LHS, Size) && isU24(RHS, Size)) {
IntrID = Intrinsic::amdgcn_mul_u24;		IntrID = Intrinsic::amdgcn_mul_u24;
} else if (ST->hasMulI24() && isI24(LHS, Size) && isI24(RHS, Size)) {		} else if (ST->hasMulI24() && isI24(LHS, Size) && isI24(RHS, Size)) {
		foadUnsubmitted Done Reply Inline Actions This check probably needs to be done after we've decided whether to use the u24 or i24 instruction. foad: This check probably needs to be done after we've decided whether to use the u24 or i24…
		foadUnsubmitted Not Done Reply Inline Actions This comment just tells you what the "if" condition is, it doesn't really explain why. foad: This comment just tells you what the "if" condition is, it doesn't really explain //why//.
		abinavppAuthorUnsubmitted Done Reply Inline Actions Addressed in D111823. abinavpp: Addressed in D111823.
IntrID = Intrinsic::amdgcn_mul_i24;		IntrID = Intrinsic::amdgcn_mul_i24;
		foadUnsubmitted Not Done Reply Inline Actions I don't think `> 31` is safe here. For example it will transform this function: define amdgpu_ps i64 @test(i64 %x, i64 %y) { %xx = trunc i64 %x to i16 %xxx = sext i16 %xx to i64 %yy = trunc i64 %y to i17 %yyy = sext i17 %yy to i64 %mul = mul i64 %xxx, %yyy ret i64 %mul } into: define amdgpu_ps i64 @test(i64 %x, i64 %y) { %xx = trunc i64 %x to i16 %xxx = sext i16 %xx to i64 %yy = trunc i64 %y to i17 %yyy = sext i17 %yy to i64 %1 = trunc i64 %xxx to i32 %2 = trunc i64 %yyy to i32 %3 = call i32 @llvm.amdgcn.mul.i24(i32 %1, i32 %2) %mul = sext i32 %3 to i64 ret i64 %mul } If %x is -0x8000 and %y is -0x10000 then the result should be +0x80000000, but the transformed code would return -0x80000000. I think this condition needs to be `> 30` instead. foad: I don't think `> 31` is safe here. For example it will transform this function: ``` define…
		abinavppAuthorUnsubmitted Done Reply Inline Actions Addressed in D111823. abinavpp: Addressed in D111823.
} else		} else
		foadUnsubmitted Not Done Reply Inline Actions `isU24` already includes a call to numBitsUnsigned, so maybe remove the isU24 and isI24 wrappers altogether and just call the lower level function once for each argument: unsigned LHSBits = numBitsUnsigned(LHS, Size); unsigned RHSBits = numBitsUnsigned(RHS, Size): if (ST->hasMulU24() && LHSBits <= 24 && RHSBits <= 24 && (Size <= 32 \|\| LHSBits + RHSBits <= 32)) { IntrID = Intrinsic::amdgcn_mul_u24; } ... and similarly for the signed case. foad: `isU24` already includes a call to numBitsUnsigned, so maybe remove the isU24 and isI24…
		abinavppAuthorUnsubmitted Done Reply Inline Actions Addressed in D111864. abinavpp: Addressed in D111864.
return false;		return false;

		// The 24 bit mul intrinsics yields the low-order 32 bits. The result's bit
		// width should not exceed 32 if `Size` > 32.
		if (Size > 32 &&
		numBitsUnsigned(LHS, Size) + numBitsUnsigned(RHS, Size) > 32) {
		rampitecUnsubmitted Done Reply Inline Actions Should we use a signed check for signed mul? rampitec: Should we use a signed check for signed mul?
		return false;
		}

SmallVector<Value *, 4> LHSVals;		SmallVector<Value *, 4> LHSVals;
SmallVector<Value *, 4> RHSVals;		SmallVector<Value *, 4> RHSVals;
SmallVector<Value *, 4> ResultVals;		SmallVector<Value *, 4> ResultVals;
extractValues(Builder, LHSVals, LHS);		extractValues(Builder, LHSVals, LHS);
extractValues(Builder, RHSVals, RHS);		extractValues(Builder, RHSVals, RHS);


IntegerType *I32Ty = Builder.getInt32Ty();		IntegerType *I32Ty = Builder.getInt32Ty();
▲ Show 20 Lines • Show All 920 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-mul24.ll

	Show First 20 Lines • Show All 167 Lines • ▼ Show 20 Lines
	}			}

	define i64 @smul24_i64(i64 %lhs, i64 %rhs) {			define i64 @smul24_i64(i64 %lhs, i64 %rhs) {
	; SI-LABEL: @smul24_i64(			; SI-LABEL: @smul24_i64(
	; SI-NEXT: [[SHL_LHS:%.]] = shl i64 [[LHS:%.]], 40			; SI-NEXT: [[SHL_LHS:%.]] = shl i64 [[LHS:%.]], 40
	; SI-NEXT: [[LHS24:%.*]] = ashr i64 [[SHL_LHS]], 40			; SI-NEXT: [[LHS24:%.*]] = ashr i64 [[SHL_LHS]], 40
	; SI-NEXT: [[LSHR_RHS:%.]] = shl i64 [[RHS:%.]], 40			; SI-NEXT: [[LSHR_RHS:%.]] = shl i64 [[RHS:%.]], 40
	; SI-NEXT: [[RHS24:%.*]] = ashr i64 [[LHS]], 40			; SI-NEXT: [[RHS24:%.*]] = ashr i64 [[LHS]], 40
	; SI-NEXT: [[TMP1:%.*]] = trunc i64 [[LHS24]] to i32			; SI-NEXT: [[MUL:%.*]] = mul i64 [[LHS24]], [[RHS24]]
	; SI-NEXT: [[TMP2:%.*]] = trunc i64 [[RHS24]] to i32
	; SI-NEXT: [[TMP3:%.*]] = call i32 @llvm.amdgcn.mul.i24(i32 [[TMP1]], i32 [[TMP2]])
	; SI-NEXT: [[MUL:%.*]] = sext i32 [[TMP3]] to i64
	; SI-NEXT: ret i64 [[MUL]]			; SI-NEXT: ret i64 [[MUL]]
	;			;
	; VI-LABEL: @smul24_i64(			; VI-LABEL: @smul24_i64(
	; VI-NEXT: [[SHL_LHS:%.]] = shl i64 [[LHS:%.]], 40			; VI-NEXT: [[SHL_LHS:%.]] = shl i64 [[LHS:%.]], 40
	; VI-NEXT: [[LHS24:%.*]] = ashr i64 [[SHL_LHS]], 40			; VI-NEXT: [[LHS24:%.*]] = ashr i64 [[SHL_LHS]], 40
	; VI-NEXT: [[LSHR_RHS:%.]] = shl i64 [[RHS:%.]], 40			; VI-NEXT: [[LSHR_RHS:%.]] = shl i64 [[RHS:%.]], 40
	; VI-NEXT: [[RHS24:%.*]] = ashr i64 [[LHS]], 40			; VI-NEXT: [[RHS24:%.*]] = ashr i64 [[LHS]], 40
	; VI-NEXT: [[TMP1:%.*]] = trunc i64 [[LHS24]] to i32			; VI-NEXT: [[MUL:%.*]] = mul i64 [[LHS24]], [[RHS24]]
	; VI-NEXT: [[TMP2:%.*]] = trunc i64 [[RHS24]] to i32
	; VI-NEXT: [[TMP3:%.*]] = call i32 @llvm.amdgcn.mul.i24(i32 [[TMP1]], i32 [[TMP2]])
	; VI-NEXT: [[MUL:%.*]] = sext i32 [[TMP3]] to i64
	; VI-NEXT: ret i64 [[MUL]]			; VI-NEXT: ret i64 [[MUL]]
	;			;
	; DISABLED-LABEL: @smul24_i64(			; DISABLED-LABEL: @smul24_i64(
	; DISABLED-NEXT: [[SHL_LHS:%.]] = shl i64 [[LHS:%.]], 40			; DISABLED-NEXT: [[SHL_LHS:%.]] = shl i64 [[LHS:%.]], 40
	; DISABLED-NEXT: [[LHS24:%.*]] = ashr i64 [[SHL_LHS]], 40			; DISABLED-NEXT: [[LHS24:%.*]] = ashr i64 [[SHL_LHS]], 40
	; DISABLED-NEXT: [[LSHR_RHS:%.]] = shl i64 [[RHS:%.]], 40			; DISABLED-NEXT: [[LSHR_RHS:%.]] = shl i64 [[RHS:%.]], 40
	; DISABLED-NEXT: [[RHS24:%.*]] = ashr i64 [[LHS]], 40			; DISABLED-NEXT: [[RHS24:%.*]] = ashr i64 [[LHS]], 40
	; DISABLED-NEXT: [[MUL:%.*]] = mul i64 [[LHS24]], [[RHS24]]			; DISABLED-NEXT: [[MUL:%.*]] = mul i64 [[LHS24]], [[RHS24]]
	; DISABLED-NEXT: ret i64 [[MUL]]			; DISABLED-NEXT: ret i64 [[MUL]]
	;			;
	%shl.lhs = shl i64 %lhs, 40			%shl.lhs = shl i64 %lhs, 40
	%lhs24 = ashr i64 %shl.lhs, 40			%lhs24 = ashr i64 %shl.lhs, 40
	%lshr.rhs = shl i64 %rhs, 40			%lshr.rhs = shl i64 %rhs, 40
	%rhs24 = ashr i64 %lhs, 40			%rhs24 = ashr i64 %lhs, 40
	%mul = mul i64 %lhs24, %rhs24			%mul = mul i64 %lhs24, %rhs24
	ret i64 %mul			ret i64 %mul
	}			}

	define i64 @umul24_i64(i64 %lhs, i64 %rhs) {			define i64 @umul24_i64(i64 %lhs, i64 %rhs) {
	; SI-LABEL: @umul24_i64(			; SI-LABEL: @umul24_i64(
	; SI-NEXT: [[LHS24:%.]] = and i64 [[LHS:%.]], 16777215			; SI-NEXT: [[LHS24:%.]] = and i64 [[LHS:%.]], 16777215
	; SI-NEXT: [[RHS24:%.]] = and i64 [[RHS:%.]], 16777215			; SI-NEXT: [[RHS24:%.]] = and i64 [[RHS:%.]], 16777215
	; SI-NEXT: [[TMP1:%.*]] = trunc i64 [[LHS24]] to i32			; SI-NEXT: [[MUL:%.*]] = mul i64 [[LHS24]], [[RHS24]]
	; SI-NEXT: [[TMP2:%.*]] = trunc i64 [[RHS24]] to i32
	; SI-NEXT: [[TMP3:%.*]] = call i32 @llvm.amdgcn.mul.u24(i32 [[TMP1]], i32 [[TMP2]])
	; SI-NEXT: [[MUL:%.*]] = zext i32 [[TMP3]] to i64
	; SI-NEXT: ret i64 [[MUL]]			; SI-NEXT: ret i64 [[MUL]]
	;			;
	; VI-LABEL: @umul24_i64(			; VI-LABEL: @umul24_i64(
	; VI-NEXT: [[LHS24:%.]] = and i64 [[LHS:%.]], 16777215			; VI-NEXT: [[LHS24:%.]] = and i64 [[LHS:%.]], 16777215
	; VI-NEXT: [[RHS24:%.]] = and i64 [[RHS:%.]], 16777215			; VI-NEXT: [[RHS24:%.]] = and i64 [[RHS:%.]], 16777215
	; VI-NEXT: [[TMP1:%.*]] = trunc i64 [[LHS24]] to i32			; VI-NEXT: [[MUL:%.*]] = mul i64 [[LHS24]], [[RHS24]]
	; VI-NEXT: [[TMP2:%.*]] = trunc i64 [[RHS24]] to i32
	; VI-NEXT: [[TMP3:%.*]] = call i32 @llvm.amdgcn.mul.u24(i32 [[TMP1]], i32 [[TMP2]])
	; VI-NEXT: [[MUL:%.*]] = zext i32 [[TMP3]] to i64
	; VI-NEXT: ret i64 [[MUL]]			; VI-NEXT: ret i64 [[MUL]]
	;			;
	; DISABLED-LABEL: @umul24_i64(			; DISABLED-LABEL: @umul24_i64(
	; DISABLED-NEXT: [[LHS24:%.]] = and i64 [[LHS:%.]], 16777215			; DISABLED-NEXT: [[LHS24:%.]] = and i64 [[LHS:%.]], 16777215
	; DISABLED-NEXT: [[RHS24:%.]] = and i64 [[RHS:%.]], 16777215			; DISABLED-NEXT: [[RHS24:%.]] = and i64 [[RHS:%.]], 16777215
	; DISABLED-NEXT: [[MUL:%.*]] = mul i64 [[LHS24]], [[RHS24]]			; DISABLED-NEXT: [[MUL:%.*]] = mul i64 [[LHS24]], [[RHS24]]
	; DISABLED-NEXT: ret i64 [[MUL]]			; DISABLED-NEXT: ret i64 [[MUL]]
	;			;
	▲ Show 20 Lines • Show All 184 Lines • ▼ Show 20 Lines
	}			}

	define i33 @smul24_i33(i33 %lhs, i33 %rhs) {			define i33 @smul24_i33(i33 %lhs, i33 %rhs) {
	; SI-LABEL: @smul24_i33(			; SI-LABEL: @smul24_i33(
	; SI-NEXT: [[SHL_LHS:%.]] = shl i33 [[LHS:%.]], 9			; SI-NEXT: [[SHL_LHS:%.]] = shl i33 [[LHS:%.]], 9
	; SI-NEXT: [[LHS24:%.*]] = ashr i33 [[SHL_LHS]], 9			; SI-NEXT: [[LHS24:%.*]] = ashr i33 [[SHL_LHS]], 9
	; SI-NEXT: [[LSHR_RHS:%.]] = shl i33 [[RHS:%.]], 9			; SI-NEXT: [[LSHR_RHS:%.]] = shl i33 [[RHS:%.]], 9
	; SI-NEXT: [[RHS24:%.*]] = ashr i33 [[LHS]], 9			; SI-NEXT: [[RHS24:%.*]] = ashr i33 [[LHS]], 9
	; SI-NEXT: [[TMP1:%.*]] = trunc i33 [[LHS24]] to i32			; SI-NEXT: [[MUL:%.*]] = mul i33 [[LHS24]], [[RHS24]]
	; SI-NEXT: [[TMP2:%.*]] = trunc i33 [[RHS24]] to i32
	; SI-NEXT: [[TMP3:%.*]] = call i32 @llvm.amdgcn.mul.i24(i32 [[TMP1]], i32 [[TMP2]])
	; SI-NEXT: [[MUL:%.*]] = sext i32 [[TMP3]] to i33
	; SI-NEXT: ret i33 [[MUL]]			; SI-NEXT: ret i33 [[MUL]]
	;			;
	; VI-LABEL: @smul24_i33(			; VI-LABEL: @smul24_i33(
	; VI-NEXT: [[SHL_LHS:%.]] = shl i33 [[LHS:%.]], 9			; VI-NEXT: [[SHL_LHS:%.]] = shl i33 [[LHS:%.]], 9
	; VI-NEXT: [[LHS24:%.*]] = ashr i33 [[SHL_LHS]], 9			; VI-NEXT: [[LHS24:%.*]] = ashr i33 [[SHL_LHS]], 9
	; VI-NEXT: [[LSHR_RHS:%.]] = shl i33 [[RHS:%.]], 9			; VI-NEXT: [[LSHR_RHS:%.]] = shl i33 [[RHS:%.]], 9
	; VI-NEXT: [[RHS24:%.*]] = ashr i33 [[LHS]], 9			; VI-NEXT: [[RHS24:%.*]] = ashr i33 [[LHS]], 9
	; VI-NEXT: [[TMP1:%.*]] = trunc i33 [[LHS24]] to i32			; VI-NEXT: [[MUL:%.*]] = mul i33 [[LHS24]], [[RHS24]]
	; VI-NEXT: [[TMP2:%.*]] = trunc i33 [[RHS24]] to i32
	; VI-NEXT: [[TMP3:%.*]] = call i32 @llvm.amdgcn.mul.i24(i32 [[TMP1]], i32 [[TMP2]])
	; VI-NEXT: [[MUL:%.*]] = sext i32 [[TMP3]] to i33
	; VI-NEXT: ret i33 [[MUL]]			; VI-NEXT: ret i33 [[MUL]]
	;			;
	; DISABLED-LABEL: @smul24_i33(			; DISABLED-LABEL: @smul24_i33(
	; DISABLED-NEXT: [[SHL_LHS:%.]] = shl i33 [[LHS:%.]], 9			; DISABLED-NEXT: [[SHL_LHS:%.]] = shl i33 [[LHS:%.]], 9
	; DISABLED-NEXT: [[LHS24:%.*]] = ashr i33 [[SHL_LHS]], 9			; DISABLED-NEXT: [[LHS24:%.*]] = ashr i33 [[SHL_LHS]], 9
	; DISABLED-NEXT: [[LSHR_RHS:%.]] = shl i33 [[RHS:%.]], 9			; DISABLED-NEXT: [[LSHR_RHS:%.]] = shl i33 [[RHS:%.]], 9
	; DISABLED-NEXT: [[RHS24:%.*]] = ashr i33 [[LHS]], 9			; DISABLED-NEXT: [[RHS24:%.*]] = ashr i33 [[LHS]], 9
	; DISABLED-NEXT: [[MUL:%.*]] = mul i33 [[LHS24]], [[RHS24]]			; DISABLED-NEXT: [[MUL:%.*]] = mul i33 [[LHS24]], [[RHS24]]
	; DISABLED-NEXT: ret i33 [[MUL]]			; DISABLED-NEXT: ret i33 [[MUL]]
	;			;
	%shl.lhs = shl i33 %lhs, 9			%shl.lhs = shl i33 %lhs, 9
	%lhs24 = ashr i33 %shl.lhs, 9			%lhs24 = ashr i33 %shl.lhs, 9
	%lshr.rhs = shl i33 %rhs, 9			%lshr.rhs = shl i33 %rhs, 9
	%rhs24 = ashr i33 %lhs, 9			%rhs24 = ashr i33 %lhs, 9
	%mul = mul i33 %lhs24, %rhs24			%mul = mul i33 %lhs24, %rhs24
	ret i33 %mul			ret i33 %mul
	}			}

	define i33 @umul24_i33(i33 %lhs, i33 %rhs) {			define i33 @umul24_i33(i33 %lhs, i33 %rhs) {
	; SI-LABEL: @umul24_i33(			; SI-LABEL: @umul24_i33(
	; SI-NEXT: [[LHS24:%.]] = and i33 [[LHS:%.]], 16777215			; SI-NEXT: [[LHS24:%.]] = and i33 [[LHS:%.]], 16777215
	; SI-NEXT: [[RHS24:%.]] = and i33 [[RHS:%.]], 16777215			; SI-NEXT: [[RHS24:%.]] = and i33 [[RHS:%.]], 16777215
	; SI-NEXT: [[TMP1:%.*]] = trunc i33 [[LHS24]] to i32			; SI-NEXT: [[MUL:%.*]] = mul i33 [[LHS24]], [[RHS24]]
	; SI-NEXT: [[TMP2:%.*]] = trunc i33 [[RHS24]] to i32
	; SI-NEXT: [[TMP3:%.*]] = call i32 @llvm.amdgcn.mul.u24(i32 [[TMP1]], i32 [[TMP2]])
	; SI-NEXT: [[MUL:%.*]] = zext i32 [[TMP3]] to i33
	; SI-NEXT: ret i33 [[MUL]]			; SI-NEXT: ret i33 [[MUL]]
	;			;
	; VI-LABEL: @umul24_i33(			; VI-LABEL: @umul24_i33(
	; VI-NEXT: [[LHS24:%.]] = and i33 [[LHS:%.]], 16777215			; VI-NEXT: [[LHS24:%.]] = and i33 [[LHS:%.]], 16777215
	; VI-NEXT: [[RHS24:%.]] = and i33 [[RHS:%.]], 16777215			; VI-NEXT: [[RHS24:%.]] = and i33 [[RHS:%.]], 16777215
	; VI-NEXT: [[TMP1:%.*]] = trunc i33 [[LHS24]] to i32			; VI-NEXT: [[MUL:%.*]] = mul i33 [[LHS24]], [[RHS24]]
	; VI-NEXT: [[TMP2:%.*]] = trunc i33 [[RHS24]] to i32
	; VI-NEXT: [[TMP3:%.*]] = call i32 @llvm.amdgcn.mul.u24(i32 [[TMP1]], i32 [[TMP2]])
	; VI-NEXT: [[MUL:%.*]] = zext i32 [[TMP3]] to i33
	; VI-NEXT: ret i33 [[MUL]]			; VI-NEXT: ret i33 [[MUL]]
	;			;
	; DISABLED-LABEL: @umul24_i33(			; DISABLED-LABEL: @umul24_i33(
	; DISABLED-NEXT: [[LHS24:%.]] = and i33 [[LHS:%.]], 16777215			; DISABLED-NEXT: [[LHS24:%.]] = and i33 [[LHS:%.]], 16777215
	; DISABLED-NEXT: [[RHS24:%.]] = and i33 [[RHS:%.]], 16777215			; DISABLED-NEXT: [[RHS24:%.]] = and i33 [[RHS:%.]], 16777215
	; DISABLED-NEXT: [[MUL:%.*]] = mul i33 [[LHS24]], [[RHS24]]			; DISABLED-NEXT: [[MUL:%.*]] = mul i33 [[LHS24]], [[RHS24]]
	; DISABLED-NEXT: ret i33 [[MUL]]			; DISABLED-NEXT: ret i33 [[MUL]]
	;			;
	▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	}			}

	define <2 x i33> @smul24_v2i33(<2 x i33> %lhs, <2 x i33> %rhs) {			define <2 x i33> @smul24_v2i33(<2 x i33> %lhs, <2 x i33> %rhs) {
	; SI-LABEL: @smul24_v2i33(			; SI-LABEL: @smul24_v2i33(
	; SI-NEXT: [[SHL_LHS:%.]] = shl <2 x i33> [[LHS:%.]], <i33 9, i33 9>			; SI-NEXT: [[SHL_LHS:%.]] = shl <2 x i33> [[LHS:%.]], <i33 9, i33 9>
	; SI-NEXT: [[LHS24:%.*]] = ashr <2 x i33> [[SHL_LHS]], <i33 9, i33 9>			; SI-NEXT: [[LHS24:%.*]] = ashr <2 x i33> [[SHL_LHS]], <i33 9, i33 9>
	; SI-NEXT: [[LSHR_RHS:%.]] = shl <2 x i33> [[RHS:%.]], <i33 9, i33 9>			; SI-NEXT: [[LSHR_RHS:%.]] = shl <2 x i33> [[RHS:%.]], <i33 9, i33 9>
	; SI-NEXT: [[RHS24:%.*]] = ashr <2 x i33> [[LHS]], <i33 9, i33 9>			; SI-NEXT: [[RHS24:%.*]] = ashr <2 x i33> [[LHS]], <i33 9, i33 9>
	; SI-NEXT: [[TMP1:%.*]] = extractelement <2 x i33> [[LHS24]], i64 0			; SI-NEXT: [[MUL:%.*]] = mul <2 x i33> [[LHS24]], [[RHS24]]
	; SI-NEXT: [[TMP2:%.*]] = extractelement <2 x i33> [[LHS24]], i64 1
	; SI-NEXT: [[TMP3:%.*]] = extractelement <2 x i33> [[RHS24]], i64 0
	; SI-NEXT: [[TMP4:%.*]] = extractelement <2 x i33> [[RHS24]], i64 1
	; SI-NEXT: [[TMP5:%.*]] = trunc i33 [[TMP1]] to i32
	; SI-NEXT: [[TMP6:%.*]] = trunc i33 [[TMP3]] to i32
	; SI-NEXT: [[TMP7:%.*]] = call i32 @llvm.amdgcn.mul.i24(i32 [[TMP5]], i32 [[TMP6]])
	; SI-NEXT: [[TMP8:%.*]] = sext i32 [[TMP7]] to i33
	; SI-NEXT: [[TMP9:%.*]] = trunc i33 [[TMP2]] to i32
	; SI-NEXT: [[TMP10:%.*]] = trunc i33 [[TMP4]] to i32
	; SI-NEXT: [[TMP11:%.*]] = call i32 @llvm.amdgcn.mul.i24(i32 [[TMP9]], i32 [[TMP10]])
	; SI-NEXT: [[TMP12:%.*]] = sext i32 [[TMP11]] to i33
	; SI-NEXT: [[TMP13:%.*]] = insertelement <2 x i33> undef, i33 [[TMP8]], i64 0
	; SI-NEXT: [[MUL:%.*]] = insertelement <2 x i33> [[TMP13]], i33 [[TMP12]], i64 1
	; SI-NEXT: ret <2 x i33> [[MUL]]			; SI-NEXT: ret <2 x i33> [[MUL]]
	;			;
	; VI-LABEL: @smul24_v2i33(			; VI-LABEL: @smul24_v2i33(
	; VI-NEXT: [[SHL_LHS:%.]] = shl <2 x i33> [[LHS:%.]], <i33 9, i33 9>			; VI-NEXT: [[SHL_LHS:%.]] = shl <2 x i33> [[LHS:%.]], <i33 9, i33 9>
	; VI-NEXT: [[LHS24:%.*]] = ashr <2 x i33> [[SHL_LHS]], <i33 9, i33 9>			; VI-NEXT: [[LHS24:%.*]] = ashr <2 x i33> [[SHL_LHS]], <i33 9, i33 9>
	; VI-NEXT: [[LSHR_RHS:%.]] = shl <2 x i33> [[RHS:%.]], <i33 9, i33 9>			; VI-NEXT: [[LSHR_RHS:%.]] = shl <2 x i33> [[RHS:%.]], <i33 9, i33 9>
	; VI-NEXT: [[RHS24:%.*]] = ashr <2 x i33> [[LHS]], <i33 9, i33 9>			; VI-NEXT: [[RHS24:%.*]] = ashr <2 x i33> [[LHS]], <i33 9, i33 9>
	; VI-NEXT: [[TMP1:%.*]] = extractelement <2 x i33> [[LHS24]], i64 0			; VI-NEXT: [[MUL:%.*]] = mul <2 x i33> [[LHS24]], [[RHS24]]
	; VI-NEXT: [[TMP2:%.*]] = extractelement <2 x i33> [[LHS24]], i64 1
	; VI-NEXT: [[TMP3:%.*]] = extractelement <2 x i33> [[RHS24]], i64 0
	; VI-NEXT: [[TMP4:%.*]] = extractelement <2 x i33> [[RHS24]], i64 1
	; VI-NEXT: [[TMP5:%.*]] = trunc i33 [[TMP1]] to i32
	; VI-NEXT: [[TMP6:%.*]] = trunc i33 [[TMP3]] to i32
	; VI-NEXT: [[TMP7:%.*]] = call i32 @llvm.amdgcn.mul.i24(i32 [[TMP5]], i32 [[TMP6]])
	; VI-NEXT: [[TMP8:%.*]] = sext i32 [[TMP7]] to i33
	; VI-NEXT: [[TMP9:%.*]] = trunc i33 [[TMP2]] to i32
	; VI-NEXT: [[TMP10:%.*]] = trunc i33 [[TMP4]] to i32
	; VI-NEXT: [[TMP11:%.*]] = call i32 @llvm.amdgcn.mul.i24(i32 [[TMP9]], i32 [[TMP10]])
	; VI-NEXT: [[TMP12:%.*]] = sext i32 [[TMP11]] to i33
	; VI-NEXT: [[TMP13:%.*]] = insertelement <2 x i33> undef, i33 [[TMP8]], i64 0
	; VI-NEXT: [[MUL:%.*]] = insertelement <2 x i33> [[TMP13]], i33 [[TMP12]], i64 1
	; VI-NEXT: ret <2 x i33> [[MUL]]			; VI-NEXT: ret <2 x i33> [[MUL]]
	;			;
	; DISABLED-LABEL: @smul24_v2i33(			; DISABLED-LABEL: @smul24_v2i33(
	; DISABLED-NEXT: [[SHL_LHS:%.]] = shl <2 x i33> [[LHS:%.]], <i33 9, i33 9>			; DISABLED-NEXT: [[SHL_LHS:%.]] = shl <2 x i33> [[LHS:%.]], <i33 9, i33 9>
	; DISABLED-NEXT: [[LHS24:%.*]] = ashr <2 x i33> [[SHL_LHS]], <i33 9, i33 9>			; DISABLED-NEXT: [[LHS24:%.*]] = ashr <2 x i33> [[SHL_LHS]], <i33 9, i33 9>
	; DISABLED-NEXT: [[LSHR_RHS:%.]] = shl <2 x i33> [[RHS:%.]], <i33 9, i33 9>			; DISABLED-NEXT: [[LSHR_RHS:%.]] = shl <2 x i33> [[RHS:%.]], <i33 9, i33 9>
	; DISABLED-NEXT: [[RHS24:%.*]] = ashr <2 x i33> [[LHS]], <i33 9, i33 9>			; DISABLED-NEXT: [[RHS24:%.*]] = ashr <2 x i33> [[LHS]], <i33 9, i33 9>
	; DISABLED-NEXT: [[MUL:%.*]] = mul <2 x i33> [[LHS24]], [[RHS24]]			; DISABLED-NEXT: [[MUL:%.*]] = mul <2 x i33> [[LHS24]], [[RHS24]]
	Show All 9 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix 24 bit mul intrinsic generation for > 32 bit result.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 379046

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-mul24.ll

[AMDGPU] Fix 24 bit mul intrinsic generation for > 32 bit result.
ClosedPublic