This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix 24 bit mul intrinsic generation for > 32 bit result.
ClosedPublic

Authored by abinavpp on Oct 10 2021, 11:23 PM.

Details

Summary

The 24 bit mul intrinsics yields the low-order 32 bits. We should only
do the transformation if the operands are known to be not wider than 24
bits and the result is known to be not wider than 32 bits.

Diff Detail

Event Timeline

abinavpp created this revision.Oct 10 2021, 11:23 PM
abinavpp requested review of this revision.Oct 10 2021, 11:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 10 2021, 11:23 PM
abinavpp updated this revision to Diff 378581.Oct 11 2021, 12:36 AM

Reworded the comment.

abinavpp retitled this revision from [AMDGPU] Don't emit 24 bit mul for > 32 bit result. to [AMDGPU] Don't emit 24 bit mul intrinsic for > 32 bit result..Oct 11 2021, 12:37 AM
abinavpp edited the summary of this revision. (Show Details)

There should be nothing wrong with mul24 regardless of the destination type. If a 64 bit mul fits 24 bit it still can use 24 bit mul, just extended to 64 bit.

foad added a comment.Oct 11 2021, 12:25 PM

There should be nothing wrong with mul24 regardless of the destination type. If a 64 bit mul fits 24 bit it still can use 24 bit mul, just extended to 64 bit.

No, I think this patch makes sense. We were only checking that the inputs fit in 24 bits. A full 24-bit multiply would have a 48 bit result, which you could safely extend to 64 bits. But mul24 gives you a truncated 32-bit result, which you can't safely extend to 64 bits.

There should be nothing wrong with mul24 regardless of the destination type. If a 64 bit mul fits 24 bit it still can use 24 bit mul, just extended to 64 bit.

No, I think this patch makes sense. We were only checking that the inputs fit in 24 bits. A full 24-bit multiply would have a 48 bit result, which you could safely extend to 64 bits. But mul24 gives you a truncated 32-bit result, which you can't safely extend to 64 bits.

Should we check numBits(LHS) + numBits(RHS) <= DstTy.sizeInBits() || Size <= 32 instead?

foad added a comment.Oct 12 2021, 12:48 AM

There should be nothing wrong with mul24 regardless of the destination type. If a 64 bit mul fits 24 bit it still can use 24 bit mul, just extended to 64 bit.

No, I think this patch makes sense. We were only checking that the inputs fit in 24 bits. A full 24-bit multiply would have a 48 bit result, which you could safely extend to 64 bits. But mul24 gives you a truncated 32-bit result, which you can't safely extend to 64 bits.

Should we check numBits(LHS) + numBits(RHS) <= DstTy.sizeInBits() || Size <= 32 instead?

No, that would still go wrong for the original case of multiplying two 24-bit numbers to get a 64-bit result, because the intermediate 48-bit result will get truncated to 32 bits by the mul24 instruction before it gets extended to 64 bits.

foad added a comment.Oct 12 2021, 12:51 AM

I think numBits(LHS) + numBits(RHS) <= 32 || Size <= 32 might be OK. (Assuming you have already checked numBits(LHS) <= 24 and numBits(RHS) <= 24.)

We only need to make sure that we're conforming to the semantics of the 24 bit
low-order mul instructions in the AMDGPU ISA. I agree with @foad, we shoud do
the following:

if (mul is wider than 32 bits) {

if (numBits(a) > 24 || numBits(b) > 24)
  return false;

// Check if numBits(mul(a, b)) > 32
if (numBits(a) + numBits(b) > 32)
  return false;

}

Ideally, we're suppose to split (mul i64 a, b) to (build_pair i64 (mul24hi a,
b), (mul24 a, b)) like how getMul24() of AMDGPUISelLowering.cpp does. I'm not
sure about doing that in LLVM-IR.

abinavpp updated this revision to Diff 379027.Oct 12 2021, 7:11 AM

Addressed review comments.

abinavpp retitled this revision from [AMDGPU] Don't emit 24 bit mul intrinsic for > 32 bit result. to [AMDGPU] Fix 24 bit mul intrinsic generation for > 32 bit result..Oct 12 2021, 7:12 AM
abinavpp edited the summary of this revision. (Show Details)
foad added inline comments.Oct 12 2021, 7:38 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
511–512

These checks are already done below, and done better because they can use the u24 or i24 instruction as appropriate.

515

This check probably needs to be done after we've decided whether to use the u24 or i24 instruction.

abinavpp updated this revision to Diff 379046.Oct 12 2021, 8:15 AM
abinavpp marked 2 inline comments as done.

Addressed review comments.

You now need to add tests where we prove it fits 32 bits and use 24 bit mul.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
534

Should we use a signed check for signed mul?

abinavpp updated this revision to Diff 379328.Oct 13 2021, 3:35 AM
abinavpp marked an inline comment as done.

Addressed review comments.

This revision is now accepted and ready to land.Oct 13 2021, 9:50 AM
foad added inline comments.Oct 14 2021, 6:39 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
454

I think it would make more sense for this to return ScalarSize - ComputeNumSignBits(Op, *DL, 0, AC) - 1. This is the maximum size that Op could be truncated to, and sign extended to give you the original value. This is analogous to numBitsUnsigned, which tells you the maximum size you could truncate to and then zero extend to get the original value.

515

This comment just tells you what the "if" condition is, it doesn't really explain why.

517

isU24 already includes a call to numBitsUnsigned, so maybe remove the isU24 and isI24 wrappers altogether and just call the lower level function once for each argument:

unsigned LHSBits = numBitsUnsigned(LHS, Size);
unsigned RHSBits = numBitsUnsigned(RHS, Size):
if (ST->hasMulU24() && LHSBits <= 24 && RHSBits <= 24 &&
    (Size <= 32 || LHSBits + RHSBits <= 32)) {
  IntrID = Intrinsic::amdgcn_mul_u24;
}

... and similarly for the signed case.

523

I don't think > 31 is safe here. For example it will transform this function:

define amdgpu_ps i64 @test(i64 %x, i64 %y) {
  %xx = trunc i64 %x to i16
  %xxx = sext i16 %xx to i64
  %yy = trunc i64 %y to i17
  %yyy = sext i17 %yy to i64
  %mul = mul i64 %xxx, %yyy
  ret i64 %mul
}

into:

define amdgpu_ps i64 @test(i64 %x, i64 %y) {
  %xx = trunc i64 %x to i16
  %xxx = sext i16 %xx to i64
  %yy = trunc i64 %y to i17
  %yyy = sext i17 %yy to i64
  %1 = trunc i64 %xxx to i32
  %2 = trunc i64 %yyy to i32
  %3 = call i32 @llvm.amdgcn.mul.i24(i32 %1, i32 %2)
  %mul = sext i32 %3 to i64
  ret i64 %mul
}

If %x is -0x8000 and %y is -0x10000 then the result should be +0x80000000, but the transformed code would return -0x80000000.

I think this condition needs to be > 30 instead.

abinavpp added inline comments.Oct 14 2021, 9:17 PM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
515

Addressed in D111823.

517

Addressed in D111864.

523

Addressed in D111823.

Do we not emit 24 bit mulhi in the IR pass? Could we just start emitting the full 48 bit computation?

abinavpp added a comment.EditedOct 18 2021, 11:47 PM

Do we not emit 24 bit mulhi in the IR pass? Could we just start emitting the full 48 bit computation?

I couldn't find the 24-bit mulhi intrinsic in IntrinsicsAMDGPU.td. At the
moment, AMDGPUTargetLowering::performMulCombine() of AMDGPUISelLowering.cpp is
creating the 48-bit mul if this bails out.

Ideally we should generate the 48-bit mul here like how getMul24() of
AMDGPUISelLowering.cpp does. I'm not sure about the right approach for that. If
we create a 24-bit mulhi intrinsic, then store the 24-bit mul and 24-bit mulhi
results in a { i16, i32 } struct, how do we replaceAllUsesWith() of the i64
uses with { i16, i32 }. Otherwise, we can create a 48-bit mul intrinsic of
type i64 (i32, i32), and split it during lowering. What do you think?

Do we not emit 24 bit mulhi in the IR pass? Could we just start emitting the full 48 bit computation?

I couldn't find the 24-bit mulhi intrinsic in IntrinsicsAMDGPU.td. At the
moment, AMDGPUTargetLowering::performMulCombine() of AMDGPUISelLowering.cpp is
creating the 48-bit mul if this bails out.

You would have to add IR intrinsics for this. The only reason we do this in the IR is due to weakness in the known bits handling in the DAG since it can't see across blocks. We'll get some simple cases in the DAG now, but miss more complex cases.

Ideally we should generate the 48-bit mul here like how getMul24() of
AMDGPUISelLowering.cpp does. I'm not sure about the right approach for that. If
we create a 24-bit mulhi intrinsic, then store the 24-bit mul and 24-bit mulhi
results in a { i16, i32 } struct, how do we replaceAllUsesWith() of the i64
uses with { i16, i32 }. Otherwise, we can create a 48-bit mul intrinsic of
type i64 (i32, i32), and split it during lowering. What do you think?

The result of the intrinsic should be 32-bit. The canonical way to merge 2 integer values in the IR would be to do an extend, shift and or

Do we not emit 24 bit mulhi in the IR pass? Could we just start emitting the full 48 bit computation?

I couldn't find the 24-bit mulhi intrinsic in IntrinsicsAMDGPU.td. At the
moment, AMDGPUTargetLowering::performMulCombine() of AMDGPUISelLowering.cpp is
creating the 48-bit mul if this bails out.

You would have to add IR intrinsics for this. The only reason we do this in the IR is due to weakness in the known bits handling in the DAG since it can't see across blocks. We'll get some simple cases in the DAG now, but miss more complex cases.

Ideally we should generate the 48-bit mul here like how getMul24() of
AMDGPUISelLowering.cpp does. I'm not sure about the right approach for that. If
we create a 24-bit mulhi intrinsic, then store the 24-bit mul and 24-bit mulhi
results in a { i16, i32 } struct, how do we replaceAllUsesWith() of the i64
uses with { i16, i32 }. Otherwise, we can create a 48-bit mul intrinsic of
type i64 (i32, i32), and split it during lowering. What do you think?

The result of the intrinsic should be 32-bit. The canonical way to merge 2 integer values in the IR would be to do an extend, shift and or

Addressed in D112394 and D112395.

Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2022, 9:09 AM

Referencing ROCm 5.2 issue, which seems to be (still?) affected by the problem being fixed here.