The 24 bit mul intrinsics yields the low-order 32 bits. We should only
do the transformation if the operands are known to be not wider than 24
bits and the result is known to be not wider than 32 bits.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This fixes https://github.com/RadeonOpenCompute/ROCm/issues/1383, which is tracked by SWDEV-273166.
There should be nothing wrong with mul24 regardless of the destination type. If a 64 bit mul fits 24 bit it still can use 24 bit mul, just extended to 64 bit.
No, I think this patch makes sense. We were only checking that the inputs fit in 24 bits. A full 24-bit multiply would have a 48 bit result, which you could safely extend to 64 bits. But mul24 gives you a truncated 32-bit result, which you can't safely extend to 64 bits.
Should we check numBits(LHS) + numBits(RHS) <= DstTy.sizeInBits() || Size <= 32 instead?
No, that would still go wrong for the original case of multiplying two 24-bit numbers to get a 64-bit result, because the intermediate 48-bit result will get truncated to 32 bits by the mul24 instruction before it gets extended to 64 bits.
I think numBits(LHS) + numBits(RHS) <= 32 || Size <= 32 might be OK. (Assuming you have already checked numBits(LHS) <= 24 and numBits(RHS) <= 24.)
We only need to make sure that we're conforming to the semantics of the 24 bit
low-order mul instructions in the AMDGPU ISA. I agree with @foad, we shoud do
the following:
if (mul is wider than 32 bits) {
if (numBits(a) > 24 || numBits(b) > 24) return false; // Check if numBits(mul(a, b)) > 32 if (numBits(a) + numBits(b) > 32) return false;
}
Ideally, we're suppose to split (mul i64 a, b) to (build_pair i64 (mul24hi a,
b), (mul24 a, b)) like how getMul24() of AMDGPUISelLowering.cpp does. I'm not
sure about doing that in LLVM-IR.
You now need to add tests where we prove it fits 32 bits and use 24 bit mul.
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
523 | Should we use a signed check for signed mul? |
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
454 | I think it would make more sense for this to return ScalarSize - ComputeNumSignBits(Op, *DL, 0, AC) - 1. This is the maximum size that Op could be truncated to, and sign extended to give you the original value. This is analogous to numBitsUnsigned, which tells you the maximum size you could truncate to and then zero extend to get the original value. | |
515 | This comment just tells you what the "if" condition is, it doesn't really explain why. | |
516 | I don't think > 31 is safe here. For example it will transform this function: define amdgpu_ps i64 @test(i64 %x, i64 %y) { %xx = trunc i64 %x to i16 %xxx = sext i16 %xx to i64 %yy = trunc i64 %y to i17 %yyy = sext i17 %yy to i64 %mul = mul i64 %xxx, %yyy ret i64 %mul } into: define amdgpu_ps i64 @test(i64 %x, i64 %y) { %xx = trunc i64 %x to i16 %xxx = sext i16 %xx to i64 %yy = trunc i64 %y to i17 %yyy = sext i17 %yy to i64 %1 = trunc i64 %xxx to i32 %2 = trunc i64 %yyy to i32 %3 = call i32 @llvm.amdgcn.mul.i24(i32 %1, i32 %2) %mul = sext i32 %3 to i64 ret i64 %mul } If %x is -0x8000 and %y is -0x10000 then the result should be +0x80000000, but the transformed code would return -0x80000000. I think this condition needs to be > 30 instead. | |
517 | isU24 already includes a call to numBitsUnsigned, so maybe remove the isU24 and isI24 wrappers altogether and just call the lower level function once for each argument: unsigned LHSBits = numBitsUnsigned(LHS, Size); unsigned RHSBits = numBitsUnsigned(RHS, Size): if (ST->hasMulU24() && LHSBits <= 24 && RHSBits <= 24 && (Size <= 32 || LHSBits + RHSBits <= 32)) { IntrID = Intrinsic::amdgcn_mul_u24; } ... and similarly for the signed case. |
Do we not emit 24 bit mulhi in the IR pass? Could we just start emitting the full 48 bit computation?
I couldn't find the 24-bit mulhi intrinsic in IntrinsicsAMDGPU.td. At the
moment, AMDGPUTargetLowering::performMulCombine() of AMDGPUISelLowering.cpp is
creating the 48-bit mul if this bails out.
Ideally we should generate the 48-bit mul here like how getMul24() of
AMDGPUISelLowering.cpp does. I'm not sure about the right approach for that. If
we create a 24-bit mulhi intrinsic, then store the 24-bit mul and 24-bit mulhi
results in a { i16, i32 } struct, how do we replaceAllUsesWith() of the i64
uses with { i16, i32 }. Otherwise, we can create a 48-bit mul intrinsic of
type i64 (i32, i32), and split it during lowering. What do you think?
You would have to add IR intrinsics for this. The only reason we do this in the IR is due to weakness in the known bits handling in the DAG since it can't see across blocks. We'll get some simple cases in the DAG now, but miss more complex cases.
Ideally we should generate the 48-bit mul here like how getMul24() of
AMDGPUISelLowering.cpp does. I'm not sure about the right approach for that. If
we create a 24-bit mulhi intrinsic, then store the 24-bit mul and 24-bit mulhi
results in a { i16, i32 } struct, how do we replaceAllUsesWith() of the i64
uses with { i16, i32 }. Otherwise, we can create a 48-bit mul intrinsic of
type i64 (i32, i32), and split it during lowering. What do you think?
The result of the intrinsic should be 32-bit. The canonical way to merge 2 integer values in the IR would be to do an extend, shift and or
Referencing ROCm 5.2 issue, which seems to be (still?) affected by the problem being fixed here.
I think it would make more sense for this to return ScalarSize - ComputeNumSignBits(Op, *DL, 0, AC) - 1. This is the maximum size that Op could be truncated to, and sign extended to give you the original value. This is analogous to numBitsUnsigned, which tells you the maximum size you could truncate to and then zero extend to get the original value.