This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Try to combine FMUL with FDIV
ClosedPublic

Authored by jaykang10 on Jul 28 2023, 7:03 AM.

Details

Summary

gcc generates less instructions than llvm from below example.

float foo(int state) {
    return (float)state / 2;
}

gcc output

foo:
  scvtf s0, w0, #1
  ret

llvm output

foo:
  scvtf s0, w0
  fmov s1, #0.50000000
  fmul s0, s0, s1
  ret

gcc converts the float division to float multiplication like X / C --> X * (1 / C), and it has a pattern aarch64_scvtfsisf2_mult with float multiplication for scvtf.
llvm also converts fdiv to fmul in InstCombine pass like X / C --> X * (1 / C) but it does not have ISel codes with fmul for scvtf.
If fmul's constant operand is the reciprocal of a power of 2 like (1/2^n) and the other operand is SINT_TO_FP, we can try X * (1 / C) --> X / C because it will be matched with scvtf patterns with fixed-point.
With this patch, the llvm's output is as below.

foo:
        scvtf   s0, w0, #1
        ret

Diff Detail

Event Timeline

jaykang10 created this revision.Jul 28 2023, 7:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2023, 7:03 AM
jaykang10 requested review of this revision.Jul 28 2023, 7:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2023, 7:03 AM

Hi @jaykang10, perhaps I'm being over-cautious here, but I'm not sure of the licensing implications of posting GCC source code in the commit message for a LLVM patch? It feels like this might be problematic.

Hi @jaykang10, perhaps I'm being over-cautious here, but I'm not sure of the licensing implications of posting GCC source code in the commit message for a LLVM patch? It feels like this might be problematic.

Ah, sorry. I did not know that. Let me remove the gcc source code.
Thanks for letting me know @david-arm

jaykang10 edited the summary of this revision. (Show Details)Jul 28 2023, 7:11 AM

All of these examples seem to canonicalize to fmul in the midend: https://godbolt.org/z/hqPv3azjf
Is it worth keeping the currently lowering for fdiv(sitofp)? Or should we just change that to work with fmul?

jaykang10 added a comment.EditedJul 28 2023, 10:30 AM

All of these examples seem to canonicalize to fmul in the midend: https://godbolt.org/z/hqPv3azjf
Is it worth keeping the currently lowering for fdiv(sitofp)? Or should we just change that to work with fmul?

I guess you are mentioning performFDivCombine function. As you can see, the function convert the fdiv(sitofp) into Intrinsic::aarch64_neon_vcvtfxs2fp which is vector version and it is matched with below pattern.

multiclass SIMDFPScalarRShift<bit U, bits<5> opc, string asm> {
  let Predicates = [HasNEON, HasFullFP16] in {
  def h : BaseSIMDScalarShift<U, opc, {0,0,1,?,?,?,?},
                              FPR16, FPR16, vecshiftR16, asm, []> { 
    let Inst{19-16} = imm{3-0};
  }
  } // Predicates = [HasNEON, HasFullFP16]
  def s : BaseSIMDScalarShift<U, opc, {0,1,?,?,?,?,?},
                              FPR32, FPR32, vecshiftR32, asm, []> { 
    let Inst{20-16} = imm{4-0};
  }
  def d : BaseSIMDScalarShift<U, opc, {1,?,?,?,?,?,?},
                              FPR64, FPR64, vecshiftR64, asm, []> { 
    let Inst{21-16} = imm{5-0};
  }
}
...
defm SCVTF  : SIMDFPScalarRShift<0, 0b11100, "scvtf">;
...
def : Pat<(int_aarch64_neon_vcvtfxs2fp FPR32:$Rn, vecshiftR32:$imm),
          (SCVTFs FPR32:$Rn, vecshiftR32:$imm)>;

As you can see on the multiclass SIMDFPScalarRShift, the MIR definition expects FPR register classes for input/output. It causes COPY MIR, which is fmov, between FPR and GPR.
For scalar version, AArch64 target has below patterns.

multiclass IntegerToFP<bit isUnsigned, string asm, SDPatternOperator node> {
...
  def SWSri: BaseIntegerToFP<isUnsigned, GPR32, FPR32, fixedpoint_f32_i32, asm,
                             [(set FPR32:$Rd,
                                   (fdiv (node GPR32:$Rn),
                                         fixedpoint_f32_i32:$scale))]> {
    let Inst{31} = 0; // 32-bit GPR flag
    let Inst{23-22} = 0b00; // 32-bit FPR flag
    let scale{5} = 1;
  }
...
defm SCVTF : IntegerToFP<0, "scvtf", any_sint_to_fp>;

We need to keep fdiv(sitofp) node to match above pattern.
In order to use current patterns for scalar version, I have converted fmul to fdiv using dagcombine with fmul.
Do you want to add some code, which handles fmul, in performFDivCombine? I am not sure whether that is better than this patch or not...

Oh I see, I hadn't spotted performFDivCombine. It was the scalar patterns I was thinking about, via IntegerToFP and fixedpoint_f32_i32 and SelectCVTFixedPosOperand.

Does it make sense to keep them selection fdiv, or should they always just match fmul? It would seem we only need one, and fmul is more canonical.

Oh I see, I hadn't spotted performFDivCombine. It was the scalar patterns I was thinking about, via IntegerToFP and fixedpoint_f32_i32 and SelectCVTFixedPosOperand.

Ah, sorry, I thought you mentioned the performFDivCombine function because I was not able to find the custom lowering code for FDIV except SVE.

Does it make sense to keep them selection fdiv, or should they always just match fmul? It would seem we only need one, and fmul is more canonical.

If possible, I would like to keep the existing fdiv pattern in this patch.
I saw @samtebbs changed the SelectCVTFixedPosOperand function so I added him as a reviewer and if possible, I would like to get his opinion too.

Matt added a subscriber: Matt.Jul 31 2023, 12:10 PM

I think this is good as it is, although I'm not 100% sure on the fact that we need to get it converted to aarch64_neon_vcvtfxs2fp first, as if something goes wrong there then we'll miss out on this optimisation. If you can think of a way to circumvent the need for that and go directly to the scvtf that would be good, otherwise this looks good to me.

I think this is good as it is, although I'm not 100% sure on the fact that we need to get it converted to aarch64_neon_vcvtfxs2fp first, as if something goes wrong there then we'll miss out on this optimisation. If you can think of a way to circumvent the need for that and go directly to the scvtf that would be good, otherwise this looks good to me.

Thanks for comment.
There is a comment in AArch64InstrInfo.td as below and it looks there was TableGen issue to generate the simd type instruction directly.

defm FCVTZS : SIMDFPScalarRShift<0, 0b11111, "fcvtzs">;
defm FCVTZU : SIMDFPScalarRShift<1, 0b11111, "fcvtzu">;
defm SCVTF  : SIMDFPScalarRShift<0, 0b11100, "scvtf">;
defm UCVTF  : SIMDFPScalarRShift<1, 0b11100, "ucvtf">;
// Codegen patterns for the above. We don't put these directly on the
// instructions because TableGen's type inference can't handle the truth.
// Having the same base pattern for fp <--> int totally freaks it out.

Does it make sense to keep them selection fdiv, or should they always just match fmul? It would seem we only need one, and fmul is more canonical.

Additionally, multiclass FPToIntegerScaled and multiclass IntegerToFP share the SelectCVTFixedPosOperand as below.

class fixedpoint_i32<ValueType FloatVT>
  : Operand<FloatVT>,
    ComplexPattern<FloatVT, 1, "SelectCVTFixedPosOperand<32>", [fpimm, ld]> {
  let EncoderMethod = "getFixedPointScaleOpValue";
  let DecoderMethod = "DecodeFixedPointScaleImm32";
  let ParserMatchClass = Imm1_32Operand;
}
...

def fixedpoint_f32_i64 : fixedpoint_i64<f32>;
...

multiclass FPToIntegerScaled<bits<2> rmode, bits<3> opcode, string asm,
                             SDPatternOperator OpN> {
...
  def SXSri : BaseFPToInteger<0b00, rmode, opcode, FPR32, GPR64,
                              fixedpoint_f32_i64, asm,
              [(set GPR64:$Rd, (OpN (fmul FPR32:$Rn,
                                          fixedpoint_f32_i64:$scale)))]> {
    let Inst{31} = 1; // 64-bit GPR flag
  }
...
multiclass IntegerToFP<bit isUnsigned, string asm, SDPatternOperator node> {
...
  def SXSri: BaseIntegerToFP<isUnsigned, GPR64, FPR32, fixedpoint_f32_i64, asm,
                             [(set FPR32:$Rd,
                                   (fdiv (node GPR64:$Rn),
                                         fixedpoint_f32_i64:$scale))]> {
    let Inst{31} = 1; // 64-bit GPR flag
    let Inst{23-22} = 0b00; // 32-bit FPR flag
  }

The SelectCVTFixedPosOperand is for fcvt and checks the constant is 2^fbits.
If we keep the fdiv for scvtf, we can use SelectCVTFixedPosOperand.
If we replace the fdiv with fmul for scvtf, we need other complex pattern which checks the constant 1/2^fbits.
GCC uses fmul for fcvt and scvft but it has two patterns for 2^n and 1/2^n.
From my personal opinion, as current implementation, it would also be good to keep one complex pattern with fmul and fdiv.

Does it make sense to keep them selection fdiv, or should they always just match fmul? It would seem we only need one, and fmul is more canonical.

Additionally, multiclass FPToIntegerScaled and multiclass IntegerToFP share the SelectCVTFixedPosOperand as below.

class fixedpoint_i32<ValueType FloatVT>
  : Operand<FloatVT>,
    ComplexPattern<FloatVT, 1, "SelectCVTFixedPosOperand<32>", [fpimm, ld]> {
  let EncoderMethod = "getFixedPointScaleOpValue";
  let DecoderMethod = "DecodeFixedPointScaleImm32";
  let ParserMatchClass = Imm1_32Operand;
}
...

def fixedpoint_f32_i64 : fixedpoint_i64<f32>;
...

multiclass FPToIntegerScaled<bits<2> rmode, bits<3> opcode, string asm,
                             SDPatternOperator OpN> {
...
  def SXSri : BaseFPToInteger<0b00, rmode, opcode, FPR32, GPR64,
                              fixedpoint_f32_i64, asm,
              [(set GPR64:$Rd, (OpN (fmul FPR32:$Rn,
                                          fixedpoint_f32_i64:$scale)))]> {
    let Inst{31} = 1; // 64-bit GPR flag
  }
...
multiclass IntegerToFP<bit isUnsigned, string asm, SDPatternOperator node> {
...
  def SXSri: BaseIntegerToFP<isUnsigned, GPR64, FPR32, fixedpoint_f32_i64, asm,
                             [(set FPR32:$Rd,
                                   (fdiv (node GPR64:$Rn),
                                         fixedpoint_f32_i64:$scale))]> {
    let Inst{31} = 1; // 64-bit GPR flag
    let Inst{23-22} = 0b00; // 32-bit FPR flag
  }

The SelectCVTFixedPosOperand is for fcvt and checks the constant is 2^fbits.
If we keep the fdiv for scvtf, we can use SelectCVTFixedPosOperand.
If we replace the fdiv with fmul for scvtf, we need other complex pattern which checks the constant 1/2^fbits.
GCC uses fmul for fcvt and scvft but it has two patterns for 2^n and 1/2^n.
From my personal opinion, as current implementation, it would also be good to keep one complex pattern with fmul and fdiv.

I see. That makes sense, but we may need to take that route anyway. I worry that if we do it this way we will just end up in a loop, transforming fdiv to fmul and back again.

There is a generic DAG combine in this bit of code that does the inverse transform: https://github.com/llvm/llvm-project/blob/c2093b85044d87805c39267c65ac9032d5454e0e/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L16543. It currently only triggers with UnsafeFPMath or AllowReciprocal, which is probably why it doesn't come up in the tests. According to the InstCombine version it should be fine for any constant that has an exact inverse (which seems the same as what you have here too), so should be more generally applicable.

I think my vote would still be for changing IntegerToFP to use fmul with a difference ComplexPat, but if you do go this route it will need some way of preventing the infinite folding back and forth.

Allen added a subscriber: Allen.Aug 2 2023, 2:35 AM

I see. That makes sense, but we may need to take that route anyway. I worry that if we do it this way we will just end up in a loop, transforming fdiv to fmul and back again.

There is a generic DAG combine in this bit of code that does the inverse transform: https://github.com/llvm/llvm-project/blob/c2093b85044d87805c39267c65ac9032d5454e0e/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L16543. It currently only triggers with UnsafeFPMath or AllowReciprocal, which is probably why it doesn't come up in the tests. According to the InstCombine version it should be fine for any constant that has an exact inverse (which seems the same as what you have here too), so should be more generally applicable.

I think my vote would still be for changing IntegerToFP to use fmul with a difference ComplexPat, but if you do go this route it will need some way of preventing the infinite folding back and forth.

Sorry for late and thanks for comment.
If possible, I would like to re-use existing patterns.
Any more comments or any objection for this patch?

Sorry for late and thanks for comment.
If possible, I would like to re-use existing patterns.
Any more comments or any objection for this patch?

Unfortunately I believe this will get stuck in loops some of the time. Try https://godbolt.org/z/c6badjcb1 for example, when it runs under Ofast.

Unfortunately I believe this will get stuck in loops some of the time. Try https://godbolt.org/z/c6badjcb1 for example, when it runs under Ofast.

Ah, Thanks for pointing out the Ofast option! I misunderstood one of your previous comments. I can see the loop between tryCombineFMULWithFDIV and DAGCombiner::visitFDIV.
I can see below comment for the precision on DAGCombiner::visitFDIV. That's what I want to check for this patch.

SDValue DAGCombiner::visitFDIV(SDNode *N) {
...
  if (Options.UnsafeFPMath || Flags.hasAllowReciprocal()) {
    // fold (fdiv X, c2) -> fmul X, 1/c2 if losing precision is acceptable.
    if (auto *N1CFP = dyn_cast<ConstantFPSDNode>(N1))

Let me add a complex pattern with Options.UnsafeFPMath || Flags.hasAllowReciprocal().

jaykang10 updated this revision to Diff 549007.Aug 10 2023, 6:12 AM

Following @dmgreen's comment, added a complex pattern and updated patterns with it.

jaykang10 updated this revision to Diff 549012.Aug 10 2023, 6:17 AM
jaykang10 added a comment.EditedAug 10 2023, 6:31 AM

@dmgreen As you can see, DAGCombiner::visitFDIV does the conversion with Options.UnsafeFPMath || Flags.hasAllowReciprocal() so the patterns with fdiv are not selected without the options.
In order to use only fmul, we could need to remove the Options.UnsafeFPMath || Flags.hasAllowReciprocal(). Alternatively, We could need to support patterns with both fmul and fdiv with different complex patterns. How do you think about it?

@dmgreen As you can see, DAGCombiner::visitFDIV does the conversion with Options.UnsafeFPMath || Flags.hasAllowReciprocal() so the patterns with fdiv are not selected without the options.
In order to use only fmul, we could need to remove the Options.UnsafeFPMath || Flags.hasAllowReciprocal(). Alternatively, We could need to support patterns with both fmul and fdiv with different complex patterns. How do you think about it?

There is an instcombine version here: https://github.com/llvm/llvm-project/blob/c2093b85044d87805c39267c65ac9032d5454e0e/llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp#L1546. It seems to do the transform whenever the value has an exact inverse. // If the constant divisor has an exact inverse, this is always safe. Alive doesn't run to completion to prove it (https://alive2.llvm.org/ce/z/mxheqK), but doesn't come up with a failure in that time. Can we use the same logic to always to the transform? I don't know of any counter examples where it wouldn't be true.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
3686

Could this be shared with SelectCVTFixedPosOperand? Maybe with a flag to specify whether the getExactInverse needs to be performed. If not them maybe the ConstantFPSDNode/ConstantPoolSDNode stuff can be pulled out and shared?

@dmgreen As you can see, DAGCombiner::visitFDIV does the conversion with Options.UnsafeFPMath || Flags.hasAllowReciprocal() so the patterns with fdiv are not selected without the options.
In order to use only fmul, we could need to remove the Options.UnsafeFPMath || Flags.hasAllowReciprocal(). Alternatively, We could need to support patterns with both fmul and fdiv with different complex patterns. How do you think about it?

There is an instcombine version here: https://github.com/llvm/llvm-project/blob/c2093b85044d87805c39267c65ac9032d5454e0e/llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp#L1546. It seems to do the transform whenever the value has an exact inverse. // If the constant divisor has an exact inverse, this is always safe. Alive doesn't run to completion to prove it (https://alive2.llvm.org/ce/z/mxheqK), but doesn't come up with a failure in that time. Can we use the same logic to always to the transform? I don't know of any counter examples where it wouldn't be true.

Ok, good.
Let me update this patch with the exact inverse.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
3686

Yep, let me try to share the function.

jaykang10 updated this revision to Diff 549312.Aug 11 2023, 2:54 AM

Following @dmgreen's comment, updated patch.

@dmgreen As you can see, there are some regressions.
After exact inverse for fmul, the convertToInteger sets IsExact to false in some cases.
Maybe, we could need to keep the fdiv patterns too...

@dmgreen As you can see, there are some regressions.
After exact inverse for fmul, the convertToInteger sets IsExact to false in some cases.
Maybe, we could need to keep the fdiv patterns too...

I think that any test with a fdiv, that would by instcombine be converted to a fmul (like https://godbolt.org/z/P1bd73TK7) are OK to leave as regressions. We would not see the regression in practice as instcombine has already performed the conversion to a fmul.

Do you know which cases that would not cover and would be left over if the fdiv was not handled?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16717 ↗(On Diff #549312)

I think this can be dropped. If we want this transform for scalars it would be best to reuse the logic in DAGCombine.

llvm/lib/Target/AArch64/AArch64InstrFormats.td
709–711

This can maybe be removed, as the class is only used as a ComplexPattern, not as a assembly Operand?

llvm/test/CodeGen/AArch64/svtcf-fmul-fdiv-combine.ll
5

Can you add some fp16 variants in here too.

I think that any test with a fdiv, that would by instcombine be converted to a fmul (like https://godbolt.org/z/P1bd73TK7) are OK to leave as regressions. We would not see the regression in practice as instcombine has already performed the conversion to a fmul.

Do you know which cases that would not cover and would be left over if the fdiv was not handled?

Thanks for comments.
If we add some patterns with fdiv, we can avoid the regressions. For example,

def : Pat<(f16 (fdiv (f16 (any_sint_to_fp (i32 GPR32:$Rn))), fixedpoint_f16_i32:$scale)),
          (SCVTFSWHri GPR32:$Rn, fixedpoint_f16_i32:$scale)>;
def : Pat<(f32 (fdiv (f32 (any_sint_to_fp (i32 GPR32:$Rn))), fixedpoint_f32_i32:$scale)),
          (SCVTFSWSri GPR32:$Rn, fixedpoint_f32_i32:$scale)>;
def : Pat<(f64 (fdiv (f64 (any_sint_to_fp (i32 GPR32:$Rn))), fixedpoint_f64_i32:$scale)),
          (SCVTFSWDri GPR32:$Rn, fixedpoint_f64_i32:$scale)>;

Let me add these patterns in update. If you do not like it, please let me know. Let me remove them.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16717 ↗(On Diff #549312)

Let me remove it.

llvm/lib/Target/AArch64/AArch64InstrFormats.td
709–711

Let me remove it.

llvm/test/CodeGen/AArch64/svtcf-fmul-fdiv-combine.ll
5

Let me add fp16 tests.

jaykang10 updated this revision to Diff 549917.Aug 14 2023, 7:00 AM

Following @dmgreen's comment, updated patch.

dmgreen accepted this revision.Aug 15 2023, 1:35 AM

I like it, the new patterns look good to me. LGTM

llvm/lib/Target/AArch64/AArch64InstrFormats.td
5046

Should these be recip too? I'm not sure they need to be, but it might be better for them to be consistent.

This revision is now accepted and ready to land.Aug 15 2023, 1:35 AM
jaykang10 added inline comments.Aug 15 2023, 2:25 AM
llvm/lib/Target/AArch64/AArch64InstrFormats.td
5046

Sorry for mistake. I did not update it.
It looks the TableGen does not complain about the inconsistency between complex pattern operands in InOperandList and Pattern.
As you mentioned, it would be just good to use same thing for consistent.

This revision was landed with ongoing or failed builds.Aug 15 2023, 2:59 AM
This revision was automatically updated to reflect the committed changes.