This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Fold IEEE `fmul`/`fdiv` by Pow2 to `add`/`sub` of exp
AbandonedPublic

Authored by goldstein.w.n on Jul 6 2023, 8:25 PM.

Details

Summary

This implements:

(fmul C, (uitofp Pow2))
    -> (bitcast_to_FP (add (bitcast_to_INT C), Log2(Pow2) << mantissa))
(fdiv C, (uitofp Pow2))
    -> (bitcast_to_FP (sub (bitcast_to_INT C), Log2(Pow2) << mantissa))

The motivation is mostly fdiv where 2^(-p) is a fairly common
expression.

The patch is intentionally conservative about the transform, only
doing so if we

  1. have IEEE floats
  2. C is normal
  3. add/sub of max(Log2(Pow2)) stays in the min/max exponent bounds.

Alive2 can't realistically prove this, but did test float16/float32
cases (within the bounds of the above rules) exhaustively.

Diff Detail

Event Timeline

goldstein.w.n created this revision.Jul 6 2023, 8:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2023, 8:25 PM
goldstein.w.n requested review of this revision.Jul 6 2023, 8:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2023, 8:25 PM

Sounds plausible, but I'll leave this to the FP experts...

llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
573

Why does this exclude scalable vectors?

625

Should call this with DoFold = false first, otherwise it'll create dead instructions. They'll get removed, but I think this will lead to an infinite combine loop.

Please add a test where this fails to fold after some sub-part has folded.

633–639

Do we do much floatbits wrangling in the middle-end? This feels like something that should be done later, but I don't know if there are precedents.

arsenm added a comment.Jul 7 2023, 2:37 AM

My current thought is this is a worse canonical form for floating point operations and would be better done in codegen. This is going to interfere with FMA matching, plus computeKnownFPClass is going to give up on the integer ops and cast.

This is also similar to a codegen combine I’ve been planning on doing to fold fmul by power of 2 to ldexp. That has the same FMA interference properties, but is easier to deal with. Have you considered introducing ldexp instead and breaking that in codegen if ldexp isn’t legal?

arsenm added a comment.Jul 7 2023, 3:59 AM

Do we do much floatbits wrangling in the middle-end? This feels like something that should be done later, but I don't know if there are precedents.

I actually think we need to do the opposite and turn int-to-fp, not fp-to-int (e.g. D151939 and parents).

goldstein.w.n added a comment.EditedJul 7 2023, 8:27 AM

My current thought is this is a worse canonical form for floating point operations and would be better done in codegen. This is going to interfere with FMA matching, plus computeKnownFPClass is going to give up on the integer ops and cast.

Is there a reason we can't handle bitcasts in computeKnownFPClass using knownbits from the integer type coming in?

I would agree with you if we were replacing the int-to-fp with a fp-to-int, but there is no float except for the constant until the fdiv/fmul.
is fmul/fdiv by constant really preferable to add/sub of a constant?

This is also similar to a codegen combine I’ve been planning on doing to fold fmul by power of 2 to ldexp. That has the same FMA interference properties, but is easier to deal with. Have you considered introducing ldexp instead and breaking that in codegen if ldexp isn’t legal?

That does seem more principled. Then we get to remove all the casts.

is (fmul/fdiv C, (uitofp Pow2)) directly equiv to (ldexp C, -/+ Log2(Pow2))? (negative log2(pow2) if fdiv)

arsenm added a comment.Jul 7 2023, 9:04 AM

My current thought is this is a worse canonical form for floating point operations and would be better done in codegen. This is going to interfere with FMA matching, plus computeKnownFPClass is going to give up on the integer ops and cast.

Is there a reason we can't handle bitcasts in computeKnownFPClass using knownbits from the integer type coming in?

The cases that are recognizable would be better served by turning the bitcasts into floating point operations. Adding in some integer add and sub is quite obscuring and would make that more difficult

I would agree with you if we were replacing the int-to-fp with a fp-to-int, but there is no float except for the constant until the fdiv/fmul.
is fmul/fdiv by constant really preferable to add/sub of a constant?

Downstream users of the fmul / fdiv can understand the original float operation but the new integer thing is hopeless. If the original fmul had an fadd user, we could no longer match FMA without a lot more effort. We do have to undo some canonicalizations already for FMA formation,
but they're simple cases (like fmul x, 2 -> fadd x, x)

This is also similar to a codegen combine I’ve been planning on doing to fold fmul by power of 2 to ldexp. That has the same FMA interference properties, but is easier to deal with. Have you considered introducing ldexp instead and breaking that in codegen if ldexp isn’t legal?

That does seem more principled. Then we get to remove all the casts.

is (fmul/fdiv C, (uitofp Pow2)) directly equiv to (ldexp C, -/+ Log2(Pow2))? (negative log2(pow2) if fdiv)

Yes. I'm pretty sure all the edge cases all work out the same

My current thought is this is a worse canonical form for floating point operations and would be better done in codegen. This is going to interfere with FMA matching, plus computeKnownFPClass is going to give up on the integer ops and cast.

Is there a reason we can't handle bitcasts in computeKnownFPClass using knownbits from the integer type coming in?

The cases that are recognizable would be better served by turning the bitcasts into floating point operations. Adding in some integer add and sub is quite obscuring and would make that more difficult

I would agree with you if we were replacing the int-to-fp with a fp-to-int, but there is no float except for the constant until the fdiv/fmul.
is fmul/fdiv by constant really preferable to add/sub of a constant?

Downstream users of the fmul / fdiv can understand the original float operation but the new integer thing is hopeless. If the original fmul had an fadd user, we could no longer match FMA without a lot more effort. We do have to undo some canonicalizations already for FMA formation,
but they're simple cases (like fmul x, 2 -> fadd x, x)

This is also similar to a codegen combine I’ve been planning on doing to fold fmul by power of 2 to ldexp. That has the same FMA interference properties, but is easier to deal with. Have you considered introducing ldexp instead and breaking that in codegen if ldexp isn’t legal?

That does seem more principled. Then we get to remove all the casts.

is (fmul/fdiv C, (uitofp Pow2)) directly equiv to (ldexp C, -/+ Log2(Pow2))? (negative log2(pow2) if fdiv)

Yes. I'm pretty sure all the edge cases all work out the same

Okay, am going to abandon this patch and replace with transform to ldexp.

Will add ldexp -> shift/sub in backend.

goldstein.w.n added a comment.EditedJul 7 2023, 10:53 AM

My current thought is this is a worse canonical form for floating point operations and would be better done in codegen. This is going to interfere with FMA matching, plus computeKnownFPClass is going to give up on the integer ops and cast.

Is there a reason we can't handle bitcasts in computeKnownFPClass using knownbits from the integer type coming in?

The cases that are recognizable would be better served by turning the bitcasts into floating point operations. Adding in some integer add and sub is quite obscuring and would make that more difficult

I would agree with you if we were replacing the int-to-fp with a fp-to-int, but there is no float except for the constant until the fdiv/fmul.
is fmul/fdiv by constant really preferable to add/sub of a constant?

Downstream users of the fmul / fdiv can understand the original float operation but the new integer thing is hopeless. If the original fmul had an fadd user, we could no longer match FMA without a lot more effort. We do have to undo some canonicalizations already for FMA formation,
but they're simple cases (like fmul x, 2 -> fadd x, x)

This is also similar to a codegen combine I’ve been planning on doing to fold fmul by power of 2 to ldexp. That has the same FMA interference properties, but is easier to deal with. Have you considered introducing ldexp instead and breaking that in codegen if ldexp isn’t legal?

That does seem more principled. Then we get to remove all the casts.

is (fmul/fdiv C, (uitofp Pow2)) directly equiv to (ldexp C, -/+ Log2(Pow2))? (negative log2(pow2) if fdiv)

Yes. I'm pretty sure all the edge cases all work out the same

Okay, am going to abandon this patch and replace with transform to ldexp.

Will add ldexp -> shift/sub in backend.

Working on adding ldexp to backend, but notice that we lose the important information that exp <= Type->getScalarSizeInBits().
We can't propagate an assume to the backend either.
Think to do this, need to add a flag to ldexp if exp is log of some sort (allows us to reason about whether it may cause denorm/etc).
What do you think?

Edit:
We lose also information about whether this is for fdiv or fmul which makes the bounds more constrained, although this is probably less important.

llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
573

Just to keep it simple. Was mostly concerns we might have an issue deducing the proper integer type for bitcasting the FP constant if we allowed scalable vecs.

What do you think?

You being @arsenm

Working on adding ldexp to backend, but notice that we lose the important information that exp <= Type->getScalarSizeInBits()

The useful exponent range is limited by ldexp. The only useful values are between -exp_min - mantissa size to exp_max, outside of that it saturates to 0/inf.

We can't propagate an assume to the backend either.
Think to do this, need to add a flag to ldexp if exp is log of some sort (allows us to reason about whether it may cause denorm/etc).

The exp is interpreted as the log value.

What do you think?

Edit:
We lose also information about whether this is for fdiv or fmul which makes the bounds more constrained, although this is probably less important.

Does that matter? Can we unconditionally turn fdiv into fmul if the divisor is known to be an exact power of 2?

I do think fmul with power of 2 constant in the general case should be more canonical than ldexp. We should turn ldexp with constant exponent into fmuls, and back into ldexp if the target wants.

Working on adding ldexp to backend, but notice that we lose the important information that exp <= Type->getScalarSizeInBits()

The useful exponent range is limited by ldexp. The only useful values are between -exp_min - mantissa size to exp_max, outside of that it saturates to 0/inf.

We can't propagate an assume to the backend either.
Think to do this, need to add a flag to ldexp if exp is log of some sort (allows us to reason about whether it may cause denorm/etc).

The exp is interpreted as the log value.

Yeah, but that log value can be anything. We where only able to bound
it before because we actually took the log of a value that maxed at Type->getScalarSizeInBits().
In the backend, it can be anything no?
I.e if we have:

%pow = shl i64 1, %cnt
%pow_fp = uitofp i64 to double %pow
%mul = fmul double %d, %pow_fp

We convert to:

%mul = ldexp(%d, %cnt)

But when we see this in the backend we have no way to bounding cnt.

What do you think?

Edit:
We lose also information about whether this is for fdiv or fmul which makes the bounds more constrained, although this is probably less important.

Does that matter? Can we unconditionally turn fdiv into fmul if the divisor is known to be an exact power of 2?

I do think fmul with power of 2 constant in the general case should be more canonical than ldexp. We should turn ldexp with constant exponent into fmuls, and back into ldexp if the target wants.

The thing is the power of not constant.
The fold is:
(fmul FP_Constant, (uitofp Pow2_Int_Variable))

Can we unconditionally turn fdiv into fmul if the divisor is known to be an exact power of 2?

Not sure, best I can come up with just take reciprical.

Would people be more open to the patch in InstCombine if it was only for fdiv? The concerns about making things less canonical have been mostly centered around fmul.

D154805 has the backend impl, going to keep this open for a moment incase people prefer just removing fmul from this patch.

Can we unconditionally turn fdiv into fmul if the divisor is known to be an exact power of 2?

Not sure, best I can come up with just take reciprical.

Seems to work for power of 2 known <= max_exp. Much better to do that rather than try to handle both

Can we unconditionally turn fdiv into fmul if the divisor is known to be an exact power of 2?

Not sure, best I can come up with just take reciprical.

Seems to work for power of 2 known <= max_exp. Much better to do that rather than try to handle both

So you mean handle fdiv only and do the following:

(fdiv C, (uitofp Pow2))
    -> (fmul C, (bitcast_to_FP (sub (FP_One), Log2(Pow2) << mantissa)))

?
Otherwise we still have the division to create the RCP. In this case
Seems we might as well also drop the fmul.

But I'm planning to abandon this patch as have implemented in the backend.

So you mean handle fdiv only and do the following:

(fdiv C, (uitofp Pow2))
    -> (fmul C, (bitcast_to_FP (sub (FP_One), Log2(Pow2) << mantissa)))

Yes. It seems InstCombine already turns fdiv by constant power of 2 to fmul, but not the unknown-but-known-to-be-pow2 case here

So you mean handle fdiv only and do the following:

(fdiv C, (uitofp Pow2))
    -> (fmul C, (bitcast_to_FP (sub (FP_One), Log2(Pow2) << mantissa)))

Yes. It seems InstCombine already turns fdiv by constant power of 2 to fmul, but not the unknown-but-known-to-be-pow2 case here

In the constant case, RCP(Constant) is free, here we need 3-4 instructions. Is that really more canonical?
Would argument we might as well drop the fmul entirely and stick with the original transform (for fdiv only)
or just drop entirely and rely on the backend patch.

ping. Would very much so like to get this in. Even if we end up re-routing through ldexp in the future, think the base of this will remain the same.

goldstein.w.n added a comment.EditedSep 2 2023, 1:14 PM

@arsenm I added some AMDGPU tests + some tests of ldexp so its easier to compare the results of this patch.
I would also be happy to add patch so this code can also generate ldexp for the fmul case if the target prefers it if that helps address your concerns.

Edit: Realize this is wrong commit ive been pinging on.

goldstein.w.n abandoned this revision.Sep 2 2023, 1:54 PM

wrong commit.... ive been messaging on :(