This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Fold IEEE `fmul`/`fdiv` by Pow2 to `add`/`sub` of exp
ClosedPublic

Authored by goldstein.w.n on Jul 9 2023, 4:01 PM.

Details

Summary

Note: This is moving D154678 which previously implemented this in
InstCombine. Concerns where brought up that this was de-canonicalizing
and really targeting a codegen improvement, so placing in DAGCombiner.

This implements:

(fmul C, (uitofp Pow2))
    -> (bitcast_to_FP (add (bitcast_to_INT C), Log2(Pow2) << mantissa))
(fdiv C, (uitofp Pow2))
    -> (bitcast_to_FP (sub (bitcast_to_INT C), Log2(Pow2) << mantissa))

The motivation is mostly fdiv where 2^(-p) is a fairly common
expression.

The patch is intentionally conservative about the transform, only
doing so if we:

  1. have IEEE floats
  2. C is normal
  3. add/sub of max(Log2(Pow2)) stays in the min/max exponent bounds.

Alive2 can't realistically prove this, but did test float16/float32
cases (within the bounds of the above rules) exhaustively.

Diff Detail

Event Timeline

goldstein.w.n created this revision.Jul 9 2023, 4:01 PM
goldstein.w.n requested review of this revision.Jul 9 2023, 4:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2023, 4:01 PM
RKSimon added inline comments.Jul 10 2023, 3:18 AM
llvm/include/llvm/CodeGen/TargetLowering.h
3975

Describe/tag the purpose of each argument or use the same names in the patterns to make it more obvious.

3981

(style) Remove the (void) lines

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16152

Can this happen? Make it an assert if you must.

16154

Remove MaxDepth and use SelectionDAG::MaxRecursionDepth

16158

If we allow bitcast can't we end with cases where we've flipped between vectors/scalars or changed vectorelementcount?

16178

You can use ISD::matchUnaryPredicate to support non-uniform vectors here - see isDivisorPowerOfTwo for a similar case.

16180

Standard practice in DAG is to increment Depth at a recursive functional call, not inside it.

16201

Use Depth + 1 here and adjust the depth tolerance above

16209

Use Depth + 1 here and adjust the depth tolerance above

16211

Use Depth + 1 here and adjust the depth tolerance above

16220

Use Depth + 1 here and adjust the depth tolerance above

16222

Use Depth + 1 here and adjust the depth tolerance above

16256

Don't you need to ensure that the scalar size in bits of stays the same through the bitcast? AFAICT you should only accept bitcasts that casts from fp to/from int (both scalar/vector and same scalar width)?

goldstein.w.n marked 13 inline comments as done.Jul 10 2023, 10:40 AM
goldstein.w.n added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16158

Dropped bitcast peeking here. It doesn't really add any value b.c we can't handle any of the cases that bitcast would be needed (Int <-> Vec <-> FP).

16256

When we takelog2 it will cast the new SDValue to proper integer type. But dropping peekthroughbitcasts here as it doesn't really add any value.

goldstein.w.n marked 2 inline comments as done.

Variety of fixes

Can you add a test where the original fmul would fold into an fma? I think you're worse off doing this if you're interfering with FMA formation

llvm/include/llvm/CodeGen/TargetLowering.h
3971

I think this should just use the fmul case, and turn the compatible fdivs to fmul

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16188

You don't need to bother special casing the splat case. getConstant on a vector will just split this out to a build_vector of constants anyway

16201

I'd kind of expect there to be a helper that does this already?

16216

Move isOneConstant as the last condition

16488

Can you add a test where this w

goldstein.w.n added inline comments.Jul 11 2023, 2:17 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16488

Where this w?

arsenm added inline comments.Jul 11 2023, 2:18 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16488

I moved the comment to the main box, the fma formation one

goldstein.w.n marked 4 inline comments as done.Jul 11 2023, 4:19 PM
goldstein.w.n added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
3971

I'm not sure I understand. We will need an fdiv no matter what, at the very least for the reciprocal of the pow2.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16201

There is DAG.getZExtOrTrunc but it doesn't check bitcast or if oldvt == newvt.

16488

Okay, I added an fma test fmul_pow_shl_cnt_vec_preserve_fma and it does so (the call to this transform is explicitly after the fma fold. I added a comment as well to be clearer about that).

goldstein.w.n marked an inline comment as done.

Cleanup + make FMA stuff more clear

arsenm added inline comments.Jul 17 2023, 4:31 PM
llvm/include/llvm/CodeGen/TargetLowering.h
3971

I mean you can rewrite fdiv by power of 2 as fmul by 2 to the negative power

goldstein.w.n added inline comments.Jul 17 2023, 4:46 PM
llvm/include/llvm/CodeGen/TargetLowering.h
3971

but 2 ^ -N requires a division. I.e 1 / (2 ^ N) so I don't see the benefit.

Have you looked at using the existing DAG::isKnownToBeAPowerOfTwo and DAGCombiner::BuildLogBase2 methods?

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16258

Why not make this a DAGCombiner method directly?

llvm/lib/Target/X86/X86ISelLowering.cpp
25042

Add a TODO - as I'm sure we can be more aggressive here, vXi64 uittofp in particular on pre-AVX512 targets, but smaller integers are often worth it since we'd have to extend them any way for the int-to-fp conversion.

goldstein.w.n added a comment.EditedJul 30 2023, 6:49 PM

Have you looked at using the existing DAG::isKnownToBeAPowerOfTwo and DAGCombiner::BuildLogBase2 methods?

Edit: after looking a little more, think this is a better approach. Currently working on a patch to improve both those functions, then will rebase this on top.
Just did. That would work in a sense. The rationale of keeping seperate is that takeLog2 as is, is basically guranteed to return an expression thats as or less expensive as the pow2 op. I.e
if we have (min a, b), we might return (min log2_a, log2_b) but wouldn't return that if we didn't already find a min.

Think that highlights this might be more re-appropriately named findInexpensiveLog2 (changed).

Looking at our current combine ability we also seem to not optimize trivial log2 cases of BuildLogBase2 i.e:

declare i32 @llvm.ctlz.i32(i32, i1)

define i32 @trivial_log2(i32 %x) {
  %s = shl i32 1, %x
  %r = call i32 @llvm.ctlz.i32(i32 %s, i1 true)
  ret i32 %r
}

Doesn't optimize out the shl/ctlz.

I can follow up with patches to integrate takeLog2 into buildLog2Base

goldstein.w.n marked an inline comment as done.Aug 1 2023, 12:34 AM

Have you looked at using the existing DAG::isKnownToBeAPowerOfTwo and DAGCombiner::BuildLogBase2 methods?

Done, includes a bit of a refactor for both those methods as they where lacking.

goldstein.w.n marked an inline comment as done.Aug 1 2023, 12:39 AM

Use buildlogbase2, put in DAGComber::, add todo

RKSimon added inline comments.Aug 15 2023, 3:09 AM
llvm/include/llvm/CodeGen/TargetLowering.h
3984

This should probably be inside TargetLoweringBase near the top of the file - somewhere near shouldProduceAndByConstByHoistingConstFromShiftsLHSOfAnd - that's where we put the per-target optimization controls switches.

goldstein.w.n marked an inline comment as done.Aug 15 2023, 4:32 PM
arsenm added inline comments.Aug 15 2023, 4:44 PM
llvm/include/llvm/CodeGen/TargetLowering.h
822–823

I think the fdiv by power of 2 should be generally converted to fmul by inverse power of 2 in the first place

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16160

ANY_EXTEND seems dangerous here

goldstein.w.n marked an inline comment as done.Aug 16 2023, 12:02 AM
goldstein.w.n added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
822–823

Maybe although you still need recipricol which is still division. But would rather make that tradeoff a seperate patch either way.

Remove any_extend

@arsenm Do you have any more feedback?

@arsenm Do you have any more feedback?

I think you're worse off doing this if ldexp is legal (not that we have the combine to form ldexp here)

goldstein.w.n added a comment.EditedAug 23 2023, 2:38 PM

@arsenm Do you have any more feedback?

I think you're worse off doing this if ldexp is legal (not that we have the combine to form ldexp here)

why is that? ldexp gets lowered as mul/div and its not really possible to transform ldexp into this.
EDIT: Am happy to add TODO to look into again should ldexp support be fleshed out.

why is that? ldexp gets lowered as mul/div and its not really possible to transform ldexp into this.

It doesn't lower to mul/div (and it may need a range check, at least for div). Most targets use the libcall

It's approximately

fmul x, pow2_k => ldexp(x, log2(pow2_k)
fmul x, pow2_k => ldexp(x, -log2(pow2_k)

with a possible additional precondition for the exponent range

goldstein.w.n added a comment.EditedAug 23 2023, 9:02 PM

why is that? ldexp gets lowered as mul/div and its not really possible to transform ldexp into this.

It doesn't lower to mul/div (and it may need a range check, at least for div). Most targets use the libcall

Err was thinking implemented with, but thats wrong too.
Looks like ldexp is implemented with the same trick as here (at least glibc/llvm-libc).
We are essentially inlining the fast path.
So don't really see how this could cause a slowdown given it gets to skip all the
checks + is inlined.
I can see the argument against having it in instcombine, but as its implemented here,
its at the end of the line so the speak and there isn't much left for it to potentially
de-optimize.

It's approximately

fmul x, pow2_k => ldexp(x, log2(pow2_k)
fmul x, pow2_k => ldexp(x, -log2(pow2_k)

with a possible additional precondition for the exponent range

I tried implemented this through ldexp but ALOT of powers of two are known from shifts
and its quite useful to do the analysis when its still in shift state as we can easily limit
the exponent. (The very fact that its (UINT_MAX * float) lets us bound stuff, OTOH (float << UINT_MAX) is unweildy).

If there where flags attached to ldexp that bounded the exponent I agree it would be
best to get to ldexp in InstCombine then use ldexp for the lowering.

Rebase + add todo for ldexp

@arsenm I added some AMDGPU tests + some tests of ldexp so its easier to compare the results of this patch.
I would also be happy to add patch so this code can also generate ldexp for the fmul case if the target prefers it if that helps address your concerns.

RKSimon accepted this revision.Sep 19 2023, 10:00 AM

LGTM - please can you raise an issue describing the missing (fmul x, pow2_k) - > (ldexp x, (log2 pow2_k)) fold

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
27116

(style) It'd be better to move the takeInexpensiveLog2 static implementation directly above BuildLogBase2 if that is its only user.

This revision is now accepted and ready to land.Sep 19 2023, 10:00 AM

LGTM - please can you raise an issue describing the missing (fmul x, pow2_k) - > (ldexp x, (log2 pow2_k)) fold

Done: see https://github.com/llvm/llvm-project/issues/66794

goldstein.w.n marked an inline comment as done.

Move takeInexpensiveLog2 to closer to its use-case

dmgreen added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
27170–27171

Hello. It looks like these should be the other way around? Otherwise CurVT might be == NewVT, but ToCast has now looked through a zext.
Testcase: https://godbolt.org/z/xd7TbGa6r

goldstein.w.n added inline comments.Sep 22 2023, 8:52 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
27170–27171

Yup, fix will be posted shortly!