This is an archive of the discontinued LLVM Phabricator instance.

[DAG] Fold select(cond,binop(x,y),binop(x,z)) -> binop(x,select(cond,y,z))
ClosedPublic

Authored by RKSimon on Jul 15 2021, 5:05 AM.

Details

Summary

Similar to the folds performed in InstCombinerImpl::foldSelectOpOp, this attempts to push a select further up to help merge a pair of binops.

I'm primarily interested in select(cond,add(x,y),add(x,z)) folds to help expose pointer math (see https://bugs.llvm.org/show_bug.cgi?id=51069 etc.) but I've tried to use the more generic isBinOp().

Diff Detail

Event Timeline

RKSimon created this revision.Jul 15 2021, 5:05 AM
RKSimon requested review of this revision.Jul 15 2021, 5:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 15 2021, 5:05 AM
lebedev.ri accepted this revision.Jul 15 2021, 6:15 AM
lebedev.ri added a reviewer: tra.

Looks good to me.
I think NVPTX tests should just be adjusted beforehand.

llvm/test/CodeGen/NVPTX/fast-math.ll
147 ↗(On Diff #358924)

I think in all of these you want to have different divisors.

This revision is now accepted and ready to land.Jul 15 2021, 6:15 AM
RKSimon edited the summary of this revision. (Show Details)Jul 15 2021, 8:04 AM
This revision was landed with ongoing or failed builds.Jul 15 2021, 8:19 AM
This revision was automatically updated to reflect the committed changes.
tra added inline comments.Jul 15 2021, 10:29 AM
llvm/test/CodeGen/NVPTX/fast-math.ll
150 ↗(On Diff #358924)

This looks like it may be a performance regression. Judging by the name of the tests, we do expect to see reciprocal and multiplies, not div.

tra added inline comments.Jul 15 2021, 10:31 AM
llvm/test/CodeGen/NVPTX/fast-math.ll
150 ↗(On Diff #358924)

Never mind. These changes are not in the final version of the patch.

lebedev.ri added inline comments.Jul 15 2021, 10:34 AM
llvm/test/CodeGen/NVPTX/fast-math.ll
150 ↗(On Diff #358924)

Then you were already missing some kind of an inverse transform,
because middle-end would have already done this fold:
https://godbolt.org/z/cb38zEj9Y

RKSimon added inline comments.Jul 15 2021, 10:41 AM
llvm/test/CodeGen/NVPTX/fast-math.ll
150 ↗(On Diff #358924)

I made a kuldge fix in rG3cc38703d5ab05be0b01c31f829d19b47f183c5f so it still tests the combineRepeatedFPDivisors code path which appeared to be their purpose. But as @lebedev.ri said, we're unlikely to see the original test code patterns from real world code as InstCombine will have already broken the pattern.

tra added inline comments.Jul 15 2021, 11:02 AM
llvm/test/CodeGen/NVPTX/fast-math.ll
150 ↗(On Diff #358924)

This version of the patch resulted in div showing up in the PTX. The godbolt example above looks good and still generates rcp/mul/mul, which is what we want.

If the generated PTX changes to use div once godbolt picks up these changes, that would be a regression. Changing the test just hides the problem. It might've been better to disable the test with a comment that the test is still correct, but that there may be an issue in NVPTX which needs to be fixed.

I'm not sure yet why/how your change affectes NVPTX. I'll check what's going on.

RKSimon added inline comments.Jul 15 2021, 11:09 AM
llvm/test/CodeGen/NVPTX/fast-math.ll
150 ↗(On Diff #358924)

This shows the llc output before/after opt -O3 https://godbolt.org/z/dz4PGvjEf

lebedev.ri added inline comments.Jul 15 2021, 11:12 AM
llvm/test/CodeGen/NVPTX/fast-math.ll
150 ↗(On Diff #358924)

The godbolt example above looks good and still generates rcp/mul/mul, which is what we want.

No it doesn't:
https://godbolt.org/z/P43Y5E76E

tra added inline comments.Jul 15 2021, 11:19 AM
llvm/test/CodeGen/NVPTX/fast-math.ll
150 ↗(On Diff #358924)

Your original link does produce correct PTX without div for me. However, RKSimon's example shows that there's indeed something weird going on that prevents lowering to multiplication by reciprocal.

To make it clear - I have no issue with this patch. It's apparently just exposed an issue in the NVPTX back-end. I'll deal with it.

tra added a comment.EditedJul 15 2021, 4:14 PM

Is it intentional that two fdiv arcp get folded into fdiv w/o arcp?

This is part of what made the difference for NVPTX tests.
Debug output from llc on IR here https://godbolt.org/z/zcrh5n8n8:

SelectionDAG has 21 nodes:
  t0: ch = EntryToken
    t14: v1f32,ch = load<(dereferenceable invariant load (s32) from `float addrspace(101)* null`, addrspace 101)> t0, TargetExternalSymbol:i32'repeated_div_recip_a_param_3', undef:i32
  t15: f32 = extract_vector_elt t14, Constant:i32<0>
            t4: v1i8,ch = load<(dereferenceable invariant load (s8) from `i1 addrspace(101)* null`, addrspace 101)> t0, TargetExternalSymbol:i32'repeated_div_recip_a_param_0', undef:i32
          t5: i8 = extract_vector_elt t4, Constant:i32<0>
        t6: i1 = truncate t5
            t8: v1f32,ch = load<(dereferenceable invariant load (s32) from `float addrspace(101)* null`, addrspace 101)> t0, TargetExternalSymbol:i32'repeated_div_recip_a_param_1', undef:i32
          t9: f32 = extract_vector_elt t8, Constant:i32<0>
        t16: f32 = fdiv arcp t9, t15
            t11: v1f32,ch = load<(dereferenceable invariant load (s32) from `float addrspace(101)* null`, addrspace 101)> t0, TargetExternalSymbol:i32'repeated_div_recip_a_param_2', undef:i32
          t12: f32 = extract_vector_elt t11, Constant:i32<0>
        t17: f32 = fdiv arcp t12, t15
      t18: f32 = select t6, t16, t17
    t19: ch = NVPTXISD::StoreRetval<(store (s32), align 1)> t0, Constant:i32<0>, t18
  t20: ch = NVPTXISD::RET_FLAG t19

Combining: t20: ch = NVPTXISD::RET_FLAG t19

Combining: t19: ch = NVPTXISD::StoreRetval<(store (s32), align 1)> t0, Constant:i32<0>, t18

Combining: t18: f32 = select t6, t16, t17
Creating new node: t21: f32 = select t6, t9, t12
Creating new node: t22: f32 = fdiv t21, t15
 ... into: t22: f32 = fdiv t21, t15

Without arcp we have no choice now but to lower into a regular div instruction.

That said, even if we were to preserve arcp, we'd run into the second issue.
NVPTX itself does not know how to lower FDIV32rr_prec arcp correctly and lowers it as a regular div.

Previously two divs+select were combined into two multiplications by reciprocal and that was what made it look like we can lower div to multiplication by reciprocal.

Is it intentional that two fdiv arcp get folded into fdiv w/o arcp?

Hm, i guess we need to intersect fast-math flags from both original instructions into the new instruction.

This is part of what made the difference for NVPTX tests.

SelectionDAG has 21 nodes:
  t0: ch = EntryToken
    t14: v1f32,ch = load<(dereferenceable invariant load (s32) from `float addrspace(101)* null`, addrspace 101)> t0, TargetExternalSymbol:i32'repeated_div_recip_a_param_3', undef:i32
  t15: f32 = extract_vector_elt t14, Constant:i32<0>
            t4: v1i8,ch = load<(dereferenceable invariant load (s8) from `i1 addrspace(101)* null`, addrspace 101)> t0, TargetExternalSymbol:i32'repeated_div_recip_a_param_0', undef:i32
          t5: i8 = extract_vector_elt t4, Constant:i32<0>
        t6: i1 = truncate t5
            t8: v1f32,ch = load<(dereferenceable invariant load (s32) from `float addrspace(101)* null`, addrspace 101)> t0, TargetExternalSymbol:i32'repeated_div_recip_a_param_1', undef:i32
          t9: f32 = extract_vector_elt t8, Constant:i32<0>
        t16: f32 = fdiv arcp t9, t15
            t11: v1f32,ch = load<(dereferenceable invariant load (s32) from `float addrspace(101)* null`, addrspace 101)> t0, TargetExternalSymbol:i32'repeated_div_recip_a_param_2', undef:i32
          t12: f32 = extract_vector_elt t11, Constant:i32<0>
        t17: f32 = fdiv arcp t12, t15
      t18: f32 = select t6, t16, t17
    t19: ch = NVPTXISD::StoreRetval<(store (s32), align 1)> t0, Constant:i32<0>, t18
  t20: ch = NVPTXISD::RET_FLAG t19

Combining: t20: ch = NVPTXISD::RET_FLAG t19

Combining: t19: ch = NVPTXISD::StoreRetval<(store (s32), align 1)> t0, Constant:i32<0>, t18

Combining: t18: f32 = select t6, t16, t17
Creating new node: t21: f32 = select t6, t9, t12
Creating new node: t22: f32 = fdiv t21, t15
 ... into: t22: f32 = fdiv t21, t15

Without arcp we have no choice now but to lower into a regular div instruction.

That said, even if we were to preserve arcp, we'd run into the second issue.
NVPTX itself does not know how to lower FDIV32rr_prec arcp correctly and lowers it as a regular div.

Previously two divs+select were combined into two multiplications by reciprocal and that was what made it look like we can lower div to multiplication by reciprocal.

OK, I'll look at adding intersected flag preservation tomorrow - cheers.

@tra Common flag propagation should be working now - in rGfcb710a7ad4f I've added back the fast-math tests that I tweaked (I've kept the tweaked test variant as well) so it should show the missing mvptx arcp lowering. Shout on this ticket if I've missed anything.

tra added a comment.Jul 19 2021, 10:04 AM

@tra Common flag propagation should be working now - in rGfcb710a7ad4f I've added back the fast-math tests that I tweaked (I've kept the tweaked test variant as well) so it should show the missing mvptx arcp lowering. Shout on this ticket if I've missed anything.

Thank you. I've confirmed that arcp does get propagated now.