This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] try repeated fdiv divisor transform before building estimate
ClosedPublic

Authored by spatel on Apr 25 2019, 1:54 PM.

Details

Summary

This was originally part of D61028, but it's an independent diff.

If we do the repeated divisor reciprocal transform before producing an estimate sequence, then we have an opportunity to use scalar fdiv. On x86, the trade-off is 1 divss vs. 5 vector FP ops in the default estimate sequence. On recent chips (Skylake, Ryzen), the full-precision division is only 3 cycle throughput, so that's probably the better perf default option and avoids problems from x86's inaccurate estimates.

The last 2 tests show that users still have the option to override the defaults by using the function attributes for reciprocal estimates, but we can potentially make those faster by converting vector ops (including ymm ops) to scalar math.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Apr 25 2019, 1:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 25 2019, 1:54 PM
RKSimon accepted this revision.Apr 26 2019, 12:40 AM

LGTM - thanks for splitting this off from D61028

This revision is now accepted and ready to land.Apr 26 2019, 12:40 AM
This revision was automatically updated to reflect the committed changes.
spatel reopened this revision.May 1 2019, 9:44 AM

Reopening - I reverted this at rL359695 because it can cause an infinite loop from opposing combines.
There's a proposal to solve that in D61384, so I'll merge that here and try again.

This revision is now accepted and ready to land.May 1 2019, 9:44 AM
spatel planned changes to this revision.May 1 2019, 9:44 AM
spatel updated this revision to Diff 197596.May 1 2019, 11:14 AM
spatel added a reviewer: nemanjai.

Patch updated:
Add a check for a vector splat of "1.0" when bailing out of the repeated divisor transform. I suspect this bug may be present independently of the original change here (moving the order of the transforms), but I didn't try to create another test case to prove that.

A reduction of the infinite looping test from D61384 was added at rL359709. I don't think we need to add a specific helper to find a splat of "1.0" through a constant pool load as is proposed in D61384 - a simpler use of the existing isConstOrConstSplatFP() avoids the problem.

This revision is now accepted and ready to land.May 1 2019, 11:14 AM
nemanjai accepted this revision.May 1 2019, 7:45 PM

Interestingly enough, I tried this (or some variant of this) and it didn't work for me. Clearly I had done something wrong when trying it. In any case, I've run this against the code we were originally spinning on and it's all good. LGTM and thanks for fixing so quickly.

This revision was automatically updated to reflect the committed changes.