This is an archive of the discontinued LLVM Phabricator instance.

[NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32 PHI's, O(n^2) algo is faster (geomean -0.08%)
ClosedPublic

Authored by lebedev.ri on Sep 9 2020, 12:11 PM.

Details

Summary

This is functionally equivalent to the old implementation.

As per https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=4739e6e4eb54d3736e6457249c0919b30f6c855a&stat=instructions
this is a clear geomean compile-time regression-free win with overall geomean of -0.08%

32 PHI's appears to be the sweet spot; both the 16 and 64 performed worse:
https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=c4efe1fbbfdf0305ac26cd19eacb0c7774cdf60e&stat=instructions
https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=e4989d1c67010d3339d1a40ff5286a31f10cfe82&stat=instructions

If we have more PHI's than that, we fall-back to the original DenseSet-based implementation,
so the not-so-fast cases will still be handled.

However compile-time isn't the main motivation here.
I can name at least 3 limitations of this CSE:

  1. Assumes that all PHI nodes have incoming basic blocks in the same order (can be fixed while keeping the DenseMap)
  2. Does not special-handle undef incoming values (i don't see how we can do this with hashing)
  3. Does not special-handle backedge incoming values (maybe can be fixed by hashing backedge as some magical value)

Thoughts?

Diff Detail

Event Timeline

lebedev.ri created this revision.Sep 9 2020, 12:11 PM
lebedev.ri requested review of this revision.Sep 9 2020, 12:11 PM

Drop some more dead code..

lebedev.ri edited the summary of this revision. (Show Details)Sep 9 2020, 12:18 PM
lebedev.ri edited the summary of this revision. (Show Details)Sep 9 2020, 12:29 PM
Harbormaster completed remote builds in B71129: Diff 290789.
nikic added a comment.Sep 9 2020, 1:13 PM

The problem here is that this is O(n^2) in the number of phi nodes, rather than O(n log n). So this will be faster for average inputs, but potentially much slower for degenerate cases. That said, I can't say that I've encountered "block with ten thousand phi nodes" as a problem before. It might still make sense to limit this heuristically, e.g. by limiting the inner loop to at most 100 iterations (which may fail to CSE some phi nodes in degenerate cases, but will avoid quadratic blowup.)

The problem here is that this is O(n^2) in the number of phi nodes, rather than O(n log n). So this will be faster for average inputs, but potentially much slower for degenerate cases. That said, I can't say that I've encountered "block with ten thousand phi nodes" as a problem before.

Yes.

It might still make sense to limit this heuristically, e.g. by limiting the inner loop to at most 100 iterations (which may fail to CSE some phi nodes in degenerate cases, but will avoid quadratic blowup.)

Before doing that, there is one more thing i guess i should try first - it doesn't make sense to reprocess *all* phis,
we should partition them by the type, and only reprocess the group in which we CSE'd.

lebedev.ri planned changes to this revision.Sep 9 2020, 2:43 PM

Will update tomorrow once http://llvm-compile-time-tracker.com/ processes variants.

lebedev.ri edited the summary of this revision. (Show Details)Sep 9 2020, 11:48 PM
lebedev.ri added a comment.EditedSep 9 2020, 11:52 PM

The problem here is that this is O(n^2) in the number of phi nodes, rather than O(n log n). So this will be faster for average inputs, but potentially much slower for degenerate cases. That said, I can't say that I've encountered "block with ten thousand phi nodes" as a problem before. It might still make sense to limit this heuristically, e.g. by limiting the inner loop to at most 100 iterations (which may fail to CSE some phi nodes in degenerate cases, but will avoid quadratic blowup.)

Ok, i feel dumber than usual now.
I've accidentally pushed more than the commit in question, so those numbers included some unrelated changes.
The real numbers are: https://llvm-compile-time-tracker.com/compare.php?from=25f3cc0ced1759af1911c2446ac40fab4f5e5571&to=5f12d8b73ac75a487c454f4a197f6bb69305bfed&stat=instructions
And it's much more clear win than what we've looked at https://llvm-compile-time-tracker.com/compare.php?from=25f3cc0ced1759af1911c2446ac40fab4f5e5571&to=b61307f6f13cb1020d0940c2d3103a0a2493b2ce&stat=instructions
ReleaseLTO-g is now also a -0.08 improvement.
The only compile-time regressions are:

  • -O3: tramp3d-v4 118935M 118946M (+0.01%)
  • ReleaseLTO-g (link only): kimwitu++ 38596M 38620M (+0.06%)
  • ReleaseLTO-g (link only): consumer-typeset 41942M 41949M (+0.02%)

With that in light, do we still want cut-offs?

lebedev.ri requested review of this revision.Sep 9 2020, 11:52 PM

The point of the cutoff wouldn't be to handle anything in the LLVM testsuite; it would be to prevent it from blowing up on someone's giant machine-generated function which isn't represented in the testuite.

lebedev.ri retitled this revision from [NFC] EliminateDuplicatePHINodes(): drop DenseMap-driven CSE in favor of quadratic algorithmn to [NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32 PHI's, O(n^2) algo is faster (geomean -0.08%).Sep 13 2020, 2:09 PM
lebedev.ri edited the summary of this revision. (Show Details)
This comment was removed by lebedev.ri.

The point of the cutoff wouldn't be to handle anything in the LLVM testsuite; it would be to prevent it from blowing up on someone's giant machine-generated function which isn't represented in the testuite.

Done, thanks, PTAL.

This revision is now accepted and ready to land.Sep 16 2020, 7:19 PM

LGTM

Cool, thank you for the review!

This revision was landed with ongoing or failed builds.Sep 17 2020, 1:29 AM
This revision was automatically updated to reflect the committed changes.