This is an archive of the discontinued LLVM Phabricator instance.

[LoopInstSimplify] Re-implement the core logic of loop-instsimplify to be both simpler and substantially more efficient.
ClosedPublic

Authored by chandlerc on May 26 2018, 2:55 AM.

Details

Summary

Rather than use a hand-rolled iteration technique that isn't quite the
same as RPO, use the pre-built RPO loop body traversal utility.

Once visiting the loop body in RPO, we can assert that we visit defs
before uses reliably. When this is the case, the only need to iterate is
when simplifying a def that is used by a PHI node along a back-edge.
With this patch, the first pass over the loop body is just a complete
simplification of every instruction across the loop body. When we
encounter a use of a simplified instruction that stems from a PHI node
in the loop body that has already been visited (due to some cyclic CFG,
potentially the loop itself, or a nested loop, or unstructured control
flow), we recall that specific PHI node for the second iteration.
Nothing else needs to be preserved from iteration to iteration.

On the second and later iterations, only instructions known to have
simplified inputs are considered, each time starting from a set of PHIs
that had simplified inputs along the backedges.

Dead instructions are collected along the way, but deleted in a batch at
the end of each iteration making the iterations themselves substantially
simpler. This uses a new batch API for recursively deleting dead
instructions.

This alsa changes the routine to visit subloops. Because simplification
is fundamentally transitive, we may need to visit the entire loop body,
including subloops, to handle knock-on simplification.

I've added a basic test file that helps demonstrate that all of these
changes work. It includes both straight-forward loops with
simplifications as well as interesting PHI-structures, CFG-structures,
and a nested loop case.

Diff Detail

Event Timeline

chandlerc created this revision.May 26 2018, 2:55 AM
chandlerc updated this revision to Diff 148711.May 26 2018, 2:55 AM

Fix some formatting goofs that snuck in w/o a run of clang-format to clean them up.

asbirlea accepted this revision.May 29 2018, 10:46 AM

LGTM. Looks like a clean simplification to me.

llvm/lib/Transforms/Scalar/LoopInstSimplify.cpp
131

Nit: "an iteration over all instructions in all blocks." perhaps.
For code readability it would be easier to understand IMO. Alternatively " end loop over instructions", " end loop over blocks" for above closing braces, but that seems too much.

This revision is now accepted and ready to land.May 29 2018, 10:46 AM
sanjoy accepted this revision.May 29 2018, 12:07 PM
sanjoy added inline comments.
llvm/lib/Transforms/Scalar/LoopInstSimplify.cpp
92

How about pulling out ToSimplify->empty() into a bool IsFirstIteration? I think that will make the intent clearer.

118

Might be worth asserting that if !L.contains(UserI) then UserI is a PHI node. Up to you.

asbirlea added inline comments.May 29 2018, 1:10 PM
llvm/lib/Transforms/Utils/Local.cpp
448

assert for isInstructionTriviallyDead too?

chandlerc marked 4 inline comments as done.May 29 2018, 1:18 PM

All suggestions implemented, submitting now, thanks!

This revision was automatically updated to reflect the committed changes.
mzolotukhin added inline comments.
llvm/trunk/lib/Transforms/Scalar/LoopInstSimplify.cpp
82 ↗(On Diff #148973)

I wonder if it would be more efficient to iterate through ToSimplify instead of all instructions in all blocks. What do you think?

chandlerc added inline comments.May 30 2018, 9:57 AM
llvm/trunk/lib/Transforms/Scalar/LoopInstSimplify.cpp
82 ↗(On Diff #148973)

I think it's trickier than that, but let me know if you see a way that seems more promising.

We need to visit defs before uses to reliably converge on the simplified result without revisiting instructions even more times. And we don't know all of the instructions we will need to visit, we only know the first ones. Each time we simplify an instruction we (potentially) grow the ToSimplify set.

So to only look at the instructions in the ToSimplify set, I think we'd need a system like the following:

  • First build a mapping from every instruction in the loop body to an integer so that they can be cheaply sorted in RPO. We could do this in the first iteration.
  • Build a sorted worklist in addition to a ToSimplify set, and every time we add an instruction to the ToSimplify set, insert it into the worklist in the correct (based on sort) position.
  • Walk that worklist rather than all the instructions in the loop body.

To build a worklist like this which has good algorithmic properties, what you'd probably want is to make ToSimplify actually a std::set or some other ordered search tree with cheap in-order insertion because we'll constantly be inserting into all kinds of different positions.

A std::set (or similar) tree data structure combined with the std::less needing to do two hash table lookups to get the sort key seemed like it would end up having a really frustratingly high overhead, and that's why I didn't immediately jump to this solution.

Another approach which doesn't have any better *algorithmic* properties in the worst case than the current one but might have hilariously better practical properties would be as follows:

  • Build a vector of instruction pointers in RPO order during the initial traversal, and a map from instruction to vector index.
  • Instead of using a ToSimplify set, use a sparse bitvector where the bit represents that the instruction at this index is in the set.
  • Walk the sparse bitvector in order in all subsequent iterations, and then use the index of the set bits to find the instruction at that position in the RPO.

Technically, in the worst case, this is the same as the current approach -- for a loop with N instructions that we iterate on M times we do O(N*M) work. But with this approach, that work involves looking a bit in a fairly cache-optimized data structure rather than walking to a linked list node and testing it in a hash table. And as the iteration becomes sparse, we actually get algorithmic improvements that would likely help in the average case.

The down side is that it is much more complex and requires a decent amount of memory, which is why I didn't implement it at first.

I'm happy to pursue either of these if you and others thing it is worthwhile. I'm not sure we'll ever have a test case where this shows up as a practical problem, but that doesn't mean one won't show up. =D I mostly had trouble making the trade-offs here, and this patch seemed a strict improvement.

I'll admit I'm more reluctant about the first approach as I suspect its algorithmic scaling will be completely undermined by the practical cost of maintaining the sorted data structure.

But the second approach is super appealing. I didn't think super carefully about it when I wrote this the first time to convince myself it would be effective, but I'm happy to go do this work if you think its worthwhile based on my description.

mzolotukhin added inline comments.May 30 2018, 11:14 AM
llvm/trunk/lib/Transforms/Scalar/LoopInstSimplify.cpp
82 ↗(On Diff #148973)

A std::set (or similar) tree data structure combined with the std::less needing to do two hash table lookups to get the sort key seemed like it would end up having a really frustratingly high overhead, and that's why I didn't immediately jump to this solution.

I was thinking about priority_queue here, which should have all the desired properties and supposedly less overhead, than set. We still will have to spend a lot of memory for enumerating all instructions, so maybe it's not worth it after all.

...this patch seemed a strict improvement.

I don't argue with that :)