This is an archive of the discontinued LLVM Phabricator instance.

[CodeGenPrepare] limit overflow intrinsic matching to a single basic block
ClosedPublic

Authored by spatel on Apr 24 2019, 8:59 AM.

Details

Summary

Using/updating a dominator tree to match math overflow patterns may be very expensive in compile-time, so just handle the single-block case.

Also, we were restarting the iterator loops when doing the overflow intrinsic transforms by marking the dominator tree for update. That was done to prevent iterating over a removed instruction. But we can postpone the deletion using the existing "RemovedInsts" structure, and that means we don't need to update the DT.

See post-commit thread for rL354298 for more details:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190422/646276.html

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Apr 24 2019, 8:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2019, 8:59 AM

I think I'm missing something basic here. CGP will iterate until done, including trivial DCE. Given that, why not just leave the instructions in the IR and let the next iteration remove them? It would trigger an extra iteration, which might be undesirable, but would it be incorrect? That would seem better than invalidating the DomTree, which just forces a new iteration anyways...

Actually, it doesn't look like it includes trivial DCE. Maybe it should?

Assuming the answer is that yet, that would be correct, then this patch becomes a smallish optimization over an extra iteration right?

Thinking through the logic, we don't need to worry about the cmp instruction since the caller advanced the iterator beyond it. So, the only concern is the binary operator. But correct me if I'm wrong, don't you require the BO to dominate the Cmp? If so, then haven't we already iterated past that instruction? How do we have an iterator invalidation problem at all?

I think I'm missing something basic here. CGP will iterate until done, including trivial DCE. Given that, why not just leave the instructions in the IR and let the next iteration remove them? It would trigger an extra iteration, which might be undesirable, but would it be incorrect? That would seem better than invalidating the DomTree, which just forces a new iteration anyways...

Actually, it doesn't look like it includes trivial DCE. Maybe it should?

Assuming the answer is that yet, that would be correct, then this patch becomes a smallish optimization over an extra iteration right?

Yes, that sounds right. If CGP included DCE, then it would be easier to just rely on that for cleanup. Unfortunately, it doesn't. I don't know the full history here, but I suspect that since CGP was intended to be a temporary hack (see opening comment in the source file), nobody has bothered to invest in doing better.

But since this patch didn't solve the compile-time problem anyway, I'm going to propose a bigger fix anyway...

Thinking through the logic, we don't need to worry about the cmp instruction since the caller advanced the iterator beyond it. So, the only concern is the binary operator. But correct me if I'm wrong, don't you require the BO to dominate the Cmp? If so, then haven't we already iterated past that instruction? How do we have an iterator invalidation problem at all?

No, the binop does not necessarily dominate the compare if they are within a single block. If they are independent ops (no def-use relationship), then we don't know which order they are in within that block. We have regression tests with those patterns.

The history of these overflow transforms is that we wanted to do these as canonicalizations in instcombine, but that caused perf regressions. So then the add overflow transform got dumped here in CGP with a TLI hook to allow it. Then, I noticed that we weren't getting that transform consistently with some constant operands or the subtract overflow sibling. So I improved the pattern matching, and idealistically/naively tried to allow the transforms in all cross-block cases as long as the target said it was legal.

But that still caused perf regressions, so the transform was limited in 2 ways: (1) avoid hoisting the binop into a dominating cmp block and (2) avoid hoisting either op unless we had a direct predecessor/successor relationship.

At this point, I'm going to propose that we just give up on any cross-block attempts. I don't have any perf data to show that we got any wins from that, and I don't think anyone has the will to fix the backend problems and compile-time issues.

spatel updated this revision to Diff 196682.Apr 25 2019, 10:56 AM
spatel retitled this revision from [CodeGenPrepare] delay instruction deletion for efficiency to [CodeGenPrepare] limit overflow intrinsic matching to a single basic block.
spatel edited the summary of this revision. (Show Details)

Patch updated:
Don't try to match this pattern across blocks. That means we are not using a dominator tree at all, so if this doesn't restore compile-time perf, I'm not sure what would. :)

Limiting this transform to a single block means it really shouldn't be in CGP anymore. It could be implemented in SDAG. But I'd prefer to leave that to a follow-up patch once we confirm that this (1) restores compile-time and (2) does not cause perf regressions.

hfinkel added inline comments.
llvm/lib/CodeGen/CodeGenPrepare.cpp
1191 ↗(On Diff #196682)

I'm actually not sure why this would be true, at least in the multiple-blocks case. If they're in the same block, then the dominance check can be expensive, but that's because of the scanning involved. And, if that's true, then an OBB cache might address the problem. In any case, I don't think that having point (3) here is useful, could be misleading, and isn't necessary given the first two points.

1206 ↗(On Diff #196682)

If the DT was expensive, I'd actually suspect it was this scanning that made it so on large blocks. Does this happen multiple times such that a OBB cache would help?

spatel marked an inline comment as done.May 2 2019, 2:06 PM
spatel added inline comments.
llvm/lib/CodeGen/CodeGenPrepare.cpp
1206 ↗(On Diff #196682)

I only have 1 test case to play with (I'll try to attach it to this review for reference), and it doesn't have large blocks. It's a ~66K line function with ~6K blocks.

From what I can tell, the problem is that we reset the DT on any made change up at line 472. That seems like overkill, but I don't have the motivation to sort out which of the transforms in the loop could bypass that reset.

spatel added a comment.May 2 2019, 2:08 PM

- this is the test file provided by @yrouban that showed the compile-time regression and would improve by avoiding DT altogether as proposed in this patch.

hfinkel added inline comments.May 2 2019, 2:11 PM
llvm/lib/CodeGen/CodeGenPrepare.cpp
1206 ↗(On Diff #196682)

Okay. If you're going to mention that DT is slow, please put in the comment something about the fact that it's slow because we need to recompute it a lot (and, perhaps, why).

spatel updated this revision to Diff 197867.May 2 2019, 2:22 PM
spatel marked an inline comment as done.

Patch updated:
Added more details to the code comment about using the dominator tree. Ie, if anyone wants to use it for this transform (or possibly others), they are probably going to have to alter the function-level CGP loop to avoid resetting the DT on every change.

hfinkel accepted this revision.May 2 2019, 5:24 PM

LGTM

This revision is now accepted and ready to land.May 2 2019, 5:24 PM
This revision was automatically updated to reflect the committed changes.
spatel reopened this revision.May 3 2019, 10:43 AM

Reopening because the commit was reverted at rL359908. Must've missed some detail about using the RemovedInsts structure.

This revision is now accepted and ready to land.May 3 2019, 10:43 AM
spatel planned changes to this revision.May 3 2019, 10:44 AM
This revision was not accepted when it landed; it landed in state Changes Planned.May 4 2019, 5:44 AM
This revision was automatically updated to reflect the committed changes.