This is an archive of the discontinued LLVM Phabricator instance.

New pass: guard widening
ClosedPublic

Authored by sanjoy on May 10 2016, 5:10 PM.

Details

Summary

Implement guard widening in LLVM. Description from GuardWidening.cpp:

The semantics of the @llvm.experimental.guard intrinsic lets LLVM
transform it so that it fails more often that it did before the
transform. This optimization is called "widening" and can be used hoist
and common runtime checks in situations like these:

%cmp0 = 7 u< Length
call @llvm.experimental.guard(i1 %cmp0) [ "deopt"(...) ]
call @unknown_side_effects()
%cmp1 = 9 u< Length
call @llvm.experimental.guard(i1 %cmp1) [ "deopt"(...) ]
...

to

%cmp0 = 9 u< Length
call @llvm.experimental.guard(i1 %cmp0) [ "deopt"(...) ]
call @unknown_side_effects()
...

If %cmp0 is false, @llvm.experimental.guard will "deoptimize" back
to a generic implementation of the same function, which will have the
correct semantics from that point onward. It is always _legal_ to
deoptimize (so replacing %cmp0 with false is "correct"), though it may
not always be profitable to do so.

Diff Detail

Event Timeline

sanjoy updated this revision to Diff 56840.May 10 2016, 5:10 PM
sanjoy retitled this revision from to New pass: guard widening.
sanjoy updated this object.
sanjoy added a subscriber: llvm-commits.
reames edited edge metadata.May 10 2016, 6:19 PM

A few initial comments mostly looking at the test cases. I'll take a look at the actual code separately.

My meta concern here is that I think you're trying to be way too aggressive here. "One pass to rule them all" is not generally a good approach. I suspect that smaller changes to InstCombine, EarlyCSE, GVN, and LICM would end up covering most of the interesting cases. Even if we decide to go with a single check widening pass, I'd prefer to see it start dead simple and evolve.

test/Transforms/GuardWidening/basic.ll
27

Add a check not to show the second one being removed.

43

Did you mean trivial here? An icmp is pretty easy to move around...

98

Doing this here feels weird. This should be an addition to loop unswitching or possibly even LICM itself.

122

Note sure what you're getting at here. The branch condition could still correlate. Are you assuming healing? If so, comment.

167

Wait what? A would absolutely expect cond_1 and cond_2 to be combined here. Why would I ever want to not do that?

A few initial comments mostly looking at the test cases. I'll take a look at the actual code separately.

My meta concern here is that I think you're trying to be way too aggressive here. "One pass to rule them all" is not generally a good approach. I suspect that smaller changes to InstCombine, EarlyCSE, GVN, and LICM would end up covering most of the interesting cases. Even if we decide to go with a single check widening pass, I'd prefer to see it start dead simple and evolve.

I'd say the pass in its current form is already fairly simple (~200 LOC, if you don't count the administrative overhead). There may be more optimal ways of spreading out the logic across other passes, but before I do that I want to be clear about what works and what doesn't, from a performance standpoint. Having a separate pass also lets me avoid making other passes pay the compile-time costs associated with widening (e.g. if I wanted to do cross-block widening in GVN/EarlyCSE, I'd have to compute the post-dom tree and loop info in GVN/EarlyCSE, something it doesn't do today).

test/Transforms/GuardWidening/basic.ll
27

Will do.

43

No, I did mean "non-trivial" (with "trivial" meaning "nothing needs to be moved"), but I see why that can sound weird. I'll s/non-trivial/trivial/ :)

98

This bit falls out of the cost/benefit logic we already have. We can consider doing better / more nuanced version of widening in LICM and LoopUnswitch if needed.

122

Yes, I've assumed healing (not yet implemented); will add a comment.

167

cond_1 and cond_2 are (intentionally) volatile loads, so you cannot hoist the computation producing cond_2 to before the first guard (and hence can't widen the first one to check cond_1).

majnemer added inline comments.
lib/Transforms/Scalar/GuardWidening.cpp
82–96

Your enum is defined in a class, I think you can forgo the WS prefix if you'd like.

186–187

match(&I, m_Intrinsic<Intrinsic::experimental_guard>()) ?

358–359

Is this saving anything over BinaryOperator::CreateAnd(Cond0, Cond1, "wide.chk", InsertPt);?

sanjoy updated this revision to Diff 56953.May 11 2016, 1:03 PM
sanjoy marked 2 inline comments as done.
sanjoy edited edge metadata.
  • Address review

Realized I had an unsubmitted comment

lib/Transforms/Scalar/GuardWidening.cpp
82–96

I think I'll keep it for now, it reads a little better when I use it without qualification.

atrick edited edge metadata.May 16 2016, 6:39 PM

I did a fairly quick review. Your pass looks pretty great.

I don't like to see unbounded recursion through multiple operands:

+ return all_of(Inst->operands(), [&](Value *Op) {
+ return isAvailableAt(Op, Loc);
+ });

...having had to fix many of these problems in LLVM when a
pathological case leads to exponential compile time. As a rule, I
think you either need a very small limit on the recursion, or a
visited set.

You're evaluating every guard against all dominating guard intrinsics:

+ for (unsigned i = 0, e = DFSI.getPathLength(); i != e; ++i) {
+ const auto &GuardsInCurBB = GuardsInBlock.find(CurBB)->second;
+ auto I = GuardsInCurBB.begin();
+ auto E = GuardsInCurBB.end();
+ for (auto *Candidate : make_range(I, E)) {

It's a nice simple design, but I wonder how well it scales (with
massive inlining). Since you're only matching guards with the same LHS
cmp operand, you could use a scoped hashtable instead. That way most
guards don't need to be searched. OTOH you would be maintaining a
complex side data structure.

I did a fairly quick review. Your pass looks pretty great.

I don't like to see unbounded recursion through multiple operands:

+ return all_of(Inst->operands(), [&](Value *Op) {
+ return isAvailableAt(Op, Loc);
+ });

...having had to fix many of these problems in LLVM when a
pathological case leads to exponential compile time. As a rule, I
think you either need a very small limit on the recursion, or a
visited set.

For now I'll add a cap on the recursion. This is something I expect
to revisit once I have this running over Real Code(TM).

You're evaluating every guard against all dominating guard intrinsics:

+ for (unsigned i = 0, e = DFSI.getPathLength(); i != e; ++i) {
+ const auto &GuardsInCurBB = GuardsInBlock.find(CurBB)->second;
+ auto I = GuardsInCurBB.begin();
+ auto E = GuardsInCurBB.end();
+ for (auto *Candidate : make_range(I, E)) {

It's a nice simple design, but I wonder how well it scales (with
massive inlining). Since you're only matching guards with the same LHS

Honestly, I suspect it won't scale very well. My priority now is to run some
real world code through this, and get a stronger sense of what
matters; and use that information to help drive improvements here.

cmp operand, you could use a scoped hashtable instead. That way most
guards don't need to be searched. OTOH you would be maintaining a
complex side data structure.

reames requested changes to this revision.May 16 2016, 6:53 PM
reames edited edge metadata.
reames added inline comments.
include/llvm/Transforms/Scalar/GuardWidening.h
13

You should expand this comment. Specifically, clarify exceptions around re-execution and healing.

Actually, never mind. This is available in the source.

lib/Transforms/Scalar/GuardWidening.cpp
121

Interface wise, this might be clearer as returning an optional<Value*> or just a Value* which can be null.

127

Generating extra IR just for checking profitability seems less than idea. Could this be combined into a single function which does the replacement if profitable and nothing otherwise?

214

Please add an assert that curBB has been populated in GuardsInBlock. Your lazy population scheme is fine, but the code doesn't make it *obvious* that's it correct.

226

You have an assumption here that the guards are recorded in the order they appeared in the original basic block. Please document that near the definition of the data structure.

230

As far as I understand it, there is never a case where widening one pair of guards would prevent you from merging that widened guard into another. You code is structured as if this is a possibility and it results in excessive generality.

The excess generality here is causing you to implement an O(G^2) algorithm for something which should be a collection of a O(N) algorithms. This is unacceptable.

254

In general, you *should* be able to simply delete the second guard here. The fact you can't, indicates there's a problem with the algorithm you've chosen.

288

This really seems likely to be excessive generality again. You don't need a generic hoisting algorithm.

301

Buggy in the face of unreachable code:
%a = add i32 %a, 1

359

Is there possibly a missing return here?

This revision now requires changes to proceed.May 16 2016, 6:53 PM

Sanjoy and I talked about this one offline and I wanted to highlight a key point which is not obvious from the review thread:

This code is not intended to be long term solution to guard widening. This is essentially an experimental piece of code whose purposes is to inform what real optimization passes need to be extended to handle the majority of cases which arrive in practice. It was deliberately designed to be simple and effective at the cost of compile time.

My own review was from the perspective of this being a "real" optimization pass. I don't know if we have a history of landing and evolving such experimental pieces of code in tree. I certainly wouldn't feel comfortable approving such a thing since I have a clear conflict of interest, but I'm also not opposed to the notion either.

What do others think? Is having a pass that is expected to be experimental and unlikely to survive in its current form in tree a good idea?

IMO this seems as good a form as any for being in-tree, and I think it makes sense to have in-tree given that it's really part of the implementation of the guard intrinsic. You could "hide" the code in other passes, but it's still specific to your compiler until others start using the intrinsic. This way, the experimental feature is self-contained.

Also, I'm curious how integrating it with other passes would actually improve the compile time.

If you're talking about adding this pass to the standard pipeline, that's a different story.

sanjoy marked an inline comment as done.May 17 2016, 1:48 PM

The non- O(n^2) bits have been addressed. As discussed above, there still is pending work to make this scale well with production ready workloads (I've noted this in the pass itself).

lib/Transforms/Scalar/GuardWidening.cpp
121

That won't work (the return value does not indicate if Result was set) -- there really are two independent results.

127

It doesn't generate extra IR if InsertPt is nullptr.

226

That was obvious at the construction site, so I've added an assert here.

301

Added an assert.

359

No, we're falling through to the return false; below.

sanjoy updated this revision to Diff 57515.May 17 2016, 1:50 PM
sanjoy edited edge metadata.
  • Address review comments

For the record, even for an experimental pass, I'm objecting to O(2^n) recursion not O(n^2).

sanjoy updated this revision to Diff 57525.May 17 2016, 2:49 PM
sanjoy edited edge metadata.
  • Address review: add a fairly low cutoff to makeAvailableAt
  • Fix a missing check on Index in the assert checking that GuardsInCurBB is in order

That really is a low threshold :) LLVM likes to use #6 to limit search depth in other places.

That really is a low threshold :) LLVM likes to use #6 to limit search depth in other places.

I'd say it is sufficient for now; I'll bump it a bit higher if needed later, or do something more targeted (e.g. only hoist icmp's and arithmetic, since that's the only bits the rest of the pass can handle).

chandlerc edited edge metadata.May 17 2016, 6:17 PM

Sanjoy and I talked about this one offline and I wanted to highlight a key point which is not obvious from the review thread:

This code is not intended to be long term solution to guard widening. This is essentially an experimental piece of code whose purposes is to inform what real optimization passes need to be extended to handle the majority of cases which arrive in practice. It was deliberately designed to be simple and effective at the cost of compile time.

My own review was from the perspective of this being a "real" optimization pass. I don't know if we have a history of landing and evolving such experimental pieces of code in tree. I certainly wouldn't feel comfortable approving such a thing since I have a clear conflict of interest, but I'm also not opposed to the notion either.

What do others think? Is having a pass that is expected to be experimental and unlikely to survive in its current form in tree a good idea?

I think this is fine. We have several passes in-tree that are not realistic passes to add to the default pipeline:

  • reg2mem
  • mem2reg
  • bb-vectorize

Probably others that I'm forgetting. Some of these are testing or experimentation passes. Some are prototypes that help us understand a problem. I think bb-vectorize is a great example of this. Maybe it doesn't fit the cost model for the majority of targets and/or users, or the compile time hit is just too significant, but it teaches us about how to do non-loop vectorization and is valuable to have in tree IMO as a consequence.

I'm perfectly happy with this pass going in in just such a fashion. I would carefully document at the top of the file and elsewhere that this is *not* intended to be a practical or production-ready pass, but is intended to somewhat brute force expose opportunities to make it clear where we should focus other efforts.

-Chandler

That really is a low threshold :) LLVM likes to use #6 to limit search depth in other places.

I'd say it is sufficient for now; I'll bump it a bit higher if needed later, or do something more targeted (e.g. only hoist icmp's and arithmetic, since that's the only bits the rest of the pass can handle).

Personally, I find it odd that you don't want a set as opposed to a limit...

(Peanut gallery comment -- shouldn't impact the progress of the patch)

With SmallPtrSet you don't have to worry much about the overhead of the visited set. But I understand keeping the code super simple in situations like this when you're really just pattern matching your bounds checks.

Just to make the picture I have in my head explicit, ideally, the path forward will be:

  1. Figure out what cases actually matter in practice
  2. Write test cases for those, and add logic as needed

Repeat (1) and (2) till we think we have adequate coverage.

  1. Trim down the brute-force O(N^2) algorithm to something nicer while still making sure we get the important cases
sanjoy updated this revision to Diff 57689.May 18 2016, 3:35 PM
sanjoy edited edge metadata.

Use ptrset to avoid exponential behavior, and add test case for it.

(The motivation being that I very quickly ran into a case where a
hoisting threshold of 2 was too small :) )

reames accepted this revision.May 18 2016, 3:42 PM
reames edited edge metadata.

LGTM given the previous discussion about this being an experimental pass.

lib/Transforms/Scalar/GuardWidening.cpp
38

Extend: In particular, it is known to have O(N^2) running time and will not scale to large numbers of guards.

This revision is now accepted and ready to land.May 18 2016, 3:42 PM
This revision was automatically updated to reflect the committed changes.