Page MenuHomePhabricator

[InstCombine] Simplify @llvm.usub.with.overflow+non-zero check (PR43251)
AcceptedPublic

Authored by lebedev.ri on Mon, Sep 9, 7:59 AM.

Details

Summary

This is again motivated by D67122 sanitizer check enhancement.
That patch seemingly worsens -fsanitize=pointer-overflow
overhead from 25% to 50%, which strongly implies missing folds.

In this particular case, given

char* test(char& base, unsigned long offset) {
  return &base - offset;
}

it will end up producing something like
https://godbolt.org/z/luGEju
which after optimizations reduces down to roughly

declare void @use64(i64)
define i1 @test(i8* dereferenceable(1) %base, i64 %offset) {
  %base_int = ptrtoint i8* %base to i64
  %adjusted = sub i64 %base_int, %offset
  call void @use64(i64 %adjusted)
  %not_null = icmp ne i64 %adjusted, 0
  %no_underflow = icmp ule i64 %adjusted, %base_int
  %no_underflow_and_not_null = and i1 %not_null, %no_underflow
  ret i1 %no_underflow_and_not_null
}

Without D67122 there was no %not_null,
and in this particular case we can "get rid of it", by merging two checks:
Here we are checking: Base u>= Offset && (Base u- Offset) != 0, but that is simply Base u> Offset

Alive proofs:
https://rise4fun.com/Alive/QOs

The @llvm.usub.with.overflow pattern itself is not handled here
because this is the main pattern, that we currently consider canonical.

https://bugs.llvm.org/show_bug.cgi?id=43251

Diff Detail

Event Timeline

lebedev.ri created this revision.Mon, Sep 9, 7:59 AM
vsk added inline comments.Mon, Sep 9, 1:02 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

Does the 'or of icmps' case arise anywhere?

lebedev.ri marked 2 inline comments as done.Mon, Sep 9, 1:15 PM
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

Define arise. We can trivially get one from another via De Morgan laws:
(a && b) ? c : d <--> !(a && b) ? d : c <--> (!a || !b) ? d : c,
It's *really* bad idea to intentionally not handle such nearby patterns.

vsk added inline comments.Mon, Sep 9, 1:24 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

As in, do unsigned underflow checks tend to contain this specific or-of-icmp pattern, or more generally whether programs in the wild tend to.

If this isn't the case, then it seems like there's a compile-time cost for optimizing the pattern without any compensating benefit.

lebedev.ri marked 2 inline comments as done.Mon, Sep 9, 1:29 PM
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

I'm not sure what the question is.

xbolva00 added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

If a compile time cost is concern, put it to AgressiveInstCombine - but I dont think we should do it in this case since there is nothing expensive like ValueTracking here..

I like Roman’s current code as is.

vsk added inline comments.Mon, Sep 9, 3:59 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

I'm asking: why bother optimizing this pattern, or what is the positive case for doing so (apart from "it's possible to do so")? Essentially my previous comment, with a question mark at the end :).

lebedev.ri marked an inline comment as done.Tue, Sep 10, 1:03 AM
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

I'm still unable to answer the question because i'm not sure about which pattern specifically you are talking about.
We can't just hope that the pattern we will always get is (a && b) ? c : d , it can trivially be converted to the or-form.

lebedev.ri edited the summary of this revision. (Show Details)Tue, Sep 10, 6:04 AM
lebedev.ri edited the summary of this revision. (Show Details)Tue, Sep 10, 6:36 AM

Patch updated: handle two more predicates
https://rise4fun.com/Alive/gjT

vsk added inline comments.Tue, Sep 10, 12:11 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

Stepping back a bit, I think the approach here should be driven by data.

You're right to be concerned about a change in the pipeline causing a performance regression ubsan. I don't think the answer to that is "introduce folds for all kinds of patterns the compiler *might* see", because then we'd endlessly be spinning our wheels and writing folds that aren't necessarily helpful (just extra maintenance & compile-time burden for no benefit).

Instead, the approach should be to focus on the actual end goal. We can check in a simple -fsanitize=pointer-overflow benchmark, perhaps your rawspeed bencharmk, to the llvm test suite. There are tons of bots that monitor the performance changes in the test suite (close to) every commit. Then, it makes sense to land targeted changes to improve performance. If there's ever a regression, we will actually know for sure, and be able to narrow it down, introduce a targeted fix, etc.

Taking a narrower focus again, I thought it was clear that we were discussing the 'or' form, but let me rephrase. Are the changes in "foldOrOfICmps" necessary to improve performance on your benchmark?

xbolva00 added inline comments.Tue, Sep 10, 12:49 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

just extra maintenance

Well, we could leave it as is, but after some weeks/months/years we would realize we need this fold and somebody may implement it from the scratch instead of current few lines.

compile-time burden

Well, I quite suprised that this concern is repeated so often for instcombine's fold but major new things introduced to other passes just go in.

lebedev.ri added inline comments.Tue, Sep 10, 12:55 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

Are the changes in "foldOrOfICmps" necessary to improve performance on your benchmark?

Ok, thank you for your question.
I'm going to answer honestly: i don't know, i didn't check, and more importantly i don't care.

By short-shortsightedly handling only the most obvious pattern
one guarantees that the work will have to be redone to

  • spot underoptimized pattern
  • recognize that it can be folded
  • come up with proof
  • write tests
  • find where to patch
  • patch
  • submit patch & get it through review.

That all is time-wasteful given it could have been
trivially solved when the initial pattern was being added.
Granted, not *all* variations will be caught,
but most obvious ones will be.

Even in my limited time contributing to instcombine pass
i have seen this play out, and it isn't fun to find
that it is due to not considering even basic commutativity
(and the question at hand,
(a && b) ? c : d <--> !(a && b) ? d : c <--> (!a || !b) ? d : c
is basic commutativity)

And finally, 'just as a glimpse of things to come',
that question will not exist (in it's current form at least)
when one day the peep-hole pass is SMT driven.
(well, more specifically all the folds would be pre-auto-deduced)

Let me know if that answers the question?

xbolva00 added inline comments.Tue, Sep 10, 1:00 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

when one day the peep-hole pass is SMT driven.

compile time would be crazy :D

lebedev.ri marked 2 inline comments as done.Tue, Sep 10, 1:09 PM
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

when one day the peep-hole pass is SMT driven.

compile time would be crazy :D

Yes, correct observation :D thus

(well, more specifically all the folds would be pre-auto-deduced)

so it wouldn't actually do any SMT during opt..

vsk added a subscriber: majnemer.Tue, Sep 10, 1:42 PM
vsk added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

To your first concern, llvm contributors are generally pretty good at spotting opportunities to reuse code, so I'm not too worried. Besides, @lebedev.ri's already done the work and left a paper trail :).

Second, note that InstCombine really is one of the biggest (top 3, IIRC) compile-time bears in the mid-level pipeline (http://lists.llvm.org/pipermail/llvm-dev/2017-March/111257.html). The marginal cost of introducing a new fold to InstCombine may be low, but adding folds without regard for the time-benefit tradeoff is what gets us into trouble.

2268

So, "i didn't check, and more importantly i don't care" is not a great approach here :). And it's not at all clear to me that work will have to be redone: perhaps only a subset of these changes will ever be needed! Imho we should let performance data and real-world examples guide us, not intuitions and hunches.

Perhaps I'm not making the case for this well. Paging @majnemer -- as the code owner, wdyt?

lebedev.ri removed a reviewer: vsk.Tue, Sep 10, 1:56 PM
lebedev.ri marked an inline comment as done.
lebedev.ri added a subscriber: vsk.
lebedev.ri marked 9 inline comments as done.Tue, Sep 10, 2:49 PM
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2268

I have nothing further to add here.
To me it's exactly the same question as "shall we only match (X * Y) ult Z or shall we also match Z ugt (X * Y)?"

lebedev.ri marked an inline comment as done.
lebedev.ri edited the summary of this revision. (Show Details)

Split off instsimplify changes

Rebased by swapping patches around.

@vsk @majnemer please can you specify, is this patch blocked by you?

This stand-off is really super unproductive for forward progress.

It both blocks the UBSan patch itself, and makes adding further folds
that are needed for that patch both more complicated
(it's best not to have many patches in-flight),
and makes them "illegal" - i don't overlook similar commutative cases in them.

:[

xbolva00 added a comment.EditedWed, Sep 18, 2:31 PM

Btw, I am in favour of this patch.

It both blocks the UBSan patch itself,

Well, I dont think so. ubsan patch works now, right? these folds just makes overhead a bit smaller, right? I would not connect ubsan patch with these folds. We could have ubsan patch in tree and these folds would land a bit later - no problem I think. Or miss I something?

xbolva00 accepted this revision.Wed, Sep 18, 2:32 PM
This revision is now accepted and ready to land.Wed, Sep 18, 2:32 PM

Btw, I am in favour of this patch.

It both blocks the UBSan patch itself,

Well, I dont think so. ubsan patch works now, right? these folds just makes overhead a bit smaller, right? I would not connect ubsan patch with these folds. We could have ubsan patch in tree and these folds would land a bit later - no problem I think. Or miss I something?

No, you are correct, these aren't *required* for the ubsan patch, but they are
very much welcomed - i suspect they will reduce the extra overhead significantly.
(Random guess - down from extra 26% to extra 10%, but i won't know until these folds are in place.)
(for optimized builds, not much can help -O0 :S)