This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Shift amount reassociation in shifty sign bit test (PR43595)
ClosedPublic

Authored by lebedev.ri on Oct 13 2019, 11:24 AM.

Details

Summary

This problem consists of several parts:

  • Basic sign bit extraction - trunc? (?shr %x, (bitwidth(x)-1)). This is trivial, and easy to do, we have a fold for it.
  • Shift amount reassociation - if we have two identical shifts, and we can simplify-add their shift amounts together, then we likely can just perform them as a single shift. But this is finicky, has one-use restrictions, and shift opcodes must be identical.

But there is a super-pattern where both of these work together. to produce sign bit test from two shifts + comparison.
We do indeed already handle this in most cases. But since we get that fold transitively, it has one-use restrictions.
And what's worse, in this case the right-shifts aren't required to be identical, and we can't handle that transitively:

If the total shift amount is bitwidth-1, only a sign bit will remain in the output value.
But if we look at this from the perspective of two shifts, we can't fold - we can't possibly know what bit pattern
we'd produce via two shifts, it will be *some* kind of a mask produced from original sign bit, but we just can't
tell it's shape: https://rise4fun.com/Alive/cM0 https://rise4fun.com/Alive/9IN

But it will *only* contain sign bit and zeros. So from the perspective of sign bit test, we're good:
https://rise4fun.com/Alive/FRz https://rise4fun.com/Alive/qBU
Superb!

So the simplest solution is to extend reassociateShiftAmtsOfTwoSameDirectionShifts() to also have a
sudo-analysis mode that will ignore extra-uses, and will only check whether a) those are two right shifts
and b) they end up with bitwidth(x)-1 shift amount and return either the original value that we sign-checking,
or null.

This does not have any functionality change for the existing reassociateShiftAmtsOfTwoSameDirectionShifts().

https://bugs.llvm.org/show_bug.cgi?id=43595

Diff Detail

Event Timeline

lebedev.ri created this revision.Oct 13 2019, 11:24 AM

Any thoughts on this?

Any thoughts on this?

It raises the usual questions: cost/complexity vs. benefit, and how much further do we want to take this pattern matching before moving it somewhere else (AggressiveInstCombine)?
I don't think we'll reach consensus on that in this patch, so I've just noted a small improvement.

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
29–30

Code comment typo and difficult to parse. Maybe better:
AnalyzeForSignBitExtraction indicates that we will only analyze whether this pattern has any 2 right-shifts that sum to 1 less than original bit width.

Any thoughts on this?

It raises the usual questions: cost/complexity vs. benefit, and how much further do we want to take this pattern matching before moving it somewhere else (AggressiveInstCombine)?
I don't think we'll reach consensus on that in this patch

Indeed. I was hoping that it wouldn't come to this patch, i thought general folds
(reassociateShiftAmtsOfTwoSameDirectionShifts(), foldShiftIntoShiftInAnotherHandOfAndInICmp())
would handle it, but no, sadly i'm still seeing regressions (although smaller!) in target benchmark,
and analysis of IR points to *this* missing pattern. I can't yet tell what other folds are needed though.

so I've just noted a small improvement.

Thanks!

lebedev.ri marked 2 inline comments as done.

Adjust comment wording as proposed.

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
29–30

Indeed, that is better; and fixes the param name.

sadly i'm still seeing regressions (although smaller!) in target benchmark, and analysis of IR points to *this* missing pattern

But does this or base pattern shows up somewhere else (test suite/clang/chromium) than in your benchmark?

I think this is very rare pattern and it is not very good that we are going to add another rare pattern to InstCombine which is a slow pass already. I still think AggresiveInstCombine is better solution - or improve your code of interest manually.

sadly i'm still seeing regressions (although smaller!) in target benchmark, and analysis of IR points to *this* missing pattern

But does this or base pattern shows up somewhere else (test suite/clang/chromium) than in your benchmark?

It's not "'your benchmark'", that is a real code. I'm not pulling it out of a thin air.

I think this is very rare pattern and it is not very good that we are going to add another rare pattern to InstCombine which is a slow pass already. I still think AggresiveInstCombine is better solution - or improve your code of interest manually.

sadly i'm still seeing regressions (although smaller!) in target benchmark, and analysis of IR points to *this* missing pattern

But does this or base pattern shows up somewhere else (test suite/clang/chromium) than in your benchmark?

It's not "'your benchmark'", that is a real code. I'm not pulling it out of a thin air.

I don't think the intent was to disrespect whatever code this pattern showed up in, but to ask if there are any stats for common clang benchmarks. That's the only way we can draw a line on these kinds of decisions. Every pattern is presumably showing up somewhere real for someone, so we can't distinguish where it belongs just based on existence.

My opinion (and others have disagreed with this) is that we always want the fold somewhere; it's just a question of where does it go to minimize cost/maximize benefit.

I don't think the intent was to disrespect whatever code this pattern showed up in, but to ask if there are any stats for common clang benchmarks. That's the only way we can draw a line on these kinds of decisions. Every pattern is presumably showing up somewhere real for someone, so we can't distinguish where it belongs just based on existence.

Strong +1.

lebedev.ri added a comment.EditedOct 18 2019, 2:53 PM

sadly i'm still seeing regressions (although smaller!) in target benchmark, and analysis of IR points to *this* missing pattern

But does this or base pattern shows up somewhere else (test suite/clang/chromium) than in your benchmark?

It's not "'your benchmark'", that is a real code. I'm not pulling it out of a thin air.

I don't think the intent was to disrespect whatever code this pattern showed up in,

Sorry, yes, i think i agree.

but to ask if there are any stats for common clang benchmarks. That's the only way we can draw a line on these kinds of decisions. Every pattern is presumably showing up somewhere real for someone, so we can't distinguish where it belongs just based on existence.

This doesn't show up in test-suite with no externals, though we have already previously
established that test-suite plain has poor coverage for the code patterns in question..
To be noted, this 'benchmark' is proposed for addition as per rL345166, so it isn't entirely outlandish to consider it.

To be noted, said library, to my knowledge, is only one of two in it's class (camera raw image decoding;
counting only open source libs), with other one using this one, so it's moderately widely used in-the-wild.
Not sure if that counts...

My opinion (and others have disagreed with this) is that we always want the fold somewhere;
it's just a question of where does it go to minimize cost/maximize benefit.

Ignoring the -O3 issue, i'm not really confident AggressiceInstCombine is a great fit for these problems.
The thing is, these folds are not one-off "folds - yay, no fold - shrug".
There isn't much interest in this fold happening in itself.
The really important part is that it will pave the road for other folds to happen.
In particular, this fold will allow e.g. rL373964 to happen.
The situation is likely similar with all other "aggressive" folds i'we added.

Ignoring the -O3 issue, i'm not really confident AggressiceInstCombine is a great fit for these problems.
The thing is, these folds are not one-off "folds - yay, no fold - shrug".
There isn't much interest in this fold happening in itself.
The really important part is that it will pave the road for other folds to happen.
In particular, this fold will allow e.g. rL373964 to happen.
The situation is likely similar with all other "aggressive" folds i'we added.

Yes, I understand this patch alone doesn't look like a big change. (We'd move the whole reassociateShiftAmtsOfTwoSameDirectionShifts() if we decide to move anything?)

But it should be ok. We can assume that regular and aggressive will work together, and we can make sure that doesn't break by adding a test within PhaseOrdering. Obviously, it's more work to do it that way, so that's probably why we haven't done it often.

An alternative could be to guard this code (and possibly many more) with:

if (ExpensiveCombines)

?

Ignoring the -O3 issue, i'm not really confident AggressiceInstCombine is a great fit for these problems.
The thing is, these folds are not one-off "folds - yay, no fold - shrug".
There isn't much interest in this fold happening in itself.
The really important part is that it will pave the road for other folds to happen.
In particular, this fold will allow e.g. rL373964 to happen.
The situation is likely similar with all other "aggressive" folds i'we added.

Yes, I understand this patch alone doesn't look like a big change.

To recap, the main problem is not that they use InstSimplify,
but that SimplifyAssociativeBinOp() uses recursive reasoning, correct?

In this case before we ever call SimplifyAddInst() we need to match 3 instructions - != 0 + 2 right shifts.
reassociateShiftAmtsOfTwoSameDirectionShifts() for example only requires just 2 instructions,
foldShiftIntoShiftInAnotherHandOfAndInICmp() - 4 instructions,
dropRedundantMaskingOfLeftShiftInput() - 3/4 instructions.

So *this* case is certainly not worst,
i`m honestly not sure what's re-triggered this disscussion yet again...

(We'd move the whole reassociateShiftAmtsOfTwoSameDirectionShifts() if we decide to move anything?)
But it should be ok. We can assume that regular and aggressive will work together,
and we can make sure that doesn't break by adding a test within PhaseOrdering.
Obviously, it's more work to do it that way, so that's probably why we haven't done it often.

To be perfectly honest i don't really see/don't look forward that happening,
because i don't really believe just a single pass of AIC followed by IC will be enough to still handle everything.
i suspect it would need to be IC-AIC-IC-AIC-IC, and right now AIC only runs once.

An alternative could be to guard this code (and possibly many more) with:

if (ExpensiveCombines)

?

Hm, opt -instcombine does result in ExpensiveCombines=1, opt/clang -O0/-O1/-O2 doesn't, opt/clang -O3 does.
So InstCombine's ExpensiveCombines is basically in-pass AIC (i.e. they run at same optlevels),
just without all the issues that will arise from not running combines till completion.
I suppose this could work, but i *really* don't like -O3 limitation (should be -O2 in these cases,
but not for the sole existing transform guarded by ExpensiveCombines).

All that being said, if it's really this much of a concern to warrant immediate action,
can't we simply expose "Depth" in the llvm::SimplifyAddInst()
and call it with such a depth that either disallows recursive reasoning,
or severely limits it? Some fastpaths will likely needed to be added then most likely.
That may be the best outcome considering the options..

spatel accepted this revision.Oct 19 2019, 1:44 PM

Ignoring the -O3 issue, i'm not really confident AggressiceInstCombine is a great fit for these problems.
The thing is, these folds are not one-off "folds - yay, no fold - shrug".
There isn't much interest in this fold happening in itself.
The really important part is that it will pave the road for other folds to happen.
In particular, this fold will allow e.g. rL373964 to happen.
The situation is likely similar with all other "aggressive" folds i'we added.

Yes, I understand this patch alone doesn't look like a big change.

To recap, the main problem is not that they use InstSimplify,
but that SimplifyAssociativeBinOp() uses recursive reasoning, correct?

Any time we call into InstSimplify, we may be incurring expensive analysis either by recursion directly or via ValueTracking, so that's part of it. The other part is that we add a few lines in every patch, and then end up with 100+ lines of code to deal with some rare patterns.

In this case before we ever call SimplifyAddInst() we need to match 3 instructions - != 0 + 2 right shifts.
reassociateShiftAmtsOfTwoSameDirectionShifts() for example only requires just 2 instructions,
foldShiftIntoShiftInAnotherHandOfAndInICmp() - 4 instructions,
dropRedundantMaskingOfLeftShiftInput() - 3/4 instructions.

So *this* case is certainly not worst,
i`m honestly not sure what's re-triggered this disscussion yet again...

It's the accumulating effect, and I agree it's out-of-scope to beat up on this patch alone, so I won't hold it up. But we should take a step back, do some measurements, and decide how things can be split up.

(We'd move the whole reassociateShiftAmtsOfTwoSameDirectionShifts() if we decide to move anything?)
But it should be ok. We can assume that regular and aggressive will work together,
and we can make sure that doesn't break by adding a test within PhaseOrdering.
Obviously, it's more work to do it that way, so that's probably why we haven't done it often.

To be perfectly honest i don't really see/don't look forward that happening,
because i don't really believe just a single pass of AIC followed by IC will be enough to still handle everything.
i suspect it would need to be IC-AIC-IC-AIC-IC, and right now AIC only runs once.

Hmm...I don't remember it being that way, but you're right. I thought it was running interleaved with instcombine at some point.

After all that, LGTM. :)

This revision is now accepted and ready to land.Oct 19 2019, 1:44 PM

Ignoring the -O3 issue, i'm not really confident AggressiceInstCombine is a great fit for these problems.
The thing is, these folds are not one-off "folds - yay, no fold - shrug".
There isn't much interest in this fold happening in itself.
The really important part is that it will pave the road for other folds to happen.
In particular, this fold will allow e.g. rL373964 to happen.
The situation is likely similar with all other "aggressive" folds i'we added.

Yes, I understand this patch alone doesn't look like a big change.

To recap, the main problem is not that they use InstSimplify,
but that SimplifyAssociativeBinOp() uses recursive reasoning, correct?

Any time we call into InstSimplify, we may be incurring expensive analysis either by recursion directly or via ValueTracking, so that's part of it. The other part is that we add a few lines in every patch, and then end up with 100+ lines of code to deal with some rare patterns.

Right.

In this case before we ever call SimplifyAddInst() we need to match 3 instructions - != 0 + 2 right shifts.
reassociateShiftAmtsOfTwoSameDirectionShifts() for example only requires just 2 instructions,
foldShiftIntoShiftInAnotherHandOfAndInICmp() - 4 instructions,
dropRedundantMaskingOfLeftShiftInput() - 3/4 instructions.

So *this* case is certainly not worst,
i`m honestly not sure what's re-triggered this disscussion yet again...

It's the accumulating effect

Certainly true, okay.

, and I agree it's out-of-scope to beat up on this patch alone, so I won't hold it up. But we should take a step back, do some measurements, and decide how things can be split up.

Any thoughts on the approach i suggested:

All that being said, if it's really this much of a concern to warrant immediate action,
can't we simply expose "Depth" in the llvm::SimplifyAddInst()
and call it with such a depth that either disallows recursive reasoning,
or severely limits it? Some fastpaths will likely needed to be added then most likely.
That may be the best outcome considering the options..

?

(We'd move the whole reassociateShiftAmtsOfTwoSameDirectionShifts() if we decide to move anything?)
But it should be ok. We can assume that regular and aggressive will work together,
and we can make sure that doesn't break by adding a test within PhaseOrdering.
Obviously, it's more work to do it that way, so that's probably why we haven't done it often.

To be perfectly honest i don't really see/don't look forward that happening,
because i don't really believe just a single pass of AIC followed by IC will be enough to still handle everything.
i suspect it would need to be IC-AIC-IC-AIC-IC, and right now AIC only runs once.

Hmm...I don't remember it being that way, but you're right. I thought it was running interleaved with instcombine at some point.

Ah, i see why it was suggested then.
Good that i believe we reached understanding on why such move might be troubling..

After all that, LGTM. :)

This revision was automatically updated to reflect the committed changes.

@spatel

, and I agree it's out-of-scope to beat up on this patch alone, so I won't hold it up. But we should take a step back, do some measurements, and decide how things can be split up.

Any thoughts on the approach i suggested:

All that being said, if it's really this much of a concern to warrant immediate action,
can't we simply expose "Depth" in the llvm::SimplifyAddInst()
and call it with such a depth that either disallows recursive reasoning,
or severely limits it? Some fastpaths will likely needed to be added then most likely.
That may be the best outcome considering the options..

?

Up

@spatel

, and I agree it's out-of-scope to beat up on this patch alone, so I won't hold it up. But we should take a step back, do some measurements, and decide how things can be split up.

Any thoughts on the approach i suggested:

All that being said, if it's really this much of a concern to warrant immediate action,
can't we simply expose "Depth" in the llvm::SimplifyAddInst()
and call it with such a depth that either disallows recursive reasoning,
or severely limits it? Some fastpaths will likely needed to be added then most likely.
That may be the best outcome considering the options..

?

Up

Still catching up on mail backlog...
Exposing/varying the depth param doesn't seem like the right way to approach this to me. My fear is that we'd end up with semi-random uses of caller depths that wouldn't age well. So we might handle all of the cases in *this* patch with depth=1, but then someone notices their code needs depth=2 to get the optimization and changes it. The depth param is just a hack based on a magic number, so it feels wrong to expand on that hack. I don't think we can distinguish compile-time cost at that level anyway (but it would be great to have data). Apart from all of that, messing with recursion depth doesn't address the underlying problem that the code is just too big/complicated.