Page MenuHomePhabricator

[InstCombine] Introducing Aggressive Instruction Combine pass
ClosedPublic

Authored by aaboud on Sep 27 2017, 5:19 AM.

Details

Summary

This new approach is a replacement for D37195.

Motivation:
InstCombine algorithm runs with high complexity (as it keeps running till there is no change to the code), thus it requires that each piece inside the outer loop, have a very small complexity (preferable O(1)).

Problem
There are instruction patterns that costs more than O(1) to identify and modify, thus it cannot be added to the InstCombine algorithm.

Solution
AggressiveInstCombine pass, which will run separately from InstCombine pass, and can perform expression pattern optimizations, which might take each up to O(n) complexity, where n is number of instructions in the function.

I introduced in this patch:

  1. The AggressiveInstCombiner class is the main pass that will run all expression pattern optimizations.
  2. The TruncInstCombine class, first added expression pattern optimization.

TruncInstCombine currently supports only simple instructions such as: add, sub, and, or, xor and mul.
The main difference between this and the canEvaluateTruncated from InstCombine is that this one supports:

  1. Instructions with multi-use, as long as all are dominated by the truncated instruction.
  2. Truncating to different width than the original trunc instruction requires. (this is useful if we reduce the expression width, even if we are not eliminating any instruction).

Next I will add to this TruncInstCombine class the following support, each in a separate patch:

  1. select, shufflevector, extractelement, insertelement
  2. udiv, urem
  3. shl, lshr, ashr
  4. phi node (and loop handling)

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
zvi added inline comments.Nov 2 2017, 7:07 AM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
54 ↗(On Diff #121149)

IIRC, in one of the previous revisions of this patch there was a comment explaining why these cast instructions are skipped? Can you please revive it or add a new one?

64 ↗(On Diff #121149)

Could we avoid pushing constants and maybe even generalize to avoid values that are not instructions? At least for these cases it may be ok, not for others, such as divide where we need to be more cautious

202 ↗(On Diff #121149)

At this point a mappting for IOp must exist in InstInfoMap, right? Then please use lookup() or find(). Also better avoid searching for same key more than once.

209 ↗(On Diff #121149)

lookup()

227 ↗(On Diff #121149)

IMO this would be more readable and guarantee that code changes won't lead to uses of stale values of MinBitWidth:

MinBitWidth = OrigBitwidth;

and drop the return

352 ↗(On Diff #121149)

I may have missed this, but why defer insertion of instruction to the BB to this point? And consider using the IRBuilder instead of calling ::Create* above and below.

aaboud marked 15 inline comments as done.Nov 2 2017, 8:03 AM

Thanks Zvi for the comments.
I will upload a new patch with most of the comments fixed.
See few answers below.

lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
54 ↗(On Diff #121149)

Not sure if there is any importance to the order.
I just did what InstructionCombiningPass does!

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
40 ↗(On Diff #121149)

This is what InstCombine preserves.

44 ↗(On Diff #121149)

llvm::ReplaceInstWithInst also deletes the old instructions, which we are not ready to delete at this point.

64 ↗(On Diff #121149)

I prefer not to complicate this function, it should return operands that are related to the optimization, the caller should check each relevant operand if it is a constant or instruction, there will be cases where even a constant need to be evaluated and will not skipped immediately.

aaboud updated this revision to Diff 121318.Nov 2 2017, 8:16 AM

Addressed Zvi's Comments.

aaboud updated this revision to Diff 121320.Nov 2 2017, 8:30 AM

Minor fix, forgot to use IRBuilder in one case in the previous patch.

zvi added a comment.Nov 2 2017, 8:32 AM

Thanks, Amjad! This patch LGTM, but i think it would be best to wait for an LGTM from one of the assigned reviewers.

aaboud updated this revision to Diff 121332.Nov 2 2017, 9:30 AM

Minor typo update.

This is the last version, where I answered all the previous comments.
Please, let me know if you have any final comment or if I can go ahead and commit the patch.

Thanks

zvi commandeered this revision.Nov 9 2017, 12:44 AM
zvi added a reviewer: aaboud.

Commandeering this patch while Amjad is away on a few weeks of vacation.
Also sending a friendly ping to the reviewers. AFAIK, all comments were addressed as of the latest revision of this patch. Please let me know if i missed anything. Thanks.

craig.topper added inline comments.Nov 14 2017, 8:59 AM
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
62 ↗(On Diff #121332)

What about the new pass manager?

lib/Transforms/AggressiveInstCombine/AggressiveInstCombineInternal.h
89 ↗(On Diff #121332)

This sentence reads funny

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
80 ↗(On Diff #121332)

Can you consistently use auto with dyn_cast throughout this patch.

175 ↗(On Diff #121332)

LLVM style preferes "auto *IOp". We don't want auto to hide the fact that its a pointer. Please scrub the whole patch for this.

206 ↗(On Diff #121332)

In the case of vectors, is this using legal scalar integer types to constrain the vector element type? I'm not sure if that's the right behavior. The legal scalar types don't necessarily imply anything about vector types.

I think we generally try to avoid creating vector types that didn't appear in the IR.

210 ↗(On Diff #121332)

Isn't this MinBitWidth == TruncBitWidth?

lib/Transforms/IPO/PassManagerBuilder.cpp
750 ↗(On Diff #121332)

What about the new pass builder for the new pass manager?

zvi added inline comments.Nov 16 2017, 4:45 AM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
206 ↗(On Diff #121332)

That' s a good point. Will look into this.

210 ↗(On Diff #121332)

Your observation is correct but the comment is also correct and it explains something that may not be obvious.
At this point, we completed visiting the expression tree starting from the truncate's operand and found 'MinBitWidth' by taking the max of all min-bit-width requirements of the predecessors. Now, since this is a truncate instruction, and by construction of the expression tree it follows that the computed MinBitWidth can never be less than the truncate's return types's size in bits or in other words MinBitWidth >= TruncBitWidth.
The case of MinBitWidth > TruncBitWidth is hand;ed in the then block just above, and what remains in to handle the MinBitWidth == TruncBitWidth in the else block here.
I can put this in a comment if you think it would be helpful.

zvi updated this revision to Diff 123159.Nov 16 2017, 4:47 AM
zvi edited the summary of this revision. (Show Details)

Address some of Craig's recent comments.

zvi marked 3 inline comments as done.Nov 16 2017, 4:49 AM
craig.topper added inline comments.Nov 16 2017, 10:50 PM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
210 ↗(On Diff #121332)

I was just saying that it should say TruncBitWidth not ValidBitWidth right?

zvi added inline comments.Nov 19 2017, 12:48 AM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
210 ↗(On Diff #121332)

Ah, right :)

zvi updated this revision to Diff 123509.Nov 19 2017, 12:15 PM

Address the last of Craig's comments:

  • Thanks, @lsaba, for porting the pass to the new PassManager.
  • Removed shrinkage of vector types until we sort out if it is generally allowed to shrink element types of vector operations.
  • Some minor fixes to comments.
zvi marked 10 inline comments as done.Nov 19 2017, 12:19 PM
zvi updated this revision to Diff 123510.Nov 19 2017, 12:23 PM

Rebase on ToT. NFC in this revision.

escha added a subscriber: escha.Nov 27 2017, 11:40 AM

Two comments on the trunc thing:

  1. Thank you!!! As a GPU target maintainer, one of my main frustrations is how much LLVM *loves* to generate code that is needlessly too wide when smaller would do. We mostly have avoided this problem due to being float-heavy, but as integer code becomes more important, I absolutely love any chance I can get to reduce 32-bit to 16-bit and save register space accordingly.
  1. I'm worried about this because the DAG *loves* to eliminate """redundant""" truncates and extensions, even if they're both marked as free. I've accidentally triggered infinite loops many times when trying to trick the DAG into emitting code that keeps intermediate variables small, an extreme example being something like this:
; pseudo-asm
; R1 = *b + (*a & 15);
; R2 = *c + (*a >> 16) & 15;
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R0H, R0, 16
and.16 R0L, R0L, 15
and.16 R0H, R0H, 15
add.32 R1, R1, R0L
add.32 R2, R2, R0H

The DAG will usually try to turn this into this:

load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R3, R0, 16
and.32 R0, R0, 15
and.32 R3, R3, 15
add.32 R1, R1, R0
add.32 R2, R2, R3

this is just a hypothetical example but in general this makes me worry from past attempts at experimentation in this realm.

davide added a subscriber: davide.Nov 27 2017, 1:45 PM

I'm really worried that the compile time hit of this for LTO will be non-negligible. Do you have numbers?

zvi updated this revision to Diff 124519.Nov 27 2017, 11:34 PM

Add missing AggressiveInstCombine.h and fix missing 'opt' dependency. Thanks, @lsaba, for noticing.

zvi added a comment.Nov 27 2017, 11:59 PM

Two comments on the trunc thing:

  1. Thank you!!! As a GPU target maintainer, one of my main frustrations is how much LLVM *loves* to generate code that is needlessly too wide when smaller would do. We mostly have avoided this problem due to being float-heavy, but as integer code becomes more important, I absolutely love any chance I can get to reduce 32-bit to 16-bit and save register space accordingly.

Sometimes it's LLVM, and sometimes it's the frontend that is required to extend small typed values before performing operations.

  1. I'm worried about this because the DAG *loves* to eliminate """redundant""" truncates and extensions, even if they're both marked as free. I've accidentally triggered infinite loops many times when trying to trick the DAG into emitting code that keeps intermediate variables small, an extreme example being something like this:
; pseudo-asm
; R1 = *b + (*a & 15);
; R2 = *c + (*a >> 16) & 15;
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R0H, R0, 16
and.16 R0L, R0L, 15
and.16 R0H, R0H, 15
add.32 R1, R1, R0L
add.32 R2, R2, R0H

The DAG will usually try to turn this into this:

load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R3, R0, 16
and.32 R0, R0, 15
and.32 R3, R3, 15
add.32 R1, R1, R0
add.32 R2, R2, R3

this is just a hypothetical example but in general this makes me worry from past attempts at experimentation in this realm.

Not sure I fully understand the concern of this patch, but if the problem is root caused to Instruction Selection, shouldn't we fix it there? If DAGCombiner's elimination of free truncates/extensions is an issue, have you considered predicating the specific combines with TLI hooks?

zvi added a comment.Nov 28 2017, 12:03 AM

I'm really worried that the compile time hit of this for LTO will be non-negligible. Do you have numbers?

Will follow-up on this.

In D38313#937267, @zvi wrote:

Two comments on the trunc thing:

  1. Thank you!!! As a GPU target maintainer, one of my main frustrations is how much LLVM *loves* to generate code that is needlessly too wide when smaller would do. We mostly have avoided this problem due to being float-heavy, but as integer code becomes more important, I absolutely love any chance I can get to reduce 32-bit to 16-bit and save register space accordingly.

Sometimes it's LLVM, and sometimes it's the frontend that is required to extend small typed values before performing operations.

  1. I'm worried about this because the DAG *loves* to eliminate """redundant""" truncates and extensions, even if they're both marked as free. I've accidentally triggered infinite loops many times when trying to trick the DAG into emitting code that keeps intermediate variables small, an extreme example being something like this:
; pseudo-asm
; R1 = *b + (*a & 15);
; R2 = *c + (*a >> 16) & 15;
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R0H, R0, 16
and.16 R0L, R0L, 15
and.16 R0H, R0H, 15
add.32 R1, R1, R0L
add.32 R2, R2, R0H

The DAG will usually try to turn this into this:

load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R3, R0, 16
and.32 R0, R0, 15
and.32 R3, R3, 15
add.32 R1, R1, R0
add.32 R2, R2, R3

this is just a hypothetical example but in general this makes me worry from past attempts at experimentation in this realm.

Not sure I fully understand the concern of this patch, but if the problem is root caused to Instruction Selection, shouldn't we fix it there? If DAGCombiner's elimination of free truncates/extensions is an issue, have you considered predicating the specific combines with TLI hooks?

There *are* TLI hooks; they're just not as widely used in the DAG as they could be.

I'm warning you with regard to this patch because the DAG may inadvertently undo a lot of the optimizations you're doing here. This isn't an objection, just something that might be worth looking at later given past experiences in trying to do similar optimizations.

zvi added a comment.Nov 29 2017, 11:10 PM
In D38313#937269, @zvi wrote:

I'm really worried that the compile time hit of this for LTO will be non-negligible. Do you have numbers?

Will follow-up on this.

Measured CTMark and internal tests and was not able to observe significant compile time changes with -flto. Below are the results for CTMark:

WorkloadToT: stdev/average of 10 runs [%]This patch: stdev/average of 10 runs [%]Average compile-time speedup of this patch over ToT (higher is better for this patch)
7zip0.19%0.19%0.999
Bullet0.30%0.37%0.998
ClamAV0.39%0.19%1.000
SPASS0.52%0.33%1.000
consumer-typeset0.27%0.36%0.999
kimwitu++0.45%0.49%0.998
lencod0.20%0.51%1.001
mafft0.63%0.29%1.006
sqlite30.70%0.82%1.002
tramp3d-v41.23%1.78%0.990
craig.topper added inline comments.Dec 12 2017, 10:24 PM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
205 ↗(On Diff #124519)

I'm not sure you've addressed my vector concerns. The first part of this 'if' would create a new type for vectors by using getSmallestLegalIntType.

217 ↗(On Diff #124519)

I think the !Vector check that was here was correct previously. We don't do isLegalnteger checks on the scalar types of vectors. For vectors we assume that if the type was present in the IR, the transform is fine. In this block, TruncWidth == MinBitWidth so the type existed in the original IR. My vector concerns were about the block above where we create new a new type.

Thanks Zvi for addressing all comments and questions while I am away.
Craig, please, see answers for your questions inlined below.

Thanks,
Amjad

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
205 ↗(On Diff #124519)

I think that in the "else" part, the one that I kept the same behavior as the original instcombine code, we might end up creating a vector type that was not in the IR, as for vector types we do not check for type legality.
So, why in this case we should behave differently?

Regarding the scalar check, it might be redundant, but not always, because even if the "trunc" instruction is performed on vector type, the evaluated expression might contain scalar operations (due to the "insertelement" instruction, which will be supported in next few patches).

Furthermore, my assumption is that codegen legalizer will promote the illegal vector type back to the original type (or to a smaller one), in both cases we will not get worse code than the one we started with!
Is that assumption too optimistic?

217 ↗(On Diff #124519)

I agree with Craig, need to change back to:

if (!DstTy->isVectorTy() && FromLegal && !ToLegal)
aaboud added inline comments.Dec 19 2017, 4:47 AM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
205 ↗(On Diff #124519)

Just to emphasize I am adding two examples.

Case 1 (MinBitWidth == TruncBitWidth):

%A1 = zext <2 x i32> %X to <2 x i64>
%B1 = mul <2 x i64> %A1, %A1
%C1 = extractelement <2 x i64> %B1, i32 0
%D1 = extractelement <2 x i64> %B1, i32 1
%E1 = add i64 %C1, %D1
%T = trunc i64 %E1 to i16
  =>
%A2 = trunc <2 x i32> %X to <2 x i16>
%B2 = mul <2 x i16> %X, %X
%C2 = extractelement <2 x i16> %B2, i32 0
%D2 = extractelement <2 x i16> %B2, i32 1
%T = add i16 %C1, %D1

Case 2 (MinBitWidth > TruncBitWidth):

%A1 = zext <2 x i32> %X to <2 x i64>
%B1 = lshr <2 x i64> %A1, <i64 8, i64 8>
%C1 = mul <2 x i64> %B1, %B1
%T = trunc <2 x i64> %C1 to <2 x i8>
  =>
%A2 = trunc <2 x i32> %X to <2 x i16>
%B2 = lshr <2 x i16> %A2, <i32 8, i32 8>
%C2 = mul <2 x i16> %B2, %B2
%T = trunc <2 x i16> %C2 to <2 x i8>

Notice that in both cases the "new" vector type (<2 x i16>) in the transformed IR did not exist in the original IR.

Do not you think that we should perform these transformations and reduce the expression type width?

Taking your first example and increasing the element count to get legal types

define i16 @foo(<8 x i32> %X) {
  %A1 = zext <8 x i32> %X to <8 x i64>
  %B1 = mul <8 x i64> %A1, %A1
  %C1 = extractelement <8 x i64> %B1, i32 0
  %D1 = extractelement <8 x i64> %B1, i32 1
  %E1 = add i64 %C1, %D1
  %T = trunc i64 %E1 to i16
  ret i16 %T
}

define i16 @bar(<8 x i32> %X) {
  %A2 = trunc <8 x i32> %X to <8 x i16>
  %B2 = mul <8 x i16> %A2, %A2
  %C2 = extractelement <8 x i16> %B2, i32 0
  %D2 = extractelement <8 x i16> %B2, i32 1
  %T = add i16 %C2, %D2
  ret i16 %T
}

Then running that through llc with avx2. I get worse code for bar than foo. Vector truncates on x86 aren't good. There is no truncate instruction until avx512 and even then its 2 uops.

Taking your first example and increasing the element count to get legal types

define i16 @foo(<8 x i32> %X) {
  %A1 = zext <8 x i32> %X to <8 x i64>
  %B1 = mul <8 x i64> %A1, %A1
  %C1 = extractelement <8 x i64> %B1, i32 0
  %D1 = extractelement <8 x i64> %B1, i32 1
  %E1 = add i64 %C1, %D1
  %T = trunc i64 %E1 to i16
  ret i16 %T
}

define i16 @bar(<8 x i32> %X) {
  %A2 = trunc <8 x i32> %X to <8 x i16>
  %B2 = mul <8 x i16> %A2, %A2
  %C2 = extractelement <8 x i16> %B2, i32 0
  %D2 = extractelement <8 x i16> %B2, i32 1
  %T = add i16 %C2, %D2
  ret i16 %T
}

Then running that through llc with avx2. I get worse code for bar than foo. Vector truncates on x86 aren't good. There is no truncate instruction until avx512 and even then its 2 uops.

I can "fix" that by ignoring cases where zext/sext will turn into a truncate for vector types, the check need to be done is:
"For each zext/sext instruction with vector type that have one usage, its source type size in bitwidth should be not less than the chosen MinBitWidth".

This will prevent creating the truncate, which was not in the IR before, on new vector types (or any vector type).
However, we will still have zext/sext to new vector type that was not in the IR before.

Does that solve the problem?

P.S. if you still insist on prevent this pass from creating new vector types, the solution is:

  1. Do not support extractelement/insertelement.
  2. Do not accept expressions with vector type truncate instruction, where the MinBitWidth > TruncBitWidth.

@spatel, what do you think about vector types here?

@spatel, what do you think about vector types here?

I’m not at a dev machine, so I can’t try any experiments. But we’ve had something like this question come up in one of my vector cmp + select patches. Ideally, we’d always shrink those as we do with scalars, but as noted, we may not have good backend support to undo the transform. Given that it’s not a clear win, I think it’s best to limit the vector transforms in this initial patch. Then, we can enable those in a follow-up patch if there are known wins and deal with any regressions without risking the main (?) scalar motivating cases.

aaboud commandeered this revision.Dec 21 2017, 2:02 PM
aaboud edited reviewers, added: zvi; removed: aaboud.

Thanks for Zvi for helping me progress with this review while I am on vacation.
I will continue as an author from here.

aaboud updated this revision to Diff 127940.Dec 21 2017, 2:05 PM

Addressed Craig and Sanjay comments:

  1. Retrieve the support for vector types.
  2. Make sure that this transformation will not create a new vector type. This is achieved by allowing reducing expression with vector type only when MinBitWidth == TruncBitWidth.
zvi added inline comments.Dec 24 2017, 6:08 AM
test/Transforms/AggressiveInstCombine/trunc_multi_uses.ll
1 ↗(On Diff #127940)

Should there be negative tests for the vector cases that are not permitted to transform?

aaboud added inline comments.Dec 26 2017, 12:24 AM
test/Transforms/AggressiveInstCombine/trunc_multi_uses.ll
1 ↗(On Diff #127940)

There should be.
However, such tests needs instructions such as lshr, ashr, udiv or urem, i.e., instructions that increase the MinBitWidth that we can truncate the expression to.
So, I will add such test in the following patches, once I add support to these instructions.

@mzolotukhin may want to comment on this one before it goes in as he's spending large part of his time doing compile time work. Please wait for his opinion.

aaboud added a comment.Jan 6 2018, 3:39 AM

Hi,
I uploaded a new version about a week ago with the required change for not generating new vector type.
Please, let me know if you have any other comments.

Thanks,
Amjad

I think the patch looks good now with the vector fix. Did you hear anything from @mzolotukhin about compile time?

Hi and sorry for the late reply, I've just returned from the holidays break.
The numbers posted before look good. I wonder though if it would make sense to only run this pass on -O3. I assume that even if now the pass spends very little time, it will grow in future and the compile-time costs might become noticeable.

Michael

Hi and sorry for the late reply, I've just returned from the holidays break.
The numbers posted before look good. I wonder though if it would make sense to only run this pass on -O3. I assume that even if now the pass spends very little time, it will grow in future and the compile-time costs might become noticeable.

Michael

Thanks Michael for the feedback.
As you said, the pass spend very little time, cannot we decide on moving it to -O3 in the future when/if other heavy optimizations is added to this pass?
And even then, we can decide to run part of them on -O2 and the rest on -O3.

Would that work for you?

Thanks,
Amjad

As you said, the pass spend very little time, cannot we decide on moving it to -O3 in the future when/if other heavy optimizations is added to this pass?
And even then, we can decide to run part of them on -O2 and the rest on -O3.

My main concern with that is that it's actually really hard to demote something to lower optlevels retroactively.

For instance, right now it would make sense to move some parts of existing InstCombine out of O0. In practice it's a very time consuming task to do so (to find the right pieces, to do all the measurements, to agree in the community on the acceptable regressions etc.). And there is usually no single big heavy part that we can just move out and solve all the compile time issues - we have many small pieces that just sum up to something big.

I expect that to be the case with this pass as well - people will add more stuff, but each individual piece would contribute only a little, so it'll be hard to say "yep, this one goes to -O3, others can stay at O0/O2". So I think it's worth moving the whole pass to -O3 now rather than in the future (and how bad is it for other optlevels not to have it? is it really critical?).

Michael

aaboud updated this revision to Diff 131053.Jan 23 2018, 6:10 AM

Moved the aggressive-inst-combine pass to run with -O3.
I prefer to make this change now, in order to get approval to commit the pass, and in the future, once the pass is complete, to argue enabling it with -O2, in a separate discussion.

Please, let me know if you have any more concerns regarding this patch.

Thanks.

No concerns from my side, thanks for making the change!

Michael

If there is no more concerns, can I get approval?
@craig.topper

This revision is now accepted and ready to land.Jan 24 2018, 2:18 AM
This revision was automatically updated to reflect the committed changes.