This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Introducing Aggressive Instruction Combine pass
ClosedPublic

Authored by aaboud on Sep 27 2017, 5:19 AM.

Details

Summary

This new approach is a replacement for D37195.

Motivation:
InstCombine algorithm runs with high complexity (as it keeps running till there is no change to the code), thus it requires that each piece inside the outer loop, have a very small complexity (preferable O(1)).

Problem
There are instruction patterns that costs more than O(1) to identify and modify, thus it cannot be added to the InstCombine algorithm.

Solution
AggressiveInstCombine pass, which will run separately from InstCombine pass, and can perform expression pattern optimizations, which might take each up to O(n) complexity, where n is number of instructions in the function.

I introduced in this patch:

  1. The AggressiveInstCombiner class is the main pass that will run all expression pattern optimizations.
  2. The TruncInstCombine class, first added expression pattern optimization.

TruncInstCombine currently supports only simple instructions such as: add, sub, and, or, xor and mul.
The main difference between this and the canEvaluateTruncated from InstCombine is that this one supports:

  1. Instructions with multi-use, as long as all are dominated by the truncated instruction.
  2. Truncating to different width than the original trunc instruction requires. (this is useful if we reduce the expression width, even if we are not eliminating any instruction).

Next I will add to this TruncInstCombine class the following support, each in a separate patch:

  1. select, shufflevector, extractelement, insertelement
  2. udiv, urem
  3. shl, lshr, ashr
  4. phi node (and loop handling)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
aaboud updated this revision to Diff 116798.Sep 27 2017, 5:21 AM

Fixed "no new line at end of file" issue.

nlopes added a subscriber: nlopes.Sep 28 2017, 6:45 AM
craig.topper edited edge metadata.Oct 6 2017, 4:13 PM

Do you have tests for "Truncating to different width than the original trunc instruction requires. (this is useful if we reduce the expression width, even if we are not eliminating any instruction)."

lib/Transforms/InstCombine/InstCombinePatterns.cpp
215 ↗(On Diff #116798)

smaller*

223 ↗(On Diff #116798)

infinit*

252 ↗(On Diff #116798)

destination*

255 ↗(On Diff #116798)

Does this call isLegalInteger using the scalar bit width for vectors? Not sure that's valid.

259 ↗(On Diff #116798)

destination*

397 ↗(On Diff #116798)

dominates*

"we" looks unnecessarry

Hi Amjad,

I didn't do a thorough drill down of all the changes (I'll leave that to Craig and Sanjay), but skimmed through it. The design seems reasonable, as long as others are ok with the instcombine plugin.
I can't tell all the effects this will have, but take care that the zext->(operations...)->trunc optimization doesn't remove a zext on a load that feed into an operation such that we turn a "movz[bw] mem -> reg" into "mov[bw] mem -> [bw]reg", since this can introduce extra merge/false dependence hits on most newer Intel architectures.

Zia.

lib/Transforms/InstCombine/InstCombinePatterns.cpp
146 ↗(On Diff #116798)

Are you planning on doing these peepholes here? If not, is there any other reason for leaving these cases in this switch?

180 ↗(On Diff #116798)

Same as above. Any reason for not breaking for trunc/zext/sext?

aaboud marked 6 inline comments as done.Oct 15 2017, 5:55 AM

Thanks Craig and Zia for the review and sorry for the late answer.
Please, see answers below.

Do you have tests for "Truncating to different width than the original trunc instruction requires. (this is useful if we reduce the expression width, even if we are not eliminating any instruction)."

I do have tests, but this is not possible yet with this patch, we need the shift/udiv/urem instructions to get this case. I will introduce that in next patches.
In this patch, I just implemented the infrastructure for this optimization.

Hi Amjad,

I didn't do a thorough drill down of all the changes (I'll leave that to Craig and Sanjay), but skimmed through it. The design seems reasonable, as long as others are ok with the instcombine plugin.
I can't tell all the effects this will have, but take care that the zext->(operations...)->trunc optimization doesn't remove a zext on a load that feed into an operation such that we turn a "movz[bw] mem -> reg" into "mov[bw] mem -> [bw]reg", since this can introduce extra merge/false dependence hits on most newer Intel architectures.

Zia.

I believe that we have passes in the codegen to make sure we will not suffer from these false dependencies, right?
I do not believe that instcombine should look that far to recognize such cases.

lib/Transforms/InstCombine/InstCombinePatterns.cpp
146 ↗(On Diff #116798)

I do need to handle these cases here as I am calling "getRelevantOperands()" function from "getMinBitWidth()" function for all supported cases (without a switch).

180 ↗(On Diff #116798)

I can add the break here, I agree that it makes code faster.

255 ↗(On Diff #116798)

Can you explain, what you mean by "that's not valid"?
Are you referring to the algorithm?

Are you concerned of cases where a scalar type is legal (e.g. i32), while a vector type is not (e.g. <8xi32>), or the opposite direction?

I believe that my algorithm should not mind about the number of elements in the type, only about the width.
The reason for this check, is to keep the old case, where we will not reduce a legal to non-legal type.

Do you think that we should do a more accurate check?

aaboud updated this revision to Diff 119079.Oct 15 2017, 5:55 AM
aaboud marked an inline comment as done.

Addressed Craig and Zia comments.

craig.topper added inline comments.Oct 16 2017, 4:39 PM
lib/Transforms/InstCombine/InstCombinePatterns.cpp
255 ↗(On Diff #116798)

I guess my point was just that the legality of the scalar type is unrelated to the legality of the vector type. In x86 for example. i64 isn't legal in 32-bit mode, but v2i64 is. If I remember right the existing code doesn't even call isLegalInteger for vector types? I'd check but someone is hammering the server I normally use.

aaboud added inline comments.Oct 18 2017, 8:23 AM
lib/Transforms/InstCombine/InstCombinePatterns.cpp
255 ↗(On Diff #116798)

I agree that I have a logical bug here.
I was trying to keep the same semantic of original code:

if ((DestTy->isVectorTy() || shouldChangeType(SrcTy, DestTy)) &&
      canEvaluateTruncated(Src, DestTy, *this, &CI)) {

I guess, I need to ignore vector types. (as in above code) and fix the logic to this:

if (MinBitWidth > ValidBitWidth) {
  Type *Ty = DL.getSmallestLegalIntType(DstTy->getContext(), MinBitWidth);
  if (!Ty)
    return nullptr;
  // update minimum bit-width with the new destination type bit-width.
  MinBitWidth = Ty->getScalarSizeInBits();
} else { // MinBitWidth == ValidBitWidth
  if (!DestTy->isVectorTy() && !shouldChangeType(SrcTy, DestTy))
    return nullptr;
}
Type *NewDstSclTy = IntegerType::get(DstTy->getContext(), MinBitWidth);
spatel edited edge metadata.

If I'm seeing this correctly, it's an independent pass within InstCombine. It sits outside InstCombine's iteration loop, so it doesn't interact with the rest of the pass. What is the advantage of this approach vs. making a standalone pass?

If I'm seeing this correctly, it's an independent pass within InstCombine. It sits outside InstCombine's iteration loop, so it doesn't interact with the rest of the pass. What is the advantage of this approach vs. making a standalone pass?

Hi Sanjay,
This code, which I am suggested below, intends to replace (and improve) the "canEvaluateTruncated" from the current instCombine implementation.
The logic of this code is similar to what instcombine does, but different in the way of implementation.
Saying that, I believe that running all current instcombine tests with this new functionality is a must, in order to make that possible, it is obvious that we need to be part of instcombine pass.

Note that the implementation is done in a way that moving it to a separate pass can be done with zero effort, but in a cost of ignoring/dropping few hundreds of LIT tests.

Do you think still that it should be in a separate pass?

aaboud added a subscriber: zvi.Oct 25 2017, 2:25 AM
aaboud updated this revision to Diff 120219.Oct 25 2017, 2:36 AM

Updated patch according to Craig comment. (Fixed minor logical bug)

Saying that, I believe that running all current instcombine tests with this new functionality is a must, in order to make that possible, it is obvious that we need to be part of instcombine pass.

Note that the implementation is done in a way that moving it to a separate pass can be done with zero effort, but in a cost of ignoring/dropping few hundreds of LIT tests.

I agree that testing must be done, but I don't see how that makes it obvious that this should be part of instcombine? If you're concerned that something else in instcombine will inhibit or invert this transform, you could add tests under test/Transforms/PhaseOrdering/ to make sure that doesn't happen. I think you've done the hard part (the code itself) already. :)

The major disadvantage of being in instcombine is that this code will be running 5-6 times in a typical pipeline when it probably doesn't need to.

Do you think still that it should be in a separate pass?

I don't know enough to say. I'm stating a concern based on feedback I've gotten about the size and cost of InstCombine (eg, http://lists.llvm.org/pipermail/llvm-dev/2017-May/113184.html , http://lists.llvm.org/pipermail/llvm-dev/2017-September/117151.html ). I'll let more experienced developers (cc'ing @nlopes @hfinkel @regehr ... ) decide what's the right way forward.

Also, I know that others are exploring ways to auto-regenerate some portion of InstCombine for greater efficiency, so it might be worth considering if/how that goal interacts with a large patch like this one.

majnemer added inline comments.Oct 25 2017, 7:54 AM
lib/Transforms/InstCombine/InstCombineInternal.h
782 ↗(On Diff #120219)

Ditto.

lib/Transforms/InstCombine/InstCombinePatterns.cpp
90 ↗(On Diff #120219)

default;

aaboud updated this revision to Diff 120258.Oct 25 2017, 8:09 AM
aaboud marked 2 inline comments as done.

Answered David's comment.

Saying that, I believe that running all current instcombine tests with this new functionality is a must, in order to make that possible, it is obvious that we need to be part of instcombine pass.

Note that the implementation is done in a way that moving it to a separate pass can be done with zero effort, but in a cost of ignoring/dropping few hundreds of LIT tests.

I agree that testing must be done, but I don't see how that makes it obvious that this should be part of instcombine? If you're concerned that something else in instcombine will inhibit or invert this transform, you could add tests under test/Transforms/PhaseOrdering/ to make sure that doesn't happen. I think you've done the hard part (the code itself) already. :)

The major disadvantage of being in instcombine is that this code will be running 5-6 times in a typical pipeline when it probably doesn't need to.

Is that a bad thing to run this code 5-6 times? it has O(n) complexity where n = number of instructions in the function.
Would not it catch more patterns once we run other optimizations?
I need this code to run at least twice, one time on regular compilation and another time as part of LTO optimization.
Would it be better if I move the code to a new pass "PatternInstructionCombinePass", but run that new pass from "addInstructionCombiningPass"? It will still run 5-6 times!

Do you think still that it should be in a separate pass?

I don't know enough to say. I'm stating a concern based on feedback I've gotten about the size and cost of InstCombine (eg, http://lists.llvm.org/pipermail/llvm-dev/2017-May/113184.html , http://lists.llvm.org/pipermail/llvm-dev/2017-September/117151.html ). I'll let more experienced developers (cc'ing @nlopes @hfinkel @regehr ... ) decide what's the right way forward.

Also, I know that others are exploring ways to auto-regenerate some portion of InstCombine for greater efficiency, so it might be worth considering if/how that goal interacts with a large patch like this one.

Notice that this code is part of InstCombine pass, but it is totally separated, I do not see why any change to InstCombine will affect this, unless you think we can auto-generate this optimization!

To summarize, I really do not mind to move it to separate pass, but I would like to make this optimization committed as soon as possible.
I appreciate your review and direction, please, advice with the best way you think I should implement this optimization.

Thanks,
Amjad

To summarize, I really do not mind to move it to separate pass, but I would like to make this optimization committed as soon as possible.
I appreciate your review and direction, please, advice with the best way you think I should implement this optimization.

As I said, I'm deferring to others on the way forward, so if everyone else thinks this is good, then I'm not objecting. Others have looked at the code closer than me, so I'll let them provide more feedback and/or approval.

To summarize, I really do not mind to move it to separate pass, but I would like to make this optimization committed as soon as possible.
I appreciate your review and direction, please, advice with the best way you think I should implement this optimization.

As I said, I'm deferring to others on the way forward, so if everyone else thinks this is good, then I'm not objecting. Others have looked at the code closer than me, so I'll let them provide more feedback and/or approval.

Thanks Sanjay,
I appreciate your help.
I will be waiting a little more for others to comment on this patch.

zvi added a comment.EditedOct 26 2017, 6:45 AM

Amjad, some questions about where do we want this work to evolve to.

Looking at the following example:

define i16 @foo(i8 %B, i16 %A, i1* %pbit) {
           %zext = zext i8 %B to i32
         %mul = mul i32 %zext, %zext
       %LSB = and i32 %mul, 255 ; <----------- First use
     %cmp = icmp ne i32 %LSB, 0
   store i1 %cmp, i1* %pbit
       %sext = sext i16 %A to i32
     %and = and i32 %mul, %sext ; <---------- Second use
   %trunc = trunc i32 %and to i16
   ret i16 %trunc
 }

The indentation should help with seeing the two branches in the expression DAG. The 'mul' is used by both 'and' instructions.
The most 'type-shrunken' form with no duplications should be this:

define i16 @foo(i8 %B, i16 %A, i1* %pbit) {

%zext = zext i8 %B to i16
%mul = mul i16 %zext, %zext
%LSB = and i16 %mul, 255
%cmp = icmp ne i16 %LSB, 0
store i1 %cmp, i1* %pbit, align 1
%and = and i16 %mul, %A
ret i16 %and

}

I assume that this patch will not cover this case because the bottom-up traversal starting from the 'trunc' will only visit one branch of the two. The bail-out condition

if (UI != CurrentTruncInst && !InstInfoMap.count(UI))
          return nullptr;

in getBestTruncatedType() will fire because one of the 'mul' users will not be mapped in the expressions DAG. If, for example, the expression DAG were constructed by traversing both defs and uses, it would have covered the entire function in the above example, and there would be sufficient information to deduce that both branches of the expression DAG can be shrunk.
Does this make sense?

lib/Transforms/InstCombine/InstCombinePatterns.cpp
62 ↗(On Diff #120258)

Didn't see any actual uses of AC and DT. Will they be used in a follow-up patch? Even if yes, better remove to avoid breaking -Werror builds

139 ↗(On Diff #120258)

Instead of populating a container, considering returning a pair of begin,end iterators to the operands of interest (or a op_range). Of course, this will only work if the operands of interest are consecutive.

260 ↗(On Diff #120258)

Please add a comment explaining the considerations for bailing out when MinBitWidth == ValidBitWidth

270 ↗(On Diff #120258)

I am commenting about this because it got me a bit confused. I think the correct term would be post-dominated or "users that dominate the truncate instruction".

271 ↗(On Diff #120258)

*expression

311 ↗(On Diff #120258)

Better call .lookup() rather than operator[] to avoid side effects in asserts

320 ↗(On Diff #120258)

Better call .lookup() rather than operator[] to avoid side effects in asserts
312

358 ↗(On Diff #120258)

Consider moving this 'if' block under the appropriate case statement above.

359 ↗(On Diff #120258)

STLExtras.h can help with saving a few bytes of code :)

auto Entry = find(Worklist, I);
aaboud marked 8 inline comments as done.Oct 26 2017, 7:45 AM

Thanks Zvi for the comments.
I fixed most of them and will upload a new patch soon.

lib/Transforms/InstCombine/InstCombinePatterns.cpp
62 ↗(On Diff #120258)

Indeed, I will be using them in next patches.
I do not think there is a warning/error issue with unused class members, but I will remove them from this patch and add them later when they are actually used.

139 ↗(On Diff #120258)

It will not work for the PHINode instruction.

aaboud updated this revision to Diff 120419.Oct 26 2017, 7:51 AM
aaboud marked an inline comment as done.

Addressed Zvi's comments.

Saying that, I believe that running all current instcombine tests with this new functionality is a must, in order to make that possible, it is obvious that we need to be part of instcombine pass.

Note that the implementation is done in a way that moving it to a separate pass can be done with zero effort, but in a cost of ignoring/dropping few hundreds of LIT tests.

I agree that testing must be done, but I don't see how that makes it obvious that this should be part of instcombine? If you're concerned that something else in instcombine will inhibit or invert this transform, you could add tests under test/Transforms/PhaseOrdering/ to make sure that doesn't happen. I think you've done the hard part (the code itself) already. :)

The major disadvantage of being in instcombine is that this code will be running 5-6 times in a typical pipeline when it probably doesn't need to.

Is that a bad thing to run this code 5-6 times? it has O(n) complexity where n = number of instructions in the function.
Would not it catch more patterns once we run other optimizations?
I need this code to run at least twice, one time on regular compilation and another time as part of LTO optimization.
Would it be better if I move the code to a new pass "PatternInstructionCombinePass", but run that new pass from "addInstructionCombiningPass"? It will still run 5-6 times!

Do you think still that it should be in a separate pass?

I don't know enough to say. I'm stating a concern based on feedback I've gotten about the size and cost of InstCombine (eg, http://lists.llvm.org/pipermail/llvm-dev/2017-May/113184.html , http://lists.llvm.org/pipermail/llvm-dev/2017-September/117151.html ). I'll let more experienced developers (cc'ing @nlopes @hfinkel @regehr ... ) decide what's the right way forward.

Also, I know that others are exploring ways to auto-regenerate some portion of InstCombine for greater efficiency, so it might be worth considering if/how that goal interacts with a large patch like this one.

Notice that this code is part of InstCombine pass, but it is totally separated, I do not see why any change to InstCombine will affect this, unless you think we can auto-generate this optimization!

To summarize, I really do not mind to move it to separate pass,

I think that this should be a separate pass. Everything in InstCombine is currently part of InstCombine's fixed-point iteration scheme. Having some things in InstCombine that are, and some that aren't, seems unnecessarily confusing. If this essentially runs as a separate pass, then I think we should just make it one, and then we can schedule it as we see fit.

Also, I'll note that this optimization seems very similar to PPCTargetLowering::DAGCombineTruncBoolExt, which optimized cases like trunc(binary-ops(binary-ops(zext(x), zext(y)), ...) and perhaps also PPCTargetLowering::DAGCombineExtBoolTrunc, which optimizes cases like zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...). Those implementation are not recursive, but use a work queue, and I suggest doing the same here.

but I would like to make this optimization committed as soon as possible.
I appreciate your review and direction, please, advice with the best way you think I should implement this optimization.

Thanks,
Amjad

zvi added a comment.Oct 30 2017, 4:32 AM

I know you decided to move this to a new pass, but here's a couple of more comments that will be relevant.

test/Transforms/InstCombine/trunc_pattern.ll
13 ↗(On Diff #120419)

dominated -> post-dominated

18 ↗(On Diff #120419)

May a nitpick of mine, but i find it easier to follow the check's when they arer right above the related sequence of instructions. Another option is to be break down to multiple functions.

aaboud updated this revision to Diff 120816.Oct 30 2017, 7:12 AM
aaboud marked 2 inline comments as done.
aaboud retitled this revision from [InstCombine] Introducing Pattern Instruction Combine plug-in into InstCombine pass to [InstCombine] Introducing Aggressive Instruction Combine pass.
aaboud edited the summary of this revision. (Show Details)

Separated the implementation from InstCombine pass and introduced a new pass called AggressiveInstCombine, which is called only twice (compared InstCombine, which is called ~6 times), one time as part of function simplification passes and second time as part of LTO optimization passes.

hfinkel added inline comments.Oct 30 2017, 7:02 PM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
77

You should use a worklist here, not recursion. Then they'll be no need for the depth limit (or it can be very large).

114

You can mention specific things here. Comments like "TODO: more work" are not helpful. You have a list in your patch description:

select, shufflevector, extractelement, insertelement
udiv, urem
shl, lshr, ashr
phi node (and loop handling)

158

If we have a DAG of instructions with multiple trunc outputs, we'll end up walking the DAG once per trunc output?

182

If your goal is only to produce legal integer widths (which might miss some cases for vectorization), then I think that you should just fold this information into getMinBitWidth so that getMinBitWidth returns the minimum *legal* bit width.

282

Please write actual messages in llvm_unreachable. Perhaps something like:

llvm_unreachable("Unhandled instruction");
aaboud marked 4 inline comments as done.Nov 1 2017, 10:50 AM

Thanks Hal for the review.
I will update the patch with the changes you ask for.
Also, see one answer below.

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
158

That is is correct, and we will not be able to perform the transformation for any of these trunc instructions.
This means that the complexity of this pass is at most O(n^2), in the worst case where n/2 is the DAG nodes and n/2 are trunc instructions that are using/truncating the DAG result.

Should we worry about this case? Can we solve it?

aaboud updated this revision to Diff 121149.Nov 1 2017, 10:53 AM

Addressed Hal's comments.

  1. Remove all recursive functions and replaced with Worklist + Stack containers. Note, this is needed for handling loops (which will be added with the PHINode patch).
  2. Improved the compile time performance of the pass, by implementing an early exist for multi-used instructions that are not post-dominated by the TruncInst.
zvi added a comment.Nov 2 2017, 7:07 AM

Some more minor comments

docs/Passes.rst
19

For the sake of consistency with other title, please add more dashes for alignment with title.

21

*patterns *expressions

24

*reduces the width of expressions

27

Maybe also say that the pattern-scan may cover the entire functions as opposed to the locality-limited instcombine patterns?

include/llvm/InitializePasses.h
2

These declarations are not perfectly sorted, but still maybe try to move this up.

include/llvm/Transforms/Scalar.h
8

*expressions *patterns

lib/LTO/LLVMBuild.txt
2

Please preserve sorted ordering

lib/Passes/LLVMBuild.txt
1–22

Please preserve sorted ordering

lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
32

Can you please rephrase this so that it is obvious that the pass can be run more than once and that the internal mechanism limits pattern matchers to run once? Maybe something like this

It differs than  the Instcombine in that each pattern combiner is run only once as opposed to instcombine's multi-iteration ...
55

Is there an importance of order to the calls? If not, maybe group the set/addPreserve* calls and addRequired call to two group separated by a newline. IMHO it's more readable.

lib/Transforms/AggressiveInstCombine/AggressiveInstCombineInternal.h
22

Move two lines up

70

post-dominated

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
41

+1 for preserving DebugInfo. Is there other metadata that may need to be copied? I can't think of anything in particular, just wanted to raise the possibility.

45

Is it possible to reuse llvm::ReplaceInstWithInst instead of doing the low-level work explicitly?

55

IIRC, in one of the previous revisions of this patch there was a comment explaining why these cast instructions are skipped? Can you please revive it or add a new one?

zvi added inline comments.Nov 2 2017, 7:07 AM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
65

Could we avoid pushing constants and maybe even generalize to avoid values that are not instructions? At least for these cases it may be ok, not for others, such as divide where we need to be more cautious

203

At this point a mappting for IOp must exist in InstInfoMap, right? Then please use lookup() or find(). Also better avoid searching for same key more than once.

210

lookup()

228

IMO this would be more readable and guarantee that code changes won't lead to uses of stale values of MinBitWidth:

MinBitWidth = OrigBitwidth;

and drop the return

353

I may have missed this, but why defer insertion of instruction to the BB to this point? And consider using the IRBuilder instead of calling ::Create* above and below.

aaboud marked 15 inline comments as done.Nov 2 2017, 8:03 AM

Thanks Zvi for the comments.
I will upload a new patch with most of the comments fixed.
See few answers below.

lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
55

Not sure if there is any importance to the order.
I just did what InstructionCombiningPass does!

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
41

This is what InstCombine preserves.

45

llvm::ReplaceInstWithInst also deletes the old instructions, which we are not ready to delete at this point.

65

I prefer not to complicate this function, it should return operands that are related to the optimization, the caller should check each relevant operand if it is a constant or instruction, there will be cases where even a constant need to be evaluated and will not skipped immediately.

aaboud updated this revision to Diff 121318.Nov 2 2017, 8:16 AM

Addressed Zvi's Comments.

aaboud updated this revision to Diff 121320.Nov 2 2017, 8:30 AM

Minor fix, forgot to use IRBuilder in one case in the previous patch.

zvi added a comment.Nov 2 2017, 8:32 AM

Thanks, Amjad! This patch LGTM, but i think it would be best to wait for an LGTM from one of the assigned reviewers.

aaboud updated this revision to Diff 121332.Nov 2 2017, 9:30 AM

Minor typo update.

This is the last version, where I answered all the previous comments.
Please, let me know if you have any final comment or if I can go ahead and commit the patch.

Thanks

zvi commandeered this revision.Nov 9 2017, 12:44 AM
zvi added a reviewer: aaboud.

Commandeering this patch while Amjad is away on a few weeks of vacation.
Also sending a friendly ping to the reviewers. AFAIK, all comments were addressed as of the latest revision of this patch. Please let me know if i missed anything. Thanks.

craig.topper added inline comments.Nov 14 2017, 8:59 AM
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
63

What about the new pass manager?

lib/Transforms/AggressiveInstCombine/AggressiveInstCombineInternal.h
90

This sentence reads funny

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
81

Can you consistently use auto with dyn_cast throughout this patch.

176

LLVM style preferes "auto *IOp". We don't want auto to hide the fact that its a pointer. Please scrub the whole patch for this.

207

In the case of vectors, is this using legal scalar integer types to constrain the vector element type? I'm not sure if that's the right behavior. The legal scalar types don't necessarily imply anything about vector types.

I think we generally try to avoid creating vector types that didn't appear in the IR.

211

Isn't this MinBitWidth == TruncBitWidth?

lib/Transforms/IPO/PassManagerBuilder.cpp
3

What about the new pass builder for the new pass manager?

zvi added inline comments.Nov 16 2017, 4:45 AM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
207

That' s a good point. Will look into this.

211

Your observation is correct but the comment is also correct and it explains something that may not be obvious.
At this point, we completed visiting the expression tree starting from the truncate's operand and found 'MinBitWidth' by taking the max of all min-bit-width requirements of the predecessors. Now, since this is a truncate instruction, and by construction of the expression tree it follows that the computed MinBitWidth can never be less than the truncate's return types's size in bits or in other words MinBitWidth >= TruncBitWidth.
The case of MinBitWidth > TruncBitWidth is hand;ed in the then block just above, and what remains in to handle the MinBitWidth == TruncBitWidth in the else block here.
I can put this in a comment if you think it would be helpful.

zvi updated this revision to Diff 123159.Nov 16 2017, 4:47 AM
zvi edited the summary of this revision. (Show Details)

Address some of Craig's recent comments.

zvi marked 3 inline comments as done.Nov 16 2017, 4:49 AM
craig.topper added inline comments.Nov 16 2017, 10:50 PM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
211

I was just saying that it should say TruncBitWidth not ValidBitWidth right?

zvi added inline comments.Nov 19 2017, 12:48 AM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
211

Ah, right :)

zvi updated this revision to Diff 123509.Nov 19 2017, 12:15 PM

Address the last of Craig's comments:

  • Thanks, @lsaba, for porting the pass to the new PassManager.
  • Removed shrinkage of vector types until we sort out if it is generally allowed to shrink element types of vector operations.
  • Some minor fixes to comments.
zvi marked 10 inline comments as done.Nov 19 2017, 12:19 PM
zvi updated this revision to Diff 123510.Nov 19 2017, 12:23 PM

Rebase on ToT. NFC in this revision.

escha added a subscriber: escha.Nov 27 2017, 11:40 AM

Two comments on the trunc thing:

  1. Thank you!!! As a GPU target maintainer, one of my main frustrations is how much LLVM *loves* to generate code that is needlessly too wide when smaller would do. We mostly have avoided this problem due to being float-heavy, but as integer code becomes more important, I absolutely love any chance I can get to reduce 32-bit to 16-bit and save register space accordingly.
  1. I'm worried about this because the DAG *loves* to eliminate """redundant""" truncates and extensions, even if they're both marked as free. I've accidentally triggered infinite loops many times when trying to trick the DAG into emitting code that keeps intermediate variables small, an extreme example being something like this:
; pseudo-asm
; R1 = *b + (*a & 15);
; R2 = *c + (*a >> 16) & 15;
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R0H, R0, 16
and.16 R0L, R0L, 15
and.16 R0H, R0H, 15
add.32 R1, R1, R0L
add.32 R2, R2, R0H

The DAG will usually try to turn this into this:

load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R3, R0, 16
and.32 R0, R0, 15
and.32 R3, R3, 15
add.32 R1, R1, R0
add.32 R2, R2, R3

this is just a hypothetical example but in general this makes me worry from past attempts at experimentation in this realm.

davide added a subscriber: davide.Nov 27 2017, 1:45 PM

I'm really worried that the compile time hit of this for LTO will be non-negligible. Do you have numbers?

zvi updated this revision to Diff 124519.Nov 27 2017, 11:34 PM

Add missing AggressiveInstCombine.h and fix missing 'opt' dependency. Thanks, @lsaba, for noticing.

zvi added a comment.Nov 27 2017, 11:59 PM

Two comments on the trunc thing:

  1. Thank you!!! As a GPU target maintainer, one of my main frustrations is how much LLVM *loves* to generate code that is needlessly too wide when smaller would do. We mostly have avoided this problem due to being float-heavy, but as integer code becomes more important, I absolutely love any chance I can get to reduce 32-bit to 16-bit and save register space accordingly.

Sometimes it's LLVM, and sometimes it's the frontend that is required to extend small typed values before performing operations.

  1. I'm worried about this because the DAG *loves* to eliminate """redundant""" truncates and extensions, even if they're both marked as free. I've accidentally triggered infinite loops many times when trying to trick the DAG into emitting code that keeps intermediate variables small, an extreme example being something like this:
; pseudo-asm
; R1 = *b + (*a & 15);
; R2 = *c + (*a >> 16) & 15;
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R0H, R0, 16
and.16 R0L, R0L, 15
and.16 R0H, R0H, 15
add.32 R1, R1, R0L
add.32 R2, R2, R0H

The DAG will usually try to turn this into this:

load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R3, R0, 16
and.32 R0, R0, 15
and.32 R3, R3, 15
add.32 R1, R1, R0
add.32 R2, R2, R3

this is just a hypothetical example but in general this makes me worry from past attempts at experimentation in this realm.

Not sure I fully understand the concern of this patch, but if the problem is root caused to Instruction Selection, shouldn't we fix it there? If DAGCombiner's elimination of free truncates/extensions is an issue, have you considered predicating the specific combines with TLI hooks?

zvi added a comment.Nov 28 2017, 12:03 AM

I'm really worried that the compile time hit of this for LTO will be non-negligible. Do you have numbers?

Will follow-up on this.

In D38313#937267, @zvi wrote:

Two comments on the trunc thing:

  1. Thank you!!! As a GPU target maintainer, one of my main frustrations is how much LLVM *loves* to generate code that is needlessly too wide when smaller would do. We mostly have avoided this problem due to being float-heavy, but as integer code becomes more important, I absolutely love any chance I can get to reduce 32-bit to 16-bit and save register space accordingly.

Sometimes it's LLVM, and sometimes it's the frontend that is required to extend small typed values before performing operations.

  1. I'm worried about this because the DAG *loves* to eliminate """redundant""" truncates and extensions, even if they're both marked as free. I've accidentally triggered infinite loops many times when trying to trick the DAG into emitting code that keeps intermediate variables small, an extreme example being something like this:
; pseudo-asm
; R1 = *b + (*a & 15);
; R2 = *c + (*a >> 16) & 15;
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R0H, R0, 16
and.16 R0L, R0L, 15
and.16 R0H, R0H, 15
add.32 R1, R1, R0L
add.32 R2, R2, R0H

The DAG will usually try to turn this into this:

load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R3, R0, 16
and.32 R0, R0, 15
and.32 R3, R3, 15
add.32 R1, R1, R0
add.32 R2, R2, R3

this is just a hypothetical example but in general this makes me worry from past attempts at experimentation in this realm.

Not sure I fully understand the concern of this patch, but if the problem is root caused to Instruction Selection, shouldn't we fix it there? If DAGCombiner's elimination of free truncates/extensions is an issue, have you considered predicating the specific combines with TLI hooks?

There *are* TLI hooks; they're just not as widely used in the DAG as they could be.

I'm warning you with regard to this patch because the DAG may inadvertently undo a lot of the optimizations you're doing here. This isn't an objection, just something that might be worth looking at later given past experiences in trying to do similar optimizations.

zvi added a comment.Nov 29 2017, 11:10 PM
In D38313#937269, @zvi wrote:

I'm really worried that the compile time hit of this for LTO will be non-negligible. Do you have numbers?

Will follow-up on this.

Measured CTMark and internal tests and was not able to observe significant compile time changes with -flto. Below are the results for CTMark:

WorkloadToT: stdev/average of 10 runs [%]This patch: stdev/average of 10 runs [%]Average compile-time speedup of this patch over ToT (higher is better for this patch)
7zip0.19%0.19%0.999
Bullet0.30%0.37%0.998
ClamAV0.39%0.19%1.000
SPASS0.52%0.33%1.000
consumer-typeset0.27%0.36%0.999
kimwitu++0.45%0.49%0.998
lencod0.20%0.51%1.001
mafft0.63%0.29%1.006
sqlite30.70%0.82%1.002
tramp3d-v41.23%1.78%0.990
craig.topper added inline comments.Dec 12 2017, 10:24 PM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
205

I'm not sure you've addressed my vector concerns. The first part of this 'if' would create a new type for vectors by using getSmallestLegalIntType.

217

I think the !Vector check that was here was correct previously. We don't do isLegalnteger checks on the scalar types of vectors. For vectors we assume that if the type was present in the IR, the transform is fine. In this block, TruncWidth == MinBitWidth so the type existed in the original IR. My vector concerns were about the block above where we create new a new type.

Thanks Zvi for addressing all comments and questions while I am away.
Craig, please, see answers for your questions inlined below.

Thanks,
Amjad

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
205

I think that in the "else" part, the one that I kept the same behavior as the original instcombine code, we might end up creating a vector type that was not in the IR, as for vector types we do not check for type legality.
So, why in this case we should behave differently?

Regarding the scalar check, it might be redundant, but not always, because even if the "trunc" instruction is performed on vector type, the evaluated expression might contain scalar operations (due to the "insertelement" instruction, which will be supported in next few patches).

Furthermore, my assumption is that codegen legalizer will promote the illegal vector type back to the original type (or to a smaller one), in both cases we will not get worse code than the one we started with!
Is that assumption too optimistic?

217

I agree with Craig, need to change back to:

if (!DstTy->isVectorTy() && FromLegal && !ToLegal)
aaboud added inline comments.Dec 19 2017, 4:47 AM
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
205

Just to emphasize I am adding two examples.

Case 1 (MinBitWidth == TruncBitWidth):

%A1 = zext <2 x i32> %X to <2 x i64>
%B1 = mul <2 x i64> %A1, %A1
%C1 = extractelement <2 x i64> %B1, i32 0
%D1 = extractelement <2 x i64> %B1, i32 1
%E1 = add i64 %C1, %D1
%T = trunc i64 %E1 to i16
  =>
%A2 = trunc <2 x i32> %X to <2 x i16>
%B2 = mul <2 x i16> %X, %X
%C2 = extractelement <2 x i16> %B2, i32 0
%D2 = extractelement <2 x i16> %B2, i32 1
%T = add i16 %C1, %D1

Case 2 (MinBitWidth > TruncBitWidth):

%A1 = zext <2 x i32> %X to <2 x i64>
%B1 = lshr <2 x i64> %A1, <i64 8, i64 8>
%C1 = mul <2 x i64> %B1, %B1
%T = trunc <2 x i64> %C1 to <2 x i8>
  =>
%A2 = trunc <2 x i32> %X to <2 x i16>
%B2 = lshr <2 x i16> %A2, <i32 8, i32 8>
%C2 = mul <2 x i16> %B2, %B2
%T = trunc <2 x i16> %C2 to <2 x i8>

Notice that in both cases the "new" vector type (<2 x i16>) in the transformed IR did not exist in the original IR.

Do not you think that we should perform these transformations and reduce the expression type width?

Taking your first example and increasing the element count to get legal types

define i16 @foo(<8 x i32> %X) {
  %A1 = zext <8 x i32> %X to <8 x i64>
  %B1 = mul <8 x i64> %A1, %A1
  %C1 = extractelement <8 x i64> %B1, i32 0
  %D1 = extractelement <8 x i64> %B1, i32 1
  %E1 = add i64 %C1, %D1
  %T = trunc i64 %E1 to i16
  ret i16 %T
}

define i16 @bar(<8 x i32> %X) {
  %A2 = trunc <8 x i32> %X to <8 x i16>
  %B2 = mul <8 x i16> %A2, %A2
  %C2 = extractelement <8 x i16> %B2, i32 0
  %D2 = extractelement <8 x i16> %B2, i32 1
  %T = add i16 %C2, %D2
  ret i16 %T
}

Then running that through llc with avx2. I get worse code for bar than foo. Vector truncates on x86 aren't good. There is no truncate instruction until avx512 and even then its 2 uops.

Taking your first example and increasing the element count to get legal types

define i16 @foo(<8 x i32> %X) {
  %A1 = zext <8 x i32> %X to <8 x i64>
  %B1 = mul <8 x i64> %A1, %A1
  %C1 = extractelement <8 x i64> %B1, i32 0
  %D1 = extractelement <8 x i64> %B1, i32 1
  %E1 = add i64 %C1, %D1
  %T = trunc i64 %E1 to i16
  ret i16 %T
}

define i16 @bar(<8 x i32> %X) {
  %A2 = trunc <8 x i32> %X to <8 x i16>
  %B2 = mul <8 x i16> %A2, %A2
  %C2 = extractelement <8 x i16> %B2, i32 0
  %D2 = extractelement <8 x i16> %B2, i32 1
  %T = add i16 %C2, %D2
  ret i16 %T
}

Then running that through llc with avx2. I get worse code for bar than foo. Vector truncates on x86 aren't good. There is no truncate instruction until avx512 and even then its 2 uops.

I can "fix" that by ignoring cases where zext/sext will turn into a truncate for vector types, the check need to be done is:
"For each zext/sext instruction with vector type that have one usage, its source type size in bitwidth should be not less than the chosen MinBitWidth".

This will prevent creating the truncate, which was not in the IR before, on new vector types (or any vector type).
However, we will still have zext/sext to new vector type that was not in the IR before.

Does that solve the problem?

P.S. if you still insist on prevent this pass from creating new vector types, the solution is:

  1. Do not support extractelement/insertelement.
  2. Do not accept expressions with vector type truncate instruction, where the MinBitWidth > TruncBitWidth.

@spatel, what do you think about vector types here?

@spatel, what do you think about vector types here?

I’m not at a dev machine, so I can’t try any experiments. But we’ve had something like this question come up in one of my vector cmp + select patches. Ideally, we’d always shrink those as we do with scalars, but as noted, we may not have good backend support to undo the transform. Given that it’s not a clear win, I think it’s best to limit the vector transforms in this initial patch. Then, we can enable those in a follow-up patch if there are known wins and deal with any regressions without risking the main (?) scalar motivating cases.

aaboud commandeered this revision.Dec 21 2017, 2:02 PM
aaboud edited reviewers, added: zvi; removed: aaboud.

Thanks for Zvi for helping me progress with this review while I am on vacation.
I will continue as an author from here.

aaboud updated this revision to Diff 127940.Dec 21 2017, 2:05 PM

Addressed Craig and Sanjay comments:

  1. Retrieve the support for vector types.
  2. Make sure that this transformation will not create a new vector type. This is achieved by allowing reducing expression with vector type only when MinBitWidth == TruncBitWidth.
zvi added inline comments.Dec 24 2017, 6:08 AM
test/Transforms/AggressiveInstCombine/trunc_multi_uses.ll
2

Should there be negative tests for the vector cases that are not permitted to transform?

aaboud added inline comments.Dec 26 2017, 12:24 AM
test/Transforms/AggressiveInstCombine/trunc_multi_uses.ll
2

There should be.
However, such tests needs instructions such as lshr, ashr, udiv or urem, i.e., instructions that increase the MinBitWidth that we can truncate the expression to.
So, I will add such test in the following patches, once I add support to these instructions.

@mzolotukhin may want to comment on this one before it goes in as he's spending large part of his time doing compile time work. Please wait for his opinion.

aaboud added a comment.Jan 6 2018, 3:39 AM

Hi,
I uploaded a new version about a week ago with the required change for not generating new vector type.
Please, let me know if you have any other comments.

Thanks,
Amjad

I think the patch looks good now with the vector fix. Did you hear anything from @mzolotukhin about compile time?

Hi and sorry for the late reply, I've just returned from the holidays break.
The numbers posted before look good. I wonder though if it would make sense to only run this pass on -O3. I assume that even if now the pass spends very little time, it will grow in future and the compile-time costs might become noticeable.

Michael

Hi and sorry for the late reply, I've just returned from the holidays break.
The numbers posted before look good. I wonder though if it would make sense to only run this pass on -O3. I assume that even if now the pass spends very little time, it will grow in future and the compile-time costs might become noticeable.

Michael

Thanks Michael for the feedback.
As you said, the pass spend very little time, cannot we decide on moving it to -O3 in the future when/if other heavy optimizations is added to this pass?
And even then, we can decide to run part of them on -O2 and the rest on -O3.

Would that work for you?

Thanks,
Amjad

As you said, the pass spend very little time, cannot we decide on moving it to -O3 in the future when/if other heavy optimizations is added to this pass?
And even then, we can decide to run part of them on -O2 and the rest on -O3.

My main concern with that is that it's actually really hard to demote something to lower optlevels retroactively.

For instance, right now it would make sense to move some parts of existing InstCombine out of O0. In practice it's a very time consuming task to do so (to find the right pieces, to do all the measurements, to agree in the community on the acceptable regressions etc.). And there is usually no single big heavy part that we can just move out and solve all the compile time issues - we have many small pieces that just sum up to something big.

I expect that to be the case with this pass as well - people will add more stuff, but each individual piece would contribute only a little, so it'll be hard to say "yep, this one goes to -O3, others can stay at O0/O2". So I think it's worth moving the whole pass to -O3 now rather than in the future (and how bad is it for other optlevels not to have it? is it really critical?).

Michael

aaboud updated this revision to Diff 131053.Jan 23 2018, 6:10 AM

Moved the aggressive-inst-combine pass to run with -O3.
I prefer to make this change now, in order to get approval to commit the pass, and in the future, once the pass is complete, to argue enabling it with -O2, in a separate discussion.

Please, let me know if you have any more concerns regarding this patch.

Thanks.

No concerns from my side, thanks for making the change!

Michael

If there is no more concerns, can I get approval?
@craig.topper

This revision is now accepted and ready to land.Jan 24 2018, 2:18 AM
This revision was automatically updated to reflect the committed changes.