This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
-
CMakeLists.txt
1/1
InstCombineInternal.h
15/25
InstCombinePatterns.cpp
-
InstructionCombining.cpp
-
test/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
2/2
trunc_pattern.ll

Differential D38313

[InstCombine] Introducing Aggressive Instruction Combine pass
ClosedPublic

Authored by aaboud on Sep 27 2017, 5:19 AM.

Download Raw Diff

Details

Reviewers

spatel
craig.topper
majnemer
zvi
mzolotukhin

Commits

rGe4453233d787: [InstCombine] Introducing Aggressive Instruction Combine pass (-aggressive…
rL323321: [InstCombine] Introducing Aggressive Instruction Combine pass (-aggressive…

Summary

This new approach is a replacement for D37195.

Motivation:
InstCombine algorithm runs with high complexity (as it keeps running till there is no change to the code), thus it requires that each piece inside the outer loop, have a very small complexity (preferable O(1)).

Problem
There are instruction patterns that costs more than O(1) to identify and modify, thus it cannot be added to the InstCombine algorithm.

Solution
AggressiveInstCombine pass, which will run separately from InstCombine pass, and can perform expression pattern optimizations, which might take each up to O(n) complexity, where n is number of instructions in the function.

I introduced in this patch:

The AggressiveInstCombiner class is the main pass that will run all expression pattern optimizations.
The TruncInstCombine class, first added expression pattern optimization.

TruncInstCombine currently supports only simple instructions such as: add, sub, and, or, xor and mul.
The main difference between this and the canEvaluateTruncated from InstCombine is that this one supports:

Instructions with multi-use, as long as all are dominated by the truncated instruction.
Truncating to different width than the original trunc instruction requires. (this is useful if we reduce the expression width, even if we are not eliminating any instruction).

Next I will add to this TruncInstCombine class the following support, each in a separate patch:

select, shufflevector, extractelement, insertelement
udiv, urem
shl, lshr, ashr
phi node (and loop handling)

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Herald added a subscriber: mgorny. · View Herald TranscriptSep 27 2017, 5:19 AM

Fixed "no new line at end of file" issue.

aaboud mentioned this in D37195: [InstCombine] Teach canEvaluateTruncated and EvaluateInDifferentType to handle expression tree with multi-used nodes..Sep 27 2017, 6:23 AM

mcrosier added a subscriber: mcrosier.Sep 27 2017, 6:35 AM

nlopes added a subscriber: nlopes.Sep 28 2017, 6:45 AM

Do you have tests for "Truncating to different width than the original trunc instruction requires. (this is useful if we reduce the expression width, even if we are not eliminating any instruction)."

lib/Transforms/InstCombine/InstCombinePatterns.cpp
215	smaller*
223	infinit*
252	destination*
255	Does this call isLegalInteger using the scalar bit width for vectors? Not sure that's valid.
259	destination*
397	dominates* "we" looks unnecessarry

Hi Amjad,

I didn't do a thorough drill down of all the changes (I'll leave that to Craig and Sanjay), but skimmed through it. The design seems reasonable, as long as others are ok with the instcombine plugin.
I can't tell all the effects this will have, but take care that the zext->(operations...)->trunc optimization doesn't remove a zext on a load that feed into an operation such that we turn a "movz[bw] mem -> reg" into "mov[bw] mem -> [bw]reg", since this can introduce extra merge/false dependence hits on most newer Intel architectures.

Zia.

lib/Transforms/InstCombine/InstCombinePatterns.cpp
146	Are you planning on doing these peepholes here? If not, is there any other reason for leaving these cases in this switch?
180	Same as above. Any reason for not breaking for trunc/zext/sext?

Thanks Craig and Zia for the review and sorry for the late answer.
Please, see answers below.

In D38313#891163, @craig.topper wrote:

Do you have tests for "Truncating to different width than the original trunc instruction requires. (this is useful if we reduce the expression width, even if we are not eliminating any instruction)."

I do have tests, but this is not possible yet with this patch, we need the shift/udiv/urem instructions to get this case. I will introduce that in next patches.
In this patch, I just implemented the infrastructure for this optimization.

In D38313#897527, @zansari wrote:

Hi Amjad,

I didn't do a thorough drill down of all the changes (I'll leave that to Craig and Sanjay), but skimmed through it. The design seems reasonable, as long as others are ok with the instcombine plugin.
I can't tell all the effects this will have, but take care that the zext->(operations...)->trunc optimization doesn't remove a zext on a load that feed into an operation such that we turn a "movz[bw] mem -> reg" into "mov[bw] mem -> [bw]reg", since this can introduce extra merge/false dependence hits on most newer Intel architectures.

Zia.

I believe that we have passes in the codegen to make sure we will not suffer from these false dependencies, right?
I do not believe that instcombine should look that far to recognize such cases.

lib/Transforms/InstCombine/InstCombinePatterns.cpp
146	I do need to handle these cases here as I am calling "getRelevantOperands()" function from "getMinBitWidth()" function for all supported cases (without a switch).
180	I can add the break here, I agree that it makes code faster.
255	Can you explain, what you mean by "that's not valid"? Are you referring to the algorithm? Are you concerned of cases where a scalar type is legal (e.g. i32), while a vector type is not (e.g. <8xi32>), or the opposite direction? I believe that my algorithm should not mind about the number of elements in the type, only about the width. The reason for this check, is to keep the old case, where we will not reduce a legal to non-legal type. Do you think that we should do a more accurate check?

Addressed Craig and Zia comments.

craig.topper added inline comments.Oct 16 2017, 4:39 PM

lib/Transforms/InstCombine/InstCombinePatterns.cpp
255	I guess my point was just that the legality of the scalar type is unrelated to the legality of the vector type. In x86 for example. i64 isn't legal in 32-bit mode, but v2i64 is. If I remember right the existing code doesn't even call isLegalInteger for vector types? I'd check but someone is hammering the server I normally use.

aaboud added inline comments.Oct 18 2017, 8:23 AM

lib/Transforms/InstCombine/InstCombinePatterns.cpp

255

I agree that I have a logical bug here.
I was trying to keep the same semantic of original code:

if ((DestTy->isVectorTy() || shouldChangeType(SrcTy, DestTy)) &&
      canEvaluateTruncated(Src, DestTy, *this, &CI)) {

I guess, I need to ignore vector types. (as in above code) and fix the logic to this:

if (MinBitWidth > ValidBitWidth) {
  Type *Ty = DL.getSmallestLegalIntType(DstTy->getContext(), MinBitWidth);
  if (!Ty)
    return nullptr;
  // update minimum bit-width with the new destination type bit-width.
  MinBitWidth = Ty->getScalarSizeInBits();
} else { // MinBitWidth == ValidBitWidth
  if (!DestTy->isVectorTy() && !shouldChangeType(SrcTy, DestTy))
    return nullptr;
}
Type *NewDstSclTy = IntegerType::get(DstTy->getContext(), MinBitWidth);

If I'm seeing this correctly, it's an independent pass within InstCombine. It sits outside InstCombine's iteration loop, so it doesn't interact with the rest of the pass. What is the advantage of this approach vs. making a standalone pass?

In D38313#903090, @spatel wrote:

If I'm seeing this correctly, it's an independent pass within InstCombine. It sits outside InstCombine's iteration loop, so it doesn't interact with the rest of the pass. What is the advantage of this approach vs. making a standalone pass?

Hi Sanjay,
This code, which I am suggested below, intends to replace (and improve) the "canEvaluateTruncated" from the current instCombine implementation.
The logic of this code is similar to what instcombine does, but different in the way of implementation.
Saying that, I believe that running all current instcombine tests with this new functionality is a must, in order to make that possible, it is obvious that we need to be part of instcombine pass.

Note that the implementation is done in a way that moving it to a separate pass can be done with zero effort, but in a cost of ignoring/dropping few hundreds of LIT tests.

Do you think still that it should be in a separate pass?

aaboud added a subscriber: zvi.Oct 25 2017, 2:25 AM

Updated patch according to Craig comment. (Fixed minor logical bug)

In D38313#906324, @aaboud wrote:

Saying that, I believe that running all current instcombine tests with this new functionality is a must, in order to make that possible, it is obvious that we need to be part of instcombine pass.

Note that the implementation is done in a way that moving it to a separate pass can be done with zero effort, but in a cost of ignoring/dropping few hundreds of LIT tests.

I agree that testing must be done, but I don't see how that makes it obvious that this should be part of instcombine? If you're concerned that something else in instcombine will inhibit or invert this transform, you could add tests under test/Transforms/PhaseOrdering/ to make sure that doesn't happen. I think you've done the hard part (the code itself) already. :)

The major disadvantage of being in instcombine is that this code will be running 5-6 times in a typical pipeline when it probably doesn't need to.

Do you think still that it should be in a separate pass?

I don't know enough to say. I'm stating a concern based on feedback I've gotten about the size and cost of InstCombine (eg, http://lists.llvm.org/pipermail/llvm-dev/2017-May/113184.html , http://lists.llvm.org/pipermail/llvm-dev/2017-September/117151.html ). I'll let more experienced developers (cc'ing @nlopes @hfinkel @regehr ... ) decide what's the right way forward.

Also, I know that others are exploring ways to auto-regenerate some portion of InstCombine for greater efficiency, so it might be worth considering if/how that goal interacts with a large patch like this one.

majnemer added inline comments.Oct 25 2017, 7:54 AM

lib/Transforms/InstCombine/InstCombineInternal.h
782	Ditto.
lib/Transforms/InstCombine/InstCombinePatterns.cpp
91	default;

Answered David's comment.

In D38313#906511, @spatel wrote:

In D38313#906324, @aaboud wrote:

Saying that, I believe that running all current instcombine tests with this new functionality is a must, in order to make that possible, it is obvious that we need to be part of instcombine pass.

Note that the implementation is done in a way that moving it to a separate pass can be done with zero effort, but in a cost of ignoring/dropping few hundreds of LIT tests.

I agree that testing must be done, but I don't see how that makes it obvious that this should be part of instcombine? If you're concerned that something else in instcombine will inhibit or invert this transform, you could add tests under test/Transforms/PhaseOrdering/ to make sure that doesn't happen. I think you've done the hard part (the code itself) already. :)

The major disadvantage of being in instcombine is that this code will be running 5-6 times in a typical pipeline when it probably doesn't need to.

Is that a bad thing to run this code 5-6 times? it has O(n) complexity where n = number of instructions in the function.
Would not it catch more patterns once we run other optimizations?
I need this code to run at least twice, one time on regular compilation and another time as part of LTO optimization.
Would it be better if I move the code to a new pass "PatternInstructionCombinePass", but run that new pass from "addInstructionCombiningPass"? It will still run 5-6 times!

Do you think still that it should be in a separate pass?

I don't know enough to say. I'm stating a concern based on feedback I've gotten about the size and cost of InstCombine (eg, http://lists.llvm.org/pipermail/llvm-dev/2017-May/113184.html , http://lists.llvm.org/pipermail/llvm-dev/2017-September/117151.html ). I'll let more experienced developers (cc'ing @nlopes @hfinkel @regehr ... ) decide what's the right way forward.

Also, I know that others are exploring ways to auto-regenerate some portion of InstCombine for greater efficiency, so it might be worth considering if/how that goal interacts with a large patch like this one.

Notice that this code is part of InstCombine pass, but it is totally separated, I do not see why any change to InstCombine will affect this, unless you think we can auto-generate this optimization!

To summarize, I really do not mind to move it to separate pass, but I would like to make this optimization committed as soon as possible.
I appreciate your review and direction, please, advice with the best way you think I should implement this optimization.

Thanks,
Amjad

In D38313#906667, @aaboud wrote:

To summarize, I really do not mind to move it to separate pass, but I would like to make this optimization committed as soon as possible.
I appreciate your review and direction, please, advice with the best way you think I should implement this optimization.

As I said, I'm deferring to others on the way forward, so if everyone else thinks this is good, then I'm not objecting. Others have looked at the code closer than me, so I'll let them provide more feedback and/or approval.

In D38313#907218, @spatel wrote:

In D38313#906667, @aaboud wrote:

To summarize, I really do not mind to move it to separate pass, but I would like to make this optimization committed as soon as possible.
I appreciate your review and direction, please, advice with the best way you think I should implement this optimization.

As I said, I'm deferring to others on the way forward, so if everyone else thinks this is good, then I'm not objecting. Others have looked at the code closer than me, so I'll let them provide more feedback and/or approval.

Thanks Sanjay,
I appreciate your help.
I will be waiting a little more for others to comment on this patch.

Amjad, some questions about where do we want this work to evolve to.

Looking at the following example:

define i16 @foo(i8 %B, i16 %A, i1* %pbit) {
           %zext = zext i8 %B to i32
         %mul = mul i32 %zext, %zext
       %LSB = and i32 %mul, 255 ; <----------- First use
     %cmp = icmp ne i32 %LSB, 0
   store i1 %cmp, i1* %pbit
       %sext = sext i16 %A to i32
     %and = and i32 %mul, %sext ; <---------- Second use
   %trunc = trunc i32 %and to i16
   ret i16 %trunc
 }

The indentation should help with seeing the two branches in the expression DAG. The 'mul' is used by both 'and' instructions.
The most 'type-shrunken' form with no duplications should be this:

define i16 @foo(i8 %B, i16 %A, i1* %pbit) {

%zext = zext i8 %B to i16
%mul = mul i16 %zext, %zext
%LSB = and i16 %mul, 255
%cmp = icmp ne i16 %LSB, 0
store i1 %cmp, i1* %pbit, align 1
%and = and i16 %mul, %A
ret i16 %and

}

I assume that this patch will not cover this case because the bottom-up traversal starting from the 'trunc' will only visit one branch of the two. The bail-out condition

if (UI != CurrentTruncInst && !InstInfoMap.count(UI))
          return nullptr;

in getBestTruncatedType() will fire because one of the 'mul' users will not be mapped in the expressions DAG. If, for example, the expression DAG were constructed by traversing both defs and uses, it would have covered the entire function in the above example, and there would be sufficient information to deduce that both branches of the expression DAG can be shrunk.
Does this make sense?

lib/Transforms/InstCombine/InstCombinePatterns.cpp
63	Didn't see any actual uses of AC and DT. Will they be used in a follow-up patch? Even if yes, better remove to avoid breaking -Werror builds
140	Instead of populating a container, considering returning a pair of begin,end iterators to the operands of interest (or a op_range). Of course, this will only work if the operands of interest are consecutive.
261	Please add a comment explaining the considerations for bailing out when MinBitWidth == ValidBitWidth
271	I am commenting about this because it got me a bit confused. I think the correct term would be post-dominated or "users that dominate the truncate instruction".
272	*expression
312	Better call .lookup() rather than operator[] to avoid side effects in asserts
321	Better call .lookup() rather than operator[] to avoid side effects in asserts 312
359	Consider moving this 'if' block under the appropriate case statement above.
360	STLExtras.h can help with saving a few bytes of code :) auto Entry = find(Worklist, I);

Thanks Zvi for the comments.
I fixed most of them and will upload a new patch soon.

lib/Transforms/InstCombine/InstCombinePatterns.cpp
63	Indeed, I will be using them in next patches. I do not think there is a warning/error issue with unused class members, but I will remove them from this patch and add them later when they are actually used.
140	It will not work for the PHINode instruction.

Addressed Zvi's comments.

In D38313#906667, @aaboud wrote:

In D38313#906511, @spatel wrote:

In D38313#906324, @aaboud wrote:

Saying that, I believe that running all current instcombine tests with this new functionality is a must, in order to make that possible, it is obvious that we need to be part of instcombine pass.

Note that the implementation is done in a way that moving it to a separate pass can be done with zero effort, but in a cost of ignoring/dropping few hundreds of LIT tests.

I agree that testing must be done, but I don't see how that makes it obvious that this should be part of instcombine? If you're concerned that something else in instcombine will inhibit or invert this transform, you could add tests under test/Transforms/PhaseOrdering/ to make sure that doesn't happen. I think you've done the hard part (the code itself) already. :)

The major disadvantage of being in instcombine is that this code will be running 5-6 times in a typical pipeline when it probably doesn't need to.

Is that a bad thing to run this code 5-6 times? it has O(n) complexity where n = number of instructions in the function.
Would not it catch more patterns once we run other optimizations?
I need this code to run at least twice, one time on regular compilation and another time as part of LTO optimization.
Would it be better if I move the code to a new pass "PatternInstructionCombinePass", but run that new pass from "addInstructionCombiningPass"? It will still run 5-6 times!

Do you think still that it should be in a separate pass?

I don't know enough to say. I'm stating a concern based on feedback I've gotten about the size and cost of InstCombine (eg, http://lists.llvm.org/pipermail/llvm-dev/2017-May/113184.html , http://lists.llvm.org/pipermail/llvm-dev/2017-September/117151.html ). I'll let more experienced developers (cc'ing @nlopes @hfinkel @regehr ... ) decide what's the right way forward.

Also, I know that others are exploring ways to auto-regenerate some portion of InstCombine for greater efficiency, so it might be worth considering if/how that goal interacts with a large patch like this one.

Notice that this code is part of InstCombine pass, but it is totally separated, I do not see why any change to InstCombine will affect this, unless you think we can auto-generate this optimization!

To summarize, I really do not mind to move it to separate pass,

I think that this should be a separate pass. Everything in InstCombine is currently part of InstCombine's fixed-point iteration scheme. Having some things in InstCombine that are, and some that aren't, seems unnecessarily confusing. If this essentially runs as a separate pass, then I think we should just make it one, and then we can schedule it as we see fit.

Also, I'll note that this optimization seems very similar to PPCTargetLowering::DAGCombineTruncBoolExt, which optimized cases like trunc(binary-ops(binary-ops(zext(x), zext(y)), ...) and perhaps also PPCTargetLowering::DAGCombineExtBoolTrunc, which optimizes cases like zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...). Those implementation are not recursive, but use a work queue, and I suggest doing the same here.

but I would like to make this optimization committed as soon as possible.
I appreciate your review and direction, please, advice with the best way you think I should implement this optimization.

Thanks,
Amjad

I know you decided to move this to a new pass, but here's a couple of more comments that will be relevant.

test/Transforms/InstCombine/trunc_pattern.ll
14	dominated -> post-dominated
19	May a nitpick of mine, but i find it easier to follow the check's when they arer right above the related sequence of instructions. Another option is to be break down to multiple functions.

Separated the implementation from InstCombine pass and introduced a new pass called AggressiveInstCombine, which is called only twice (compared InstCombine, which is called ~6 times), one time as part of function simplification passes and second time as part of LTO optimization passes.

Herald added a subscriber: mehdi_amini. · View Herald TranscriptOct 30 2017, 7:12 AM

hfinkel added inline comments.Oct 30 2017, 7:02 PM

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
76 ↗	(On Diff #120816)	You should use a worklist here, not recursion. Then they'll be no need for the depth limit (or it can be very large).
113 ↗	(On Diff #120816)	You can mention specific things here. Comments like "TODO: more work" are not helpful. You have a list in your patch description: select, shufflevector, extractelement, insertelement udiv, urem shl, lshr, ashr phi node (and loop handling)
157 ↗	(On Diff #120816)	If we have a DAG of instructions with multiple trunc outputs, we'll end up walking the DAG once per trunc output?
181 ↗	(On Diff #120816)	If your goal is only to produce legal integer widths (which might miss some cases for vectorization), then I think that you should just fold this information into getMinBitWidth so that getMinBitWidth returns the minimum legal bit width.
281 ↗	(On Diff #120816)	Please write actual messages in llvm_unreachable. Perhaps something like: llvm_unreachable("Unhandled instruction");

Thanks Hal for the review.
I will update the patch with the changes you ask for.
Also, see one answer below.

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
157 ↗	(On Diff #120816)	That is is correct, and we will not be able to perform the transformation for any of these trunc instructions. This means that the complexity of this pass is at most O(n^2), in the worst case where n/2 is the DAG nodes and n/2 are trunc instructions that are using/truncating the DAG result. Should we worry about this case? Can we solve it?

Addressed Hal's comments.

Remove all recursive functions and replaced with Worklist + Stack containers. Note, this is needed for handling loops (which will be added with the PHINode patch).
Improved the compile time performance of the pass, by implementing an early exist for multi-used instructions that are not post-dominated by the TruncInst.

Some more minor comments

docs/Passes.rst
647 ↗	(On Diff #121149)	For the sake of consistency with other title, please add more dashes for alignment with title.
649 ↗	(On Diff #121149)	patterns expressions
652 ↗	(On Diff #121149)	*reduces the width of expressions
655 ↗	(On Diff #121149)	Maybe also say that the pattern-scan may cover the entire functions as opposed to the locality-limited instcombine patterns?
include/llvm/InitializePasses.h
284 ↗	(On Diff #121149)	These declarations are not perfectly sorted, but still maybe try to move this up.
include/llvm/Transforms/Scalar.h
137 ↗	(On Diff #121149)	expressions patterns
lib/LTO/LLVMBuild.txt
30 ↗	(On Diff #121149)	Please preserve sorted ordering
lib/Passes/LLVMBuild.txt
22 ↗	(On Diff #121149)	Please preserve sorted ordering
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
31 ↗	(On Diff #121149)	Can you please rephrase this so that it is obvious that the pass can be run more than once and that the internal mechanism limits pattern matchers to run once? Maybe something like this It differs than the Instcombine in that each pattern combiner is run only once as opposed to instcombine's multi-iteration ...
54 ↗	(On Diff #121149)	Is there an importance of order to the calls? If not, maybe group the set/addPreserve* calls and addRequired call to two group separated by a newline. IMHO it's more readable.
lib/Transforms/AggressiveInstCombine/AggressiveInstCombineInternal.h
21 ↗	(On Diff #121149)	Move two lines up
69 ↗	(On Diff #121149)	post-dominated
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
40 ↗	(On Diff #121149)	+1 for preserving DebugInfo. Is there other metadata that may need to be copied? I can't think of anything in particular, just wanted to raise the possibility.
44 ↗	(On Diff #121149)	Is it possible to reuse llvm::ReplaceInstWithInst instead of doing the low-level work explicitly?
54 ↗	(On Diff #121149)	IIRC, in one of the previous revisions of this patch there was a comment explaining why these cast instructions are skipped? Can you please revive it or add a new one?

zvi added inline comments.Nov 2 2017, 7:07 AM

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
64 ↗	(On Diff #121149)	Could we avoid pushing constants and maybe even generalize to avoid values that are not instructions? At least for these cases it may be ok, not for others, such as divide where we need to be more cautious
202 ↗	(On Diff #121149)	At this point a mappting for IOp must exist in InstInfoMap, right? Then please use lookup() or find(). Also better avoid searching for same key more than once.
209 ↗	(On Diff #121149)	lookup()
227 ↗	(On Diff #121149)	IMO this would be more readable and guarantee that code changes won't lead to uses of stale values of MinBitWidth: MinBitWidth = OrigBitwidth; and drop the return
352 ↗	(On Diff #121149)	I may have missed this, but why defer insertion of instruction to the BB to this point? And consider using the IRBuilder instead of calling ::Create* above and below.

Thanks Zvi for the comments.
I will upload a new patch with most of the comments fixed.
See few answers below.

lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
54 ↗	(On Diff #121149)	Not sure if there is any importance to the order. I just did what InstructionCombiningPass does!
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
40 ↗	(On Diff #121149)	This is what InstCombine preserves.
44 ↗	(On Diff #121149)	llvm::ReplaceInstWithInst also deletes the old instructions, which we are not ready to delete at this point.
64 ↗	(On Diff #121149)	I prefer not to complicate this function, it should return operands that are related to the optimization, the caller should check each relevant operand if it is a constant or instruction, there will be cases where even a constant need to be evaluated and will not skipped immediately.

Addressed Zvi's Comments.

Minor fix, forgot to use IRBuilder in one case in the previous patch.

Thanks, Amjad! This patch LGTM, but i think it would be best to wait for an LGTM from one of the assigned reviewers.

Minor typo update.

This is the last version, where I answered all the previous comments.
Please, let me know if you have any final comment or if I can go ahead and commit the patch.

Thanks

Commandeering this patch while Amjad is away on a few weeks of vacation.
Also sending a friendly ping to the reviewers. AFAIK, all comments were addressed as of the latest revision of this patch. Please let me know if i missed anything. Thanks.

craig.topper added inline comments.Nov 14 2017, 8:59 AM

lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
62 ↗	(On Diff #121332)	What about the new pass manager?
lib/Transforms/AggressiveInstCombine/AggressiveInstCombineInternal.h
89 ↗	(On Diff #121332)	This sentence reads funny
lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
80 ↗	(On Diff #121332)	Can you consistently use auto with dyn_cast throughout this patch.
175 ↗	(On Diff #121332)	LLVM style preferes "auto *IOp". We don't want auto to hide the fact that its a pointer. Please scrub the whole patch for this.
206 ↗	(On Diff #121332)	In the case of vectors, is this using legal scalar integer types to constrain the vector element type? I'm not sure if that's the right behavior. The legal scalar types don't necessarily imply anything about vector types. I think we generally try to avoid creating vector types that didn't appear in the IR.
210 ↗	(On Diff #121332)	Isn't this MinBitWidth == TruncBitWidth?
lib/Transforms/IPO/PassManagerBuilder.cpp
750 ↗	(On Diff #121332)	What about the new pass builder for the new pass manager?

efriedma added a subscriber: efriedma.Nov 14 2017, 2:08 PM

zvi added inline comments.Nov 16 2017, 4:45 AM

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
206 ↗	(On Diff #121332)	That' s a good point. Will look into this.
210 ↗	(On Diff #121332)	Your observation is correct but the comment is also correct and it explains something that may not be obvious. At this point, we completed visiting the expression tree starting from the truncate's operand and found 'MinBitWidth' by taking the max of all min-bit-width requirements of the predecessors. Now, since this is a truncate instruction, and by construction of the expression tree it follows that the computed MinBitWidth can never be less than the truncate's return types's size in bits or in other words MinBitWidth >= TruncBitWidth. The case of MinBitWidth > TruncBitWidth is hand;ed in the then block just above, and what remains in to handle the MinBitWidth == TruncBitWidth in the else block here. I can put this in a comment if you think it would be helpful.

Address some of Craig's recent comments.

Harbormaster completed remote builds in B12230: Diff 123159.Nov 16 2017, 4:47 AM

zvi marked 3 inline comments as done.Nov 16 2017, 4:49 AM

zvi added subscribers: guyblank, lsaba.Nov 16 2017, 6:31 AM

craig.topper added inline comments.Nov 16 2017, 10:50 PM

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
210 ↗	(On Diff #121332)	I was just saying that it should say TruncBitWidth not ValidBitWidth right?

RKSimon added a subscriber: RKSimon.Nov 18 2017, 6:14 AM

zvi added inline comments.Nov 19 2017, 12:48 AM

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
210 ↗	(On Diff #121332)	Ah, right :)

Address the last of Craig's comments:

Thanks, @lsaba, for porting the pass to the new PassManager.
Removed shrinkage of vector types until we sort out if it is generally allowed to shrink element types of vector operations.
Some minor fixes to comments.

Herald added a subscriber: eraman. · View Herald TranscriptNov 19 2017, 12:15 PM

zvi marked 10 inline comments as done.Nov 19 2017, 12:19 PM

Rebase on ToT. NFC in this revision.

Harbormaster completed remote builds in B12315: Diff 123510.Nov 19 2017, 12:24 PM

ping

Two comments on the trunc thing:

Thank you!!! As a GPU target maintainer, one of my main frustrations is how much LLVM *loves* to generate code that is needlessly too wide when smaller would do. We mostly have avoided this problem due to being float-heavy, but as integer code becomes more important, I absolutely love any chance I can get to reduce 32-bit to 16-bit and save register space accordingly.

I'm worried about this because the DAG *loves* to eliminate """redundant""" truncates and extensions, even if they're both marked as free. I've accidentally triggered infinite loops many times when trying to trick the DAG into emitting code that keeps intermediate variables small, an extreme example being something like this:

; pseudo-asm
; R1 = *b + (*a & 15);
; R2 = *c + (*a >> 16) & 15;
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R0H, R0, 16
and.16 R0L, R0L, 15
and.16 R0H, R0H, 15
add.32 R1, R1, R0L
add.32 R2, R2, R0H

The DAG will usually try to turn this into this:

load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R3, R0, 16
and.32 R0, R0, 15
and.32 R3, R3, 15
add.32 R1, R1, R0
add.32 R2, R2, R3

this is just a hypothetical example but in general this makes me worry from past attempts at experimentation in this realm.

I'm really worried that the compile time hit of this for LTO will be non-negligible. Do you have numbers?

Add missing AggressiveInstCombine.h and fix missing 'opt' dependency. Thanks, @lsaba, for noticing.

In D38313#936422, @escha wrote:

Two comments on the trunc thing:

Thank you!!! As a GPU target maintainer, one of my main frustrations is how much LLVM *loves* to generate code that is needlessly too wide when smaller would do. We mostly have avoided this problem due to being float-heavy, but as integer code becomes more important, I absolutely love any chance I can get to reduce 32-bit to 16-bit and save register space accordingly.

Sometimes it's LLVM, and sometimes it's the frontend that is required to extend small typed values before performing operations.

I'm worried about this because the DAG *loves* to eliminate """redundant""" truncates and extensions, even if they're both marked as free. I've accidentally triggered infinite loops many times when trying to trick the DAG into emitting code that keeps intermediate variables small, an extreme example being something like this:
; pseudo-asm
; R1 = *b + (*a & 15);
; R2 = *c + (*a >> 16) & 15;
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R0H, R0, 16
and.16 R0L, R0L, 15
and.16 R0H, R0H, 15
add.32 R1, R1, R0L
add.32 R2, R2, R0H
The DAG will usually try to turn this into this:
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R3, R0, 16
and.32 R0, R0, 15
and.32 R3, R3, 15
add.32 R1, R1, R0
add.32 R2, R2, R3
this is just a hypothetical example but in general this makes me worry from past attempts at experimentation in this realm.

Not sure I fully understand the concern of this patch, but if the problem is root caused to Instruction Selection, shouldn't we fix it there? If DAGCombiner's elimination of free truncates/extensions is an issue, have you considered predicating the specific combines with TLI hooks?

In D38313#936670, @davide wrote:

I'm really worried that the compile time hit of this for LTO will be non-negligible. Do you have numbers?

Will follow-up on this.

In D38313#937267, @zvi wrote:
In D38313#936422, @escha wrote:

Two comments on the trunc thing:

Thank you!!! As a GPU target maintainer, one of my main frustrations is how much LLVM *loves* to generate code that is needlessly too wide when smaller would do. We mostly have avoided this problem due to being float-heavy, but as integer code becomes more important, I absolutely love any chance I can get to reduce 32-bit to 16-bit and save register space accordingly.

Sometimes it's LLVM, and sometimes it's the frontend that is required to extend small typed values before performing operations.
I'm worried about this because the DAG *loves* to eliminate """redundant""" truncates and extensions, even if they're both marked as free. I've accidentally triggered infinite loops many times when trying to trick the DAG into emitting code that keeps intermediate variables small, an extreme example being something like this:
; pseudo-asm
; R1 = *b + (*a & 15);
; R2 = *c + (*a >> 16) & 15;
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R0H, R0, 16
and.16 R0L, R0L, 15
and.16 R0H, R0H, 15
add.32 R1, R1, R0L
add.32 R2, R2, R0H
The DAG will usually try to turn this into this:
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R3, R0, 16
and.32 R0, R0, 15
and.32 R3, R3, 15
add.32 R1, R1, R0
add.32 R2, R2, R3
this is just a hypothetical example but in general this makes me worry from past attempts at experimentation in this realm.
Not sure I fully understand the concern of this patch, but if the problem is root caused to Instruction Selection, shouldn't we fix it there? If DAGCombiner's elimination of free truncates/extensions is an issue, have you considered predicating the specific combines with TLI hooks?

There *are* TLI hooks; they're just not as widely used in the DAG as they could be.

I'm warning you with regard to this patch because the DAG may inadvertently undo a lot of the optimizations you're doing here. This isn't an objection, just something that might be worth looking at later given past experiences in trying to do similar optimizations.

In D38313#937269, @zvi wrote:

In D38313#936670, @davide wrote:

I'm really worried that the compile time hit of this for LTO will be non-negligible. Do you have numbers?

Will follow-up on this.

Measured CTMark and internal tests and was not able to observe significant compile time changes with -flto. Below are the results for CTMark:

Workload	ToT: stdev/average of 10 runs [%]	This patch: stdev/average of 10 runs [%]	Average compile-time speedup of this patch over ToT (higher is better for this patch)
7zip	0.19%	0.19%	0.999
Bullet	0.30%	0.37%	0.998
ClamAV	0.39%	0.19%	1.000
SPASS	0.52%	0.33%	1.000
consumer-typeset	0.27%	0.36%	0.999
kimwitu++	0.45%	0.49%	0.998
lencod	0.20%	0.51%	1.001
mafft	0.63%	0.29%	1.006
sqlite3	0.70%	0.82%	1.002
tramp3d-v4	1.23%	1.78%	0.990

spatel mentioned this in D39421: [InstCombine] Extracting common and-mask for shift operands of Or instruction.Dec 5 2017, 7:19 AM

Ping

craig.topper added inline comments.Dec 12 2017, 10:24 PM

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
205 ↗	(On Diff #124519)	I'm not sure you've addressed my vector concerns. The first part of this 'if' would create a new type for vectors by using getSmallestLegalIntType.
217 ↗	(On Diff #124519)	I think the !Vector check that was here was correct previously. We don't do isLegalnteger checks on the scalar types of vectors. For vectors we assume that if the type was present in the IR, the transform is fine. In this block, TruncWidth == MinBitWidth so the type existed in the original IR. My vector concerns were about the block above where we create new a new type.

Thanks Zvi for addressing all comments and questions while I am away.
Craig, please, see answers for your questions inlined below.

Thanks,
Amjad

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp
205 ↗	(On Diff #124519)	I think that in the "else" part, the one that I kept the same behavior as the original instcombine code, we might end up creating a vector type that was not in the IR, as for vector types we do not check for type legality. So, why in this case we should behave differently? Regarding the scalar check, it might be redundant, but not always, because even if the "trunc" instruction is performed on vector type, the evaluated expression might contain scalar operations (due to the "insertelement" instruction, which will be supported in next few patches). Furthermore, my assumption is that codegen legalizer will promote the illegal vector type back to the original type (or to a smaller one), in both cases we will not get worse code than the one we started with! Is that assumption too optimistic?
217 ↗	(On Diff #124519)	I agree with Craig, need to change back to: if (!DstTy->isVectorTy() && FromLegal && !ToLegal)

aaboud added inline comments.Dec 19 2017, 4:47 AM

lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp

205 ↗

(On Diff #124519)

Just to emphasize I am adding two examples.

Case 1 (MinBitWidth == TruncBitWidth):

%A1 = zext <2 x i32> %X to <2 x i64>
%B1 = mul <2 x i64> %A1, %A1
%C1 = extractelement <2 x i64> %B1, i32 0
%D1 = extractelement <2 x i64> %B1, i32 1
%E1 = add i64 %C1, %D1
%T = trunc i64 %E1 to i16
  =>
%A2 = trunc <2 x i32> %X to <2 x i16>
%B2 = mul <2 x i16> %X, %X
%C2 = extractelement <2 x i16> %B2, i32 0
%D2 = extractelement <2 x i16> %B2, i32 1
%T = add i16 %C1, %D1

Case 2 (MinBitWidth > TruncBitWidth):

%A1 = zext <2 x i32> %X to <2 x i64>
%B1 = lshr <2 x i64> %A1, <i64 8, i64 8>
%C1 = mul <2 x i64> %B1, %B1
%T = trunc <2 x i64> %C1 to <2 x i8>
  =>
%A2 = trunc <2 x i32> %X to <2 x i16>
%B2 = lshr <2 x i16> %A2, <i32 8, i32 8>
%C2 = mul <2 x i16> %B2, %B2
%T = trunc <2 x i16> %C2 to <2 x i8>

Notice that in both cases the "new" vector type (<2 x i16>) in the transformed IR did not exist in the original IR.

Do not you think that we should perform these transformations and reduce the expression type width?

Taking your first example and increasing the element count to get legal types

define i16 @foo(<8 x i32> %X) {
  %A1 = zext <8 x i32> %X to <8 x i64>
  %B1 = mul <8 x i64> %A1, %A1
  %C1 = extractelement <8 x i64> %B1, i32 0
  %D1 = extractelement <8 x i64> %B1, i32 1
  %E1 = add i64 %C1, %D1
  %T = trunc i64 %E1 to i16
  ret i16 %T
}

define i16 @bar(<8 x i32> %X) {
  %A2 = trunc <8 x i32> %X to <8 x i16>
  %B2 = mul <8 x i16> %A2, %A2
  %C2 = extractelement <8 x i16> %B2, i32 0
  %D2 = extractelement <8 x i16> %B2, i32 1
  %T = add i16 %C2, %D2
  ret i16 %T
}

Then running that through llc with avx2. I get worse code for bar than foo. Vector truncates on x86 aren't good. There is no truncate instruction until avx512 and even then its 2 uops.

In D38313#959647, @craig.topper wrote:
Taking your first example and increasing the element count to get legal types
define i16 @foo(<8 x i32> %X) {
  %A1 = zext <8 x i32> %X to <8 x i64>
  %B1 = mul <8 x i64> %A1, %A1
  %C1 = extractelement <8 x i64> %B1, i32 0
  %D1 = extractelement <8 x i64> %B1, i32 1
  %E1 = add i64 %C1, %D1
  %T = trunc i64 %E1 to i16
  ret i16 %T
}

define i16 @bar(<8 x i32> %X) {
  %A2 = trunc <8 x i32> %X to <8 x i16>
  %B2 = mul <8 x i16> %A2, %A2
  %C2 = extractelement <8 x i16> %B2, i32 0
  %D2 = extractelement <8 x i16> %B2, i32 1
  %T = add i16 %C2, %D2
  ret i16 %T
}
Then running that through llc with avx2. I get worse code for bar than foo. Vector truncates on x86 aren't good. There is no truncate instruction until avx512 and even then its 2 uops.

I can "fix" that by ignoring cases where zext/sext will turn into a truncate for vector types, the check need to be done is:
"For each zext/sext instruction with vector type that have one usage, its source type size in bitwidth should be not less than the chosen MinBitWidth".

This will prevent creating the truncate, which was not in the IR before, on new vector types (or any vector type).
However, we will still have zext/sext to new vector type that was not in the IR before.

Does that solve the problem?

P.S. if you still insist on prevent this pass from creating new vector types, the solution is:

Do not support extractelement/insertelement.
Do not accept expressions with vector type truncate instruction, where the MinBitWidth > TruncBitWidth.

@spatel, what do you think about vector types here?

In D38313#960502, @craig.topper wrote:

@spatel, what do you think about vector types here?

I’m not at a dev machine, so I can’t try any experiments. But we’ve had something like this question come up in one of my vector cmp + select patches. Ideally, we’d always shrink those as we do with scalars, but as noted, we may not have good backend support to undo the transform. Given that it’s not a clear win, I think it’s best to limit the vector transforms in this initial patch. Then, we can enable those in a follow-up patch if there are known wins and deal with any regressions without risking the main (?) scalar motivating cases.

fedor.sergeev added a subscriber: fedor.sergeev.Dec 20 2017, 1:25 PM

Thanks for Zvi for helping me progress with this review while I am on vacation.
I will continue as an author from here.

Addressed Craig and Sanjay comments:

Retrieve the support for vector types.
Make sure that this transformation will not create a new vector type. This is achieved by allowing reducing expression with vector type only when MinBitWidth == TruncBitWidth.

zvi added inline comments.Dec 24 2017, 6:08 AM

test/Transforms/AggressiveInstCombine/trunc_multi_uses.ll
1 ↗	(On Diff #127940)	Should there be negative tests for the vector cases that are not permitted to transform?

aaboud added inline comments.Dec 26 2017, 12:24 AM

test/Transforms/AggressiveInstCombine/trunc_multi_uses.ll
1 ↗	(On Diff #127940)	There should be. However, such tests needs instructions such as lshr, ashr, udiv or urem, i.e., instructions that increase the MinBitWidth that we can truncate the expression to. So, I will add such test in the following patches, once I add support to these instructions.

@mzolotukhin may want to comment on this one before it goes in as he's spending large part of his time doing compile time work. Please wait for his opinion.

Hi,
I uploaded a new version about a week ago with the required change for not generating new vector type.
Please, let me know if you have any other comments.

Thanks,
Amjad

a.elovikov added a subscriber: a.elovikov.Jan 6 2018, 7:41 AM

I think the patch looks good now with the vector fix. Did you hear anything from @mzolotukhin about compile time?

Hi and sorry for the late reply, I've just returned from the holidays break.
The numbers posted before look good. I wonder though if it would make sense to only run this pass on -O3. I assume that even if now the pass spends very little time, it will grow in future and the compile-time costs might become noticeable.

Michael

In D38313#971848, @mzolotukhin wrote:

Hi and sorry for the late reply, I've just returned from the holidays break.
The numbers posted before look good. I wonder though if it would make sense to only run this pass on -O3. I assume that even if now the pass spends very little time, it will grow in future and the compile-time costs might become noticeable.

Michael

Thanks Michael for the feedback.
As you said, the pass spend very little time, cannot we decide on moving it to -O3 in the future when/if other heavy optimizations is added to this pass?
And even then, we can decide to run part of them on -O2 and the rest on -O3.

Would that work for you?

Thanks,
Amjad

As you said, the pass spend very little time, cannot we decide on moving it to -O3 in the future when/if other heavy optimizations is added to this pass?
And even then, we can decide to run part of them on -O2 and the rest on -O3.

My main concern with that is that it's actually really hard to demote something to lower optlevels retroactively.

For instance, right now it would make sense to move some parts of existing InstCombine out of O0. In practice it's a very time consuming task to do so (to find the right pieces, to do all the measurements, to agree in the community on the acceptable regressions etc.). And there is usually no single big heavy part that we can just move out and solve all the compile time issues - we have many small pieces that just sum up to something big.

I expect that to be the case with this pass as well - people will add more stuff, but each individual piece would contribute only a little, so it'll be hard to say "yep, this one goes to -O3, others can stay at O0/O2". So I think it's worth moving the whole pass to -O3 now rather than in the future (and how bad is it for other optlevels not to have it? is it really critical?).

Michael

Moved the aggressive-inst-combine pass to run with -O3.
I prefer to make this change now, in order to get approval to commit the pass, and in the future, once the pass is complete, to argue enabling it with -O2, in a separate discussion.

Please, let me know if you have any more concerns regarding this patch.

Thanks.

Herald added a subscriber: hintonda. · View Herald TranscriptJan 23 2018, 6:10 AM

No concerns from my side, thanks for making the change!

Michael

If there is no more concerns, can I get approval?
@craig.topper

LGTM

This revision is now accepted and ready to land.Jan 24 2018, 2:18 AM

Closed by commit rL323321: [InstCombine] Introducing Aggressive Instruction Combine pass (-aggressive… (authored by aaboud). · Explain WhyJan 24 2018, 4:44 AM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D46760: [InstCombine] Enhance narrowUDivURem..May 20 2018, 10:56 AM

lebedev.ri mentioned this in D47113: [CVP] Teach CorrelatedValuePropagation to reduce the width of lshr instruction..May 26 2018, 4:38 AM

anton-afanasyev mentioned this in D113179: [Passes] Move AggressiveInstCombine after InstCombine.Nov 4 2021, 6:23 AM

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

	CMakeLists.txt
	CMakeLists.txt (revision 314297)

1 line

	InstCombineInternal.h
	InstCombineInternal.h (revision 314297)

24 lines

	InstCombinePatterns.cpp
	InstCombinePatterns.cpp (revision 0)

424 lines

	InstructionCombining.cpp
	InstructionCombining.cpp (revision 314297)

4 lines

test/

Transforms/

InstCombine/

	trunc_pattern.ll
	trunc_pattern.ll (revision 0)

153 lines

Diff 116798

lib/Transforms/InstCombine/CMakeLists.txt

	add_llvm_library(LLVMInstCombine			add_llvm_library(LLVMInstCombine
	InstructionCombining.cpp			InstructionCombining.cpp
	InstCombineAddSub.cpp			InstCombineAddSub.cpp
	InstCombineAndOrXor.cpp			InstCombineAndOrXor.cpp
	InstCombineCalls.cpp			InstCombineCalls.cpp
	InstCombineCasts.cpp			InstCombineCasts.cpp
	InstCombineCompares.cpp			InstCombineCompares.cpp
	InstCombineLoadStoreAlloca.cpp			InstCombineLoadStoreAlloca.cpp
	InstCombineMulDivRem.cpp			InstCombineMulDivRem.cpp
				InstCombinePatterns.cpp
	InstCombinePHI.cpp			InstCombinePHI.cpp
	InstCombineSelect.cpp			InstCombineSelect.cpp
	InstCombineShifts.cpp			InstCombineShifts.cpp
	InstCombineSimplifyDemanded.cpp			InstCombineSimplifyDemanded.cpp
	InstCombineVectorOps.cpp			InstCombineVectorOps.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms			${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms
	${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms/InstCombine			${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms/InstCombine

	DEPENDS			DEPENDS
	intrinsics_gen			intrinsics_gen
	)			)

lib/Transforms/InstCombine/InstCombineInternal.h

Show First 20 Lines • Show All 758 Lines • ▼ Show 20 Lines	private:
Value EvaluateInDifferentType(Value V, Type *Ty, bool isSigned);		Value EvaluateInDifferentType(Value V, Type *Ty, bool isSigned);

/// \brief Returns a value X such that Val = X * Scale, or null if none.		/// \brief Returns a value X such that Val = X * Scale, or null if none.
///		///
/// If the multiplication is known not to overflow then NoSignedWrap is set.		/// If the multiplication is known not to overflow then NoSignedWrap is set.
Value Descale(Value Val, APInt Scale, bool &NoSignedWrap);		Value Descale(Value Val, APInt Scale, bool &NoSignedWrap);
};		};


		/// Contains instruction pattern combiner logic.
		/// This class provides both the logic to combine instruction patterns and
		/// combine them. It differs from InstCombiner class in that it runs only once,
		/// which allows pattern optimization to have higher complexity than the O(1)
		/// required by the instruction combiner.
		class PatternInstCombiner {
		TargetLibraryInfo &TLI;
		const DataLayout &DL;
		AssumptionCache *AC;
		const DominatorTree *DT;
		public:
		PatternInstCombiner(TargetLibraryInfo &TLI, const DataLayout &DL,
		AssumptionCache AC, const DominatorTree DT)
		: TLI(TLI), DL(DL), AC(AC), DT(DT) {}
		~PatternInstCombiner() {}
		majnemerUnsubmitted Done Reply Inline Actions Ditto. majnemer: Ditto.

		/// Run all instruction pattern optimizations on the given /p F function.
		///
		/// \param F function to optimize.
		/// \returns true if the IR is changed.
		bool run(Function &F);
		};

} // end namespace llvm.		} // end namespace llvm.

#undef DEBUG_TYPE		#undef DEBUG_TYPE

#endif		#endif

lib/Transforms/InstCombine/InstCombinePatterns.cpp

				//===- InstCombinePatterns.cpp --------------------------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements the instruction pattern combiner classes.
				// Currently, it handles pattern expressions for:
				// * Truncate instruction
				//
				//===----------------------------------------------------------------------===//

				#include "InstCombineInternal.h"
				#include "llvm/ADT/MapVector.h"
				#include "llvm/Analysis/ConstantFolding.h"
				#include "llvm/Analysis/TargetLibraryInfo.h"
				#include "llvm/IR/DataLayout.h"
				using namespace llvm;

				#define DEBUG_TYPE "instcombine"

				static cl::opt<unsigned> MaxDepth("instcombine-pattern-depth",
				cl::desc("InstCombine Pattern Depth."),
				cl::init(20), cl::Hidden);

				/// \brief Inserts an instruction \p New before instruction \p Old and update
				/// \p New instruction DebugLoc and Name with those of \p Old instruction.
				static void InsertAndUpdateNewInstWith(Instruction &New, Instruction &Old) {
				assert(!New.getParent() &&
				"New instruction already inserted into a basic block!");
				BasicBlock *BB = Old.getParent();
				BB->getInstList().insert(Old.getIterator(), &New); // Insert inst
				New.setDebugLoc(Old.getDebugLoc());
				New.takeName(&Old);
				}

				//===----------------------------------------------------------------------===//
				// TruncInstCombine - looks for expression dags dominated by trunc instructions
				// and for each eligible dag, it will create a reduced bit-width expression and
				// replace the old expression with this new one and remove the old one.
				// Eligible expression dag is such that:
				// 1. Contains only supported instructions.
				// 2. Supported leaves: ZExtInst, SExtInst, TruncInst and Constant value.
				// 3. Can be evaluated into type with reduced legal bit-width (or Trunc type).
				// 4. All instructions in the dag must not have users outside the dag.
				// Only exception is for {ZExt, SExt}Inst with operand type equal to the
				// new reduced type chosen in (3).
				//
				// The motivation for this optimization is that evaluating and expression using
				// smaller bit-width is preferable, especially for vectorization where we can
				// fit more values in one vectorized instruction. In addition, this optimization
				// may decrease the number of cast instructions, but will not increase it.
				//===----------------------------------------------------------------------===//

				namespace {
				class TruncInstCombine {
				TargetLibraryInfo &TLI;
				const DataLayout &DL;
				AssumptionCache *AC;
				const DominatorTree *DT;
				zviUnsubmitted Done Reply Inline Actions Didn't see any actual uses of AC and DT. Will they be used in a follow-up patch? Even if yes, better remove to avoid breaking -Werror builds zvi: Didn't see any actual uses of AC and DT. Will they be used in a follow-up patch? Even if yes…
				aaboudAuthorUnsubmitted Not Done Reply Inline Actions Indeed, I will be using them in next patches. I do not think there is a warning/error issue with unused class members, but I will remove them from this patch and add them later when they are actually used. aaboud: Indeed, I will be using them in next patches. I do not think there is a warning/error issue…

				/// List of all TruncInst instructions to be processed.
				SmallVector<TruncInst *, 4> Worklist;

				/// Current processed TruncInst instruction.
				TruncInst *CurrentTruncInst;

				/// Information per each instruction in the expression dag.
				struct Info {
				/// Number of LSBs that are needed to generate a valid expression.
				unsigned ValidBitWidth = 0;
				/// Minimum number of LSBs needed to generate the ValidBitWidth.
				unsigned MinBitWidth = 0;
				/// The reduced value generated to replace the old instruction.
				Value *NewValue = nullptr;
				};
				/// An ordered map representing expression dag dominated by current processed
				/// TruncInst. It maps each instruction in the dag to its Info structure.
				/// The map is ordered such that each instruction appears before all other
				/// instructions in the dag that uses it.
				MapVector<Instruction *, Info> InstInfoMap;

				public:
				TruncInstCombine(TargetLibraryInfo &TLI, const DataLayout &DL,
				AssumptionCache AC, const DominatorTree DT)
				: TLI(TLI), DL(DL), AC(AC), DT(DT), CurrentTruncInst(nullptr) {}
				~TruncInstCombine() {}

				majnemerUnsubmitted Done Reply Inline Actions default; majnemer: = default;
				/// Perform TruncInst pattern optimization on given function.
				bool run(Function &F);

				private:
				/// Build expression dag dominated by the given /p V value and append it to
				/// the InstInfoMap container. This function is a recursive function gaurded
				/// by a depth limit. It will be called first with TruncInst operand.
				///
				/// \param V value node to generate its sub expression dag.
				/// \param Depth current depth of the recursion.
				/// \return true only if succeed to generate an eligible sub expression dag.
				bool buildTruncExpressionDag(Value *V, unsigned Depth);

				/// Calculate the minimum number of LSBs needed to generate the given number
				/// of /p ValidBitWidth, by the given /p V value.
				///
				/// \param V value to be evaluated.
				/// \param ValidBitWidth number of LSBs needed to be preserved.
				/// \return minimum number of LSBs needed to preserve \p ValidBitWidth.
				unsigned getMinBitWidth(Value *V, unsigned ValidBitWidth);

				/// Build an expression dag dominated by the current processed TruncInst and
				/// Check if it is eligible to be reduced to a smaller type.
				///
				/// \return the scalar version of the new type to be used for the reduced
				/// expression dag, or nullptr if the expression dag is not eligible
				/// to be reduced.
				Type *getBestTruncatedType();

				/// Given a \p V value and a \p SclTy scalar type return the generated reduced
				/// value of \p V based on the type \p SclTy.
				///
				/// \param V value to be reduced.
				/// \param SclTy scalar version of new type to reduce to.
				/// \return the new reduced value.
				Value getReducedOperand(Value V, Type *SclTy);

				/// Create a new expression dag using the reduced /p SclTy type and replace
				/// the old expression dag with it. Also erase all instructions in the old
				/// dag, except those that are still needed outside the dag.
				///
				/// \param SclTy scalar version of new type to reduce expression dag into.
				void ReduceExpressionDag(Type *SclTy);
				};
				} // namespace

				/// Given an instruction and a container, it fills all the relevant operands of
				/// that instruction, with respect to the Trunc expression dag optimizaton.
				static void getRelevantOperands(Instruction I, SmallVectorImpl<Value > &Ops) {
				zviUnsubmitted Not Done Reply Inline Actions Instead of populating a container, considering returning a pair of begin,end iterators to the operands of interest (or a op_range). Of course, this will only work if the operands of interest are consecutive. zvi: Instead of populating a container, considering returning a pair of begin,end iterators to the…
				aaboudAuthorUnsubmitted Not Done Reply Inline Actions It will not work for the PHINode instruction. aaboud: It will not work for the PHINode instruction.
				unsigned Opc = I->getOpcode();
				switch (Opc) {
				case Instruction::Trunc:
				case Instruction::ZExt:
				case Instruction::SExt:
				// trunc(trunc(x)) -> trunc(x)
				zansariUnsubmitted Not Done Reply Inline Actions Are you planning on doing these peepholes here? If not, is there any other reason for leaving these cases in this switch? zansari: Are you planning on doing these peepholes here? If not, is there any other reason for leaving…
				aaboudAuthorUnsubmitted Not Done Reply Inline Actions I do need to handle these cases here as I am calling "getRelevantOperands()" function from "getMinBitWidth()" function for all supported cases (without a switch). aaboud: I do need to handle these cases here as I am calling "getRelevantOperands()" function from…
				// trunc(ext(x)) -> ext(x) if the source type is smaller than the new dest
				// trunc(ext(x)) -> trunc(x) if the source type is larger than the new dest
				break;
				case Instruction::Add:
				case Instruction::Sub:
				case Instruction::Mul:
				case Instruction::And:
				case Instruction::Or:
				case Instruction::Xor:
				Ops.push_back(I->getOperand(0));
				Ops.push_back(I->getOperand(1));
				break;
				default:
				llvm_unreachable("Unreachable!");
				}
				}

				bool TruncInstCombine::buildTruncExpressionDag(Value *V, unsigned Depth) {
				if (Depth > MaxDepth)
				return false;

				if (isa<Constant>(V))
				return true;

				Instruction *I = dyn_cast<Instruction>(V);
				if (!I)
				return false;

				if (InstInfoMap.count(I))
				return true;

				unsigned Opc = I->getOpcode();
				switch (Opc) {
				case Instruction::Trunc:
				zansariUnsubmitted Done Reply Inline Actions Same as above. Any reason for not breaking for trunc/zext/sext? zansari: Same as above. Any reason for not breaking for trunc/zext/sext?
				aaboudAuthorUnsubmitted Not Done Reply Inline Actions I can add the break here, I agree that it makes code faster. aaboud: I can add the break here, I agree that it makes code faster.
				case Instruction::ZExt:
				case Instruction::SExt:
				case Instruction::Add:
				case Instruction::Sub:
				case Instruction::Mul:
				case Instruction::And:
				case Instruction::Or:
				case Instruction::Xor: {
				SmallVector<Value *, 2> Operands;
				getRelevantOperands(I, Operands);
				for (Value *Operand : Operands)
				if (!buildTruncExpressionDag(Operand, Depth + 1))
				return false;
				break;
				}
				default:
				// TODO: Can handle more cases here.
				return false;
				}

				// Insert I to the Info map.
				InstInfoMap.insert(std::make_pair(I, Info()));
				return true;
				}

				unsigned TruncInstCombine::getMinBitWidth(Value *V, unsigned ValidBitWidth) {
				if (isa<Constant>(V))
				return ValidBitWidth;

				// Otherwise, it must be an instruction.
				Instruction *I = cast<Instruction>(V);
				auto &Info = InstInfoMap[I];

				// If we already calculated the minimum bit-width for this valid bit-width, or
				// for a smaler valid bit-width, then just return the answer we already have.
				craig.topperUnsubmitted Done Reply Inline Actions smaller* craig.topper: smaller*
				if (Info.ValidBitWidth >= ValidBitWidth)
				return Info.MinBitWidth;

				SmallVector<Value *, 2> Operands;
				getRelevantOperands(I, Operands);

				// Must update both valid bit-width and minimum bit-width at this point, to
				// prevent infinit loop, when this instruction is part of a loop in the dag.
				craig.topperUnsubmitted Done Reply Inline Actions infinit* craig.topper: infinit*
				Info.ValidBitWidth = ValidBitWidth;
				Info.MinBitWidth = std::max(Info.MinBitWidth, ValidBitWidth);

				unsigned MinBitWidth = Info.MinBitWidth;
				for (auto *Operand : Operands)
				MinBitWidth = std::max(MinBitWidth, getMinBitWidth(Operand, ValidBitWidth));

				// Update minimum bit-width.
				Info.MinBitWidth = MinBitWidth;
				return MinBitWidth;
				}

				Type *TruncInstCombine::getBestTruncatedType() {
				Value *Src = CurrentTruncInst->getOperand(0);

				// Clear old expression dag.
				InstInfoMap.clear();
				if (!buildTruncExpressionDag(Src, 0))
				return nullptr;

				Type *DstTy = CurrentTruncInst->getType();
				unsigned OrigBitWidth = Src->getType()->getScalarSizeInBits();
				unsigned ValidBitWidth = DstTy->getScalarSizeInBits();

				// Calculate minimum bit-width needed to calculate the valid bit-bidth.
				unsigned MinBitWidth = getMinBitWidth(Src, ValidBitWidth);

				// If minimum bit-width is equal to valid bit-width, and original bit-width is
				// not legal integer, then use the target destenation type. Otherwise, use
				craig.topperUnsubmitted Done Reply Inline Actions destination* craig.topper: destination*
				// the smallest integer type in the range [MinBitWidth, OrigBitWidth).
				Type *NewDstSclTy = DstTy->getScalarType();
				if (DL.isLegalInteger(OrigBitWidth) \|\| MinBitWidth > ValidBitWidth) {
				craig.topperUnsubmitted Not Done Reply Inline Actions Does this call isLegalInteger using the scalar bit width for vectors? Not sure that's valid. craig.topper: Does this call isLegalInteger using the scalar bit width for vectors? Not sure that's valid.
				aaboudAuthorUnsubmitted Not Done Reply Inline Actions Can you explain, what you mean by "that's not valid"? Are you referring to the algorithm? Are you concerned of cases where a scalar type is legal (e.g. i32), while a vector type is not (e.g. <8xi32>), or the opposite direction? I believe that my algorithm should not mind about the number of elements in the type, only about the width. The reason for this check, is to keep the old case, where we will not reduce a legal to non-legal type. Do you think that we should do a more accurate check? aaboud: Can you explain, what you mean by "that's not valid"? Are you referring to the algorithm? Are…
				craig.topperUnsubmitted Not Done Reply Inline Actions I guess my point was just that the legality of the scalar type is unrelated to the legality of the vector type. In x86 for example. i64 isn't legal in 32-bit mode, but v2i64 is. If I remember right the existing code doesn't even call isLegalInteger for vector types? I'd check but someone is hammering the server I normally use. craig.topper: I guess my point was just that the legality of the scalar type is unrelated to the legality of…
				aaboudAuthorUnsubmitted Not Done Reply Inline Actions I agree that I have a logical bug here. I was trying to keep the same semantic of original code: if ((DestTy->isVectorTy() \|\| shouldChangeType(SrcTy, DestTy)) && canEvaluateTruncated(Src, DestTy, this, &CI)) { I guess, I need to ignore vector types. (as in above code) and fix the logic to this: if (MinBitWidth > ValidBitWidth) { Type Ty = DL.getSmallestLegalIntType(DstTy->getContext(), MinBitWidth); if (!Ty) return nullptr; // update minimum bit-width with the new destination type bit-width. MinBitWidth = Ty->getScalarSizeInBits(); } else { // MinBitWidth == ValidBitWidth if (!DestTy->isVectorTy() && !shouldChangeType(SrcTy, DestTy)) return nullptr; } Type NewDstSclTy = IntegerType::get(DstTy->getContext(), MinBitWidth); aaboud:* I agree that I have a logical bug here. I was trying to keep the same semantic of original code…
				NewDstSclTy = DL.getSmallestLegalIntType(DstTy->getContext(), MinBitWidth);
				if (!NewDstSclTy)
				return nullptr;
				// update minimum bit-width with the new destenation type bit-width.
				craig.topperUnsubmitted Done Reply Inline Actions destination* craig.topper: destination*
				MinBitWidth = NewDstSclTy->getScalarSizeInBits();
				}
				zviUnsubmitted Done Reply Inline Actions Please add a comment explaining the considerations for bailing out when MinBitWidth == ValidBitWidth zvi: Please add a comment explaining the considerations for bailing out when MinBitWidth ==…

				// We don't want to duplicate instructions, which isn't profitable. Thus, we
				// can't shrink something that has multiple users, unless all users are
				// dominated by the trunc instruction, i.e., were visited during the
				// expresion evaluation.
				for (auto Itr : InstInfoMap) {
				Instruction *I = Itr.first;
				if (I->hasOneUse())
				continue;
				// If this is an extension from the dest type, we can eliminate it, even if
				zviUnsubmitted Done Reply Inline Actions I am commenting about this because it got me a bit confused. I think the correct term would be post-dominated or "users that dominate the truncate instruction". zvi: I am commenting about this because it got me a bit confused. I think the correct term would be…
				// it has multiple users.
				zviUnsubmitted Done Reply Inline Actions expression zvi:* *expression
				if ((isa<ZExtInst>(I) \|\| isa<SExtInst>(I)) &&
				I->getOperand(0)->getType()->getScalarSizeInBits() == MinBitWidth)
				continue;
				for (auto *U : I->users())
				if (auto *UI = dyn_cast<Instruction>(U))
				if (UI != CurrentTruncInst && !InstInfoMap.count(UI))
				return nullptr;
				}

				return NewDstSclTy;
				}

				/// Given a reduced scalar type \p Ty and a \p V value, return a reduced type
				/// for \p V, according to its type, if it vector type, return the vector
				/// version of \p Ty, otherwise return \p Ty.
				static Type getReducedType(Value V, Type *Ty) {
				assert(Ty && !Ty->isVectorTy() && "Expect Scalar Type");
				if (auto *VTy = dyn_cast<VectorType>(V->getType()))
				return VectorType::get(Ty, VTy->getNumElements());
				return Ty;
				}

				Value TruncInstCombine::getReducedOperand(Value V, Type *SclTy) {
				Type *Ty = getReducedType(V, SclTy);
				if (Constant *C = dyn_cast<Constant>(V)) {
				C = ConstantExpr::getIntegerCast(C, Ty, false);
				// If we got a constantexpr back, try to simplify it with DL info.
				if (Constant *FoldedC = ConstantFoldConstant(C, DL, &TLI))
				C = FoldedC;
				return C;
				}

				Instruction *I = cast<Instruction>(V);
				assert(InstInfoMap[I].NewValue);
				return InstInfoMap[I].NewValue;
				}

				void TruncInstCombine::ReduceExpressionDag(Type *SclTy) {
				for (auto &Itr : InstInfoMap) { // Forward
				Instruction *I = Itr.first;
				zviUnsubmitted Done Reply Inline Actions Better call .lookup() rather than operator[] to avoid side effects in asserts zvi: Better call .lookup() rather than operator[] to avoid side effects in asserts

				// Check if this instruction have already been evaluated.
				assert(!InstInfoMap[I].NewValue);

				Instruction *Res = nullptr;
				unsigned Opc = I->getOpcode();
				switch (Opc) {
				case Instruction::Trunc:
				case Instruction::ZExt:
				zviUnsubmitted Done Reply Inline Actions Better call .lookup() rather than operator[] to avoid side effects in asserts 312 zvi: Better call .lookup() rather than operator[] to avoid side effects in asserts 312
				case Instruction::SExt: {
				Type *Ty = getReducedType(I, SclTy);
				// If the source type of the cast is the type we're trying for then we can
				// just return the source. There's no need to insert it because it is not
				// new.
				if (I->getOperand(0)->getType() == Ty) {
				InstInfoMap[I].NewValue = I->getOperand(0);
				continue;
				}
				// Otherwise, must be the same type of cast, so just reinsert a new one.
				// This also handles the case of zext(trunc(x)) -> zext(x).
				Res = CastInst::CreateIntegerCast(I->getOperand(0), Ty,
				Opc == Instruction::SExt);
				break;
				}
				case Instruction::Add:
				case Instruction::Sub:
				case Instruction::Mul:
				case Instruction::And:
				case Instruction::Or:
				case Instruction::Xor: {
				Value *LHS = getReducedOperand(I->getOperand(0), SclTy);
				Value *RHS = getReducedOperand(I->getOperand(1), SclTy);
				Res = BinaryOperator::Create((Instruction::BinaryOps)Opc, LHS, RHS);
				break;
				}
				default:
				// TODO: Can handle more cases here.
				llvm_unreachable("Unreachable!");
				}

				// Update Worklist entries with new value if needed.
				if (auto *NewCI = dyn_cast<TruncInst>(Res)) {
				auto Entry = std::find(Worklist.begin(), Worklist.end(), I);
				if (Entry != Worklist.end())
				*Entry = NewCI;
				}
				InstInfoMap[I].NewValue = Res;
				zviUnsubmitted Done Reply Inline Actions Consider moving this 'if' block under the appropriate case statement above. zvi: Consider moving this 'if' block under the appropriate case statement above.
				InsertAndUpdateNewInstWith(Res, I);
				zviUnsubmitted Done Reply Inline Actions STLExtras.h can help with saving a few bytes of code :) auto Entry = find(Worklist, I); zvi: STLExtras.h can help with saving a few bytes of code :) ``` auto Entry = find(Worklist, I); ```
				}

				Value *Res = getReducedOperand(CurrentTruncInst->getOperand(0), SclTy);
				Type *DstTy = CurrentTruncInst->getType();
				Instruction *NewInst = nullptr;
				if (Res->getType() != DstTy) {
				NewInst = CastInst::CreateIntegerCast(Res, DstTy, false);
				InsertAndUpdateNewInstWith(NewInst, CurrentTruncInst);
				Res = NewInst;
				}
				CurrentTruncInst->replaceAllUsesWith(Res);

				// Erase old expression dag, which was replaced by the reduced expression dag.
				// We iterate backward, which means we visit the instruction before we visit
				// any of its operands, this way, when we get to the operand, we already
				// removed the instructions (from the expression dag) that uses it.
				CurrentTruncInst->eraseFromParent();
				for (auto I = InstInfoMap.rbegin(), E = InstInfoMap.rend(); I != E; ++I) {
				// We still need to check that the instruction has no users before we erase
				// it, because {SExt, ZExt}Int Instruction might have other users that was
				// not reduced, in such case, we need to keep that instruction.
				if (!I->first->getNumUses())
				I->first->eraseFromParent();
				}
				}

				bool TruncInstCombine::run(Function &F) {
				bool MadeIRChange = false;

				// Collect all TruncInst in the function into the Worklist for evaluating.
				for (auto &BB : F)
				for (auto &I : BB)
				if (TruncInst *CI = dyn_cast<TruncInst>(&I))
				Worklist.push_back(CI);

				// Process all TruncInst in the Worklist, for each instruction:
				// 1. Check if we it domenates an eligible expression dag to be reduced.
				craig.topperUnsubmitted Done Reply Inline Actions dominates* "we" looks unnecessarry craig.topper: dominates* "we" looks unnecessarry
				// 2. Create a reduced expression dag and replace the old one with it.
				while (!Worklist.empty()) {
				CurrentTruncInst = Worklist.pop_back_val();

				if (Type *NewDstSclTy = getBestTruncatedType()) {
				DEBUG(dbgs() << "ICE: TruncInstCombine reducing type of expression dag "
				"dominated by: "
				<< CurrentTruncInst << '\n');
				ReduceExpressionDag(NewDstSclTy);
				MadeIRChange = true;
				}
				}

				return MadeIRChange;
				}

				bool PatternInstCombiner::run(Function &F) {
				bool MadeIRChange = false;

				// Handle TruncInst patterns
				TruncInstCombine TIC(TLI, DL, AC, DT);
				MadeIRChange \|= TIC.run(F);

				// TODO: add more patterns to handle...

				return MadeIRChange;
				}

lib/Transforms/InstCombine/InstructionCombining.cpp

Show First 20 Lines • Show All 3,214 Lines • ▼ Show 20 Lines	IRBuilder<TargetFolder, IRBuilderCallbackInserter> Builder(
}));		}));

// Lower dbg.declare intrinsics otherwise their value may be clobbered		// Lower dbg.declare intrinsics otherwise their value may be clobbered
// by instcombiner.		// by instcombiner.
bool MadeIRChange = false;		bool MadeIRChange = false;
if (ShouldLowerDbgDeclare)		if (ShouldLowerDbgDeclare)
MadeIRChange = LowerDbgDeclare(F);		MadeIRChange = LowerDbgDeclare(F);

		// Run instruction pattern combiner.
		PatternInstCombiner PIC(TLI, DL, &AC, &DT);
		MadeIRChange \|= PIC.run(F);

// Iterate while there is work to do.		// Iterate while there is work to do.
int Iteration = 0;		int Iteration = 0;
for (;;) {		for (;;) {
++Iteration;		++Iteration;
DEBUG(dbgs() << "\n\nINSTCOMBINE ITERATION #" << Iteration << " on "		DEBUG(dbgs() << "\n\nINSTCOMBINE ITERATION #" << Iteration << " on "
<< F.getName() << "\n");		<< F.getName() << "\n");

MadeIRChange \|= prepareICWorklistFromFunction(F, DL, &TLI, Worklist);		MadeIRChange \|= prepareICWorklistFromFunction(F, DL, &TLI, Worklist);
▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

test/Transforms/InstCombine/trunc_pattern.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -instcombine -S \| FileCheck %s
				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"

				; Instcombine should be able to reduce expression of these truncate casts.

				declare i32 @use32(i32)
				declare i32 @use64(i64)
				declare <2 x i32> @use32_vec(<2 x i32>)
				declare <2 x i32> @use64_vec(<2 x i64>)

				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
				;; These tests check cases where expression dag dominated by trunc instruction
				;; contains instruction other than {zext, sext, trunc} has more than one usage.
				zviUnsubmitted Done Reply Inline Actions dominated -> post-dominated zvi: dominated -> post-dominated
				;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

				define void @multi_uses(i32 %X) {
				; CHECK-LABEL: @multi_uses(
				; CHECK-NEXT: [[A1:%.]] = zext i32 [[X:%.]] to i64
				zviUnsubmitted Done Reply Inline Actions May a nitpick of mine, but i find it easier to follow the check's when they arer right above the related sequence of instructions. Another option is to be break down to multiple functions. zvi: May a nitpick of mine, but i find it easier to follow the check's when they arer right above…
				; CHECK-NEXT: [[B1:%.*]] = add i32 [[X]], 15
				; CHECK-NEXT: [[C1:%.*]] = mul i32 [[B1]], [[B1]]
				; CHECK-NEXT: [[R1:%.*]] = call i32 @use32(i32 [[C1]])
				; CHECK-NEXT: [[A2:%.*]] = zext i32 [[R1]] to i64
				; CHECK-NEXT: [[B2:%.*]] = or i32 [[R1]], 15
				; CHECK-NEXT: [[C2:%.*]] = mul i32 [[B2]], [[B2]]
				; CHECK-NEXT: [[R2:%.*]] = call i32 @use32(i32 [[C2]])
				; CHECK-NEXT: [[A3:%.*]] = zext i32 [[R2]] to i64
				; CHECK-NEXT: [[B3:%.*]] = xor i32 [[R2]], 15
				; CHECK-NEXT: [[C3:%.*]] = mul i32 [[B3]], [[B3]]
				; CHECK-NEXT: [[R3:%.*]] = call i32 @use32(i32 [[C3]])
				; CHECK-NEXT: [[A4:%.*]] = zext i32 [[R3]] to i64
				; CHECK-NEXT: [[B4:%.*]] = and i32 [[R3]], 15
				; CHECK-NEXT: [[C4:%.*]] = mul nuw nsw i32 [[B4]], [[B4]]
				; CHECK-NEXT: [[R4:%.*]] = call i32 @use32(i32 [[C4]])
				; CHECK-NEXT: [[A5:%.*]] = zext i32 [[R4]] to i64
				; CHECK-NEXT: [[B5:%.*]] = sub i32 [[R4]], [[R3]]
				; CHECK-NEXT: [[C5:%.*]] = mul i32 [[B5]], [[B5]]
				; CHECK-NEXT: [[R5:%.*]] = call i32 @use32(i32 [[C5]])
				; CHECK-NEXT: [[TMP1:%.*]] = call i32 @use64(i64 [[A1]])
				; CHECK-NEXT: [[TMP2:%.*]] = call i32 @use64(i64 [[A2]])
				; CHECK-NEXT: [[TMP3:%.*]] = call i32 @use64(i64 [[A3]])
				; CHECK-NEXT: [[TMP4:%.*]] = call i32 @use64(i64 [[A4]])
				; CHECK-NEXT: [[TMP5:%.*]] = call i32 @use64(i64 [[A5]])
				; CHECK-NEXT: ret void
				;
				%A1 = zext i32 %X to i64
				%B1 = add i64 %A1, 15
				%C1 = mul i64 %B1, %B1
				%T1 = trunc i64 %C1 to i32
				%R1 = call i32 @use32(i32 %T1)

				%A2 = zext i32 %R1 to i64
				%B2 = or i64 %A2, 15
				%C2 = mul i64 %B2, %B2
				%T2 = trunc i64 %C2 to i32
				%R2 = call i32 @use32(i32 %T2)

				%A3 = zext i32 %R2 to i64
				%B3 = xor i64 %A3, 15
				%C3 = mul i64 %B3, %B3
				%T3 = trunc i64 %C3 to i32
				%R3 = call i32 @use32(i32 %T3)

				%A4 = zext i32 %R3 to i64
				%B4 = and i64 %A4, 15
				%C4 = mul i64 %B4, %B4
				%T4 = trunc i64 %C4 to i32
				%R4 = call i32 @use32(i32 %T4)

				%A5 = zext i32 %R4 to i64
				%B5 = sub i64 %A5, %A4
				%C5 = mul i64 %B5, %B5
				%T5 = trunc i64 %C5 to i32
				%R5 = call i32 @use32(i32 %T5)

				; make sure all zext have another use that is not dominated by the trunc inst.
				call i32 @use64(i64 %A1)
				call i32 @use64(i64 %A2)
				call i32 @use64(i64 %A3)
				call i32 @use64(i64 %A4)
				call i32 @use64(i64 %A5)

				ret void
				}

				define void @multi_use_vec(<2 x i32> %X) {
				; CHECK-LABEL: @multi_use_vec(
				; CHECK-NEXT: [[A1:%.]] = zext <2 x i32> [[X:%.]] to <2 x i64>
				; CHECK-NEXT: [[B1:%.*]] = add <2 x i32> [[X]], <i32 15, i32 15>
				; CHECK-NEXT: [[C1:%.*]] = mul <2 x i32> [[B1]], [[B1]]
				; CHECK-NEXT: [[R1:%.*]] = call <2 x i32> @use32_vec(<2 x i32> [[C1]])
				; CHECK-NEXT: [[A2:%.*]] = zext <2 x i32> [[R1]] to <2 x i64>
				; CHECK-NEXT: [[B2:%.*]] = or <2 x i32> [[R1]], <i32 15, i32 15>
				; CHECK-NEXT: [[C2:%.*]] = mul <2 x i32> [[B2]], [[B2]]
				; CHECK-NEXT: [[R2:%.*]] = call <2 x i32> @use32_vec(<2 x i32> [[C2]])
				; CHECK-NEXT: [[A3:%.*]] = zext <2 x i32> [[R2]] to <2 x i64>
				; CHECK-NEXT: [[B3:%.*]] = xor <2 x i32> [[R2]], <i32 15, i32 15>
				; CHECK-NEXT: [[C3:%.*]] = mul <2 x i32> [[B3]], [[B3]]
				; CHECK-NEXT: [[R3:%.*]] = call <2 x i32> @use32_vec(<2 x i32> [[C3]])
				; CHECK-NEXT: [[A4:%.*]] = zext <2 x i32> [[R3]] to <2 x i64>
				; CHECK-NEXT: [[B4:%.*]] = and <2 x i32> [[R3]], <i32 15, i32 15>
				; CHECK-NEXT: [[C4:%.*]] = mul nuw nsw <2 x i32> [[B4]], [[B4]]
				; CHECK-NEXT: [[R4:%.*]] = call <2 x i32> @use32_vec(<2 x i32> [[C4]])
				; CHECK-NEXT: [[A5:%.*]] = zext <2 x i32> [[R4]] to <2 x i64>
				; CHECK-NEXT: [[B5:%.*]] = sub <2 x i32> [[R4]], [[R3]]
				; CHECK-NEXT: [[C5:%.*]] = mul <2 x i32> [[B5]], [[B5]]
				; CHECK-NEXT: [[R5:%.*]] = call <2 x i32> @use32_vec(<2 x i32> [[C5]])
				; CHECK-NEXT: [[TMP1:%.*]] = call <2 x i32> @use64_vec(<2 x i64> [[A1]])
				; CHECK-NEXT: [[TMP2:%.*]] = call <2 x i32> @use64_vec(<2 x i64> [[A2]])
				; CHECK-NEXT: [[TMP3:%.*]] = call <2 x i32> @use64_vec(<2 x i64> [[A3]])
				; CHECK-NEXT: [[TMP4:%.*]] = call <2 x i32> @use64_vec(<2 x i64> [[A4]])
				; CHECK-NEXT: [[TMP5:%.*]] = call <2 x i32> @use64_vec(<2 x i64> [[A5]])
				; CHECK-NEXT: ret void
				;
				%A1 = zext <2 x i32> %X to <2 x i64>
				%B1 = add <2 x i64> %A1, <i64 15, i64 15>
				%C1 = mul <2 x i64> %B1, %B1
				%T1 = trunc <2 x i64> %C1 to <2 x i32>
				%R1 = call <2 x i32> @use32_vec(<2 x i32> %T1)

				%A2 = zext <2 x i32> %R1 to <2 x i64>
				%B2 = or <2 x i64> %A2, <i64 15, i64 15>
				%C2 = mul <2 x i64> %B2, %B2
				%T2 = trunc <2 x i64> %C2 to <2 x i32>
				%R2 = call <2 x i32> @use32_vec(<2 x i32> %T2)

				%A3 = zext <2 x i32> %R2 to <2 x i64>
				%B3 = xor <2 x i64> %A3, <i64 15, i64 15>
				%C3 = mul <2 x i64> %B3, %B3
				%T3 = trunc <2 x i64> %C3 to <2 x i32>
				%R3 = call <2 x i32> @use32_vec(<2 x i32> %T3)

				%A4 = zext <2 x i32> %R3 to <2 x i64>
				%B4 = and <2 x i64> %A4, <i64 15, i64 15>
				%C4 = mul <2 x i64> %B4, %B4
				%T4 = trunc <2 x i64> %C4 to <2 x i32>
				%R4 = call <2 x i32> @use32_vec(<2 x i32> %T4)

				%A5 = zext <2 x i32> %R4 to <2 x i64>
				%B5 = sub <2 x i64> %A5, %A4
				%C5 = mul <2 x i64> %B5, %B5
				%T5 = trunc <2 x i64> %C5 to <2 x i32>
				%R5 = call <2 x i32> @use32_vec(<2 x i32> %T5)

				; make sure all zext have another use that is not dominated by the trunc inst.
				call <2 x i32> @use64_vec(<2 x i64> %A1)
				call <2 x i32> @use64_vec(<2 x i64> %A2)
				call <2 x i32> @use64_vec(<2 x i64> %A3)
				call <2 x i32> @use64_vec(<2 x i64> %A4)
				call <2 x i32> @use64_vec(<2 x i64> %A5)

				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Introducing Aggressive Instruction Combine passClosedPublic

Details

Diff Detail

Event Timeline

default;

Revision Contents

Diff 116798

lib/Transforms/InstCombine/CMakeLists.txt

lib/Transforms/InstCombine/InstCombineInternal.h

lib/Transforms/InstCombine/InstCombinePatterns.cpp

default;

lib/Transforms/InstCombine/InstructionCombining.cpp

test/Transforms/InstCombine/trunc_pattern.ll

[InstCombine] Introducing Aggressive Instruction Combine pass
ClosedPublic