This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Allow common type conversions to i8/i16
ClosedPublic

Authored by dmgreen on Jan 23 2018, 8:15 AM.

Diff Detail

Event Timeline

dmgreen created this revision.Jan 23 2018, 8:15 AM

Let me know if I should add more tests/change the condition/whathaveyou.

Let me cross-ref back to the llvm-dev thread for the motivating example and perf results so far:
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120522.html
...and add some reviewers who likely have better intuition about the potential backend effects on non-x86 targets.
Reminder: this should have no effect on x86 because i8/i16 are already data-layout-legal there.

lib/Transforms/InstCombine/InstructionCombining.cpp
159–160

See Alex Bradbury's comment in the llvm-dev thread about i32 too. Maybe we generalize this for ToLegal as:

bool ToLegal = ToWidth == 1 || isPowerOf2_32(ToWidth) || DL.isLegalInteger(ToWidth);
test/Transforms/InstCombine/should-change-type.ll
4

Should be able to reduce that string to just: "n32" ?

6

trunk -> trunc

spatel added inline comments.Jan 23 2018, 10:04 AM
lib/Transforms/InstCombine/InstructionCombining.cpp
159–160

pow2 isn't quite right; nobody wants i2 or i4. :)

dmgreen updated this revision to Diff 131212.Jan 24 2018, 2:39 AM
dmgreen edited the summary of this revision. (Show Details)
dmgreen added inline comments.Jan 24 2018, 2:44 AM
lib/Transforms/InstCombine/InstructionCombining.cpp
149

I almost wrote this as a FIXME:

159–160

I took the simple route. Not sure what's really best here. ARM/AArch64 has LDRB/LDRH for loading bytes/halfwords, same for loads, UXTB/UXTH for extends etc. These types feel like they are almost "semi-legal".

test/Transforms/InstCombine/select-bitext.ll
117 ↗(On Diff #131212)

Although longer, this produced the same final assembly as the old function on any target I tried.

spatel added inline comments.Jan 24 2018, 10:33 AM
test/Transforms/InstCombine/select-bitext.ll
117 ↗(On Diff #131212)

We canonicalize trunc+sext to shifts, and we have a similar fold for:

// ashr (shl (zext X), C), C --> sext X

...so I think we can add a fold for this too (guarded by shouldChangeType).

https://rise4fun.com/Alive/NVU

I'll post that for review, so we can eliminate this sequence as a potential cause of regressions.

spatel added inline comments.Jan 24 2018, 12:50 PM
test/Transforms/InstCombine/select-bitext.ll
117 ↗(On Diff #131212)

Actually, adding a fold for this won't work because we transform the other way (to shifts) in InstCombineCasts:

DEBUG(dbgs() << "ICE: EvaluateInDifferentType converting expression type"
          " to avoid sign extend: " << CI << '\n');
    // We need to emit a shl + ashr to do the sign extend.

So if we try to reduce to the form with sext, we'll infinite loop on a case like this:

target datalayout = "n8:16:32"

define i16 @trunc_ashr(i32 %x) {
  %t = trunc i32 %x to i16
  %s = shl i16 %t, 8
  %a = ashr i16 %s, 8
  ret i16 %a
}

It doesn't show up in the test here because this file doesn't specify a data-layout.

I added a different transform that is guarded by shouldChangeType() at rL323437 .
The tests added in that patch will be affected by this patch, so please rebase. I don't have any other suggestions, so if there are no other comments, let's try it?

I think the difference in AArch64 code with this patch for the examples in PR35792 would be:

$ ./opt  35792.ll -S | ./llc -o - -mtriple=aarch64
julia_a_62828:                          // @julia_a_62828
	mov	w8, w0
	sub	w9, w0, #1              // =1
	and	x0, x9, x8
	ret
bad:                                    // @bad
	orr	w8, wzr, #0xffff
	add	w8, w0, w8
	and	w0, w0, w8
	ret

vs. after this patch:

$ ./opt -instcombine  35792.ll -S | ./llc -o - -mtriple=aarch64
julia_a_62828:                          // @julia_a_62828
	sub	w8, w0, #1              // =1
	and	w0, w8, w0
	ret
bad:                                    // @bad
	sub	w8, w0, #1              // =1
	and	w0, w8, w0
	ret
lib/Transforms/InstCombine/InstructionCombining.cpp
147–150

or -> of ?

dmgreen updated this revision to Diff 131592.Jan 26 2018, 8:27 AM
dmgreen added a subscriber: kparzysz.

Thanks for the info. Unfortunately it looks like I was previously building without all backends (or an intervening commit has changed things?), and a test in Hexagon (CodeGen/Hexagon/loop-idiom/pmpy-mod.ll) is failing due to this. It looks like the code entering into the hexagon loop recogniser is shorter, but is no longer recognised and transformed to a single intrinsic.

Does anyone know if there a more sensible way to do this? Perhaps just for the ARM/AArch64 backends, not for all targets, no matter how odd they are. Or do we think this is the best way to do things?

The sensible thing to do would be to have a well-defined canonical form of the IR, instead of "the most strength-reduced". This is an example why lack of it is only creating issues, and how having idiom-recognition that goes beyond a trivial "memcpy" or "memset" is notoriously difficult to maintain. We could still have the "maximum combining", but after idiom recognition has run.

The code in the polynomial multiplication recognition has even evolved to have its own expression transformations to "undo" various changes that the combining had developed since the initial pattern matching was written. I could change that code again, but unless there is a plan in place to fix this ongoing issue, this is not going to be a high priority for me.

Does anyone know if there a more sensible way to do this? Perhaps just for the ARM/AArch64 backends, not for all targets, no matter how odd they are. Or do we think this is the best way to do things?

The target-specific hack would be to change the data-layout for those targets, so i8 and i16 are legal ints. I don't know if that would cause any conflicts with legalization though. You could try that locally and see if anything breaks?

Yes, making them legal types in the datalayout was one of the things I tried. It made things (a little) worse in the quick tests I ran, but didn't seem to fail on anything. It was only a quick set of tests, and the changes vs this were only quite small. It obviously has larger reaching implications that this just in instcombine.

The purist in me is looking for something a little better, but I think what we have here is a sensible approach to take. We can always change it if not.

Krzysztof, thanks for the info. It sounds like having a canonical representation we all agree on would be a hard problem to solve :) I thinks that is how this all began, wanting a canonicalisation of selects vs cfg that played better with GVN/similar passes.

It appears that the loop idiom recogniser is not picking this up because it does not know how to handle the trunc's that are now part of the CFG. There seem to be a number of changes in the IR going into the idiom recogniser, like doing icmp eq 1 vs icmp eq 0, doing and(xor()) vs xor(and()), etc. They are all being handled fine, except for the new trunk nodes it just doesn't know anything about.

Krzysztof, thanks for the info. It sounds like having a canonical representation we all agree on would be a hard problem to solve :) I thinks that is how this all began, wanting a canonicalisation of selects vs cfg that played better with GVN/similar passes.

It appears that the loop idiom recogniser is not picking this up because it does not know how to handle the trunc's that are now part of the CFG. There seem to be a number of changes in the IR going into the idiom recogniser, like doing icmp eq 1 vs icmp eq 0, doing and(xor()) vs xor(and()), etc. They are all being handled fine, except for the new trunk nodes it just doesn't know anything about.

I'm not familiar with LIR, but if this isn't matching for Hexagon now, then isn't it already a problem for other targets? Ie, aren't we failing to recognize the idiom for x86 because it must already be transformed into the form that is not recognized? Is it difficult/undesirable to make the matching more flexible?

Krzysztof, thanks for the info. It sounds like having a canonical representation we all agree on would be a hard problem to solve :) I thinks that is how this all began, wanting a canonicalisation of selects vs cfg that played better with GVN/similar passes.

It appears that the loop idiom recogniser is not picking this up because it does not know how to handle the trunc's that are now part of the CFG. There seem to be a number of changes in the IR going into the idiom recogniser, like doing icmp eq 1 vs icmp eq 0, doing and(xor()) vs xor(and()), etc. They are all being handled fine, except for the new trunk nodes it just doesn't know anything about.

I'm not familiar with LIR, but if this isn't matching for Hexagon now, then isn't it already a problem for other targets? Ie, aren't we failing to recognize the idiom for x86 because it must already be transformed into the form that is not recognized? Is it difficult/undesirable to make the matching more flexible?

Oops - disregard. I see now that we're talking about a target-specific LIR that produces a target-specific intrinsic.

So let me adjust the question: if source code was written in a way that this new variant of IR with truncs was already created (independent of this patch), then we're already failing to match, right? Or is there something that guarantees that that pattern is not created from source in the first place?

So let me adjust the question: if source code was written in a way that this new variant of IR with truncs was already created (independent of this patch), then we're already failing to match, right? Or is there something that guarantees that that pattern is not created from source in the first place?

The motivating source had short types (uint8_t and uint16_t) or equivalent, if I remember correctly, and the passes running prior to LIR "legalized" the types to i32. The pattern matching code was written to detect a loop that does polynomial multiplication and polynomial division (with a remainder). Originally it was matching the code that was generated from the sources at the time (and the operating type was i32 after extensions). Over time, changes in combining caused it to fail without us noticing (we didn't have that testcase, just something with a pre-generated IR going directly to the LIR pass). Due to the complexity of the LIR code, it was left alone, and a pre-processing step was added to it to convert the IR back to the form that it looked for, essentially undoing the combiner's changes. This was done on a "local copy" of the IR, so to speak, so it didn't really undo it for the rest of the passes. If the match succeeded, the loop would go away, otherwise, the IR would stay (and the "local copy" would be deleted).

Krzysztof, thanks for the info. It sounds like having a canonical representation we all agree on would be a hard problem to solve :) I thinks that is how this all began, wanting a canonicalisation of selects vs cfg that played better with GVN/similar passes.

What I want out of the canonical form is predictability. I could not care less if it's suboptimal, I just want to have a form that I can match against, and only at a specified point in the optimization sequence. All the aggressive optimizations can still happen, but after idiom recognizers had their chance to take a look at the code. In other words, I'd advocate to split these transformations that make code look predictable from those that make it optimized.

Going back to this issue. I'm not opposed to the change itself, if it enables more things to happen then it sounds like a good idea. At the same time we (Hexagon) cannot give up that LIR pass. If it's only a question of truncates, it may be easy to change the LIR code, but I don't want to be finding myself in this situation over and over again. The current model simply does not support idiom recognition.

OK. For some technical details. This is what we used to have heading into the recogniser (slightly edited for clarity):

%v64 = phi i8 [ 0, %b2 ], [ %v45, %b9 ]
%v53 = phi i16 [ %a1, %b2 ], [ %v43, %b9 ]
%v42 = phi i8 [ %a0, %b2 ], [ %a, %b9 ]

%0 = and i8 %v42, 1
%v11 = zext i8 %0 to i32
%1 = and i16 %v53, 1
%v14 = zext i16 %1 to i32
%v15 = xor i32 %v14, %v11

%a = lshr i8 %v42, 1
%v21 = icmp eq i32 %v15, 1
%b = lshr i16 %v53, 1
%v36 = xor i16 %b, -24575
%v43 = select i1 %v21, i16 %v36, i16 %b
%v45 = add nuw nsw i8 %v64, 1
%v8 = icmp ult i8 %v45, 8
br i1 %v8, label %b9, label %b46

This is what it is now:

%v64 = phi i8 [ 0, %b2 ], [ %v45, %b9 ]
%v53 = phi i16 [ %a1, %b2 ], [ %v43, %b9 ]
%v42 = phi i8 [ %a0, %b2 ], [ %a, %b9 ]

%0 = trunc i16 %v53 to i8
%v111 = xor i8 %v42, %0
%v15 = and i8 %v111, 1

%a = lshr i8 %v42, 1
%v21 = icmp eq i8 %v15, 0
%b = lshr i16 %v53, 1
%v36 = xor i16 %b, -24575
%v43 = select i1 %v21, i16 %b, i16 %v36
%v45 = add nuw nsw i8 %v64, 1
%v8 = icmp ult i8 %v45, 8
br i1 %v8, label %b9, label %b46

I managed to hackily get things working by adding Trunc as cases to isPromotableTo, commutesWithShift and keepsHighBitsZero, and then doing something like "if(isa<TruncInst>(Var)) Var = Var->getOperand(0)" in scanSelect. I can't claim these changes are complete or properly thought through or tested at-all :-/ but they did get at least this test case to work correctly again.

I was impressed that the rest of the pass just worked correctly, even with the other knock-on changes. When I was looking through it, I felt like an "is equivalent to" matcher would have been useful. Might be very hard to make in practice though.

Oh brilliant. What a champ! Looks there were a few more changes than just my hacks. Thank you for the fixes, much appreciated.

I did a build and all tests now pass.

spatel accepted this revision.Jan 31 2018, 6:25 AM

I did a build and all tests now pass.

Let's try it then. LGTM.

This revision is now accepted and ready to land.Jan 31 2018, 6:25 AM
This revision was automatically updated to reflect the committed changes.

Thanks. Let see how it goes.

There may be some blow back from this one, let me know if anything comes up.

Alas, looks like one came up already

Causing timeouts on at least some ppc64le, with some other suspicious looking cases. Reverted in rL323959. I'll take a look into why.

spatel reopened this revision.Feb 1 2018, 7:51 AM

Reopening the review since this was reverted.

This revision is now accepted and ready to land.Feb 1 2018, 7:51 AM
spatel requested changes to this revision.Feb 1 2018, 7:52 AM
This revision now requires changes to proceed.Feb 1 2018, 7:52 AM
dmgreen updated this revision to Diff 132575.Feb 2 2018, 6:17 AM

OK. So the reason it was looping is two parts of InstCombinePhi acting against one another.

FoldPHIArgsIntoPHI (guarded by shouldChangeType) would sink from phi i8(trunc a, trunc b) to trunc(phi i16(a,b)), and SliceUpIllegalIntegerPHI (guarded by !isLegalType) would do the opposite.

I've updated back to an older version here, where we only do the conversion if ToWidth < FromWidth, breaking the cycle. I've also added a test for the infinite recursion case and ran an arm bootstrap/testsuite/other testing, without finding any problems.

I also had a go at ran a ppc64le bootstrap, but couldn't get it to cross-compile correctly, so just used a dummy linker. All the compiles did OK, no timeouts.

spatel accepted this revision.Feb 2 2018, 7:23 AM

LGTM.

test/Transforms/InstCombine/phi-timeout.ll
2

I recommend just checking the expected (current) output on this test rather than debug output. That potentially gives us more bot coverage...if we inf-loop, we'll probably know sooner from a 'release' buildbot.

This revision is now accepted and ready to land.Feb 2 2018, 7:23 AM
dmgreen updated this revision to Diff 132634.Feb 2 2018, 11:20 AM

Update test. Thanks for the suggestion. I was copying one of the existing tests, but I think you are right, this way is better.

I'll try to run some more tests overnight to be sure this is OK.

This revision was automatically updated to reflect the committed changes.