This allows conversions to i8/i16/i32 (very common cases) even if the
resulting type is not legal. This can often open up extra combine
opportunities.
Details
- Reviewers
spatel efriedma hfinkel bogner arsenm t.p.northover sdardis asb - Commits
- rG7174023f5730: [InstCombine] Allow common type conversions to i8/i16/i32
rGe11f0545db7a: [InstCombine] Allow common type conversions to i8/i16/i32
rL324174: [InstCombine] Allow common type conversions to i8/i16/i32
rL323951: [InstCombine] Allow common type conversions to i8/i16/i32
Diff Detail
- Repository
- rL LLVM
Event Timeline
Let me cross-ref back to the llvm-dev thread for the motivating example and perf results so far:
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120522.html
...and add some reviewers who likely have better intuition about the potential backend effects on non-x86 targets.
Reminder: this should have no effect on x86 because i8/i16 are already data-layout-legal there.
lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
156–157 ↗ | (On Diff #131076) | See Alex Bradbury's comment in the llvm-dev thread about i32 too. Maybe we generalize this for ToLegal as: bool ToLegal = ToWidth == 1 || isPowerOf2_32(ToWidth) || DL.isLegalInteger(ToWidth); |
test/Transforms/InstCombine/should-change-type.ll | ||
3 ↗ | (On Diff #131076) | Should be able to reduce that string to just: "n32" ? |
5 ↗ | (On Diff #131076) | trunk -> trunc |
lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
156–157 ↗ | (On Diff #131076) | pow2 isn't quite right; nobody wants i2 or i4. :) |
lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
149 ↗ | (On Diff #131212) | I almost wrote this as a FIXME: |
156–157 ↗ | (On Diff #131076) | I took the simple route. Not sure what's really best here. ARM/AArch64 has LDRB/LDRH for loading bytes/halfwords, same for loads, UXTB/UXTH for extends etc. These types feel like they are almost "semi-legal". |
test/Transforms/InstCombine/select-bitext.ll | ||
117 ↗ | (On Diff #131212) | Although longer, this produced the same final assembly as the old function on any target I tried. |
test/Transforms/InstCombine/select-bitext.ll | ||
---|---|---|
117 ↗ | (On Diff #131212) | We canonicalize trunc+sext to shifts, and we have a similar fold for: // ashr (shl (zext X), C), C --> sext X ...so I think we can add a fold for this too (guarded by shouldChangeType). https://rise4fun.com/Alive/NVU I'll post that for review, so we can eliminate this sequence as a potential cause of regressions. |
test/Transforms/InstCombine/select-bitext.ll | ||
---|---|---|
117 ↗ | (On Diff #131212) | Actually, adding a fold for this won't work because we transform the other way (to shifts) in InstCombineCasts: DEBUG(dbgs() << "ICE: EvaluateInDifferentType converting expression type" " to avoid sign extend: " << CI << '\n'); // We need to emit a shl + ashr to do the sign extend. So if we try to reduce to the form with sext, we'll infinite loop on a case like this: target datalayout = "n8:16:32" define i16 @trunc_ashr(i32 %x) { %t = trunc i32 %x to i16 %s = shl i16 %t, 8 %a = ashr i16 %s, 8 ret i16 %a } It doesn't show up in the test here because this file doesn't specify a data-layout. |
I added a different transform that is guarded by shouldChangeType() at rL323437 .
The tests added in that patch will be affected by this patch, so please rebase. I don't have any other suggestions, so if there are no other comments, let's try it?
I think the difference in AArch64 code with this patch for the examples in PR35792 would be:
$ ./opt 35792.ll -S | ./llc -o - -mtriple=aarch64 julia_a_62828: // @julia_a_62828 mov w8, w0 sub w9, w0, #1 // =1 and x0, x9, x8 ret bad: // @bad orr w8, wzr, #0xffff add w8, w0, w8 and w0, w0, w8 ret
vs. after this patch:
$ ./opt -instcombine 35792.ll -S | ./llc -o - -mtriple=aarch64 julia_a_62828: // @julia_a_62828 sub w8, w0, #1 // =1 and w0, w8, w0 ret bad: // @bad sub w8, w0, #1 // =1 and w0, w8, w0 ret
lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
147 ↗ | (On Diff #131212) | or -> of ? |
Thanks for the info. Unfortunately it looks like I was previously building without all backends (or an intervening commit has changed things?), and a test in Hexagon (CodeGen/Hexagon/loop-idiom/pmpy-mod.ll) is failing due to this. It looks like the code entering into the hexagon loop recogniser is shorter, but is no longer recognised and transformed to a single intrinsic.
Does anyone know if there a more sensible way to do this? Perhaps just for the ARM/AArch64 backends, not for all targets, no matter how odd they are. Or do we think this is the best way to do things?
The sensible thing to do would be to have a well-defined canonical form of the IR, instead of "the most strength-reduced". This is an example why lack of it is only creating issues, and how having idiom-recognition that goes beyond a trivial "memcpy" or "memset" is notoriously difficult to maintain. We could still have the "maximum combining", but after idiom recognition has run.
The code in the polynomial multiplication recognition has even evolved to have its own expression transformations to "undo" various changes that the combining had developed since the initial pattern matching was written. I could change that code again, but unless there is a plan in place to fix this ongoing issue, this is not going to be a high priority for me.
The target-specific hack would be to change the data-layout for those targets, so i8 and i16 are legal ints. I don't know if that would cause any conflicts with legalization though. You could try that locally and see if anything breaks?
Yes, making them legal types in the datalayout was one of the things I tried. It made things (a little) worse in the quick tests I ran, but didn't seem to fail on anything. It was only a quick set of tests, and the changes vs this were only quite small. It obviously has larger reaching implications that this just in instcombine.
The purist in me is looking for something a little better, but I think what we have here is a sensible approach to take. We can always change it if not.
Krzysztof, thanks for the info. It sounds like having a canonical representation we all agree on would be a hard problem to solve :) I thinks that is how this all began, wanting a canonicalisation of selects vs cfg that played better with GVN/similar passes.
It appears that the loop idiom recogniser is not picking this up because it does not know how to handle the trunc's that are now part of the CFG. There seem to be a number of changes in the IR going into the idiom recogniser, like doing icmp eq 1 vs icmp eq 0, doing and(xor()) vs xor(and()), etc. They are all being handled fine, except for the new trunk nodes it just doesn't know anything about.
I'm not familiar with LIR, but if this isn't matching for Hexagon now, then isn't it already a problem for other targets? Ie, aren't we failing to recognize the idiom for x86 because it must already be transformed into the form that is not recognized? Is it difficult/undesirable to make the matching more flexible?
Oops - disregard. I see now that we're talking about a target-specific LIR that produces a target-specific intrinsic.
So let me adjust the question: if source code was written in a way that this new variant of IR with truncs was already created (independent of this patch), then we're already failing to match, right? Or is there something that guarantees that that pattern is not created from source in the first place?
The motivating source had short types (uint8_t and uint16_t) or equivalent, if I remember correctly, and the passes running prior to LIR "legalized" the types to i32. The pattern matching code was written to detect a loop that does polynomial multiplication and polynomial division (with a remainder). Originally it was matching the code that was generated from the sources at the time (and the operating type was i32 after extensions). Over time, changes in combining caused it to fail without us noticing (we didn't have that testcase, just something with a pre-generated IR going directly to the LIR pass). Due to the complexity of the LIR code, it was left alone, and a pre-processing step was added to it to convert the IR back to the form that it looked for, essentially undoing the combiner's changes. This was done on a "local copy" of the IR, so to speak, so it didn't really undo it for the rest of the passes. If the match succeeded, the loop would go away, otherwise, the IR would stay (and the "local copy" would be deleted).
What I want out of the canonical form is predictability. I could not care less if it's suboptimal, I just want to have a form that I can match against, and only at a specified point in the optimization sequence. All the aggressive optimizations can still happen, but after idiom recognizers had their chance to take a look at the code. In other words, I'd advocate to split these transformations that make code look predictable from those that make it optimized.
Going back to this issue. I'm not opposed to the change itself, if it enables more things to happen then it sounds like a good idea. At the same time we (Hexagon) cannot give up that LIR pass. If it's only a question of truncates, it may be easy to change the LIR code, but I don't want to be finding myself in this situation over and over again. The current model simply does not support idiom recognition.
OK. For some technical details. This is what we used to have heading into the recogniser (slightly edited for clarity):
%v64 = phi i8 [ 0, %b2 ], [ %v45, %b9 ] %v53 = phi i16 [ %a1, %b2 ], [ %v43, %b9 ] %v42 = phi i8 [ %a0, %b2 ], [ %a, %b9 ] %0 = and i8 %v42, 1 %v11 = zext i8 %0 to i32 %1 = and i16 %v53, 1 %v14 = zext i16 %1 to i32 %v15 = xor i32 %v14, %v11 %a = lshr i8 %v42, 1 %v21 = icmp eq i32 %v15, 1 %b = lshr i16 %v53, 1 %v36 = xor i16 %b, -24575 %v43 = select i1 %v21, i16 %v36, i16 %b %v45 = add nuw nsw i8 %v64, 1 %v8 = icmp ult i8 %v45, 8 br i1 %v8, label %b9, label %b46
This is what it is now:
%v64 = phi i8 [ 0, %b2 ], [ %v45, %b9 ] %v53 = phi i16 [ %a1, %b2 ], [ %v43, %b9 ] %v42 = phi i8 [ %a0, %b2 ], [ %a, %b9 ] %0 = trunc i16 %v53 to i8 %v111 = xor i8 %v42, %0 %v15 = and i8 %v111, 1 %a = lshr i8 %v42, 1 %v21 = icmp eq i8 %v15, 0 %b = lshr i16 %v53, 1 %v36 = xor i16 %b, -24575 %v43 = select i1 %v21, i16 %b, i16 %v36 %v45 = add nuw nsw i8 %v64, 1 %v8 = icmp ult i8 %v45, 8 br i1 %v8, label %b9, label %b46
I managed to hackily get things working by adding Trunc as cases to isPromotableTo, commutesWithShift and keepsHighBitsZero, and then doing something like "if(isa<TruncInst>(Var)) Var = Var->getOperand(0)" in scanSelect. I can't claim these changes are complete or properly thought through or tested at-all :-/ but they did get at least this test case to work correctly again.
I was impressed that the rest of the pass just worked correctly, even with the other knock-on changes. When I was looking through it, I felt like an "is equivalent to" matcher would have been useful. Might be very hard to make in practice though.
Oh brilliant. What a champ! Looks there were a few more changes than just my hacks. Thank you for the fixes, much appreciated.
I did a build and all tests now pass.
Thanks. Let see how it goes.
There may be some blow back from this one, let me know if anything comes up.
Alas, looks like one came up already
Causing timeouts on at least some ppc64le, with some other suspicious looking cases. Reverted in rL323959. I'll take a look into why.
OK. So the reason it was looping is two parts of InstCombinePhi acting against one another.
FoldPHIArgsIntoPHI (guarded by shouldChangeType) would sink from phi i8(trunc a, trunc b) to trunc(phi i16(a,b)), and SliceUpIllegalIntegerPHI (guarded by !isLegalType) would do the opposite.
I've updated back to an older version here, where we only do the conversion if ToWidth < FromWidth, breaking the cycle. I've also added a test for the infinite recursion case and ran an arm bootstrap/testsuite/other testing, without finding any problems.
I also had a go at ran a ppc64le bootstrap, but couldn't get it to cross-compile correctly, so just used a dummy linker. All the compiles did OK, no timeouts.
LGTM.
test/Transforms/InstCombine/phi-timeout.ll | ||
---|---|---|
2 ↗ | (On Diff #132575) | I recommend just checking the expected (current) output on this test rather than debug output. That potentially gives us more bot coverage...if we inf-loop, we'll probably know sooner from a 'release' buildbot. |
Update test. Thanks for the suggestion. I was copying one of the existing tests, but I think you are right, this way is better.
I'll try to run some more tests overnight to be sure this is OK.