Page MenuHomePhabricator

[SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand.
ClosedPublic

Authored by az on Jul 10 2018, 12:55 PM.

Details

Summary

One of the transformation done by the SimplyIndVar is to widen instructions, eliminate sign/zero extend instructions, and reduce the generation of truncate instructions when legal. Let's consider the following common C code fragment within a loop:
p = *(base + x*i); // i is the loop induction variable

If x is some load instruction that is hoisted outside the loop by LICM, then SimplyIndVar generates optimal code by not emitting any truncate instructions after widening. In case x is not hoisted, then the code generated is sub-optimal and it is mainly because SimplyIndVar relies on scalar evolution that can not handle loop variant expression. This patch handle the non-hoisted case and generates similar code to the hoisted case (see output of .ll file with and without patch).

The performance effect of redundant truncate and extend instructions can be big on strength reduction and on the backend when choosing the appropriate addressing mode.

No performance change on spec for ARM A72 but significant improvement on proprietary benchmark. Note that an alternative to this patch is to move SimplifyIndVar pass after PRE because PRE hoists more loop invariant code than LICM given that it uses a less conservative but expensive version of alias analysis. We opted not to make any change into pass ordering which can affect performance more than a local change.

Diff Detail

Repository
rL LLVM

Event Timeline

az created this revision.Jul 10 2018, 12:55 PM

How does the code generation actually change on aarch64? As far as I can tell, you're basically just changing an "sxtw #2" to an "lsl #2"; does that save a uop on A72?

Note that an alternative to this patch is to move SimplifyIndVar pass after PRE because PRE hoists more loop invariant code than LICM given that it uses a less conservative but expensive version of alias analysis.

That might work in some cases, but not in general, so this is probably worth solving anyway...

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1408 ↗(On Diff #154858)

Not sure the getExtendExpr helper is actually buying anything, given you have to check the extension kind anyway.

1460 ↗(On Diff #154858)

It probably isn't a good idea to create a new multiply without erasing the old one.

llvm/test/Transforms/IndVarSimplify/iv-widen-elim-ext.ll
310 ↗(On Diff #154858)

Please fix the testcase so this load isn't dead (to make the testcase less fragile).

az updated this revision to Diff 155520.Jul 13 2018, 3:55 PM

How does the code generation actually change on aarch64? As far as I can tell, you're basically just changing an "sxtw #2" to an "lsl #2"; does that save a uop on A72?

For the unit test case added in this patch, it does not show much in terms of performance. It only shows how SimplifyIndVar, without the patch, generates different code for similar multiply instructions (%mul = ... and %mul1= ...) that differ only in the first operand being hoisted outside the loop or not. For the hoisted case, it widens the multiply, inserts a sign extend outside the loop, and removes the sign extend instruction that comes after mul. In the non-hoisted case, it adds a truncate instruction and leaves the sign extend after the multiply. This patch tries to make SimplifyIndVar generates the same code for both cases especially that code for the hoisted case seems more efficient and cleaner to work with for passes that runs later. In order to see performance improvement due to this patch, let's consider this C Code:
struct info1 { int C };
struct info2 { int data };
void foo(struct info1* in, struct info2* out, int N, unsigned char* p) {

     int p0, p1, p2;
     for (int x = 1; x < N; ++x) {
       p0 = *(p + (x+1) * in->C);
       p1 = *(p + (x-1) * in->C);
       p2 = *(p + (x-2) * in->C);
       out[N + x].data = p0 - p1 + p2;
     }
return;

}
Without the Patch, here is the AArch64 assembly:

ldr     w9, [x0]
add     x11, x1, w2, sxtw #2
mov     w12, w2
mov     w8, wzr
add     x11, x11, #4            // =4
neg     w10, w9
sub     x12, x12, #1            // =1
lsl     w13, w9, #1

.LBB0_2: // %for.body

                                    // =>This Inner Loop Header: Depth=1
add     w15, w13, w8
ldrb    w14, [x3, w8, sxtw]
ldrb    w15, [x3, w15, sxtw]
add     w16, w10, w8
ldrb    w16, [x3, w16, sxtw]
add     w8, w8, w9
subs    x12, x12, #1            // =1
sub     w14, w15, w14
add     w14, w14, w16
str     w14, [x11], #4
b.ne    .LBB0_2

.LBB0_3: // %for.cond.cleanup

ret

With the patch, here is the AArch64 generated assembly

ldrsw   x8, [x0]
add     x10, x1, w2, sxtw #2
mov     w12, w2
sub     x12, x12, #1            // =1
add     x10, x10, #4            // =4
neg     x9, x8
lsl     x11, x8, #1

.LBB0_2: // %for.body

                                    // =>This Inner Loop Header: Depth=1
ldrb    w13, [x3, x11]
ldrb    w15, [x3]
ldrb    w14, [x3, x9]
add     x3, x3, x8
sub     w13, w13, w15
add     w13, w13, w14
subs    x12, x12, #1            // =1
str     w13, [x10], #4
b.ne    .LBB0_2

.LBB0_3: // %for.cond.cleanup

ret

There is a performance improvement with the patch due to the fact that most variables involved in computing the addresses of the ldrb instructions are computed outside the loop. The redundant truncate and sign extend instructions that goes into loop strength reduction and in particular induction rewrite does not allow this pass to generate the most efficient code.

az marked an inline comment as done.Jul 13 2018, 3:59 PM
az added inline comments.Jul 13 2018, 4:03 PM
llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1408 ↗(On Diff #154858)

I took the code that does the non-scalar evolution legality check (opcode = {add..}, call to hasNoSignWrap(), etc.) and put it, as is, in a function called getExtendExpr that both existing code and patch code calls it. It may have a little bit of redundancy but I can rewrite it if needed by un-putting it in a function and replicate what I need in terms of legality check.

1460 ↗(On Diff #154858)

Actually, this is cloning the use instruction and not removing it yet. The old instruction is removed by the defining instruction when we return up the call chain if it has no other use. This is my understanding of how things worked with widening the use. However, I am adding some code in the new revision to immediately remove the new instruction when not useful. I used to leave the instruction unused and hoping that it will be removed by dead code elimination.

Your example doesn't really help make the case for this patch. The load in that test is actually loop-invariant; we just don't figure that out until after indvars transforms the induction variable. Probably LICM could be fixed to handle this case earlier. Then ultimately, the multiply goes away; the extra operation you're trying to get rid of is actually the sign-extension of a PHI node created by LSR. LSR and/or SCEV could probably be fixed so this produces an i64 PHI instead. Either of those fixes would be more straightforward and more obviously profitable.

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1408 ↗(On Diff #154858)

Okay, it's probably fine as-is.

1460 ↗(On Diff #154858)

We need to make sure to avoid the situation where after this transform runs, there's an i32 multiply and an i64 multiply. It's possible IV rewriting could lead to this sort of situation anyway in certain edge cases, I guess (haven't looked too closely), but multiplies are relatively expensive, and i64 multiplies are more expensive than i32 multiplies.

Actually, more generally, we probably need to weigh the extra cost of the multiply; on multiple targets, an i64 multiply is substantially slower than an i32 multiply.

az updated this revision to Diff 157396.Jul 25 2018, 4:39 PM
az added a comment.Jul 25 2018, 4:44 PM

Your example doesn't really help make the case for this patch. The load in that test is actually loop-invariant; we just don't figure that out until after indvars transforms the induction variable. Probably LICM could be fixed to handle this case earlier. Then ultimately, the multiply goes away; the extra operation you're trying to get rid of is actually the sign-extension of a PHI node created by LSR. LSR and/or SCEV could probably be fixed so this produces an i64 PHI instead. Either of those fixes would be more straightforward and more obviously profitable.

My example is a greatly simplified example of the actual benchmark but you are absolutely right on the suggestion that we should solve this problem in LICM. If we can do that, SimplifyIndVar would work with clean hoisted loads and LICM would be improved in general. That was my original approach too and a good portion of the work for this performance issue was spent on trying to improve LICM/AA. What I found out is that LICM uses a simple/fast but conservative Type-Based Alias Analysis (AliasSet). I would have to make major changes to that alias analysis to solve my problem and it is unlikely to have it accepted given the strong push back against adding too much complexity into AliasSet. Then, I also thought about making LICM use the more accurate but expansive MemDep alias analysis in a similar way that GVN based PRE uses it (Memdep gives good Alias Analysis info for my real benchmark). But given that the LICM pass is called numerous times, this would add substantial compile time. Given that I did not have an agreeable fix in LICM/AA, I went to the next best place to put a fix which is SimplifyIndVar. It is the next best place because 1) it is there where the truncate instruction and inefficient code is first generated, 2) it is an early pass and I prefer that we clean code early on instead of letting inefficient IR go through other passes, and 3) It makes the IndvarSimp widening optimization more solid because let's consider we have two similar C examples with the only difference being one with a Load inside the loop and the other with a Load outside the loop (LICM could not hoist because of real legality issue or because of limitation of licm such as my case). The widening generates quite different IRs for both cases even though the Load being outside or inside is irrelevant to widening itself. In other words, we want it to generate similar type of code for closely similar input IR.

az added inline comments.Jul 25 2018, 4:57 PM
llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1460 ↗(On Diff #154858)

Good catch.
I updated the patch so that it does not generate two versions of the instructions, an i32 and i64 (for the example below a two versions of the multiply is a possibility). I have done so by making sure that the instruction in question is consumed by a s/z extend instruction to get a full benefit of widening and eliminate the extend instruction. Having said that, it is most likely the original code is suffering from the same issue of generating multiple version of the instructions. I will look at it post the patch.
As for the general idea of adding a cost model for the widening optimization, it is worth investigating it (may be along with the other optimizations within SimplifyIndVar in case they do not have a cost model). In case you are aware of such examples, please share but I think you have good point. We are transforming the code by widening and elimination of some instructions without checking on profitability.

az marked an inline comment as done.Jul 25 2018, 4:58 PM
az updated this revision to Diff 157816.Jul 27 2018, 5:20 PM

Some cleaning and update to comments.

az added a reviewer: sebpop.Aug 23 2018, 2:09 PM
sebpop added inline comments.Sep 4 2018, 11:43 AM
llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1400 ↗(On Diff #157816)

clang-format

1422 ↗(On Diff #157816)

Variable names should start with a Capital letter.

1430 ↗(On Diff #157816)

Rewrite like so:

if (!isa<SCEVSignExtendExpr>(ExtendOperExpr))
  return false;
1436 ↗(On Diff #157816)

idem: use isa<> shorter format.

1471 ↗(On Diff #157816)

Can you please remove all the unneeded parentheses?
Also I find it more clear if you first check for all cases that exit the loop with WideningUseful = false; break;

1475 ↗(On Diff #157816)

There is a path on which we would transform the code in this rAUW stmt, and in a later iteration will fail in the else break clause.
Can we transform this loop such that we split the analysis phase from code generation part? Maybe by using a vector of the things to be replaced. The analysis part that may fail with return false; should be moved before // Generating a widening use instruction.
You can also split all the code gen part in a separate function that does not fail, and call it from below...

1597 ↗(On Diff #157816)

... and call it from here.

if (analysisFails())
  return nullptr;
codeGenWiden();
sebpop added inline comments.Sep 4 2018, 11:48 AM
llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1597 ↗(On Diff #157816)

this should be:

if (analysisSucceeds()) {
   codeGenWiden();
   return nullptr;
}
junlim added a subscriber: junlim.Sep 5 2018, 8:28 AM
az updated this revision to Diff 164243.Sep 6 2018, 10:45 AM
az marked 5 inline comments as done.Sep 6 2018, 10:50 AM
az added inline comments.
llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1471 ↗(On Diff #157816)

Removed one. I tried to remove another one that should work fine based on C precedence rule, but I got some compiler warning. So, it is slightly better but still parentheses heavy.

1597 ↗(On Diff #157816)

Done what you have in mind which is completely separate analysis from code Gen but I still kept them in the same function with clearly marked analysis phase and codegen phase mainly because they share code that may need to be re-executed if separated. I Can separate them into two functions if you still think that this small enhancement to widening need an analysis function and a code gen function.

sebpop added inline comments.Sep 6 2018, 12:42 PM
llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1597 ↗(On Diff #157816)

Let's split the function into two smaller ones.

1440 ↗(On Diff #164243)

Let's break this down into smaller statements that can be read with ease:
here is my suggestion

if (ExtKind == SignExtended) {
  for () {
    auto *User = ...;
    if (isa<SExtInst>(User) && User->getType() == WideType)
      continue;
    return false;
  }
} else { // ExtKind == ZeroExtended
  for () {
    auto *User = ...;
    if (isa<ZExtInst>(User) && User->getType() == WideType)
      continue;
    return false;
  }
}
az updated this revision to Diff 164285.Sep 6 2018, 2:06 PM
az marked 2 inline comments as done.
sebpop accepted this revision.Sep 6 2018, 4:18 PM

The patch looks good to me.
Please address the last two comments and then apply.

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1440 ↗(On Diff #164285)

Let's simplify the loop like so:

for () {
  ZExtInst *User = dyn_cast<ZExtInst>(U.getUser());
  if (!User || User->getType() != WideType)
    return false;
}

and the same for the signExtend loop.

1507 ↗(On Diff #164285)

please remove return stmt.

This revision is now accepted and ready to land.Sep 6 2018, 4:18 PM
az updated this revision to Diff 164504.Sep 7 2018, 1:43 PM
az marked 2 inline comments as done.
This revision was automatically updated to reflect the committed changes.