This is an archive of the discontinued LLVM Phabricator instance.

[GVN] Eliminate redundant loads whose addresses are dependent on the result of a select instruction.
AbandonedPublic

Authored by mcrosier on Mar 6 2015, 12:26 PM.

Details

Summary

This patch attempts to eliminate redundant loads whose addresses are dependent upon a select instruction.

First, look for patterns such as:

(load (gep (select (cond, ptr_a, ptr_b)) idx0, idx1, ..., idxN))
(load (gep ptr_op, (select (cond, idx_a, idx_b)), idx1 ..., idxN))
(load (gep ptr_op, idx0, (select (cond, idx_a, idx_b)), ..., idxN))

Then find similar loads and perform a transformation such as:

Constant zero index with different pointer operands:

a = (load (gep ptr_a 0 0))
b = (load (gep ptr_b 0 0))
c = (load (gep (select (cond, ptr_a, ptr_b)) 0 0))
-->
a = (load (gep ptr_a, 0, 0))
b = (load (gep ptr_b, 0, 0))
c = (select (cond, a, b))

and

GEPs differ by only a single index:

a = (load (gep ptr_op, idx, idx_a))
b = (load (gep ptr_op, idx, idx_b))
c = (load (gep ptr_op, idx, (select (cond, idx_a, idx_b))))
-->
a = (load (gep ptr_op, idx, idx_a))
b = (load (gep ptr_op, idx, idx_b))
c = (select (cond, a, b))

See the PR for additional information.
http://llvm.org/bugs/show_bug.cgi?id=21052

This work is based on a patch done by Tilmann Scheller (attached to PR21052).

Basically, I moved the patch from InstCombine to GVN, so I would have dependency information. I also made the solution a bit more general.

This is a WIP, but I wanted to get some feedback before continuing. After considering the suggestions from Nick L. and David M. (see the bug report), I didn't see how those suggestions would benefit this patch (at least not in local scope).

Here are the tasks that remain:

  1. Use use-def and def-use chains rather than iterate over instructions.
  2. Allow the base load to be fed by stores as well as loads.
  3. Perform additional correctness/perf testing.

Please feel free to comment.

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 21385.Mar 6 2015, 12:26 PM
mcrosier retitled this revision from to [GVN] Eliminate redundant loads whose addresses are dependent on the result of a select instruction..
mcrosier updated this object.
mcrosier edited the test plan for this revision. (Show Details)
mcrosier set the repository for this revision to rL LLVM.
mcrosier added a subscriber: Unknown Object (MLST).
mcrosier added a subscriber: jmolloy.
mcrosier updated this revision to Diff 21867.Mar 12 2015, 12:34 PM
mcrosier added a reviewer: dberlin.
mcrosier removed rL LLVM as the repository for this revision.

Revised patch with the following changes:
-Reworked the code so that we now use def-use and use-def, rather than iterate over all instructions in the block.
-Store can now feed the select. I.e.,

a = (load (gep ptr_a 0 0))
store b (gep ptr_b 0 0))
c = (load (gep (select (cond, ptr_a, ptr_b)) 0 0))
-->
a = (load (gep ptr_a, 0, 0))
store b (gep ptr_b 0 0))
c = (select (cond, a, b))

There were no correctness or performance regressions across spec2000, spec2006, and eembc on an A57 device. The only benchmarks that hit this transform are spec2000/gcc, spec2000/vpr, and spec2006/dealII. I saw a ~1% improvement in vpr, but that's basically noise. :o/

Also, adding Daniel B. to the reviews as he seems to be working in this area as of late.

All, please have a look.

Chad

mcrosier abandoned this revision.Mar 16 2015, 5:41 AM
reames added a subscriber: reames.Mar 24 2015, 5:17 PM

A few comments inline. I didn't spot any obvious problems with the code, but it feels like you're trying to solve a problem in one step that should probably be sub-divided into several. I suggested one possible division, but there might be others that make more sense.

lib/Transforms/Scalar/GVN.cpp
1855

Another way you might be able to phrase this would be to ask GVN if it had a value which would be CSEd with a newly created GEP at the current instruction.

1868

This restriction feels like a hack.

1903

Ignoring your code entirely for a moment....

Your starting IR fragment seems really odd to me. I'd have expected this to be canonicalized as:
a_addr = gep ptr_op idx idx_a
b_addr = gep ptr_op idx idx_b
a = (ld|st a_addr
b = (ld|st b_addr
c = select (cond, a_addr, b_addr)
c = load c_addr

This form would make your transform far more obvious once we actually got to PRE.

Thoughts?

1915

Why only one level?

reames added inline comments.Mar 24 2015, 5:19 PM
lib/Transforms/Scalar/GVN.cpp
2025

Ah, you need to check for a local dependence if you're making that assumption. Spotted that immediately after sending last comments. :)

dberlin edited edge metadata.Mar 24 2015, 5:31 PM

So, this is an awfully specific redundancy to detect in GVN.

The more general case doesn't use select's but phi's, and is tested in gcc
with GVNPRE with testcases like:

int
foo (int i)
{

int a, b;
if (i)
    a = 3, b = 2;
else
    a = 2, b = 3;
return a + b;

}

and more complex testcase that involve expressions.
But the basic idea is the same - if both values are computed already, the
end is redundant.
(here, the result is constant along both edges)

GVN doesn't touch this (and wouldn't with your patch):

; Function Attrs: nounwind readnone ssp uwtable
define i32 @foo(i32 %i) #0 {

%1 = icmp eq i32 %i, 0
%. = select i1 %1, i32 2, i32 3
%.1 = select i1 %1, i32 3, i32 2
%2 = add nsw i32 %., %.1
ret i32 %2

}

GCC treats this as a GVNPRE problem rather than a GVN one, because it
involves computing available values along edges.

Putting these as selects, sadly, makes it harder for things like PRE to see
this (though you could treat the selects are false blocks with incoming
edges)
It would compute the values are available along each edge already, and
insert a no-cost phi to merge them.

In any case, if we want it, sure. I have no objections as even in GVN,
this is trivial to add to new gvn, it just costs time as a special case
until i rewrite PRE based on GVN.

I've abandoned this patch (originally abandoned on March 16th). Overall, I saw no significant performance improvements across SPEC2K/SPEC2K6 and after further testing there was one significant regression (i.e., 6% on spec2000/vpr), which was the workload I was targeting. While the patch does remove the redundant load, that redundancy was replaced by another set of redundant instructions, fcmp/csel. IIRC, csel instructions are bad for A57 devices in general.

I agree that a more general solution should be considered and if the new PRE pass can enable that solution, I'm in no rush to push this patch forward.

Regardless, I do appreciate everyones feedback!

Out of curiosity, at what phase do you see these redundancies?
GCC used to canonicalize on the CFG version and do select formation late
specifically to address some of these issues.

Not to hijack this discussion, but this is related to something I’ve been looking at (and have discussed with Hal and James previously), which is select vs. phi canonicalization.

Currently SimplifyCFG does most of the converting of phis to selects in IR. There is also early and late if-conversion in some backends, but let’s ignore that for now.

CodeGenPrepare does the reverse (i.e. select -> phi/branch) for some targets and some selects.

SimplifyCFG uses a target instruction cost model of the speculated instructions to decide whether to do phi -> select conversion. It currently uses a cutoff of 2*Basic if I recall correctly.

I’d like to start a discussion on what the “right” canonical representation of these operations is over the phases of optimization. It seems to me that keeping branches around longer would be better since it would allow global optimizations to only have to worry about phis. It would also create blocks where instructions could be sunk. For example:

int selects(int a, int b, int c, int d) {

int x1, x2, x3;

x1 = a / b;

x2 = b / c;

x3 = c / d;

if (a < b) {

    x1 = 0;

    x2 = 0;

    x3 = 0;

}

return x1 + x2 + x3;

If we don’t convert the above to selects, then the CodeSinking pass is able to sink the divides into an else block that it creates. (As an aside this brings up the question of canonical location of instructions w.r.t. sinking/forwarding, which I haven’t seen discussed either).

I believe the downside to keeping phis/branches in the IR (someone please correct me if I’m wrong here), is that we break up the scope of later non-global passes. I don’t have a feel for how big of a problem this is or how hard it would be to fix by e.g. extending these passes to work on single-entry single-exit regions or even superblocks.

Any thoughts on this?

Geoff Berry

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

From: Daniel Berlin [mailto:dberlin@dberlin.org]
Sent: Wednesday, March 25, 2015 10:25 AM
To: reviews+D8120+public+d2c9388bc8837c4c@reviews.llvm.org
Cc: tilmann.scheller@googlemail.com; Nick Lewycky; David Majnemer; Hal Finkel; Philip Reames; James Molloy; mssimpso@codeaurora.org; gberry@codeaurora.org; Commit Messages and Patches for LLVM
Subject: Re: [PATCH] [GVN] Eliminate redundant loads whose addresses are dependent on the result of a select instruction.

Out of curiosity, at what phase do you see these redundancies?

GCC used to canonicalize on the CFG version and do select formation late specifically to address some of these issues.