This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] reassociate loop invariant GEP chains to enable LICM
ClosedPublic

Authored by sebpop on Nov 10 2017, 9:01 AM.

Details

Summary

This change brings performance of zlib up by 10%. The example below is from a
hot loop in longest_match() from zlib.

do.body:

%cur_match.addr.0 = phi i32 [ %cur_match, %entry ], [ %2, %do.cond ]
%idx.ext = zext i32 %cur_match.addr.0 to i64
%add.ptr = getelementptr inbounds i8, i8* %win, i64 %idx.ext
%add.ptr2 = getelementptr inbounds i8, i8* %add.ptr, i64 %idx.ext1
%add.ptr3 = getelementptr inbounds i8, i8* %add.ptr2, i64 -1

In this example %idx.ext1 is a loop invariant. It will be moved above the use of
loop induction variable %idx.ext such that it can be hoisted out of the loop by
LICM. The operands that have dependences carried by the loop will be sinked down
in the GEP chain. This patch will produce the following output:

do.body:

%cur_match.addr.0 = phi i32 [ %cur_match, %entry ], [ %2, %do.cond ]
%idx.ext = zext i32 %cur_match.addr.0 to i64
%add.ptr = getelementptr inbounds i8, i8* %win, i64 %idx.ext1
%add.ptr2 = getelementptr inbounds i8, i8* %add.ptr, i64 -1
%add.ptr3 = getelementptr inbounds i8, i8* %add.ptr2, i64 %idx.ext

Diff Detail

Event Timeline

DIVYA created this revision.Nov 10 2017, 9:01 AM
DIVYA added a comment.EditedNov 10 2017, 9:10 AM

This patch helps to reduce the number of instructions generated by llvm for aarch64 for the longest_match() hottest functions in zlib-ng library.

Assembly code for the longest_match function before applying the patch.
The code contains 2 adds inside the loop
.LBB0_7: // %do.body37

                                //   in Loop: Header=BB0_8 Depth=2
add             x26, x8, x20
add             x10, x26, x9
ldurh   w10, [x10, #-1]
cmp             w10, w28, uxth
b.eq    .LBB0_10

After applying the patch
.LBB0_8: // %if.then49

                                      //   Parent Loop BB0_5 Depth=1
                                      // =>  This Inner Loop Header: Depth=2

add   x26, x8, x20
ldrh    w10, [x26, x9]
cmp   w10, w28, uxth
b.ne  .LBB0_8

.LBB0_11:

DIVYA added a comment.EditedNov 10 2017, 9:13 AM

Perf stat result before applying the patch for aarch64 for the minizip executable of zlib-ng library(Used llvm/build/lib as the source file to be compressed)

267701.266460      task-clock (msec)         #    0.503 CPUs utilized
         35,437      context-switches          #    0.132 K/sec
            147      cpu-migrations            #    0.001 K/sec
            176      page-faults               #    0.001 K/sec
294,334,720,322      cycles                    #    1.099 GHz
326,901,036,222      instructions              #    1.11  insns per cycle

Perf stat result before applying the patch for aarch64 or the minizip executable of zlib-ng library

265034.440160      task-clock (msec)         #    0.500 CPUs utilized
         35,480      context-switches          #    0.134 K/sec
            161      cpu-migrations            #    0.001 K/sec
            180      page-faults               #    0.001 K/sec
291,392,526,820      cycles                    #    1.099 GHz
321,014,404,980      instructions              #    1.10  insns per cycle
DIVYA edited the summary of this revision. (Show Details)Nov 10 2017, 9:18 AM
mgrang added a subscriber: mgrang.Nov 10 2017, 11:03 AM
mgrang added inline comments.
lib/Transforms/InstCombine/InstructionCombining.cpp
1663

nit: Period after comment.

1697–1713

Please consider using early exits: https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code.

Something like:

if (!LI)
  return nullptr;
<main logic goes here>
1701

nit: Period after comment.

efriedma edited edge metadata.Nov 10 2017, 11:22 AM

See also https://reviews.llvm.org/D8911 / https://bugs.llvm.org/show_bug.cgi?id=23163, which originally introduced this check. I don't think checking whether the operands are loop-invariant really solves that problem... you still get exactly the same code duplication, just not inside the innermost loop.

lib/Transforms/InstCombine/InstructionCombining.cpp
1703

Constants are loop-invariant; no need to check for them explicitly.

test/Transforms/InstCombine/gep-combine-loop-invariant.ll
3

Please write tests to only run one pass; putting multiple passes in the RUN line makes it confusing to figure out what you're actually expecting each individual pass to do. (You can add a comment to explain, if it isn't obvious why a transform is profitable.)

DIVYA updated this revision to Diff 122667.Nov 13 2017, 8:55 AM
DIVYA marked 5 inline comments as done.
DIVYA added a comment.Nov 13 2017, 9:06 AM

For the testcase in https://bugs.llvm.org/show_bug.cgi?id=23163, the patch will not create extra instructions , since the operands are not loop invariants.
And for the cases where the extra instructions are produced, gep(gep ...) merging optimization happens only when the second operands are loop invariant , and hence LICM pass will move them out of the loop.
Also, I am checking if the first operand of Src is not loop invariant.Since , if the first operand of Src Gep is loop invariant and both the second operands are also loop invariants, then they shouldn't be combined as LICM will anyway hoist them out of the loop and combining will only create extra instructions.

And for the cases where the extra instructions are produced, gep(gep ...) merging optimization happens only when the second operands are loop invariant , and hence LICM pass will move them out of the loop.

"L" here is the innermost loop relative to the GEP. The innermost loop might not be the important loop (e.g. it might have a low trip count and get unrolled). And even if the addition is hoisted out of the hottest loop, you're still increasing codesize and register pressure.

Do you have numbers for LLVM testsuite or SPEC or something?

lib/Transforms/InstCombine/InstructionCombining.cpp
1712

CreateAdd can't return nullptr.

mgrang added inline comments.Nov 13 2017, 5:19 PM
lib/Transforms/InstCombine/InstructionCombining.cpp
1707

Too many parentheses. Can be simplified:

if (!L->isLoopInvariant(GO1) || !L->isLoopInvariant(SO1) ||
    L->isLoopInvariant(Src->getOperand(0)))
sebpop added a subscriber: sebpop.Mar 8 2018, 1:24 PM

The motivation behind this patch is that it brings up the performance
of gzip deflate by 10%: this pattern occurs in the hot spot of
longest_match() in zlib.

To answer Eli's question, here are the performance numbers
for the spec2000 int with and without the patch on A72 firefly:
a positive value is a speedup percent, a negative value is a slowdown percent

164.gzip .1000
175.vpr -1.5400
176.gcc -.5200
177.mesa .6200
179.art 2.1500
181.mcf -.7100
183.equake 1.2800
186.crafty .5400
188.ammp -.8700
197.parser .0400
252.eon .4700
253.perlbmk .7200
254.gap -.8300
255.vortex -.3400
256.bzip2 -.0400
300.twolf .4800

I will update the patch with all corrections asked in previous comments.

sebpop commandeered this revision.Mar 8 2018, 1:33 PM
sebpop updated this revision to Diff 137649.
sebpop added a reviewer: DIVYA.

Update patch on today's tree.
Fixed parentheses.

sebpop updated this revision to Diff 137652.Mar 8 2018, 1:46 PM
sebpop edited the summary of this revision. (Show Details)

The 175.vpr result looks bad.

I'm still not convinced this is the right approach; if you look at the testcase as a reassociation problem, you can optimize it without duplicating code.

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
1726 ↗(On Diff #137649)

CreateAdd (still) can't return null.

sebpop added a comment.Mar 8 2018, 2:00 PM

The 175.vpr result looks bad.

That's noise, I just reran that benchmark and the 1% slowdown is within the noise level: in the new run vpr shows a speedup of 1.47%.

I'm still not convinced this is the right approach; if you look at the testcase as a reassociation problem, you can optimize it without duplicating code.

Could you please explain how you want to optimize this sequence?
I also don't see where "duplicating code" comes from.

Also looking at the AVX512 example that needed to be amended, you see that
this pattern matches at quite some spots in there, and that it transforms two geps
into only one.

I also don't see where "duplicating code" comes from.

You're inserting an "add" instruction without removing any other instructions; in general, this hurts codesize/performance. This is why the combine is restricted in the first place. You might get lucky sometimes and simplify away the extra instruction, but you can't rely on that.

The "reassociation" way to optimize this is to consider the following two instructions:

%add.ptr = getelementptr inbounds i8, i8* %win, i64 %idx.ext
%add.ptr2 = getelementptr inbounds i8, i8* %add.ptr, i64 %idx.ext1

These can be reassociated into the following:

%tmp = getelementptr inbounds i8, i8* %win, i64 %idx.ext1
%add.ptr2 = getelementptr inbounds i8, i8* %tmp, i64 %idx.ext

The first GEP then gets hoisted out of the loop.

sebpop updated this revision to Diff 137834.Mar 9 2018, 1:58 PM
sebpop retitled this revision from [InstCombine] Allowing GEP Instructions with loop Invariant operands to combine to [InstCombine] reassociate loop invariant GEP chains to enable LICM.
sebpop edited the summary of this revision. (Show Details)

Rewrote the patch as suggested by Eli.

efriedma added inline comments.Mar 9 2018, 2:10 PM
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
1683 ↗(On Diff #137834)

I think you need to check Src->hasOneUse()?

sebpop updated this revision to Diff 137869.Mar 9 2018, 4:24 PM
sebpop marked an inline comment as done.Mar 19 2018, 8:16 AM

Ping.

This revision is now accepted and ready to land.Mar 23 2018, 2:52 PM